Real-Time Facial Expression Recognition with Bengali Audio Feedback

An Interactive Exploration of a Multimodal Deep Learning System

Introduction: Bridging Communication Gaps

This project addresses a critical challenge in human-computer interaction: the inability of technology to understand non-verbal cues, which form a significant part of communication. Specifically, it focuses on bridging this gap for over 230 million Bengali speakers worldwide. The research introduces a novel system that not only recognizes facial expressions in real-time but also provides contextually appropriate feedback in the Bengali language. By integrating advanced deep learning models for both visual and linguistic processing, the system aims to create more natural, inclusive, and effective communication experiences. This interactive report allows you to explore the core components, methodology, and findings of this innovative research.

About the Research

This research carried out under the supervision of Mr. Moin Mostakim.

Authors

Arnab Sarker Sangit

19201100

Sharthak Das

23241033

Supervision & Affiliation

Supervisor: Mr. Moin Mostakim

Senior Lecturer, Brac University

Institution: Brac University

Department of Computer Science and Engineering

The Proposed System Architecture

The core of this research is a dual-stream deep learning architecture that processes visual and textual information in parallel to achieve its goal. The flowchart below illustrates the complete journey from initial data input to the final Bengali audio output. Hover over each step to see a more detailed explanation of its function within the system. This design allows the model to learn from both what it 'sees' (facial features) and what it 'knows' (linguistic context), leading to a more robust and nuanced understanding of human emotion.

Input Data Raw facial images and a corpus of Bengali text are fed into the system.
Data Preprocessing Data is cleaned, standardized, and prepared for processing.

Image Stream (CNN)

Image Processing Images are resized to 48x48 pixels, converted to grayscale, and normalized.
Feature Extraction A Convolutional Neural Network (CNN) extracts key visual features like edges, textures, and shapes related to expressions.

Feedback Stream (LSTM)

Text Processing Bengali text is tokenized into numerical representations for the model.
Sequence Learning A Long Short-Term Memory (LSTM) network learns the patterns and structure of the Bengali language.
Feature Fusion & Classification Visual features from the CNN and linguistic context are combined. A classifier then predicts the final emotion.
Emotion Detected The model outputs a specific emotion class (e.g., 'Happy', 'Sad').
Bengali Sentence Generation The detected emotion is mapped to a pre-defined, contextually appropriate Bengali sentence.
Final Audio Feedback A Text-to-Speech engine converts the Bengali sentence into audio output.

Exploring the Dataset

A high-quality, diverse dataset is the foundation of any successful deep learning model. This project utilized a meticulously curated dataset of over 22,500 facial expression images. The bar chart below shows the distribution of the seven core emotions within the dataset. A relatively balanced distribution is crucial to prevent the model from becoming biased towards more frequently represented emotions. Below the chart, you can see one sample image for each emotional category, providing a glimpse into the visual data the model was trained on.

Sample Images Per Emotion

Angry

Angry

Disgust

Disgust

Fear

Fear

Happy

Happy

Neutral

Neutral

Sad

Sad

Surprise

Surprise

Comparative Model Performance

To validate the effectiveness of the custom-built model (`Best_FER`), its performance was benchmarked against two well-established, pre-trained models: `ResNet-50` and `MobileNetV2`. Use the buttons below to switch between the models and compare their key performance metrics and learning curves. The chart visualizes the training accuracy (the model's performance on data it has seen) versus the validation accuracy (its performance on unseen data) over the training epochs. A smaller gap between these two lines generally indicates a more robust model that generalizes well to new, real-world data.

Custom Model (Best_FER)

Training Accuracy: 65.12%

Validation Accuracy: 50.34%

Results Analysis: A Deeper Look

This section provides a detailed breakdown of the custom `Best_FER` model's performance. The confusion matrix on the left shows how the model's predictions align with the actual emotions. The diagonal values represent correct classifications. Hover over any cell to see a detailed explanation of what that number means. The chart on the right visualizes the Precision, Recall, and F1-Score for each emotion, which are key indicators of classification quality. A higher bar indicates better performance for that specific metric and emotion.

Confusion Matrix

Class-wise Performance Metrics

Conclusion & Future Work

This research successfully demonstrates the potential of a real-time facial expression recognition system with integrated Bengali audio feedback. The custom `Best_FER` model, with its hybrid CNN-RNN architecture, proved superior to standard pre-trained models, highlighting the benefits of tailored solutions for specific tasks. The system provides a solid foundation for creating more emotionally intelligent and inclusive human-computer interfaces.

Future work will focus on improving the model's robustness to real-world challenges such as varying lighting conditions, partial facial occlusions, and diverse cultural expressions. Expanding the dataset and exploring more advanced multimodal fusion techniques, potentially incorporating biosignals or speech tonality, could lead to an even more nuanced and accurate understanding of human emotion, paving the way for truly empathetic technology.