Invention Title:

ARTIFICIAL INTELLIGENCE DEVICE FOR LIGHT TRANSFORMER-BASED EMOTION RECOGNITION (LTER) AND METHOD THEREOF

Publication number:

US20250209802

Publication date:
Section:

Physics

Class:

G06V10/806

Inventors:

Assignee:

Applicant:

Smart overview of the Invention

The patent application outlines a method for emotion recognition using an artificial intelligence (AI) device. The process involves receiving a video segment with multiple frames and an accompanying audio signal. These inputs are processed through specialized encoders to generate embeddings, which are then transformed into feature vectors. The system prioritizes audio as the primary modality for emotion detection, utilizing a lightweight visual encoder for facial expression recognition. This approach aims to enhance efficiency and accuracy in emotion recognition while addressing privacy concerns.

Background

Emotion recognition is crucial for improving human-computer interaction, enabling systems to understand user intent and provide empathetic responses. Existing methods face challenges such as complexity, high computational demands, and over-reliance on textual data, which can lead to inaccuracies. Many systems require transmitting raw data to remote servers, raising privacy issues. Thus, there is a need for more efficient, accurate, and privacy-conscious emotion recognition solutions that can operate effectively on resource-constrained devices.

Technical Approach

The proposed method involves processing audio signals through an audio encoder to create an audio embedding and using a visual encoder for video segments to produce visual embeddings. These embeddings are transformed into feature vectors via transformers. The system utilizes cross-attention mechanisms between audio and visual data to enhance emotion detection capabilities. A fusion module combines these vectors, and a classifier module predicts emotions based on the fused output. This method supports real-time applications by optimizing processing efficiency.

Innovative Aspects

A significant innovation in this method is the prioritization of audio as the main modality for emotion recognition, supported by a specifically trained visual encoder for facial expressions. The visual encoder undergoes pre-training on face recognition followed by fine-tuning on emotion classification. The system's architecture includes multiple transformer blocks with components like multi-head attention and normalization blocks to optimize feature extraction and processing.

Applications and Benefits

The AI device designed under this patent can be deployed across various platforms due to its lightweight nature and efficient processing capabilities. It addresses privacy concerns by minimizing the need for transmitting raw data externally. By generating accurate emotion predictions from combined audio and visual features, the system promises enhanced human-computer interactions across different applications, including personal devices, customer service interfaces, and interactive educational tools.