US20240411997
2024-12-12
Physics
G06F40/30
A novel system integrates large language models (LLMs) with multimodal inputs to create a more engaging and empathetic user experience. It achieves this by maintaining a low latency response time of under 500 milliseconds, allowing for real-time interaction on consumer hardware. The system interprets the user's emotional state through sentiment analysis of messages, facial expressions, and voice parameters. This enables the LLM to tailor its responses based on the user's mood and mental state, enhancing user interaction by considering past interactions and aiming for a desired emotional outcome.
Large language models such as ChatGPT and Llama are designed for general-purpose language generation, learning from vast amounts of text data. These models perform tasks like text classification and generation by predicting subsequent words based on input text. The disclosed system seeks to improve these models by incorporating mood detection through expressive behavioral and linguistic cues. This approach combines linguistic and paralinguistic elements from multimodal inputs to craft mood-aware prompts, enriching the user experience.
The system utilizes various inputs, including facial expressions, voice features, text sentiment, and physiological measurements, to infer the user's mood and mental states. These inputs inform the customization of LLM prompts to foster an empathetic interaction. Key factors considered include the user's current mood while interacting with the LLM, reactions to previous responses, sentiment analysis of past interactions, and the system's empathetic objectives. This dynamic adjustment aims to align the LLM's outputs with desired emotional states.
Central to the system is the Valence-Arousal-Dominance (VAD) model, which defines emotions across three axes: valence (pleasantness), arousal (alertness), and dominance (control). The system sets empathetic goals to adjust user moods within this VAD space. It measures improvement by comparing the distance between current and target moods, adjusting responses accordingly. The goal is to produce outputs that mimic or enhance user moods, achieved through a prompt customization module that tailors responses based on multimodal input analysis.
The described system can operate on user devices or via cloud services, capturing video and audio through standard device cameras and microphones. It also employs non-contact techniques like photoplethysmography for physiological data collection. The system architecture includes modules for text input processing, mood estimation from voice and facial data, and a multimodal mood fusion module that integrates these inputs into a cohesive understanding of user states. This enables the LLM to generate responses that are emotionally congruent with its empathetic goals.