Invention Title:

EXPRESSING EMOTION IN SPEECH FOR CONVERSATIONAL AI SYSTEMS AND APPLICATIONS

Publication number:

US20250173938

Publication date:

2025-05-29

Section:

Physics

Class:

G06T13/40

Inventors:

Oluwatobi Olabiyi 🇺🇸 Falls Church, VA, United States

Ehsan Hosseini Asl 🇺🇸 Boston, MA, United States

Nikhil Srihari 🇺🇸 Seattle, WA, United States

Akshay Hazare 🇮🇳 Mumbai, India

Applicant:

NVIDIA Corporation 🇺🇸 Santa Clara, CA, United States

Smart overview of the Invention

The patent application describes systems and methods for expressing emotion in speech for conversational AI systems and applications. Utilizing machine learning models, these systems determine both the emotional state associated with speech output by a character and values for variables related to that emotional state and the speech itself. Variables may include intensity, pitch, rate, volume, and emphasis. The character then outputs speech that accurately reflects these emotional nuances, enhancing user interaction.

Background

Modern applications, such as gaming and multimedia platforms, often feature animated characters or digital avatars that interact with users. Traditionally, these systems determine a character's emotional state based solely on text analysis, which can lead to inaccurate emotional expression. For instance, the phrase "Have a good day" could be expressed happily or sadly depending on context, which text alone cannot capture. This limitation results in less realistic interactions and a narrower emotional range.

Innovative Approach

The disclosed systems improve upon conventional methods by using additional inputs alongside text to determine emotional states more accurately. These inputs include user data and character data, allowing the system to consider the context of interactions. This approach enables the determination of additional variable values related to speech characteristics like pitch and volume, allowing characters to express a broader spectrum of emotions more realistically.

Detailed Methodology

The system processes input data from users or characters using machine learning models designed for generating text and determining emotions. Input data might include text, audio spectrograms, images, or profile information. By applying this data to first and second machine learning models, the system generates text for speech and determines associated emotional states with corresponding variable values like intensity or pitch.

Speech Generation and Output

Once emotions are determined, a third machine learning model generates audio data that reflects these emotions in speech. This model considers variables such as intensity and pitch to produce audio that accurately conveys the intended emotional state. The system ensures continuous updates as interactions proceed, refining emotional expressions in real-time as characters communicate with users or other characters.