Invention Title:

EMOTIVE TEXT-TO-SPEECH WITH AUTO DETECTION OF EMOTIONS

Publication number:

US20250201233

Publication date:

2025-06-19

Section:

Physics

Class:

G10L13/08

Inventors:

Arindrima Datta 🇺🇸 New York, NY, United States

Rakesh Narayan Iyer 🇺🇸 Mountain View, CA, United States

Assignee:

Google LLC 🇺🇸 Mountain View, CA, United States

Applicant:

Google LLC 🇺🇸 Mountain View, CA, United States

Smart overview of the Invention

The patent application outlines a process for enhancing text-to-speech (TTS) systems by integrating emotion detection capabilities. It begins with obtaining input text that represents a natural language response generated by an assistant language model (LLM) in a user conversation. This step is crucial for ensuring that the system can accurately interpret the user's intent and context.

A key feature of the system is its ability to predict the emotional state of the input text. This is achieved by processing the text through the assistant LLM, which is conditioned with an emotion detection task prompt. The output is an emotional state prediction that reflects the sentiment or feeling expressed in the natural language response.

Once the emotional state is determined, an emotional embedding is created for the input text. This embedding serves as a bridge between the detected emotion and the TTS model, allowing for nuanced and contextually appropriate speech synthesis. The emotional embedding ensures that the synthesized speech mirrors the emotional tone intended by the original text.

The TTS model plays a critical role in transforming the input text and its associated emotional embedding into synthesized speech. By leveraging advanced speech synthesis techniques, the system generates speech that not only conveys linguistic content but also accurately reflects the predicted emotional state. This results in more engaging and human-like interactions.

In summary, this innovation enhances traditional TTS systems by incorporating emotion detection and synthesis, leading to more expressive and effective communication between users and AI assistants. By focusing on emotional context, it aims to improve user satisfaction and interaction quality in conversational AI applications.