Invention Title:

REAL-TIME SYSTEM FOR SPOKEN NATURAL STYLISTIC CONVERSATIONS WITH LARGE LANGUAGE MODELS

Publication number:

US20250363978

Publication date:

2025-11-27

Section:

Physics

Class:

G10L13/10

Inventors:

Shawn Callegari 🇺🇸 Redmond, WA, United States

Adrian Wyatt BONAR 🇺🇸 Seattle, WA, United States

Jennifer FOX 🇺🇸 Seattle, WA, United States

Nicole E. BERDY 🇺🇸 Cambridge, MA, United States

Mollie MUNOZ 🇺🇸 Redmond, WA, United States

Devis LUCATO 🇺🇸 Redmond, WA, United States

Ryan H. VOLUM 🇺🇸 Seattle, WA, United States

Applicant:

Microsoft Technology Licensing, LLC 🇺🇸 Redmond, WA, United States

Smart overview of the Invention

The disclosed techniques facilitate spoken natural stylistic conversations with large language models, moving beyond the traditional text-based interactions. Users can engage in spoken dialogues where their speech is converted into text, and a prompt engine analyzes the sentiment expressed. The system then generates a text response and a style cue to reflect emotion, which is transformed back into speech by a text-to-speech engine, creating an experience akin to human conversation.

Background

Large language models have gained popularity due to their capabilities in processing various forms of data. Unlike other AI models, these utilize self-attention mechanisms to understand complex contexts and generate new content. Despite their potential, interactions with such models are often restricted to text inputs. This limitation hinders broader applications and user accessibility, particularly for those without technical expertise or in non-text-based scenarios.

Summary of Techniques

The invention introduces a natural language interface for spoken conversations with large language models. A prompt engine assesses user speech for sentiment, which guides the model's responses. This approach broadens the applicability of language models beyond text inputs, allowing users to interact more naturally and intuitively through speech. The system uses training data and sentiment analysis to generate responses that feel lifelike and engaging.

System Functionality

A large language model equipped with a conversational profile uses training data to learn conversational patterns. Upon receiving user speech input, the system performs speech-to-text conversion and sentiment analysis. The model then crafts a text response based on this analysis and selects an appropriate sentiment or style cue. This response is converted into audio output with added inflection to express emotion, creating an immersive conversational experience.

Technical Implementation

The described system processes user input by continuously or periodically converting speech to text. A prompt engine evaluates sentiment through word choice and vocal subtleties like tone and volume. The language model generates a response within its training pattern, attaching a style cue for emotional expression. This is transformed into audio output using text-to-speech technology, ensuring conversations are not only informative but also emotionally resonant.