US20240169974
2024-05-23
Physics
G10L13/10
A novel system allows users to engage in spoken conversations with large language models, moving beyond traditional text-based interactions. By converting user speech audio into text, the system employs a prompt engine to analyze the sentiment behind the spoken input. This sentiment informs the responses generated by the large language model, which is trained on diverse conversational data, allowing it to produce text responses that mimic human-like emotional expressions.
Current methods for interacting with large language models primarily rely on text inputs, which can limit accessibility for users unfamiliar with technology. This restriction can lead to intimidation or confusion, especially for those without a technical background. The focus on text-only interactions also confines the potential applications of these models, making it challenging to incorporate them into more dynamic and engaging environments.
The proposed system introduces a natural language interface that enables users to have fluid spoken conversations with large language models. By employing a conversational profile that includes training data and a sentiment set, the model can effectively analyze user input, determine the associated sentiment, and generate appropriate responses. This setup enhances the conversational experience by allowing the model to respond in a way that aligns with the emotional context of the user’s speech.
When a user provides speech input, it is converted into text through speech-to-text translation. The prompt engine then conducts a sentiment analysis based on this translation, considering not only the words but also vocal nuances such as tone and inflection. With this information, the large language model formulates a text response while selecting an appropriate style cue from its sentiment set to convey emotion effectively.
The final output consists of an audio response generated through a text-to-speech engine, which interprets both the text response and the style cue. This process ensures that the audio output carries emotional inflections that reflect the sentiment determined from the user's input. As a result, users can enjoy immersive conversations that feel natural and lifelike, enhancing overall interaction with large language models.