US20240242703
2024-07-18
Physics
G10L13/027
An information processing device is designed to generate artificial speech data by utilizing various forms of input. It incorporates circuitry that extracts emotional indicators from speech data, along with the timing of these indicators. Additionally, it processes text data derived from the same speech data to create more realistic and emotionally nuanced artificial speech.
Human speech is not solely about the words spoken; it conveys emotions that enhance communication. Traditional text-to-speech systems often lack this emotional depth, resulting in robotic-sounding outputs. By integrating emotional features into artificial speech generation, the device aims to produce output that more closely mimics the nuances of natural human conversation.
The device leverages a combination of text transcription, audio analysis, and visual cues from videos to enhance speech generation. By analyzing both audio and video signals, it captures emotional information that can be used to inform the synthetic voice output. This approach aims to improve the realism of voice cloning and dubbing applications.
The circuitry within the device includes processors, memory, and interfaces for communication and input/output operations. It can extract speech data from various sources, including digital recordings and live inputs. The system generates text data through transcription or translation, which serves as a foundation for creating artificial speech enriched with emotional context.
The artificial speech generation relies on a trained model that incorporates acoustic and emotional parameters. Machine learning techniques, such as neural networks, are employed to refine the generation process. The training utilizes examples of emotional speech to enhance the model's understanding of how to replicate human-like emotional expression in synthesized voices.