Invention Title:

INFORMATION PROCESSING DEVICE AND INFORMATION PROCESSING METHOD FOR ARTIFICIAL SPEECH GENERATION

Publication number:

US20240242703

Publication date:

2024-07-18

Section:

Physics

Class:

G10L13/027

Inventor:

Justinas MISEIKIS Stuttgart, Germany

Assignee:

Sony Group Corporation Tokyo, Japan

Applicant:

Sony Group Corporation Tokyo, Japan

Drawings (4 of 7)

Drawing 01 for INFORMATION PROCESSING DEVICE AND INFORMATION PROCESSING METHOD FOR ARTIFICIAL SPEECH GENERATION

Drawing 02 for INFORMATION PROCESSING DEVICE AND INFORMATION PROCESSING METHOD FOR ARTIFICIAL SPEECH GENERATION

Drawing 03 for INFORMATION PROCESSING DEVICE AND INFORMATION PROCESSING METHOD FOR ARTIFICIAL SPEECH GENERATION

Drawing 04 for INFORMATION PROCESSING DEVICE AND INFORMATION PROCESSING METHOD FOR ARTIFICIAL SPEECH GENERATION

Smart overview of the Invention

An information processing device is designed to generate artificial speech data by utilizing various forms of input. It incorporates circuitry that extracts emotional indicators from speech data, along with the timing of these indicators. Additionally, it processes text data derived from the same speech data to create more realistic and emotionally nuanced artificial speech.

Importance of Emotional Features in Speech

Human speech is not solely about the words spoken; it conveys emotions that enhance communication. Traditional text-to-speech systems often lack this emotional depth, resulting in robotic-sounding outputs. By integrating emotional features into artificial speech generation, the device aims to produce output that more closely mimics the nuances of natural human conversation.

Utilization of Multiple Data Sources

The device leverages a combination of text transcription, audio analysis, and visual cues from videos to enhance speech generation. By analyzing both audio and video signals, it captures emotional information that can be used to inform the synthetic voice output. This approach aims to improve the realism of voice cloning and dubbing applications.

Components and Functionality

The circuitry within the device includes processors, memory, and interfaces for communication and input/output operations. It can extract speech data from various sources, including digital recordings and live inputs. The system generates text data through transcription or translation, which serves as a foundation for creating artificial speech enriched with emotional context.

Training and Model Development

The artificial speech generation relies on a trained model that incorporates acoustic and emotional parameters. Machine learning techniques, such as neural networks, are employed to refine the generation process. The training utilizes examples of emotional speech to enhance the model's understanding of how to replicate human-like emotional expression in synthesized voices.