Invention Title:

Speech Synthesizer and Method for Speech Synthesis

Publication number:

US20250182737

Publication date:

2025-06-05

Section:

Physics

Class:

G10L13/027

Inventors:

Raphaela Groten 🇩🇪 Aachen, Germany

Rebecca Johnson 🇩🇪 München, Germany

Assignee:

SIEMENS AKTIENGESELLSCHAFT 🇩🇪 Munchen, Germany

Applicant:

Siemens Aktiengesellschaft 🇩🇪 München, Germany

Smart overview of the Invention

The application describes a sophisticated speech synthesizer designed to improve the naturalness and emotional expressiveness of synthetic speech. It incorporates a processor with a speech analysis module for analyzing and processing natural language content and an emotional module for emotional modeling. These components work alongside a neural network with an artificial intelligence (AI) system, which is trained using both machine and human-generated data to enhance the emotional quality of speech synthesis.

Technical Components

The system includes several key components: a microphone for capturing audio, a memory module for storing acoustic data, and a processor that analyzes this data. The processor is equipped with modules for both speech analysis and emotional modeling. These modules are connected to a neural network AI system that suggests emotional adjustments based on training data. The AI system's training data is enriched through human interaction, allowing it to produce more nuanced emotional expressions in synthetic speech.

Methodology

The method for speech synthesis involves playing back synthetic or human speech, capturing real-time human responses, converting these into machine-processable data, and using this data to train the neural network. This iterative process refines the AI's ability to model emotions in speech. The method captures various human emotions such as joy, fear, and curiosity, providing detailed feedback that helps the AI system improve its suggestions for emotional modeling.

Human Interaction and Feedback

Unlike traditional Speech Emotion Recognition (SER) systems, which rely solely on measurable acoustic features, this synthesizer incorporates feedback from human listeners. This feedback is integrated into the AI's training data, allowing it to make more accurate and nuanced emotional expressions in synthetic speech. Human listeners' ability to discern subtle emotional cues like smiling is digitized and used to enhance the AI's modeling capabilities.

Prototype Results

A prototype of the speech synthesizer demonstrated significant improvements in the naturalness of synthetic speech, particularly in Western languages. The integration of pitch contour training through human interaction contributes to a more animated and emotionally expressive output. The system utilizes deep learning models such as Generative Pre-trained Transformer (GPT) to process natural language, leveraging common libraries and programs for enhanced speech synthesis capabilities.