Invention Title:

Method and System for a Parametric Speech Synthesis

Publication number:

US20240428778

Publication date:

2024-12-26

Section:

Physics

Class:

G10L13/10

Inventors:

Ophir Frieder Chevy Chase, MD, United States

Joe Garman Washington, DC, United States

Applicant:

Georgetown University Washington, DC, United States

Smart overview of the Invention

Parametric Speech Synthesis is a method designed to generate speech in any voice, language, and accent. The system utilizes a text-to-speech conversion process that includes a text converter, a machine-learning model, and a decoder. The text converter transforms input text into phonemes, which are then processed by the machine-learning model to produce acoustic features. These features are used by the decoder to create a speech signal that mimics the desired voice and language characteristics.

Background

Historically, speech synthesis has employed two main strategies: concatenative and parametric. Concatenative synthesis involves using pre-recorded speech segments, yielding high-quality but limited flexibility. In contrast, parametric synthesis requires less recorded data and allows for language transformation but offers lower quality. The growing demand for multilingual and personalized speech synthesis necessitates advancements in parametric methods to improve quality and flexibility.

System Components

The system comprises several key components:

Text Converter: Converts input text into phonemes using stored phonemes from various languages.
Machine-Learning Model: A neural network that stores voice patterns for multiple individuals and processes phonemes to generate acoustic features.
Decoder: Generates a speech signal from the acoustic features, simulating the speaker's voice in the desired language and accent.

Methodology

The process begins with training the system using speech corpora, which include recordings of spoken sentences with corresponding transcripts. Each recording is analyzed to identify phonemes through an automatic speech recognizer (ASR). The ASR converts speech signals into Mel-Frequency Cepstral Coefficients (MFCCs) and identifies phonemes per frame. This information trains a deep recurrent neural network (DRNN) to synthesize speech across various languages and speakers.

Applications and Flexibility

The disclosed systems offer significant flexibility by enabling voice synthesis in any language with any accent. The DRNN model allows for seamless integration of new languages and speakers without needing extensive data for each new addition. Potential applications include personalized voice instructions, reading aloud texts, and creating entertainment content in multiple languages. This approach simplifies the development of multilingual speech synthesis systems while maintaining high-quality output.