US20240428778
2024-12-26
Physics
G10L13/10
Parametric Speech Synthesis is a method designed to generate speech in any voice, language, and accent. The system utilizes a text-to-speech conversion process that includes a text converter, a machine-learning model, and a decoder. The text converter transforms input text into phonemes, which are then processed by the machine-learning model to produce acoustic features. These features are used by the decoder to create a speech signal that mimics the desired voice and language characteristics.
Historically, speech synthesis has employed two main strategies: concatenative and parametric. Concatenative synthesis involves using pre-recorded speech segments, yielding high-quality but limited flexibility. In contrast, parametric synthesis requires less recorded data and allows for language transformation but offers lower quality. The growing demand for multilingual and personalized speech synthesis necessitates advancements in parametric methods to improve quality and flexibility.
The system comprises several key components:
The process begins with training the system using speech corpora, which include recordings of spoken sentences with corresponding transcripts. Each recording is analyzed to identify phonemes through an automatic speech recognizer (ASR). The ASR converts speech signals into Mel-Frequency Cepstral Coefficients (MFCCs) and identifies phonemes per frame. This information trains a deep recurrent neural network (DRNN) to synthesize speech across various languages and speakers.
The disclosed systems offer significant flexibility by enabling voice synthesis in any language with any accent. The DRNN model allows for seamless integration of new languages and speakers without needing extensive data for each new addition. Potential applications include personalized voice instructions, reading aloud texts, and creating entertainment content in multiple languages. This approach simplifies the development of multilingual speech synthesis systems while maintaining high-quality output.