Invention Title:

SYNTHESIS OF SPEECH FROM TEXT IN A VOICE OF A TARGET SPEAKER USING NEURAL NETWORKS

Publication number:

US20250095630

Publication date:

2025-03-20

Section:

Physics

Class:

G10L13/04

Inventors:

Ruoming Pang New York, NY, United States

Patrick An Phu Nguyen Palo Alto, CA, United States

Ron J. Weiss New York, NY, United States

Zhifeng Chen Sunnyvale, CA, United States

Yonghui Wu Fremont, CA, United States

Yu Zhang Mountain View, CA, United States

Ignacio Lopez Moreno Brooklyn, NY, United States

Ye Jia Mountain View, CA, United States

Jonathan Shen Mountain View, CA, United States

Fei Ren Mountain View, CA, United States

Quan Wang Hoboken, NJ, United States

Assignee:

Google LLC Mountain View, CA, United States

Applicant:

Google LLC Mountain View, CA, United States

Smart overview of the Invention

The patent application describes a system for synthesizing speech from text using neural networks, specifically designed to replicate the voice of a target speaker. The process involves obtaining an audio sample from the target speaker and input text for synthesis. A speaker encoder engine generates a speaker vector from the audio sample, which is then used by a spectrogram generation engine to create an audio representation of the input text in the target speaker's voice. This synthesized speech can be outputted for various applications.

Technical Field

The system operates within the field of speech synthesis using neural networks. Neural networks are complex machine learning models that consist of multiple layers, each performing specific transformations on input data to generate desired outputs. These networks can be trained to recognize and reproduce specific patterns, such as human speech, by adjusting parameters through a process known as training.

System Functionality

The system is capable of generating speech in the voice of speakers not included in its training data, using only brief untranscribed audio samples. It employs a sequence-to-sequence model to convert phonemes or graphemes into spectrograms conditioned on speaker embeddings. These embeddings are derived from a speaker encoder network trained on diverse datasets. This decoupling allows independent training for speaker verification and text-to-speech synthesis, optimizing data requirements for each task.

Advantages and Applications

This approach offers several advantages, including the ability to synthesize speech from minimal audio samples without requiring transcriptions. It can also generate speech in different languages than those present in the sample audio. By separating the training processes for speaker modeling and speech synthesis, the system can efficiently adapt to new speakers without extensive data collection or fine-tuning, making it suitable for applications like real-time translation.

Implementation Details

The system utilizes a speaker verification neural network to generate embedding vectors that represent speaker characteristics. These vectors are computed by processing overlapping audio segments and averaging their embeddings. The spectrogram generation neural network uses these embeddings alongside phoneme or grapheme sequences to predict mel spectrograms, which are then converted into time-domain representations by a vocoder for playback. This architecture allows for flexible adaptation to new speakers while maintaining high-quality speech synthesis.