Invention Title:

HIGH FIDELITY SPEECH SYNTHESIS WITH ADVERSARIAL NETWORKS

Publication number:

US20250245507

Publication date:

2025-07-31

Section:

Physics

Class:

G06N3/084

Inventors:

Karen Simonyan 🇬🇧 London, United Kingdom

Luis Carlos Cobo Rus 🇺🇸 San Francisco, CA, United States

Sander Etienne Lea Dieleman 🇬🇧 London, United Kingdom

Norman Casagrande 🇬🇧 London, United Kingdom

Jeffrey Donahue 🇬🇧 London, United Kingdom

Aidan Clark 🇬🇧 London, United Kingdom

Erich Konrad Elsen 🇺🇸 Naperville, IL, United States

Mikolaj Binkowski 🇬🇧 London, United Kingdom

Applicant:

GDM Holding LLC 🇺🇸 Mountain View, CA, United States

Drawings (4 of 6)

Drawing 01 for HIGH FIDELITY SPEECH SYNTHESIS WITH ADVERSARIAL NETWORKS

Drawing 02 for HIGH FIDELITY SPEECH SYNTHESIS WITH ADVERSARIAL NETWORKS

Drawing 03 for HIGH FIDELITY SPEECH SYNTHESIS WITH ADVERSARIAL NETWORKS

Drawing 04 for HIGH FIDELITY SPEECH SYNTHESIS WITH ADVERSARIAL NETWORKS

Smart overview of the Invention

The patent application discusses a method for generating high-fidelity speech using a generative neural network. The system processes a conditioning text input to produce audio that corresponds to the given text. This is achieved by training a feedforward generative neural network in an adversarial manner with the help of multiple discriminators. These discriminators evaluate whether the generated audio is real or synthetic, allowing the system to refine its outputs continuously.

Adversarial Training

The training involves both conditional and unconditional discriminators. Conditional discriminators analyze both the audio and the conditioning text input, ensuring that the generated speech aligns well with the input text. Unconditional discriminators, however, focus solely on the audio, providing a broader evaluation without being constrained by text input alignment. This dual approach enhances the system's ability to generate realistic and accurate audio samples.

Generative Network Architecture

The generative neural network described is a feedforward type, which processes inputs in a single forward pass, unlike autoregressive models that require multiple time steps. This architecture allows for faster generation of audio examples while maintaining high quality. The network employs convolutional layers organized into generator blocks that sequentially process the conditioning text inputs to produce audio outputs.

Advantages and Innovations

Utilizing both types of discriminators offers various benefits, such as improved adherence to input text and increased diversity in audio sample evaluation. The system can operate on different frequencies of audio samples by assigning specific window sizes to each discriminator, enhancing realism and reducing computational complexity. Additionally, using dilated convolutional layers broadens the receptive fields, enabling learning across various frequency dependencies.

Implementation Details

The system's implementation involves computer programs across multiple locations, integrating components like a generative neural network, discriminator network system, and parameter updating system. The conditioning text input can include linguistic features such as phonemes and pitch information, which are processed through generator blocks in the neural network. This setup ensures efficient training and generation of high-fidelity speech synthesis.