US20250245507
2025-07-31
Physics
G06N3/084
The patent application discusses a method for generating high-fidelity speech using a generative neural network. The system processes a conditioning text input to produce audio that corresponds to the given text. This is achieved by training a feedforward generative neural network in an adversarial manner with the help of multiple discriminators. These discriminators evaluate whether the generated audio is real or synthetic, allowing the system to refine its outputs continuously.
The training involves both conditional and unconditional discriminators. Conditional discriminators analyze both the audio and the conditioning text input, ensuring that the generated speech aligns well with the input text. Unconditional discriminators, however, focus solely on the audio, providing a broader evaluation without being constrained by text input alignment. This dual approach enhances the system's ability to generate realistic and accurate audio samples.
The generative neural network described is a feedforward type, which processes inputs in a single forward pass, unlike autoregressive models that require multiple time steps. This architecture allows for faster generation of audio examples while maintaining high quality. The network employs convolutional layers organized into generator blocks that sequentially process the conditioning text inputs to produce audio outputs.
Utilizing both types of discriminators offers various benefits, such as improved adherence to input text and increased diversity in audio sample evaluation. The system can operate on different frequencies of audio samples by assigning specific window sizes to each discriminator, enhancing realism and reducing computational complexity. Additionally, using dilated convolutional layers broadens the receptive fields, enabling learning across various frequency dependencies.
The system's implementation involves computer programs across multiple locations, integrating components like a generative neural network, discriminator network system, and parameter updating system. The conditioning text input can include linguistic features such as phonemes and pitch information, which are processed through generator blocks in the neural network. This setup ensures efficient training and generation of high-fidelity speech synthesis.