US20240127803
2024-04-18
Physics
G10L15/18
A voice morphing model is designed to convert various voices into one or a limited number of target voices. This model can be utilized alongside an acoustic model, which enhances speech recognition accuracy by processing audio that has been morphed to match a target voice. Training for both models can occur separately or in conjunction, allowing flexibility in achieving high accuracy in automatic speech recognition (ASR).
The training of the voice morphing model involves generating morphed speech audio from input audio, which can vary in length and source. A voiceprint calculator derives a unique voiceprint from the morphed audio, which is then compared to a target voiceprint to evaluate similarity. The training process adjusts the model parameters based on this comparison, aiming to minimize differences and produce audio that closely resembles the target voice.
To improve the quality of the morphed audio, the training process incorporates components that address noise and distortion. By including these factors in the loss function during training, the model learns to produce clearer and more natural-sounding output, which enhances overall speech recognition performance.
The acoustic model focuses on inferring phonemes from speech audio, which can be either spoken or synthesized. Training involves comparing recognized phonemes against transcriptions to refine the model's accuracy. This process benefits from using consistent target voices or a small set of voices, improving recognition capabilities across diverse user populations.
Flexibility in training approaches allows for joint training of both morphing and acoustic models, enhancing their compatibility. Additionally, the voice morphing model can be adapted for multiple target speakers, enabling it to morph audio into various voices based on specific embeddings. This adaptability ensures that the system can accommodate different speaker characteristics while maintaining high recognition accuracy.