Invention Title:

Automatic Speech Recognition with Voice Personalization and Generalization

Publication number:

US20240127803

Publication date:

2024-04-18

Section:

Physics

Class:

G10L15/18

Inventor:

Keyvan Mohajer Los Gatos, CA, United States

Assignee:

SoundHound, Inc. Santa Clara, CA, United States

Applicant:

SoundHound, Inc. Santa Clara, CA, United States

Drawings (4 of 10)

Drawing 01 for Automatic Speech Recognition with Voice Personalization and Generalization

Drawing 02 for Automatic Speech Recognition with Voice Personalization and Generalization

Drawing 03 for Automatic Speech Recognition with Voice Personalization and Generalization

Drawing 04 for Automatic Speech Recognition with Voice Personalization and Generalization

Smart overview of the Invention

A voice morphing model is designed to convert various voices into one or a limited number of target voices. This model can be utilized alongside an acoustic model, which enhances speech recognition accuracy by processing audio that has been morphed to match a target voice. Training for both models can occur separately or in conjunction, allowing flexibility in achieving high accuracy in automatic speech recognition (ASR).

Training Process for Voice Morphing

The training of the voice morphing model involves generating morphed speech audio from input audio, which can vary in length and source. A voiceprint calculator derives a unique voiceprint from the morphed audio, which is then compared to a target voiceprint to evaluate similarity. The training process adjusts the model parameters based on this comparison, aiming to minimize differences and produce audio that closely resembles the target voice.

Noise and Distortion Reduction

To improve the quality of the morphed audio, the training process incorporates components that address noise and distortion. By including these factors in the loss function during training, the model learns to produce clearer and more natural-sounding output, which enhances overall speech recognition performance.

Acoustic Model Training

The acoustic model focuses on inferring phonemes from speech audio, which can be either spoken or synthesized. Training involves comparing recognized phonemes against transcriptions to refine the model's accuracy. This process benefits from using consistent target voices or a small set of voices, improving recognition capabilities across diverse user populations.

Joint and Multiple Target Training

Flexibility in training approaches allows for joint training of both morphing and acoustic models, enhancing their compatibility. Additionally, the voice morphing model can be adapted for multiple target speakers, enabling it to morph audio into various voices based on specific embeddings. This adaptability ensures that the system can accommodate different speaker characteristics while maintaining high recognition accuracy.