US20240185844
2024-06-06
Physics
G10L15/18
An automatic speech recognition (ASR) model processes input audio by receiving a sequence of acoustic frames. It generates higher order feature representations for these frames using an audio encoder. Additionally, a context encoder creates context embeddings based on previous transcriptions, enhancing the model's ability to understand and predict speech more accurately.
Conventional ASR systems often transcribe segments of audio independently, which can lead to inaccuracies when context from previous utterances is ignored. The proposed model incorporates contextual information by utilizing a joint network that combines data from the audio encoder, context encoder, and prediction network to improve recognition performance.
The context encoder employs self-attentive pooling to refine wordpiece embeddings derived from previous transcriptions. This method reweights the embeddings to prioritize more relevant information, thereby improving the model's understanding of context and enhancing transcription accuracy.
The implementation involves executing a series of operations that include inputting acoustic frames into the ASR model, generating feature representations, creating context embeddings, and producing probability distributions for potential hypotheses. The final output is a transcription of the input utterance based on the hypothesis with the highest probability, ensuring that contextual nuances are considered throughout the process.