Invention Title:

CONTEXT-AWARE END-TO-END ASR FUSION OF CONTEXT, ACOUSTIC AND TEXT PRESENTATIONS

Publication number:

US20240185844

Publication date:
Section:

Physics

Class:

G10L15/18

Inventor:

Assignee:

Applicant:

Drawings (4 of 7)

Smart overview of the Invention

An automatic speech recognition (ASR) model processes input audio by receiving a sequence of acoustic frames. It generates higher order feature representations for these frames using an audio encoder. Additionally, a context encoder creates context embeddings based on previous transcriptions, enhancing the model's ability to understand and predict speech more accurately.

Integration of Contextual Information

Conventional ASR systems often transcribe segments of audio independently, which can lead to inaccuracies when context from previous utterances is ignored. The proposed model incorporates contextual information by utilizing a joint network that combines data from the audio encoder, context encoder, and prediction network to improve recognition performance.

Functional Components of the ASR Model

  • Audio Encoder: Generates higher order feature representations for acoustic frames.
  • Context Encoder: Produces context embeddings from prior transcriptions, potentially using a pre-trained BERT model.
  • Prediction Network: Creates dense representations based on non-blank symbols from the final Softmax layer.
  • Joint Network: Generates probability distributions for various speech recognition hypotheses by integrating outputs from the other components.

Enhancements Through Self-Attentive Pooling

The context encoder employs self-attentive pooling to refine wordpiece embeddings derived from previous transcriptions. This method reweights the embeddings to prioritize more relevant information, thereby improving the model's understanding of context and enhancing transcription accuracy.

Implementation and Operations

The implementation involves executing a series of operations that include inputting acoustic frames into the ASR model, generating feature representations, creating context embeddings, and producing probability distributions for potential hypotheses. The final output is a transcription of the input utterance based on the hypothesis with the highest probability, ensuring that contextual nuances are considered throughout the process.