Invention Title:

SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS

Publication number:

US20240420686

Publication date:

2024-12-19

Section:

Physics

Class:

G10L15/16

Inventors:

Tara N. Sainath Jersey City, NJ, United States

Patrick An Phu Nguyen Palo Alto, CA, United States

Ron J. Weiss New York, NY, United States

Navdeep Jaitly Mountain View, CA, United States

Rohit Prakash Prabhavalkar Santa Clara, CA, United States

Zhifeng Chen Sunnyvale, CA, United States

Yonghui Wu Fremont, CA, United States

Chung-Cheng Chiu Sunnyvale, CA, United States

Bo Li Fremont, CA, United States

Kanury Kanishka Rao Santa Clara, CA, United States

Jan Kazimierz Chorowski Poland, Poland

Anjuli Patricia Kannan Berkeley, CA, United States

Ekaterina Gonina Sunnyvale, CA, United States

Michiel A. U. Bacchiani Summit, NJ, United States

Assignee:

Google LLC Mountain View, CA, United States

Applicant:

Google LLC Mountain View, CA, United States

Smart overview of the Invention

Speech recognition using sequence-to-sequence models involves receiving audio data and processing it through an encoder to extract features indicative of the utterance's acoustic characteristics. An attender then generates a context vector from the encoder's output, which is used by a decoder to produce speech recognition scores. These scores are employed to generate a transcription of the utterance, which is provided as the output of the automated speech recognition (ASR) system.

Technical Background

The field of speech recognition has evolved with the use of neural network models that enhance speech quality and perform acoustic modeling. Traditional systems often require multiple input sources and separate models for different tasks. The integration of sequence-to-sequence models, which combine acoustic, pronunciation, and language models into a single neural network, offers improved accuracy and efficiency without needing additional components like lexicons or text normalization modules.

Model Enhancements

Several techniques enhance speech recognition accuracy, including the use of Listen, Attend, and Spell (LAS) models and neural transducer models. These models utilize attention mechanisms between encoders and decoders to achieve high accuracy. Structural improvements such as word piece models and multi-headed attention processing allow for diverse linguistic unit outputs and multiple attention distributions, respectively. Optimization strategies include minimum word-error-rate training, scheduled sampling, synchronous training, and label smoothing.

Performance Improvements

The application of these enhancements significantly improves ASR performance on tasks like voice search. For instance, a unidirectional LSTM encoder improved the Word Error Rate (WER) from 9.2% to 5.6% on a 12,500-hour voice search task compared to a conventional system's 6.7% WER. On dictation tasks, the enhanced model achieved a 4.1% WER versus 5% for traditional systems, demonstrating its effectiveness in various applications.

Streaming Capabilities

Sequence-to-sequence models can also improve streaming applications such as Voice Search by maintaining performance even with low latency requirements. Techniques like increasing attention computation windows and initializing neural transducer (NT) models from LAS-trained models help NT match LAS performance levels while providing streaming results with minimal delay. These advancements highlight the potential for real-time ASR applications without sacrificing accuracy.