US20240420686
2024-12-19
Physics
G10L15/16
Speech recognition using sequence-to-sequence models involves receiving audio data and processing it through an encoder to extract features indicative of the utterance's acoustic characteristics. An attender then generates a context vector from the encoder's output, which is used by a decoder to produce speech recognition scores. These scores are employed to generate a transcription of the utterance, which is provided as the output of the automated speech recognition (ASR) system.
The field of speech recognition has evolved with the use of neural network models that enhance speech quality and perform acoustic modeling. Traditional systems often require multiple input sources and separate models for different tasks. The integration of sequence-to-sequence models, which combine acoustic, pronunciation, and language models into a single neural network, offers improved accuracy and efficiency without needing additional components like lexicons or text normalization modules.
Several techniques enhance speech recognition accuracy, including the use of Listen, Attend, and Spell (LAS) models and neural transducer models. These models utilize attention mechanisms between encoders and decoders to achieve high accuracy. Structural improvements such as word piece models and multi-headed attention processing allow for diverse linguistic unit outputs and multiple attention distributions, respectively. Optimization strategies include minimum word-error-rate training, scheduled sampling, synchronous training, and label smoothing.
The application of these enhancements significantly improves ASR performance on tasks like voice search. For instance, a unidirectional LSTM encoder improved the Word Error Rate (WER) from 9.2% to 5.6% on a 12,500-hour voice search task compared to a conventional system's 6.7% WER. On dictation tasks, the enhanced model achieved a 4.1% WER versus 5% for traditional systems, demonstrating its effectiveness in various applications.
Sequence-to-sequence models can also improve streaming applications such as Voice Search by maintaining performance even with low latency requirements. Techniques like increasing attention computation windows and initializing neural transducer (NT) models from LAS-trained models help NT match LAS performance levels while providing streaming results with minimal delay. These advancements highlight the potential for real-time ASR applications without sacrificing accuracy.