Invention Title:

SIGNAL ENCODING USING LATENT FEATURE PREDICTION

Publication number:

US20250364001

Publication date:
Section:

Physics

Class:

G10L19/06

Inventors:

Assignee:

Applicant:

Smart overview of the Invention

Innovative techniques are detailed for encoding and decoding signals, particularly useful in speech coding for real-time communications. Utilizing a neural network, these methods encode latent features of a current frame by predicting from reconstructed latent features of past frames. An extractor learns a residual-like feature from this prediction and the latent features obtained via an encoder. This feature is quantized and, during decoding, is dequantized and combined with predictions from prior frames to reconstruct the current frame's features, enabling signal reconstruction.

The field of digital audio has seen significant advancements since the 1970s, particularly with the rise of internet-based real-time communication tools like Microsoft Teams. Despite improvements in computing power and network infrastructure, there remains a need to enhance audio quality while minimizing data transmission requirements. Real-time audio processing is sensitive to delays, which can hinder effective communication. This innovation aims to address these challenges by improving coding efficiency through contextual coding.

Existing neural audio codecs fall into two categories: generative decoder models and end-to-end neural audio coding. The latter often uses the VQ-VAE framework, which learns encoding, quantization, and decoding in an end-to-end manner. However, these methods don't fully exploit temporal correlations, leading to redundancies. The proposed innovations incorporate contextual coding into the VQ-VAE framework to eliminate such redundancies, enhancing coding efficiency by leveraging temporal predictions.

Contextual coding with temporal predictions is introduced into the VQ-VAE framework for neural audio coding. Unlike traditional methods that subtract samples from predictions to determine residuals, this approach uses a learnable extractor and synthesizer to merge predictions with latent features. This method is particularly beneficial for low-latency speech encoding but can be adapted for other signal types beyond audio data.

The TFNet codec exemplifies this approach by utilizing a neural network with time-frequency input. It processes audio samples using Short-Time Fourier Transform (STFT) and applies power law compression to balance frequency importance. The encoder exploits local 2D correlations while temporal filters handle longer-term dependencies. Quantized features are coded efficiently, and decoding involves temporal filtering blocks followed by reconstruction. This method ensures resilience to packet losses in real-time communications, offering recovery capability and minimizing error propagation.