Invention Title:

TEXT-TO-AUDIO CONVERSION WITH BYTE-ENCODING VECTORS

Publication number:

US20250104692

Publication date:

2025-03-27

Section:

Physics

Class:

G10L13/08

Inventors:

Kilian Quirin Weinberger Ithaca, NY, United States

Soham Ray Ithaca, NY, United States

Justin Robert Lovelace Ithaca, NY, United States

Felix Wu Issaquah, WA, United States

Kwangyoun Kim San Jose, CA, United States

Applicant:

ASAPP, INC. New York, NY, United States

Smart overview of the Invention

The patent application discusses a method for converting text into audio using a diffusion model. This model processes text alongside noise vectors to generate encoded audio vectors, which are then decoded into an audio signal. The method aims to produce high-quality audio by leveraging byte-encoding vectors, which enhance the naturalness and quality of the generated speech. Additionally, it allows for the generation of audio that mimics a specific person's voice by using prompt audio.

Background

Traditional text-to-speech (TTS) systems have relied on phonetic conversion and pre-recorded segments, often resulting in robotic-sounding speech. Neural network-based approaches have improved naturalness but require significant computational resources, limiting real-time application. The proposed method seeks to address these limitations by integrating advanced diffusion models with byte-encoding vectors for more efficient and natural speech synthesis.

Technical Details

The method involves several steps: receiving text, computing byte-encoding vectors, and generating noise vectors. These are processed through a neural network with multiple layers, including residual block layers and transformer layers with cross-attention capabilities. The process involves multiple stages of encoding and decoding, using noise-schedule weights from a noise schedule that can be scaled using functions like sigmoid or cosine.

System Architecture

The system comprises server computers configured to execute the method's steps, including receiving text input and generating an audio signal. The neural network architecture includes contracting and expanding paths, residual block layers, and upsampling layers. It also supports processing prompt encoded-audio vectors to produce speech resembling a specified person and can adjust the audio signal length based on requirements.

Implementation

The described method is implemented in non-transitory, computer-readable media containing executable instructions. These instructions guide processors to perform tasks such as computing byte-encoding vectors from text and reversing the diffusion process. The system utilizes Gaussian noise distribution sampling and provides semantic context through byte-encoding vectors to enhance the quality of generated speech.