US20240355346
2024-10-24
Physics
G10L21/013
The patent application describes a system designed for voice modification, particularly focusing on accent alteration while preserving the speaker's unique vocal characteristics. This system takes an audio waveform of a person's speech and outputs a modified version that retains elements such as voice timbre, intonation, and intensity, but with a target accent. It employs a bottleneck-based autoencoder using speech spectrograms as both input and output. To ensure the output maintains the speaker's identity and intelligibility, the system incorporates multiple loss functions, including a speaker identity discriminator and a speech intelligibility scorer.
The invention falls within the realm of speech processing, leveraging machine learning techniques to modify speech attributes. The system utilizes autoencoders, which are neural networks used for unsupervised representation learning. These autoencoders can process various signals like images, videos, and audio waveforms to generate modified outputs. The technology aims to enhance speech processing capabilities by modifying accents while keeping the speaker's voice recognizable.
The system involves a computing setup that processes an input audio waveform and outputs a modified version with a different accent. It uses an autoencoder architecture with competing discriminators to achieve this transformation. The process includes training a decoder with three loss functions: one for reconstruction error, another for speaker identity preservation, and a third for maintaining speech intelligibility. Users can adjust the level of accent modification and the recognizability of the speaker's voice in the output.
An "accent" in this context refers to distinctive pronunciation or prosody associated with an individual or group. This can be influenced by native language, regional dialects, anatomical factors, or learning new languages. The system can adapt to various accent types, including those resulting from physical conditions like slurred speech due to stroke or anatomical differences. The goal is to produce an output that aligns with a target accent while retaining the speaker's personality.
The technology has practical applications in enhancing communication for foreign speakers aiming for better intelligibility in different regions. It can also assist individuals with distinct speech patterns due to physical conditions or disabilities. The autoencoder framework is akin to those used in image processing, allowing for efficient real-time transformations of speech waveforms into modified outputs that are more understandable or relatable to specific audiences.