Invention Title:

VOICED-OVER MULTIMEDIA TRACK GENERATION

Publication number:

US20250210065

Publication date:
Section:

Physics

Class:

G11B27/036

Inventors:

Applicant:

Smart overview of the Invention

Approaches for generating a final media track in a final language by altering an initial media track in an initial language are highlighted. An audio generation model is employed to convert or translate an initial audio track into a final audio track. A video generation model manipulates the movement of lips in the initial video track based on the final audio and text, ensuring synchronization. The final media file is created by merging the altered audio and video tracks.

Background

The rise of digital technology and internet accessibility has increased global digital media consumption. However, language differences pose a barrier to content accessibility across regions. Traditionally, voice-over artists are hired to dub content into target languages, replacing the original audio tracks. This method is costly and time-consuming, especially when multiple speakers are involved.

Challenges and Conventional Solutions

Conventional solutions like subtitles or TTS (text-to-speech) systems have limitations. Subtitles often distract users from the main content, while TTS systems fail to maintain vocal characteristics consistent with the original speaker's attributes, such as age and gender. Additionally, these solutions do not address lip synchronization in videos, impacting the user experience negatively.

Proposed Approach

The described approach involves converting initial media files into final language versions by processing both audio and video elements. The process begins with converting audio to text and identifying individual sentences with speaker identifiers through 'speaker diarization.' Final audio characteristics are determined based on speaker attributes, which guide the translation and generation of a final audio portion for each sentence using an audio generation model.

Implementation

The final step involves generating the complete final audio track by merging individual sentence portions. An audio generation model, trained on diverse speaker data, produces output audio that matches input texts based on specified characteristics. This ensures the final media file maintains coherence between visual cues and dubbed audio, enhancing user experience across different languages.