US20250210065
2025-06-26
Physics
G11B27/036
Approaches for generating a final media track in a final language by altering an initial media track in an initial language are highlighted. An audio generation model is employed to convert or translate an initial audio track into a final audio track. A video generation model manipulates the movement of lips in the initial video track based on the final audio and text, ensuring synchronization. The final media file is created by merging the altered audio and video tracks.
The rise of digital technology and internet accessibility has increased global digital media consumption. However, language differences pose a barrier to content accessibility across regions. Traditionally, voice-over artists are hired to dub content into target languages, replacing the original audio tracks. This method is costly and time-consuming, especially when multiple speakers are involved.
Conventional solutions like subtitles or TTS (text-to-speech) systems have limitations. Subtitles often distract users from the main content, while TTS systems fail to maintain vocal characteristics consistent with the original speaker's attributes, such as age and gender. Additionally, these solutions do not address lip synchronization in videos, impacting the user experience negatively.
The described approach involves converting initial media files into final language versions by processing both audio and video elements. The process begins with converting audio to text and identifying individual sentences with speaker identifiers through 'speaker diarization.' Final audio characteristics are determined based on speaker attributes, which guide the translation and generation of a final audio portion for each sentence using an audio generation model.
The final step involves generating the complete final audio track by merging individual sentence portions. An audio generation model, trained on diverse speaker data, produces output audio that matches input texts based on specified characteristics. This ensures the final media file maintains coherence between visual cues and dubbed audio, enhancing user experience across different languages.