US20260004495
2026-01-01
Physics
G06T13/205
The invention pertains to a method, electronic device, and computer program product designed for generating video content. It involves obtaining a reference image and speech, where the image specifies the head of a target object and the speech specifies its voice. A fusion vector is created by combining features from both the head and voice. This vector is then used to produce video frames that represent the target object speaking, by denoising initial frames with noise, resulting in a high-quality video.
The technology aims to enhance the generation of speaking avatars by combining speech audio with character avatars. This process converts static images into dynamic animations that can express speech content realistically. Such advancements allow virtual avatars to interact more naturally with users, making them more vivid and expressive. This is crucial for applications in video communication and entertainment where realistic and expressive facial animations are required.
There are two main approaches to generating speaking avatars: reference-based and reference-free methods. Reference-based methods utilize additional videos or images to guide the process, while reference-free methods rely solely on speech input. The invention employs a diffusion model that effectively uses a single identity image and speech input to generate realistic speaking head videos, addressing challenges like high resolution and natural expression.
Generating a speaking avatar involves overcoming several challenges, such as encoding rich speech information and maintaining the speaker's appearance. The generated video frames must be high-resolution, realistic, and synchronized with the speech input in terms of lip movements and facial expressions. The invention addresses these challenges by using a multi-modal encoding solution that ensures alignment between speech and visual characteristics.
The invention proposes a novel framework that combines multi-modal encoding and super-resolution techniques to improve the quality and diversity of generated speaking avatars. A CLIP-based encoder is used to capture semantic and style information from speech and images, while a diffusion-based decoder generates high-resolution frames. This approach, complemented by an end-to-end loss function, optimizes the generation process, making it suitable for applications like virtual conferences and online education.