US20240428493
2024-12-26
Physics
G06T13/40
The patent application describes a method and device for synthesizing talking head videos. This involves obtaining speech data and observation data, performing feature extraction to derive speech and non-speech features, conducting temporal modeling to obtain low-dimensional representations, and synthesizing the video based on these representations along with additional non-speech features. The approach aims to improve efficiency and reduce complexity in generating realistic virtual human videos.
The invention falls under the domain of video processing, specifically focusing on methods and devices for extracting video frame features. It leverages advancements in artificial intelligence to create virtual humans, which are digital characters that mimic human appearance and behavior. The technology aims to enhance the synthesis of talking head videos by addressing limitations in existing methods that rely heavily on autoregressive models.
The concept of virtual humans is gaining traction due to AI advancements, with talking head video synthesis being a key component. Traditional methods often use autoregressive models to manage dependencies between video frames, leading to increased complexity and longer processing times, particularly with high-resolution images. This invention seeks to streamline this process by using a novel approach to feature extraction and temporal modeling.
An electronic device is central to the invention, comprising a processor and storage for executing computer programs that implement the synthesis method. The device performs several steps: acquiring speech and observation data, extracting relevant features, conducting temporal modeling to derive low-dimensional representations, and finally synthesizing the video. The processor can be a CPU or other integrated circuit capable of executing these tasks efficiently.
The method involves extracting speech features from audio data while removing noise, and deriving non-speech features from observation data related to appearance and movement. Temporal modeling fuses these features to create low-dimensional representations that capture essential shape-related information like lip movements. Video synthesis is then achieved by combining these representations with texture-related features that are less sensitive to temporal changes, ensuring natural visual output.