US20240346730
2024-10-17
Physics
G06T13/205
A method for generating a controllable talking face image is described, which utilizes a source image, a series of driving images from the same video, and input audio. The process involves encoding these inputs into a visual space to create a style latent code that combines both source and driving latent codes. Additionally, an audio feature is extracted from the input audio to generate an audio latent code.
The method continues by mapping the source latent code to a canonical space, resulting in a canonical code. A motion code is then generated by merging the driving latent code with the audio latent code, which is mapped to a multimodal motion space. This combination allows for the creation of a multimodal fused latent code by integrating both the canonical and motion codes.
The final step in the process involves generating the talking face image by transferring the multimodal fused latent code to a generative adversarial network (GAN). The method includes specific equations that facilitate the linear combination of codes and ensure that only motion features are included in the combined code.
The implementation of this method can be achieved through a device that executes program code on at least one processor. This device processes the source image, driving images, and audio input to derive all necessary codes for generating the talking face image. The architecture includes various multilayer perceptrons for encoding and decoding tasks.
This face image generation method addresses limitations found in previous technologies by enabling detailed control over various facial movements without needing additional supervision for facial keypoints. Potential applications span across industries such as film, entertainment, virtual assistants, and video conferencing, creating more immersive human-machine interactions.