Invention Title:

FACE IMAGE GENERATION METHOD AND DEVICE FOR GENERATING FULLY-CONTROLLABLE TALKING FACE

Publication number:

US20240346730

Publication date:
Section:

Physics

Class:

G06T13/205

Inventors:

Applicants:

Smart overview of the Invention

A method for generating a controllable talking face image is described, which utilizes a source image, a series of driving images from the same video, and input audio. The process involves encoding these inputs into a visual space to create a style latent code that combines both source and driving latent codes. Additionally, an audio feature is extracted from the input audio to generate an audio latent code.

Canonical and Motion Code Acquisition

The method continues by mapping the source latent code to a canonical space, resulting in a canonical code. A motion code is then generated by merging the driving latent code with the audio latent code, which is mapped to a multimodal motion space. This combination allows for the creation of a multimodal fused latent code by integrating both the canonical and motion codes.

Generating the Talking Face Image

The final step in the process involves generating the talking face image by transferring the multimodal fused latent code to a generative adversarial network (GAN). The method includes specific equations that facilitate the linear combination of codes and ensure that only motion features are included in the combined code.

Implementation Details

The implementation of this method can be achieved through a device that executes program code on at least one processor. This device processes the source image, driving images, and audio input to derive all necessary codes for generating the talking face image. The architecture includes various multilayer perceptrons for encoding and decoding tasks.

Advantages and Applications

This face image generation method addresses limitations found in previous technologies by enabling detailed control over various facial movements without needing additional supervision for facial keypoints. Potential applications span across industries such as film, entertainment, virtual assistants, and video conferencing, creating more immersive human-machine interactions.