Invention Title:

Photorealistic Talking Faces from Audio

Publication number:

US20240320892

Publication date:
Section:

Physics

Class:

G06T13/205

Inventors:

Applicant:

Drawings (4 of 13)

Smart overview of the Invention

A framework has been developed to create photorealistic 3D talking faces using only audio input. This system employs machine learning models to predict both the geometry and texture of a face based on audio signals that contain speech. The resulting 3D mesh model can be integrated into existing videos or virtual environments, enhancing various multimedia applications.

Background

Talking head videos are prevalent in various media formats, including news broadcasts and online courses. Traditional methods for synthesizing these videos often rely on regressing facial movements from audio, which can lead to limitations in realism and personalization. Newer techniques that predict 3D facial meshes from audio have emerged but often fall short in visual quality, restricting their use to gaming and VR environments.

System Components

The proposed computing system consists of processors and non-transitory computer-readable media that store two main machine-learned models: one for predicting face geometry and another for face texture. These models work together to generate a cohesive 3D representation of a face from audio input, allowing for more accurate and realistic facial animations.

Applications

The generated talking faces can be utilized in numerous applications, such as creating personalized avatars in gaming or VR, auto-translating videos with synchronized lip movements, and enhancing video editing capabilities. The ability to recreate visual aspects from audio alone opens new avenues for multimedia communication, where only audio is transmitted while the visual component is reconstructed as needed.

Technical Innovations

Innovative features include the use of audio as the primary input, enabling simpler data preparation and training processes. The system also incorporates a 3D decomposition method that separates head pose from speech-related facial movements, allowing for greater flexibility and realism. Personalized models can be trained for individual speakers, ensuring that unique facial expressions are captured effectively. Additionally, an auto-regressive framework enhances temporal consistency in the generated sequences, resulting in visually stable outputs.