US20240320892
2024-09-26
Physics
G06T13/205
A framework has been developed to create photorealistic 3D talking faces using only audio input. This system employs machine learning models to predict both the geometry and texture of a face based on audio signals that contain speech. The resulting 3D mesh model can be integrated into existing videos or virtual environments, enhancing various multimedia applications.
Talking head videos are prevalent in various media formats, including news broadcasts and online courses. Traditional methods for synthesizing these videos often rely on regressing facial movements from audio, which can lead to limitations in realism and personalization. Newer techniques that predict 3D facial meshes from audio have emerged but often fall short in visual quality, restricting their use to gaming and VR environments.
The proposed computing system consists of processors and non-transitory computer-readable media that store two main machine-learned models: one for predicting face geometry and another for face texture. These models work together to generate a cohesive 3D representation of a face from audio input, allowing for more accurate and realistic facial animations.
The generated talking faces can be utilized in numerous applications, such as creating personalized avatars in gaming or VR, auto-translating videos with synchronized lip movements, and enhancing video editing capabilities. The ability to recreate visual aspects from audio alone opens new avenues for multimedia communication, where only audio is transmitted while the visual component is reconstructed as needed.
Innovative features include the use of audio as the primary input, enabling simpler data preparation and training processes. The system also incorporates a 3D decomposition method that separates head pose from speech-related facial movements, allowing for greater flexibility and realism. Personalized models can be trained for individual speakers, ensuring that unique facial expressions are captured effectively. Additionally, an auto-regressive framework enhances temporal consistency in the generated sequences, resulting in visually stable outputs.