US20240428492
2024-12-26
Physics
G06T13/40
The patent application outlines a system for generating real-time animations for 3D avatars using input video and audio from a client device. A camera captures video of a user's face, and a microphone records audio. Trained machine learning models process these inputs to produce facial action coding system (FACS) weights, which are used to animate the avatar. The system integrates both video and audio data to create synchronized and realistic facial animations.
The invention pertains to computer-based virtual experiences, specifically focusing on methods and systems for creating robust facial animations from video in real-time. This technology is applicable in various online platforms, including gaming and virtual environments, where users interact through avatars. The approach addresses limitations of conventional animation methods, which often require manual input for gestures and expressions.
Online platforms allow users to engage in multiplayer games and virtual environments, often using avatars to represent themselves. Traditional methods of animating avatars involve predefined gestures and movements, which can be limiting. The described technology seeks to enhance these interactions by automating avatar animation based on real-time video and audio inputs, thus providing a more immersive experience.
The method involves capturing video frames and audio frames along with a blending term that identifies audio lapses. Video FACS weights are derived from a trained model analyzing the video input, while audio FACS weights are obtained from another model processing the audio input. These weights are combined using the blending term to produce final FACS weights, which drive the 3D avatar's facial animation.
The system uses two trained machine learning models: one for video processing and another for audio processing. Each model includes encoders and task-specific decoders that output relevant parameters like head poses and facial landmarks. A modularity mixing component fuses the outputs from both models into final FACS weights. The models undergo a semi-supervised training process to enhance their accuracy in detecting facial movements and synchronizing them with audio cues.