Invention Title:

ROBUST FACIAL ANIMATION FROM VIDEO AND AUDIO

Publication number:

US20240428492

Publication date:
Section:

Physics

Class:

G06T13/40

Inventors:

Assignee:

Applicant:

Smart overview of the Invention

The patent application outlines a system for generating real-time animations for 3D avatars using input video and audio from a client device. A camera captures video of a user's face, and a microphone records audio. Trained machine learning models process these inputs to produce facial action coding system (FACS) weights, which are used to animate the avatar. The system integrates both video and audio data to create synchronized and realistic facial animations.

Technical Field

The invention pertains to computer-based virtual experiences, specifically focusing on methods and systems for creating robust facial animations from video in real-time. This technology is applicable in various online platforms, including gaming and virtual environments, where users interact through avatars. The approach addresses limitations of conventional animation methods, which often require manual input for gestures and expressions.

Background

Online platforms allow users to engage in multiplayer games and virtual environments, often using avatars to represent themselves. Traditional methods of animating avatars involve predefined gestures and movements, which can be limiting. The described technology seeks to enhance these interactions by automating avatar animation based on real-time video and audio inputs, thus providing a more immersive experience.

Methodology

The method involves capturing video frames and audio frames along with a blending term that identifies audio lapses. Video FACS weights are derived from a trained model analyzing the video input, while audio FACS weights are obtained from another model processing the audio input. These weights are combined using the blending term to produce final FACS weights, which drive the 3D avatar's facial animation.

Implementation Details

The system uses two trained machine learning models: one for video processing and another for audio processing. Each model includes encoders and task-specific decoders that output relevant parameters like head poses and facial landmarks. A modularity mixing component fuses the outputs from both models into final FACS weights. The models undergo a semi-supervised training process to enhance their accuracy in detecting facial movements and synchronizing them with audio cues.