US20240412440
2024-12-12
Physics
G06T13/40
The disclosed technology involves animating characters by separating facial regions to enhance realism in conversational AI. Utilizing neural networks, the system generates high-fidelity facial animations from audio inputs. A key innovation is the decoupling of emotional states from audio influences during neural network training. Specifically, audio drives lower face movements, while implicit emotional states influence upper face animations. Adversarial training further refines expressiveness by predicting if generated emotional states match real distributions.
Compared to existing systems like VOCA, FaceFormer, and Meshtalk, this approach offers improved generalization across multiple animated characters. Training can begin with a single character and adapt to others, requiring only a small dataset of 3-5 minutes of high-quality visual-audio pairs. This efficiency reduces computing resources and time compared to traditional methods, which often need extensive data and computing power.
The system enables explicit control over emotions, unlike conventional methods that lack semantic feature extraction for emotional control. Explicit emotions guide major expression styles, while implicit emotions add nuanced details. Conventional systems often reconstruct only one speaking style per character and fail to allow explicit emotional manipulation, a gap this technology addresses.
Applications for this technology span various fields such as gaming, interactive communications, multimedia applications, and digital assistants. The neural networks process inputs like audio data to output facial animation vertex positions. The architecture includes multiple layers that handle different aspects of animation generation, ensuring detailed and realistic facial expressions.
Training involves decoupling audio effects from implicit emotional states to improve animation fidelity. Audio primarily drives lower facial animations like lips and cheeks, while emotions influence upper features like eyes and eyebrows. Initial training stages focus on lower facial expressions using audio inputs combined with explicit emotional labels and random geometry data. Loss functions update neural network parameters based on these outputs to refine the animations.