US20250061634
2025-02-20
Physics
G06T13/205
The patent application presents systems and methods for animating virtual avatars or agents based on input audio, emotions, and styles using machine learning. A deep neural network is utilized to generate motion or deformation information for a character, representing speech from audio input. The character's facial components, such as head, skin, eyes, and tongue, are modeled separately, allowing the network to output specific motion data for each component. During training, a transformer-based audio encoder with locked parameters is used to train a decoder with a weighted feature vector. The network's output is then rendered to create emotion-accurate facial animations.
Animating characters to appear as if they are speaking based on audio data is complex and time-consuming. Current machine-learning approaches often fail to create realistic animations, particularly in languages the model isn't trained for. This can be problematic when animating virtual humans intended to appear as real people. The disclosed invention aims to automate this process for real-time applications, improving the realism of facial animations.
The described systems and methods have diverse applications across various industries. They can be used in non-autonomous and autonomous vehicles, robots, drones, and other machinery for purposes such as machine control, synthetic data generation, augmented reality, virtual reality, security surveillance, and more. They are also applicable in digital twinning, conversational AI, generative AI with large language models (LLMs), light transport simulation, collaborative content creation for 3D assets, and cloud computing.
The approach involves using deep neural networks like CNNs or transformers to process raw audio input and extract features for animation. These networks output motion or deformation data to render facial animations corresponding to speech. The system can incorporate style and emotion vectors during training to adjust animations according to different emotional states or styles. Facial components are modeled independently for more realistic animation effects.
The invention includes integrating transformer-based audio encoders into an Audio2Face model. Modifications involve adding transformer layers to existing CNN models pre-trained on diverse audio data. The system can handle small-duration audio samples without prior frame knowledge loss. Techniques like regularization loss reduce motion jitter during silence, enhancing animation quality. The system aims to produce realistic character behavior by considering emotional states alongside speech.