US20260105671
2026-04-16
Physics
G06T13/205
Systems and methods are presented for animating digital avatars using audio-driven animation. The process involves identifying animations for a mesh based on audio data and specific speaking styles. These animations are used to create vertex deltas, which are modifications to the mesh, to transform it from a neutral pose. A machine-learning model is then updated to generate output vertex deltas, which help animate the mesh according to varying speaking styles and audio inputs.
Traditional methods for synchronizing lip movements with audio in 3D meshes are often resource-intensive and computationally inefficient. They typically struggle to adapt to different speaking styles without significant computational overhead. The disclosed techniques aim to improve this process by using a single machine-learning model to handle multiple speaking styles and identities, enhancing efficiency and reducing the computational burden.
The system employs one or more processors to identify animations for a mesh, creating vertex deltas using these animations and a neutral mesh pose. These deltas, along with audio data and speaking style indications, are used to update a machine-learning model. The model can then produce output vertex deltas for the mesh, allowing it to animate in response to different speaking styles and audio inputs. This method provides a more efficient way to synchronize animations with audio data without needing separate models for each speaking style.
In various implementations, the system generates vertex deltas based on the differences between mesh vertices in an animation frame and their neutral pose counterparts. A style vector is created from the speaking style indication, which helps update the machine-learning model. The model can then be used with new audio data and speaking style indications to produce vertex deltas for animating the mesh. The system may also blend multiple meshes to represent different identities, using weight values to determine the contribution of each identity to the final animation.
The system can receive user input through a graphical interface to determine speaking styles. It uses this input, along with audio data, to generate vertex deltas and animate the facial mesh. The system can blend multiple facial meshes based on speaking style indications and generate animations synchronized with the audio data. This approach allows for flexible and efficient facial animation across different identities and speaking styles, enhancing the realism and adaptability of digital avatars.