US20240354996
2024-10-24
Physics
G06T9/00
Autoregressive content rendering focuses on generating temporally coherent videos, particularly digital humans. The increasing popularity of digital humans in contexts like gaming and the metaverse highlights the need for realistic and seamless animations. Current challenges involve ensuring that generated videos are free from visual artifacts such as jitter and glitches, which can disrupt the user experience.
Creating digital humans requires overcoming the limitations of existing technologies that rely on sparse and noisy input features like keypoints and contours. These inputs often lead to jittery or glitchy movements in non-common features such as hair or clothing. In multi-modal settings, additional noise from audio data can further affect the coherence of mouth and lip movements.
The proposed method involves using an autoencoder network to generate a series of predicted images, which are then fed back into the network. By encoding both predicted images and keypoint images, the system aims to produce temporally coherent video content. This iterative decoding process helps maintain smooth transitions between video frames, addressing issues of temporal incoherence.
The autoencoder network comprises a first encoder for predicted images and a second encoder for keypoint images, along with a decoder to iteratively generate predicted images. This setup is designed to enhance temporal coherence by leveraging encoded information from previous iterations. The technology can be implemented through various systems, devices, or computer program products.
This approach enables the creation of high-resolution video content suitable for display on larger screens without visible artifacts. By ensuring smooth transitions between frames, the method enhances user engagement with digital humans in virtual environments. The technology supports applications ranging from realistic avatars in gaming to interactive virtual experiences.