Invention Title:

REAL-TIME EXTRACTION OF HUMAN POSES FROM VIDEO FOR ANIMATION OF AVATARS

Publication number:

US20250069259

Publication date:
Section:

Physics

Class:

G06T7/73

Inventors:

Assignee:

Applicant:

Smart overview of the Invention

The patent application details a method for extracting human poses from video data to animate avatars in real-time. This approach involves obtaining input video frames that capture a person's movements, detecting keypoints in these frames, and determining corresponding 3D body poses. The process utilizes a spatial-temporal transformer to encode spatial and temporal dimensions separately, enhancing the accuracy of joint angles and movement depiction.

Technical Background

Creating high-quality animations traditionally involves time-consuming manual work or expensive motion capture setups. While machine learning offers alternatives by deriving motion from video images, existing techniques often result in low-quality animations and require significant computational resources. The described method aims to address these challenges by providing real-time output with improved pose accuracy.

Methodology

The method detects keypoints in video frames and uses a spatial-temporal transformer to determine 3D poses by encoding spatial and temporal inputs separately. It outputs 6D circular representations of joint angles, which are converted into 3D angles. This process includes predicting global translation in 3D world coordinates, potentially using a second transformer for enhanced accuracy.

Additional Features

To improve animation quality, the method incorporates a smoothing filter with an optimization solver to minimize jitters in the pose sequence. The filter reduces acceleration errors of keypoints, employing techniques like the alternating direction method of multipliers (ADMM) solver for optimization. This ensures smoother transitions and realistic animations.

Applications and Implementations

The system can be applied to virtual environments where avatars mirror human movements captured in video data. It includes processors and memory with software instructions for obtaining video input, detecting keypoints, and determining 3D poses. The approach is adaptable to various devices and non-transitory computer-readable media, offering flexibility in integrating with existing systems.