Invention Title:

BODY TRACKING FROM MONOCULAR VIDEO

Publication number:

US20250078377

Publication date:

2025-03-06

Section:

Physics

Class:

G06T13/40

Inventors:

Kiran BHAT San Francisco, CA, United States

Haomiao Jiang Cupertino, CA, United States

Joseph LIU San Mateo, CA, United States

Inaki NAVARRO OIZA Pfaeffikon, Switzerland

Mubbasir Turab KAPADIA San Mateo, CA, United States

Young-Yoon LEE Los Altos, CA, United States

Che-jui CHANG San Mateo, CA, United States

Seonghyeon MOON San Mateo, CA, United States

Assignee:

Roblox Corporation San Mateo, CA, United States

Applicant:

Roblox Corporation San Mateo, CA, United States

Smart overview of the Invention

The patent application discusses a method for body tracking using monocular video, which involves capturing video frames of a human subject's movement and processing them to create a 3D representation. This process uses a pre-trained neural network model to analyze 2D images extracted from the video frames and determine the subject's pose. The method emphasizes creating a 3D pose estimation of the upper body joints, determining confidence scores, and selecting keypoints for accurate animation of a 3D avatar.

Technical Field

This invention falls within the realm of computer graphics, focusing on tracking body movements without noticeable lag using a single camera feed. It is particularly useful for devices with limited processing capabilities, aiming to improve applications in gaming, virtual reality (VR), augmented reality (AR), and human-computer interaction by providing an accessible alternative to traditional motion capture systems that require specialized hardware.

Challenges Addressed

The approach tackles several challenges inherent in monocular video-based body tracking. These include the difficulty of extrapolating 3D poses from 2D input data due to missing depth information, maintaining real-time performance without lag, and handling partial visibility or self-occlusion where parts of the body are obscured. The method aims to balance computational efficiency with accuracy, making it suitable for mobile or low-end devices.

Implementation Details

Video frames are processed to extract 2D images of the human subject.
A pre-trained neural network model is used to determine the pose and generate a 3D pose estimation.
Confidence scores are calculated for the accuracy of joint predictions.
A set of keypoints is selected based on these scores for animating a 3D avatar.
The animated avatar is displayed in a user interface, mimicking the subject's movements.

Additional Features

The method includes several enhancements such as temporal smoothing across frames, calibration for camera distortions, and re-detection if confidence scores fall below a threshold. The neural network may use an attention mechanism to focus on keypoints during estimation, and joint positions of the avatar can be scaled to match the human subject's proportions. These features aim to improve tracking accuracy and reliability in various conditions.