Invention Title:

Method for Monocular Acquisition of Realistic Three-Dimensional Scene Models

Publication number:

US20250148702

Publication date:
Section:

Physics

Class:

G06T15/40

Inventors:

Applicant:

Drawings (4 of 5)

Smart overview of the Invention

The disclosed method focuses on generating three-dimensional (3D) models from a single two-dimensional (2D) image using advanced neural networks. The process begins by receiving a 2D input image, from which a latent tensor is predicted. This tensor contains 3D geometrical data and information on occluded surfaces within the image. A neural network then uses both the input image and the latent tensor to predict a comprehensive 3D model of the scene.

Technical Background

Traditional methods for creating 3D models involve stereo camera setups and multi-view stereo reconstruction algorithms, which require multiple images or video frames. Monocular acquisition, however, seeks to achieve this with just one image, typically from consumer-grade cameras like those in smartphones. It involves inferring depth and other properties that are not directly visible in the input image, using projective geometry and semantic understanding of objects.

Monocular Acquisition Challenges

Creating 3D models from a single image is challenging due to the inherent one-to-many prediction problem, where multiple plausible 3D interpretations can exist for a single 2D image. Traditional machine learning models struggle with this, often averaging predictions in a way that does not accurately represent any single possibility. Denoising diffusion probabilistic models (DDPMs) have emerged as a solution, offering improved prediction plausibility by iteratively refining predictions through a noise-reduction process.

Innovative Approach

The method involves training a reconstruction neural network alongside an encoding network to create a paired dataset of input images and latent tensors. This training leverages auxiliary images taken from various viewpoints to enhance the learning process. The latent tensor is derived using either a denoising diffusion process or other pretrained networks such as auto-regressive or image translation networks, which are trained on the paired dataset using techniques like adversarial learning.

Applications and Implementation

The system can be applied to individual video frames captured by a single camera, allowing for the assembly and post-processing of these frames into coherent 3D videos. This innovative approach enhances the ability to render realistic 3D scenes for applications in virtual reality (VR), augmented reality (AR), and other immersive technologies, providing users with dynamic viewpoints and heightened realism.