Invention Title:

EDITABLE FREE-VIEWPOINT VIDEO USING A LAYERED NEURAL REPRESENTATION

Publication number:

US20240290059

Publication date:

2024-08-29

Section:

Physics

Class:

G06V10/25

Inventors:

Lan Xu Shanghai, China

Jingyi YU Shanghai, China

Jiakai ZHANG Shanghai, China

Assignee:

SHANGHAITECH UNIVERSITY Shanghai, China

Applicant:

SHANGHAITECH UNIVERSITY Shanghai, China

Drawings (4 of 7)

Drawing 01 for EDITABLE FREE-VIEWPOINT VIDEO USING A LAYERED NEURAL REPRESENTATION

Drawing 02 for EDITABLE FREE-VIEWPOINT VIDEO USING A LAYERED NEURAL REPRESENTATION

Drawing 03 for EDITABLE FREE-VIEWPOINT VIDEO USING A LAYERED NEURAL REPRESENTATION

Drawing 04 for EDITABLE FREE-VIEWPOINT VIDEO USING A LAYERED NEURAL REPRESENTATION

Smart overview of the Invention

A method for generating editable free-viewport videos has been developed, utilizing a collection of videos captured from multiple perspectives of a scene containing an environment and dynamic entities. Each dynamic entity is encapsulated within a 3D bounding box, and a machine learning model is employed to encode the scene's environment and dynamic elements. This model includes layers that represent the spatial and temporal characteristics of both the environment and the entities, enabling advanced manipulation of video content.

Machine Learning Model Structure

The machine learning model consists of two primary layers: an environment layer and a dynamic entity layer. The environment layer captures the continuous function of space and time for the overall environment, while the dynamic entity layer focuses on individual entities within the scene. This latter layer incorporates a deformation module to adjust spatial coordinates based on timestamps and trained weights, alongside a neural radiance module that determines color and density values for rendering purposes.

Point Cloud and Depth Map Integration

To enhance video rendering accuracy, point clouds are generated for each frame across multiple views, allowing for depth map reconstruction. Initial 2D bounding boxes are created for dynamic entities in each view, which are then transformed into 3D bounding boxes through a trajectory prediction network. This process facilitates precise tracking and manipulation of dynamic objects within the video.

Dynamic Object Manipulation Features

The method enables various editing capabilities for dynamic entities in the scene, such as resizing or even removing objects entirely. The layers are designed to be disentangled, allowing users to manipulate individual elements without affecting the entire scene. The model can also apply transformations to bounding boxes or timestamps, further expanding the editing possibilities within rendered videos.

Implementation of Neural Radiance Modules

Both the environment and dynamic entity layers utilize neural radiance modules that output color and density values based on deformed spatial coordinates. These modules can be structured using multi-layer perceptrons (MLPs), enhancing the model's ability to reconstruct novel views from encoded data. The overall design aims to improve upon traditional view synthesis methods by providing editable content suitable for diverse applications in virtual reality (VR) and augmented reality (AR).