US20240290059
2024-08-29
Physics
G06V10/25
A method for generating editable free-viewport videos has been developed, utilizing a collection of videos captured from multiple perspectives of a scene containing an environment and dynamic entities. Each dynamic entity is encapsulated within a 3D bounding box, and a machine learning model is employed to encode the scene's environment and dynamic elements. This model includes layers that represent the spatial and temporal characteristics of both the environment and the entities, enabling advanced manipulation of video content.
The machine learning model consists of two primary layers: an environment layer and a dynamic entity layer. The environment layer captures the continuous function of space and time for the overall environment, while the dynamic entity layer focuses on individual entities within the scene. This latter layer incorporates a deformation module to adjust spatial coordinates based on timestamps and trained weights, alongside a neural radiance module that determines color and density values for rendering purposes.
To enhance video rendering accuracy, point clouds are generated for each frame across multiple views, allowing for depth map reconstruction. Initial 2D bounding boxes are created for dynamic entities in each view, which are then transformed into 3D bounding boxes through a trajectory prediction network. This process facilitates precise tracking and manipulation of dynamic objects within the video.
The method enables various editing capabilities for dynamic entities in the scene, such as resizing or even removing objects entirely. The layers are designed to be disentangled, allowing users to manipulate individual elements without affecting the entire scene. The model can also apply transformations to bounding boxes or timestamps, further expanding the editing possibilities within rendered videos.
Both the environment and dynamic entity layers utilize neural radiance modules that output color and density values based on deformed spatial coordinates. These modules can be structured using multi-layer perceptrons (MLPs), enhancing the model's ability to reconstruct novel views from encoded data. The overall design aims to improve upon traditional view synthesis methods by providing editable content suitable for diverse applications in virtual reality (VR) and augmented reality (AR).