US20250184581
2025-06-05
Electricity
H04N21/816
Techniques are introduced for creating three-dimensional (3D) videos from text inputs using machine learning models. The process involves inputting text and data indicative of multi-view images into a machine learning model, which then associates the content of these images with the text. The model includes multiple sub-models, each corresponding to different camera parameters, to generate various sets of multi-view images. These sub-models operate in parallel to efficiently produce the image sets, which are used to create a 3D video linked to the input text.
Machine learning models have become prevalent across industries for tasks such as content generation. There is a growing need for enhanced methods that leverage these models for creating content. Existing 3D asset generation methods are often slow and inefficient, taking significant time to produce high-quality outputs. This delay can be problematic when users wish to modify or edit the generated assets, as they must wait extended periods to view both initial and updated results.
The proposed system accelerates the visualization of 3D assets by generating 3D videos quickly. It uses a two-part machine learning system: the first model generates multi-view images based on user-inputted text, while the second model refines these images into multiple sets with different camera perspectives. Each set includes images from four orthogonal views, allowing for comprehensive visualization of the object described by the text.
The second machine learning model comprises several sub-models, each associated with specific camera parameters. These sub-models run concurrently, creating multiple image sets that reflect varying perspectives of the object. The system can adjust camera offsets between image sets, enabling smooth transitions in the resulting 3D video. This parallel processing significantly reduces the time required to produce a complete 360-degree view.
An example scenario involves generating a 3D video of "a bulldog wearing a pirate hat" based on user input. The first model creates initial multi-view images from different angles, which are then enhanced by the second model into additional sets with varied perspectives. The final 3D video displays a smooth rotation of the bulldog, providing users with an immediate visualization that aids in deciding whether further text modifications are needed.