Invention Title:

GENERATING THREE-DIMENSIONAL VIDEOS BASED ON TEXT USING MACHINE LEARNING MODELS

Publication number:

US20250184581

Publication date:

2025-06-05

Section:

Electricity

Class:

H04N21/816

Inventors:

Peng Wang 🇺🇸 Los Angeles, CA, United States

Yichun Shi 🇺🇸 Los Angeles, CA, United States

Kejie Li 🇺🇸 Los Angeles, CA, United States

Applicant:

Lemon Inc. Grand Cayman, Cayman Islands

Smart overview of the Invention

Techniques are introduced for creating three-dimensional (3D) videos from text inputs using machine learning models. The process involves inputting text and data indicative of multi-view images into a machine learning model, which then associates the content of these images with the text. The model includes multiple sub-models, each corresponding to different camera parameters, to generate various sets of multi-view images. These sub-models operate in parallel to efficiently produce the image sets, which are used to create a 3D video linked to the input text.

Background

Machine learning models have become prevalent across industries for tasks such as content generation. There is a growing need for enhanced methods that leverage these models for creating content. Existing 3D asset generation methods are often slow and inefficient, taking significant time to produce high-quality outputs. This delay can be problematic when users wish to modify or edit the generated assets, as they must wait extended periods to view both initial and updated results.

Detailed Process

The proposed system accelerates the visualization of 3D assets by generating 3D videos quickly. It uses a two-part machine learning system: the first model generates multi-view images based on user-inputted text, while the second model refines these images into multiple sets with different camera perspectives. Each set includes images from four orthogonal views, allowing for comprehensive visualization of the object described by the text.

Parallel Sub-models

The second machine learning model comprises several sub-models, each associated with specific camera parameters. These sub-models run concurrently, creating multiple image sets that reflect varying perspectives of the object. The system can adjust camera offsets between image sets, enabling smooth transitions in the resulting 3D video. This parallel processing significantly reduces the time required to produce a complete 360-degree view.

Application Example

An example scenario involves generating a 3D video of "a bulldog wearing a pirate hat" based on user input. The first model creates initial multi-view images from different angles, which are then enhanced by the second model into additional sets with varied perspectives. The final 3D video displays a smooth rotation of the bulldog, providing users with an immediate visualization that aids in deciding whether further text modifications are needed.