Invention Title:

GENERATING VIDEOS USING SEQUENCES OF GENERATIVE NEURAL NETWORKS

Publication number:

US20240320965

Publication date:
Section:

Physics

Class:

G06V10/82

Inventors:

Applicant:

Drawings (4 of 20)

Smart overview of the Invention

A video generation system utilizes a sequence of generative neural networks to create videos based on text prompts. The process begins with receiving a text prompt that describes a specific scene. This prompt is then processed by a text encoder neural network to produce a contextual embedding, which serves as the foundation for generating the final video.

Processing through Generative Neural Networks

The system employs an initial generative neural network that takes the contextual embedding and generates an initial output video. This video has defined spatial and temporal resolutions. Subsequent generative neural networks receive the output from the previous network, further refining the video by enhancing either its spatial resolution, temporal resolution, or both.

Flexibility in Input Data

While the primary input is a text prompt, the system is designed to accommodate various types of conditioning inputs. These can include noise distributions, pre-existing videos or images, audio signals, and more. This versatility allows the system to cater to different video generation scenarios beyond just text-based prompts.

Training Mechanisms

The generative neural networks are trained using multiple examples that pair specific text prompts with corresponding target videos. Various training techniques, including masking and self-attention mechanisms, are employed to ensure that each network learns effectively while maintaining high-quality output across different resolutions.

Advantages of Cascaded Generative Networks

The use of a cascading approach in generative neural networks allows for high-definition video generation without overwhelming a single model. By incrementally refining both pixel resolution and frame rates, the system ensures that the final output video is both temporally consistent and visually rich, accurately reflecting the scene described in the initial text prompt.