Invention Title:

GENERATING VIDEOS USING SEQUENCES OF GENERATIVE NEURAL NETWORKS

Publication number:

US20240320965

Publication date:

2024-09-26

Section:

Physics

Class:

G06V10/82

Inventors:

William Chan Toronto, Canada

Tim Salimans Utrecht, Netherlands

Chitwan Saharia Toronto, Canada

Jonathan Ho New York, NY, United States

Jay Ha Whang Austin, TX, United States

Applicant:

Google LLC Mountain View, CA, United States

Drawings (4 of 20)

Drawing 01 for GENERATING VIDEOS USING SEQUENCES OF GENERATIVE NEURAL NETWORKS

Drawing 02 for GENERATING VIDEOS USING SEQUENCES OF GENERATIVE NEURAL NETWORKS

Drawing 03 for GENERATING VIDEOS USING SEQUENCES OF GENERATIVE NEURAL NETWORKS

Drawing 04 for GENERATING VIDEOS USING SEQUENCES OF GENERATIVE NEURAL NETWORKS

Smart overview of the Invention

A video generation system utilizes a sequence of generative neural networks to create videos based on text prompts. The process begins with receiving a text prompt that describes a specific scene. This prompt is then processed by a text encoder neural network to produce a contextual embedding, which serves as the foundation for generating the final video.

Processing through Generative Neural Networks

The system employs an initial generative neural network that takes the contextual embedding and generates an initial output video. This video has defined spatial and temporal resolutions. Subsequent generative neural networks receive the output from the previous network, further refining the video by enhancing either its spatial resolution, temporal resolution, or both.

Flexibility in Input Data

While the primary input is a text prompt, the system is designed to accommodate various types of conditioning inputs. These can include noise distributions, pre-existing videos or images, audio signals, and more. This versatility allows the system to cater to different video generation scenarios beyond just text-based prompts.

Training Mechanisms

The generative neural networks are trained using multiple examples that pair specific text prompts with corresponding target videos. Various training techniques, including masking and self-attention mechanisms, are employed to ensure that each network learns effectively while maintaining high-quality output across different resolutions.

Advantages of Cascaded Generative Networks

The use of a cascading approach in generative neural networks allows for high-definition video generation without overwhelming a single model. By incrementally refining both pixel resolution and frame rates, the system ensures that the final output video is both temporally consistent and visually rich, accurately reflecting the scene described in the initial text prompt.