US20250209806
2025-06-26
Physics
G06V10/82
The patent application describes a system and method for generating videos using sequences of generative neural networks (GNNs). The process begins with a text prompt that describes a scene. This text prompt is processed by a text encoder neural network to create a contextual embedding, which is then used by a sequence of GNNs to produce a video depicting the described scene. The system is designed to generate high-definition, temporally consistent videos from various types of conditioning inputs, such as text prompts, pre-existing videos, images, or audio signals.
The method involves multiple stages of processing. Initially, an input text prompt is transformed into a contextual embedding through a text encoder neural network. This embedding is fed into an initial generative neural network, which outputs an initial video with specified spatial and temporal resolutions. Subsequent GNNs in the sequence take this initial video as input and enhance it by increasing either the spatial or temporal resolution. The process continues until a final video with the desired quality is achieved.
Each generative neural network in the sequence may have different functionalities. For example, some networks might focus on improving spatial resolution through spatial self-attention and convolution, while others might enhance temporal resolution by generating additional frames. The GNNs can be trained using various techniques such as classifier-free guidance and progressive distillation. Additionally, noise conditioning augmentation can be applied to further refine the video output.
The generative neural networks are trained on diverse examples that pair text prompts with target videos depicting corresponding scenes. Training can include image-based examples where frames depict variations of a scene described by the text prompt. During training, temporal self-attention and convolution might be masked out to focus on spatial features. The text encoder is pre-trained and remains static during this process to maintain consistency in contextual embeddings.
This system offers significant advantages in generating videos with high spatial and temporal resolutions from textual descriptions. By using a cascade of GNNs, each conditioned on the initial text prompt, the system efficiently scales up video resolution without relying on a single network to achieve the final output quality directly. This approach allows for detailed refinement of video frames and smoother transitions between frames, resulting in visually coherent and high-quality videos that align closely with the input descriptions.