US20240320965
2024-09-26
Physics
G06V10/82
A video generation system utilizes a sequence of generative neural networks to create videos based on text prompts. The process begins with receiving a text prompt that describes a specific scene. This prompt is then processed by a text encoder neural network to produce a contextual embedding, which serves as the foundation for generating the final video.
The system employs an initial generative neural network that takes the contextual embedding and generates an initial output video. This video has defined spatial and temporal resolutions. Subsequent generative neural networks receive the output from the previous network, further refining the video by enhancing either its spatial resolution, temporal resolution, or both.
While the primary input is a text prompt, the system is designed to accommodate various types of conditioning inputs. These can include noise distributions, pre-existing videos or images, audio signals, and more. This versatility allows the system to cater to different video generation scenarios beyond just text-based prompts.
The generative neural networks are trained using multiple examples that pair specific text prompts with corresponding target videos. Various training techniques, including masking and self-attention mechanisms, are employed to ensure that each network learns effectively while maintaining high-quality output across different resolutions.
The use of a cascading approach in generative neural networks allows for high-definition video generation without overwhelming a single model. By incrementally refining both pixel resolution and frame rates, the system ensures that the final output video is both temporally consistent and visually rich, accurately reflecting the scene described in the initial text prompt.