Invention Title:

SYSTEMS AND METHODS FOR CONTROLLABLE VIDEO GENEATION

Publication number:

US20250175679

Publication date:

2025-05-29

Section:

Electricity

Class:

H04N21/816

Inventors:

Hung Le 🇸🇬 Singapore, Singapore

Caiming XIONG 🇺🇸 Menlo Park, CA, United States

Doyen Sahoo 🇸🇬 Singapore, Singapore

Dongxu Li 🇸🇬 Singapore, Singapore

Junhao Zhang 🇺🇸 San Jose, CA, United States

Applicant:

Salesforce, Inc. 🇺🇸 San Francisco, CA, United States

Smart overview of the Invention

The patent application discusses a novel video generation framework that leverages a decoupled multimodal cross-attention module. This framework allows simultaneous conditioning of video generation on both an input image and a text description. By incorporating visual cues from an image, the system achieves zero-shot video generation with minimal fine-tuning, enhancing the precision of visual content depiction in generated videos.

Technical Field

The invention pertains to the domain of generative artificial intelligence (AI) systems, focusing on controllable video generation methods. Traditional text-to-video diffusion models often rely solely on textual descriptions, which can limit control over the visual and geometric aspects of generated videos. This framework addresses these limitations by integrating image-based inputs alongside text prompts.

Detailed Description

The video generation model described employs a U-Net denoising diffusion model, which iteratively refines a noise vector conditioned on both image and text inputs. The model comprises multimodal video blocks (MVBs) featuring spatial-temporal layers and decoupled cross-attention layers for separate handling of image and text inputs. This setup improves video quality by maintaining frame coherence and visual consistency across generated content.

Embodiments

One embodiment utilizes spatial-temporal layers that include spatial convolution, self-attention, and temporal attention layers, allowing reuse of pre-trained weights from text-to-image models. The decoupled multimodal cross-attention layer simultaneously conditions video generation on image and text inputs, leveraging visual cues for enhanced temporal consistency. Additionally, a pre-trained image ControlNet module may be integrated to manage geometric structure without additional training.

Training Framework

The training process involves a latent diffusion model that denoises Gaussian noise sequences guided by text and image prompts. During training, the model learns to progressively remove noise from latent representations over multiple iterations. This iterative denoising process enables the generation of videos that align closely with the conditioning inputs, improving overall quality and applicability in tasks like image animation and video editing.