Invention Title:

Video Diffusion Model

Publication number:

US20250238905

Publication date:

2025-07-24

Section:

Physics

Class:

G06T5/60

Inventors:

Tomer MICHAELI 🇮🇱 Zichron Ya'akov, Israel

Oliver Wang 🇺🇸 Seattle, WA, United States

Inbar Mosseri 🇮🇱 Raanana, Israel

Michael RUBINSTEIN 🇺🇸 Natick, MA, United States

Ariel Ephrat 🇮🇱 Efrat, Israel

Tali Dekel 🇺🇸 Arlington, MA, United States

Yuanzhen Li 🇺🇸 Newton Centre, MA, United States

Deqing Sun 🇺🇸 Boston, MA, United States

Amit Raj 🇺🇸 Seattle, WA, United States

Shiran Elyahu Zada 🇮🇱 Tel Aviv, Israel

Omer Tov 🇮🇱 Kadima-Zoran, Israel

Omer Bar Tal 🇺🇸 Mountain View, CA, United States

Hila Chefer-Livshen 🇮🇱 Tel Aviv, Israel

Charles Irwin Herrmann 🇺🇸 Boston, MA, United States

Rony Paiss 🇮🇱 Givataim, Israel

Junhwa Hur 🇺🇸 Cambridge, MA, United States

Guanghui Liu 🇺🇸 Seattle, WA, United States

Applicant:

Google LLC 🇺🇸 Mountain View, CA, United States

Drawings (4 of 9)

Smart overview of the Invention

The video generation model discussed here is designed to synthesize videos with realistic, diverse, and coherent motion. It addresses the challenge of generating globally coherent motion over an entire video clip without incurring extreme computational costs. Prior text-to-video (T2V) techniques typically involve inflating a pre-trained text-to-image (T2I) model by adding temporal layers, which often results in high computational demands and limited global coherence due to temporal aliasing ambiguities in low framerate videos.

Innovative Approach

This model introduces a machine-learned denoising diffusion approach that down-samples signals in both space and time. It performs most computations on a compact space-time representation, enabling the simultaneous generation of multiple frames (e.g., 80 frames at 16 frames per second) without relying on a cascade of temporal super-resolution (TSR) models. This design diverges from existing T2V methods that maintain a fixed temporal resolution across the network.

Technical Implementation

The model employs an inflation scheme for a pre-trained spatial super-resolution (SSR) model to enhance the resolution of synthetic images generated by the denoising diffusion model. This SSR approach mitigates the high memory consumption issues associated with inflating image SSR models to be temporally aware. By extending a multi-diffusion approach to the temporal domain, it computes spatial super-resolution on temporal windows, aggregating results into a globally coherent solution across the entire video clip.

Computational Efficiency

By performing SSR over smaller windows while preserving global motion consistency, the model significantly reduces computational requirements compared to processing all frames simultaneously. This allows for efficient video synthesis without compromising on motion coherence or resolution quality, making it suitable for applications demanding high-quality video generation with lower computational overhead.

Applications and Benefits

The proposed technology is applicable in various fields such as entertainment, virtual reality, and training simulations. By optimizing memory usage and ensuring coherent motion in generated videos, it offers a practical solution to traditional T2V challenges. The innovative use of denoising diffusion models and SSR techniques provides a balanced approach to high-quality video generation with reduced computational costs.