Invention Title:

Video Diffusion Model For Virtual Try-On

Publication number:

US20260120176

Publication date:

2026-04-30

Section:

Physics

Class:

G06Q30/06432

Inventors:

Irena KEMELMAHER 🇺🇸 Seattle, WA, United States

Innfarn Yoo 🇺🇸 Fremont, CA, United States

Luyang Zhu 🇺🇸 Seattle, WA, United States

Yingwei Li 🇺🇸 San Jose, CA, United States

Nan Liu 🇺🇸 New York, NY, United States

Johanna Suvi Karras 🇺🇸 Seattle, WA, United States

Andreas Franz Lugmayr 🇺🇸 New York, NY, United States

Christopher Albert Lee 🇺🇸 New York, NY, United States

Applicant:

Google LLC 🇺🇸 Mountain View, CA, United States

Smart overview of the Invention

Systems and methods for video virtual try-on leverage machine-learned video diffusion models to create high-quality try-on videos of individuals wearing specific garments. These systems take an input of a garment image and a person video to generate a video that maintains the person's identity and motion while showcasing the garment from various angles. This approach addresses the challenges of preserving fabric dynamics and temporal consistency in video virtual try-on (VVT).

Challenges in VVT

Video virtual try-on is a complex task due to the need for realistic synthesis of garment appearance from different viewpoints and motion dynamics, such as folds and wrinkles. The scarcity of try-on video data and the difficulty in acquiring perfect ground truth data add to the challenge. Traditional flow-based methods often result in artifacts, especially in the presence of occlusions and large pose deformations, and are limited in capturing fine-grained fabric dynamics.

Methodology

The proposed system uses a computer-implemented method involving a machine-learned diffusion model to process noisy inputs and generate initial predictions. It employs split classifier-free guidance over multiple diffusion timesteps, updating predictions based on conditioning inputs like clothing-agnostic images, garment descriptions, and pose data. The system outputs images depicting the person wearing the garment, ensuring temporal consistency across frames.

Training Techniques

The diffusion model is trained progressively, initially focusing on generating single denoised images and then expanding to videos with increasing numbers of frames. This progressive temporal training technique helps maintain temporal consistency while adhering to computational constraints. Joint training on both image and video batches further enhances the model's capability, especially when video data is limited.

Innovations and Benefits

The innovation lies in using a diffusion-based architecture for VVT, which provides improved control over conditioning inputs and enhances garment detail representation and temporal consistency. This model can generate a 64-frame video in a single inference pass, demonstrating effectiveness even with limited video data. The approach addresses previous limitations in VVT, offering a more robust solution for virtual try-on applications.