Invention Title:

Diffusion Models for Multi-Garment Virtual Try-On or Editing

Publication number:

US20250299302

Publication date:

2025-09-25

Section:

Physics

Class:

G06T5/60

Inventors:

Irena KEMELMAHER 🇺🇸 Seattle, WA, United States

Hao PENG 🇺🇸 Kirkland, WA, United States

Luyang Zhu 🇺🇸 Seattle, WA, United States

Dawei Yang 🇺🇸 Kirkland, WA, United States

Yingwei Li 🇺🇸 San Jose, CA, United States

Nan Liu 🇺🇸 New York, NY, United States

Applicant:

Google LLC 🇺🇸 Mountain View, CA, United States

Smart overview of the Invention

The patent application describes a system and method for multi-garment virtual try-on and editing, known as M&M VTO. It enables users to visualize how different combinations of garments would look on a person. Inputs can include images of multiple garments, an image of a person, and optionally a text description for garment layout. The output is a high-resolution visualization showing the person in the desired outfit layout.

Background

In virtual shopping and fashion design, realistic representation of clothing on individuals is challenging without physical fitting. Current virtual try-on (VTO) technologies often focus on single garments, limiting the ability to visualize complete outfits. Existing methods typically use multi-stage models that may lose garment details at higher resolutions. Another issue is the loss of personal identity due to 'clothing-agnostic' representations that erase distinguishing features, reducing realism.

Innovations

The proposed method utilizes a single-stage diffusion-based model to mix and match multiple garments while preserving intricate details. This approach eliminates the need for super-resolution cascading, directly synthesizing high-resolution images. The method also includes a unique architecture that separates denoising from feature extraction, allowing efficient fine-tuning to preserve person identity without large models per individual.

Text-Based Control

An innovative feature is the use of text inputs to control garment layout. A text embedding model, fine-tuned for virtual try-on tasks, allows precise specification of garment attributes like rolled sleeves or tucked shirts. This formulation can treat attribute extraction as an image captioning task, enhancing accuracy and realism by automatically extracting labels from labeled images.

Training Strategy

The disclosure includes a progressive training strategy starting with low-resolution images, gradually moving to high-resolution ones. This helps the model learn and refine details more effectively. The system can be applied in various applications such as online shopping platforms, allowing customers to virtually try on garments before purchasing. The single-stage denoising diffusion model simplifies the process by directly generating synthetic images from input sets including textual descriptions for layout accuracy.