US20250299302
2025-09-25
Physics
G06T5/60
The patent application describes a system and method for multi-garment virtual try-on and editing, known as M&M VTO. It enables users to visualize how different combinations of garments would look on a person. Inputs can include images of multiple garments, an image of a person, and optionally a text description for garment layout. The output is a high-resolution visualization showing the person in the desired outfit layout.
In virtual shopping and fashion design, realistic representation of clothing on individuals is challenging without physical fitting. Current virtual try-on (VTO) technologies often focus on single garments, limiting the ability to visualize complete outfits. Existing methods typically use multi-stage models that may lose garment details at higher resolutions. Another issue is the loss of personal identity due to 'clothing-agnostic' representations that erase distinguishing features, reducing realism.
The proposed method utilizes a single-stage diffusion-based model to mix and match multiple garments while preserving intricate details. This approach eliminates the need for super-resolution cascading, directly synthesizing high-resolution images. The method also includes a unique architecture that separates denoising from feature extraction, allowing efficient fine-tuning to preserve person identity without large models per individual.
An innovative feature is the use of text inputs to control garment layout. A text embedding model, fine-tuned for virtual try-on tasks, allows precise specification of garment attributes like rolled sleeves or tucked shirts. This formulation can treat attribute extraction as an image captioning task, enhancing accuracy and realism by automatically extracting labels from labeled images.
The disclosure includes a progressive training strategy starting with low-resolution images, gradually moving to high-resolution ones. This helps the model learn and refine details more effectively. The system can be applied in various applications such as online shopping platforms, allowing customers to virtually try on garments before purchasing. The single-stage denoising diffusion model simplifies the process by directly generating synthetic images from input sets including textual descriptions for layout accuracy.