Invention Title:

TEXT GUIDED IMAGE EDITOR

Publication number:

US20250308115

Publication date:
Section:

Physics

Class:

G06T11/60

Inventor:

Assignee:

Applicant:

Smart overview of the Invention

The patent application describes a method for text-guided image editing, which involves using natural language prompts to make semantic changes to images. This approach leverages diffusion models to modify images based on textual instructions without requiring extensive retraining for each new edit. By utilizing a combination of base and edit prompts, the method allows for intuitive and controlled image manipulation, preserving the identity of the subject while implementing complex edits.

Methodology

The process begins by obtaining a base prompt and an edit prompt, which are converted into respective embeddings. These embeddings are iteratively processed to determine new edit embeddings, factoring in a time step and a weight that controls their mixing. The base embeddings are input into a diffusion model through a base reverse process to update a latent representation of the base image. Concurrently, the new edit embeddings are processed through an edit reverse process to update an edit latent corresponding to the edited image. Cross-attention maps from the base reverse process are utilized in the edit reverse process to refine the edits.

Diffusion Model Utilization

The method employs stable diffusion models that synthesize images through progressive denoising. These models are conditioned using CLIP embeddings derived from text prompts via cross-attention mechanisms. This setup enables high-quality image generation and semantic editing, such as altering facial expressions while maintaining consistent identity traits. The method circumvents the need for finetuning diffusion models for each specific edit by using an embedding mixer that integrates base and edit prompts.

Technical Details

Cross-attention mechanisms play a crucial role by focusing on relevant parts of sequences during processing. They operate with query, key, and value sequences representing data like words or pixels. The mechanism computes attention weights based on similarity measures between query and key vectors, producing cross-attention maps that guide the editing process. These maps help maintain contextual integrity between base and edited images by leveraging learned parameters during training.

Comparative Methods

The application contrasts its approach with existing methods like DiffusionCLIP, which requires finetuning separate models for different edits, a process that is both time-consuming and resource-intensive. Another method is highlighted, which uses cross-attention maps to control edits without individual model training for each change. This simultaneous processing of base and edit prompts, with shared cross-attention maps for overlapping text parts, exemplifies the efficiency of the proposed method in achieving coherent image modifications.