Invention Title:

Text-Based Real Image Editing with Diffusion Models

Publication number:

US20240355017

Publication date:
Section:

Physics

Class:

G06T11/60

Inventors:

Applicant:

Drawings (4 of 9)

Smart overview of the Invention

The disclosed technology pertains to digital image editing by applying text-based semantic edits using diffusion models. It involves receiving an input image and a target text, which describes the desired changes to the image. The system optimizes a text embedding based on this input and fine-tunes a diffusion model to generate an edited image that aligns with the target text while maintaining the original image's structure and composition.

Challenges and Improvements

Traditional methods for semantic image editing face limitations, such as being restricted to specific edits or requiring additional inputs like image masks. The proposed method addresses these issues by allowing a wide range of non-rigid edits on real images, preserving details like background and object identity. This is achieved without needing auxiliary inputs beyond the input image and target text.

Methodology

The process consists of three main steps. First, it optimizes a text embedding that aligns with the input image. Second, it fine-tunes diffusion models to reflect this alignment. Finally, it interpolates between the optimized and target text embeddings to generate an edited image that matches both the input image's fidelity and the target text's intent.

Capabilities and Applications

This method enables sophisticated edits while preserving high-resolution details. For example, altering an image of a standing dog to show it lying down results in an output that maintains consistent features like fur color and background. It also supports gradual edits through linear interpolation between text embeddings, showcasing strong compositional capabilities.

Technical Implementation

The system uses a pre-trained diffusion model initialized with random noise, refined iteratively until a photorealistic image is synthesized. Text embeddings derived from target texts guide these refinements, allowing for realistic high-resolution outputs based on simple text prompts. This approach leverages large language models or hybrid vision-language models for enhanced realism in generated images.