Invention Title:

Text-Driven Image Editing via Image-Specific Finetuning of Diffusion Models

Publication number:

US20260148448

Publication date:

2026-05-28

Section:

Physics

Class:

G06T11/60

Inventors:

Yossi Matias 🇮🇱 Tel-Aviv, Israel

Yaniv LEVIATHAN 🇺🇸 Sunnyvale, CA, United States

Daniel Walevski 🇮🇱 Tel Aviv, Israel

Matan Kalman 🇺🇸 Sunnyvale, CA, United States

Applicant:

Google LLC 🇺🇸 Mountain View, CA, United States

Smart overview of the Invention

The disclosure presents systems and methods for text-driven image editing, known as "UniTune". This approach accepts an arbitrary image and a textual edit description, performing edits while preserving the semantic and visual fidelity of the input image. UniTune operates without needing additional inputs like masks or sketches. By selecting appropriate parameters, the system fine-tunes a large diffusion model, such as Imagen, on a single image, ensuring fidelity to the input while allowing for expressive manipulations.

Text-Driven Image Editing

A core aspect involves a computer-implemented method that utilizes a machine-learned diffusion model for text-driven image editing. This method takes a base image and an edit prompt, which is a natural language description of the desired edit. The model processes these inputs to generate an output image, incorporating the edits while maintaining the base image's overall fidelity. For example, the model can change an object's color, add elements, or alter lighting based on the edit prompt.

Finetuning with Tuples

The diffusion model is fine-tuned using finetuning tuples, each comprising a base image and a finetuning prompt. These prompts may include rare tokens, focusing the model on preserving specific features of the base image. This finetuning process enhances the model's ability to maintain the essential characteristics of the original image during editing.

Combined Prompt Approach

A significant feature is the use of a combined prompt, formed by concatenating the finetuning prompt with the edit prompt. This technique ensures the model considers both the original image features and the desired edits. For instance, in a beach scene where a sunset is requested, the combined prompt helps the model retain beach elements while adding a sunset, effectively guiding the editing process.

Multi-Stage Diffusion Model

The machine-learned diffusion model may include a text-to-image diffusion model and one or more super-resolution diffusion models. The text-to-image model generates a low-resolution version of the output, which is then refined by the super-resolution models to produce a high-resolution image. Both models are fine-tuned using finetuning tuples, enabling the creation of high-quality edited images even when complex or detailed changes are specified in the edit prompt.