Invention Title:

LOCAL IMAGE AND SCENE EDITING BY TEXT INSTRUCTIONS

Publication number:

US20250078362

Publication date:
Section:

Physics

Class:

G06T11/60

Inventors:

Assignee:

Applicant:

Smart overview of the Invention

The patent application describes a method for editing images and 3D scenes using text instructions, leveraging a diffusion model. This process involves receiving an input image or scene and a corresponding text instruction from the user. A relevance map is generated based on these inputs, guiding the editing process to ensure changes are localized to specific areas of interest, preserving the integrity of the original image or scene.

Background

Image editing plays a significant role in various fields such as social media and marketing, prompting the development of automated generative approaches. Diffusion models have gained popularity for their ability to generate high-quality images from textual descriptions. However, existing models like InstructPix2Pix (IP2P) and InstructNeRF2NeRF (IN2N) often fail to confine modifications within relevant boundaries, leading to unnecessary alterations. This patent addresses these challenges by introducing a method that localizes edits using relevance maps, ensuring fidelity to the original input.

Methodology

The proposed method utilizes a denoising diffusion model to edit local areas of an image or scene. It begins by receiving an input image and text instruction, followed by generating a relevance map that identifies discrepancies between predictions with and without the text instruction. The relevance map guides the editing process, resulting in a rendered image that reflects the desired changes accurately while maintaining the original input's integrity.

Applications in 3D Scenes

For 3D scenes, the method involves fitting a Neural Radiance Field (NeRF) to multiple images of an input scene. Text instructions are used to generate relevance maps for these images, which are then combined into a relevance field. This field guides the scene editing process, ensuring that changes are localized and relevant. The NeRF is updated iteratively with edited images and updated relevance maps to achieve the desired scene modifications.

Technical Implementation

The technical implementation includes generating noisy versions of input images and performing denoising steps guided by relevance maps. The diffusion model's output is used to create rendered images that align with the user's text instructions. By maintaining control over the level of noise introduced and focusing on specific regions, this approach reduces over-editing and preserves essential features of the original content.