Invention Title:

TEXT-DRIVEN DIFFUSION MODEL FOR ENHANCED IMAGE GENERATION

Publication number:

US20260112073

Publication date:

2026-04-23

Section:

Physics

Class:

G06T11/00

Inventors:

Ze Ming Zhao 🇨🇳 Beijing, China

Xiao Tian Xu 🇨🇳 BEIJING, China

Xue Yin Zhuang 🇨🇳 Beijing, China

Applicant:

INTERNATIONAL BUSINESS MACHINES CORPORATION 🇺🇸 Armonk, NY, United States

Drawings (4 of 5)

Drawing 01 for TEXT-DRIVEN DIFFUSION MODEL FOR ENHANCED IMAGE GENERATION

Drawing 02 for TEXT-DRIVEN DIFFUSION MODEL FOR ENHANCED IMAGE GENERATION

Drawing 03 for TEXT-DRIVEN DIFFUSION MODEL FOR ENHANCED IMAGE GENERATION

Drawing 04 for TEXT-DRIVEN DIFFUSION MODEL FOR ENHANCED IMAGE GENERATION

Smart overview of the Invention

A method, computer system, and computer program product are developed for detail-enhanced text-to-image generation. The approach begins with retrieving a text prompt and processing it through a trained region of interest model to identify and extract key details. These details are further processed using a pre-trained large language model to create multiple structured text prompts. These prompts are then organized into a retrospective text sequence using an interleaved retrospective algorithm, which is subsequently used by a progressive text-driven diffusion model to generate a detailed image.

Technical Background

The invention integrates computer vision and natural language processing, both of which employ machine learning to interpret visual and textual inputs. Computer vision allows systems to recognize objects and people in images, while natural language processing enables understanding and communication using human language. The invention specifically applies these technologies to improve text-to-image generation, addressing limitations of existing models in processing lengthy text inputs and maintaining detail accuracy in the generated images.

Enhanced Process

The process involves several key steps: extracting key details from the text prompt, generating structured text prompts, and arranging them into a retrospective sequence. The use of a trained region of interest model, possibly based on BERT with a classifier, aids in identifying crucial text portions. The structured text prompts include both detailed and high-level summaries, which are essential for creating a comprehensive retrospective text sequence. This sequence is then processed by a diffusion model trained to handle any character length, overcoming the limitations of current models like DALL-E2 and DALL-E3.

Advantages

The proposed method enhances the accuracy and applicability of text-to-image generation by capturing and preserving key text details. The retrospective text sequence ensures that no information is lost during the diffusion process. The model's ability to process text of any length expands the range of applicable scenarios, addressing the issue of detail loss in longer text inputs. This improvement in processing capability results in images that are more representative of the original text prompt.

Implementation Details

The system can be implemented as a method, computer system, or computer program product. It includes components like a trained region of interest model and a progressive text-driven diffusion model. The invention can be integrated at various technical levels, utilizing storage media such as RAM, ROM, and Flash memory for program instructions. The flexibility in processing order and operations, as described, allows for adaptation to different technological environments and requirements, ensuring broad applicability and effectiveness in generating detail-enhanced images.