Invention Title:

TEXT-TO-IMAGE SYSTEM AND METHOD

Publication number:

US20240386621

Publication date:

2024-11-21

Section:

Physics

Class:

G06T11/00

Inventors:

Tong Sun San Ramon, CA, United States

Jiuxiang Gu Baltimore, MD, United States

Ruiyi Zhang San Jose, CA, United States

Yufan Zhou Buffalo, NY, United States

Christopher Alan Tensmeyer Fulton, MD, United States

Tong Yu Fremont, CA, United States

Rajiv JAIN Falls Church, VA, United States

Assignee:

Adobe Inc. San Jose, CA, United States

Applicant:

Adobe Inc. San Jose, CA, United States

Smart overview of the Invention

The text-to-image generation system leverages a pre-trained multimodal model to create text-image pairs from bare images, which are images without accompanying descriptive text. This approach addresses the challenges of assembling large datasets of high-quality text-image pairs, typically required for training such models. By generating text-image pairs automatically, the system reduces the need for labor-intensive manual captioning and filtering processes, thus saving time and resources.

Training Methodology

The process begins by inputting a collection of bare images into a pre-trained multimodal model, which generates corresponding text descriptions. These generated text-image pairs are then used to train a text-to-image generation model. This model includes a generator and a discriminator; the generator creates images based on the generated text-image pairs, while the discriminator evaluates these images against the original set to improve realism and accuracy.

Model Components

The pre-trained multimodal model comprises an image encoder and a text encoder, capable of processing large datasets with millions of text-image pairs. The generator in the text-to-image model produces images influenced by both the generated text and the original images. The discriminator's role is to compare these generated images against real ones, providing feedback that refines the generator's output.

Efficiency and Scalability

The system is designed to minimize or eliminate reliance on manually created text-image pairs, using generated pairs instead. This method can be repeated multiple times to enhance model training and implementation. Additionally, during training, the system checks the cosine similarity between paired texts and images to ensure quality, aiming for a threshold value of at least 0.27.

Applications

The described system offers significant advantages in creating custom text-to-image models without extensive manual dataset preparation. It supports various devices, from powerful personal computers to resource-limited mobile devices, making it versatile for different computing environments. This approach provides an efficient solution for developing tailored models with minimal manual intervention.