Invention Title:

TEXT-BASED IMAGE GENERATION USING AN IMAGE-TRAINED TEXT

Publication number:

US20240320873

Publication date:
Section:

Physics

Class:

G06T11/00

Inventors:

Applicant:

Drawings (4 of 23)

Smart overview of the Invention

A method for generating images from text prompts involves obtaining a text prompt and encoding it using a text encoder that has been jointly trained with an image generation model. This process results in a text embedding, which is then utilized by the image generation model to create a synthetic image. By employing this approach, the system enhances the alignment between the text description and the generated image.

Joint Training for Improved Alignment

The system features a unique aspect where both the text encoder and the image generation model are trained together. This joint training allows for improved text-image alignment, meaning that the images produced are more accurately reflective of the input text prompts. This method stands out compared to conventional systems that use fixed text encoders, which often result in less optimal alignments.

Training Data Utilization

Training data plays a crucial role in this method, as it includes ground-truth images paired with corresponding text prompts. The system generates provisional images based on provisional text embeddings derived from these prompts. By training the text encoder with this data, it becomes adept at producing high-quality text embeddings that serve as inputs for generating images.

Architecture of the Image Generation System

The architecture comprises processors and memory components that house both the text encoder and the image generation model. The text encoder is designed to convert text prompts into embeddings, while the image generation model uses these embeddings to generate synthetic images. This collaborative structure contributes to the overall accuracy of the generated images.

Advantages Over Conventional Systems

Improvements over traditional image generation systems are evident through enhanced accuracy in depicting content based on text prompts. The jointly trained text encoder produces more precise embeddings compared to conventional methods, leading to better alignment between generated images and their corresponding descriptions. This results in a more seamless and effective image generation process for users.