Invention Title:

TEACHING LANGUAGE MODELS TO DRAW SKETCHES

Publication number:

US20240394936

Publication date:
Section:

Physics

Class:

G06T11/20

Inventors:

Applicant:

Smart overview of the Invention

A processor-implemented method for image generation utilizes an artificial neural network (ANN) to create sketches from inputs like images or text prompts. The ANN interprets these inputs to determine virtual brush strokes or commands that guide an image drawing application to produce the output image. This involves generating a list of strokes or commands, which are then executed to create a sketch based on the initial input.

Technical Background

Artificial neural networks, including convolutional neural networks (CNNs), are utilized in various technologies such as image and speech recognition. These networks are increasingly applied to solve complex problems in generative artificial intelligence, such as training large language models (LLMs) to understand visual and textual data. However, training LLMs to grasp spatio-temporal relationships in visual data remains a challenge.

Summary of the Innovation

The disclosed method involves a processor receiving an input and using an ANN to process it, determining virtual brush strokes or commands for an image drawing application. It generates a list of these strokes or commands to create the desired output image. The approach can be implemented through various apparatuses or non-transitory computer-readable media containing program code for executing these tasks.

Application of Language-Vision Models

The innovation addresses challenges in using LLMs for image generation by adapting them into language-vision models. Cross-attention modules are added to measure interactions between image features and LLM hidden states. A visual feedback loop may also be included to monitor the progress of image generation tasks, allowing LLMs to generate images by producing virtual brush strokes in an auto-regressive manner.

Potential Implications

The described technology could significantly enhance the capabilities of LLMs in handling multimodal tasks that require detailed visual-textual reasoning. By integrating cross-attention mechanisms and visual feedback loops, the model improves accuracy in generating fine-grained visual outputs, expanding the applicability of LLMs in diverse technological contexts.