US20240394936
2024-11-28
Physics
G06T11/20
A processor-implemented method for image generation utilizes an artificial neural network (ANN) to create sketches from inputs like images or text prompts. The ANN interprets these inputs to determine virtual brush strokes or commands that guide an image drawing application to produce the output image. This involves generating a list of strokes or commands, which are then executed to create a sketch based on the initial input.
Artificial neural networks, including convolutional neural networks (CNNs), are utilized in various technologies such as image and speech recognition. These networks are increasingly applied to solve complex problems in generative artificial intelligence, such as training large language models (LLMs) to understand visual and textual data. However, training LLMs to grasp spatio-temporal relationships in visual data remains a challenge.
The disclosed method involves a processor receiving an input and using an ANN to process it, determining virtual brush strokes or commands for an image drawing application. It generates a list of these strokes or commands to create the desired output image. The approach can be implemented through various apparatuses or non-transitory computer-readable media containing program code for executing these tasks.
The innovation addresses challenges in using LLMs for image generation by adapting them into language-vision models. Cross-attention modules are added to measure interactions between image features and LLM hidden states. A visual feedback loop may also be included to monitor the progress of image generation tasks, allowing LLMs to generate images by producing virtual brush strokes in an auto-regressive manner.
The described technology could significantly enhance the capabilities of LLMs in handling multimodal tasks that require detailed visual-textual reasoning. By integrating cross-attention mechanisms and visual feedback loops, the model improves accuracy in generating fine-grained visual outputs, expanding the applicability of LLMs in diverse technological contexts.