US20240386623
2024-11-21
Physics
G06T11/00
The patent describes a method for image generation using diffusion models, specifically focusing on controllable image generation. It introduces a system with a fixed diffusion model and a trainable diffusion model. The fixed model is pretrained on a large dataset and remains unchanged during further training. The trainable model, however, is designed to control the image generation process by altering the internal representations of the fixed model. This approach allows for task-specific adjustments guided by visual conditions and task instructions.
This invention relates to machine learning systems used in generative tasks like image generation. It particularly focuses on enhancing controllable image generation through denoising diffusion models (DDMs). These models traditionally generate images based on conditioning inputs such as text prompts or input images like sketches. The novel approach aims to improve performance across various tasks without needing separate models for each, which is resource-intensive.
The proposed system involves a dual-model setup where a trainable DDM modulates a fixed DDM to handle multiple tasks. The trainable model starts as a copy of the fixed model but is adjustable. It uses convolutional layers to adapt visual conditions and task instructions into feature maps, which then influence the fixed model's image generation process. This enables the system to perform defined tasks efficiently and adapt to new, unseen tasks by leveraging existing knowledge.
The described method offers significant advantages in terms of computational efficiency and resource usage. By using a unified model structure, it reduces the need for multiple models, saving memory and computation power while maintaining or improving performance across diverse image-related tasks. Moreover, it enhances the ability to perform unseen tasks by combining knowledge from existing tasks, thus streamlining the training and execution processes in neural network-based image processing.
The training framework involves progressively denoising a random noise vector conditioned by user inputs such as text prompts. During training, noise is incrementally added to latent image representations to generate data for training the denoising model. This iterative process allows the model to learn how to reverse noise addition effectively, resulting in clear images aligned with user inputs. The framework supports efficient training of diffusion models for generating or editing images based on conditioning inputs.