US20240303873
2024-09-12
Physics
G06T11/00
A method for generating images utilizes a natural language prompt to guide the process. Initially, a noised image vector is received, which undergoes a series of iterations through a denoising diffusion model (DDM). The DDM operates in a forward pass to create a denoised image vector conditioned on the provided prompt. This iterative process continues until the image is sufficiently refined, resulting in a final image generated from the optimized noised vector.
Latent optimization enhances the image generation process by refining the initial random noise vector without altering the model parameters. This approach employs backpropagation-like techniques to optimize the latent vector based on gradients derived from generated images. A differentiable loss function evaluates the quality of generated images, allowing for adjustments to the input vector to improve outcomes over multiple iterations.
Existing methods for computing gradients in latent optimization often face significant limitations. Storing all intermediate latents during the forward pass can lead to excessive memory consumption, while recomputing these latents at each backward step results in high computational demands. These inefficiencies necessitate a more effective framework for image generation that minimizes resource usage.
The proposed method introduces Direct Optimization of Diffusion Latents (DOODL), which enables efficient optimization of diffusion noise vectors. By utilizing invertible diffusion processes, this approach allows for backpropagation with constant memory requirements and accurate gradient calculations without needing to store or recompute intermediate latents. This innovation significantly reduces both memory and computational overhead compared to traditional methods.
The advantages of this improved framework include enhanced image quality based on compositionality metrics and support for diverse conditioning methods beyond text prompts. The system can incorporate additional modalities, such as reference images, without requiring retraining of existing networks. Ultimately, this method streamlines the image generation process while expanding its applicability across various domains.