US20250078349
2025-03-06
Physics
G06T11/60
The patent application outlines a method for generating images using a controllable diffusion model. This approach involves obtaining a content input, which includes a target spatial layout, and a style input, which encompasses a target style. The system utilizes an image processing apparatus that encodes these inputs into a spatial layout mask and a style embedding, respectively. These encoded inputs are then used by an image generation model to synthesize an image that combines the specified layout and style.
This technology relates to machine learning applications in image processing, specifically focusing on image generation through diffusion models. Diffusion models are a subset of machine learning techniques that generate images by learning patterns from training data. Traditional models often struggle with controllability and editability, as they rely on fixed embeddings that limit flexibility. The proposed system addresses these limitations by incorporating multiple latent spaces for more precise control over image attributes during generation.
The system employs a content encoder and a style encoder to process the respective inputs into a spatial layout mask and a style embedding. These components serve as conditional inputs for the diffusion model, which is typically structured as a U-Net. The model's training involves an end-to-end process that jointly learns the content and style latent spaces, enhancing the model's ability to preserve content while transferring styles. A weight scheduler is used to manage the influence of content and style conditions during the denoising phase of image generation.
The training process involves computing an objective function based on spatial content and style attributes, which guides the joint training of the encoders and the diffusion model. During inference, timestep scheduling is applied to leverage the inductive bias of the diffusion model, allowing structural information from the content input to be incorporated early in the denoising process, while style information is emphasized in later stages. This approach enhances the model's ability to learn low-frequency layout information initially and high-frequency details subsequently.
This technology can be applied in various domains requiring image editing or translation. By efficiently combining content and style inputs, the system generates high-quality synthesized images with improved controllability and editability. Potential applications include creating content-preserving style-transferred images or conducting reference-based image translations. The detailed architecture and processes are elaborated with illustrative figures in the patent documentation.