Invention Title:

AUTOMATIC IMAGE GENERATION USING LATENT STRUCTURAL DIFFUSION

Publication number:

US20250104290

Publication date:
Section:

Physics

Class:

G06T11/00

Inventors:

Applicants:

Smart overview of the Invention

Automatic image generation involves using generative machine learning models to create images based on specific inputs. The process begins by accessing a plurality of inputs, including a text prompt that describes the desired image and structural data indicative of its features. A first generative model processes these inputs to produce intermediate outputs, which are then refined by a second generative model to generate the final image. This output is displayed on a user device, offering a seamless user experience in generating complex images.

Technical Background

Generative artificial intelligence (AI) systems, such as text-to-image models, transform natural language descriptions into corresponding images. Despite advancements, generating hyperrealistic images, especially those depicting humans, remains challenging due to the complexity of human anatomy and articulation. Traditional models like diffusion models or Generative Adversarial Networks (GANs) often struggle with this task due to limitations in capturing intricate structural details from text prompts alone.

Innovative Approach

The described technique leverages latent diffusion models enhanced with structural information to improve image realism. By incorporating data such as depth and pose maps, these models can better capture and reflect the structural intricacies of subjects like humans or complex objects. This method extends traditional latent diffusion models by conditioning image generation on both visual appearance and structural data, enhancing the coherence and quality of the generated images.

Model Framework

A unified model framework is proposed for generating realistic human images with diverse layouts. This framework utilizes a large-scale human-centric dataset featuring comprehensive annotations like captions and pose data. The method involves using a first generative model to create intermediate outputs from input data, which are then processed by a second model to refine and generate the final image. The framework ensures spatial alignment of structural features across intermediate outputs for improved realism.

Implementation Details

The first generative model employs a diffusion process in latent space, denoising intermediate outputs such as depth maps and surface normal maps. These outputs are spatially aligned to reflect consistent structural features. The second model acts as a refiner, further enhancing the image based on these intermediate outputs. This two-step process allows for high-quality image generation by focusing on both appearance and underlying structure, making it particularly effective for complex subjects like human figures.