Invention Title:

CUSTOM IMAGE AND CONCEPT COMBINER USING DIFFUSION MODELS

Publication number:

US20250278816

Publication date:
Section:

Physics

Class:

G06T5/50

Inventors:

Assignee:

Applicant:

Smart overview of the Invention

The patent application describes a system and method for generating images using a machine learning model that incorporates multiple input modalities, such as images and text. The system processes these inputs to create embeddings, which are then used by the model to generate an output image. The generated image includes elements from the input images and adheres to the structure and content described by the text input.

Background

Image generation involves complex computational processes that combine computer vision and natural language processing techniques. Traditional systems often struggle to effectively integrate multiple input modalities to produce images that meet specific user intentions. The described invention addresses these limitations by enabling more precise control over the image generation process through the use of various input types.

System Architecture

The system employs a machine learning model, such as a diffusion model or a generative model, trained using diverse input modalities, including reference images, randomly generated image portions, and text prompts. Encoders process these inputs to generate corresponding embeddings. The model is iteratively trained with these embeddings to learn how to semantically arrange image portions in alignment with the reference image's structure.

Training and Inference Phases

During the training phase, the system uses reference images and text inputs to generate embeddings that guide the ML model in producing accurate representations of the reference image. In the inference phase, similar inputs are processed to generate new images that integrate aspects of all provided modalities. This dual-phase approach ensures that the system can create detailed and contextually relevant output images.

Technical Benefits

The invention offers significant technical advantages by allowing users to control the content of generated images through diverse input combinations. This capability surpasses existing systems by integrating reference images and textual instructions to enhance accuracy. Users gain creative control over output images by specifying different visual and textual inputs, guiding the trained ML model to produce desired results.