Invention Title:

MACHINE LEARNING DIFFUSION MODEL WITH IMAGE ENCODER TRAINED FOR SYNTHETIC IMAGE GENERATION

Publication number:

US20240282016

Publication date:

2024-08-22

Section:

Physics

Class:

G06T11/00

Inventors:

Xiao YANG Los Angeles, CA, United States

Bingchen LIU Los Angeles, CA, United States

Qing YAN Los Angeles, CA, United States

Yizhe ZHU Los Angeles, CA, United States

Applicant:

Lemon Inc. Grand Cayman, Cayman Islands

Drawings (4 of 12)

Drawing 01 for MACHINE LEARNING DIFFUSION MODEL WITH IMAGE ENCODER TRAINED FOR SYNTHETIC IMAGE GENERATION

Drawing 02 for MACHINE LEARNING DIFFUSION MODEL WITH IMAGE ENCODER TRAINED FOR SYNTHETIC IMAGE GENERATION

Drawing 03 for MACHINE LEARNING DIFFUSION MODEL WITH IMAGE ENCODER TRAINED FOR SYNTHETIC IMAGE GENERATION

Drawing 04 for MACHINE LEARNING DIFFUSION MODEL WITH IMAGE ENCODER TRAINED FOR SYNTHETIC IMAGE GENERATION

Smart overview of the Invention

A machine learning diffusion model is designed to generate synthesized images of users efficiently. It comprises three main components: an image encoder, a text encoder, and the diffusion model itself. The image encoder processes a user's input image to create embeddings that capture the user's visual features. The text encoder then transforms these embeddings into an input feature vector, which the diffusion model uses to produce a synthesized image of the user.

Background on Generative Models

Generative models in machine learning serve various applications, including image-to-text generation and style transfer. Recent advancements have enabled models to create photorealistic images based on text prompts. Diffusion models are a specific type of generative model that can generate diverse content but traditionally require extensive fine-tuning with multiple images of a user to achieve accurate results.

Challenges with Conventional Diffusion Models

Conventional diffusion models face significant drawbacks, primarily the need for numerous training images (often ten or more) from the same user, which can be burdensome. Variations in lighting, facial expressions, and other visual characteristics can hinder the model's ability to learn effectively. Additionally, fine-tuning these models is time-consuming and resource-intensive, often taking around ten minutes even on advanced hardware.

Innovative Approach to Image Synthesis

The proposed system addresses these challenges by utilizing a pre-trained image encoder that can fine-tune itself using just a single image of the user. This method significantly reduces both the number of required training images and the time taken to generate synthesized images—down to approximately two seconds compared to traditional methods.

System Architecture and Functionality

The computing system includes essential components such as processors, memory, and I/O modules that work together with a social media application. This application allows users to capture their images and initiates the process of generating synthesized images with stylized features based on their input. The resulting images can display variations in clothing, poses, and artistic styles while maintaining the user's unique visual characteristics.