Invention Title:

Visual Object Consistency in Image Generation Models

Publication number:

US20250157093

Publication date:
Section:

Physics

Class:

G06T11/00

Inventors:

Applicant:

Smart overview of the Invention

Systems and methods are provided for generating self-consistent synthetic imagery based on a textual prompt. These approaches tackle the challenge of producing consistent character images across different contexts, a common limitation in existing text-to-image generative models. This innovation is particularly beneficial in creative fields such as book illustration, brand crafting, comic creation, presentation development, and webpage design where visual consistency is essential.

Challenges with Existing Methods

Current techniques often depend on multiple pre-existing images or involve labor-intensive manual processes, struggling to maintain consistent character identity, especially for novel or imaginary characters. These methods typically do not generalize well to new characters due to being trained on specific datasets. This limitation restricts their versatility and applicability.

Proposed Solution

The disclosure presents a fully automated solution for consistent character generation using only a text prompt. Implementations iteratively customize a pre-trained text-to-image model with images generated by the model itself as training data. Images are clustered based on visual cohesion, and the model is trained on these selected images until a convergence metric is satisfied, enhancing character identity consistency.

Advantages

This method does not require pre-existing images, allowing for the generation of consistent and diverse images from just a text description. It is highly versatile and applicable to a wide range of characters and contexts, being fully automated and domain-agnostic. This eliminates the need for manual interventions or ad hoc solutions.

Implementation Details

A computer-implemented method generates self-consistent synthetic imagery by processing a textual prompt with a machine-learned image generation model through multiple update iterations. In each iteration, synthetic images are generated, and visually cohesive subsets are selected for training. Cohesion is assessed using embeddings in a latent space, clustering them to evaluate cohesion measures, ensuring training focuses on the most consistent images.