Invention Title:

STYLE TAILORING LATENT DIFFUSION MODELS FOR HUMAN EXPRESSION

Publication number:

US20250157106

Publication date:
Section:

Physics

Class:

G06T11/60

Inventors:

Applicant:

Smart overview of the Invention

Overview: The system described utilizes a finetuned latent diffusion model to generate visual content from descriptive text or audio. It starts by creating an initial latent representation based on the input description. A denoising process is then applied to refine this latent representation. The system samples data points from content and style distributions at different timesteps to produce a final image latent, which is decoded into visually aligned images that reflect the input description.

Technological Field: This innovation is situated in the domain of visual image enhancement using latent diffusion models. It addresses the challenges of generating high-quality images that are both visually appealing and aligned with user prompts, providing a solution that balances style adherence with prompt accuracy.

Background: Traditional diffusion-based text-to-image models have advanced significantly, allowing users to create new and diverse visual scenes. However, these models often struggle to maintain a balance between style alignment and prompt fidelity. The proposed system seeks to overcome these limitations, enhancing prompt alignment, visual diversity, and stylistic adherence simultaneously.

Methodology: The system processes descriptive inputs to generate an initial latent representation using a finetuned Latent Diffusion Model (LDM). A denoising process refines this representation, and data points are sampled from content and style distributions at different timesteps to form a final image latent. This is then decoded into one or more visually aligned images corresponding to the input description, which can be displayed on a user interface.

Applications: The described system can be implemented in various forms, including as a method, apparatus, or computer program product. It is designed to enhance image generation processes in virtual environments such as the Metaverse, where immersive and visually coherent content is essential for user interaction and engagement.