US20250078346
2025-03-06
Physics
G06T11/60
The patent application outlines a system for generating multi-layer images using machine learning models. It involves converting text input into a single-layer image with multiple objects, generating masks and attributes for these objects, and creating textual descriptions for them. A second machine learning model generates these descriptions, which are then used to produce individual images for each object. These images are combined into a multi-layer image using the generated masks.
Recent advancements in text-to-image models have focused on generating flat images from text prompts. These models typically work by using a combination of text encoders and generative networks to create images based on textual inputs. However, these approaches often result in monolithic images that lack layer segmentation, which is crucial for content creators who require layered images for compositing tasks.
The application addresses limitations in current models that do not handle transparency or generate layered images effectively. Existing approaches often treat each generated image independently, leading to inconsistent layers when combined. The disclosed methods aim to generate consistent multi-layer images directly from text inputs without requiring extensive post-processing or specialized training data that includes alpha-channel images.
The system utilizes two machine learning models: one for generating initial single-layer images and another for creating object-specific textual descriptions. By segmenting the single-layer image into multiple object-based layers and using masks to manage transparency, the system can produce a coherent multi-layer image. This process conserves computational resources by avoiding the need for training on alpha-channel data.
Embodiments of the system may include capabilities like detecting object edges to maintain consistency between layers and modifying empty regions in segmented images. The approach ensures that generated multi-layer images align closely with the initial single-layer image's attributes, enhancing the quality and usability of the final output without manual intervention.