US20240320872
2024-09-26
Physics
G06T11/00
The disclosed method involves generating images using a machine learning model that integrates both text and image prompts. The process begins by obtaining embeddings for a text prompt and an image prompt. These embeddings are then mapped into a joint embedding space, which allows for the combination of both textual and visual information.
In the system, the text embedding is transformed into a joint text embedding, while the image embedding is converted into a joint image embedding. This mapping is crucial as it enables the model to utilize both types of input effectively. By conditioning the image generation model on these mapped embeddings, the system can create images that accurately reflect the descriptions and styles provided by the user.
The apparatus consists of several components, including processors and memory units that work together to facilitate image generation. Key elements include a text mapping network and an image mapping network, both trained to produce joint embeddings. Additionally, an image generation model synthesizes images based on these embeddings, ensuring high fidelity to the original prompts.
This approach improves upon traditional image generation systems that typically rely on only one type of input (either text or image). By integrating both inputs into a single framework, the system can produce more accurate and contextually relevant images. The ability to generate images from diverse textual conditions enhances its applicability across various domains.
Users can interact with the system through a user interface, providing both text and image prompts. The generated output may initially be a low-resolution image, which can then be enhanced to high resolution through techniques like generative adversarial networks (GANs). This process allows users to receive detailed and visually appealing images based on their specific inputs.