US20240249456
2024-07-25
Physics
G06T11/60
Innovative methods and systems are presented for generating images using a sequence of generative neural networks. The process begins with receiving an input text prompt, which consists of a sequence of text tokens in natural language. This prompt is then processed by a text encoder neural network to produce contextual embeddings that capture the meaning of the text. These embeddings are subsequently fed into a series of generative neural networks, ultimately resulting in a final output image that visually represents the scene described by the input prompt.
The image generation involves multiple layers of generative neural networks, starting with an initial network that processes the contextual embeddings to create an initial image at a lower resolution. Subsequent networks take this initial output and further refine it, enhancing the resolution with each step. Each generative network operates based on both the contextual embeddings and the image produced by the previous network, ensuring a gradual improvement in image quality. This cascading approach allows for significant enhancements while addressing potential artifacts created during earlier stages.
While the primary focus is on text prompts, the system is adaptable and can accept various types of conditioning inputs. These include noise samples from distributions, existing images, audio signals describing scenes, or combinations thereof. This versatility enables the generation of images from diverse data sources, making it applicable to numerous scenarios beyond just textual descriptions. The method's robust design ensures high-resolution outputs regardless of the input type.
The described system boasts several advantages, particularly in producing high-resolution images that accurately reflect their textual descriptions. By employing a sequence of generative neural networks, it effectively reduces the computational burden associated with generating high-resolution images directly. This approach not only improves image quality but also mitigates common issues such as distortions and artifacts that may arise in lower-resolution outputs. Overall, this innovative method enhances both the quality and efficiency of image generation processes.