US20240362830
2024-10-31
Physics
G06T11/00
The method involves a computing device receiving a textual description of a scene and using a neural network to generate an image rendition of that scene. The neural network is trained to ensure that images associated with the same textual description are similar, while those linked to different descriptions differ. This is achieved by leveraging mutual information between corresponding image-to-image and text-to-image pairs. The process concludes with predicting the final image rendition based on the input description.
Text-to-image synthesis utilizes neural networks to create images from textual descriptions, aiming for coherence and high semantic fidelity. Descriptive sentences are favored for their intuitiveness in expressing visual concepts. Challenges arise from the unstructured nature of language and differing statistical properties between text and images. Various deep generative models like GANs have improved image quality, but complex scenes remain difficult to render accurately.
Contrastive learning plays a crucial role in self-supervised representation learning by maintaining consistency in image representations through positive and negative pair contrasts. This approach can be integrated into adversarial training scenarios, aiding in disentangling features for specific applications such as face generation. By applying contrastive loss regularization, models can enhance image generation fidelity.
Generative Adversarial Networks (GANs) are pivotal in producing high-quality images from text inputs. Techniques like AttnGAN refine images by focusing on relevant descriptive words, achieving high fidelity in specific domains. However, these models struggle with complex scenes involving multiple objects, as seen in datasets like MS-COCO. Hierarchical approaches attempt to model object instances but require detailed labels, complicating real-world application.
The patent outlines various implementations of the method, including computer programs and systems designed to execute the described functions. Key components involve training data with textual descriptions and associated images, training neural networks using contrastive losses, and outputting trained models for generating text-based images. These implementations highlight the versatility and applicability of the proposed method across different computing environments.