Invention Title:

Cross-Modal Contrastive Learning for Text-to-Image Generation based on Machine Learning Models

Publication number:

US20240362830

Publication date:

2024-10-31

Section:

Physics

Class:

G06T11/00

Inventors:

Honglak LEE Ann Arbor, MI, United States

Han Zhang Sunnyvale, CA, United States

Yinfei Yang Sunnyvale, CA, United States

Jing Yu Koh Singapore, Singapore

Jason Michael Baldridge Austin, TX, United States

Applicant:

Google LLC Mountain View, CA, United States

Drawings (4 of 16)

Drawing 01 for Cross-Modal Contrastive Learning for Text-to-Image Generation based on Machine Learning Models

Drawing 02 for Cross-Modal Contrastive Learning for Text-to-Image Generation based on Machine Learning Models

Drawing 03 for Cross-Modal Contrastive Learning for Text-to-Image Generation based on Machine Learning Models

Drawing 04 for Cross-Modal Contrastive Learning for Text-to-Image Generation based on Machine Learning Models

Smart overview of the Invention

The method involves a computing device receiving a textual description of a scene and using a neural network to generate an image rendition of that scene. The neural network is trained to ensure that images associated with the same textual description are similar, while those linked to different descriptions differ. This is achieved by leveraging mutual information between corresponding image-to-image and text-to-image pairs. The process concludes with predicting the final image rendition based on the input description.

Background and Challenges

Text-to-image synthesis utilizes neural networks to create images from textual descriptions, aiming for coherence and high semantic fidelity. Descriptive sentences are favored for their intuitiveness in expressing visual concepts. Challenges arise from the unstructured nature of language and differing statistical properties between text and images. Various deep generative models like GANs have improved image quality, but complex scenes remain difficult to render accurately.

Contrastive Learning in Image Generation

Contrastive learning plays a crucial role in self-supervised representation learning by maintaining consistency in image representations through positive and negative pair contrasts. This approach can be integrated into adversarial training scenarios, aiding in disentangling features for specific applications such as face generation. By applying contrastive loss regularization, models can enhance image generation fidelity.

Innovations in GAN-Based Models

Generative Adversarial Networks (GANs) are pivotal in producing high-quality images from text inputs. Techniques like AttnGAN refine images by focusing on relevant descriptive words, achieving high fidelity in specific domains. However, these models struggle with complex scenes involving multiple objects, as seen in datasets like MS-COCO. Hierarchical approaches attempt to model object instances but require detailed labels, complicating real-world application.

Implementation Aspects

The patent outlines various implementations of the method, including computer programs and systems designed to execute the described functions. Key components involve training data with textual descriptions and associated images, training neural networks using contrastive losses, and outputting trained models for generating text-based images. These implementations highlight the versatility and applicability of the proposed method across different computing environments.