Invention Title:

MULTIMODAL DIALOGS USING LARGE LANGUAGE MODEL(S) AND VISUAL LANGUAGE MODEL(S)

Publication number:

US20250005293

Publication date:

2025-01-02

Section:

Physics

Class:

G06F40/40

Inventors:

Tuan Nguyen San Jose, CA, United States

Qiong Huang San Jose, CA, United States

Sergei Volnov London, Great Britain (UK)

Tzu-Chan Chuang San Francisco, CA, United States

William A. Truong San Jose, CA, United States

Alexey Galata San Jose, CA, United States

Sana Mithani Plantation, FL, United States

Yunfan Ye Sunnyvale, CA, United States

Neel Joshi Newcastle, WA, United States

Liang-yu Chen Sunnyvale, CA, United States

Krunal Shah Mountain View, CA, United States

Sai Aditya Chitturu Sunnyvale, CA, United States

Applicant:

Google LLC Mountain View, CA, United States

Smart overview of the Invention

The patent application describes a system that integrates large language models (LLMs) and vision language models (VLMs) to facilitate enhanced human-computer interaction through multimodal dialogs. This approach combines visual data processing with natural language inputs to generate responsive content, thereby enabling a more dynamic and context-aware interaction with virtual assistants or chatbots.

Background

LLMs are machine learning models capable of performing various natural language processing tasks such as language generation and question-answering by leveraging vast datasets. VLMs, on the other hand, handle tasks involving visual data combined with natural language, like image captioning and visual question answering. These models are trained on diverse datasets to perform tasks that require understanding both text and images.

Implementation

The system processes digital images via VLMs to generate outputs that reflect an environmental state. These outputs are used to construct LLM prompts alongside natural language inputs. The LLMs then process these prompts to create content that responds to the initial natural language input, which is subsequently rendered on output devices. This process allows for resolving ambiguities in user input through synthetic follow-up queries and the integration of visual context.

Functionality

In practice, the system can detect items in an environment and their attributes or locations using processed images. For example, when a user asks for a dinner recipe, the system can generate follow-up queries like "What food is available?" and use VLMs to identify available ingredients from images of the user's environment, such as the contents of a refrigerator.

Technical Aspects

The system may utilize various processors, including CPUs, GPUs, or TPUs, to execute instructions stored in memory for performing these tasks. The integration of LLMs and VLMs involves prompt engineering to facilitate data exchange between the models, enabling the creation of responses enriched with contextual data derived from visual inputs.