Invention Title:

GENERATING TAILORED MULTI-MODAL RESPONSE(S) THROUGH UTILIZATION OF LARGE LANGUAGE MODEL(S) AND/OR OTHER GENERATIVE MODEL(S)

Publication number:

US20250217585

Publication date:

2025-07-03

Section:

Physics

Class:

G06F40/20

Inventors:

Quoc Le 🇺🇸 Sunnyvale, CA, United States

Golnaz Ghiasi 🇺🇸 Mountain View, CA, United States

Sanil Jain 🇺🇸 Sunnyvale, CA, United States

Thang Luong 🇺🇸 Santa Clara, CA, United States

Elle Chae 🇺🇸 Sunnyvale, CA, United States

Marcelo Menegali 🇺🇸 San Francisco, CA, United States

Aaron Cohen 🇺🇸 Newark, CA, United States

Applicant:

Google LLC 🇺🇸 Mountain View, CA, United States

Drawings (4 of 8)

Drawing 01 for GENERATING TAILORED MULTI-MODAL RESPONSE(S) THROUGH UTILIZATION OF LARGE LANGUAGE MODEL(S) AND/OR OTHER GENERATIVE MODEL(S)

Drawing 02 for GENERATING TAILORED MULTI-MODAL RESPONSE(S) THROUGH UTILIZATION OF LARGE LANGUAGE MODEL(S) AND/OR OTHER GENERATIVE MODEL(S)

Drawing 03 for GENERATING TAILORED MULTI-MODAL RESPONSE(S) THROUGH UTILIZATION OF LARGE LANGUAGE MODEL(S) AND/OR OTHER GENERATIVE MODEL(S)

Drawing 04 for GENERATING TAILORED MULTI-MODAL RESPONSE(S) THROUGH UTILIZATION OF LARGE LANGUAGE MODEL(S) AND/OR OTHER GENERATIVE MODEL(S)

Smart overview of the Invention

The patent application discusses a system for generating tailored multi-modal responses using large language models (LLMs). These responses integrate both textual and multimedia content to address specific user requests, such as creating slide presentations or assisting with tasks. The system processes natural language (NL) inputs from users and generates responses that are rendered on the user's device, ensuring the multimedia content is logically arranged with respect to the textual content and context of use.

Background

Large language models excel in natural language processing tasks like language generation and machine translation. They are trained on vast datasets, enabling them to generate responses to NL inputs. However, traditional multi-modal responses often lack contextual integration between text and multimedia, leading to a suboptimal user experience. This issue is particularly pronounced on devices with limited display capabilities or for users with input challenges.

Implementation

The described system enhances user interaction by aligning multimedia content with textual information in a contextually relevant manner. For instance, if a user requests help configuring a router, the system might generate step-by-step instructions, each step presented as a slide containing both text and multimedia elements like images or videos. This approach not only aids in task completion but also reduces unnecessary computational resource usage by minimizing user input interactions.

Applications

In scenarios where existing documents are available, such as user manuals, the system can extract relevant information using LLMs to generate concise, easy-to-follow instructions. These instructions are interwoven with multimedia content to enhance clarity and usability. Additionally, references to original sources can be included to verify information accuracy and mitigate potential errors from LLM-generated output.

Versatility

Beyond task assistance, the technology can generate presentation slides based on NL inputs. For example, a request for a slide deck about pizza history would result in a structured presentation with slides containing balanced text and multimedia elements. Speaker notes can be included to aid presenters. This process reduces interaction complexity and is particularly beneficial for devices with limited input capabilities or users with input difficulties.