US20250005293
2025-01-02
Physics
G06F40/40
The patent application describes a system that integrates large language models (LLMs) and vision language models (VLMs) to facilitate enhanced human-computer interaction through multimodal dialogs. This approach combines visual data processing with natural language inputs to generate responsive content, thereby enabling a more dynamic and context-aware interaction with virtual assistants or chatbots.
LLMs are machine learning models capable of performing various natural language processing tasks such as language generation and question-answering by leveraging vast datasets. VLMs, on the other hand, handle tasks involving visual data combined with natural language, like image captioning and visual question answering. These models are trained on diverse datasets to perform tasks that require understanding both text and images.
The system processes digital images via VLMs to generate outputs that reflect an environmental state. These outputs are used to construct LLM prompts alongside natural language inputs. The LLMs then process these prompts to create content that responds to the initial natural language input, which is subsequently rendered on output devices. This process allows for resolving ambiguities in user input through synthetic follow-up queries and the integration of visual context.
In practice, the system can detect items in an environment and their attributes or locations using processed images. For example, when a user asks for a dinner recipe, the system can generate follow-up queries like "What food is available?" and use VLMs to identify available ingredients from images of the user's environment, such as the contents of a refrigerator.
The system may utilize various processors, including CPUs, GPUs, or TPUs, to execute instructions stored in memory for performing these tasks. The integration of LLMs and VLMs involves prompt engineering to facilitate data exchange between the models, enabling the creation of responses enriched with contextual data derived from visual inputs.