Invention Title:

SYSTEMS AND METHODS FOR INTERACTING WITH A LARGE LANGUAGE MODEL

Publication number:

US20250104243

Publication date:
Section:

Physics

Class:

G06T7/10

Inventors:

Assignee:

Applicant:

Smart overview of the Invention

The disclosed systems and methods are designed to enhance interaction with multimodal machine learning models, particularly through graphical user interfaces (GUIs). These methods involve displaying an image to a user, receiving a textual prompt, and generating input data that combines both the image and the text. The multimodal model then processes this input data to identify specific locations in the image, providing outputs that include location indications. These outputs are visually represented within the GUI to improve user interaction.

Context and Challenges

Traditional machine learning models, especially multimodal large language models (LLMs), process various data types but often rely on text-based interactions. Users typically interact with these models via GUIs, which can be cumbersome, especially when precise selection of image elements is required. This process is challenging for users with visual impairments or those unfamiliar with GUI functionalities. Current models mainly provide textual responses, which can be resource-intensive and may not effectively guide users in identifying visual elements in images.

Innovative Solutions

The proposed solutions address these challenges by enabling multimodal LLMs to generate visual responses, enhancing user experience. The system uses prompt engineering to conditionally identify locations in images based on user prompts. Outputs include location indications that are visually emphasized within the GUI, aiding users in identifying specific image elements without relying solely on text descriptions.

Technical Implementation

The system includes methods for displaying emphasis indicators at identified locations in images, potentially using cursor placement or updated images. Input data is generated by combining images with spatial encoding. Outputs can include textual responses alongside graphical emphasis at indicated locations. The system also supports sequential location indications and image segmentation to modify visual characteristics within specified segments.

System Architecture

The disclosed embodiments feature a system comprising processors and non-transitory computer-readable media containing instructions for executing operations such as providing GUIs associated with multimodal models. These systems are capable of processing user prompts and images to generate outputs that visually guide users within the interface. Additionally, server configurations are described for handling online requests from client devices, facilitating remote interaction with the machine learning model.