Invention Title:

VISUAL CHAIN-OF-THOUGHT REASONING FOR MULTIMODAL LANGUAGE MODELS

Publication number:

US20250278573

Publication date:

2025-09-04

Section:

Physics

Class:

G06F40/40

Inventors:

Sumit Gulwani 🇺🇸 Sammamish, WA, United States

Vu Minh LE 🇺🇸 Redmond, WA, United States

Mukul SINGH 🇮🇳 Delhi, India

Gust VERBRUGGEN 🇧🇪 Keerbergen, Belgium

José Pablo CAMBRONERO SÁNCHEZ 🇺🇸 Washington, DC, United States

Assignee:

Microsoft Technology Licensing, LLC 🇺🇸 Redmond, WA, United States

Applicant:

Microsoft Technology Licensing, LLC 🇺🇸 Redmond, WA, United States

Smart overview of the Invention

Multimodal Large Language Models (MLLMs) are a type of artificial intelligence designed to process and generate content across different forms of communication, or modalities, such as text and images. These models excel in tasks that require understanding and generating information from multiple modalities simultaneously. Common applications include image captioning and visual question answering, where both textual and visual elements are involved. Despite their capabilities, MLLMs face challenges in performing complex structured reasoning tasks that require integrating information from various modalities.

Challenges in Structured Reasoning Tasks

Structured reasoning tasks involve multi-step problem solving, often requiring the integration of text and image data. Current methods to enhance MLLM performance on these tasks include Chain-of-Thought (CoT) reasoning, which breaks down problems into intermediate steps. Two main techniques for integrating CoT reasoning are instruction fine-tuning and prompt engineering. While instruction fine-tuning is effective, it demands extensive resources and training data. Prompt engineering, although less resource-intensive, is less effective for multimodal tasks due to its focus on unimodal contexts.

Visual Chain-of-Thought (v-CoT) Reasoning

The proposed system introduces a Visual Chain-of-Thought (v-CoT) method to improve MLLM performance on structured reasoning tasks. This involves generating a v-CoT prompt by combining an input image with a natural language task description and a series of v-CoT instructions. These instructions guide the MLLM to describe relevant information from the image, reason over it, and determine a solution for the task. The solution is then included in the output, which is presented on the user interface of the multimodal assistant system.

System Implementation

The system operates by receiving multimodal input through a user interface, which includes an image and a task description. A prompt generating component creates the v-CoT prompt by integrating the input image, task description, and v-CoT instructions. The MLLM uses this prompt to generate an output that solves the task by utilizing CoT reasoning. The output is then displayed on the user interface, providing a seamless interaction with complex multimodal data.

Technical Implications

This approach leverages existing MLLM capabilities while addressing their limitations in structured reasoning tasks. By introducing visual elements into CoT reasoning prompts, the system enhances the model's ability to integrate visual information effectively. This innovation does not require extensive resources like instruction fine-tuning and offers a practical solution for improving MLLM performance in complex multimodal scenarios.