Invention Title:

METHOD AND SYSTEM FOR QUERYING A VIDEO BY INCORPORATING VIDEO ANALYTICS AND LARGE LANGUAGE MODELS

Publication number:

US20250299491

Publication date:
Section:

Physics

Class:

G06V20/52

Inventors:

Assignee:

Applicant:

Smart overview of the Invention

The patent application introduces a method and system that leverages video analytics and large language models (LLMs) to enhance video querying capabilities. It focuses on improving the interaction between users and video content by enabling natural language queries about objects within video streams. The system integrates video analytics to extract relevant features from video frames and uses these features to form prompts for LLMs, which then generate responses based on the user's query.

Technical Field

The invention pertains to video surveillance, specifically enhancing video surveillance systems using advanced video analytics and LLMs. It addresses the limitations of current LLMs in processing video data due to computational constraints, by presenting a method that combines the strengths of both technologies for efficient video content analysis.

Background

Traditional closed-circuit television (CCTV) systems are limited to transmitting signals to specific monitors without open broadcast capabilities. Video content analytics (VCA) enhances these systems by automatically analyzing videos for events. However, VCA alone cannot efficiently process complex queries without additional computational resources. LLMs, while powerful in natural language processing, struggle with high frame-rate videos due to their size and processing demands.

Invention Summary

The proposed system addresses these challenges by utilizing video analytics to process and summarize video data, which is then used to generate prompts for LLMs. This involves receiving a user query about objects in a video stream, selecting relevant objects through video analytics, and forming a prompt that combines user input with video analytics data. The LLM processes this prompt to provide a detailed response, effectively bridging the gap between raw video data and user-understandable information.

Detailed Description

The system efficiently processes input from both recorded and live video streams, using video analytics to track and classify objects within the scene. It provides metadata such as object trajectories, classifications, and activities, which are crucial for forming comprehensive prompts for LLMs. This approach allows the LLM to focus on key information without processing every frame, thus optimizing computational resources. Users can interactively refine queries based on initial LLM responses, enhancing the system's usability in dynamic surveillance environments.