Invention Title:

SCENE PARSING

Publication number:

US20250292574

Publication date:

2025-09-18

Section:

Physics

Class:

G06V20/49

Inventors:

Kaizhi Qian 🇺🇸 Champaign, IL, United States

Chuang Gan 🇺🇸 Cambridge, MA, United States

Zhenfang Chen 🇺🇸 Cambridge, MA, United States

Yikang Shen 🇺🇸 Cambridge, MA, United States

Assignee:

INTERNATIONAL BUSINESS MACHINES CORPORATION 🇺🇸 ARMONK, NY, United States

Applicant:

INTERNATIONAL BUSINESS MACHINES CORPORATION 🇺🇸 Armonk, NY, United States

Smart overview of the Invention

The patent application describes a method for processing images through scene parsing, utilizing advanced machine learning models. This approach involves segmenting an input image into multiple patches, each of which is analyzed to generate meaningful interpretations. The process leverages a combination of image segmentation models and vision transformers to effectively handle the complexity of visual data.

Image Segmentation

Initially, the input image is divided into several smaller sections or patches using a trained image segmentation model. This partitioning allows for detailed examination of different parts of the image, facilitating more precise analysis by focusing on smaller, manageable components rather than the entire image at once.

Patch Embeddings

After partitioning the image, each patch is transformed into a patch embedding using a vision transformer model. These embeddings are multidimensional numerical representations that capture essential features and characteristics of each patch, enabling further processing and interpretation.

Word Embeddings Generation

The next step involves generating word embeddings from the patch embeddings using a trained patch-label similarity model. These word embeddings provide a semantic representation of the visual information contained within each patch, bridging the gap between visual data and textual interpretation.

Label Prediction

Finally, a trained label prediction model utilizes the generated word embeddings to produce a text label that corresponds to the original input image. This label effectively summarizes the content or scene depicted in the image, making it easier to categorize and understand without visual inspection.