US20250292574
2025-09-18
Physics
G06V20/49
The patent application describes a method for processing images through scene parsing, utilizing advanced machine learning models. This approach involves segmenting an input image into multiple patches, each of which is analyzed to generate meaningful interpretations. The process leverages a combination of image segmentation models and vision transformers to effectively handle the complexity of visual data.
Initially, the input image is divided into several smaller sections or patches using a trained image segmentation model. This partitioning allows for detailed examination of different parts of the image, facilitating more precise analysis by focusing on smaller, manageable components rather than the entire image at once.
After partitioning the image, each patch is transformed into a patch embedding using a vision transformer model. These embeddings are multidimensional numerical representations that capture essential features and characteristics of each patch, enabling further processing and interpretation.
The next step involves generating word embeddings from the patch embeddings using a trained patch-label similarity model. These word embeddings provide a semantic representation of the visual information contained within each patch, bridging the gap between visual data and textual interpretation.
Finally, a trained label prediction model utilizes the generated word embeddings to produce a text label that corresponds to the original input image. This label effectively summarizes the content or scene depicted in the image, making it easier to categorize and understand without visual inspection.