US20240144651
2024-05-02
Physics
G06V10/764
A method and apparatus are introduced for enhancing vision-language understanding through a machine learning model. The focus is on training a model that can classify images based on both known and novel classes. This is achieved by utilizing a structured training dataset that includes various class names, which serves as the foundation for the model's learning process.
The training process involves generating augmented textual prompts for each class name in the dataset. These prompts are input into a frozen text encoder, which outputs text embeddings representing the class names. The method further includes concatenating learnable soft prompts to these class names, followed by minimizing a cross-entropy loss between the generated embeddings. This approach helps in refining the model's ability to classify images effectively.
An apparatus is also described for utilizing the trained vision-language model to classify images. This includes an interface for receiving input text data containing class names, storage for class names and their corresponding embeddings, and processing capabilities to generate and compare embeddings. If a generated embedding is novel, it can be added to the existing list in storage, enhancing the model's adaptability.
A key challenge in prompt learning is base class overfitting, where accuracy improves for trained classes but drops for unseen classes. The proposed solution combines soft prompt learning with traditional prompt engineering by incorporating a cross-entropy loss that aligns learned prompts with textual prompts. This dual approach aims to enhance recognition accuracy across both base and novel classes.
The techniques outlined contribute significantly to the field of vision-language understanding by improving soft prompt learning and addressing overfitting issues. By leveraging both learnable prompts and textual information, the method enhances the model's performance in zero-shot scenarios, making it more robust for various downstream tasks in image classification.