Invention Title:

COMMON SENSE REASONING FOR DEEPFAKE DETECTION

Publication number:

US20250225773

Publication date:
Section:

Physics

Class:

G06V10/7715

Inventors:

Assignee:

Applicant:

Drawings (4 of 24)

Smart overview of the Invention

The disclosed method addresses the challenge of detecting deepfake images by utilizing a system that integrates both visual and textual analysis. This approach allows users to input a textual inquiry along with an image, which is then processed by a deepfake detection model. The model includes components such as an image encoder, a text encoder, and a language model to generate a comprehensive textual analysis that determines the authenticity of the image and identifies specific visual features contributing to this classification.

Background

The prevalence of generative machine-learning techniques has significantly increased the production of deepfakes, posing potential risks related to misinformation and security threats. Traditional deepfake detection methods primarily function as binary classifiers, using techniques like convolution neural networks (CNNs). However, these methods often lack the ability to provide detailed textual explanations for their decisions, which is crucial for understanding why an image is classified as fake. The current need is for detection models that can incorporate common-sense reasoning to explain anomalies in images.

Summary of the Invention

The proposed system extends deepfake detection from merely identifying fake images to a more nuanced task called Deepfake Detection Visual Question Answer (DD-VQA). This task involves generating answers that not only confirm whether an image is fake but also provide textual explanations using common-sense knowledge. The DD-VQA task encourages models to focus on cognitive aspects of authenticity, moving beyond simple recognition-level features. Users can pose specific questions about facial components, allowing the model to simulate human intuition in explaining its classification.

Dataset and Model Training

To train the deepfake detection model effectively, a novel dataset named DD-VQA is introduced, comprising image-question-answer triplets. This dataset includes images sourced from public databases like FaceForensics++ and features questions designed to probe both general and specific aspects of image authenticity. Annotators provide decisions and reasons based on common-sense knowledge. The system enhances model training through contrastive learning techniques, improving its ability to distinguish between real and fake images across different modalities.

Conclusion

The introduction of the DD-VQA task and its corresponding dataset represents a significant advancement in deepfake detection, enabling models to provide decisions supported by textual explanations based on common-sense reasoning. The multi-modal Transformer model serves as a benchmark, with enhanced representation learning through contrastive learning formulations. This approach improves both the performance and generalization ability of deepfake detection models, offering better interpretability alongside detection accuracy.