Invention Title:

MULTIMODAL DEEPFAKE DETECTION VIA LIP-AUDIO CROSS-ATTENTION AND FACIAL SELF-ATTENTION

Publication number:

US20250005925

Publication date:

2025-01-02

Section:

Physics

Class:

G06V20/41

Inventors:

Aniket Bera West Lafayette, IN, United States

Aaditya Kharel West Lafayette, IN, United States

Manas Aniruddha Paranjape Plano, TX, United States

Assignee:

PURDUE RESEARCH FOUNDATION West Lafayette, IN, United States

Applicant:

Purdue Research Foundation West Lafayette, IN, United States

Smart overview of the Invention

A cutting-edge multi-modal framework is described for detecting video manipulations, commonly known as deepfakes. This approach combines audio and video data to improve the accuracy of deepfake detection. The framework utilizes two distinct pipelines: a vision encoder and an audio-video encoder. The vision encoder focuses on identifying facial artifacts, while the audio-video encoder detects inconsistencies between spoken words and lip movements. Together, these modalities enhance the detection process.

Background

Deepfakes leverage deep learning to create manipulated media, posing significant threats to digital content authenticity. The rise of deepfake technologies, such as GANs, has made it challenging to discern real from fake media. Traditional detection methods often focus on either audio or video, lacking the combined analysis needed for comprehensive detection. This limitation is due to the scarcity of datasets containing both modalities. Thus, a need exists for methods that integrate both audio and video for real-time detection.

Method Summary

The disclosed method involves processing a video with multiple frames and accompanying audio using a processor. A first neural network uses self-attention mechanisms to analyze video frames for facial artifacts, creating a first embedding. Simultaneously, a second neural network employs cross-attention mechanisms to compare lip movements with audio content, forming a second embedding. These embeddings are then used together to determine if the video has been altered.

Model Overview

The multi-modal model integrates a vision encoder and an audio+lip encoder to detect deepfakes effectively. The vision encoder processes video content to find facial anomalies using self-attention mechanisms. Meanwhile, the audio+lip encoder assesses both audio and video inputs for lip-audio discrepancies via cross-attention mechanisms. This dual-modality approach has shown superior performance over traditional methods in detecting manipulated videos.

Hardware Implementation

An exemplary computing device suitable for implementing this model includes a processor, memory, display screen, user interface, and network communications module. This device can be configured as various forms of computers or mobile devices to perform the described operations. The processor executes instructions stored in memory to enable the device's functionality, while the display and user interface facilitate interaction with the system.