Invention Title:

MACHINE-LEARNING-BASED DETECTION OF FAKE VIDEOS

Publication number:

US20250166358

Publication date:
Section:

Physics

Class:

G06V10/774

Inventors:

Assignee:

Applicant:

Smart overview of the Invention

The patent application pertains to methods for detecting fake videos, particularly those generated using AI technologies. These techniques address the growing societal issue of AI-generated content being used for malicious purposes, such as spreading disinformation. The focus is on identifying discrepancies between audio and visual data in videos, which can help determine their authenticity.

Background

Generative AI has advanced to produce high-quality multimedia content, but it also poses risks when used for deceitful activities. AI-generated videos can be misleading as they may combine real and synthetic elements across audio and visual modalities. Existing detection methods often fail when only one modality is analyzed, leading to potential oversight of fake content. Current models typically rely on supervised learning, which limits their ability to generalize beyond the specific correspondences found in training datasets.

Summary

The application introduces a machine-learning approach for training models to detect fake videos. The model includes components like visual and audio encoders, as well as networks that transform embeddings between modalities (A2V and V2A networks). By generating and updating sequences of image and audio embeddings with synthetic counterparts from the opposite modality, the system learns intrinsic audio-visual correspondences. This cross-modal learning enables the model to distinguish between real and fake videos with improved accuracy.

Technical Advantages

The described techniques offer significant technical benefits, such as reducing computational resources required for fake video detection. The cross-modal learning method creates broadly focused models capable of analyzing diverse video types. Classifiers trained this way can accurately interpret various audio-visual correspondences, effectively identifying videos with any combination of real or fake audio and visuals.

Training Methodology

The training process involves generating image tiles and audio segments from input videos, creating embedding sequences through encoders, and transforming these using A2V and V2A networks. Updated embeddings are decoded to reconstruct original data segments, optimizing the model through a dual-objective loss function. The classifier is trained using labeled videos to classify them based on combined sequences of real and synthetic embeddings, employing a cross-entropy loss objective for accuracy.