Invention Title:

GENERALIZING AUDIO DEEPFAKE DETECTION BY EXPLORING STYLE-LINGUISTICS MISMATCH

Publication number:

US20250363996

Publication date:

2025-11-27

Section:

Physics

Class:

G10L17/26

Inventors:

Ben Colman 🇺🇸 New York, NY, United States

Surya KOPPISETTI 🇨🇦 Coquitlam, Canada

Ali SHAHRIYARI 🇺🇸 Las Vegas, NV, United States

Gaurav Bharaj 🇺🇸 Los Angeles, CA, United States

Yi ZHU 🇨🇦 Saint-Lambert, Canada

Trang TRAN 🇻🇳 Ha Noi, Vietnam

Assignee:

Reality Defender, Inc. 🇺🇸 New York, NY, United States

Applicant:

Reality Defender, Inc. 🇺🇸 New York, NY, United States

Smart overview of the Invention

Audio deepfake detection (ADD) is essential to prevent the misuse of synthesized speech generated by AI models. Current ADD systems face challenges such as poor generalization to unseen data and limited interpretability. A novel framework is introduced that uses Style and Linguistics Mismatch (SLIM) to differentiate between real and fake audio. This approach involves a self-supervised pretraining stage using only real samples, enabling the model to learn the style-linguistics dependency. The SLIM method outperforms existing models on out-of-domain datasets while maintaining competitive performance on in-domain data, improving both detection accuracy and explainability.

Background

The rise of generative models has made it easy to create realistic audio deepfakes using text-to-speech (TTS) or voice conversion (VC) systems. These deepfakes pose risks such as impersonation and fraud. State-of-the-art (SOTA) ADD systems use large self-supervised learning (SSL) models, but they struggle with new, unseen generative techniques. Improving these models often requires costly retraining and fine-tuning. Furthermore, existing systems lack transparency in their decision-making process, leading to trust issues among users.

Current Challenges

Existing ADD systems typically rely on fully-supervised training with SSL frontends and backend classifiers. Although methods like data augmentation and neural vocoders have improved performance, generalization remains a significant issue. Additionally, interpretability is limited; current models often focus on artifacts from voice synthesis, which may become less detectable as generative methods improve. While some works use explainable AI (XAI) methods for interpretation, these are often sensitive to training setups and inconsistent.

Style-Linguistics Dependency

Speech is analyzed by decomposing it into style and linguistics subspaces. Style refers to non-verbal attributes like emotions and speaker identity, while linguistics pertains to the textual content. Most voice generative models assume these subspaces are independent; however, real speech shows dependencies between them. For instance, emotional states can influence word choices. The proposed ADD framework leverages this dependency by training a model to recognize mismatches between style and linguistics in fake audio.

Applications

The disclosed techniques can be applied across various domains to enhance security and integrity. For example, integrating the system into communication platforms can automatically terminate calls identified as fake or alert users about potential threats. In virtual meetings, it could flag or block participants using deepfake audio. Automated transcription services could use it to detect fake audio before converting speech to text. Content platforms might employ these methods to filter or flag broadcasts containing fake audio. These applications demonstrate the broad utility of the proposed ADD system.