US20250363996
2025-11-27
Physics
G10L17/26
Audio deepfake detection (ADD) is essential to prevent the misuse of synthesized speech generated by AI models. Current ADD systems face challenges such as poor generalization to unseen data and limited interpretability. A novel framework is introduced that uses Style and Linguistics Mismatch (SLIM) to differentiate between real and fake audio. This approach involves a self-supervised pretraining stage using only real samples, enabling the model to learn the style-linguistics dependency. The SLIM method outperforms existing models on out-of-domain datasets while maintaining competitive performance on in-domain data, improving both detection accuracy and explainability.
The rise of generative models has made it easy to create realistic audio deepfakes using text-to-speech (TTS) or voice conversion (VC) systems. These deepfakes pose risks such as impersonation and fraud. State-of-the-art (SOTA) ADD systems use large self-supervised learning (SSL) models, but they struggle with new, unseen generative techniques. Improving these models often requires costly retraining and fine-tuning. Furthermore, existing systems lack transparency in their decision-making process, leading to trust issues among users.
Existing ADD systems typically rely on fully-supervised training with SSL frontends and backend classifiers. Although methods like data augmentation and neural vocoders have improved performance, generalization remains a significant issue. Additionally, interpretability is limited; current models often focus on artifacts from voice synthesis, which may become less detectable as generative methods improve. While some works use explainable AI (XAI) methods for interpretation, these are often sensitive to training setups and inconsistent.
Speech is analyzed by decomposing it into style and linguistics subspaces. Style refers to non-verbal attributes like emotions and speaker identity, while linguistics pertains to the textual content. Most voice generative models assume these subspaces are independent; however, real speech shows dependencies between them. For instance, emotional states can influence word choices. The proposed ADD framework leverages this dependency by training a model to recognize mismatches between style and linguistics in fake audio.
The disclosed techniques can be applied across various domains to enhance security and integrity. For example, integrating the system into communication platforms can automatically terminate calls identified as fake or alert users about potential threats. In virtual meetings, it could flag or block participants using deepfake audio. Automated transcription services could use it to detect fake audio before converting speech to text. Content platforms might employ these methods to filter or flag broadcasts containing fake audio. These applications demonstrate the broad utility of the proposed ADD system.