Exploring Self-supervised Embeddings and Synthetic Data Augmentation for Robust Audio Deepfake Detection

Autores: Juan Manuel Martín Doñas Aitor Álvarez Muniain Eros Rosello Ángel M. Gómez Antonio M. Peinado

Fecha: 01.09.2024


Abstract

This work explores the performance of large speech self-supervised models as robust audio deepfake detectors. Despite the current trend of fine-tuning the upstream network, in this paper, we revisit the use of pre-trained models as feature extractors to adapt specialized downstream audio deepfake classifiers. The goal is to keep the general knowledge of the audio foundation model to extract discriminative features to feed up a simplified deepfake classifier. In addition, the generalization capabilities of the system are improved by augmenting the training corpora using additional synthetic data from different vocoder algorithms. This strategy is also complemented by various data augmentations covering challenging acoustic conditions. Our proposal is evaluated under different benchmark datasets for audio deepfake and anti-spoofing tasks, showing state-of-the-art performance. Furthermore, we analyze the relevant parts of the downstream classifier to achieve a robust system.

BIB_text

@Article {
title = {Exploring Self-supervised Embeddings and Synthetic Data Augmentation for Robust Audio Deepfake Detection},
pages = {2085-2089},
keywds = {
anti-spoofing; audio deepfake detection; data augmentation; self-supervised models; vocoders
}
abstract = {

This work explores the performance of large speech self-supervised models as robust audio deepfake detectors. Despite the current trend of fine-tuning the upstream network, in this paper, we revisit the use of pre-trained models as feature extractors to adapt specialized downstream audio deepfake classifiers. The goal is to keep the general knowledge of the audio foundation model to extract discriminative features to feed up a simplified deepfake classifier. In addition, the generalization capabilities of the system are improved by augmenting the training corpora using additional synthetic data from different vocoder algorithms. This strategy is also complemented by various data augmentations covering challenging acoustic conditions. Our proposal is evaluated under different benchmark datasets for audio deepfake and anti-spoofing tasks, showing state-of-the-art performance. Furthermore, we analyze the relevant parts of the downstream classifier to achieve a robust system.


}
date = {2024-09-01},
}
Vicomtech

Parque Científico y Tecnológico de Gipuzkoa,
Paseo Mikeletegi 57,
20009 Donostia / San Sebastián (España)

+(34) 943 309 230

Zorrotzaurreko Erribera 2, Deusto,
48014 Bilbao (España)

close overlay

Las cookies de publicidad comportamental son necesarias para cargar el contenido

Aceptar cookies de publicidad comportamental