Anonymizing Dysarthric Speech: Investigating the Effects of Voice Conversion on Pathological Information Preservation
Autores: Abner Hernández Paula Andrea Pérez Tomás Arias Seung Hee Yang Juan Rafael Orozco Andreas Maier
Fecha: 09.09.2024
Abstract
Acquiring speech data is a crucial step in the development of speech recognition systems and related speech-based machine learning models. However, protecting privacy is an increasing concern that must be addressed. This study investigates voice conversion (VC) as a strategy for anonymizing the speech of individuals with dysarthria. We specifically focus on training a variety of VC models using self-supervised speech representations, such as Wav2Vec and its multi-lingual variant, Wav2Vec2.0 (XLSR). The converted voices maintain a word error rate that is within 1% with respect to the original recordings. The Equal Error Rate (EER) showed a significant increase, from 1.52% to 41.18% on the LibriSpeech test set, and from 3.75% to 42.19% on speakers from the VCTK corpus, indicating a substantial decrease in speaker verification performance. A similar trend is observed with dysarthric speech, where the EER varied from 16.45% to 43.46%. Additionally, our study includes classification experiments on dysarthric vs. healthy speech data to demonstrate that anonymized voices can still yield speech features essential for distinguishing between healthy and pathological speech. The impact of voice conversion is investigated by covering aspects such as articulation, prosody, phonation, and phonology.
BIB_text
title = {Anonymizing Dysarthric Speech: Investigating the Effects of Voice Conversion on Pathological Information Preservation},
pages = {149-160},
keywds = {
Dysarthria; Medical Data; Speech Representation; Voice Anonymization; Voice Conversion
}
abstract = {
Acquiring speech data is a crucial step in the development of speech recognition systems and related speech-based machine learning models. However, protecting privacy is an increasing concern that must be addressed. This study investigates voice conversion (VC) as a strategy for anonymizing the speech of individuals with dysarthria. We specifically focus on training a variety of VC models using self-supervised speech representations, such as Wav2Vec and its multi-lingual variant, Wav2Vec2.0 (XLSR). The converted voices maintain a word error rate that is within 1% with respect to the original recordings. The Equal Error Rate (EER) showed a significant increase, from 1.52% to 41.18% on the LibriSpeech test set, and from 3.75% to 42.19% on speakers from the VCTK corpus, indicating a substantial decrease in speaker verification performance. A similar trend is observed with dysarthric speech, where the EER varied from 16.45% to 43.46%. Additionally, our study includes classification experiments on dysarthric vs. healthy speech data to demonstrate that anonymized voices can still yield speech features essential for distinguishing between healthy and pathological speech. The impact of voice conversion is investigated by covering aspects such as articulation, prosody, phonation, and phonology.
}
isbn = {978-303170565-6},
doi = {10.1007/978-3-031-70566-3_14},
date = {2024-09-09},
}