When Whisper Meets TTS: Domain Adaptation Using only Synthetic Speech Data

Authors: Juan Camilo Vasquez Correa Haritz Arzelus Irazusta Juan Manuel Martín Doñas Joaquin Arellano Goicoechea Ander González Docasal Aitor Álvarez Muniain

Date: 04.09.2023


Abstract

Automatic Speech Recognition is among the most important areas of Artificial Intelligence research today. One of the most notable advances in this area is the development of end-to-end models, which have shown state-of-the-art performance in many benchmark scenarios. In spite of the recent improvements, these architectures still require large amounts of transcribed speech data to be trained, which can be challenging in low resource languages, or in specific domains due to privacy concerns. This study proposes a methodology to fine-tune Whisper-based models using only synthetic speech. The aim is to enable training robust systems for specific domains and low resource languages, where large labeled corpora are difficult to collect. Our approach is based on a language model adaptation by fine-tuning only the decoder of the model, thus the network is able to learn specific vocabulary that is not initially available. The proposed methodology is evaluated with data from different languages and domains. In addition, Parameter Efficient Fine-Tuning strategies were used to efficiently adapt the large pre-trained Whisper models. This is one of the first studies that considers the effect of using only synthetic speech for domain adaption of speech recognition systems in non-English data, providing word error rate reductions in low resource languages between 2 and 30 points, depending on the Whisper version.

BIB_text

@Article {
title = {When Whisper Meets TTS: Domain Adaptation Using only Synthetic Speech Data},
pages = {226-238},
keywds = {
Domain Adaptation; Parameter Efficient Fine-Tuning; Speech Recognition; Text to Speech; Whisper
}
abstract = {

Automatic Speech Recognition is among the most important areas of Artificial Intelligence research today. One of the most notable advances in this area is the development of end-to-end models, which have shown state-of-the-art performance in many benchmark scenarios. In spite of the recent improvements, these architectures still require large amounts of transcribed speech data to be trained, which can be challenging in low resource languages, or in specific domains due to privacy concerns. This study proposes a methodology to fine-tune Whisper-based models using only synthetic speech. The aim is to enable training robust systems for specific domains and low resource languages, where large labeled corpora are difficult to collect. Our approach is based on a language model adaptation by fine-tuning only the decoder of the model, thus the network is able to learn specific vocabulary that is not initially available. The proposed methodology is evaluated with data from different languages and domains. In addition, Parameter Efficient Fine-Tuning strategies were used to efficiently adapt the large pre-trained Whisper models. This is one of the first studies that considers the effect of using only synthetic speech for domain adaption of speech recognition systems in non-English data, providing word error rate reductions in low resource languages between 2 and 30 points, depending on the Whisper version.


}
isbn = {978-303140497-9},
date = {2023-09-04},
}
Vicomtech

Parque Científico y Tecnológico de Gipuzkoa,
Paseo Mikeletegi 57,
20009 Donostia / San Sebastián (Spain)

+(34) 943 309 230

Zorrotzaurreko Erribera 2, Deusto,
48014 Bilbao (Spain)

close overlay

Behavioral advertising cookies are necessary to load this content

Accept behavioral advertising cookies