The Vicomtech Speech Transcription Systems for the Albayzín-RTVE 2020 Speech to Text Transcription Challenge
Authors: Iván Gonzalez Torre
Date: 24.03.2021
Abstract
This paper describes the Vicomtech’s submission to the Albayzín-RTVE 2020 Speech to Text Transcription Challenge,
which calls for automatic speech transcription systems to be evaluated in realistic TV shows.
A total of 4 systems were built and presented to the evaluation challenge, considering the primary system along to three constrastive systems. These recognition engines are different versions, evolutions and configurations of two main architectures. The first architecture includes an hybrid DNN-HMM acoustic model, where factorized TDNN layers with and without initial CNN layers were trained to provide posterior probabilities to the HMM states. The language model for decoding correspond to modified Kneser-Ney smoothed 3-gram model, whilst a RNNLM model was used in some systems for rescoring the initial lattices. The second architecture was based on the Quartznet architecture proposed by Nvidia with the aim of building smaller and ligther ASR models with SOTA-level accuracy. A modified Kneser-Ney smoothed 5-gram model was employed to re-score the initial hypothesis of this E2E model.
The results obtained for each TV program in the final test set are also presented in addition to the hardware resources and computation time needed by each system to process the released evaluation data.
BIB_text
title = {The Vicomtech Speech Transcription Systems for the Albayzín-RTVE 2020 Speech to Text Transcription Challenge},
pages = {104-107},
keywds = {
albayzín evaluations, speech recognition, deep learning, convolutional neural networks, recurrent neural networks
}
abstract = {
This paper describes the Vicomtech’s submission to the Albayzín-RTVE 2020 Speech to Text Transcription Challenge,
which calls for automatic speech transcription systems to be evaluated in realistic TV shows.
A total of 4 systems were built and presented to the evaluation challenge, considering the primary system along to three constrastive systems. These recognition engines are different versions, evolutions and configurations of two main architectures. The first architecture includes an hybrid DNN-HMM acoustic model, where factorized TDNN layers with and without initial CNN layers were trained to provide posterior probabilities to the HMM states. The language model for decoding correspond to modified Kneser-Ney smoothed 3-gram model, whilst a RNNLM model was used in some systems for rescoring the initial lattices. The second architecture was based on the Quartznet architecture proposed by Nvidia with the aim of building smaller and ligther ASR models with SOTA-level accuracy. A modified Kneser-Ney smoothed 5-gram model was employed to re-score the initial hypothesis of this E2E model.
The results obtained for each TV program in the final test set are also presented in addition to the hardware resources and computation time needed by each system to process the released evaluation data.
}
date = {2021-03-24},
}