Speech Technologies for the Audiovisual and Multimedia Interaction Environments
Author:
Directors: Arantza del Pozo Echezarreta (Vicomtech) Andoni Arruti (University)
University: UPV/EHU
Date: 22.07.2016
Place: Facultad Informática, Donostia-San Sebastián
The progress of technology, the easy access to powerful machines and electronic devices, social networks, the unlimited storing space on the Internet and ultimately, all that encompasses the new Digital Era, have driven a huge increase of the amount of contents that are created and publicly shared on a daily basis. These contents may include text, images, video and/or audio. The generation of such vast amount of contents has led to the advancement of new methodologies for their optimal indexing and mining and for the automatic extraction of semantic information in different applications and domains, such as the security, surveillance, information access and retrieval, audiovisual or forensics sectors, among others. Concerning audio analysis, it can be used in a wide range of applications considering the large amount of information that can be extracted from each audio content. Depending on the type of application, audio analysis can encompass information extraction at different levels, such as the linguistic level (speech transcription), language identification, the paralinguistic level (e.g. emotions), the speaker level (number, genre, segmentation, identification), the acoustic level (background or isolated noises, etc.), classification of audio segments (e.g. music, noise, speech) or music analysis. Audio analysis has to continually deal with the variability created by the particularities of each speaker, the acoustic environment, volume changes, accents, types of speech, overlappings, etc. Most of these aspects still pose a great challenge for the speech community. Besides, given their statistical nature, most of the solutions implemented for audio analysis are still highly domain-dependent and require adaptation when the application domain notably differs from the training data conditions. This dissertation work involves several advanced audio and speech processing technologies that can be applied to the audiovisual and human-computer interaction environments. It includes an analysis of their applicability, their current state and details of the main contributions made to the fields. Finally, various of the developed technological solutions are described, as well as their transfer to several companies for use in Industry.