Contributions to Document-Level Neural Machine Translation
Autor:
Directores: Thierry Etchegoyhen (Vicomtech) Gorka Labaka (Universidad)
Universidad: Universidad del País Vasco - Euskal Herriko Unibertsitatea
Fecha: 11.03.2025
Neural machine translation (NMT) systems can achieve high translation quality at the sentence level, but still face significant challenges on document-level phenomena, including coreference resolution, lexical cohesion and discourse coherence, leading to inconsistencies and inaccuracies across sentences within the same translated text. This thesis contributes to addressing these limitations by developing resources and methodologies specifically designed to capture extrasentential linguistic phenomena in NMT, with a focus on translation involving low-resource languages, Basque in particular. The thesis is structured around four research areas: We first address general improvements for sentence-level models, as a basis for document-level modelling. As our second topic, we focus on building corpora for document-level translation. Our third research area centres on modelling variants to improve context-aware translation. Finally, we dedicate part of our research to analysing strengths and weaknesses of context-aware models. Primary contributions include the creation of the first dataset specifically designed for Basque-Spanish and Basque-French context-aware translation, along with novel data augmentation techniques to enhance the training of context-aware models. Regarding model development, this work introduces novel approaches, evaluates the impact of pretraining, explores the identification of contextual information via learnable source factors, and studies the promotion of target language context into standard architectures, achieving consistent improvements. Additionally, this thesis provides a comprehensive analysis of contextual NMT, highlighting specific strengths and limitations, particularly those related to context length, complexity, and syntactic functions. It also includes an in-depth exploration of the impact of context on gender bias in machine translation. Our findings suggest that the datasets and methods developed throughout the thesis can provide significant improvements in translation quality for intersentential linguistic phenomena.