Knowledge Transfer for Active Learning in Textual Anonymisation
Autores: Laura García Sardiña Manex Serras Saenz
Fecha: 30.11.-0001
Abstract
Data privacy compliance has gained a lot of attention over the last years. The automation of the de-identification process is a challenging task that often requires annotating in-domain data from scratch, as there is usually a lack of annotated resources for such scenarios. In this work, knowledge from a classifier learnt from a source annotated dataset is transferred to speed up the process of training a binary personal data identification classifier in a pool-based Active Learning context, for a new initially unlabelled target dataset which differs in language and domain. To this end, knowledge from the source classifier is used for seed selection and uncertainty based query selection strategies. Through the experimentation phase, multiple entropy-based criteria and input diversity measures are combined. Results show a significant improvement of the anonymisation label from the first batch, speeding up the classifier’s learning curve in the target domain and reaching top performance with less than 10% of the total training data, thus demonstrating the usefulness of the proposed approach even when the anonymisation domains diverge significantly.
BIB_text
title = {Knowledge Transfer for Active Learning in Textual Anonymisation},
pages = {155-166},
keywds = {
Knowledge Transfer, Active Learning, Seed Selection, Query Selection Strategy, Textual Anonymisation
}
abstract = {
Data privacy compliance has gained a lot of attention over the last years. The automation of the de-identification process is a challenging task that often requires annotating in-domain data from scratch, as there is usually a lack of annotated resources for such scenarios. In this work, knowledge from a classifier learnt from a source annotated dataset is transferred to speed up the process of training a binary personal data identification classifier in a pool-based Active Learning context, for a new initially unlabelled target dataset which differs in language and domain. To this end, knowledge from the source classifier is used for seed selection and uncertainty based query selection strategies. Through the experimentation phase, multiple entropy-based criteria and input diversity measures are combined. Results show a significant improvement of the anonymisation label from the first batch, speeding up the classifier’s learning curve in the target domain and reaching top performance with less than 10% of the total training data, thus demonstrating the usefulness of the proposed approach even when the anonymisation domains diverge significantly.
}
isbn = {978-3-030-00810-9},
doi = {10.1007/978-3-030-00810-9_14},
date = {0000-00-00},
}