Effect of incorporating metadata to the generation of synthetic time series in a healthcare context
Autores: Ane Alberdi Bamidis Panagiotis Evdokimos Konstantinidis
Fecha: 01.06.2023
Abstract
Synthetic data is becoming the way forward to manage legal and regulatory aspects of biomedical research involving personal and clinical data. As no matches are expected between artificial instances and real samples and/or subjects, external researchers performing secondary analyses could benefit significantly by having unlimited access to uncompromised information. In this context, one of the main objectives of the H2020 VITALISE project is to develop a platform for providing synthetic data generated from real data collected in Living Labs to those external researchers. In addition, while some time series specific synthetic data generation models exist, only a few of them consider metadata (e.g., patient demographics) as part of the time series generation process itself. Therefore, the objective of this research is to perform a comparative assessment of two synthetic data generation models that use and process the metadata of subjects differently: The Wasserstein GAN with Gradient Penalty (WGAN-GP) and the DöppelGANger (DGAN). To achieve this goal making sure the analyses were data-independent, we selected two healthcare-related longitudinal datasets: (1) Treadmill Maximal Effort Test (TMET) measurements from the University of Málaga; and (2) a hypotension subset derived from the MIMIC-III v1.4 database. After synthetic data was generated, we assessed three pivotal aspects: resemblance to the original data, utility, and level of privacy. As a main conclusion, the importance of using metadata as context variables and the methodology to take them into account was proved to be significant and valuable, the DGAN model offering better results overall. A more extensive time series specific evaluation is left as the main avenue for future research.
BIB_text
title = {Effect of incorporating metadata to the generation of synthetic time series in a healthcare context},
pages = {910-916},
keywds = {
health data; shareable data; synthetic data; time series
}
abstract = {
Synthetic data is becoming the way forward to manage legal and regulatory aspects of biomedical research involving personal and clinical data. As no matches are expected between artificial instances and real samples and/or subjects, external researchers performing secondary analyses could benefit significantly by having unlimited access to uncompromised information. In this context, one of the main objectives of the H2020 VITALISE project is to develop a platform for providing synthetic data generated from real data collected in Living Labs to those external researchers. In addition, while some time series specific synthetic data generation models exist, only a few of them consider metadata (e.g., patient demographics) as part of the time series generation process itself. Therefore, the objective of this research is to perform a comparative assessment of two synthetic data generation models that use and process the metadata of subjects differently: The Wasserstein GAN with Gradient Penalty (WGAN-GP) and the DöppelGANger (DGAN). To achieve this goal making sure the analyses were data-independent, we selected two healthcare-related longitudinal datasets: (1) Treadmill Maximal Effort Test (TMET) measurements from the University of Málaga; and (2) a hypotension subset derived from the MIMIC-III v1.4 database. After synthetic data was generated, we assessed three pivotal aspects: resemblance to the original data, utility, and level of privacy. As a main conclusion, the importance of using metadata as context variables and the methodology to take them into account was proved to be significant and valuable, the DGAN model offering better results overall. A more extensive time series specific evaluation is left as the main avenue for future research.
}
isbn = {979-835031224-9},
date = {2023-06-01},
}