Iterative application of UMAP-based algorithms for fully synthetic healthcare tabular data generation

Building on a previously developed partially synthetic data generation algorithm utilizing data visualization techniques, this study extends the novel algorithm to generate fully synthetic tabular healthcare data. In this enhanced form, the algorithm serves as an alternative to conventional methods...

Descripción completa

Detalles Bibliográficos
Autores: Lázaro Trilles, Carla, Angulo Bahón, Cecilio|||0000-0001-9589-8199
Tipo de recurso: artículo
Fecha de publicación:2024
País:España
Institución:Universitat Politècnica de Catalunya (UPC)
Repositorio:UPCommons. Portal del coneixement obert de la UPC
Idioma:inglés
OAI Identifier:oai:upcommons.upc.edu:2117/429521
Acceso en línea:https://hdl.handle.net/2117/429521
https://dx.doi.org/10.3390/a17120591
Access Level:acceso abierto
Palabra clave:Fully synthetic data
UMAP
Healthcare tabular data
Data augmentation
Àrees temàtiques de la UPC::Informàtica::Aplicacions de la informàtica::Bioinformàtica
Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Aprenentatge automàtic
Descripción
Sumario:Building on a previously developed partially synthetic data generation algorithm utilizing data visualization techniques, this study extends the novel algorithm to generate fully synthetic tabular healthcare data. In this enhanced form, the algorithm serves as an alternative to conventional methods based on Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs). By iteratively applying the original methodology, the adapted algorithm employs UMAP (Uniform Manifold Approximation and Projection), a dimensionality reduction technique, to validate generated samples through low-dimensional clustering. This approach has been successfully applied to three healthcare domains: prostate cancer, breast cancer, and cardiovascular disease. The generated synthetic data have been rigorously evaluated for fidelity and utility. Results show that the UMAP-based algorithm outperforms GAN- and VAE-based generation methods across different scenarios. In fidelity assessments, it achieved smaller maximum distances between the cumulative distribution functions of real and synthetic data for different attributes. In utility evaluations, the UMAP-based synthetic datasets enhanced machine learning model performance, particularly in classification tasks. In conclusion, this method represents a robust solution for generating secure, high-quality synthetic healthcare data, effectively addressing data scarcity challenges.