A Synthetic Data Generation System based on the Variational-Autoencoder Technique and the Linked Data Paradigm

Dos Santos, Ricardo; Aguilar, Jose

doi:10.1007/s13748-024-00328-x

Ficheros

Articulo (504.1Kb)

Identificadores

URI: https://hdl.handle.net/20.500.12761/1828

ISSN: 2192-6352

DOI: 10.1007/s13748-024-00328-x

Metadatos

Mostrar el registro completo del ítem

Autor(es)

Dos Santos, Ricardo; Aguilar, Jose

Fecha

2024-06-30

Resumen

Currently, the generation of synthetic data has become very fashionable, either due to the need to create data in certain specific contexts or to study unknown scenarios among other reasons. Additionally, synthetic data is a critical component in training machine learning models in the presence of little data. This work proposes a Synthetic Data Generation System (SDGS) architecture to allow synthetic data generation to be fully automated. SDGS is based on the Variational AutoEncoders (VAE) learning technique, and has three main capabilities. The first is related to the ability to extract data samples from multiple sources using the Linked Data (LD) paradigm. The second is linked to the ability to merge data sets to increase the amount of information that can be provided to the VAE-based synthetic data generator. The last one is related to having a Feature Engineering layer to create new features by generating or extracting information from the dataset and then selecting the features that provide the best information for the VAE model. A case study is described in detail to show the new functionalities of the SDGS, such as dataset extraction from different sources using LD, dataset merging using pivots, and the application of different feature engineering methods. Finally, two metrics are used to evaluate the quality of the generated datasets in different case studies. The first one is the accuracy to analyze the performance of the models generated with the new SDGS functionalities, obtaining results above 90%. The second one is the two-Sample Hotelling's T-Squared Test to determine the quality of the synthetic data generated by the system, obtaining synthetic datasets very similar to the original datasets.