Data pre-processing pipeline generation for AutoETL

Data pre-processing plays a key role in a data analytics process (e.g., applying a classification algorithm on a predictive task). It encompasses a broad range of activities that span from correcting errors to selecting the most relevant features for the analysis phase. There is no clear evidence, o...

Descripción completa

Detalles Bibliográficos
Autores: Giovanelli, Joseph, Bilalli, Besim|||0000-0002-0575-2389, Abelló Gamazo, Alberto|||0000-0002-3223-2186
Tipo de recurso: artículo
Fecha de publicación:2022
País:España
Institución:Universitat Politècnica de Catalunya (UPC)
Repositorio:UPCommons. Portal del coneixement obert de la UPC
Idioma:inglés
OAI Identifier:oai:upcommons.upc.edu:2117/362396
Acceso en línea:https://hdl.handle.net/2117/362396
https://dx.doi.org/10.1016/j.is.2021.101957
Access Level:acceso abierto
Palabra clave:Big data
Data mining
Algorithms
Data pre-processing pipelines
Data analytics
Dades massives
Mineria de dades
Algorismes
Àrees temàtiques de la UPC::Informàtica::Sistemes d'informació
id ES_3d2fa2f38f019334d89ce2cb81b4607d
oai_identifier_str oai:upcommons.upc.edu:2117/362396
network_acronym_str ES
network_name_str España
repository_id_str
spelling Data pre-processing pipeline generation for AutoETLGiovanelli, JosephBilalli, Besim|||0000-0002-0575-2389Abelló Gamazo, Alberto|||0000-0002-3223-2186Big dataData miningAlgorithmsData pre-processing pipelinesData analyticsDades massivesMineria de dadesAlgorismesÀrees temàtiques de la UPC::Informàtica::Sistemes d'informacióData pre-processing plays a key role in a data analytics process (e.g., applying a classification algorithm on a predictive task). It encompasses a broad range of activities that span from correcting errors to selecting the most relevant features for the analysis phase. There is no clear evidence, or rules defined, on how pre-processing transformations impact the final results of the analysis. The problem is exacerbated when transformations are combined into pre-processing pipeline prototypes. Data scientists cannot easily foresee the impact of pipeline prototypes and hence require a method to discriminate between them and find the most relevant ones (e.g., with highest positive impact) for their study at hand. Once found, these prototypes can be instantiated and optimized e.g., using Bayesian Optimization. In this work, we study the impact of transformations when chained together into prototypes, and the impact of transformations when instantiated via various operators. We develop and scrutinize a generic method that allows to generate pre-processing pipelines, as a step towards AutoETL. We make use of rules that enable the construction of prototypes (i.e., define the order of transformations), and rules that guide the instantiation of the transformations inside the prototypes (i.e., define the operator for each transformation). The optimization of our effective pipeline prototypes provide results that compared to an exhaustive search, get 90% of the predictive accuracy in the median, but with a time cost that is 24 times smaller.This work was supported by the DOGO4ML project, funded by the Spanish Ministerio de Ciencia e Innovación under project/ funding scheme PID2020-117191RB-I00/AEI/10.13039/50110001 1033. We thank University of Bologna for issuing a grant for author’s research stay at Universitat Politècnica de Catalunya.Peer ReviewedElsevier20222022-09-0120222022-02-15journal articlehttp://purl.org/coar/resource_type/c_6501AMhttp://purl.org/coar/version/c_ab4af688f83e57aainfo:eu-repo/semantics/articleapplication/pdfhttps://hdl.handle.net/2117/362396https://dx.doi.org/10.1016/j.is.2021.101957reponame:UPCommons. Portal del coneixement obert de la UPCinstname:Universitat Politècnica de Catalunya (UPC)InglésengAgencia Estatal de Investigación http://doi.org/10.13039/501100011033 Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020 PID2020-117191RB-I00 DESARROLLO, OPERATIVA Y GOBERNANZA DE DATOS PARA SISTEMAS SOFTWARE BASADOS EN APRENDIZAJE AUTOMATICOopen accesshttp://purl.org/coar/access_right/c_abf2Attribution-NonCommercial-NoDerivatives 4.0 Internationalhttps://creativecommons.org/licenses/by-nc-nd/4.0/info:eu-repo/semantics/openAccessoai:upcommons.upc.edu:2117/3623962026-05-27T15:37:01Z
dc.title.none.fl_str_mv Data pre-processing pipeline generation for AutoETL
title Data pre-processing pipeline generation for AutoETL
spellingShingle Data pre-processing pipeline generation for AutoETL
Giovanelli, Joseph
Big data
Data mining
Algorithms
Data pre-processing pipelines
Data analytics
Dades massives
Mineria de dades
Algorismes
Àrees temàtiques de la UPC::Informàtica::Sistemes d'informació
title_short Data pre-processing pipeline generation for AutoETL
title_full Data pre-processing pipeline generation for AutoETL
title_fullStr Data pre-processing pipeline generation for AutoETL
title_full_unstemmed Data pre-processing pipeline generation for AutoETL
title_sort Data pre-processing pipeline generation for AutoETL
dc.creator.none.fl_str_mv Giovanelli, Joseph
Bilalli, Besim|||0000-0002-0575-2389
Abelló Gamazo, Alberto|||0000-0002-3223-2186
author Giovanelli, Joseph
author_facet Giovanelli, Joseph
Bilalli, Besim|||0000-0002-0575-2389
Abelló Gamazo, Alberto|||0000-0002-3223-2186
author_role author
author2 Bilalli, Besim|||0000-0002-0575-2389
Abelló Gamazo, Alberto|||0000-0002-3223-2186
author2_role author
author
dc.subject.none.fl_str_mv Big data
Data mining
Algorithms
Data pre-processing pipelines
Data analytics
Dades massives
Mineria de dades
Algorismes
Àrees temàtiques de la UPC::Informàtica::Sistemes d'informació
topic Big data
Data mining
Algorithms
Data pre-processing pipelines
Data analytics
Dades massives
Mineria de dades
Algorismes
Àrees temàtiques de la UPC::Informàtica::Sistemes d'informació
description Data pre-processing plays a key role in a data analytics process (e.g., applying a classification algorithm on a predictive task). It encompasses a broad range of activities that span from correcting errors to selecting the most relevant features for the analysis phase. There is no clear evidence, or rules defined, on how pre-processing transformations impact the final results of the analysis. The problem is exacerbated when transformations are combined into pre-processing pipeline prototypes. Data scientists cannot easily foresee the impact of pipeline prototypes and hence require a method to discriminate between them and find the most relevant ones (e.g., with highest positive impact) for their study at hand. Once found, these prototypes can be instantiated and optimized e.g., using Bayesian Optimization. In this work, we study the impact of transformations when chained together into prototypes, and the impact of transformations when instantiated via various operators. We develop and scrutinize a generic method that allows to generate pre-processing pipelines, as a step towards AutoETL. We make use of rules that enable the construction of prototypes (i.e., define the order of transformations), and rules that guide the instantiation of the transformations inside the prototypes (i.e., define the operator for each transformation). The optimization of our effective pipeline prototypes provide results that compared to an exhaustive search, get 90% of the predictive accuracy in the median, but with a time cost that is 24 times smaller.
publishDate 2022
dc.date.none.fl_str_mv 2022
2022-09-01
2022
2022-02-15
dc.type.none.fl_str_mv journal article
http://purl.org/coar/resource_type/c_6501
AM
http://purl.org/coar/version/c_ab4af688f83e57aa
dc.type.openaire.fl_str_mv info:eu-repo/semantics/article
format article
dc.identifier.none.fl_str_mv https://hdl.handle.net/2117/362396
https://dx.doi.org/10.1016/j.is.2021.101957
url https://hdl.handle.net/2117/362396
https://dx.doi.org/10.1016/j.is.2021.101957
dc.language.none.fl_str_mv Inglés
eng
language_invalid_str_mv Inglés
language eng
dc.relation.none.fl_str_mv Agencia Estatal de Investigación http://doi.org/10.13039/501100011033 Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020 PID2020-117191RB-I00 DESARROLLO, OPERATIVA Y GOBERNANZA DE DATOS PARA SISTEMAS SOFTWARE BASADOS EN APRENDIZAJE AUTOMATICO
dc.rights.none.fl_str_mv open access
http://purl.org/coar/access_right/c_abf2
Attribution-NonCommercial-NoDerivatives 4.0 International
https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.rights.openaire.fl_str_mv info:eu-repo/semantics/openAccess
rights_invalid_str_mv open access
http://purl.org/coar/access_right/c_abf2
Attribution-NonCommercial-NoDerivatives 4.0 International
https://creativecommons.org/licenses/by-nc-nd/4.0/
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Elsevier
publisher.none.fl_str_mv Elsevier
dc.source.none.fl_str_mv reponame:UPCommons. Portal del coneixement obert de la UPC
instname:Universitat Politècnica de Catalunya (UPC)
instname_str Universitat Politècnica de Catalunya (UPC)
reponame_str UPCommons. Portal del coneixement obert de la UPC
collection UPCommons. Portal del coneixement obert de la UPC
repository.name.fl_str_mv
repository.mail.fl_str_mv
_version_ 1869406423860903936
score 15.300724