Data pre-processing pipeline generation for AutoETL
Data pre-processing plays a key role in a data analytics process (e.g., applying a classification algorithm on a predictive task). It encompasses a broad range of activities that span from correcting errors to selecting the most relevant features for the analysis phase. There is no clear evidence, o...
| Autores: | , , |
|---|---|
| Tipo de recurso: | artículo |
| Fecha de publicación: | 2022 |
| País: | España |
| Institución: | Universitat Politècnica de Catalunya (UPC) |
| Repositorio: | UPCommons. Portal del coneixement obert de la UPC |
| Idioma: | inglés |
| OAI Identifier: | oai:upcommons.upc.edu:2117/362396 |
| Acceso en línea: | https://hdl.handle.net/2117/362396 https://dx.doi.org/10.1016/j.is.2021.101957 |
| Access Level: | acceso abierto |
| Palabra clave: | Big data Data mining Algorithms Data pre-processing pipelines Data analytics Dades massives Mineria de dades Algorismes Àrees temàtiques de la UPC::Informàtica::Sistemes d'informació |
| id |
ES_3d2fa2f38f019334d89ce2cb81b4607d |
|---|---|
| oai_identifier_str |
oai:upcommons.upc.edu:2117/362396 |
| network_acronym_str |
ES |
| network_name_str |
España |
| repository_id_str |
|
| spelling |
Data pre-processing pipeline generation for AutoETLGiovanelli, JosephBilalli, Besim|||0000-0002-0575-2389Abelló Gamazo, Alberto|||0000-0002-3223-2186Big dataData miningAlgorithmsData pre-processing pipelinesData analyticsDades massivesMineria de dadesAlgorismesÀrees temàtiques de la UPC::Informàtica::Sistemes d'informacióData pre-processing plays a key role in a data analytics process (e.g., applying a classification algorithm on a predictive task). It encompasses a broad range of activities that span from correcting errors to selecting the most relevant features for the analysis phase. There is no clear evidence, or rules defined, on how pre-processing transformations impact the final results of the analysis. The problem is exacerbated when transformations are combined into pre-processing pipeline prototypes. Data scientists cannot easily foresee the impact of pipeline prototypes and hence require a method to discriminate between them and find the most relevant ones (e.g., with highest positive impact) for their study at hand. Once found, these prototypes can be instantiated and optimized e.g., using Bayesian Optimization. In this work, we study the impact of transformations when chained together into prototypes, and the impact of transformations when instantiated via various operators. We develop and scrutinize a generic method that allows to generate pre-processing pipelines, as a step towards AutoETL. We make use of rules that enable the construction of prototypes (i.e., define the order of transformations), and rules that guide the instantiation of the transformations inside the prototypes (i.e., define the operator for each transformation). The optimization of our effective pipeline prototypes provide results that compared to an exhaustive search, get 90% of the predictive accuracy in the median, but with a time cost that is 24 times smaller.This work was supported by the DOGO4ML project, funded by the Spanish Ministerio de Ciencia e Innovación under project/ funding scheme PID2020-117191RB-I00/AEI/10.13039/50110001 1033. We thank University of Bologna for issuing a grant for author’s research stay at Universitat Politècnica de Catalunya.Peer ReviewedElsevier20222022-09-0120222022-02-15journal articlehttp://purl.org/coar/resource_type/c_6501AMhttp://purl.org/coar/version/c_ab4af688f83e57aainfo:eu-repo/semantics/articleapplication/pdfhttps://hdl.handle.net/2117/362396https://dx.doi.org/10.1016/j.is.2021.101957reponame:UPCommons. Portal del coneixement obert de la UPCinstname:Universitat Politècnica de Catalunya (UPC)InglésengAgencia Estatal de Investigación http://doi.org/10.13039/501100011033 Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020 PID2020-117191RB-I00 DESARROLLO, OPERATIVA Y GOBERNANZA DE DATOS PARA SISTEMAS SOFTWARE BASADOS EN APRENDIZAJE AUTOMATICOopen accesshttp://purl.org/coar/access_right/c_abf2Attribution-NonCommercial-NoDerivatives 4.0 Internationalhttps://creativecommons.org/licenses/by-nc-nd/4.0/info:eu-repo/semantics/openAccessoai:upcommons.upc.edu:2117/3623962026-05-27T15:37:01Z |
| dc.title.none.fl_str_mv |
Data pre-processing pipeline generation for AutoETL |
| title |
Data pre-processing pipeline generation for AutoETL |
| spellingShingle |
Data pre-processing pipeline generation for AutoETL Giovanelli, Joseph Big data Data mining Algorithms Data pre-processing pipelines Data analytics Dades massives Mineria de dades Algorismes Àrees temàtiques de la UPC::Informàtica::Sistemes d'informació |
| title_short |
Data pre-processing pipeline generation for AutoETL |
| title_full |
Data pre-processing pipeline generation for AutoETL |
| title_fullStr |
Data pre-processing pipeline generation for AutoETL |
| title_full_unstemmed |
Data pre-processing pipeline generation for AutoETL |
| title_sort |
Data pre-processing pipeline generation for AutoETL |
| dc.creator.none.fl_str_mv |
Giovanelli, Joseph Bilalli, Besim|||0000-0002-0575-2389 Abelló Gamazo, Alberto|||0000-0002-3223-2186 |
| author |
Giovanelli, Joseph |
| author_facet |
Giovanelli, Joseph Bilalli, Besim|||0000-0002-0575-2389 Abelló Gamazo, Alberto|||0000-0002-3223-2186 |
| author_role |
author |
| author2 |
Bilalli, Besim|||0000-0002-0575-2389 Abelló Gamazo, Alberto|||0000-0002-3223-2186 |
| author2_role |
author author |
| dc.subject.none.fl_str_mv |
Big data Data mining Algorithms Data pre-processing pipelines Data analytics Dades massives Mineria de dades Algorismes Àrees temàtiques de la UPC::Informàtica::Sistemes d'informació |
| topic |
Big data Data mining Algorithms Data pre-processing pipelines Data analytics Dades massives Mineria de dades Algorismes Àrees temàtiques de la UPC::Informàtica::Sistemes d'informació |
| description |
Data pre-processing plays a key role in a data analytics process (e.g., applying a classification algorithm on a predictive task). It encompasses a broad range of activities that span from correcting errors to selecting the most relevant features for the analysis phase. There is no clear evidence, or rules defined, on how pre-processing transformations impact the final results of the analysis. The problem is exacerbated when transformations are combined into pre-processing pipeline prototypes. Data scientists cannot easily foresee the impact of pipeline prototypes and hence require a method to discriminate between them and find the most relevant ones (e.g., with highest positive impact) for their study at hand. Once found, these prototypes can be instantiated and optimized e.g., using Bayesian Optimization. In this work, we study the impact of transformations when chained together into prototypes, and the impact of transformations when instantiated via various operators. We develop and scrutinize a generic method that allows to generate pre-processing pipelines, as a step towards AutoETL. We make use of rules that enable the construction of prototypes (i.e., define the order of transformations), and rules that guide the instantiation of the transformations inside the prototypes (i.e., define the operator for each transformation). The optimization of our effective pipeline prototypes provide results that compared to an exhaustive search, get 90% of the predictive accuracy in the median, but with a time cost that is 24 times smaller. |
| publishDate |
2022 |
| dc.date.none.fl_str_mv |
2022 2022-09-01 2022 2022-02-15 |
| dc.type.none.fl_str_mv |
journal article http://purl.org/coar/resource_type/c_6501 AM http://purl.org/coar/version/c_ab4af688f83e57aa |
| dc.type.openaire.fl_str_mv |
info:eu-repo/semantics/article |
| format |
article |
| dc.identifier.none.fl_str_mv |
https://hdl.handle.net/2117/362396 https://dx.doi.org/10.1016/j.is.2021.101957 |
| url |
https://hdl.handle.net/2117/362396 https://dx.doi.org/10.1016/j.is.2021.101957 |
| dc.language.none.fl_str_mv |
Inglés eng |
| language_invalid_str_mv |
Inglés |
| language |
eng |
| dc.relation.none.fl_str_mv |
Agencia Estatal de Investigación http://doi.org/10.13039/501100011033 Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020 PID2020-117191RB-I00 DESARROLLO, OPERATIVA Y GOBERNANZA DE DATOS PARA SISTEMAS SOFTWARE BASADOS EN APRENDIZAJE AUTOMATICO |
| dc.rights.none.fl_str_mv |
open access http://purl.org/coar/access_right/c_abf2 Attribution-NonCommercial-NoDerivatives 4.0 International https://creativecommons.org/licenses/by-nc-nd/4.0/ |
| dc.rights.openaire.fl_str_mv |
info:eu-repo/semantics/openAccess |
| rights_invalid_str_mv |
open access http://purl.org/coar/access_right/c_abf2 Attribution-NonCommercial-NoDerivatives 4.0 International https://creativecommons.org/licenses/by-nc-nd/4.0/ |
| eu_rights_str_mv |
openAccess |
| dc.format.none.fl_str_mv |
application/pdf |
| dc.publisher.none.fl_str_mv |
Elsevier |
| publisher.none.fl_str_mv |
Elsevier |
| dc.source.none.fl_str_mv |
reponame:UPCommons. Portal del coneixement obert de la UPC instname:Universitat Politècnica de Catalunya (UPC) |
| instname_str |
Universitat Politècnica de Catalunya (UPC) |
| reponame_str |
UPCommons. Portal del coneixement obert de la UPC |
| collection |
UPCommons. Portal del coneixement obert de la UPC |
| repository.name.fl_str_mv |
|
| repository.mail.fl_str_mv |
|
| _version_ |
1869406423860903936 |
| score |
15.300724 |