An integration data tool for joinable tables based on apache spark

Data analysts perform exploratory programming for several analytical tasks on notebooks. One is Data Discovery which consists in finding attributes that might join. This is timeconsuming and new techniques are needed to provide joinable attributes and receive a speed-up to analyse data. Those attrib...

Descripción completa

Detalles Bibliográficos
Autor: Flores Herrera, Javier de Jesús|||0000-0002-2998-9962
Tipo de recurso: tesis de maestría
Fecha de publicación:2020
País:España
Institución:Universitat Politècnica de Catalunya (UPC)
Repositorio:UPCommons. Portal del coneixement obert de la UPC
Idioma:inglés
OAI Identifier:oai:upcommons.upc.edu:2117/335717
Acceso en línea:https://hdl.handle.net/2117/335717
Access Level:acceso abierto
Palabra clave:Big data
data discovery
data integration
attribute profiling
random forest
data fusion
joinable attributes
quality join
Dades massives
Anàlisi de dades
Àrees temàtiques de la UPC::Informàtica
Descripción
Sumario:Data analysts perform exploratory programming for several analytical tasks on notebooks. One is Data Discovery which consists in finding attributes that might join. This is timeconsuming and new techniques are needed to provide joinable attributes and receive a speed-up to analyse data. Those attributes should produce high quality joins. We consider high quality joins those joins between attributes that share a high number of unique values. In this thesis, we aim to find quality joinable attributes by proposing a three-step approach: performing attribute profiling, classification and ranking. We create 5 categorical labels to represent the quality join that two attributes might have. One-vs-the-Rest strategy is used to create machine learning models. We aim at integrating data discovery with notebooks and well-known data management tools. We prototype our techniques on top of mature tools for exploratory and large-scale data processing, namely Jupyter and Apache Spark. We created four experiments with real datasets to validate our approach. Our experiments suggest our approach is a general approach for finding high quality joins for any topic. Our solution can reduce time for finding joinable attributes without having to perform a manual data exploration on multiple datasets