A framework to operationalize and automate the data integration lifecycle

(English) Data plays a key role in today’s world. Many organizations collect and store massive amounts of data from many different data sources. As a result, these data collections show a diversity in structure and semantics that grows as the data sources expand and evolve. These factors challenge t...

Descripción completa

Detalles Bibliográficos
Autor: Flores Herrera, Javier de Jesús
Tipo de recurso: tesis doctoral
Estado:Versión publicada
Fecha de publicación:2025
País:España
Institución:CBUC, CESCA
Repositorio:TDR. Tesis Doctorales en Red
OAI Identifier:oai:www.tdx.cat:10803/695267
Acceso en línea:http://hdl.handle.net/10803/695267
https://dx.doi.org/10.5821/dissertation-2117-442278
Access Level:acceso abierto
Palabra clave:Data Integration
Data Discovery
Knowledge Graphs
Data Wrangling
Àrees temàtiques de la UPC::Informàtica
004 - Informàtica
Descripción
Sumario:(English) Data plays a key role in today’s world. Many organizations collect and store massive amounts of data from many different data sources. As a result, these data collections show a diversity in structure and semantics that grows as the data sources expand and evolve. These factors challenge traditional data management methods, which depend on fixed structures and stable conditions. There is a mismatch between old assumptions and new realities, where it is not enough to just collect data and run conventional tools. Instead, we must rethink how we integrate data to support high variety, handle large-scale collections, and accommodate new available data. This PhD thesis proposes innovative and advanced techniques to support and automate the data integration lifecycle. First, we describe how to represent and standardize data sources using graph-based schemas. These schemas provide a solid foundation for all steps of the data integration lifecycle. Next, we introduce an integration method that leverages graph-based schemas to add new data incrementally without disrupting existing integration structures. This approach ensures that data integration remains flexible and scalable as organizations grow. We also help users find the right datasets to integrate. By focusing on data discovery, we reduce the time spent exploring irrelevant data sources and suggest relevant ones for integration. To this end, we focus first on facilitating the discovery of joinable attributes among datasets. We propose a new qualitative metric and use data profiles and learning models to decide which attributes are worth joining. To further enhance data discovery, we introduce contextual pre-filtering. Using data profiles and graph-based schemas, we can focus on promising datasets before applying data discovery tools. This pre-filtering step not only boosts the accuracy of existing data discovery tools but also optimizes their performance by narrowing the search space. In summary, this thesis helps bridge the gap between conventional data methods and modern, diverse data ecosystems. The results contribute to the field of data integration by offering scalable and automated solutions that match the changing needs of data integration today.