A framework to operationalize and automate the data integration lifecycle
(English) Data plays a key role in today’s world. Many organizations collect and store massive amounts of data from many different data sources. As a result, these data collections show a diversity in structure and semantics that grows as the data sources expand and evolve. These factors challenge t...
| Autor: | |
|---|---|
| Tipo de recurso: | tesis doctoral |
| Estado: | Versión publicada |
| Fecha de publicación: | 2025 |
| País: | España |
| Institución: | CBUC, CESCA |
| Repositorio: | TDR. Tesis Doctorales en Red |
| OAI Identifier: | oai:www.tdx.cat:10803/695267 |
| Acceso en línea: | http://hdl.handle.net/10803/695267 https://dx.doi.org/10.5821/dissertation-2117-442278 |
| Access Level: | acceso abierto |
| Palabra clave: | Data Integration Data Discovery Knowledge Graphs Data Wrangling Àrees temàtiques de la UPC::Informàtica 004 - Informàtica |
| Sumario: | (English) Data plays a key role in today’s world. Many organizations collect and store massive amounts of data from many different data sources. As a result, these data collections show a diversity in structure and semantics that grows as the data sources expand and evolve. These factors challenge traditional data management methods, which depend on fixed structures and stable conditions. There is a mismatch between old assumptions and new realities, where it is not enough to just collect data and run conventional tools. Instead, we must rethink how we integrate data to support high variety, handle large-scale collections, and accommodate new available data. This PhD thesis proposes innovative and advanced techniques to support and automate the data integration lifecycle. First, we describe how to represent and standardize data sources using graph-based schemas. These schemas provide a solid foundation for all steps of the data integration lifecycle. Next, we introduce an integration method that leverages graph-based schemas to add new data incrementally without disrupting existing integration structures. This approach ensures that data integration remains flexible and scalable as organizations grow. We also help users find the right datasets to integrate. By focusing on data discovery, we reduce the time spent exploring irrelevant data sources and suggest relevant ones for integration. To this end, we focus first on facilitating the discovery of joinable attributes among datasets. We propose a new qualitative metric and use data profiles and learning models to decide which attributes are worth joining. To further enhance data discovery, we introduce contextual pre-filtering. Using data profiles and graph-based schemas, we can focus on promising datasets before applying data discovery tools. This pre-filtering step not only boosts the accuracy of existing data discovery tools but also optimizes their performance by narrowing the search space. In summary, this thesis helps bridge the gap between conventional data methods and modern, diverse data ecosystems. The results contribute to the field of data integration by offering scalable and automated solutions that match the changing needs of data integration today. |
|---|