A framework to operationalize and automate the data integration lifecycle

Flores Herrera, Javier de Jesús

A framework to operationalize and automate the data integration lifecycle

(English) Data plays a key role in today’s world. Many organizations collect and store massive amounts of data from many different data sources. As a result, these data collections show a diversity in structure and semantics that grows as the data sources expand and evolve. These factors challenge t...

Full description

Bibliographic Details
Author:	Flores Herrera, Javier de Jesús
Format:	doctoral thesis
Status:	Published version
Publication Date:	2025
Country:	España
Institution:	CBUC, CESCA
Repository:	TDR. Tesis Doctorales en Red
OAI Identifier:	oai:www.tdx.cat:10803/695267
Online Access:	http://hdl.handle.net/10803/695267 https://dx.doi.org/10.5821/dissertation-2117-442278
Access Level:	Open access
Keyword:	Data Integration Data Discovery Knowledge Graphs Data Wrangling Àrees temàtiques de la UPC::Informàtica 004 - Informàtica

Description
Summary:	(English) Data plays a key role in today’s world. Many organizations collect and store massive amounts of data from many different data sources. As a result, these data collections show a diversity in structure and semantics that grows as the data sources expand and evolve. These factors challenge traditional data management methods, which depend on fixed structures and stable conditions. There is a mismatch between old assumptions and new realities, where it is not enough to just collect data and run conventional tools. Instead, we must rethink how we integrate data to support high variety, handle large-scale collections, and accommodate new available data. This PhD thesis proposes innovative and advanced techniques to support and automate the data integration lifecycle. First, we describe how to represent and standardize data sources using graph-based schemas. These schemas provide a solid foundation for all steps of the data integration lifecycle. Next, we introduce an integration method that leverages graph-based schemas to add new data incrementally without disrupting existing integration structures. This approach ensures that data integration remains flexible and scalable as organizations grow. We also help users find the right datasets to integrate. By focusing on data discovery, we reduce the time spent exploring irrelevant data sources and suggest relevant ones for integration. To this end, we focus first on facilitating the discovery of joinable attributes among datasets. We propose a new qualitative metric and use data profiles and learning models to decide which attributes are worth joining. To further enhance data discovery, we introduce contextual pre-filtering. Using data profiles and graph-based schemas, we can focus on promising datasets before applying data discovery tools. This pre-filtering step not only boosts the accuracy of existing data discovery tools but also optimizes their performance by narrowing the search space. In summary, this thesis helps bridge the gap between conventional data methods and modern, diverse data ecosystems. The results contribute to the field of data integration by offering scalable and automated solutions that match the changing needs of data integration today.

A framework to operationalize and automate the data integration lifecycle

Similar Items