Freyja: Efficient join discovery in data lakes

We study the problem of efficiently computing rankings of joinable attributes in data lakes. Traditional set-overlap measures produce numerous false positives in this scenario, while modern, more accurate Table Representation Learning (TRL) techniques incur prohibitive computational costs. In contra...

Descripción completa

Detalles Bibliográficos
Autores: Maynou Yelamos, Marc, Nadal Francesch, Sergi|||0000-0002-8565-952X, Panadero Palenzuela, Raquel, Flores Herrera, Javier de Jesús|||0000-0002-2998-9962, Romero Moral, Óscar|||0000-0001-6350-8328, Queralt Calafat, Anna|||0000-0003-2782-2955
Tipo de recurso: artículo
Fecha de publicación:2026
País:España
Institución:Universitat Politècnica de Catalunya (UPC)
Repositorio:UPCommons. Portal del coneixement obert de la UPC
Idioma:inglés
OAI Identifier:oai:upcommons.upc.edu:2117/452744
Acceso en línea:https://hdl.handle.net/2117/452744
https://dx.doi.org/10.1109/TKDE.2026.3656786
Access Level:acceso abierto
Palabra clave:Data discovery
Join discovery
Big data processing
Data lakes
Data profiling
Àrees temàtiques de la UPC::Informàtica::Sistemes d'informació
Descripción
Sumario:We study the problem of efficiently computing rankings of joinable attributes in data lakes. Traditional set-overlap measures produce numerous false positives in this scenario, while modern, more accurate Table Representation Learning (TRL) techniques incur prohibitive computational costs. In contrast to the state-of-the-art, we adopt a novel notion of join quality tailored to data lakes relying on a metric that combines multiset Jaccard and cardinality proportion. The proposed metric merges the best of both worlds by leveraging syntactic measures while achieving accuracy scores comparable to those of TRL approaches. Generating rankings of joinable pairs is highly scalable at both preparation and query time, since we train a general-purpose predictive model. Predictions are based on data profiles, succinct and efficiently computed representations of dataset characteristics. Our experiments show that our system, Freyja, matches and improves upon, the results obtained by the state-of-the-art while reducing execution costs by orders of magnitude.