Freyja: Efficient join discovery in data lakes
We study the problem of efficiently computing rankings of joinable attributes in data lakes. Traditional set-overlap measures produce numerous false positives in this scenario, while modern, more accurate Table Representation Learning (TRL) techniques incur prohibitive computational costs. In contra...
| Autores: | , , , , , |
|---|---|
| Tipo de recurso: | artículo |
| Fecha de publicación: | 2026 |
| País: | España |
| Institución: | Universitat Politècnica de Catalunya (UPC) |
| Repositorio: | UPCommons. Portal del coneixement obert de la UPC |
| Idioma: | inglés |
| OAI Identifier: | oai:upcommons.upc.edu:2117/452744 |
| Acceso en línea: | https://hdl.handle.net/2117/452744 https://dx.doi.org/10.1109/TKDE.2026.3656786 |
| Access Level: | acceso abierto |
| Palabra clave: | Data discovery Join discovery Big data processing Data lakes Data profiling Àrees temàtiques de la UPC::Informàtica::Sistemes d'informació |
| Sumario: | We study the problem of efficiently computing rankings of joinable attributes in data lakes. Traditional set-overlap measures produce numerous false positives in this scenario, while modern, more accurate Table Representation Learning (TRL) techniques incur prohibitive computational costs. In contrast to the state-of-the-art, we adopt a novel notion of join quality tailored to data lakes relying on a metric that combines multiset Jaccard and cardinality proportion. The proposed metric merges the best of both worlds by leveraging syntactic measures while achieving accuracy scores comparable to those of TRL approaches. Generating rankings of joinable pairs is highly scalable at both preparation and query time, since we train a general-purpose predictive model. Predictions are based on data profiles, succinct and efficiently computed representations of dataset characteristics. Our experiments show that our system, Freyja, matches and improves upon, the results obtained by the state-of-the-art while reducing execution costs by orders of magnitude. |
|---|