Test-driving information theory-based compositional distributional semantics: A case study on Spanish song lyrics

Song lyrics pose unique challenges for semantic similarity assessment due to their metaphorical language, structural patterns, and cultural nuances - characteristics that often challenge standard natural language processing (NLP) approaches. These challenges stem from a tension between compositional...

Descripción completa

Detalles Bibliográficos
Autores: Ghajari Espinosa, Adrián, Benito Santos, Alejandro, Ros Muñoz, Salvador, Fresno Fernández, Víctor Diego, González-Blanco García, Elena
Tipo de recurso: artículo
Fecha de publicación:2025
País:España
Institución:Universidad Nacional de Educación a Distancia
Repositorio:e-spacio. Repositorio Institucional de la UNED
Idioma:inglés
OAI Identifier:oai:e-spacio.uned.es:20.500.14468/26536
Acceso en línea:https://hdl.handle.net/20.500.14468/26536
Access Level:acceso abierto
Palabra clave:33 Ciencias Tecnológicas
compositional distributional semantics
semantic textual similarity
word embeddings
song lyrics
Descripción
Sumario:Song lyrics pose unique challenges for semantic similarity assessment due to their metaphorical language, structural patterns, and cultural nuances - characteristics that often challenge standard natural language processing (NLP) approaches. These challenges stem from a tension between compositional and distributional semantics: while lyrics follow compositional structures, their meaning depends heavily on context and interpretation. The Information Theory-based Compositional Distributional Semantics framework offers a principled approach by integrating information theory with compositional rules and distributional representations. We evaluate eight embedding models on Spanish song lyrics, including multilingual, monolingual contextual, and static embeddings. Results show that multilingual models consistently outperform monolingual alternatives, with the domain-adapted ALBERTI achieving the highest F1 macro scores (78.92 ± 10.86). Our analysis reveals that monolingual models generate highly anisotropic embedding spaces, significantly impacting performance with traditional metrics. The Information Contrast Model metric proves particularly effective, providing improvements up to 18.04 percentage points over cosine similarity. Additionally, composition functions maintaining longer accumulated vector norms consistently outperform standard averaging approaches. Our findings have important implications for NLP applications and challenge standard practices in similarity calculation, showing that effectiveness varies with both task nature and model characteristics.