Measuring language distance for historical texts in Basque

Measuring distance between languages, dialects and language varieties, both synchronically and diachronically, is a topic of growing interest in NLP. Based on our Syntactically Annotated Historical COrpus in BAsque (SAHCOBA) and previous work in perplexity-based language distance proposed by Gamallo...

Descripción completa

Detalles Bibliográficos
Autores: Estarrona Ibarloza, Ainara, Etxeberria Uztarroz, Izaskun, Padilla Moyano, Manuel, Soraluze Irureta, Ander
Tipo de recurso: artículo
Fecha de publicación:2023
País:España
Institución:Universidad del País Vasco
Repositorio:Addi. Archivo Digital para la Docencia y la Investigación
OAI Identifier:oai:addi.ehu.eus:10810/69932
Acceso en línea:http://hdl.handle.net/10810/69932
Access Level:acceso abierto
Palabra clave:language distance
dialectology
basque dialects
historical texts
perplexity
Descripción
Sumario:Measuring distance between languages, dialects and language varieties, both synchronically and diachronically, is a topic of growing interest in NLP. Based on our Syntactically Annotated Historical COrpus in BAsque (SAHCOBA) and previous work in perplexity-based language distance proposed by Gamallo, Pichel and Alegria (2017, 2020), we have compared historical corpora with current texts in the standard variety and calculated the language distances between them. As the standard Basque is based on the central dialects, the starting hypothesis is that the oldest texts and the dialects on the extremes will be the most distant. The results obtained have largely confirmed the thesis of traditional dialectology: peripheral dialects show a strong idiosyncrasy and are more distant from the rest.