Measuring language distance for historical texts in Basque

Measuring distance between languages, dialects and language varieties, both synchronically and diachronically, is a topic of growing interest in NLP. Based on our Syntactically Annotated Historical COrpus in BAsque (SAHCOBA) and previous work in perplexity-based language distance proposed by Gamallo...

ver descrição completa

Detalhes bibliográficos
Autores: Estarrona Ibarloza, Ainara, Etxeberria Uztarroz, Izaskun, Padilla Moyano, Manuel, Soraluze Irureta, Ander
Tipo de documento: artigo
Data de publicação:2023
País:España
Recursos:Universidad del País Vasco
Repositório:Addi. Archivo Digital para la Docencia y la Investigación
OAI Identifier:oai:addi.ehu.eus:10810/69932
Acesso em linha:http://hdl.handle.net/10810/69932
Access Level:Acceso aberto
Palavra-chave:language distance
dialectology
basque dialects
historical texts
perplexity
Descrição
Resumo:Measuring distance between languages, dialects and language varieties, both synchronically and diachronically, is a topic of growing interest in NLP. Based on our Syntactically Annotated Historical COrpus in BAsque (SAHCOBA) and previous work in perplexity-based language distance proposed by Gamallo, Pichel and Alegria (2017, 2020), we have compared historical corpora with current texts in the standard variety and calculated the language distances between them. As the standard Basque is based on the central dialects, the starting hypothesis is that the oldest texts and the dialects on the extremes will be the most distant. The results obtained have largely confirmed the thesis of traditional dialectology: peripheral dialects show a strong idiosyncrasy and are more distant from the rest.