Using confidence and informativeness criteria to improve POS-tagging in amazigh

Amazigh is used by tens of millions of people mainly for oral communication. However, and like all the newly investigated languages in natural language processing, it is resource-scarce. The main aim of this paper is to present our POS taggers results based on two state of the art sequence labeling...

Descripción completa

Detalles Bibliográficos
Autores: Outahajala, Mohamed, Benajiba, Yassine, Zenkouar, Lahbib, Rosso, Paolo
Tipo de recurso: artículo
Fecha de publicación:2015
País:España
Institución:Universitat Politècnica de València (UPV)
Repositorio:RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia
Idioma:inglés
OAI Identifier:oai:riunet.upv.es:10251/63906
Acceso en línea:https://riunet.upv.es/handle/10251/63906
Access Level:acceso abierto
Palabra clave:POS-tagging
Amazigh
Conditional random fields
Support vector machines
Out of vocabulary
Self training
LENGUAJES Y SISTEMAS INFORMATICOS
Descripción
Sumario:Amazigh is used by tens of millions of people mainly for oral communication. However, and like all the newly investigated languages in natural language processing, it is resource-scarce. The main aim of this paper is to present our POS taggers results based on two state of the art sequence labeling techniques, namely Conditional Random Fields and Support Vector Machines, by making use of a small manually annotated corpus of only 20k tokens. Since creating labeled data is very time-consuming task while obtaining unlabeled data is less so, we have decided to gather a set of unlabeled data of Amazigh language that we have preprocessed and tokenized. The paper is also meant to address using semi-supervised techniques to improve POS tagging accuracy. An adapted self training algorithm, combining confidence measure with a function of Out Of Vocabulary words to select data for self training, has been used. Using this language independent method, we have managed to obtain encouraging results.