Using confidence and informativeness criteria to improve POS-tagging in amazigh

Amazigh is used by tens of millions of people mainly for oral communication. However, and like all the newly investigated languages in natural language processing, it is resource-scarce. The main aim of this paper is to present our POS taggers results based on two state of the art sequence labeling...

Descripción completa

Detalles Bibliográficos
Autores: Outahajala, Mohamed, Benajiba, Yassine, Zenkouar, Lahbib, Rosso, Paolo
Tipo de recurso: artículo
Fecha de publicación:2015
País:España
Institución:Universitat Politècnica de València (UPV)
Repositorio:RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia
Idioma:inglés
OAI Identifier:oai:riunet.upv.es:10251/63906
Acceso en línea:https://riunet.upv.es/handle/10251/63906
Access Level:acceso abierto
Palabra clave:POS-tagging
Amazigh
Conditional random fields
Support vector machines
Out of vocabulary
Self training
LENGUAJES Y SISTEMAS INFORMATICOS
id ES_b2cd7fd5e6bfeec85362e02b9b36c726
oai_identifier_str oai:riunet.upv.es:10251/63906
network_acronym_str ES
network_name_str España
repository_id_str
spelling Using confidence and informativeness criteria to improve POS-tagging in amazighOutahajala, MohamedBenajiba, YassineZenkouar, LahbibRosso, PaoloPOS-taggingAmazighConditional random fieldsSupport vector machinesOut of vocabularySelf trainingLENGUAJES Y SISTEMAS INFORMATICOSAmazigh is used by tens of millions of people mainly for oral communication. However, and like all the newly investigated languages in natural language processing, it is resource-scarce. The main aim of this paper is to present our POS taggers results based on two state of the art sequence labeling techniques, namely Conditional Random Fields and Support Vector Machines, by making use of a small manually annotated corpus of only 20k tokens. Since creating labeled data is very time-consuming task while obtaining unlabeled data is less so, we have decided to gather a set of unlabeled data of Amazigh language that we have preprocessed and tokenized. The paper is also meant to address using semi-supervised techniques to improve POS tagging accuracy. An adapted self training algorithm, combining confidence measure with a function of Out Of Vocabulary words to select data for self training, has been used. Using this language independent method, we have managed to obtain encouraging results.The first author wants to grant CODESRIA. The work of the third author was carried out in the framework of the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems and the European Commission WIQ-EI IRSES (no. 269180) and DIANA-APPLICATIONS-Finding Hidden Knowledge in Texts: Applications(TIN2012-38603- C02-01) research projects.IOS PressDepartamento de Sistemas Informáticos y ComputaciónEscuela Técnica Superior de Ingeniería InformáticaCentro de Investigación Pattern Recognition and Human Language TechnologyVLC/CAMPUSRepositorio Institucional de la Universitat Politècnica de València Riunet20152015-01-01journal articlehttp://purl.org/coar/resource_type/c_6501VoRhttp://purl.org/coar/version/c_970fb48d4fbd8a85info:eu-repo/semantics/articleapplication/pdfapplication/pdfhttps://riunet.upv.es/handle/10251/63906reponame:RiuNet. Repositorio Institucional de la Universitat Politécnica de Valénciainstname:Universitat Politècnica de València (UPV)InglésengMinisterio de Economía y Competitividad http://dx.doi.org/10.13039/501100003329 TIN2012-38603-C02-01 DIANA-APPLICATIONS: FINDING HIDDEN KNOWLEDGE IN TEXTS: APPLICATIONSopen accesshttp://purl.org/coar/access_right/c_abf2Reserva de todos los derechoshttp://rightsstatements.org/vocab/InC/1.0/info:eu-repo/semantics/openAccessoai:riunet.upv.es:10251/639062026-06-13T07:49:27Z
dc.title.none.fl_str_mv Using confidence and informativeness criteria to improve POS-tagging in amazigh
title Using confidence and informativeness criteria to improve POS-tagging in amazigh
spellingShingle Using confidence and informativeness criteria to improve POS-tagging in amazigh
Outahajala, Mohamed
POS-tagging
Amazigh
Conditional random fields
Support vector machines
Out of vocabulary
Self training
LENGUAJES Y SISTEMAS INFORMATICOS
title_short Using confidence and informativeness criteria to improve POS-tagging in amazigh
title_full Using confidence and informativeness criteria to improve POS-tagging in amazigh
title_fullStr Using confidence and informativeness criteria to improve POS-tagging in amazigh
title_full_unstemmed Using confidence and informativeness criteria to improve POS-tagging in amazigh
title_sort Using confidence and informativeness criteria to improve POS-tagging in amazigh
dc.creator.none.fl_str_mv Outahajala, Mohamed
Benajiba, Yassine
Zenkouar, Lahbib
Rosso, Paolo
author Outahajala, Mohamed
author_facet Outahajala, Mohamed
Benajiba, Yassine
Zenkouar, Lahbib
Rosso, Paolo
author_role author
author2 Benajiba, Yassine
Zenkouar, Lahbib
Rosso, Paolo
author2_role author
author
author
dc.contributor.none.fl_str_mv Departamento de Sistemas Informáticos y Computación
Escuela Técnica Superior de Ingeniería Informática
Centro de Investigación Pattern Recognition and Human Language Technology
VLC/CAMPUS
Repositorio Institucional de la Universitat Politècnica de València Riunet
dc.subject.none.fl_str_mv POS-tagging
Amazigh
Conditional random fields
Support vector machines
Out of vocabulary
Self training
LENGUAJES Y SISTEMAS INFORMATICOS
topic POS-tagging
Amazigh
Conditional random fields
Support vector machines
Out of vocabulary
Self training
LENGUAJES Y SISTEMAS INFORMATICOS
description Amazigh is used by tens of millions of people mainly for oral communication. However, and like all the newly investigated languages in natural language processing, it is resource-scarce. The main aim of this paper is to present our POS taggers results based on two state of the art sequence labeling techniques, namely Conditional Random Fields and Support Vector Machines, by making use of a small manually annotated corpus of only 20k tokens. Since creating labeled data is very time-consuming task while obtaining unlabeled data is less so, we have decided to gather a set of unlabeled data of Amazigh language that we have preprocessed and tokenized. The paper is also meant to address using semi-supervised techniques to improve POS tagging accuracy. An adapted self training algorithm, combining confidence measure with a function of Out Of Vocabulary words to select data for self training, has been used. Using this language independent method, we have managed to obtain encouraging results.
publishDate 2015
dc.date.none.fl_str_mv 2015
2015-01-01
dc.type.none.fl_str_mv journal article
http://purl.org/coar/resource_type/c_6501
VoR
http://purl.org/coar/version/c_970fb48d4fbd8a85
dc.type.openaire.fl_str_mv info:eu-repo/semantics/article
format article
dc.identifier.none.fl_str_mv https://riunet.upv.es/handle/10251/63906
url https://riunet.upv.es/handle/10251/63906
dc.language.none.fl_str_mv Inglés
eng
language_invalid_str_mv Inglés
language eng
dc.relation.none.fl_str_mv Ministerio de Economía y Competitividad http://dx.doi.org/10.13039/501100003329 TIN2012-38603-C02-01 DIANA-APPLICATIONS: FINDING HIDDEN KNOWLEDGE IN TEXTS: APPLICATIONS
dc.rights.none.fl_str_mv open access
http://purl.org/coar/access_right/c_abf2
Reserva de todos los derechos
http://rightsstatements.org/vocab/InC/1.0/
dc.rights.openaire.fl_str_mv info:eu-repo/semantics/openAccess
rights_invalid_str_mv open access
http://purl.org/coar/access_right/c_abf2
Reserva de todos los derechos
http://rightsstatements.org/vocab/InC/1.0/
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
application/pdf
dc.publisher.none.fl_str_mv IOS Press
publisher.none.fl_str_mv IOS Press
dc.source.none.fl_str_mv reponame:RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia
instname:Universitat Politècnica de València (UPV)
instname_str Universitat Politècnica de València (UPV)
reponame_str RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia
collection RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia
repository.name.fl_str_mv
repository.mail.fl_str_mv
_version_ 1869417085329735680
score 15,300724