Using confidence and informativeness criteria to improve POS-tagging in amazigh

Outahajala, Mohamed; Benajiba, Yassine; Zenkouar, Lahbib; Rosso, Paolo

Using confidence and informativeness criteria to improve POS-tagging in amazigh

Amazigh is used by tens of millions of people mainly for oral communication. However, and like all the newly investigated languages in natural language processing, it is resource-scarce. The main aim of this paper is to present our POS taggers results based on two state of the art sequence labeling...

Descripción completa

Detalles Bibliográficos
Autores:	Outahajala, Mohamed, Benajiba, Yassine, Zenkouar, Lahbib, Rosso, Paolo
Tipo de recurso:	artículo
Fecha de publicación:	2015
País:	España
Institución:	Universitat Politècnica de València (UPV)
Repositorio:	RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia
Idioma:	inglés
OAI Identifier:	oai:riunet.upv.es:10251/63906
Acceso en línea:	https://riunet.upv.es/handle/10251/63906
Access Level:	acceso abierto
Palabra clave:	POS-tagging Amazigh Conditional random fields Support vector machines Out of vocabulary Self training LENGUAJES Y SISTEMAS INFORMATICOS

id	ES_b2cd7fd5e6bfeec85362e02b9b36c726
oai_identifier_str	oai:riunet.upv.es:10251/63906
network_acronym_str	ES
network_name_str	España
repository_id_str
spelling	Using confidence and informativeness criteria to improve POS-tagging in amazighOutahajala, MohamedBenajiba, YassineZenkouar, LahbibRosso, PaoloPOS-taggingAmazighConditional random fieldsSupport vector machinesOut of vocabularySelf trainingLENGUAJES Y SISTEMAS INFORMATICOSAmazigh is used by tens of millions of people mainly for oral communication. However, and like all the newly investigated languages in natural language processing, it is resource-scarce. The main aim of this paper is to present our POS taggers results based on two state of the art sequence labeling techniques, namely Conditional Random Fields and Support Vector Machines, by making use of a small manually annotated corpus of only 20k tokens. Since creating labeled data is very time-consuming task while obtaining unlabeled data is less so, we have decided to gather a set of unlabeled data of Amazigh language that we have preprocessed and tokenized. The paper is also meant to address using semi-supervised techniques to improve POS tagging accuracy. An adapted self training algorithm, combining confidence measure with a function of Out Of Vocabulary words to select data for self training, has been used. Using this language independent method, we have managed to obtain encouraging results.The first author wants to grant CODESRIA. The work of the third author was carried out in the framework of the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems and the European Commission WIQ-EI IRSES (no. 269180) and DIANA-APPLICATIONS-Finding Hidden Knowledge in Texts: Applications(TIN2012-38603- C02-01) research projects.IOS PressDepartamento de Sistemas Informáticos y ComputaciónEscuela Técnica Superior de Ingeniería InformáticaCentro de Investigación Pattern Recognition and Human Language TechnologyVLC/CAMPUSRepositorio Institucional de la Universitat Politècnica de València Riunet20152015-01-01journal articlehttp://purl.org/coar/resource_type/c_6501VoRhttp://purl.org/coar/version/c_970fb48d4fbd8a85info:eu-repo/semantics/articleapplication/pdfapplication/pdfhttps://riunet.upv.es/handle/10251/63906reponame:RiuNet. Repositorio Institucional de la Universitat Politécnica de Valénciainstname:Universitat Politècnica de València (UPV)InglésengMinisterio de Economía y Competitividad http://dx.doi.org/10.13039/501100003329 TIN2012-38603-C02-01 DIANA-APPLICATIONS: FINDING HIDDEN KNOWLEDGE IN TEXTS: APPLICATIONSopen accesshttp://purl.org/coar/access_right/c_abf2Reserva de todos los derechoshttp://rightsstatements.org/vocab/InC/1.0/info:eu-repo/semantics/openAccessoai:riunet.upv.es:10251/639062026-06-13T07:49:27Z
dc.title.none.fl_str_mv	Using confidence and informativeness criteria to improve POS-tagging in amazigh
title	Using confidence and informativeness criteria to improve POS-tagging in amazigh
spellingShingle	Using confidence and informativeness criteria to improve POS-tagging in amazigh Outahajala, Mohamed POS-tagging Amazigh Conditional random fields Support vector machines Out of vocabulary Self training LENGUAJES Y SISTEMAS INFORMATICOS
title_short	Using confidence and informativeness criteria to improve POS-tagging in amazigh
title_full	Using confidence and informativeness criteria to improve POS-tagging in amazigh
title_fullStr	Using confidence and informativeness criteria to improve POS-tagging in amazigh
title_full_unstemmed	Using confidence and informativeness criteria to improve POS-tagging in amazigh
title_sort	Using confidence and informativeness criteria to improve POS-tagging in amazigh
dc.creator.none.fl_str_mv	Outahajala, Mohamed Benajiba, Yassine Zenkouar, Lahbib Rosso, Paolo
author	Outahajala, Mohamed
author_facet	Outahajala, Mohamed Benajiba, Yassine Zenkouar, Lahbib Rosso, Paolo
author_role	author
author2	Benajiba, Yassine Zenkouar, Lahbib Rosso, Paolo
author2_role	author author author
dc.contributor.none.fl_str_mv	Departamento de Sistemas Informáticos y Computación Escuela Técnica Superior de Ingeniería Informática Centro de Investigación Pattern Recognition and Human Language Technology VLC/CAMPUS Repositorio Institucional de la Universitat Politècnica de València Riunet
dc.subject.none.fl_str_mv	POS-tagging Amazigh Conditional random fields Support vector machines Out of vocabulary Self training LENGUAJES Y SISTEMAS INFORMATICOS
topic	POS-tagging Amazigh Conditional random fields Support vector machines Out of vocabulary Self training LENGUAJES Y SISTEMAS INFORMATICOS
description	Amazigh is used by tens of millions of people mainly for oral communication. However, and like all the newly investigated languages in natural language processing, it is resource-scarce. The main aim of this paper is to present our POS taggers results based on two state of the art sequence labeling techniques, namely Conditional Random Fields and Support Vector Machines, by making use of a small manually annotated corpus of only 20k tokens. Since creating labeled data is very time-consuming task while obtaining unlabeled data is less so, we have decided to gather a set of unlabeled data of Amazigh language that we have preprocessed and tokenized. The paper is also meant to address using semi-supervised techniques to improve POS tagging accuracy. An adapted self training algorithm, combining confidence measure with a function of Out Of Vocabulary words to select data for self training, has been used. Using this language independent method, we have managed to obtain encouraging results.
publishDate	2015
dc.date.none.fl_str_mv	2015 2015-01-01
dc.type.none.fl_str_mv	journal article http://purl.org/coar/resource_type/c_6501 VoR http://purl.org/coar/version/c_970fb48d4fbd8a85
dc.type.openaire.fl_str_mv	info:eu-repo/semantics/article
format	article
dc.identifier.none.fl_str_mv	https://riunet.upv.es/handle/10251/63906
url	https://riunet.upv.es/handle/10251/63906
dc.language.none.fl_str_mv	Inglés eng
language_invalid_str_mv	Inglés
language	eng
dc.relation.none.fl_str_mv	Ministerio de Economía y Competitividad http://dx.doi.org/10.13039/501100003329 TIN2012-38603-C02-01 DIANA-APPLICATIONS: FINDING HIDDEN KNOWLEDGE IN TEXTS: APPLICATIONS
dc.rights.none.fl_str_mv	open access http://purl.org/coar/access_right/c_abf2 Reserva de todos los derechos http://rightsstatements.org/vocab/InC/1.0/
dc.rights.openaire.fl_str_mv	info:eu-repo/semantics/openAccess
rights_invalid_str_mv	open access http://purl.org/coar/access_right/c_abf2 Reserva de todos los derechos http://rightsstatements.org/vocab/InC/1.0/
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf application/pdf
dc.publisher.none.fl_str_mv	IOS Press
publisher.none.fl_str_mv	IOS Press
dc.source.none.fl_str_mv	reponame:RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia instname:Universitat Politècnica de València (UPV)
instname_str	Universitat Politècnica de València (UPV)
reponame_str	RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia
collection	RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia
repository.name.fl_str_mv
repository.mail.fl_str_mv
_version_	1869417085329735680
score	15,300724

Using confidence and informativeness criteria to improve POS-tagging in amazigh

Similares en LA Referencia