Using confidence and informativeness criteria to improve POS-tagging in amazigh
Amazigh is used by tens of millions of people mainly for oral communication. However, and like all the newly investigated languages in natural language processing, it is resource-scarce. The main aim of this paper is to present our POS taggers results based on two state of the art sequence labeling...
| Autores: | , , , |
|---|---|
| Tipo de recurso: | artículo |
| Fecha de publicación: | 2015 |
| País: | España |
| Institución: | Universitat Politècnica de València (UPV) |
| Repositorio: | RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia |
| Idioma: | inglés |
| OAI Identifier: | oai:riunet.upv.es:10251/63906 |
| Acceso en línea: | https://riunet.upv.es/handle/10251/63906 |
| Access Level: | acceso abierto |
| Palabra clave: | POS-tagging Amazigh Conditional random fields Support vector machines Out of vocabulary Self training LENGUAJES Y SISTEMAS INFORMATICOS |
| id |
ES_b2cd7fd5e6bfeec85362e02b9b36c726 |
|---|---|
| oai_identifier_str |
oai:riunet.upv.es:10251/63906 |
| network_acronym_str |
ES |
| network_name_str |
España |
| repository_id_str |
|
| spelling |
Using confidence and informativeness criteria to improve POS-tagging in amazighOutahajala, MohamedBenajiba, YassineZenkouar, LahbibRosso, PaoloPOS-taggingAmazighConditional random fieldsSupport vector machinesOut of vocabularySelf trainingLENGUAJES Y SISTEMAS INFORMATICOSAmazigh is used by tens of millions of people mainly for oral communication. However, and like all the newly investigated languages in natural language processing, it is resource-scarce. The main aim of this paper is to present our POS taggers results based on two state of the art sequence labeling techniques, namely Conditional Random Fields and Support Vector Machines, by making use of a small manually annotated corpus of only 20k tokens. Since creating labeled data is very time-consuming task while obtaining unlabeled data is less so, we have decided to gather a set of unlabeled data of Amazigh language that we have preprocessed and tokenized. The paper is also meant to address using semi-supervised techniques to improve POS tagging accuracy. An adapted self training algorithm, combining confidence measure with a function of Out Of Vocabulary words to select data for self training, has been used. Using this language independent method, we have managed to obtain encouraging results.The first author wants to grant CODESRIA. The work of the third author was carried out in the framework of the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems and the European Commission WIQ-EI IRSES (no. 269180) and DIANA-APPLICATIONS-Finding Hidden Knowledge in Texts: Applications(TIN2012-38603- C02-01) research projects.IOS PressDepartamento de Sistemas Informáticos y ComputaciónEscuela Técnica Superior de Ingeniería InformáticaCentro de Investigación Pattern Recognition and Human Language TechnologyVLC/CAMPUSRepositorio Institucional de la Universitat Politècnica de València Riunet20152015-01-01journal articlehttp://purl.org/coar/resource_type/c_6501VoRhttp://purl.org/coar/version/c_970fb48d4fbd8a85info:eu-repo/semantics/articleapplication/pdfapplication/pdfhttps://riunet.upv.es/handle/10251/63906reponame:RiuNet. Repositorio Institucional de la Universitat Politécnica de Valénciainstname:Universitat Politècnica de València (UPV)InglésengMinisterio de Economía y Competitividad http://dx.doi.org/10.13039/501100003329 TIN2012-38603-C02-01 DIANA-APPLICATIONS: FINDING HIDDEN KNOWLEDGE IN TEXTS: APPLICATIONSopen accesshttp://purl.org/coar/access_right/c_abf2Reserva de todos los derechoshttp://rightsstatements.org/vocab/InC/1.0/info:eu-repo/semantics/openAccessoai:riunet.upv.es:10251/639062026-06-13T07:49:27Z |
| dc.title.none.fl_str_mv |
Using confidence and informativeness criteria to improve POS-tagging in amazigh |
| title |
Using confidence and informativeness criteria to improve POS-tagging in amazigh |
| spellingShingle |
Using confidence and informativeness criteria to improve POS-tagging in amazigh Outahajala, Mohamed POS-tagging Amazigh Conditional random fields Support vector machines Out of vocabulary Self training LENGUAJES Y SISTEMAS INFORMATICOS |
| title_short |
Using confidence and informativeness criteria to improve POS-tagging in amazigh |
| title_full |
Using confidence and informativeness criteria to improve POS-tagging in amazigh |
| title_fullStr |
Using confidence and informativeness criteria to improve POS-tagging in amazigh |
| title_full_unstemmed |
Using confidence and informativeness criteria to improve POS-tagging in amazigh |
| title_sort |
Using confidence and informativeness criteria to improve POS-tagging in amazigh |
| dc.creator.none.fl_str_mv |
Outahajala, Mohamed Benajiba, Yassine Zenkouar, Lahbib Rosso, Paolo |
| author |
Outahajala, Mohamed |
| author_facet |
Outahajala, Mohamed Benajiba, Yassine Zenkouar, Lahbib Rosso, Paolo |
| author_role |
author |
| author2 |
Benajiba, Yassine Zenkouar, Lahbib Rosso, Paolo |
| author2_role |
author author author |
| dc.contributor.none.fl_str_mv |
Departamento de Sistemas Informáticos y Computación Escuela Técnica Superior de Ingeniería Informática Centro de Investigación Pattern Recognition and Human Language Technology VLC/CAMPUS Repositorio Institucional de la Universitat Politècnica de València Riunet |
| dc.subject.none.fl_str_mv |
POS-tagging Amazigh Conditional random fields Support vector machines Out of vocabulary Self training LENGUAJES Y SISTEMAS INFORMATICOS |
| topic |
POS-tagging Amazigh Conditional random fields Support vector machines Out of vocabulary Self training LENGUAJES Y SISTEMAS INFORMATICOS |
| description |
Amazigh is used by tens of millions of people mainly for oral communication. However, and like all the newly investigated languages in natural language processing, it is resource-scarce. The main aim of this paper is to present our POS taggers results based on two state of the art sequence labeling techniques, namely Conditional Random Fields and Support Vector Machines, by making use of a small manually annotated corpus of only 20k tokens. Since creating labeled data is very time-consuming task while obtaining unlabeled data is less so, we have decided to gather a set of unlabeled data of Amazigh language that we have preprocessed and tokenized. The paper is also meant to address using semi-supervised techniques to improve POS tagging accuracy. An adapted self training algorithm, combining confidence measure with a function of Out Of Vocabulary words to select data for self training, has been used. Using this language independent method, we have managed to obtain encouraging results. |
| publishDate |
2015 |
| dc.date.none.fl_str_mv |
2015 2015-01-01 |
| dc.type.none.fl_str_mv |
journal article http://purl.org/coar/resource_type/c_6501 VoR http://purl.org/coar/version/c_970fb48d4fbd8a85 |
| dc.type.openaire.fl_str_mv |
info:eu-repo/semantics/article |
| format |
article |
| dc.identifier.none.fl_str_mv |
https://riunet.upv.es/handle/10251/63906 |
| url |
https://riunet.upv.es/handle/10251/63906 |
| dc.language.none.fl_str_mv |
Inglés eng |
| language_invalid_str_mv |
Inglés |
| language |
eng |
| dc.relation.none.fl_str_mv |
Ministerio de Economía y Competitividad http://dx.doi.org/10.13039/501100003329 TIN2012-38603-C02-01 DIANA-APPLICATIONS: FINDING HIDDEN KNOWLEDGE IN TEXTS: APPLICATIONS |
| dc.rights.none.fl_str_mv |
open access http://purl.org/coar/access_right/c_abf2 Reserva de todos los derechos http://rightsstatements.org/vocab/InC/1.0/ |
| dc.rights.openaire.fl_str_mv |
info:eu-repo/semantics/openAccess |
| rights_invalid_str_mv |
open access http://purl.org/coar/access_right/c_abf2 Reserva de todos los derechos http://rightsstatements.org/vocab/InC/1.0/ |
| eu_rights_str_mv |
openAccess |
| dc.format.none.fl_str_mv |
application/pdf application/pdf |
| dc.publisher.none.fl_str_mv |
IOS Press |
| publisher.none.fl_str_mv |
IOS Press |
| dc.source.none.fl_str_mv |
reponame:RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia instname:Universitat Politècnica de València (UPV) |
| instname_str |
Universitat Politècnica de València (UPV) |
| reponame_str |
RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia |
| collection |
RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia |
| repository.name.fl_str_mv |
|
| repository.mail.fl_str_mv |
|
| _version_ |
1869417085329735680 |
| score |
15,300724 |