A Proposal for wide-coverage Spanish named entity recognition

This paper presents a proposal for wide--coverage Named Entity Recognition for Spanish. First, a linguistic description of the typology of Named Entities is proposed. Following this definition an architecture of sequential processes is described for addressing the recognition and classification of s...

Descripción completa

Detalles Bibliográficos
Autores: Arévalo, M., Carreras Pérez, Xavier, Màrquez Villodre, Lluís|||0009-0009-0593-368X, Martí Antonin, Maria Antònia, Padró, Lluís|||0000-0003-4738-5019, Simon, Maria José
Tipo de recurso: informe técnico
Fecha de publicación:2002
País:España
Institución:Universitat Politècnica de Catalunya (UPC)
Repositorio:UPCommons. Portal del coneixement obert de la UPC
Idioma:inglés
OAI Identifier:oai:upcommons.upc.edu:2117/97522
Acceso en línea:https://hdl.handle.net/2117/97522
Access Level:acceso abierto
Palabra clave:Named entity recognition for Spanish
Machine learning
AdaBoost
Àrees temàtiques de la UPC::Informàtica
Descripción
Sumario:This paper presents a proposal for wide--coverage Named Entity Recognition for Spanish. First, a linguistic description of the typology of Named Entities is proposed. Following this definition an architecture of sequential processes is described for addressing the recognition and classification of strong and weak Named Entities. The former are treated using Machine Learning techniques (AdaBoost) and simple attributes requiring non tagged corpora complemented with external information sources (a list of trigger words and a gazetteer). The latter are approached through a context free grammar for recognizing syntactic patterns. A deep evaluation of the first task on real corpora to validate the appropriateness of the approach is presented. A preliminar version of the context free grammar is qualitatively evaluated with also good results on a small hand--tagged corpus.