Towards a Protein-Protein Interaction information extraction system: recognizing named entities

[EN] The majority of biological functions of any living being are related to Protein Protein Interactions (PPI). PPI discoveries are reported in form of research publications whose volume grows day after day. Consequently, automatic PPI information extraction systems are a pressing need for biologis...

Descripción completa

Detalles Bibliográficos
Autores: Danger Mercaderes, Roxana María, Pla Santamaría, Ferran, Molina Marco, Antonio|||0000-0001-6537-8803, Rosso, Paolo
Tipo de recurso: artículo
Fecha de publicación:2014
País:España
Institución:Universitat Politècnica de València (UPV)
Repositorio:RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia
Idioma:inglés
OAI Identifier:oai:riunet.upv.es:10251/47925
Acceso en línea:https://riunet.upv.es/handle/10251/47925
Access Level:acceso abierto
Palabra clave:Biomedical named entity recognition
Protein-Protein interactions
Dictionary look-up
Conditional random field
Support vector machine
LENGUAJES Y SISTEMAS INFORMATICOS
Descripción
Sumario:[EN] The majority of biological functions of any living being are related to Protein Protein Interactions (PPI). PPI discoveries are reported in form of research publications whose volume grows day after day. Consequently, automatic PPI information extraction systems are a pressing need for biologists. In this paper we are mainly concerned with the named entity detection module of PPIES (the PPI information extraction system we are implementing) which recognizes twelve entity types relevant in PPI context. It is composed of two sub-modules: a dictionary look-up with extensive normalization and acronym detection, and a Conditional Random Field classifier. The dictionary look-up module has been tested with Interaction Method Task (IMT), and it improves by approximately 10% the current solutions that do not use Machine Learning (ML). The second module has been used to create a classifier using the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA 04) data set. It does not use any external resources, or complex or ad hoc post-processing, and obtains 77.25%, 75.04% and 76.13 for precision, recall, and F1-measure, respectively, improving all previous results obtained for this data set.