HESML V2R1 Java software library of semantic similarity measures for the biomedical domain

This dataset introduces HESML V2R1 which is the sixth release of the Half-Edge Semantic Measures Library (HESML) detailed in [24]. HESML V2R1 is a linearly scalable and efficient Java software library of ontology-based semantic similarity measures and Information Content (IC) models for ontologies l...

Descripción completa

Detalles Bibliográficos
Autores: Lara-Clares, Alicia, Lastra-Díaz, Juan José, Garcia-Serrano, Ana M.
Tipo de recurso: conjunto de datos
Estado:Versión publicada
Fecha de publicación:2022
País:España
Institución:Consorcio Madroño
Repositorio:e-cienciaDatos, Repositorio de Datos del Consorcio Madroño
OAI Identifier:doi:10.21950/AQLSMV
Acceso en línea:https://doi.org/10.21950/AQLSMV
Access Level:acceso abierto
Palabra clave:Computer and Information Science
HESML
semantic measures library
Ontology-based semantic similarity measures
Word embeddings
Information Content (IC) models
WordNet
UMLS
SNOMED-CT
MeSH
Gene Ontology (GO)
Sentence embeddings
BERT
sentence similarity
biomedical sentence similarity
Descripción
Sumario:This dataset introduces HESML V2R1 which is the sixth release of the Half-Edge Semantic Measures Library (HESML) detailed in [24]. HESML V2R1 is a linearly scalable and efficient Java software library of ontology-based semantic similarity measures and Information Content (IC) models for ontologies like WordNet, SNOMED-CT, MeSH, GO and any other ontologies based on the OBO file format. HESML V2R1 also implements most of the sentence similarity methods in the biomedical domain together with a set of sentence pre-processing configurations, the integration of the three main biomedical NER tools, Metamap [3], MetamapLite [7] and cTAKES [31]. HESML V2R1 implements most ontology-based semantic similarity measures and Information Content (IC) models reported in the literature, as well as the evaluation of three pre-trained word embedding models for the general domain and 33 pre-trained embeddings and language models. It also provides a XML-based input file format in order to specify the execution of reproducible word/concept similarity experiments based on WordNet, SNOMED-CT, MeSH, or GO without software coding, and the necessary software clients to run the sentence-based experiments in the biomedical domain. HESML V2R1 introduces the following novelties: (1) the software implementation of a new package for the evaluation of sentence similarity methods; (2) the software implementation of most of the sentence similarity methods in the biomedical domain; (3) the implementation of a new package for sentence pre-processing together with a set of sentence pre-processing configurations; (4) the integration of the three main biomedical NER tools, Metamap [3], MetamapLite [7] and cTAKES [31]; (5) the software implementation of a parser based on the averaging Simple Word EMbeddings (SWEM) models introduced by Shen et al. [32] for efficiently loading and evaluating FastText-based [4] and other word embedding models; (6) the integration of Python wrappers for the evaluation of BERT [8], Universal Sentence Encoder (USE) [5] and Flair [1] models; and finally, (7) the software implementation of a new string-based sentence similarity method based on the aggregation of the Li et al. [29] similarity and Block distance [9] measures, called LiBlock, as well as eight new variants of the ontology-based methods proposed by Sogancioglu et al. [33], and a new pre-trained word embedding model based on FastText [4] and trained on the full-text of the articles in the PMC-BioC corpus [6]. HESML library is freely distributed for any non-commercial purpose under a CC By-NC-SA-4.0 license, subject to the citing of the two mains HESML papers [24] as attribution requirement.However, HESML distribution also includes other datasets, databases or data files whose use require the attribution acknowledgement by any user of HEMSL. Thus, we urge to the HESML users to fulfill with licensing terms related to other resources distributed with the library as detailed in its companion release notes.