Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET]

[Description of methods used for collection/generation of data] The corpus statistics and methods are explained in the following article: Federico Ortega-Riba, Leonardo Campillos-Llanos, Doaa Samy (2025) "Lexical Simplification in Spanish Texts For Patients: The Complex Word Identification Task...

Descripción completa

Detalles Bibliográficos
Autores: Ortega Riba, Federico, Campillos-Llanos, Leonardo
Tipo de recurso: conjunto de datos
Fecha de publicación:2024
País:España
Institución:Consejo Superior de Investigaciones Científicas (CSIC)
Repositorio:DIGITAL.CSIC. Repositorio Institucional del CSIC
OAI Identifier:oai:digital.csic.es:10261/373675
Acceso en línea:http://hdl.handle.net/10261/373675
https://doi.org/10.20350/digitalCSIC/16706
Access Level:acceso abierto
Palabra clave:Patient information documents
Annotated corpus
Medical text simplification
Biomedical natural language processing
Consent forms
Clinical trials
Linguistics
Medical sciences
Linguistic research
Ciencias médicas
id ES_b4e89550ecbd368944df36d285878bc8
oai_identifier_str oai:digital.csic.es:10261/373675
network_acronym_str ES
network_name_str España
repository_id_str
dc.title.none.fl_str_mv Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET]
title Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET]
spellingShingle Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET]
Ortega Riba, Federico
Patient information documents
Annotated corpus
Medical text simplification
Biomedical natural language processing
Consent forms
Clinical trials
Linguistics
Medical sciences
Linguistic research
Ciencias médicas
title_short Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET]
title_full Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET]
title_fullStr Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET]
title_full_unstemmed Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET]
title_sort Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET]
dc.creator.none.fl_str_mv Ortega Riba, Federico
Campillos-Llanos, Leonardo
author Ortega Riba, Federico
author_facet Ortega Riba, Federico
Campillos-Llanos, Leonardo
author_role author
author2 Campillos-Llanos, Leonardo
author2_role author
dc.contributor.none.fl_str_mv Agencia Estatal de Investigación (España)
Ministerio de Ciencia, Innovación y Universidades (España)
Campillos-Llanos, Leonardo [0000-0003-3040-1756]
Campillos-Llanos, Leonardo [leonardo.campillos@csic.es]
Consejo Superior de Investigaciones Científicas [https://ror.org/02gfc7t72]
dc.subject.none.fl_str_mv Patient information documents
Annotated corpus
Medical text simplification
Biomedical natural language processing
Consent forms
Clinical trials
Linguistics
Medical sciences
Linguistic research
Ciencias médicas
topic Patient information documents
Annotated corpus
Medical text simplification
Biomedical natural language processing
Consent forms
Clinical trials
Linguistics
Medical sciences
Linguistic research
Ciencias médicas
description [Description of methods used for collection/generation of data] The corpus statistics and methods are explained in the following article: Federico Ortega-Riba, Leonardo Campillos-Llanos, Doaa Samy (2025) "Lexical Simplification in Spanish Texts For Patients: The Complex Word Identification Task". (Under review). [Methods for processing the data] Manual annotation of complex words (CW) according to the criteria defined in the guideline explained in the companion article.
publishDate 2024
dc.date.none.fl_str_mv 2024
2024
2024
2024
dc.type.none.fl_str_mv info:eu-repo/semantics/dataset
http://purl.org/coar/resource_type/c_ddb1
format dataset
dc.identifier.none.fl_str_mv http://hdl.handle.net/10261/373675
https://doi.org/10.20350/digitalCSIC/16706
url http://hdl.handle.net/10261/373675
https://doi.org/10.20350/digitalCSIC/16706
dc.language.none.fl_str_mv Inglés
language_invalid_str_mv Inglés
dc.relation.none.fl_str_mv #PLACEHOLDER_PARENT_METADATA_VALUE#
info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-116001RA-C33
Federico Ortega-Riba, Leonardo Campillos-Llanos, Doaa Samy (2025) "Complex Word Identification for Lexical Simplification in Spanish Texts for Patients". Procesamiento del lenguaje natural, 74, pp. 95-108. http://hdl.handle.net/10261/387368
The BRAT annotation tool is needed to display the annotated (.ann) files. To download and install BRAT, please access: https://brat.nlplab.org/

dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv txt
ann
text/csv
json
dc.publisher.none.fl_str_mv DIGITAL.CSIC
publisher.none.fl_str_mv DIGITAL.CSIC
dc.source.none.fl_str_mv reponame:DIGITAL.CSIC. Repositorio Institucional del CSIC
instname:Consejo Superior de Investigaciones Científicas (CSIC)
instname_str Consejo Superior de Investigaciones Científicas (CSIC)
reponame_str DIGITAL.CSIC. Repositorio Institucional del CSIC
collection DIGITAL.CSIC. Repositorio Institucional del CSIC
repository.name.fl_str_mv
repository.mail.fl_str_mv
_version_ 1869417299142770688
spelling Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET]Ortega Riba, FedericoCampillos-Llanos, LeonardoPatient information documentsAnnotated corpusMedical text simplificationBiomedical natural language processingConsent formsClinical trialsLinguisticsMedical sciencesLinguistic researchCiencias médicas[Description of methods used for collection/generation of data] The corpus statistics and methods are explained in the following article: Federico Ortega-Riba, Leonardo Campillos-Llanos, Doaa Samy (2025) "Lexical Simplification in Spanish Texts For Patients: The Complex Word Identification Task". (Under review). [Methods for processing the data] Manual annotation of complex words (CW) according to the criteria defined in the guideline explained in the companion article.We greatly thank the following colleagues who doubly revised a subset of texts in order to compute the inter-annotator agreement: Ana R. Terroba-Reinares (Fundación Rioja Salud) [ORCID: 0000-0003-1582-6481]; Ana Valverde-Mateos (Unidad de Terminología Médica, Real Academia Nacional de Medicina de España) [ORCID: 0000-0003-1610-0770].The corpus is made up of 225 texts in Spanish annotated with complex words (CW). It contains three text types: consent forms (75 texts), clinical trial announcements (75 texts) and patient information documents (75 texts). This resource is aimed at training models, evaluating and performing experiments on complex word identification of Spanish medical texts.The corpus contains three text types: 1. Consent forms (75 texts), 2. Clinical trial announcements (75 texts) y 3. Patient information leaflets (75 texts).This dataset was collected in the CLARA-MeD project (PID2020-116001RA-C33), with funding from the Spanish government by MICIU/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos Investigación”.The corpus is made up of 225 texts. It is aimed at training models, evaluating and performing experiments on complex word identification of Spanish medical texts. The corpus contains three text types: • Consent forms (75 texts) • Clinical trial announcements (75 texts) • Patient information leaflets (75 texts) - ANN: Contains BRAT annotated files (.ann) and corresponding text files (.txt): These are separated in three folders and subfolders (corresponding to each text type): • TRAIN: ▫ ci: 51 consent forms ('consentimientos informados') ▫ eudract: 51 clinical trial announcements from REEC and EudraCT ▫ info: 51 patient-oriented information leaflets • DEV: ▫ ci: 9 consent forms FALTA UNO ▫ eudract: 9 clinical trial announcements ▫ info: 9 patient-oriented information leaflets • TEST: ▫ ci: 15 consent forms ▫ eudract: 15 clinical trial announcements ▫ info: 15 patient-oriented information leaflets - JSON files for transformer models: These are separated in TRAIN, DEV and TEST. - CSV files with the processed data, corresponding to TRAIN, DEV and TEST data. These were used for the machine learning experiments. Each file has the following fields: • Token • Label: it encodes the class (CW, 'complex word') and if the token is the Beginning of the entity (B), if it is Inside (I) or Outside (O).Peer reviewedDIGITAL.CSICAgencia Estatal de Investigación (España)Ministerio de Ciencia, Innovación y Universidades (España)Campillos-Llanos, Leonardo [0000-0003-3040-1756]Campillos-Llanos, Leonardo [leonardo.campillos@csic.es]Consejo Superior de Investigaciones Científicas [https://ror.org/02gfc7t72]2024202420242024info:eu-repo/semantics/datasethttp://purl.org/coar/resource_type/c_ddb1txtanntext/csvjsonhttp://hdl.handle.net/10261/373675https://doi.org/10.20350/digitalCSIC/16706reponame:DIGITAL.CSIC. Repositorio Institucional del CSICinstname:Consejo Superior de Investigaciones Científicas (CSIC)Inglés#PLACEHOLDER_PARENT_METADATA_VALUE#info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-116001RA-C33Federico Ortega-Riba, Leonardo Campillos-Llanos, Doaa Samy (2025) "Complex Word Identification for Lexical Simplification in Spanish Texts for Patients". Procesamiento del lenguaje natural, 74, pp. 95-108. http://hdl.handle.net/10261/387368The BRAT annotation tool is needed to display the annotated (.ann) files. To download and install BRAT, please access: https://brat.nlplab.org/Síinfo:eu-repo/semantics/openAccessoai:digital.csic.es:10261/3736752026-05-22T06:33:51Z
score 15.81155