Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET]
[Description of methods used for collection/generation of data] The corpus statistics and methods are explained in the following article: Federico Ortega-Riba, Leonardo Campillos-Llanos, Doaa Samy (2025) "Lexical Simplification in Spanish Texts For Patients: The Complex Word Identification Task...
| Autores: | , |
|---|---|
| Tipo de recurso: | conjunto de datos |
| Fecha de publicación: | 2024 |
| País: | España |
| Institución: | Consejo Superior de Investigaciones Científicas (CSIC) |
| Repositorio: | DIGITAL.CSIC. Repositorio Institucional del CSIC |
| OAI Identifier: | oai:digital.csic.es:10261/373675 |
| Acceso en línea: | http://hdl.handle.net/10261/373675 https://doi.org/10.20350/digitalCSIC/16706 |
| Access Level: | acceso abierto |
| Palabra clave: | Patient information documents Annotated corpus Medical text simplification Biomedical natural language processing Consent forms Clinical trials Linguistics Medical sciences Linguistic research Ciencias médicas |
| id |
ES_b4e89550ecbd368944df36d285878bc8 |
|---|---|
| oai_identifier_str |
oai:digital.csic.es:10261/373675 |
| network_acronym_str |
ES |
| network_name_str |
España |
| repository_id_str |
|
| dc.title.none.fl_str_mv |
Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET] |
| title |
Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET] |
| spellingShingle |
Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET] Ortega Riba, Federico Patient information documents Annotated corpus Medical text simplification Biomedical natural language processing Consent forms Clinical trials Linguistics Medical sciences Linguistic research Ciencias médicas |
| title_short |
Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET] |
| title_full |
Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET] |
| title_fullStr |
Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET] |
| title_full_unstemmed |
Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET] |
| title_sort |
Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET] |
| dc.creator.none.fl_str_mv |
Ortega Riba, Federico Campillos-Llanos, Leonardo |
| author |
Ortega Riba, Federico |
| author_facet |
Ortega Riba, Federico Campillos-Llanos, Leonardo |
| author_role |
author |
| author2 |
Campillos-Llanos, Leonardo |
| author2_role |
author |
| dc.contributor.none.fl_str_mv |
Agencia Estatal de Investigación (España) Ministerio de Ciencia, Innovación y Universidades (España) Campillos-Llanos, Leonardo [0000-0003-3040-1756] Campillos-Llanos, Leonardo [leonardo.campillos@csic.es] Consejo Superior de Investigaciones Científicas [https://ror.org/02gfc7t72] |
| dc.subject.none.fl_str_mv |
Patient information documents Annotated corpus Medical text simplification Biomedical natural language processing Consent forms Clinical trials Linguistics Medical sciences Linguistic research Ciencias médicas |
| topic |
Patient information documents Annotated corpus Medical text simplification Biomedical natural language processing Consent forms Clinical trials Linguistics Medical sciences Linguistic research Ciencias médicas |
| description |
[Description of methods used for collection/generation of data] The corpus statistics and methods are explained in the following article: Federico Ortega-Riba, Leonardo Campillos-Llanos, Doaa Samy (2025) "Lexical Simplification in Spanish Texts For Patients: The Complex Word Identification Task". (Under review). [Methods for processing the data] Manual annotation of complex words (CW) according to the criteria defined in the guideline explained in the companion article. |
| publishDate |
2024 |
| dc.date.none.fl_str_mv |
2024 2024 2024 2024 |
| dc.type.none.fl_str_mv |
info:eu-repo/semantics/dataset http://purl.org/coar/resource_type/c_ddb1 |
| format |
dataset |
| dc.identifier.none.fl_str_mv |
http://hdl.handle.net/10261/373675 https://doi.org/10.20350/digitalCSIC/16706 |
| url |
http://hdl.handle.net/10261/373675 https://doi.org/10.20350/digitalCSIC/16706 |
| dc.language.none.fl_str_mv |
Inglés |
| language_invalid_str_mv |
Inglés |
| dc.relation.none.fl_str_mv |
#PLACEHOLDER_PARENT_METADATA_VALUE# info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-116001RA-C33 Federico Ortega-Riba, Leonardo Campillos-Llanos, Doaa Samy (2025) "Complex Word Identification for Lexical Simplification in Spanish Texts for Patients". Procesamiento del lenguaje natural, 74, pp. 95-108. http://hdl.handle.net/10261/387368 The BRAT annotation tool is needed to display the annotated (.ann) files. To download and install BRAT, please access: https://brat.nlplab.org/ Sí |
| dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess |
| eu_rights_str_mv |
openAccess |
| dc.format.none.fl_str_mv |
txt ann text/csv json |
| dc.publisher.none.fl_str_mv |
DIGITAL.CSIC |
| publisher.none.fl_str_mv |
DIGITAL.CSIC |
| dc.source.none.fl_str_mv |
reponame:DIGITAL.CSIC. Repositorio Institucional del CSIC instname:Consejo Superior de Investigaciones Científicas (CSIC) |
| instname_str |
Consejo Superior de Investigaciones Científicas (CSIC) |
| reponame_str |
DIGITAL.CSIC. Repositorio Institucional del CSIC |
| collection |
DIGITAL.CSIC. Repositorio Institucional del CSIC |
| repository.name.fl_str_mv |
|
| repository.mail.fl_str_mv |
|
| _version_ |
1869417299142770688 |
| spelling |
Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET]Ortega Riba, FedericoCampillos-Llanos, LeonardoPatient information documentsAnnotated corpusMedical text simplificationBiomedical natural language processingConsent formsClinical trialsLinguisticsMedical sciencesLinguistic researchCiencias médicas[Description of methods used for collection/generation of data] The corpus statistics and methods are explained in the following article: Federico Ortega-Riba, Leonardo Campillos-Llanos, Doaa Samy (2025) "Lexical Simplification in Spanish Texts For Patients: The Complex Word Identification Task". (Under review). [Methods for processing the data] Manual annotation of complex words (CW) according to the criteria defined in the guideline explained in the companion article.We greatly thank the following colleagues who doubly revised a subset of texts in order to compute the inter-annotator agreement: Ana R. Terroba-Reinares (Fundación Rioja Salud) [ORCID: 0000-0003-1582-6481]; Ana Valverde-Mateos (Unidad de Terminología Médica, Real Academia Nacional de Medicina de España) [ORCID: 0000-0003-1610-0770].The corpus is made up of 225 texts in Spanish annotated with complex words (CW). It contains three text types: consent forms (75 texts), clinical trial announcements (75 texts) and patient information documents (75 texts). This resource is aimed at training models, evaluating and performing experiments on complex word identification of Spanish medical texts.The corpus contains three text types: 1. Consent forms (75 texts), 2. Clinical trial announcements (75 texts) y 3. Patient information leaflets (75 texts).This dataset was collected in the CLARA-MeD project (PID2020-116001RA-C33), with funding from the Spanish government by MICIU/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos Investigación”.The corpus is made up of 225 texts. It is aimed at training models, evaluating and performing experiments on complex word identification of Spanish medical texts. The corpus contains three text types: • Consent forms (75 texts) • Clinical trial announcements (75 texts) • Patient information leaflets (75 texts) - ANN: Contains BRAT annotated files (.ann) and corresponding text files (.txt): These are separated in three folders and subfolders (corresponding to each text type): • TRAIN: ▫ ci: 51 consent forms ('consentimientos informados') ▫ eudract: 51 clinical trial announcements from REEC and EudraCT ▫ info: 51 patient-oriented information leaflets • DEV: ▫ ci: 9 consent forms FALTA UNO ▫ eudract: 9 clinical trial announcements ▫ info: 9 patient-oriented information leaflets • TEST: ▫ ci: 15 consent forms ▫ eudract: 15 clinical trial announcements ▫ info: 15 patient-oriented information leaflets - JSON files for transformer models: These are separated in TRAIN, DEV and TEST. - CSV files with the processed data, corresponding to TRAIN, DEV and TEST data. These were used for the machine learning experiments. Each file has the following fields: • Token • Label: it encodes the class (CW, 'complex word') and if the token is the Beginning of the entity (B), if it is Inside (I) or Outside (O).Peer reviewedDIGITAL.CSICAgencia Estatal de Investigación (España)Ministerio de Ciencia, Innovación y Universidades (España)Campillos-Llanos, Leonardo [0000-0003-3040-1756]Campillos-Llanos, Leonardo [leonardo.campillos@csic.es]Consejo Superior de Investigaciones Científicas [https://ror.org/02gfc7t72]2024202420242024info:eu-repo/semantics/datasethttp://purl.org/coar/resource_type/c_ddb1txtanntext/csvjsonhttp://hdl.handle.net/10261/373675https://doi.org/10.20350/digitalCSIC/16706reponame:DIGITAL.CSIC. Repositorio Institucional del CSICinstname:Consejo Superior de Investigaciones Científicas (CSIC)Inglés#PLACEHOLDER_PARENT_METADATA_VALUE#info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-116001RA-C33Federico Ortega-Riba, Leonardo Campillos-Llanos, Doaa Samy (2025) "Complex Word Identification for Lexical Simplification in Spanish Texts for Patients". Procesamiento del lenguaje natural, 74, pp. 95-108. http://hdl.handle.net/10261/387368The BRAT annotation tool is needed to display the annotated (.ann) files. To download and install BRAT, please access: https://brat.nlplab.org/Síinfo:eu-repo/semantics/openAccessoai:digital.csic.es:10261/3736752026-05-22T06:33:51Z |
| score |
15.81155 |