Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET]

Ortega Riba, Federico; Campillos-Llanos, Leonardo

Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET]

[Description of methods used for collection/generation of data] The corpus statistics and methods are explained in the following article: Federico Ortega-Riba, Leonardo Campillos-Llanos, Doaa Samy (2025) "Lexical Simplification in Spanish Texts For Patients: The Complex Word Identification Task...

Descripción completa

Detalles Bibliográficos
Autores:	Ortega Riba, Federico, Campillos-Llanos, Leonardo
Tipo de recurso:	conjunto de datos
Fecha de publicación:	2024
País:	España
Institución:	Consejo Superior de Investigaciones Científicas (CSIC)
Repositorio:	DIGITAL.CSIC. Repositorio Institucional del CSIC
OAI Identifier:	oai:digital.csic.es:10261/373675
Acceso en línea:	http://hdl.handle.net/10261/373675 https://doi.org/10.20350/digitalCSIC/16706
Access Level:	acceso abierto
Palabra clave:	Patient information documents Annotated corpus Medical text simplification Biomedical natural language processing Consent forms Clinical trials Linguistics Medical sciences Linguistic research Ciencias médicas

id	ES_b4e89550ecbd368944df36d285878bc8
oai_identifier_str	oai:digital.csic.es:10261/373675
network_acronym_str	ES
network_name_str	España
repository_id_str
dc.title.none.fl_str_mv	Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET]
title	Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET]
spellingShingle	Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET] Ortega Riba, Federico Patient information documents Annotated corpus Medical text simplification Biomedical natural language processing Consent forms Clinical trials Linguistics Medical sciences Linguistic research Ciencias médicas
title_short	Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET]
title_full	Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET]
title_fullStr	Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET]
title_full_unstemmed	Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET]
title_sort	Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET]
dc.creator.none.fl_str_mv	Ortega Riba, Federico Campillos-Llanos, Leonardo
author	Ortega Riba, Federico
author_facet	Ortega Riba, Federico Campillos-Llanos, Leonardo
author_role	author
author2	Campillos-Llanos, Leonardo
author2_role	author
dc.contributor.none.fl_str_mv	Agencia Estatal de Investigación (España) Ministerio de Ciencia, Innovación y Universidades (España) Campillos-Llanos, Leonardo [0000-0003-3040-1756] Campillos-Llanos, Leonardo [leonardo.campillos@csic.es] Consejo Superior de Investigaciones Científicas [https://ror.org/02gfc7t72]
dc.subject.none.fl_str_mv	Patient information documents Annotated corpus Medical text simplification Biomedical natural language processing Consent forms Clinical trials Linguistics Medical sciences Linguistic research Ciencias médicas
topic	Patient information documents Annotated corpus Medical text simplification Biomedical natural language processing Consent forms Clinical trials Linguistics Medical sciences Linguistic research Ciencias médicas
description	[Description of methods used for collection/generation of data] The corpus statistics and methods are explained in the following article: Federico Ortega-Riba, Leonardo Campillos-Llanos, Doaa Samy (2025) "Lexical Simplification in Spanish Texts For Patients: The Complex Word Identification Task". (Under review). [Methods for processing the data] Manual annotation of complex words (CW) according to the criteria defined in the guideline explained in the companion article.
publishDate	2024
dc.date.none.fl_str_mv	2024 2024 2024 2024
dc.type.none.fl_str_mv	info:eu-repo/semantics/dataset http://purl.org/coar/resource_type/c_ddb1
format	dataset
dc.identifier.none.fl_str_mv	http://hdl.handle.net/10261/373675 https://doi.org/10.20350/digitalCSIC/16706
url	http://hdl.handle.net/10261/373675 https://doi.org/10.20350/digitalCSIC/16706
dc.language.none.fl_str_mv	Inglés
language_invalid_str_mv	Inglés
dc.relation.none.fl_str_mv	#PLACEHOLDER_PARENT_METADATA_VALUE# info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-116001RA-C33 Federico Ortega-Riba, Leonardo Campillos-Llanos, Doaa Samy (2025) "Complex Word Identification for Lexical Simplification in Spanish Texts for Patients". Procesamiento del lenguaje natural, 74, pp. 95-108. http://hdl.handle.net/10261/387368 The BRAT annotation tool is needed to display the annotated (.ann) files. To download and install BRAT, please access: https://brat.nlplab.org/ Sí
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	txt ann text/csv json
dc.publisher.none.fl_str_mv	DIGITAL.CSIC
publisher.none.fl_str_mv	DIGITAL.CSIC
dc.source.none.fl_str_mv	reponame:DIGITAL.CSIC. Repositorio Institucional del CSIC instname:Consejo Superior de Investigaciones Científicas (CSIC)
instname_str	Consejo Superior de Investigaciones Científicas (CSIC)
reponame_str	DIGITAL.CSIC. Repositorio Institucional del CSIC
collection	DIGITAL.CSIC. Repositorio Institucional del CSIC
repository.name.fl_str_mv
repository.mail.fl_str_mv
_version_	1869417299142770688
spelling	Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET]Ortega Riba, FedericoCampillos-Llanos, LeonardoPatient information documentsAnnotated corpusMedical text simplificationBiomedical natural language processingConsent formsClinical trialsLinguisticsMedical sciencesLinguistic researchCiencias médicas[Description of methods used for collection/generation of data] The corpus statistics and methods are explained in the following article: Federico Ortega-Riba, Leonardo Campillos-Llanos, Doaa Samy (2025) "Lexical Simplification in Spanish Texts For Patients: The Complex Word Identification Task". (Under review). [Methods for processing the data] Manual annotation of complex words (CW) according to the criteria defined in the guideline explained in the companion article.We greatly thank the following colleagues who doubly revised a subset of texts in order to compute the inter-annotator agreement: Ana R. Terroba-Reinares (Fundación Rioja Salud) [ORCID: 0000-0003-1582-6481]; Ana Valverde-Mateos (Unidad de Terminología Médica, Real Academia Nacional de Medicina de España) [ORCID: 0000-0003-1610-0770].The corpus is made up of 225 texts in Spanish annotated with complex words (CW). It contains three text types: consent forms (75 texts), clinical trial announcements (75 texts) and patient information documents (75 texts). This resource is aimed at training models, evaluating and performing experiments on complex word identification of Spanish medical texts.The corpus contains three text types: 1. Consent forms (75 texts), 2. Clinical trial announcements (75 texts) y 3. Patient information leaflets (75 texts).This dataset was collected in the CLARA-MeD project (PID2020-116001RA-C33), with funding from the Spanish government by MICIU/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos Investigación”.The corpus is made up of 225 texts. It is aimed at training models, evaluating and performing experiments on complex word identification of Spanish medical texts. The corpus contains three text types: • Consent forms (75 texts) • Clinical trial announcements (75 texts) • Patient information leaflets (75 texts) - ANN: Contains BRAT annotated files (.ann) and corresponding text files (.txt): These are separated in three folders and subfolders (corresponding to each text type): • TRAIN: ▫ ci: 51 consent forms ('consentimientos informados') ▫ eudract: 51 clinical trial announcements from REEC and EudraCT ▫ info: 51 patient-oriented information leaflets • DEV: ▫ ci: 9 consent forms FALTA UNO ▫ eudract: 9 clinical trial announcements ▫ info: 9 patient-oriented information leaflets • TEST: ▫ ci: 15 consent forms ▫ eudract: 15 clinical trial announcements ▫ info: 15 patient-oriented information leaflets - JSON files for transformer models: These are separated in TRAIN, DEV and TEST. - CSV files with the processed data, corresponding to TRAIN, DEV and TEST data. These were used for the machine learning experiments. Each file has the following fields: • Token • Label: it encodes the class (CW, 'complex word') and if the token is the Beginning of the entity (B), if it is Inside (I) or Outside (O).Peer reviewedDIGITAL.CSICAgencia Estatal de Investigación (España)Ministerio de Ciencia, Innovación y Universidades (España)Campillos-Llanos, Leonardo [0000-0003-3040-1756]Campillos-Llanos, Leonardo [leonardo.campillos@csic.es]Consejo Superior de Investigaciones Científicas [https://ror.org/02gfc7t72]2024202420242024info:eu-repo/semantics/datasethttp://purl.org/coar/resource_type/c_ddb1txtanntext/csvjsonhttp://hdl.handle.net/10261/373675https://doi.org/10.20350/digitalCSIC/16706reponame:DIGITAL.CSIC. Repositorio Institucional del CSICinstname:Consejo Superior de Investigaciones Científicas (CSIC)Inglés#PLACEHOLDER_PARENT_METADATA_VALUE#info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-116001RA-C33Federico Ortega-Riba, Leonardo Campillos-Llanos, Doaa Samy (2025) "Complex Word Identification for Lexical Simplification in Spanish Texts for Patients". Procesamiento del lenguaje natural, 74, pp. 95-108. http://hdl.handle.net/10261/387368The BRAT annotation tool is needed to display the annotated (.ann) files. To download and install BRAT, please access: https://brat.nlplab.org/Síinfo:eu-repo/semantics/openAccessoai:digital.csic.es:10261/3736752026-05-22T06:33:51Z
score	15.81155

Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [DATASET]

Similares en LA Referencia