Complex Word Identification for Lexical Simplification in Spanish Texts for Patients

Ortega Riba, Federico; Campillos-Llanos, Leonardo; Samy, Doa

Complex Word Identification for Lexical Simplification in Spanish Texts for Patients

[EN] This work describes the task of complex word identification (CWI) in Spanish medical texts for patients. Identifying complex words is the first step in lexical simplification, which aims to overcome the language gap between patients and healthcare professionals, enable access to information, an...

Descripción completa

Detalles Bibliográficos
Autores:	Ortega Riba, Federico, Campillos-Llanos, Leonardo, Samy, Doa
Tipo de recurso:	artículo
Estado:	Versión publicada
Fecha de publicación:	2025
País:	España
Institución:	Consejo Superior de Investigaciones Científicas (CSIC)
Repositorio:	DIGITAL.CSIC. Repositorio Institucional del CSIC
OAI Identifier:	oai:digital.csic.es:10261/387368
Acceso en línea:	http://hdl.handle.net/10261/387368
Access Level:	acceso abierto
Palabra clave:	Automatic Text Simplification, Language Resources Corpora Simplificación Automática de Text Recursos lingüísticos Corpus Computational linguistics

Descripción
Sumario:	[EN] This work describes the task of complex word identification (CWI) in Spanish medical texts for patients. Identifying complex words is the first step in lexical simplification, which aims to overcome the language gap between patients and healthcare professionals, enable access to information, and ensure unambiguous terminology for effective and clear communication. As part of the task, we created a medical complex words annotation guideline and compiled a corpus consisting of 225 texts (162575 tokens). A total of 18203 complex words (single and multi-words) were manually labeled, each text being annotated by two linguists with high interannotator agreement (F1 = 84.42%). The corpus was utilized to train two machine learning classifiers (Support Vector Machines and Logistic Regression) as baselines, in addition to seven deep learning transformer models. The models were selected by considering two factors: language (Spanish and multilingual) and domain (general or medical). The final results on the test set achieve an overall average F1 score of 79.02 (±0.65) for the transformer model with the best performance.

Complex Word Identification for Lexical Simplification in Spanish Texts for Patients

Similares en LA Referencia