Corpus-based sentence deletion and split decisions for Spanish text simplification

Stajner, Sanja; Drndarevic, Biljana; Saggion, Horacio

Corpus-based sentence deletion and split decisions for Spanish text simplification

This study addresses the automatic simplification of texts in Spanish in order to make them more accessible to people with cognitive disabilities. A corpus analysis of original and manually simplified news articles was undertaken in order to identify and quantify relevant operations to be implemente...

ver descrição completa

Detalhes bibliográficos
Autores:	Stajner, Sanja, Drndarevic, Biljana, Saggion, Horacio
Tipo de documento:	artigo
Estado:	Versão publicada
Data de publicação:	2013
País:	España
Recursos:	Universitat Pompeu Fabra
Repositório:	Repositorio Digital de la UPF
OAI Identifier:	oai:repositori.upf.edu:10230/36106
Acesso em linha:	http://hdl.handle.net/10230/36106
Access Level:	Acceso aberto
Palavra-chave:	Spanish text simplification Supervised learning Sentence classification Simplificación de textos en español Aprendizaje supervisado Clasificación de frases

id	ES_93d0bd3d68b2d7ccab186c31fdff9c6f
oai_identifier_str	oai:repositori.upf.edu:10230/36106
network_acronym_str	ES
network_name_str	España
repository_id_str
dc.title.none.fl_str_mv	Corpus-based sentence deletion and split decisions for Spanish text simplification Eliminación de frases y decisiones de división basadas en corpus para simplificación de textos en español
title	Corpus-based sentence deletion and split decisions for Spanish text simplification
spellingShingle	Corpus-based sentence deletion and split decisions for Spanish text simplification Stajner, Sanja Spanish text simplification Supervised learning Sentence classification Simplificación de textos en español Aprendizaje supervisado Clasificación de frases
title_short	Corpus-based sentence deletion and split decisions for Spanish text simplification
title_full	Corpus-based sentence deletion and split decisions for Spanish text simplification
title_fullStr	Corpus-based sentence deletion and split decisions for Spanish text simplification
title_full_unstemmed	Corpus-based sentence deletion and split decisions for Spanish text simplification
title_sort	Corpus-based sentence deletion and split decisions for Spanish text simplification
dc.creator.none.fl_str_mv	Stajner, Sanja Drndarevic, Biljana Saggion, Horacio
author	Stajner, Sanja
author_facet	Stajner, Sanja Drndarevic, Biljana Saggion, Horacio
author_role	author
author2	Drndarevic, Biljana Saggion, Horacio
author2_role	author author
dc.subject.none.fl_str_mv	Spanish text simplification Supervised learning Sentence classification Simplificación de textos en español Aprendizaje supervisado Clasificación de frases
topic	Spanish text simplification Supervised learning Sentence classification Simplificación de textos en español Aprendizaje supervisado Clasificación de frases
description	This study addresses the automatic simplification of texts in Spanish in order to make them more accessible to people with cognitive disabilities. A corpus analysis of original and manually simplified news articles was undertaken in order to identify and quantify relevant operations to be implemented in a text simplification system. The articles were further compared at sentence and text level by means of automatic feature extraction and various machine learning classification algorithms, using three different groups of features (POS frequencies, syntactic information, and text complexity measures) with the aim of identifying features that help separate original documents from their simple equivalents. Finally, it was investigated whether these features can be used to decide upon simplification operations to be carried out at the sentence level (split, delete, and reduce). Automatic classification of original sentences into those to be kept and those to be eliminated outperformed the classification that was previously conducted on the same corpus. Kept sentences were further classified into those to be split or significantly reduced in length and those to be left largely unchanged, with the overall F-measure up to 0.92. Both experiments were conducted and compared on two different sets of features: all features and the best subset returned by an attribute selection algorithm.
publishDate	2013
dc.date.none.fl_str_mv	2013 2018 2018
dc.type.none.fl_str_mv	info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion
format	article
status_str	publishedVersion
dc.identifier.none.fl_str_mv	http://hdl.handle.net/10230/36106
url	http://hdl.handle.net/10230/36106
dc.language.none.fl_str_mv	Inglés
language_invalid_str_mv	Inglés
dc.relation.none.fl_str_mv	Computacion y Sistemas. 2013; 17(2):251-62. info:eu-repo/grantAgreement/EC/FP7/287607 info:eu-repo/grantAgreement/ES/3PN/TSI-020302-2010-84 info:eu-repo/grantAgreement/ES/3PN/TIN2012-38584-C06-03
dc.rights.none.fl_str_mv	© Computing Research Center (CIC-IPN) info:eu-repo/semantics/openAccess
rights_invalid_str_mv	© Computing Research Center (CIC-IPN)
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf application/pdf
dc.publisher.none.fl_str_mv	Centro de Investigación en Computación
publisher.none.fl_str_mv	Centro de Investigación en Computación
dc.source.none.fl_str_mv	reponame:Repositorio Digital de la UPF instname:Universitat Pompeu Fabra
instname_str	Universitat Pompeu Fabra
reponame_str	Repositorio Digital de la UPF
collection	Repositorio Digital de la UPF
repository.name.fl_str_mv
repository.mail.fl_str_mv
_version_	1869413628763963392
spelling	Corpus-based sentence deletion and split decisions for Spanish text simplificationEliminación de frases y decisiones de división basadas en corpus para simplificación de textos en españolStajner, SanjaDrndarevic, BiljanaSaggion, HoracioSpanish text simplificationSupervised learningSentence classificationSimplificación de textos en españolAprendizaje supervisadoClasificación de frasesThis study addresses the automatic simplification of texts in Spanish in order to make them more accessible to people with cognitive disabilities. A corpus analysis of original and manually simplified news articles was undertaken in order to identify and quantify relevant operations to be implemented in a text simplification system. The articles were further compared at sentence and text level by means of automatic feature extraction and various machine learning classification algorithms, using three different groups of features (POS frequencies, syntactic information, and text complexity measures) with the aim of identifying features that help separate original documents from their simple equivalents. Finally, it was investigated whether these features can be used to decide upon simplification operations to be carried out at the sentence level (split, delete, and reduce). Automatic classification of original sentences into those to be kept and those to be eliminated outperformed the classification that was previously conducted on the same corpus. Kept sentences were further classified into those to be split or significantly reduced in length and those to be left largely unchanged, with the overall F-measure up to 0.92. Both experiments were conducted and compared on two different sets of features: all features and the best subset returned by an attribute selection algorithm.Este estudio aborda el problema de simplificación automática de textos en español con el fin de hacerlos más accesible a las personas con discapacidades cognitivas. Análisis de corpus de artículos originales y artículos simplificados manualmente se ha realizado para identificar y calificar relevantes operaciones que tienen que ser implementadas en el sistema de simplificación de textos. Luego los artículos se han comparado al nivel de frase y texto mediante extracción automática de características y diversos algoritmos de aprendizaje de máquina para clasificación usando tres distintos grupos de características (frecuencias de partes de oración (POS), información sintáctica y medidas de la complejidad de textos) con el propósito de identificar las características que ayuden a distinguir los documentos originales de sus simples equivalentes. Finalmente, se ha investigado la posibilidad de usar esas características en operaciones de simplificación a nivel de frase (dividir, eliminar y reducir). Clasificación automática de frases originales en las que deben preservarse y las que deben eliminarse ha superado la clasificación anterior sobre el mismo corpus. Las frases guardadas luego se clasificaron en las que se dividen o reducen de manera significativa en su longitud y las que se quedan sin cambios mayores con la F-medida de 0.92. Ambos experimentos se realizaron y compararon sobre dos distintos conjuntos de características: el de todas características y el mejor subconjunto recuperado por el algoritmo de selección de atributos.The research described in this paper was partially funded by the European Commission under the Seventh (FP7 - 2007-2013) Framework Programme for Research and Technological Development (FIRST 287607). This publication [communication] reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein. We acknowledge partial support from the following grants: Avanza Competitiveness grant number TSI-020302-2010-84 from the Ministry of Industry, Tourism and Trade, Spain and grant number TIN2012-38584-C06-03 and fellowship RYC-2009-04291 (Programa Ramón y Cajal 2009) from the Spanish Ministry of Economy and Competitiveness.Centro de Investigación en Computación201820182013info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionapplication/pdfapplication/pdfhttp://hdl.handle.net/10230/36106reponame:Repositorio Digital de la UPFinstname:Universitat Pompeu FabraInglésComputacion y Sistemas. 2013; 17(2):251-62.info:eu-repo/grantAgreement/EC/FP7/287607info:eu-repo/grantAgreement/ES/3PN/TSI-020302-2010-84info:eu-repo/grantAgreement/ES/3PN/TIN2012-38584-C06-03© Computing Research Center (CIC-IPN)info:eu-repo/semantics/openAccessoai:repositori.upf.edu:10230/361062026-06-12T07:21:37Z
score	15,81155

Corpus-based sentence deletion and split decisions for Spanish text simplification

Registros relacionados