Overcoming statistical machine translation limitations: error analysis and proposed solutions for the Catalan–Spanish language pair

This work aims to improve an N-gram-based statistical machine translation system between the Catalan and Spanish languages, trained with an aligned Spanish-Catalan parallel corpus consisting of 1.7 million sentences taken from El Periódico newspaper. Starting from a linguistic error analysis above t...

Descripción completa

Detalles Bibliográficos
Autores: Farrús, Mireia, Costa-jussà, Marta R., Mariño Acebal, José B., Poch, Marc, Hernández, Adolfo, Henríquez, Carlos, Fonollosa, José A Rodriguez
Tipo de recurso: artículo
Estado:Versión aceptada para publicación
Fecha de publicación:2011
País:España
Institución:Varias* (Consorci de Biblioteques Universitáries de Catalunya, Centre de Serveis Científics i Acadèmics de Catalunya)
Repositorio:Recercat. Dipósit de la Recerca de Catalunya
OAI Identifier:oai:recercat.cat:10230/32733
Acceso en línea:http://hdl.handle.net/10230/32733
http://dx.doi.org/10.1007/s10579-011-9137-0
Access Level:acceso abierto
Palabra clave:Statistical machine translation
N-gram-based translation
Linguistic knowledge
Grammatical categories
Descripción
Sumario:This work aims to improve an N-gram-based statistical machine translation system between the Catalan and Spanish languages, trained with an aligned Spanish-Catalan parallel corpus consisting of 1.7 million sentences taken from El Periódico newspaper. Starting from a linguistic error analysis above this baseline system, orthographic, morphological, lexical, semantic and syntactic problems are approached using a set of techniques. The proposed solutions include the development and application of additional statistical techniques, text pre- and post-processing tasks, and rules based on the use of grammatical categories, as well as lexical categorization. The performance of the improved system is clearly increased, as is shown in both human and automatic evaluations of the system, with a gain of about 1.1 points BLEU observed in the Spanish-to-Catalan direction of translation, and a gain of about 0.5 points in the reverse direction. The final system is freely available online as a linguistic resource.