A large Spanish-Catalan parallel corpus release for machine translation

We present a large Spanish-Catalan parallel corpus extracted from ten years of the paper edition of a bilingual Catalan newspaper. The produced corpus of 7:5M parallel sentences (around 180M words per language) is useful for many natural language applications. We report excellent results when buildi...

Descripción completa

Detalles Bibliográficos
Autores: Costa-jussà, Marta R., Fonollosa, José A Rodriguez, Mariño Acebal, José B., Poch, Marc, Farrús, Mireia
Tipo de recurso: artículo
Estado:Versión publicada
Fecha de publicación:2014
País:España
Institución:Varias* (Consorci de Biblioteques Universitáries de Catalunya, Centre de Serveis Científics i Acadèmics de Catalunya)
Repositorio:Recercat. Dipósit de la Recerca de Catalunya
OAI Identifier:oai:recercat.cat:10230/26266
Acceso en línea:http://hdl.handle.net/10230/26266
Access Level:acceso abierto
Palabra clave:Catalan-Spanish parallel corpus
Machine translation
Descripción
Sumario:We present a large Spanish-Catalan parallel corpus extracted from ten years of the paper edition of a bilingual Catalan newspaper. The produced corpus of 7:5M parallel sentences (around 180M words per language) is useful for many natural language applications. We report excellent results when building a statistical machine translation system trained on this parallel corpus. The Spanish-Catalan corpus is partially available via ELDA (Evaluations and Language Resources Distribution Agency) in catalog number ELRA-W0053.