Transcription of Spanish Historical Handwritten Documents with Deep Neural Networks

Granell, Emilio; Chammas, Edgard; Likforman-Sulem, Laurence; Mokbel, Chafic; Cirstea, Bogdan-Ionut; Martínez-Hinarejos, Carlos-D.|||0000-0002-6139-2891

Transcription of Spanish Historical Handwritten Documents with Deep Neural Networks

[EN] The digitization of historical handwritten document images is important for the preservation of cultural heritage. Moreover, the transcription of text images obtained from digitization is necessary to provide efficient information access to the content of these documents. Handwritten Text Recog...

Full description

Bibliographic Details
Authors:	Granell, Emilio, Chammas, Edgard, Likforman-Sulem, Laurence, Mokbel, Chafic, Cirstea, Bogdan-Ionut, Martínez-Hinarejos, Carlos-D.\|\|\|0000-0002-6139-2891
Format:	article
Publication Date:	2018
Country:	España
Institution:	Universitat Politècnica de València (UPV)
Repository:	RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia
Language:	English
OAI Identifier:	oai:riunet.upv.es:10251/120670
Online Access:	https://riunet.upv.es/handle/10251/120670
Access Level:	Open access
Keyword:	Character-level language model Historical handwritten transcription Out-of-vocabulary word recognition Word structure retrieval LENGUAJES Y SISTEMAS INFORMATICOS

id	ES_5463a1f0028aa187bbde84e583e4e8d6
oai_identifier_str	oai:riunet.upv.es:10251/120670
network_acronym_str	ES
network_name_str	España
repository_id_str
spelling	Transcription of Spanish Historical Handwritten Documents with Deep Neural NetworksGranell, EmilioChammas, EdgardLikforman-Sulem, LaurenceMokbel, ChaficCirstea, Bogdan-IonutMartínez-Hinarejos, Carlos-D.\|\|\|0000-0002-6139-2891Character-level language modelHistorical handwritten transcriptionOut-of-vocabulary word recognitionWord structure retrievalLENGUAJES Y SISTEMAS INFORMATICOS[EN] The digitization of historical handwritten document images is important for the preservation of cultural heritage. Moreover, the transcription of text images obtained from digitization is necessary to provide efficient information access to the content of these documents. Handwritten Text Recognition (HTR) has become an important research topic in the areas of image and computational language processing that allows us to obtain transcriptions from text images. State-of-the-art HTR systems are, however, far from perfect. One difficulty is that they have to cope with image noise and handwriting variability. Another difficulty is the presence of a large amount of Out-Of-Vocabulary (OOV) words in ancient historical texts. A solution to this problem is to use external lexical resources, but such resources might be scarce or unavailable given the nature and the age of such documents. This work proposes a solution to avoid this limitation. It consists of associating a powerful optical recognition system that will cope with image noise and variability, with a language model based on sub-lexical units that will model OOV words. Such a language modeling approach reduces the size of the lexicon while increasing the lexicon coverage. Experiments are first conducted on the publicly available Rodrigo dataset, which contains the digitization of an ancient Spanish manuscript, with a recognizer based on Hidden Markov Models (HMMs). They show that sub-lexical units outperform word units in terms of Word Error Rate (WER), Character Error Rate (CER) and OOV word accuracy rate. This approach is then applied to deep net classifiers, namely Bi-directional Long-Short Term Memory (BLSTMs) and Convolutional Recurrent Neural Nets (CRNNs). Results show that CRNNs outperform HMMs and BLSTMs, reaching the lowest WER and CER for this image dataset and significantly improving OOV recognition.Work partially supported by projects READ: Recognition and Enrichment of Archival Documents - 674943 (European Union's H2020) and CoMUN-HaT: Context, Multimodality and User Collaboration in Handwritten Text Processing - TIN2015-70924-C2-1-R (MINECO/FEDER), and a DGA-MRIS (Direction Generale de l'Armement - Mission pour la Recherche et l'Innovation Scientifique) scholarship.MDPI AGDepartamento de Sistemas Informáticos y ComputaciónEscuela Técnica Superior de Ingeniería InformáticaCentro de Investigación Pattern Recognition and Human Language TechnologyMinisterio de Economía, Industria y CompetitividadRepositorio Institucional de la Universitat Politècnica de València Riunet20182018-01-01journal articlehttp://purl.org/coar/resource_type/c_6501VoRhttp://purl.org/coar/version/c_970fb48d4fbd8a85info:eu-repo/semantics/articleapplication/pdfhttps://riunet.upv.es/handle/10251/120670reponame:RiuNet. Repositorio Institucional de la Universitat Politécnica de Valénciainstname:Universitat Politècnica de València (UPV)InglésengEuropean Commission https://doi.org/10.13039/501100000780 H2020 674943 Recognition and Enrichment of Archival DocumentsMinisterio de Economía y Competitividad http://dx.doi.org/10.13039/501100003329 TIN2015-70924-C2-1-R CONTEXTO, MULTIMODALIDAD Y COLABORACION DEL USUARIO EN PROCESADO DE TEXTO MANUSCRITOopen accesshttp://purl.org/coar/access_right/c_abf2Reconocimiento (by)http://creativecommons.org/licenses/by/4.0/info:eu-repo/semantics/openAccessoai:riunet.upv.es:10251/1206702026-06-13T07:49:27Z
dc.title.none.fl_str_mv	Transcription of Spanish Historical Handwritten Documents with Deep Neural Networks
title	Transcription of Spanish Historical Handwritten Documents with Deep Neural Networks
spellingShingle	Transcription of Spanish Historical Handwritten Documents with Deep Neural Networks Granell, Emilio Character-level language model Historical handwritten transcription Out-of-vocabulary word recognition Word structure retrieval LENGUAJES Y SISTEMAS INFORMATICOS
title_short	Transcription of Spanish Historical Handwritten Documents with Deep Neural Networks
title_full	Transcription of Spanish Historical Handwritten Documents with Deep Neural Networks
title_fullStr	Transcription of Spanish Historical Handwritten Documents with Deep Neural Networks
title_full_unstemmed	Transcription of Spanish Historical Handwritten Documents with Deep Neural Networks
title_sort	Transcription of Spanish Historical Handwritten Documents with Deep Neural Networks
dc.creator.none.fl_str_mv	Granell, Emilio Chammas, Edgard Likforman-Sulem, Laurence Mokbel, Chafic Cirstea, Bogdan-Ionut Martínez-Hinarejos, Carlos-D.\|\|\|0000-0002-6139-2891
author	Granell, Emilio
author_facet	Granell, Emilio Chammas, Edgard Likforman-Sulem, Laurence Mokbel, Chafic Cirstea, Bogdan-Ionut Martínez-Hinarejos, Carlos-D.\|\|\|0000-0002-6139-2891
author_role	author
author2	Chammas, Edgard Likforman-Sulem, Laurence Mokbel, Chafic Cirstea, Bogdan-Ionut Martínez-Hinarejos, Carlos-D.\|\|\|0000-0002-6139-2891
author2_role	author author author author author
dc.contributor.none.fl_str_mv	Departamento de Sistemas Informáticos y Computación Escuela Técnica Superior de Ingeniería Informática Centro de Investigación Pattern Recognition and Human Language Technology Ministerio de Economía, Industria y Competitividad Repositorio Institucional de la Universitat Politècnica de València Riunet
dc.subject.none.fl_str_mv	Character-level language model Historical handwritten transcription Out-of-vocabulary word recognition Word structure retrieval LENGUAJES Y SISTEMAS INFORMATICOS
topic	Character-level language model Historical handwritten transcription Out-of-vocabulary word recognition Word structure retrieval LENGUAJES Y SISTEMAS INFORMATICOS
description	[EN] The digitization of historical handwritten document images is important for the preservation of cultural heritage. Moreover, the transcription of text images obtained from digitization is necessary to provide efficient information access to the content of these documents. Handwritten Text Recognition (HTR) has become an important research topic in the areas of image and computational language processing that allows us to obtain transcriptions from text images. State-of-the-art HTR systems are, however, far from perfect. One difficulty is that they have to cope with image noise and handwriting variability. Another difficulty is the presence of a large amount of Out-Of-Vocabulary (OOV) words in ancient historical texts. A solution to this problem is to use external lexical resources, but such resources might be scarce or unavailable given the nature and the age of such documents. This work proposes a solution to avoid this limitation. It consists of associating a powerful optical recognition system that will cope with image noise and variability, with a language model based on sub-lexical units that will model OOV words. Such a language modeling approach reduces the size of the lexicon while increasing the lexicon coverage. Experiments are first conducted on the publicly available Rodrigo dataset, which contains the digitization of an ancient Spanish manuscript, with a recognizer based on Hidden Markov Models (HMMs). They show that sub-lexical units outperform word units in terms of Word Error Rate (WER), Character Error Rate (CER) and OOV word accuracy rate. This approach is then applied to deep net classifiers, namely Bi-directional Long-Short Term Memory (BLSTMs) and Convolutional Recurrent Neural Nets (CRNNs). Results show that CRNNs outperform HMMs and BLSTMs, reaching the lowest WER and CER for this image dataset and significantly improving OOV recognition.
publishDate	2018
dc.date.none.fl_str_mv	2018 2018-01-01
dc.type.none.fl_str_mv	journal article http://purl.org/coar/resource_type/c_6501 VoR http://purl.org/coar/version/c_970fb48d4fbd8a85
dc.type.openaire.fl_str_mv	info:eu-repo/semantics/article
format	article
dc.identifier.none.fl_str_mv	https://riunet.upv.es/handle/10251/120670
url	https://riunet.upv.es/handle/10251/120670
dc.language.none.fl_str_mv	Inglés eng
language_invalid_str_mv	Inglés
language	eng
dc.relation.none.fl_str_mv	European Commission https://doi.org/10.13039/501100000780 H2020 674943 Recognition and Enrichment of Archival Documents Ministerio de Economía y Competitividad http://dx.doi.org/10.13039/501100003329 TIN2015-70924-C2-1-R CONTEXTO, MULTIMODALIDAD Y COLABORACION DEL USUARIO EN PROCESADO DE TEXTO MANUSCRITO
dc.rights.none.fl_str_mv	open access http://purl.org/coar/access_right/c_abf2 Reconocimiento (by) http://creativecommons.org/licenses/by/4.0/
dc.rights.openaire.fl_str_mv	info:eu-repo/semantics/openAccess
rights_invalid_str_mv	open access http://purl.org/coar/access_right/c_abf2 Reconocimiento (by) http://creativecommons.org/licenses/by/4.0/
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	MDPI AG
publisher.none.fl_str_mv	MDPI AG
dc.source.none.fl_str_mv	reponame:RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia instname:Universitat Politècnica de València (UPV)
instname_str	Universitat Politècnica de València (UPV)
reponame_str	RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia
collection	RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia
repository.name.fl_str_mv
repository.mail.fl_str_mv
_version_	1869408181432614912
score	15,300724

Transcription of Spanish Historical Handwritten Documents with Deep Neural Networks

Similar Items