Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish

Gimeno-Gómez, David|||0000-0002-7375-9515; Martínez-Hinarejos, Carlos-D.|||0000-0002-6139-2891

Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish

[EN] Visual speech recognition (VSR) is a challenging task that aims to interpret speech based solely on lip movements. However, although remarkable results have recently been reached in the field, this task remains an open research problem due to different challenges, such as visual ambiguities, th...

Descripción completa

Detalles Bibliográficos
Autores:	Gimeno-Gómez, David\|\|\|0000-0002-7375-9515, Martínez-Hinarejos, Carlos-D.\|\|\|0000-0002-6139-2891
Tipo de recurso:	artículo
Fecha de publicación:	2023
País:	España
Institución:	Universitat Politècnica de València (UPV)
Repositorio:	RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia
Idioma:	inglés
OAI Identifier:	oai:riunet.upv.es:10251/204394
Acceso en línea:	https://riunet.upv.es/handle/10251/204394
Access Level:	acceso abierto
Palabra clave:	Visual speech recognition Speaker adaptation Fine-tuning Adapters Spanish language End-to-end architectures LENGUAJES Y SISTEMAS INFORMATICOS

id	ES_eef6ecf32afddbb578ef5791a4f18cef
oai_identifier_str	oai:riunet.upv.es:10251/204394
network_acronym_str	ES
network_name_str	España
repository_id_str
spelling	Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous SpanishGimeno-Gómez, David\|\|\|0000-0002-7375-9515Martínez-Hinarejos, Carlos-D.\|\|\|0000-0002-6139-2891Visual speech recognitionSpeaker adaptationFine-tuningAdaptersSpanish languageEnd-to-end architecturesLENGUAJES Y SISTEMAS INFORMATICOS[EN] Visual speech recognition (VSR) is a challenging task that aims to interpret speech based solely on lip movements. However, although remarkable results have recently been reached in the field, this task remains an open research problem due to different challenges, such as visual ambiguities, the intra-personal variability among speakers, and the complex modeling of silence. Nonetheless, these challenges can be alleviated when the task is approached from a speaker-dependent perspective. Our work focuses on the adaptation of end-to-end VSR systems to a specific speaker. Hence, we propose two different adaptation methods based on the conventional fine-tuning technique or the so-called Adapters. We conduct a comparative study in terms of performance while considering different deployment aspects such as training time and storage cost. Results on the Spanish LIP-RTVE database show that both methods are able to obtain recognition rates comparable to the state of the art, even when only a limited amount of training data is available. Although it incurs a deterioration in performance, the Adapters-based method presents a more scalable and efficient solution, significantly reducing the training time and storage cost by up to 80%.This work was partially supported by the Grant CIACIF/2021/295 funded by Generalitat Valenciana and by the Grant PID2021-124719OB-I00 under the LLEER (PID2021-124719OB-100) project funded by MCIN/AEI/10.13039/501100011033/ and by ERDF EU, A way of making Europe .MDPI AGDepartamento de Sistemas Informáticos y ComputaciónEscuela Técnica Superior de Ingeniería InformáticaCentro de Investigación Pattern Recognition and Human Language TechnologyGENERALITAT VALENCIANAAGENCIA ESTATAL DE INVESTIGACIONEuropean Regional Development FundRepositorio Institucional de la Universitat Politècnica de València Riunet20232023-05-26journal articlehttp://purl.org/coar/resource_type/c_6501VoRhttp://purl.org/coar/version/c_970fb48d4fbd8a85info:eu-repo/semantics/articleapplication/pdfhttps://riunet.upv.es/handle/10251/204394reponame:RiuNet. Repositorio Institucional de la Universitat Politécnica de Valénciainstname:Universitat Politècnica de València (UPV)InglésengAgencia Estatal de Investigación http://dx.doi.org/10.13039/501100011033 Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023 PID2021-124719OB-I00 LECTURA DE LABIOS EN ESPAÑOL EN ESCENARIOS REALISTASGeneralitat Valenciana https://doi.org/10.13039/501100003359 CIACIF%2F2021%2F295 Contributions to Automatic Lipreading for SpanishEuropean Regional Development Fund https://doi.org/10.13039/501100008530 C22%2FERDFopen accesshttp://purl.org/coar/access_right/c_abf2Reconocimiento (by)http://creativecommons.org/licenses/by/4.0/info:eu-repo/semantics/openAccessoai:riunet.upv.es:10251/2043942026-06-13T07:49:27Z
dc.title.none.fl_str_mv	Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish
title	Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish
spellingShingle	Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish Gimeno-Gómez, David\|\|\|0000-0002-7375-9515 Visual speech recognition Speaker adaptation Fine-tuning Adapters Spanish language End-to-end architectures LENGUAJES Y SISTEMAS INFORMATICOS
title_short	Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish
title_full	Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish
title_fullStr	Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish
title_full_unstemmed	Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish
title_sort	Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish
dc.creator.none.fl_str_mv	Gimeno-Gómez, David\|\|\|0000-0002-7375-9515 Martínez-Hinarejos, Carlos-D.\|\|\|0000-0002-6139-2891
author	Gimeno-Gómez, David\|\|\|0000-0002-7375-9515
author_facet	Gimeno-Gómez, David\|\|\|0000-0002-7375-9515 Martínez-Hinarejos, Carlos-D.\|\|\|0000-0002-6139-2891
author_role	author
author2	Martínez-Hinarejos, Carlos-D.\|\|\|0000-0002-6139-2891
author2_role	author
dc.contributor.none.fl_str_mv	Departamento de Sistemas Informáticos y Computación Escuela Técnica Superior de Ingeniería Informática Centro de Investigación Pattern Recognition and Human Language Technology GENERALITAT VALENCIANA AGENCIA ESTATAL DE INVESTIGACION European Regional Development Fund Repositorio Institucional de la Universitat Politècnica de València Riunet
dc.subject.none.fl_str_mv	Visual speech recognition Speaker adaptation Fine-tuning Adapters Spanish language End-to-end architectures LENGUAJES Y SISTEMAS INFORMATICOS
topic	Visual speech recognition Speaker adaptation Fine-tuning Adapters Spanish language End-to-end architectures LENGUAJES Y SISTEMAS INFORMATICOS
description	[EN] Visual speech recognition (VSR) is a challenging task that aims to interpret speech based solely on lip movements. However, although remarkable results have recently been reached in the field, this task remains an open research problem due to different challenges, such as visual ambiguities, the intra-personal variability among speakers, and the complex modeling of silence. Nonetheless, these challenges can be alleviated when the task is approached from a speaker-dependent perspective. Our work focuses on the adaptation of end-to-end VSR systems to a specific speaker. Hence, we propose two different adaptation methods based on the conventional fine-tuning technique or the so-called Adapters. We conduct a comparative study in terms of performance while considering different deployment aspects such as training time and storage cost. Results on the Spanish LIP-RTVE database show that both methods are able to obtain recognition rates comparable to the state of the art, even when only a limited amount of training data is available. Although it incurs a deterioration in performance, the Adapters-based method presents a more scalable and efficient solution, significantly reducing the training time and storage cost by up to 80%.
publishDate	2023
dc.date.none.fl_str_mv	2023 2023-05-26
dc.type.none.fl_str_mv	journal article http://purl.org/coar/resource_type/c_6501 VoR http://purl.org/coar/version/c_970fb48d4fbd8a85
dc.type.openaire.fl_str_mv	info:eu-repo/semantics/article
format	article
dc.identifier.none.fl_str_mv	https://riunet.upv.es/handle/10251/204394
url	https://riunet.upv.es/handle/10251/204394
dc.language.none.fl_str_mv	Inglés eng
language_invalid_str_mv	Inglés
language	eng
dc.relation.none.fl_str_mv	Agencia Estatal de Investigación http://dx.doi.org/10.13039/501100011033 Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023 PID2021-124719OB-I00 LECTURA DE LABIOS EN ESPAÑOL EN ESCENARIOS REALISTAS Generalitat Valenciana https://doi.org/10.13039/501100003359 CIACIF%2F2021%2F295 Contributions to Automatic Lipreading for Spanish European Regional Development Fund https://doi.org/10.13039/501100008530 C22%2FERDF
dc.rights.none.fl_str_mv	open access http://purl.org/coar/access_right/c_abf2 Reconocimiento (by) http://creativecommons.org/licenses/by/4.0/
dc.rights.openaire.fl_str_mv	info:eu-repo/semantics/openAccess
rights_invalid_str_mv	open access http://purl.org/coar/access_right/c_abf2 Reconocimiento (by) http://creativecommons.org/licenses/by/4.0/
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	MDPI AG
publisher.none.fl_str_mv	MDPI AG
dc.source.none.fl_str_mv	reponame:RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia instname:Universitat Politècnica de València (UPV)
instname_str	Universitat Politècnica de València (UPV)
reponame_str	RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia
collection	RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia
repository.name.fl_str_mv
repository.mail.fl_str_mv
_version_	1869423794422022144
score	15,300724

Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish

Similares en LA Referencia