Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish

[EN] Visual speech recognition (VSR) is a challenging task that aims to interpret speech based solely on lip movements. However, although remarkable results have recently been reached in the field, this task remains an open research problem due to different challenges, such as visual ambiguities, th...

Descripción completa

Detalles Bibliográficos
Autores: Gimeno-Gómez, David|||0000-0002-7375-9515, Martínez-Hinarejos, Carlos-D.|||0000-0002-6139-2891
Tipo de recurso: artículo
Fecha de publicación:2023
País:España
Institución:Universitat Politècnica de València (UPV)
Repositorio:RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia
Idioma:inglés
OAI Identifier:oai:riunet.upv.es:10251/204394
Acceso en línea:https://riunet.upv.es/handle/10251/204394
Access Level:acceso abierto
Palabra clave:Visual speech recognition
Speaker adaptation
Fine-tuning
Adapters
Spanish language
End-to-end architectures
LENGUAJES Y SISTEMAS INFORMATICOS
id ES_eef6ecf32afddbb578ef5791a4f18cef
oai_identifier_str oai:riunet.upv.es:10251/204394
network_acronym_str ES
network_name_str España
repository_id_str
spelling Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous SpanishGimeno-Gómez, David|||0000-0002-7375-9515Martínez-Hinarejos, Carlos-D.|||0000-0002-6139-2891Visual speech recognitionSpeaker adaptationFine-tuningAdaptersSpanish languageEnd-to-end architecturesLENGUAJES Y SISTEMAS INFORMATICOS[EN] Visual speech recognition (VSR) is a challenging task that aims to interpret speech based solely on lip movements. However, although remarkable results have recently been reached in the field, this task remains an open research problem due to different challenges, such as visual ambiguities, the intra-personal variability among speakers, and the complex modeling of silence. Nonetheless, these challenges can be alleviated when the task is approached from a speaker-dependent perspective. Our work focuses on the adaptation of end-to-end VSR systems to a specific speaker. Hence, we propose two different adaptation methods based on the conventional fine-tuning technique or the so-called Adapters. We conduct a comparative study in terms of performance while considering different deployment aspects such as training time and storage cost. Results on the Spanish LIP-RTVE database show that both methods are able to obtain recognition rates comparable to the state of the art, even when only a limited amount of training data is available. Although it incurs a deterioration in performance, the Adapters-based method presents a more scalable and efficient solution, significantly reducing the training time and storage cost by up to 80%.This work was partially supported by the Grant CIACIF/2021/295 funded by Generalitat Valenciana and by the Grant PID2021-124719OB-I00 under the LLEER (PID2021-124719OB-100) project funded by MCIN/AEI/10.13039/501100011033/ and by ERDF EU, A way of making Europe .MDPI AGDepartamento de Sistemas Informáticos y ComputaciónEscuela Técnica Superior de Ingeniería InformáticaCentro de Investigación Pattern Recognition and Human Language TechnologyGENERALITAT VALENCIANAAGENCIA ESTATAL DE INVESTIGACIONEuropean Regional Development FundRepositorio Institucional de la Universitat Politècnica de València Riunet20232023-05-26journal articlehttp://purl.org/coar/resource_type/c_6501VoRhttp://purl.org/coar/version/c_970fb48d4fbd8a85info:eu-repo/semantics/articleapplication/pdfhttps://riunet.upv.es/handle/10251/204394reponame:RiuNet. Repositorio Institucional de la Universitat Politécnica de Valénciainstname:Universitat Politècnica de València (UPV)InglésengAgencia Estatal de Investigación http://dx.doi.org/10.13039/501100011033 Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023 PID2021-124719OB-I00 LECTURA DE LABIOS EN ESPAÑOL EN ESCENARIOS REALISTASGeneralitat Valenciana https://doi.org/10.13039/501100003359 CIACIF%2F2021%2F295 Contributions to Automatic Lipreading for SpanishEuropean Regional Development Fund https://doi.org/10.13039/501100008530 C22%2FERDFopen accesshttp://purl.org/coar/access_right/c_abf2Reconocimiento (by)http://creativecommons.org/licenses/by/4.0/info:eu-repo/semantics/openAccessoai:riunet.upv.es:10251/2043942026-06-13T07:49:27Z
dc.title.none.fl_str_mv Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish
title Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish
spellingShingle Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish
Gimeno-Gómez, David|||0000-0002-7375-9515
Visual speech recognition
Speaker adaptation
Fine-tuning
Adapters
Spanish language
End-to-end architectures
LENGUAJES Y SISTEMAS INFORMATICOS
title_short Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish
title_full Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish
title_fullStr Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish
title_full_unstemmed Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish
title_sort Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish
dc.creator.none.fl_str_mv Gimeno-Gómez, David|||0000-0002-7375-9515
Martínez-Hinarejos, Carlos-D.|||0000-0002-6139-2891
author Gimeno-Gómez, David|||0000-0002-7375-9515
author_facet Gimeno-Gómez, David|||0000-0002-7375-9515
Martínez-Hinarejos, Carlos-D.|||0000-0002-6139-2891
author_role author
author2 Martínez-Hinarejos, Carlos-D.|||0000-0002-6139-2891
author2_role author
dc.contributor.none.fl_str_mv Departamento de Sistemas Informáticos y Computación
Escuela Técnica Superior de Ingeniería Informática
Centro de Investigación Pattern Recognition and Human Language Technology
GENERALITAT VALENCIANA
AGENCIA ESTATAL DE INVESTIGACION
European Regional Development Fund
Repositorio Institucional de la Universitat Politècnica de València Riunet
dc.subject.none.fl_str_mv Visual speech recognition
Speaker adaptation
Fine-tuning
Adapters
Spanish language
End-to-end architectures
LENGUAJES Y SISTEMAS INFORMATICOS
topic Visual speech recognition
Speaker adaptation
Fine-tuning
Adapters
Spanish language
End-to-end architectures
LENGUAJES Y SISTEMAS INFORMATICOS
description [EN] Visual speech recognition (VSR) is a challenging task that aims to interpret speech based solely on lip movements. However, although remarkable results have recently been reached in the field, this task remains an open research problem due to different challenges, such as visual ambiguities, the intra-personal variability among speakers, and the complex modeling of silence. Nonetheless, these challenges can be alleviated when the task is approached from a speaker-dependent perspective. Our work focuses on the adaptation of end-to-end VSR systems to a specific speaker. Hence, we propose two different adaptation methods based on the conventional fine-tuning technique or the so-called Adapters. We conduct a comparative study in terms of performance while considering different deployment aspects such as training time and storage cost. Results on the Spanish LIP-RTVE database show that both methods are able to obtain recognition rates comparable to the state of the art, even when only a limited amount of training data is available. Although it incurs a deterioration in performance, the Adapters-based method presents a more scalable and efficient solution, significantly reducing the training time and storage cost by up to 80%.
publishDate 2023
dc.date.none.fl_str_mv 2023
2023-05-26
dc.type.none.fl_str_mv journal article
http://purl.org/coar/resource_type/c_6501
VoR
http://purl.org/coar/version/c_970fb48d4fbd8a85
dc.type.openaire.fl_str_mv info:eu-repo/semantics/article
format article
dc.identifier.none.fl_str_mv https://riunet.upv.es/handle/10251/204394
url https://riunet.upv.es/handle/10251/204394
dc.language.none.fl_str_mv Inglés
eng
language_invalid_str_mv Inglés
language eng
dc.relation.none.fl_str_mv Agencia Estatal de Investigación http://dx.doi.org/10.13039/501100011033 Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023 PID2021-124719OB-I00 LECTURA DE LABIOS EN ESPAÑOL EN ESCENARIOS REALISTAS
Generalitat Valenciana https://doi.org/10.13039/501100003359 CIACIF%2F2021%2F295 Contributions to Automatic Lipreading for Spanish
European Regional Development Fund https://doi.org/10.13039/501100008530 C22%2FERDF
dc.rights.none.fl_str_mv open access
http://purl.org/coar/access_right/c_abf2
Reconocimiento (by)
http://creativecommons.org/licenses/by/4.0/
dc.rights.openaire.fl_str_mv info:eu-repo/semantics/openAccess
rights_invalid_str_mv open access
http://purl.org/coar/access_right/c_abf2
Reconocimiento (by)
http://creativecommons.org/licenses/by/4.0/
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv MDPI AG
publisher.none.fl_str_mv MDPI AG
dc.source.none.fl_str_mv reponame:RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia
instname:Universitat Politècnica de València (UPV)
instname_str Universitat Politècnica de València (UPV)
reponame_str RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia
collection RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia
repository.name.fl_str_mv
repository.mail.fl_str_mv
_version_ 1869423794422022144
score 15,300724