Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish
[EN] Visual speech recognition (VSR) is a challenging task that aims to interpret speech based solely on lip movements. However, although remarkable results have recently been reached in the field, this task remains an open research problem due to different challenges, such as visual ambiguities, th...
| Autores: | , |
|---|---|
| Tipo de recurso: | artículo |
| Fecha de publicación: | 2023 |
| País: | España |
| Institución: | Universitat Politècnica de València (UPV) |
| Repositorio: | RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia |
| Idioma: | inglés |
| OAI Identifier: | oai:riunet.upv.es:10251/204394 |
| Acceso en línea: | https://riunet.upv.es/handle/10251/204394 |
| Access Level: | acceso abierto |
| Palabra clave: | Visual speech recognition Speaker adaptation Fine-tuning Adapters Spanish language End-to-end architectures LENGUAJES Y SISTEMAS INFORMATICOS |
| id |
ES_eef6ecf32afddbb578ef5791a4f18cef |
|---|---|
| oai_identifier_str |
oai:riunet.upv.es:10251/204394 |
| network_acronym_str |
ES |
| network_name_str |
España |
| repository_id_str |
|
| spelling |
Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous SpanishGimeno-Gómez, David|||0000-0002-7375-9515Martínez-Hinarejos, Carlos-D.|||0000-0002-6139-2891Visual speech recognitionSpeaker adaptationFine-tuningAdaptersSpanish languageEnd-to-end architecturesLENGUAJES Y SISTEMAS INFORMATICOS[EN] Visual speech recognition (VSR) is a challenging task that aims to interpret speech based solely on lip movements. However, although remarkable results have recently been reached in the field, this task remains an open research problem due to different challenges, such as visual ambiguities, the intra-personal variability among speakers, and the complex modeling of silence. Nonetheless, these challenges can be alleviated when the task is approached from a speaker-dependent perspective. Our work focuses on the adaptation of end-to-end VSR systems to a specific speaker. Hence, we propose two different adaptation methods based on the conventional fine-tuning technique or the so-called Adapters. We conduct a comparative study in terms of performance while considering different deployment aspects such as training time and storage cost. Results on the Spanish LIP-RTVE database show that both methods are able to obtain recognition rates comparable to the state of the art, even when only a limited amount of training data is available. Although it incurs a deterioration in performance, the Adapters-based method presents a more scalable and efficient solution, significantly reducing the training time and storage cost by up to 80%.This work was partially supported by the Grant CIACIF/2021/295 funded by Generalitat Valenciana and by the Grant PID2021-124719OB-I00 under the LLEER (PID2021-124719OB-100) project funded by MCIN/AEI/10.13039/501100011033/ and by ERDF EU, A way of making Europe .MDPI AGDepartamento de Sistemas Informáticos y ComputaciónEscuela Técnica Superior de Ingeniería InformáticaCentro de Investigación Pattern Recognition and Human Language TechnologyGENERALITAT VALENCIANAAGENCIA ESTATAL DE INVESTIGACIONEuropean Regional Development FundRepositorio Institucional de la Universitat Politècnica de València Riunet20232023-05-26journal articlehttp://purl.org/coar/resource_type/c_6501VoRhttp://purl.org/coar/version/c_970fb48d4fbd8a85info:eu-repo/semantics/articleapplication/pdfhttps://riunet.upv.es/handle/10251/204394reponame:RiuNet. Repositorio Institucional de la Universitat Politécnica de Valénciainstname:Universitat Politècnica de València (UPV)InglésengAgencia Estatal de Investigación http://dx.doi.org/10.13039/501100011033 Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023 PID2021-124719OB-I00 LECTURA DE LABIOS EN ESPAÑOL EN ESCENARIOS REALISTASGeneralitat Valenciana https://doi.org/10.13039/501100003359 CIACIF%2F2021%2F295 Contributions to Automatic Lipreading for SpanishEuropean Regional Development Fund https://doi.org/10.13039/501100008530 C22%2FERDFopen accesshttp://purl.org/coar/access_right/c_abf2Reconocimiento (by)http://creativecommons.org/licenses/by/4.0/info:eu-repo/semantics/openAccessoai:riunet.upv.es:10251/2043942026-06-13T07:49:27Z |
| dc.title.none.fl_str_mv |
Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish |
| title |
Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish |
| spellingShingle |
Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish Gimeno-Gómez, David|||0000-0002-7375-9515 Visual speech recognition Speaker adaptation Fine-tuning Adapters Spanish language End-to-end architectures LENGUAJES Y SISTEMAS INFORMATICOS |
| title_short |
Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish |
| title_full |
Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish |
| title_fullStr |
Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish |
| title_full_unstemmed |
Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish |
| title_sort |
Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish |
| dc.creator.none.fl_str_mv |
Gimeno-Gómez, David|||0000-0002-7375-9515 Martínez-Hinarejos, Carlos-D.|||0000-0002-6139-2891 |
| author |
Gimeno-Gómez, David|||0000-0002-7375-9515 |
| author_facet |
Gimeno-Gómez, David|||0000-0002-7375-9515 Martínez-Hinarejos, Carlos-D.|||0000-0002-6139-2891 |
| author_role |
author |
| author2 |
Martínez-Hinarejos, Carlos-D.|||0000-0002-6139-2891 |
| author2_role |
author |
| dc.contributor.none.fl_str_mv |
Departamento de Sistemas Informáticos y Computación Escuela Técnica Superior de Ingeniería Informática Centro de Investigación Pattern Recognition and Human Language Technology GENERALITAT VALENCIANA AGENCIA ESTATAL DE INVESTIGACION European Regional Development Fund Repositorio Institucional de la Universitat Politècnica de València Riunet |
| dc.subject.none.fl_str_mv |
Visual speech recognition Speaker adaptation Fine-tuning Adapters Spanish language End-to-end architectures LENGUAJES Y SISTEMAS INFORMATICOS |
| topic |
Visual speech recognition Speaker adaptation Fine-tuning Adapters Spanish language End-to-end architectures LENGUAJES Y SISTEMAS INFORMATICOS |
| description |
[EN] Visual speech recognition (VSR) is a challenging task that aims to interpret speech based solely on lip movements. However, although remarkable results have recently been reached in the field, this task remains an open research problem due to different challenges, such as visual ambiguities, the intra-personal variability among speakers, and the complex modeling of silence. Nonetheless, these challenges can be alleviated when the task is approached from a speaker-dependent perspective. Our work focuses on the adaptation of end-to-end VSR systems to a specific speaker. Hence, we propose two different adaptation methods based on the conventional fine-tuning technique or the so-called Adapters. We conduct a comparative study in terms of performance while considering different deployment aspects such as training time and storage cost. Results on the Spanish LIP-RTVE database show that both methods are able to obtain recognition rates comparable to the state of the art, even when only a limited amount of training data is available. Although it incurs a deterioration in performance, the Adapters-based method presents a more scalable and efficient solution, significantly reducing the training time and storage cost by up to 80%. |
| publishDate |
2023 |
| dc.date.none.fl_str_mv |
2023 2023-05-26 |
| dc.type.none.fl_str_mv |
journal article http://purl.org/coar/resource_type/c_6501 VoR http://purl.org/coar/version/c_970fb48d4fbd8a85 |
| dc.type.openaire.fl_str_mv |
info:eu-repo/semantics/article |
| format |
article |
| dc.identifier.none.fl_str_mv |
https://riunet.upv.es/handle/10251/204394 |
| url |
https://riunet.upv.es/handle/10251/204394 |
| dc.language.none.fl_str_mv |
Inglés eng |
| language_invalid_str_mv |
Inglés |
| language |
eng |
| dc.relation.none.fl_str_mv |
Agencia Estatal de Investigación http://dx.doi.org/10.13039/501100011033 Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023 PID2021-124719OB-I00 LECTURA DE LABIOS EN ESPAÑOL EN ESCENARIOS REALISTAS Generalitat Valenciana https://doi.org/10.13039/501100003359 CIACIF%2F2021%2F295 Contributions to Automatic Lipreading for Spanish European Regional Development Fund https://doi.org/10.13039/501100008530 C22%2FERDF |
| dc.rights.none.fl_str_mv |
open access http://purl.org/coar/access_right/c_abf2 Reconocimiento (by) http://creativecommons.org/licenses/by/4.0/ |
| dc.rights.openaire.fl_str_mv |
info:eu-repo/semantics/openAccess |
| rights_invalid_str_mv |
open access http://purl.org/coar/access_right/c_abf2 Reconocimiento (by) http://creativecommons.org/licenses/by/4.0/ |
| eu_rights_str_mv |
openAccess |
| dc.format.none.fl_str_mv |
application/pdf |
| dc.publisher.none.fl_str_mv |
MDPI AG |
| publisher.none.fl_str_mv |
MDPI AG |
| dc.source.none.fl_str_mv |
reponame:RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia instname:Universitat Politècnica de València (UPV) |
| instname_str |
Universitat Politècnica de València (UPV) |
| reponame_str |
RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia |
| collection |
RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia |
| repository.name.fl_str_mv |
|
| repository.mail.fl_str_mv |
|
| _version_ |
1869423794422022144 |
| score |
15,300724 |