Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish

Gimeno-Gómez, David|||0000-0002-7375-9515; Martínez-Hinarejos, Carlos-D.|||0000-0002-6139-2891

Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish

[EN] Visual speech recognition (VSR) is a challenging task that aims to interpret speech based solely on lip movements. However, although remarkable results have recently been reached in the field, this task remains an open research problem due to different challenges, such as visual ambiguities, th...

Descripción completa

Detalles Bibliográficos
Autores:	Gimeno-Gómez, David\|\|\|0000-0002-7375-9515, Martínez-Hinarejos, Carlos-D.\|\|\|0000-0002-6139-2891
Tipo de recurso:	artículo
Fecha de publicación:	2023
País:	España
Institución:	Universitat Politècnica de València (UPV)
Repositorio:	RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia
Idioma:	inglés
OAI Identifier:	oai:riunet.upv.es:10251/204394
Acceso en línea:	https://riunet.upv.es/handle/10251/204394
Access Level:	acceso abierto
Palabra clave:	Visual speech recognition Speaker adaptation Fine-tuning Adapters Spanish language End-to-end architectures LENGUAJES Y SISTEMAS INFORMATICOS

Descripción
Sumario:	[EN] Visual speech recognition (VSR) is a challenging task that aims to interpret speech based solely on lip movements. However, although remarkable results have recently been reached in the field, this task remains an open research problem due to different challenges, such as visual ambiguities, the intra-personal variability among speakers, and the complex modeling of silence. Nonetheless, these challenges can be alleviated when the task is approached from a speaker-dependent perspective. Our work focuses on the adaptation of end-to-end VSR systems to a specific speaker. Hence, we propose two different adaptation methods based on the conventional fine-tuning technique or the so-called Adapters. We conduct a comparative study in terms of performance while considering different deployment aspects such as training time and storage cost. Results on the Spanish LIP-RTVE database show that both methods are able to obtain recognition rates comparable to the state of the art, even when only a limited amount of training data is available. Although it incurs a deterioration in performance, the Adapters-based method presents a more scalable and efficient solution, significantly reducing the training time and storage cost by up to 80%.

Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish

Similares en LA Referencia