Language modelling for speaker diarization in telephonic interviews

The aim of this paper is to investigate the benefit of combining both language and acoustic modelling for speaker diarization. Although conventional systems only use acoustic features, in some scenarios linguistic data contain high discriminative speaker information, even more reliable than the acou...

Descripción completa

Detalles Bibliográficos
Autores: India Massana, Miquel Àngel|||0000-0002-3107-3662, Hernando Pericás, Francisco Javier|||0000-0002-1730-8154, Rodríguez Fonollosa, José Adrián|||0000-0001-9513-7939
Tipo de recurso: artículo
Fecha de publicación:2022
País:España
Institución:Universitat Politècnica de Catalunya (UPC)
Repositorio:UPCommons. Portal del coneixement obert de la UPC
Idioma:inglés
OAI Identifier:oai:upcommons.upc.edu:2117/374077
Acceso en línea:https://hdl.handle.net/2117/374077
https://dx.doi.org/10.1016/j.csl.2022.101441
Access Level:acceso abierto
Palabra clave:Speech processing systems
Neural networks (Computer science)
Speaker diarization
Language modelling
Acoustic modelling
LSTM neural networks
Processament de la parla
Xarxes neuronals (Informàtica)
Àrees temàtiques de la UPC::Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic
Descripción
Sumario:The aim of this paper is to investigate the benefit of combining both language and acoustic modelling for speaker diarization. Although conventional systems only use acoustic features, in some scenarios linguistic data contain high discriminative speaker information, even more reliable than the acoustic ones. In this study we analyze how an appropriate fusion of both kind of features is able to obtain good results in these cases. The proposed system is based on an iterative algorithm where a LSTM network is used as a speaker classifier. The network is fed with character-level word embeddings and a GMM based acoustic score created with the output labels from previous iterations. The presented algorithm has been evaluated in a Call-Center database, which is composed of telephone interview audios. The combination of acoustic features and linguistic content shows a 84.29% improvement in terms of a word-level DER as compared to a HMM/VB baseline system. The results of this study confirms that linguistic content can be efficiently used for some speaker recognition tasks.