Multimodal multilingual models: improving image-text embedding alignment across languages with limited data

Rapid advancements in Visual Language Models (VLMs) have transformed multimodal understanding but are often constrained by generating English responses regardless of the input language. This phenomenon has been termed as Image-induced Fidelity Loss (IFL) and stems from limited multimodal multilingua...

Descripción completa

Detalles Bibliográficos
Autor: Pikabea Mentxaka, Iñigo
Tipo de recurso: tesis de maestría
Fecha de publicación:2025
País:España
Institución:Universitat Politècnica de Catalunya (UPC)
Repositorio:UPCommons. Portal del coneixement obert de la UPC
Idioma:inglés
OAI Identifier:oai:upcommons.upc.edu:2117/446130
Acceso en línea:https://hdl.handle.net/2117/446130
Access Level:acceso abierto
Palabra clave:Deep learning (Machine learning)
Natural language processing (Computer science)
Computer vision
Deep learning
Visual language models
Multimodality
LLMs
Aprenentatge profund (Aprenentatge automàtic)
Tractament del llenguatge natural (Informàtica)
Visió per ordinador
Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Aprenentatge automàtic
Descripción
Sumario:Rapid advancements in Visual Language Models (VLMs) have transformed multimodal understanding but are often constrained by generating English responses regardless of the input language. This phenomenon has been termed as Image-induced Fidelity Loss (IFL) and stems from limited multimodal multilingual training data. To address this, we propose a continuous multilingual integration strategy that injects text-only multilingual data during visual instruction tuning, preserving the language model's original multilingual capabilities. Extensive evaluations demonstrate that our approach significantly improves linguistic fidelity across languages (from 2.7 % to 88.7 % in German and from 4.4 % to 92.9 % in Spanish) without degradation in visual performance. We also explore model merging, which improves language fidelity but comes at the cost of visual performance. In contrast, our core method achieves robust multilingual alignment without trade-offs, offering a scalable and effective path to mitigating IFL for global VLM adoption.