Egocentric video description based on temporally-linked sequences

[EN] Egocentric vision consists in acquiring images along the day from a first person point-of-view using wearable cameras. The automatic analysis of this information allows to discover daily patterns for improving the quality of life of the user. A natural topic that arises in egocentric vision is...

Descripción completa

Detalles Bibliográficos
Autores: Bolaños, Marc, Peris-Abril, Álvaro, Soler, Sergi, Radeva, Petia, Casacuberta Nolla, Francisco|||0000-0002-8497-5598
Tipo de recurso: artículo
Fecha de publicación:2018
País:España
Institución:Universitat Politècnica de València (UPV)
Repositorio:RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia
Idioma:inglés
OAI Identifier:oai:riunet.upv.es:10251/141941
Acceso en línea:https://riunet.upv.es/handle/10251/141941
Access Level:acceso abierto
Palabra clave:Egocentric vision
Video description
Deep learning
Multi-modal learning
LENGUAJES Y SISTEMAS INFORMATICOS
Descripción
Sumario:[EN] Egocentric vision consists in acquiring images along the day from a first person point-of-view using wearable cameras. The automatic analysis of this information allows to discover daily patterns for improving the quality of life of the user. A natural topic that arises in egocentric vision is storytelling, that is, how to understand and tell the story relying behind the pictures. In this paper, we tackle storytelling as an egocentric sequences description problem. We propose a novel methodology that exploits information from temporally neighboring events, matching precisely the nature of egocentric sequences. Furthermore, we present a new method for multimodal data fusion consisting on a multi-input attention recurrent network. We also release the EDUB-SegDesc dataset. This is the first dataset for egocentric image sequences description, consisting of 1339 events with 3991 descriptions, from 55¿days acquired by 11 people. Finally, we prove that our proposal outperforms classical attentional encoder-decoder methods for video description.