Visual content-based web page categorization with deep transfer learning and metric learning.

[EN]The growing amounts of online multimedia content challenge the current search, recommendation and information retrieval systems. Information in the form of visual elements is highly valuable in a range of web mining tasks. However, the mining of these resources is a difficult task due to the com...

Descripción completa

Detalles Bibliográficos
Autores: López Sánchez, Daniel, González Arrieta, María Angélica, Corchado Rodríguez, Juan Manuel
Tipo de recurso: artículo
Estado:Versión publicada
Fecha de publicación:2019
País:España
Institución:Universidad de Salamanca (USAL)
Repositorio:GREDOS. Repositorio Institucional de la Universidad de Salamanca
OAI Identifier:oai:gredos.usal.es:10366/157119
Acceso en línea:http://hdl.handle.net/10366/157119
Access Level:acceso abierto
Palabra clave:Web page categorization
Metric learning
Transfer learning
Deep learning
1203.17 Informática
Descripción
Sumario:[EN]The growing amounts of online multimedia content challenge the current search, recommendation and information retrieval systems. Information in the form of visual elements is highly valuable in a range of web mining tasks. However, the mining of these resources is a difficult task due to the complexity and variability of images, and the cost of collecting big enough datasets to successfully train accurate deep learning models. This paper proposes a novel framework for the categorization of web pages on the basis of their visual content. This is achieved by exploring the joint application of a transfer learning strategy and metric learning techniques to build a Deep Convolutional Neural Network (DCNN) for feature extrac- tion, even when training data is scarce. The obtained experimental results evidence that the proposed approach outperforms the state-of-the-art handcrafted image descriptors and achieves a high categoriza- tion accuracy. In addition, we address the problem of over-time learning, so the proposed framework can learn to identify new web page categories as new labeled images are provided at test time. As a result, prior knowledge of the complete set of possible web categories is not necessary in the initial training phase.