A cellular-based evolutionary approach for the extraction of emerging patterns in massive data streams

Today, the number of existing devices generates immense amounts of data on a continuous basis that must be processed by new distributed data stream mining approaches. In this paper we present a new approach for extracting descriptive emerging patterns in massive data streams from different sources t...

Descripción completa

Detalles Bibliográficos
Autores: García-Vico, Ángel M., Carmona, Cristóbal J., González, Pedro, del Jesus, María José
Tipo de recurso: artículo
Estado:Versión aceptada para publicación
Fecha de publicación:2021
País:España
Institución:Universidad de Jaén
Repositorio:RUJA. Repositorio Institucional de la Producción Científica de la Universidad de Jaén
OAI Identifier:oai:ruja.ujaen.es:10953/4304
Acceso en línea:https://doi.org/10.1016/j.eswa.2021.115419
https://hdl.handle.net/10953/4304
Access Level:acceso abierto
Palabra clave:Big dataData stream mining
Evolutionary algorithms
Fuzzy logic
Emerging pattern mining
Descripción
Sumario:Today, the number of existing devices generates immense amounts of data on a continuous basis that must be processed by new distributed data stream mining approaches. In this paper we present a new approach for extracting descriptive emerging patterns in massive data streams from different sources through Apache Kafka and Apache Spark Streaming whose objective is to monitor the state of the system with respect to a variable of interest. For this purpose, the proposed algorithm is a cellular-based multi-objective evolutionary fuzzy system that uses an informed strategy for efficient data processing and a re-initialisation and filtering mechanism to eliminate redundant and low-reliable patterns. The experimental study carried out demonstrates an interpretability improvement of 25% in the extraction of high-interest knowledge by the proposed algorithm, which would make it easier for experts to analyse the problem. Finally, the proposed algorithm is up to five times faster than another proposal on the processing of the same amount of data. In this experimental study, up to 750,000 instances have been processed in approximately four seconds.