A cellular-based evolutionary approach for the extraction of emerging patterns in massive data streams

Today, the number of existing devices generates immense amounts of data on a continuous basis that must be processed by new distributed data stream mining approaches. In this paper we present a new approach for extracting descriptive emerging patterns in massive data streams from different sources t...

ver descrição completa

Detalhes bibliográficos
Autores: García-Vico, Ángel M., Carmona, Cristóbal J., González, Pedro, del Jesus, María José
Tipo de documento: artigo
Estado:Versión aceptada para publicación
Data de publicação:2021
País:España
Recursos:Universidad de Jaén
Repositório:RUJA. Repositorio Institucional de la Producción Científica de la Universidad de Jaén
OAI Identifier:oai:ruja.ujaen.es:10953/4304
Acesso em linha:https://doi.org/10.1016/j.eswa.2021.115419
https://hdl.handle.net/10953/4304
Access Level:Acceso aberto
Palavra-chave:Big dataData stream mining
Evolutionary algorithms
Fuzzy logic
Emerging pattern mining
Descrição
Resumo:Today, the number of existing devices generates immense amounts of data on a continuous basis that must be processed by new distributed data stream mining approaches. In this paper we present a new approach for extracting descriptive emerging patterns in massive data streams from different sources through Apache Kafka and Apache Spark Streaming whose objective is to monitor the state of the system with respect to a variable of interest. For this purpose, the proposed algorithm is a cellular-based multi-objective evolutionary fuzzy system that uses an informed strategy for efficient data processing and a re-initialisation and filtering mechanism to eliminate redundant and low-reliable patterns. The experimental study carried out demonstrates an interpretability improvement of 25% in the extraction of high-interest knowledge by the proposed algorithm, which would make it easier for experts to analyse the problem. Finally, the proposed algorithm is up to five times faster than another proposal on the processing of the same amount of data. In this experimental study, up to 750,000 instances have been processed in approximately four seconds.