Mapreduce performance model for Hadoop 2.x

MapReduce is a popular programming model for distributed processing of large data sets. Apache Hadoop is one of the most common open-source implementations of such paradigm. Performance analysis of concurrent job executions has been recognized as a challenging problem, at the same time, that may pro...

Descripción completa

Detalles Bibliográficos
Autores: Glushkova, Daria|||0000-0002-8906-4793, Jovanovic, Petar|||0000-0003-4635-6646, Abelló Gamazo, Alberto|||0000-0002-3223-2186
Tipo de recurso: artículo
Fecha de publicación:2018
País:España
Institución:Universitat Politècnica de Catalunya (UPC)
Repositorio:UPCommons. Portal del coneixement obert de la UPC
Idioma:inglés
OAI Identifier:oai:upcommons.upc.edu:2117/124328
Acceso en línea:https://hdl.handle.net/2117/124328
https://dx.doi.org/10.1016/j.is.2017.11.006
Access Level:acceso abierto
Palabra clave:Electronic data processing -- Distributed processing
Cost effectiveness
Hadoop 2.x
MapReduce performance model
Processament distribuït de dades
Cost-eficàcia
Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors::Arquitectures distribuïdes
Descripción
Sumario:MapReduce is a popular programming model for distributed processing of large data sets. Apache Hadoop is one of the most common open-source implementations of such paradigm. Performance analysis of concurrent job executions has been recognized as a challenging problem, at the same time, that may provide reasonably accurate job response time estimation at significantly lower cost than experimental evaluation of real setups. In this paper, we tackle the challenge of defining MapReduce performance model for Hadoop 2.x. While there are several efficient approaches for modeling the performance of MapReduce workloads in Hadoop 1.x, they could not be applied to Hadoop 2.x due to fundamental architectural changes and dynamic resource allocation in Hadoop 2.x. Thus, the proposed solution is based on an existing performance model for Hadoop 1.x, but taking into consideration architectural changes and capturing the execution flow of a MapReduce job by using queuing network model. This way, the cost model reflects the intra-job synchronization constraints that occur due the contention at shared resources. The accuracy of our solution is validated via comparison of our model estimates against measurements in a real Hadoop 2.x setup.