Mapreduce performance model for Hadoop 2.x

MapReduce is a popular programming model for distributed processing of large data sets. Apache Hadoop is one of the most common open-source implementations of such paradigm. Performance analysis of concurrent job executions has been recognized as a challenging problem, at the same time, that may pro...

ver descrição completa

Detalhes bibliográficos
Autores: Glushkova, Daria|||0000-0002-8906-4793, Jovanovic, Petar|||0000-0003-4635-6646, Abelló Gamazo, Alberto|||0000-0002-3223-2186
Tipo de documento: artigo
Data de publicação:2018
País:España
Recursos:Universitat Politècnica de Catalunya (UPC)
Repositório:UPCommons. Portal del coneixement obert de la UPC
Idioma:inglês
OAI Identifier:oai:upcommons.upc.edu:2117/124328
Acesso em linha:https://hdl.handle.net/2117/124328
https://dx.doi.org/10.1016/j.is.2017.11.006
Access Level:Acceso aberto
Palavra-chave:Electronic data processing -- Distributed processing
Cost effectiveness
Hadoop 2.x
MapReduce performance model
Processament distribuït de dades
Cost-eficàcia
Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors::Arquitectures distribuïdes
id ES_2b4a3d0f2801ecc66e65d2de10a731b2
oai_identifier_str oai:upcommons.upc.edu:2117/124328
network_acronym_str ES
network_name_str España
repository_id_str
spelling Mapreduce performance model for Hadoop 2.xGlushkova, Daria|||0000-0002-8906-4793Jovanovic, Petar|||0000-0003-4635-6646Abelló Gamazo, Alberto|||0000-0002-3223-2186Electronic data processing -- Distributed processingCost effectivenessHadoop 2.xMapReduce performance modelProcessament distribuït de dadesCost-eficàciaÀrees temàtiques de la UPC::Informàtica::Arquitectura de computadors::Arquitectures distribuïdesMapReduce is a popular programming model for distributed processing of large data sets. Apache Hadoop is one of the most common open-source implementations of such paradigm. Performance analysis of concurrent job executions has been recognized as a challenging problem, at the same time, that may provide reasonably accurate job response time estimation at significantly lower cost than experimental evaluation of real setups. In this paper, we tackle the challenge of defining MapReduce performance model for Hadoop 2.x. While there are several efficient approaches for modeling the performance of MapReduce workloads in Hadoop 1.x, they could not be applied to Hadoop 2.x due to fundamental architectural changes and dynamic resource allocation in Hadoop 2.x. Thus, the proposed solution is based on an existing performance model for Hadoop 1.x, but taking into consideration architectural changes and capturing the execution flow of a MapReduce job by using queuing network model. This way, the cost model reflects the intra-job synchronization constraints that occur due the contention at shared resources. The accuracy of our solution is validated via comparison of our model estimates against measurements in a real Hadoop 2.x setup.Peer ReviewedElsevier20192019-01-0120182018-11-15journal articlehttp://purl.org/coar/resource_type/c_6501AMhttp://purl.org/coar/version/c_ab4af688f83e57aainfo:eu-repo/semantics/articleapplication/pdfhttps://hdl.handle.net/2117/124328https://dx.doi.org/10.1016/j.is.2017.11.006reponame:UPCommons. Portal del coneixement obert de la UPCinstname:Universitat Politècnica de Catalunya (UPC)Inglésengopen accesshttp://purl.org/coar/access_right/c_abf2Attribution-NonCommercial-NoDerivs 3.0 Spainhttp://creativecommons.org/licenses/by-nc-nd/3.0/es/info:eu-repo/semantics/openAccessoai:upcommons.upc.edu:2117/1243282026-05-27T15:37:01Z
dc.title.none.fl_str_mv Mapreduce performance model for Hadoop 2.x
title Mapreduce performance model for Hadoop 2.x
spellingShingle Mapreduce performance model for Hadoop 2.x
Glushkova, Daria|||0000-0002-8906-4793
Electronic data processing -- Distributed processing
Cost effectiveness
Hadoop 2.x
MapReduce performance model
Processament distribuït de dades
Cost-eficàcia
Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors::Arquitectures distribuïdes
title_short Mapreduce performance model for Hadoop 2.x
title_full Mapreduce performance model for Hadoop 2.x
title_fullStr Mapreduce performance model for Hadoop 2.x
title_full_unstemmed Mapreduce performance model for Hadoop 2.x
title_sort Mapreduce performance model for Hadoop 2.x
dc.creator.none.fl_str_mv Glushkova, Daria|||0000-0002-8906-4793
Jovanovic, Petar|||0000-0003-4635-6646
Abelló Gamazo, Alberto|||0000-0002-3223-2186
author Glushkova, Daria|||0000-0002-8906-4793
author_facet Glushkova, Daria|||0000-0002-8906-4793
Jovanovic, Petar|||0000-0003-4635-6646
Abelló Gamazo, Alberto|||0000-0002-3223-2186
author_role author
author2 Jovanovic, Petar|||0000-0003-4635-6646
Abelló Gamazo, Alberto|||0000-0002-3223-2186
author2_role author
author
dc.subject.none.fl_str_mv Electronic data processing -- Distributed processing
Cost effectiveness
Hadoop 2.x
MapReduce performance model
Processament distribuït de dades
Cost-eficàcia
Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors::Arquitectures distribuïdes
topic Electronic data processing -- Distributed processing
Cost effectiveness
Hadoop 2.x
MapReduce performance model
Processament distribuït de dades
Cost-eficàcia
Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors::Arquitectures distribuïdes
description MapReduce is a popular programming model for distributed processing of large data sets. Apache Hadoop is one of the most common open-source implementations of such paradigm. Performance analysis of concurrent job executions has been recognized as a challenging problem, at the same time, that may provide reasonably accurate job response time estimation at significantly lower cost than experimental evaluation of real setups. In this paper, we tackle the challenge of defining MapReduce performance model for Hadoop 2.x. While there are several efficient approaches for modeling the performance of MapReduce workloads in Hadoop 1.x, they could not be applied to Hadoop 2.x due to fundamental architectural changes and dynamic resource allocation in Hadoop 2.x. Thus, the proposed solution is based on an existing performance model for Hadoop 1.x, but taking into consideration architectural changes and capturing the execution flow of a MapReduce job by using queuing network model. This way, the cost model reflects the intra-job synchronization constraints that occur due the contention at shared resources. The accuracy of our solution is validated via comparison of our model estimates against measurements in a real Hadoop 2.x setup.
publishDate 2018
dc.date.none.fl_str_mv 2018
2018-11-15
2019
2019-01-01
dc.type.none.fl_str_mv journal article
http://purl.org/coar/resource_type/c_6501
AM
http://purl.org/coar/version/c_ab4af688f83e57aa
dc.type.openaire.fl_str_mv info:eu-repo/semantics/article
format article
dc.identifier.none.fl_str_mv https://hdl.handle.net/2117/124328
https://dx.doi.org/10.1016/j.is.2017.11.006
url https://hdl.handle.net/2117/124328
https://dx.doi.org/10.1016/j.is.2017.11.006
dc.language.none.fl_str_mv Inglés
eng
language_invalid_str_mv Inglés
language eng
dc.rights.none.fl_str_mv open access
http://purl.org/coar/access_right/c_abf2
Attribution-NonCommercial-NoDerivs 3.0 Spain
http://creativecommons.org/licenses/by-nc-nd/3.0/es/
dc.rights.openaire.fl_str_mv info:eu-repo/semantics/openAccess
rights_invalid_str_mv open access
http://purl.org/coar/access_right/c_abf2
Attribution-NonCommercial-NoDerivs 3.0 Spain
http://creativecommons.org/licenses/by-nc-nd/3.0/es/
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Elsevier
publisher.none.fl_str_mv Elsevier
dc.source.none.fl_str_mv reponame:UPCommons. Portal del coneixement obert de la UPC
instname:Universitat Politècnica de Catalunya (UPC)
instname_str Universitat Politècnica de Catalunya (UPC)
reponame_str UPCommons. Portal del coneixement obert de la UPC
collection UPCommons. Portal del coneixement obert de la UPC
repository.name.fl_str_mv
repository.mail.fl_str_mv
_version_ 1869405135163097088
score 15,300724