Towards resilient EU HPC systems: A blueprint

This document aims to spearhead a Europe-wide discussion on HPC system resilience and to help the European HPC community define best practices for resilience. We analyse a wide range of state-of-the-art resilience mechanisms and recommend the most effective approaches to employ in large-scale HPC sy...

Descripción completa

Detalles Bibliográficos
Autores: Radojković, Petar, Marazakis, Manolis, Carpenter, Paul Matthew, Jeyapaul, Reiley, Gizopoulos, Dimitris, Schulz, Martin, Armejach Sanosa, Adrià|||0000-0003-2869-668X, Ayguadé Parra, Eduard|||0000-0002-5146-103X, Canal Corretger, Ramon|||0000-0003-4542-204X, Moretó Planas, Miquel|||0000-0002-9848-8758, Salami, Behzad, Unsal, Osman Sabri
Tipo de recurso: informe técnico
Fecha de publicación:2020
País:España
Institución:Universitat Politècnica de Catalunya (UPC)
Repositorio:UPCommons. Portal del coneixement obert de la UPC
Idioma:inglés
OAI Identifier:oai:upcommons.upc.edu:2117/330695
Acceso en línea:https://hdl.handle.net/2117/330695
Access Level:acceso abierto
Palabra clave:High performance computing -- Europe
Càlcul intensiu (Informàtica) -- Europa
Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors
id ES_96eb0062646948c76bbf9e3011dbcefa
oai_identifier_str oai:upcommons.upc.edu:2117/330695
network_acronym_str ES
network_name_str España
repository_id_str
spelling Towards resilient EU HPC systems: A blueprintRadojković, PetarMarazakis, ManolisCarpenter, Paul MatthewJeyapaul, ReileyGizopoulos, DimitrisSchulz, MartinArmejach Sanosa, Adrià|||0000-0003-2869-668XAyguadé Parra, Eduard|||0000-0002-5146-103XCanal Corretger, Ramon|||0000-0003-4542-204XMoretó Planas, Miquel|||0000-0002-9848-8758Salami, BehzadUnsal, Osman SabriHigh performance computing -- EuropeCàlcul intensiu (Informàtica) -- EuropaÀrees temàtiques de la UPC::Informàtica::Arquitectura de computadorsThis document aims to spearhead a Europe-wide discussion on HPC system resilience and to help the European HPC community define best practices for resilience. We analyse a wide range of state-of-the-art resilience mechanisms and recommend the most effective approaches to employ in large-scale HPC systems. Our guidelines will be useful in the allocation of available resources, as well as guiding researchers and research funding towards the enhancement of resilience approaches with the highest priority and utility. Although our work is focused on the needs of next generation HPC systems in Europe, the principles and evaluations are applicable globally.This work has received funding from the European Union’s Horizon 2020 research and innovation programme under the projects ECOSCALE (grant agreement No 671632), EPI (grant agreement No 826647), EuroEXA (grant agreement No 754337), Eurolab4HPC (grant agreement No 800962), EVOLVE (grant agreement No 825061), EXA2PRO (grant agreement No 801015), ExaNest (grant agreement No 671553), ExaNoDe (grant agreement No 671578), EXDCI-2 (grant agreement No 800957), LEGaTO (grant agreement No 780681), MB2020 (grant agreement No 779877), RECIPE (grant agreement No 801137) and SDK4ED (grant agreement No 780572). The work was also supported by the European Commission’s Seventh Framework Programme under the projects CLERECO (grant agreement No 611404), the NCSA-Inria-ANL-BSC-JSCRiken-UTK Joint-Laboratory for Extreme Scale Computing – JLESC (https://jlesc.github.io/), OMPI-X project (No ECP-2.3.1.17) and the Spanish Government through Severo Ochoa programme (SEV-2015-0493). This work was sponsored in part by the U.S. Department of Energy's Office of Advanced Scientific Computing Research, program managers Robinson Pino and Lucy Nowell. This manuscript has been authored by UT-Battelle, LLC under Contract No DE-AC05-00OR22725 with the U.S. Department of Energy.20202020-04-0120202020-10-23reporthttp://purl.org/coar/resource_type/c_93fcAOhttp://purl.org/coar/version/c_b1a7d7d4d402bcceinfo:eu-repo/semantics/reportapplication/pdfhttps://hdl.handle.net/2117/330695reponame:UPCommons. Portal del coneixement obert de la UPCinstname:Universitat Politècnica de Catalunya (UPC)InglésengEuropean Commission http://dx.doi.org/10.13039/100011102 Seventh Framework Programme 611404 Cross-Layer Early Reliability Evaluation for the Computing cOntinuumEuropean Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 801137 REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systemsEuropean Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 800962 Consolidation of European Research Excellence in Exascale HPC SystemsEuropean Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 826647 SGA1 (Specific Grant Agreement 1) OF THE EUROPEAN PROCESSOR INITIATIVE (EPI)European Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 671578 European Exascale Processor Memory Node DesignEuropean Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 780681 Low Energy Toolset for Heterogeneous ComputingEuropean Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 779877 Mont-Blanc 2020, European scalable, modular and power efficient HPC processoropen accesshttp://purl.org/coar/access_right/c_abf2info:eu-repo/semantics/openAccessoai:upcommons.upc.edu:2117/3306952026-05-27T15:37:01Z
dc.title.none.fl_str_mv Towards resilient EU HPC systems: A blueprint
title Towards resilient EU HPC systems: A blueprint
spellingShingle Towards resilient EU HPC systems: A blueprint
Radojković, Petar
High performance computing -- Europe
Càlcul intensiu (Informàtica) -- Europa
Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors
title_short Towards resilient EU HPC systems: A blueprint
title_full Towards resilient EU HPC systems: A blueprint
title_fullStr Towards resilient EU HPC systems: A blueprint
title_full_unstemmed Towards resilient EU HPC systems: A blueprint
title_sort Towards resilient EU HPC systems: A blueprint
dc.creator.none.fl_str_mv Radojković, Petar
Marazakis, Manolis
Carpenter, Paul Matthew
Jeyapaul, Reiley
Gizopoulos, Dimitris
Schulz, Martin
Armejach Sanosa, Adrià|||0000-0003-2869-668X
Ayguadé Parra, Eduard|||0000-0002-5146-103X
Canal Corretger, Ramon|||0000-0003-4542-204X
Moretó Planas, Miquel|||0000-0002-9848-8758
Salami, Behzad
Unsal, Osman Sabri
author Radojković, Petar
author_facet Radojković, Petar
Marazakis, Manolis
Carpenter, Paul Matthew
Jeyapaul, Reiley
Gizopoulos, Dimitris
Schulz, Martin
Armejach Sanosa, Adrià|||0000-0003-2869-668X
Ayguadé Parra, Eduard|||0000-0002-5146-103X
Canal Corretger, Ramon|||0000-0003-4542-204X
Moretó Planas, Miquel|||0000-0002-9848-8758
Salami, Behzad
Unsal, Osman Sabri
author_role author
author2 Marazakis, Manolis
Carpenter, Paul Matthew
Jeyapaul, Reiley
Gizopoulos, Dimitris
Schulz, Martin
Armejach Sanosa, Adrià|||0000-0003-2869-668X
Ayguadé Parra, Eduard|||0000-0002-5146-103X
Canal Corretger, Ramon|||0000-0003-4542-204X
Moretó Planas, Miquel|||0000-0002-9848-8758
Salami, Behzad
Unsal, Osman Sabri
author2_role author
author
author
author
author
author
author
author
author
author
author
dc.subject.none.fl_str_mv High performance computing -- Europe
Càlcul intensiu (Informàtica) -- Europa
Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors
topic High performance computing -- Europe
Càlcul intensiu (Informàtica) -- Europa
Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors
description This document aims to spearhead a Europe-wide discussion on HPC system resilience and to help the European HPC community define best practices for resilience. We analyse a wide range of state-of-the-art resilience mechanisms and recommend the most effective approaches to employ in large-scale HPC systems. Our guidelines will be useful in the allocation of available resources, as well as guiding researchers and research funding towards the enhancement of resilience approaches with the highest priority and utility. Although our work is focused on the needs of next generation HPC systems in Europe, the principles and evaluations are applicable globally.
publishDate 2020
dc.date.none.fl_str_mv 2020
2020-04-01
2020
2020-10-23
dc.type.none.fl_str_mv report
http://purl.org/coar/resource_type/c_93fc
AO
http://purl.org/coar/version/c_b1a7d7d4d402bcce
dc.type.openaire.fl_str_mv info:eu-repo/semantics/report
format report
dc.identifier.none.fl_str_mv https://hdl.handle.net/2117/330695
url https://hdl.handle.net/2117/330695
dc.language.none.fl_str_mv Inglés
eng
language_invalid_str_mv Inglés
language eng
dc.relation.none.fl_str_mv European Commission http://dx.doi.org/10.13039/100011102 Seventh Framework Programme 611404 Cross-Layer Early Reliability Evaluation for the Computing cOntinuum
European Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 801137 REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems
European Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 800962 Consolidation of European Research Excellence in Exascale HPC Systems
European Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 826647 SGA1 (Specific Grant Agreement 1) OF THE EUROPEAN PROCESSOR INITIATIVE (EPI)
European Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 671578 European Exascale Processor Memory Node Design
European Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 780681 Low Energy Toolset for Heterogeneous Computing
European Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 779877 Mont-Blanc 2020, European scalable, modular and power efficient HPC processor
dc.rights.none.fl_str_mv open access
http://purl.org/coar/access_right/c_abf2
dc.rights.openaire.fl_str_mv info:eu-repo/semantics/openAccess
rights_invalid_str_mv open access
http://purl.org/coar/access_right/c_abf2
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:UPCommons. Portal del coneixement obert de la UPC
instname:Universitat Politècnica de Catalunya (UPC)
instname_str Universitat Politècnica de Catalunya (UPC)
reponame_str UPCommons. Portal del coneixement obert de la UPC
collection UPCommons. Portal del coneixement obert de la UPC
repository.name.fl_str_mv
repository.mail.fl_str_mv
_version_ 1869414005431336960
score 15,300724