Towards resilient EU HPC systems: A blueprint
This document aims to spearhead a Europe-wide discussion on HPC system resilience and to help the European HPC community define best practices for resilience. We analyse a wide range of state-of-the-art resilience mechanisms and recommend the most effective approaches to employ in large-scale HPC sy...
| Autores: | , , , , , , , , , , , |
|---|---|
| Tipo de recurso: | informe técnico |
| Fecha de publicación: | 2020 |
| País: | España |
| Institución: | Universitat Politècnica de Catalunya (UPC) |
| Repositorio: | UPCommons. Portal del coneixement obert de la UPC |
| Idioma: | inglés |
| OAI Identifier: | oai:upcommons.upc.edu:2117/330695 |
| Acceso en línea: | https://hdl.handle.net/2117/330695 |
| Access Level: | acceso abierto |
| Palabra clave: | High performance computing -- Europe Càlcul intensiu (Informàtica) -- Europa Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors |
| id |
ES_96eb0062646948c76bbf9e3011dbcefa |
|---|---|
| oai_identifier_str |
oai:upcommons.upc.edu:2117/330695 |
| network_acronym_str |
ES |
| network_name_str |
España |
| repository_id_str |
|
| spelling |
Towards resilient EU HPC systems: A blueprintRadojković, PetarMarazakis, ManolisCarpenter, Paul MatthewJeyapaul, ReileyGizopoulos, DimitrisSchulz, MartinArmejach Sanosa, Adrià|||0000-0003-2869-668XAyguadé Parra, Eduard|||0000-0002-5146-103XCanal Corretger, Ramon|||0000-0003-4542-204XMoretó Planas, Miquel|||0000-0002-9848-8758Salami, BehzadUnsal, Osman SabriHigh performance computing -- EuropeCàlcul intensiu (Informàtica) -- EuropaÀrees temàtiques de la UPC::Informàtica::Arquitectura de computadorsThis document aims to spearhead a Europe-wide discussion on HPC system resilience and to help the European HPC community define best practices for resilience. We analyse a wide range of state-of-the-art resilience mechanisms and recommend the most effective approaches to employ in large-scale HPC systems. Our guidelines will be useful in the allocation of available resources, as well as guiding researchers and research funding towards the enhancement of resilience approaches with the highest priority and utility. Although our work is focused on the needs of next generation HPC systems in Europe, the principles and evaluations are applicable globally.This work has received funding from the European Union’s Horizon 2020 research and innovation programme under the projects ECOSCALE (grant agreement No 671632), EPI (grant agreement No 826647), EuroEXA (grant agreement No 754337), Eurolab4HPC (grant agreement No 800962), EVOLVE (grant agreement No 825061), EXA2PRO (grant agreement No 801015), ExaNest (grant agreement No 671553), ExaNoDe (grant agreement No 671578), EXDCI-2 (grant agreement No 800957), LEGaTO (grant agreement No 780681), MB2020 (grant agreement No 779877), RECIPE (grant agreement No 801137) and SDK4ED (grant agreement No 780572). The work was also supported by the European Commission’s Seventh Framework Programme under the projects CLERECO (grant agreement No 611404), the NCSA-Inria-ANL-BSC-JSCRiken-UTK Joint-Laboratory for Extreme Scale Computing – JLESC (https://jlesc.github.io/), OMPI-X project (No ECP-2.3.1.17) and the Spanish Government through Severo Ochoa programme (SEV-2015-0493). This work was sponsored in part by the U.S. Department of Energy's Office of Advanced Scientific Computing Research, program managers Robinson Pino and Lucy Nowell. This manuscript has been authored by UT-Battelle, LLC under Contract No DE-AC05-00OR22725 with the U.S. Department of Energy.20202020-04-0120202020-10-23reporthttp://purl.org/coar/resource_type/c_93fcAOhttp://purl.org/coar/version/c_b1a7d7d4d402bcceinfo:eu-repo/semantics/reportapplication/pdfhttps://hdl.handle.net/2117/330695reponame:UPCommons. Portal del coneixement obert de la UPCinstname:Universitat Politècnica de Catalunya (UPC)InglésengEuropean Commission http://dx.doi.org/10.13039/100011102 Seventh Framework Programme 611404 Cross-Layer Early Reliability Evaluation for the Computing cOntinuumEuropean Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 801137 REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systemsEuropean Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 800962 Consolidation of European Research Excellence in Exascale HPC SystemsEuropean Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 826647 SGA1 (Specific Grant Agreement 1) OF THE EUROPEAN PROCESSOR INITIATIVE (EPI)European Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 671578 European Exascale Processor Memory Node DesignEuropean Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 780681 Low Energy Toolset for Heterogeneous ComputingEuropean Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 779877 Mont-Blanc 2020, European scalable, modular and power efficient HPC processoropen accesshttp://purl.org/coar/access_right/c_abf2info:eu-repo/semantics/openAccessoai:upcommons.upc.edu:2117/3306952026-05-27T15:37:01Z |
| dc.title.none.fl_str_mv |
Towards resilient EU HPC systems: A blueprint |
| title |
Towards resilient EU HPC systems: A blueprint |
| spellingShingle |
Towards resilient EU HPC systems: A blueprint Radojković, Petar High performance computing -- Europe Càlcul intensiu (Informàtica) -- Europa Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors |
| title_short |
Towards resilient EU HPC systems: A blueprint |
| title_full |
Towards resilient EU HPC systems: A blueprint |
| title_fullStr |
Towards resilient EU HPC systems: A blueprint |
| title_full_unstemmed |
Towards resilient EU HPC systems: A blueprint |
| title_sort |
Towards resilient EU HPC systems: A blueprint |
| dc.creator.none.fl_str_mv |
Radojković, Petar Marazakis, Manolis Carpenter, Paul Matthew Jeyapaul, Reiley Gizopoulos, Dimitris Schulz, Martin Armejach Sanosa, Adrià|||0000-0003-2869-668X Ayguadé Parra, Eduard|||0000-0002-5146-103X Canal Corretger, Ramon|||0000-0003-4542-204X Moretó Planas, Miquel|||0000-0002-9848-8758 Salami, Behzad Unsal, Osman Sabri |
| author |
Radojković, Petar |
| author_facet |
Radojković, Petar Marazakis, Manolis Carpenter, Paul Matthew Jeyapaul, Reiley Gizopoulos, Dimitris Schulz, Martin Armejach Sanosa, Adrià|||0000-0003-2869-668X Ayguadé Parra, Eduard|||0000-0002-5146-103X Canal Corretger, Ramon|||0000-0003-4542-204X Moretó Planas, Miquel|||0000-0002-9848-8758 Salami, Behzad Unsal, Osman Sabri |
| author_role |
author |
| author2 |
Marazakis, Manolis Carpenter, Paul Matthew Jeyapaul, Reiley Gizopoulos, Dimitris Schulz, Martin Armejach Sanosa, Adrià|||0000-0003-2869-668X Ayguadé Parra, Eduard|||0000-0002-5146-103X Canal Corretger, Ramon|||0000-0003-4542-204X Moretó Planas, Miquel|||0000-0002-9848-8758 Salami, Behzad Unsal, Osman Sabri |
| author2_role |
author author author author author author author author author author author |
| dc.subject.none.fl_str_mv |
High performance computing -- Europe Càlcul intensiu (Informàtica) -- Europa Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors |
| topic |
High performance computing -- Europe Càlcul intensiu (Informàtica) -- Europa Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors |
| description |
This document aims to spearhead a Europe-wide discussion on HPC system resilience and to help the European HPC community define best practices for resilience. We analyse a wide range of state-of-the-art resilience mechanisms and recommend the most effective approaches to employ in large-scale HPC systems. Our guidelines will be useful in the allocation of available resources, as well as guiding researchers and research funding towards the enhancement of resilience approaches with the highest priority and utility. Although our work is focused on the needs of next generation HPC systems in Europe, the principles and evaluations are applicable globally. |
| publishDate |
2020 |
| dc.date.none.fl_str_mv |
2020 2020-04-01 2020 2020-10-23 |
| dc.type.none.fl_str_mv |
report http://purl.org/coar/resource_type/c_93fc AO http://purl.org/coar/version/c_b1a7d7d4d402bcce |
| dc.type.openaire.fl_str_mv |
info:eu-repo/semantics/report |
| format |
report |
| dc.identifier.none.fl_str_mv |
https://hdl.handle.net/2117/330695 |
| url |
https://hdl.handle.net/2117/330695 |
| dc.language.none.fl_str_mv |
Inglés eng |
| language_invalid_str_mv |
Inglés |
| language |
eng |
| dc.relation.none.fl_str_mv |
European Commission http://dx.doi.org/10.13039/100011102 Seventh Framework Programme 611404 Cross-Layer Early Reliability Evaluation for the Computing cOntinuum European Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 801137 REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems European Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 800962 Consolidation of European Research Excellence in Exascale HPC Systems European Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 826647 SGA1 (Specific Grant Agreement 1) OF THE EUROPEAN PROCESSOR INITIATIVE (EPI) European Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 671578 European Exascale Processor Memory Node Design European Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 780681 Low Energy Toolset for Heterogeneous Computing European Commission http://doi.org/10.13039/100010661 Horizon 2020 Framework Programme 779877 Mont-Blanc 2020, European scalable, modular and power efficient HPC processor |
| dc.rights.none.fl_str_mv |
open access http://purl.org/coar/access_right/c_abf2 |
| dc.rights.openaire.fl_str_mv |
info:eu-repo/semantics/openAccess |
| rights_invalid_str_mv |
open access http://purl.org/coar/access_right/c_abf2 |
| eu_rights_str_mv |
openAccess |
| dc.format.none.fl_str_mv |
application/pdf |
| dc.source.none.fl_str_mv |
reponame:UPCommons. Portal del coneixement obert de la UPC instname:Universitat Politècnica de Catalunya (UPC) |
| instname_str |
Universitat Politècnica de Catalunya (UPC) |
| reponame_str |
UPCommons. Portal del coneixement obert de la UPC |
| collection |
UPCommons. Portal del coneixement obert de la UPC |
| repository.name.fl_str_mv |
|
| repository.mail.fl_str_mv |
|
| _version_ |
1869414005431336960 |
| score |
15,300724 |