Towards resilient EU HPC systems: A blueprint

This document aims to spearhead a Europe-wide discussion on HPC system resilience and to help the European HPC community define best practices for resilience. We analyse a wide range of state-of-the-art resilience mechanisms and recommend the most effective approaches to employ in large-scale HPC sy...

Descripción completa

Detalles Bibliográficos
Autores: Radojković, Petar, Marazakis, Manolis, Carpenter, Paul Matthew, Jeyapaul, Reiley, Gizopoulos, Dimitris, Schulz, Martin, Armejach Sanosa, Adrià|||0000-0003-2869-668X, Ayguadé Parra, Eduard|||0000-0002-5146-103X, Canal Corretger, Ramon|||0000-0003-4542-204X, Moretó Planas, Miquel|||0000-0002-9848-8758, Salami, Behzad, Unsal, Osman Sabri
Tipo de recurso: informe técnico
Fecha de publicación:2020
País:España
Institución:Universitat Politècnica de Catalunya (UPC)
Repositorio:UPCommons. Portal del coneixement obert de la UPC
Idioma:inglés
OAI Identifier:oai:upcommons.upc.edu:2117/330695
Acceso en línea:https://hdl.handle.net/2117/330695
Access Level:acceso abierto
Palabra clave:High performance computing -- Europe
Càlcul intensiu (Informàtica) -- Europa
Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors
Descripción
Sumario:This document aims to spearhead a Europe-wide discussion on HPC system resilience and to help the European HPC community define best practices for resilience. We analyse a wide range of state-of-the-art resilience mechanisms and recommend the most effective approaches to employ in large-scale HPC systems. Our guidelines will be useful in the allocation of available resources, as well as guiding researchers and research funding towards the enhancement of resilience approaches with the highest priority and utility. Although our work is focused on the needs of next generation HPC systems in Europe, the principles and evaluations are applicable globally.