On the design and development of programming models for exascale systems

High Performance Computing (HPC) systems have been evolving over time to adapt to the scientific community requirements. We are currently approaching to the Exascale era. Exascale systems will incorporate a large number of nodes, each of them containing many computing resources. Besides that, not on...

Full description

Bibliographic Details
Author: Maroñas, Marcos
Format: doctoral thesis
Status:Published version
Publication Date:2021
Country:España
Institution:CBUC, CESCA
Repository:TDR. Tesis Doctorales en Red
OAI Identifier:oai:www.tdx.cat:10803/671783
Online Access:http://hdl.handle.net/10803/671783
https://dx.doi.org/10.5821/dissertation-2117-346655
Access Level:Open access
Keyword:Àrees temàtiques de la UPC::Informàtica
004
Description
Summary:High Performance Computing (HPC) systems have been evolving over time to adapt to the scientific community requirements. We are currently approaching to the Exascale era. Exascale systems will incorporate a large number of nodes, each of them containing many computing resources. Besides that, not only the computing resources, but memory hierarchies are becoming more deep and complex. Overall, Exascale systems will present several challenges in terms of performance, programmability and fault tolerance. Regarding programmability, the more complex a system architecture is, the more complex to properly exploit the system. The programmability is closely related to the performance, because the performance a system can deliver is useless if users are not able to write programs that obtain such performance. This stresses the importance of programming models as a tool to easily write programs that can reach the peak performance of the system. Finally, it is well known that more components lead to more errors. The combination of large executions with a low Mean Time To Failure (MTTF) may jeopardize application progress. Thus, all the efforts done to improve performance become pointless if applications hardly finish. To prevent that, we must apply fault tolerance. The main goal of this thesis is to enable non-expert users to exploit complex Exascale systems. To that end, we have enhanced state-of-the-art parallel programming models to cope with three key Exascale challenges: programmability, performance and fault tolerance. The first set of contributions focuses on the efficient management of modern multicore/manycore processors. We propose a new kind of task that combines the key advantages of tasks with the key advantages of worksharing techniques. The use of this new task type alleviates granularity issues, thereby enhancing performance in several scenarios. We also propose the introduction of dependences in the taskloop construct so that programmers can easily apply blocking techniques. Finally, we extend taskloop construct to support the creation of the new kind of tasks instead of regular tasks. The second set of contributions focuses on the efficient management of modern memory hierarchies, focused on NUMA domains. By using the information that users provide in the dependences annotations, we build a system that tracks data location. Later, we use this information to take scheduling decisions that maximize data locality. Our last set of contributions focuses on fault tolerance. We propose a programming model that provides application-level checkpoint/restart in an easy and portable way. Our programming model offers a set of compiler directives to abstract users from system-level nuances. Then, it leverages state-of-the-art libraries to deliver high performance and includes several redundancy schemes.