Floating-point arithmetic paradigms for high-performance computing: software algorithms and hardware designs

(English) This dissertation explores the challenges and advancements in arithmetic representations and computations within computer architectures, focusing on the limitations of the IEEE 754 standard. Modern computing demands, driven by advancements in AI, HPC, and scientific simulations, make effic...

Descripción completa

Detalles Bibliográficos
Autor: Ledoux Pardo, Luis Eduardo
Tipo de recurso: tesis doctoral
Fecha de publicación:2024
País:España
Institución:Universitat Politècnica de Catalunya (UPC)
Repositorio:UPCommons. Portal del coneixement obert de la UPC
Idioma:inglés
OAI Identifier:oai:upcommons.upc.edu:2117/454987
Acceso en línea:https://hdl.handle.net/2117/454987
https://dx.doi.org/10.5821/dissertation-2117-454987
Access Level:acceso abierto
Palabra clave:004 - Informàtica
Àrees temàtiques de la UPC::Informàtica
Descripción
Sumario:(English) This dissertation explores the challenges and advancements in arithmetic representations and computations within computer architectures, focusing on the limitations of the IEEE 754 standard. Modern computing demands, driven by advancements in AI, HPC, and scientific simulations, make efficient and precise numerical representations crucial. This work investigates these challenges and proposes innovative solutions, evaluating their impact on computational efficiency and accuracy. The core problem is the inefficiencies of the IEEE 754 standard for floating-point arithmetic, which do not meet the needs of modern workloads. These inefficiencies result in higher energy consumption, inadequate precision, and suboptimal performance, especially in energy-constrained environments and high-precision applications. To address these challenges, this thesis explores various facets of arithmetic computation, from algorithmic concepts to metal and silicon structures. It introduces mechanisms to improve the adaptability of numerical representations, allowing precision adjustments according to computational tasks, resulting in more efficient circuits. Focusing on improving arithmetic performance, the thesis addresses energy consumption and highlights the importance of efficient arithmetic logic units. It also shows how these solutions can be integrated into various software frameworks, revealing a correlation between numerical requirements and internal precision, highlighting an underexploited aspect of general-purpose floating-point formats. Firstly, it develops a framework for generating Posit operators in hardware, improving accuracy and performance in tasks like image classification. The Posit Operator Framework, described in SystemVerilog, enables the construction of Multi-Layer Perceptrons for inference engines, applicable in POWER9/CAPI2 environments with FPGA acceleration. Secondly, it presents a generator for Systolic Arrays optimized for Matrix-Matrix Multiplication (MMM), showing the impact of custom hardware configurations on accuracy and energy efficiency. The MMM units are fully parametrizable and adapted to the numerical specifications of the workload, facilitated by a core generator with automated pipelining. These units allow evaluations with CAPI2 on FPGA and POWER9 systems, achieving up to two Tera floating-point operations per second. They have also demonstrated success in ASIC generation. Additionally, it establishes an open-source framework to integrate MMM units into high-level software, offering energy savings and enhanced precision for applications like AI and scientific computations. The methodology involves mapping General Matrix-Matrix Multiplication calls in BLAS libraries to our accelerators via the OpenCAPI coherent link, saturating the 22 GBps bandwidth by tuning computer formats to accommodate more Processing Elements while preserving accuracy. Finally, the resurgence of vector processing leads to a reevaluation of division algorithms, revealing opportunities to use smaller and slower computing units, allowing more units within varied energy and power budgets. This approach shows a broad Design Space Exploration. We developed an open-source EDA ASIC flow, facilitating parallel generation of multiple chip designs, enabling systematic exploration of power, performance, and area across various process design kits to identify optimal configurations. These contributions form an interdisciplinary thesis that advances solutions to computing challenges from an arithmetic perspective, overcoming the "arithmetic wall."