Quantisation of LLMs on modern CPU architectures

The rapid growth of large language models (LLMs) has intensified the demand for efficient training and inference solutions. While GPUs dominate this domain, their cost, power consumption, and limited availability motivate the exploration of alternative platforms. This work investigates the feasibili...

Full description

Bibliographic Details
Author: Centeno Fidalgo, Hugo
Format: master thesis
Publication Date:2026
Country:España
Institution:Universitat Politècnica de Catalunya (UPC)
Repository:UPCommons. Portal del coneixement obert de la UPC
Language:English
OAI Identifier:oai:dnet:upcommonspor::11b0ad12620762382a5dcaeaac130d02
Online Access:https://hdl.handle.net/2117/460726
Access Level:Open access
Keyword:Artificial intelligence
Graphics processing units
High performance computing
Computer architecture
Quantització
Models de llenguatge de gran escala
Entrenament en CPU
Precisió mixta
bfloat16 (BF16)
FP32
AVX-512
AMX
Extensions ISA
Vectorització del compilador
Multiplicació de matrius (GEMM)
BLAS
Intel oneMKL
AMD AOCL
OpenMP
Paral·lelització
Optimització del rendiment
Perfilatge
Paraver
Computació d'altes prestacions (HPC)
MareNostrum 5
Quantization
Large-scale language models
CPU training
Mixed precision
Performance optimization
Profiling
Intel·ligència artificial
Processadors gràfics
Càlcul intensiu (Informàtica)
Arquitectura d'ordinadors
Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial
Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors
Description
Summary:The rapid growth of large language models (LLMs) has intensified the demand for efficient training and inference solutions. While GPUs dominate this domain, their cost, power consumption, and limited availability motivate the exploration of alternative platforms. This work investigates the feasibility of running and training LLMs efficiently on modern CPUs by leveraging recent instruction set extensions with bfloat16 support. Building on Karpathy's GPT-2 implementation llm.c, this work focuses on op- timization, parallelization and integration of the BF16 data format. BF16 reduces memory footprint and bandwidth requirements while preserving numerical stability in neural network training. The implementation explores both compiler- driven vectorization and optimized BLAS backends that exploit AVX-512 and AMX instructions on Intel and AMD architectures. Performance evaluation is conducted using Paraver on two state-of-the-art HPC platforms, analyzing training throughput, scalability, and kernel-level behavior. Results show that naive BF16 implementations offer limited benefits due to conversion overheads, whereas BLAS-based BF16 matrix multiplication significantly improves performance, achieving speedups over FP32. The study highlights that modern CPUs, when paired with appropriate software stacks, can deliver competitive performance for LLM workloads, especially in memory- bound scenarios. Overall, this work demonstrates that mixed-precision execution and architecture- aware optimizations enable CPUs to become a viable platform for LLM training and inference, opening new opportunities for deployment in HPC, edge, and resource-constrained environments.