Quantisation of LLMs on modern CPU architectures

Centeno Fidalgo, Hugo

Quantisation of LLMs on modern CPU architectures

The rapid growth of large language models (LLMs) has intensified the demand for efficient training and inference solutions. While GPUs dominate this domain, their cost, power consumption, and limited availability motivate the exploration of alternative platforms. This work investigates the feasibili...

Full description

Bibliographic Details
Author:	Centeno Fidalgo, Hugo
Format:	master thesis
Publication Date:	2026
Country:	España
Institution:	Universitat Politècnica de Catalunya (UPC)
Repository:	UPCommons. Portal del coneixement obert de la UPC
Language:	English
OAI Identifier:	oai:dnet:upcommonspor::11b0ad12620762382a5dcaeaac130d02
Online Access:	https://hdl.handle.net/2117/460726
Access Level:	Open access
Keyword:	Artificial intelligence Graphics processing units High performance computing Computer architecture Quantització Models de llenguatge de gran escala Entrenament en CPU Precisió mixta bfloat16 (BF16) FP32 AVX-512 AMX Extensions ISA Vectorització del compilador Multiplicació de matrius (GEMM) BLAS Intel oneMKL AMD AOCL OpenMP Paral·lelització Optimització del rendiment Perfilatge Paraver Computació d'altes prestacions (HPC) MareNostrum 5 Quantization Large-scale language models CPU training Mixed precision Performance optimization Profiling Intel·ligència artificial Processadors gràfics Càlcul intensiu (Informàtica) Arquitectura d'ordinadors Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors

Description
Summary:	The rapid growth of large language models (LLMs) has intensified the demand for efficient training and inference solutions. While GPUs dominate this domain, their cost, power consumption, and limited availability motivate the exploration of alternative platforms. This work investigates the feasibility of running and training LLMs efficiently on modern CPUs by leveraging recent instruction set extensions with bfloat16 support. Building on Karpathy's GPT-2 implementation llm.c, this work focuses on op- timization, parallelization and integration of the BF16 data format. BF16 reduces memory footprint and bandwidth requirements while preserving numerical stability in neural network training. The implementation explores both compiler- driven vectorization and optimized BLAS backends that exploit AVX-512 and AMX instructions on Intel and AMD architectures. Performance evaluation is conducted using Paraver on two state-of-the-art HPC platforms, analyzing training throughput, scalability, and kernel-level behavior. Results show that naive BF16 implementations offer limited benefits due to conversion overheads, whereas BLAS-based BF16 matrix multiplication significantly improves performance, achieving speedups over FP32. The study highlights that modern CPUs, when paired with appropriate software stacks, can deliver competitive performance for LLM workloads, especially in memory- bound scenarios. Overall, this work demonstrates that mixed-precision execution and architecture- aware optimizations enable CPUs to become a viable platform for LLM training and inference, opening new opportunities for deployment in HPC, edge, and resource-constrained environments.

Quantisation of LLMs on modern CPU architectures

Similar Items