Quantisation of LLMs on modern CPU architectures
The rapid growth of large language models (LLMs) has intensified the demand for efficient training and inference solutions. While GPUs dominate this domain, their cost, power consumption, and limited availability motivate the exploration of alternative platforms. This work investigates the feasibili...
| Author: | |
|---|---|
| Format: | master thesis |
| Publication Date: | 2026 |
| Country: | España |
| Institution: | Universitat Politècnica de Catalunya (UPC) |
| Repository: | UPCommons. Portal del coneixement obert de la UPC |
| Language: | English |
| OAI Identifier: | oai:dnet:upcommonspor::11b0ad12620762382a5dcaeaac130d02 |
| Online Access: | https://hdl.handle.net/2117/460726 |
| Access Level: | Open access |
| Keyword: | Artificial intelligence Graphics processing units High performance computing Computer architecture Quantització Models de llenguatge de gran escala Entrenament en CPU Precisió mixta bfloat16 (BF16) FP32 AVX-512 AMX Extensions ISA Vectorització del compilador Multiplicació de matrius (GEMM) BLAS Intel oneMKL AMD AOCL OpenMP Paral·lelització Optimització del rendiment Perfilatge Paraver Computació d'altes prestacions (HPC) MareNostrum 5 Quantization Large-scale language models CPU training Mixed precision Performance optimization Profiling Intel·ligència artificial Processadors gràfics Càlcul intensiu (Informàtica) Arquitectura d'ordinadors Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors |
| Summary: | The rapid growth of large language models (LLMs) has intensified the demand for efficient training and inference solutions. While GPUs dominate this domain, their cost, power consumption, and limited availability motivate the exploration of alternative platforms. This work investigates the feasibility of running and training LLMs efficiently on modern CPUs by leveraging recent instruction set extensions with bfloat16 support. Building on Karpathy's GPT-2 implementation llm.c, this work focuses on op- timization, parallelization and integration of the BF16 data format. BF16 reduces memory footprint and bandwidth requirements while preserving numerical stability in neural network training. The implementation explores both compiler- driven vectorization and optimized BLAS backends that exploit AVX-512 and AMX instructions on Intel and AMD architectures. Performance evaluation is conducted using Paraver on two state-of-the-art HPC platforms, analyzing training throughput, scalability, and kernel-level behavior. Results show that naive BF16 implementations offer limited benefits due to conversion overheads, whereas BLAS-based BF16 matrix multiplication significantly improves performance, achieving speedups over FP32. The study highlights that modern CPUs, when paired with appropriate software stacks, can deliver competitive performance for LLM workloads, especially in memory- bound scenarios. Overall, this work demonstrates that mixed-precision execution and architecture- aware optimizations enable CPUs to become a viable platform for LLM training and inference, opening new opportunities for deployment in HPC, edge, and resource-constrained environments. |
|---|