Cache-aware optimization of matrix multiplication and matrix factorizations on multicore processors

Martínez-Pérez, Héctor; Catalán, Sandra; Igual, Francisco D.; Herrero, José R.; Rodríguez-Sánchez, Rafael; Quintana-Ortí, Enrique S.|||0000-0002-5454-165X

Cache-aware optimization of matrix multiplication and matrix factorizations on multicore processors

[EN] This paper advocates for a careful customization of the special general matrix multiplication (GEMM) kernels that are invoked from blocked routines for several relevant matrix factorizations in LAPACK, in order to improve their performance on modern multicore processors with hierarchical cache...

Descripción completa

Detalles Bibliográficos
Autores:	Martínez-Pérez, Héctor, Catalán, Sandra, Igual, Francisco D., Herrero, José R., Rodríguez-Sánchez, Rafael, Quintana-Ortí, Enrique S.\|\|\|0000-0002-5454-165X
Tipo de recurso:	artículo
Fecha de publicación:	2025
País:	España
Institución:	Universitat Politècnica de València (UPV)
Repositorio:	RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia
Idioma:	inglés
OAI Identifier:	oai:riunet.upv.es:10251/230322
Acceso en línea:	https://riunet.upv.es/handle/10251/230322
Access Level:	acceso abierto
Palabra clave:	Dense linear algebra Computer architecture Multicore processors Cache memory Matrix factorization

Descripción
Sumario:	[EN] This paper advocates for a careful customization of the special general matrix multiplication (GEMM) kernels that are invoked from blocked routines for several relevant matrix factorizations in LAPACK, in order to improve their performance on modern multicore processors with hierarchical cache memories. To achieve this, we leverage a refined analytical model to dynamically tune the cache configuration parameters of GEMM for these kernels, taking into account the matrix operands' dimensions, in order to improve cache occupation. In addition, toward the same goal, we accommodate a flexible development of architecture-specific micro-kernels for GEMM that allows us to select the option that, depending on the operands' dimensions, ameliorates cache utilization. Our experiments for the LU and QR factorizations on two platforms, equipped with ARM (NVIDIA Carmel) and x86 (AMD EPYC) multi-core processors, demonstrate the benefits of this approach in terms of a better cache utilization and, in general, higher performance. Moreover, they also reveal the delicate balance between optimizing for multi-threaded parallelism versus cache usage as well as the positive effects of software prefetching.

Cache-aware optimization of matrix multiplication and matrix factorizations on multicore processors

Similares en LA Referencia