A case for malleable thread-level linear algebra libraries: The LU factorization with partial pivoting

We propose two novel techniques for overcoming load-imbalance encountered when implementing so-called look-ahead mechanisms in relevant dense matrix factorizations for the solution of linear systems. Both techniques target the scenario where two thread teams are created/activated during the factoriz...

Descripción completa

Detalles Bibliográficos
Autores: Catalán Pallarés, Sandra, Herrero Zaragoza, José Ramón|||0000-0002-4060-367X, Quintana Ortí, Enrique Salvador, Rodríguez Sánchez, Rafael, Van De Geijn, Robert
Tipo de recurso: artículo
Fecha de publicación:2019
País:España
Institución:Universitat Politècnica de Catalunya (UPC)
Repositorio:UPCommons. Portal del coneixement obert de la UPC
Idioma:inglés
OAI Identifier:oai:upcommons.upc.edu:2117/129939
Acceso en línea:https://hdl.handle.net/2117/129939
https://dx.doi.org/10.1109/ACCESS.2019.2895541
Access Level:acceso abierto
Palabra clave:Linear systems
Algebras, Linear
Solution of linear systems
Multi-threading
Workload balancing
Thread malleability
Basic linear algebra subprograms
BLAS
Linear algebra package
LAPACK
Sistemes lineals
Àlgebra lineal
Àrees temàtiques de la UPC::Informàtica::Informàtica teòrica
Descripción
Sumario:We propose two novel techniques for overcoming load-imbalance encountered when implementing so-called look-ahead mechanisms in relevant dense matrix factorizations for the solution of linear systems. Both techniques target the scenario where two thread teams are created/activated during the factorization, with each team in charge of performing an independent task/branch of execution. The first technique promotes worker sharing (WS) between the two tasks, allowing the threads of the task that completes first to be reallocated for use by the costlier task. The second technique allows a fast task to alert the slower task of completion, enforcing the early termination (ET) of the second task, and a smooth transition of the factorization procedure into the next iteration. The two mechanisms are instantiated via a new malleable thread-level implementation of the basic linear algebra subprograms, and their benefits are illustrated via an implementation of the LU factorization with partial pivoting enhanced with look-ahead. Concretely, our experimental results on an Intel-Xeon system with 12 cores show the benefits of combining WS+ET, reporting competitive performance in comparison with a task-parallel runtime-based solution.