Thread to core assignment in SMT on-chip multiprocessors

State-of-the-art high-performance processors like the IBM POWER5 and Intel i7 show a trend in industry towards on-chip Multiprocessors (CMP) involving Simultaneous Multithreading (SMT) in each core. In these processors, the way in which applications are assigned to cores plays a key role in the perf...

Descripción completa

Detalles Bibliográficos
Autores: Acosta Ojeda, Carmelo Alexis, Cazorla Almeida, Francisco Javier, Ramírez Bellido, Alejandro, Valero Cortés, Mateo|||0000-0003-2917-2482
Tipo de recurso: artículo
Fecha de publicación:2009
País:España
Institución:Universitat Politècnica de Catalunya (UPC)
Repositorio:UPCommons. Portal del coneixement obert de la UPC
Idioma:inglés
OAI Identifier:oai:upcommons.upc.edu:2117/6860
Acceso en línea:https://hdl.handle.net/2117/6860
https://dx.doi.org/10.1109/SBAC-PAD.2009.13
Access Level:acceso abierto
Palabra clave:Multiprocessors
Multiprocessadors
Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors
Descripción
Sumario:State-of-the-art high-performance processors like the IBM POWER5 and Intel i7 show a trend in industry towards on-chip Multiprocessors (CMP) involving Simultaneous Multithreading (SMT) in each core. In these processors, the way in which applications are assigned to cores plays a key role in the performance of each application and the overall system performance. In this paper we show that the system throughput highly depends on the Thread to Core Assignment (TCA), regardless the SMT Instruction Fetch (IFetch) Policy implemented in the cores. Our results indicate that a good TCA can improve the results of any underlying IFetch Policy, yielding speedups of up to 28%. Given the relevance of TCA, we propose an algorithm to manage it in CMP+SMT processors. The proposed throughput-oriented TCA Algorithm takes into account the workload characteristics and the underlying SMT IFetch Policy. Our results show that the TCA Algorithm obtains thread-to-core assignments 3% close to the optimal assignation for each case, yielding system throughput improvements up to 21%.