Fetch unit design for scalable simultaneous multithreading (ScSMT)

Continuous IC process enhancements make possible to integrate on a single chip the re-sources required for simultaneously executing multiple control flows or threads, exploiting different levels of thread-level parallelism: application-, function-, and loop-level. Scalable simultaneous multi-threadi...

Descripción completa

Detalles Bibliográficos
Autores: Moure, Juan Carlos, Rexachs del Rosario, Dolores, Luque Fadón, Emilio
Tipo de recurso: artículo
Estado:Versión publicada
Fecha de publicación:2001
País:Argentina
Institución:Universidad Nacional de La Plata
Repositorio:SEDICI (UNLP)
Idioma:inglés
OAI Identifier:oai:sedici.unlp.edu.ar:10915/9404
Acceso en línea:http://sedici.unlp.edu.ar/handle/10915/9404
Access Level:acceso abierto
Palabra clave:Ciencias Informáticas
Procesador paralelo
Threads
Arquitectura del procesador
Informática
Descripción
Sumario:Continuous IC process enhancements make possible to integrate on a single chip the re-sources required for simultaneously executing multiple control flows or threads, exploiting different levels of thread-level parallelism: application-, function-, and loop-level. Scalable simultaneous multi-threading combines static and dynamic mechanisms to assemble a complexity-effective design that provides high instruction per cycle rates without sacrificing cycle time nor single-thread performance. This paper addresses the design of the fetch unit for a high-performance, scalable, simultaneous multithreaded processor. We present the detailed microarchitecture of a clustered and reconfigurable fetch unit based on an existing single-thread fetch unit. In order to minimize the occurrence of fetch hazards, the fetch unit dynamically adapts to the available thread-level parallelism and to the fetch characteristics of the active threads, working as a single shared unit or as two separate clusters. It combines static and dynamic methods in a complexity-efficient way. The design is supported by a simulation- based analysis of different instruction cache and branch target buffer configurations on the context of a multithreaded execution workload. Average reductions on the miss rates between 30% and 60% and peak reductions greater than 200% are obtained.