Runtime home mapping for effective memory resource usage
In tiled Chip Multiprocessors (CMPs) last-level cache (LLC) banks are usually shared but distributed among the tiles. A static mapping of cache blocks to the LLC banks leads to poor efficiency since a block may be mapped away from the tiles actually accessing it. Dynamic policies either rely on the...
| Autores: | , |
|---|---|
| Tipo de recurso: | artículo |
| Fecha de publicación: | 2014 |
| País: | España |
| Institución: | Universitat Politècnica de València (UPV) |
| Repositorio: | RiuNet. Repositorio Institucional de la Universitat Politécnica de Valéncia |
| Idioma: | inglés |
| OAI Identifier: | oai:riunet.upv.es:10251/50718 |
| Acceso en línea: | https://riunet.upv.es/handle/10251/50718 |
| Access Level: | acceso abierto |
| Palabra clave: | Chip multiprocessors Network-on-chip Cache hierarchy Coherence protocols ARQUITECTURA Y TECNOLOGIA DE COMPUTADORES |
| Sumario: | In tiled Chip Multiprocessors (CMPs) last-level cache (LLC) banks are usually shared but distributed among the tiles. A static mapping of cache blocks to the LLC banks leads to poor efficiency since a block may be mapped away from the tiles actually accessing it. Dynamic policies either rely on the static mapping of blocks to a set of banks (D-NUCA) or rely on the OS to dynamically load pages to statically mapped addresses (first-touch). In this paper, we propose Runtime Home Mapping (RHM), a new dynamic approach where the LLC home bank is determined at runtime by the memory controller when the block is fetched from main memory, trying to map each block as close as possible to the requestor thus speeding up execution time and lowering message latencies. Block migration and replication provide further improvements to basic RHM. Also, in a further optimization we eliminate the directory structure. All these optimizations involve specific NoC optimizations and co-designs. Results with PARSEC and SPLASH-2 applications show, when compared with alternative solutions, that RHM achieves a 41% and 35% average reduction in load and store latencies respectively compared to static mapping. This leads to an average reduction of 28% in applications execution. |
|---|