4 resultados para ASSOCIATIVITY
em Indian Institute of Science - Bangalore - Índia
Resumo:
CMPs enable simultaneous execution of multiple applications on the same platforms that share cache resources. Diversity in the cache access patterns of these simultaneously executing applications can potentially trigger inter-application interference, leading to cache pollution. Whereas a large cache can ameliorate this problem, the issues of larger power consumption with increasing cache size, amplified at sub-100nm technologies, makes this solution prohibitive. In this paper in order to address the issues relating to power-aware performance of caches, we propose a caching structure that addresses the following: 1. Definition of application-specific cache partitions as an aggregation of caching units (molecules). The parameters of each molecule namely size, associativity and line size are chosen so that the power consumed by it and access time are optimal for the given technology. 2. Application-Specific resizing of cache partitions with variable and adaptive associativity per cache line, way size and variable line size. 3. A replacement policy that is transparent to the partition in terms of size, heterogeneity in associativity and line size. Through simulation studies we establish the superiority of molecular cache (caches built as aggregations of molecules) that offers a 29% power advantage over that of an equivalently performing traditional cache.
Resumo:
Packet forwarding is a memory-intensive application requiring multiple accesses through a trie structure. The efficiency of a cache for this application critically depends on the placement function to reduce conflict misses. Traditional placement functions use a one-level mapping that naively partitions trie-nodes into cache sets. However, as a significant percentage of trie nodes are not useful, these schemes suffer from a non-uniform distribution of useful nodes to sets. This in turn results in increased conflict misses. Newer organizations such as variable associativity caches achieve flexibility in placement at the expense of increased hit-latency. This makes them unsuitable for L1 caches.We propose a novel two-level mapping framework that retains the hit-latency of one-level mapping yet incurs fewer conflict misses. This is achieved by introducing a secondlevel mapping which reorganizes the nodes in the naive initial partitions into refined partitions with near-uniform distribution of nodes. Further as this remapping is accomplished by simply adapting the index bits to a given routing table the hit-latency is not affected. We propose three new schemes which result in up to 16% reduction in the number of misses and 13% speedup in memory access time. In comparison, an XOR-based placement scheme known to perform extremely well for general purpose architectures, can obtain up to 2% speedup in memory access time.
Resumo:
It is essential to accurately estimate the working set size (WSS) of an application for various optimizations such as to partition cache among virtual machines or reduce leakage power dissipated in an over-allocated cache by switching it OFF. However, the state-of-the-art heuristics such as average memory access latency (AMAL) or cache miss ratio (CMR) are poorly correlated to the WSS of an application due to 1) over-sized caches and 2) their dispersed nature. Past studies focus on estimating WSS of an application executing on a uniprocessor platform. Estimating the same for a chip multiprocessor (CMP) with a large dispersed cache is challenging due to the presence of concurrently executing threads/processes. Hence, we propose a scalable, highly accurate method to estimate WSS of an application. We call this method ``tagged WSS (TWSS)'' estimation method. We demonstrate the use of TWSS to switch-OFF the over-allocated cache ways in Static and Dynamic NonUniform Cache Architectures (SNUCA, DNUCA) on a tiled CMP. In our implementation of adaptable way SNUCA and DNUCA caches, decision of altering associativity is taken by each L2 controller. Hence, this approach scales better with the number of cores present on a CMP. It gives overall (geometric mean) 26% and 19% higher energy-delay product savings compared to AMAL and CMR heuristics on SNUCA, respectively.