46 resultados para cache-oblivious


Relevância:

20.00% 20.00%

Publicador:

Resumo:

One of the challenges for accurately estimating Worst Case Execu-tion Time(WCET) of executables is to accurately predict their cache behaviour. Various techniques have been developed to predict the cache contents at different program points to estimate the execution time of memory-accessing instructions. One of the most widely used techniques is Abstract Interpretation based Must Analysis, which de-termines the cache blocks guaranteed to be present in the cache, and hence provides safe estimation of cache hits and misses. However,Must Analysis is highly imprecise, and platforms using Must Analysis have been known to produce blown-up WCET estimates. In our work, we propose to use May Analysis to assist the Must Analysis cache up-date and make it more precise. We prove the safety of our approach as well as provide examples where our Improved Must Analysis provides better precision. Further, we also detect a serious flaw in the original Persistence Analysis, and use Must and May Analysis to assist the Persistence Analysis cache update, to make it safe and more precise than the known solutions to the problem.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Cache analysis plays a very important role in obtaining precise Worst Case Execution Time (WCET) estimates of programs for real-time systems. While Abstract Interpretation based approaches are almost universally used for cache analysis, they fail to take advantage of its unique requirement: it is not necessary to find the guaranteed cache behavior that holds across all executions of a program. We only need the cache behavior along one particular program path, which is the path with the maximum execution time. In this work, we introduce the concept of cache miss paths, which allows us to use the worst-case path information to improve the precision of AI-based cache analysis. We use Abstract Interpretation to determine the cache miss paths, and then integrate them in the IPET formulation. An added advantage is that this further allows us to use infeasible path information for cache analysis. Experimentally, our approach gives more precise WCETs as compared to AI-based cache analysis, and we also provide techniques to trade-off analysis time with precision to provide scalability.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper, we present Bi-Modal Cache - a flexible stacked DRAM cache organization which simultaneously achieves several objectives: (i) improved cache hit ratio, (ii) moving the tag storage overhead to DRAM, (iii) lower cache hit latency than tags-in-SRAM, and (iv) reduction in off-chip bandwidth wastage. The Bi-Modal Cache addresses the miss rate versus off-chip bandwidth dilemma by organizing the data in a bi-modal fashion - blocks with high spatial locality are organized as large blocks and those with little spatial locality as small blocks. By adaptively selecting the right granularity of storage for individual blocks at run-time, the proposed DRAM cache organization is able to make judicious use of the available DRAM cache capacity as well as reduce the off-chip memory bandwidth consumption. The Bi-Modal Cache improves cache hit latency despite moving the metadata to DRAM by means of a small SRAM based Way Locator. Further by leveraging the tremendous internal bandwidth and capacity that stacked DRAM organizations provide, the Bi-Modal Cache enables efficient concurrent accesses to tags and data to reduce hit time. Through detailed simulations, we demonstrate that the Bi-Modal Cache achieves overall performance improvement (in terms of Average Normalized Turnaround Time (ANTT)) of 10.8%, 13.8% and 14.0% in 4-core, 8-core and 16-core workloads respectively.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Background: Animals that hoard food to mediate seasonal deficits in resource availability might be particularly vulnerable to climate-mediated reductions in the quality and accessibility of food during the caching season. Central-place foragers might be additionally impacted by climatic constraints on their already restricted foraging range. Aims: We sought evidence for these patterns in a study of the American pika (Ochotona princeps), a territorial, central-place forager sensitive to climate. Methods: Pika food caches and available forage were re-sampled using historical methods at two long-term study sites, to quantify changes over two decades. Taxa that changed in availability or use were analysed for primary and secondary metabolites. Results: Both sites trended towards warmer summers, and snowmelt trended earlier at the lower latitude site. Graminoid cover increased at each site, and caching trends appeared to reflect available forage rather than primary metabolites. Pikas at the lower latitude site preferred species higher in secondary metabolites, known to provide higher-nutrient winter forage. However, caching of lower-nutrient graminoids increased in proportion with graminoid availability at that site. Conclusions: If our results represent trends in climate, cache quality and available forage, we predict that pikas at the lower latitude site will soon face nutritional deficiencies.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Loads that miss in L1 or L2 caches and waiting for their data at the head of the ROB cause significant slow down in the form of commit stalls. We identify that most of these commit stalls are caused by a small set of loads, referred to as LIMCOS (Loads Incurring Majority of COmmit Stalls). We propose simple history-based classifiers that track commit stalls suffered by loads to help us identify this small set of loads. We study an application of these classifiers to prefetching. The classifiers are used to train the prefetcher to focus on the misses suffered by LIMCOS. This, referred to as focused prefetching, results in a 9.8% gain in IPC over naive GHB based delta correlation prefetcher along with a 20.3% reduction in memory traffic for a set of 17 memory-intensive SPEC2000 benchmarks. Another important impact of focused prefetching is a 61% improvement in the accuracy of prefetches. We demonstrate that the proposed classification criterion performs better than other existing criteria like criticality and delinquent loads. Also we show that the criterion of focusing on commit stalls is robust enough across cache levels and can be applied to any prefetcher without any modifications to the prefetcher.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

A major concern of embedded system architects is the design for low power. We address one aspect of the problem in this paper, namely the effect of executable code compression. There are two benefits of code compression – firstly, a reduction in the memory footprint of embedded software, and secondly, potential reduction in memory bus traffic and power consumption. Since decompression has to be performed at run time it is achieved by hardware. We describe a tool called COMPASS which can evaluate a range of strategies for any given set of benchmarks and display compression ratios. Also, given an execution trace, it can compute the effect on bus toggles, and cache misses for a range of compression strategies. The tool is interactive and allows the user to vary a set of parameters, and observe their effect on performance. We describe an implementation of the tool and demonstrate its effectiveness. To the best of our knowledge this is the first tool proposed for such a purpose.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

CD-ROMs have proliferated as a distribution media for desktop machines for a large variety of multimedia applications (targeted for a single-user environment) like encyclopedias, magazines and games. With CD-ROM capacities up to 3 GB being available in the near future, they will form an integral part of Video on Demand (VoD) servers to store full-length movies and multimedia. In the first section of this paper we look at issues related to the single- user desktop environment. Since these multimedia applications are highly interactive in nature, we take a pragmatic approach, and have made a detailed study of the multimedia application behavior in terms of the I/O request patterns generated to the CD-ROM subsystem by tracing these patterns. We discuss prefetch buffer design and seek time characteristics in the context of the analysis of these traces. We also propose an adaptive main-memory hosted cache that receives caching hints from the application to reduce the latency when the user moves from one node of the hyper graph to another. In the second section we look at the use of CD-ROM in a VoD server and discuss the problem of scheduling multiple request streams and buffer management in this scenario. We adapt the C-SCAN (Circular SCAN) algorithm to suit the CD-ROM drive characteristics and prove that it is optimal in terms of buffer size management. We provide computationally inexpensive relations by which this algorithm can be implemented. We then propose an admission control algorithm which admits new request streams without disrupting the continuity of playback of the previous request streams. The algorithm also supports operations such as fast forward and replay. Finally, we discuss the problem of optimal placement of MPEG streams on CD-ROMs in the third section.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We describe the design of a directory-based shared memory architecture on a hierarchical network of hypercubes. The distributed directory scheme comprises two separate hierarchical networks for handling cache requests and transfers. Further, the scheme assumes a single address space and each processing element views the entire network as contiguous memory space. The size of individual directories stored at each node of the network remains constant throughout the network. Although the size of the directory increases with the network size, the architecture is scalable. The results of the analytical studies demonstrate superior performance characteristics of our scheme compared with those of other schemes.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Simulation is an important means of evaluating new microarchitectures. With the invention of multi-core (CMP) platforms, simulators are becoming larger and more complex. However, with the availability of CMPs with larger caches and higher operating frequency, the wall clock time required for simulating an application has become comparatively shorter. Reducing this simulation time further is a great challenge, especially in the case of multi-threaded workload due to indeterminacy introduced due to simultaneously executing various threads. In this paper, we propose a technique for speeding multi-core simulation. The model of the processor core and cache are replaced with functional models, to achieve speedup. A timed Petri net model is used to estimate the execution time of the processor and the memory access latencies are estimated using hit/miss information obtained from the functional model of the cache. This model can be used to predict performance of data parallel applications or multiprogramming workload on CMP platform with various cache hierarchies and shared bus interconnect. The error in estimation of the execution time of an application is within 6%. The speedup achieved ranges between an average of 2x--4x over the cycle accurate simulator.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this paper we propose a new method of data handling for web servers. We call this method Network Aware Buffering and Caching (NABC for short). NABC facilitates reduction of data copies in web server's data sending path, by doing three things: (1) Layout the data in main memory in a way that protocol processing can be done without data copies (2) Keep a unified cache of data in kernel and ensure safe access to it by various processes and kernel and (3) Pass only the necessary meta data between processes so that bulk data handling time spent during IPC can be reduced. We realize NABC by implementing a set of system calls and an user library. The end product of the implementation is a set of APIs specifically designed for use by the web servers. We port an in house web server called SWEET, to NABC APIs and evaluate performance using a range of workloads both simulated and real. The results show a very impressive gain of 12% to 21% in throughput for static file serving and 1.6 to 4 times gain in throughput for lightweight dynamic content serving for a server using NABC APIs over the one using UNIX APIs.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Road transportation, as an important requirement of modern society, is presently hindered by restrictions in emission legislations as well as the availability of petroleum fuels, and as a consequence, the fuel cost. For nearly 270 years, we burned our fossil cache and have come to within a generation of exhausting the liquid part of it. Besides, to reduce the greenhouse gases, and to obey the environmental laws of most countries, it would be necessary to replace a significant number of the petroleum-fueled internal-combustion-engine vehicles (ICEVs) with electric cars in the near future. In this article, we briefly describe the merits and demerits of various proposed electrochemical systems for electric cars, namely the storage batteries, fuel cells and electrochemical supercapacitors, and determine the power and energy requirements of a modern car. We conclude that a viable electric car could be operated with a 50 kW polymer-electrolyte fuel cell stack to provide power for cruising and climbing, coupled in parallel with a 30 kW supercapacitor and/or battery bank to deliver additional short-term burst-power during acceleration.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We describe a System-C based framework we are developing, to explore the impact of various architectural and microarchitectural level parameters of the on-chip interconnection network elements on its power and performance. The framework enables one to choose from a variety of architectural options like topology, routing policy, etc., as well as allows experimentation with various microarchitectural options for the individual links like length, wire width, pitch, pipelining, supply voltage and frequency. The framework also supports a flexible traffic generation and communication model. We provide preliminary results of using this framework to study the power, latency and throughput of a 4x4 multi-core processing array using mesh, torus and folded torus, for two different communication patterns of dense and sparse linear algebra. The traffic consists of both Request-Response messages (mimicing cache accesses)and One-Way messages. We find that the average latency can be reduced by increasing the pipeline depth, as it enables higher link frequencies. We also find that there exists an optimum degree of pipelining which minimizes energy-delay product.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In sensor networks, routing algorithms should be designed such that packet losses due to wireless links are reduced.In this paper, we present a ”potential”-based routing scheme to find routes with high packet delivery ratios. The basic idea is to define a scalar potential value at each node in the network and forward data to the neighbour with the highest potential.For a simple 2-relay network, we propose a potential function that takes into account wireless channel state. Markov-chain based analysis provides analytical expressions for packet delivery ratio. Considerable improvement can be observed compared to a channel-state-oblivious policy. This motivates us to define a channel-state-dependent potential function in a general network context. Simulations show that for a relatively slowly changing wireless network, our approach can provide up to 20% better performance than the commonly- used shortest-hop-count-based routing.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Multiple Clock Domain processors provide an attractive solution to the increasingly challenging problems of clock distribution and power dissipation. They allow their chips to be partitioned into different clock domains, and each domain’s frequency (voltage) to be independently configured. This flexibility adds new dimensions to the Dynamic Voltage and Frequency Scaling problem, while providing better scope for saving energy and meeting performance demands. In this paper, we propose a compiler directed approach for MCD-DVFS. We build a formal petri net based program performance model, parameterized by settings of microarchitectural components and resource configurations, and integrate it with our compiler passes for frequency selection.Our model estimates the performance impact of a frequency setting, unlike the existing best techniques which rely on weaker indicators of domain performance such as queue occupancies(used by online methods) and slack manifestation for a particular frequency setting (software based methods).We evaluate our method with subsets of SPECFP2000,Mediabench and Mibench benchmarks. Our mean energy savings is 60.39% (versus 33.91% of the best software technique)in a memory constrained system for cache miss dominated benchmarks, and we meet the performance demands.Our ED2 improves by 22.11% (versus 18.34%) for other benchmarks. For a CPU with restricted frequency settings, our energy consumption is within 4.69% of the optimal.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Digest caches have been proposed as an effective method tospeed up packet classification in network processors. In this paper, weshow that the presence of a large number of small flows and a few largeflows in the Internet has an adverse impact on the performance of thesedigest caches. In the Internet, a few large flows transfer a majority ofthe packets whereas the contribution of several small flows to the totalnumber of packets transferred is small. In such a scenario, the LRUcache replacement policy, which gives maximum priority to the mostrecently accessed digest, tends to evict digests belonging to the few largeflows. We propose a new cache management algorithm called SaturatingPriority (SP) which aims at improving the performance of digest cachesin network processors by exploiting the disparity between the number offlows and the number of packets transferred. Our experimental resultsdemonstrate that SP performs better than the widely used LRU cachereplacement policy in size constrained caches. Further, we characterizethe misses experienced by flow identifiers in digest caches.