990 resultados para data cache
Resumo:
Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs. In order for STMs to be adopted widely for performance critical software, understanding and improving the cache performance of applications running on STM becomes increasingly crucial, as the performance gap between processor and memory continues to grow. In this paper, we present the most detailed experimental evaluation to date, of the cache behavior of STM applications and quantify the impact of the different STM factors on the cache misses experienced by the applications. We find that STMs are not cache friendly, with the data cache stall cycles contributing to more than 50% of the execution cycles in a majority of the benchmarks. We find that on an average, misses occurring inside the STM account for 62% of total data cache miss latency cycles experienced by the applications and the cache performance is impacted adversely due to certain inherent characteristics of the STM itself. The above observations motivate us to propose a set of specific compiler transformations targeted at making the STMs cache friendly. We find that STM's fine grained and application unaware locking is a major contributor to its poor cache behavior. Hence we propose selective Lock Data co-location (LDC) and Redundant Lock Access Removal (RLAR) to address the lock access misses. We find that even transactions that are completely disjoint access parallel, suffer from costly coherence misses caused by the centralized global time stamp updates and hence we propose the Selective Per-Partition Time Stamp (SPTS) transformation to address this. We show that our transformations are effective in improving the cache behavior of STM applications by reducing the data cache miss latency by 20.15% to 37.14% and improving execution time by 18.32% to 33.12% in five of the 8 STAMP applications.
Resumo:
The first level data cache un modern processors has become a major consumer of energy due to its increasing size and high frequency access rate. In order to reduce this high energy con sumption, we propose in this paper a straightforward filtering technique based on a highly accurate forwarding predictor. Specifically, a simple structure predicts whether a load instruction will obtain its corresponding data via forwarding from the load-store structure -thus avoiding the data cache access - or if it will be provided by the data cache. This mechanism manages to reduce the data cache energy consumption by an average of 21.5% with a negligible performance penalty of less than 0.1%. Furthermore, in this paper we focus on the cache static energy consumption too by disabling a portin of sets of the L2 associative cache. Overall, when merging both proposals, the combined L1 and L2 total energy consumption is reduced by an average of 29.2% with a performance penalty of just 0.25%. Keywords: Energy consumption; filtering; forwarding predictor; cache hierarchy
Resumo:
Knowledge about program worst case execution time (WCET) is essential in validating real-time systems and helps in effective scheduling. One popular approach used in industry is to measure execution time of program components on the target architecture and combine them using static analysis of the program. Measurements need to be taken in the least intrusive way in order to avoid affecting accuracy of estimated WCET. Several programs exhibit phase behavior, wherein program dynamic execution is observed to be composed of phases. Each phase being distinct from the other, exhibits homogeneous behavior with respect to cycles per instruction (CPI), data cache misses etc. In this paper, we show that phase behavior has important implications on timing analysis. We make use of the homogeneity of a phase to reduce instrumentation overhead at the same time ensuring that accuracy of WCET is not largely affected. We propose a model for estimating WCET using static worst case instruction counts of individual phases and a function of measured average CPI. We describe a WCET analyzer built on this model which targets two different architectures. The WCET analyzer is observed to give safe estimates for most benchmarks considered in this paper. The tightness of the WCET estimates are observed to be improved for most benchmarks compared to Chronos, a well known static WCET analyzer.
Resumo:
Sparse matrix-vector multiplication (SMVM) is a fundamental operation in many scientific and engineering applications. In many cases sparse matrices have thousands of rows and columns where most of the entries are zero, while non-zero data is spread over the matrix. This sparsity of data locality reduces the effectiveness of data cache in general-purpose processors quite reducing their performance efficiency when compared to what is achieved with dense matrix multiplication. In this paper, we propose a parallel processing solution for SMVM in a many-core architecture. The architecture is tested with known benchmarks using a ZYNQ-7020 FPGA. The architecture is scalable in the number of core elements and limited only by the available memory bandwidth. It achieves performance efficiencies up to almost 70% and better performances than previous FPGA designs.
Resumo:
This dissertation studies the context-aware application with its proposed algorithms at client side. The required context-aware infrastructure is discussed in depth to illustrate that such an infrastructure collects the mobile user’s context information, registers service providers, derives mobile user’s current context, distributes user context among context-aware applications, and provides tailored services. The approach proposed tries to strike a balance between the context server and mobile devices. The context acquisition is centralized at the server to ensure the usability of context information among mobile devices, while context reasoning remains at the application level. Hence, a centralized context acquisition and distributed context reasoning are viewed as a better solution overall. The context-aware search application is designed and implemented at the server side. A new algorithm is proposed to take into consideration the user context profiles. By promoting feedback on the dynamics of the system, any prior user selection is now saved for further analysis such that it may contribute to help the results of a subsequent search. On the basis of these developments at the server side, various solutions are consequently provided at the client side. A proxy software-based component is set up for the purpose of data collection. This research endorses the belief that the proxy at the client side should contain the context reasoning component. Implementation of such a component provides credence to this belief in that the context applications are able to derive the user context profiles. Furthermore, a context cache scheme is implemented to manage the cache on the client device in order to minimize processing requirements and other resources (bandwidth, CPU cycle, power). Java and MySQL platforms are used to implement the proposed architecture and to test scenarios derived from user’s daily activities. To meet the practical demands required of a testing environment without the impositions of a heavy cost for establishing such a comprehensive infrastructure, a software simulation using a free Yahoo search API is provided as a means to evaluate the effectiveness of the design approach in a most realistic way. The integration of Yahoo search engine into the context-aware architecture design proves how context aware application can meet user demands for tailored services and products in and around the user’s environment. The test results show that the overall design is highly effective,providing new features and enriching the mobile user’s experience through a broad scope of potential applications.
Resumo:
Data caching is an important technique in mobile computing environments for improving data availability and access latencies particularly because these computing environments are characterized by narrow bandwidth wireless links and frequent disconnections. Cache replacement policy plays a vital role to improve the performance in a cached mobile environment, since the amount of data stored in a client cache is small. In this paper we reviewed some of the well known cache replacement policies proposed for mobile data caches. We made a comparison between these policies after classifying them based on the criteria used for evicting documents. In addition, this paper suggests some alternative techniques for cache replacement
Resumo:
"IEPA/BOW/02-021."--Cover.
Resumo:
Mobile applications are being increasingly deployed on a massive scale in various mobile sensor grid database systems. With limited resources from the mobile devices, how to process the huge number of queries from mobile users with distributed sensor grid databases becomes a critical problem for such mobile systems. While the fundamental semantic cache technique has been investigated for query optimization in sensor grid database systems, the problem is still difficult due to the fact that more realistic multi-dimensional constraints have not been considered in existing methods. To solve the problem, a new semantic cache scheme is presented in this paper for location-dependent data queries in distributed sensor grid database systems. It considers multi-dimensional constraints or factors in a unified cost model architecture, determines the parameters of the cost model in the scheme by using the concept of Nash equilibrium from game theory, and makes semantic cache decisions from the established cost model. The scenarios of three factors of semantic, time and locations are investigated as special cases, which improve existing methods. Experiments are conducted to demonstrate the semantic cache scheme presented in this paper for distributed sensor grid database systems.
Analyzing Cache Performance Bottlenecks of STM Applications and addressing them with Compiler's help
Resumo:
Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs as an alternative to traditional lock based synchronization. However adoption of STM in mainstream software has been quite low due to its considerable overheads and its poor cache/memory performance. In this paper, we perform a detailed study of the cache behavior of STM applications and quantify the impact of different STM factors on the cache misses experienced by the applications. Based on our analysis, we propose a compiler driven Lock-Data Colocation (LDC), targeted at reducing the cache overheads on STM. We show that LDC is effective in improving the cache behavior of STM applications by reducing the dcache miss latency and improving execution time performance.
Resumo:
In this paper we present a cache coherence protocol for multistage interconnection network (MIN)-based multiprocessors with two distinct private caches: private-blocks caches (PCache) containing blocks private to a process and shared-blocks caches (SCache) containing data accessible by all processes. The architecture is extended by a coherence control bus connecting all shared-block cache controllers. Timing problems due to variable transit delays through the MIN are dealt with by introducing Transient states in the proposed cache coherence protocol. The impact of the coherence protocol on system performance is evaluated through a performance study of three phases. Assuming homogeneity of all nodes, a single-node queuing model (phase 3) is developed to analyze system performance. This model is solved for processor and coherence bus utilizations using the mean value analysis (MVA) technique with shared-blocks steady state probabilities (phase 1) and communication delays (phase 2) as input parameters. The performance of our system is compared to that of a system with an equivalent-sized unified cache and with a multiprocessor implementing a directory-based coherence protocol. System performance measures are verified through simulation.
Resumo:
In this paper we propose a new method of data handling for web servers. We call this method Network Aware Buffering and Caching (NABC for short). NABC facilitates reduction of data copies in web server's data sending path, by doing three things: (1) Layout the data in main memory in a way that protocol processing can be done without data copies (2) Keep a unified cache of data in kernel and ensure safe access to it by various processes and kernel and (3) Pass only the necessary meta data between processes so that bulk data handling time spent during IPC can be reduced. We realize NABC by implementing a set of system calls and an user library. The end product of the implementation is a set of APIs specifically designed for use by the web servers. We port an in house web server called SWEET, to NABC APIs and evaluate performance using a range of workloads both simulated and real. The results show a very impressive gain of 12% to 21% in throughput for static file serving and 1.6 to 4 times gain in throughput for lightweight dynamic content serving for a server using NABC APIs over the one using UNIX APIs.
Resumo:
Today's feature-rich multimedia products require embedded system solution with complex System-on-Chip (SoC) to meet market expectations of high performance at a low cost and lower energy consumption. The memory architecture of the embedded system strongly influences critical system design objectives like area, power and performance. Hence the embedded system designer performs a complete memory architecture exploration to custom design a memory architecture for a given set of applications. Further, the designer would be interested in multiple optimal design points to address various market segments. However, tight time-to-market constraints enforces short design cycle time. In this paper we address the multi-level multi-objective memory architecture exploration problem through a combination of exhaustive-search based memory exploration at the outer level and a two step based integrated data layout for SPRAM-Cache based architectures at the inner level. We present a two step integrated approach for data layout for SPRAM-Cache based hybrid architectures with the first step as data-partitioning that partitions data between SPRAM and Cache, and the second step is the cache conscious data layout. We formulate the cache-conscious data layout as a graph partitioning problem and show that our approach gives up to 34% improvement over an existing approach and also optimizes the off-chip memory address space. We experimented our approach with 3 embedded multimedia applications and our approach explores several hundred memory configurations for each application, yielding several optimal design points in a few hours of computation on a standard desktop.
Resumo:
As the gap between processor and memory continues to grow Memory performance becomes a key performance bottleneck for many applications. Compilers therefore increasingly seek to modify an application’s data layout to improve cache locality and cache reuse. Whole program Structure Layout [WPSL] transformations can significantly increase the spatial locality of data and reduce the runtime of programs that use link-based data structures, by increasing the cache line utilization. However, in production compilers WPSL transformations do not realize the entire performance potential possible due to a number of factors. Structure layout decisions made on the basis of whole program aggregated affinity/hotness of structure fields, can be sub optimal for local code regions. WPSL is also restricted in applicability in production compilers for type unsafe languages like C/C++ due to the extensive legality checks and field sensitive pointer analysis required over the entire application. In order to overcome the issues associated with WPSL, we propose Region Based Structure Layout (RBSL) optimization framework, using selective data copying. We describe our RBSL framework, implemented in the production compiler for C/C++ on HP-UX IA-64. We show that acting in complement to the existing and mature WPSL transformation framework in our compiler, RBSL improves application performance in pointer intensive SPEC benchmarks ranging from 3% to 28% over WPSL