27 resultados para MEMORY PERFORMANCE

em Indian Institute of Science - Bangalore - Índia


Relevância:

70.00% 70.00%

Publicador:

Resumo:

Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs as an alternative to traditional lock based synchronization. However adoption of STM in mainstream software has been quite low due to its considerable overheads and its poor cache/memory performance. In this paper, we perform a detailed study of the cache behavior of STM applications and quantify the impact of different STM factors on the cache misses experienced by the applications. Based on our analysis, we propose a compiler driven Lock-Data Colocation (LDC), targeted at reducing the cache overheads on STM. We show that LDC is effective in improving the cache behavior of STM applications by reducing the dcache miss latency and improving execution time performance.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

As the gap between processor and memory continues to grow Memory performance becomes a key performance bottleneck for many applications. Compilers therefore increasingly seek to modify an application’s data layout to improve cache locality and cache reuse. Whole program Structure Layout [WPSL] transformations can significantly increase the spatial locality of data and reduce the runtime of programs that use link-based data structures, by increasing the cache line utilization. However, in production compilers WPSL transformations do not realize the entire performance potential possible due to a number of factors. Structure layout decisions made on the basis of whole program aggregated affinity/hotness of structure fields, can be sub optimal for local code regions. WPSL is also restricted in applicability in production compilers for type unsafe languages like C/C++ due to the extensive legality checks and field sensitive pointer analysis required over the entire application. In order to overcome the issues associated with WPSL, we propose Region Based Structure Layout (RBSL) optimization framework, using selective data copying. We describe our RBSL framework, implemented in the production compiler for C/C++ on HP-UX IA-64. We show that acting in complement to the existing and mature WPSL transformation framework in our compiler, RBSL improves application performance in pointer intensive SPEC benchmarks ranging from 3% to 28% over WPSL

Relevância:

40.00% 40.00%

Publicador:

Resumo:

In this paper, the design and implementation of a single shared bus, shared memory multiprocessing system using Intel's single board computers is presented. The hardware configuration and the operating system developed to execute the parallel algorithms are discussed. The performance evaluation studies carried out on Image are outlined.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Software transactional memory (STM) has been proposed as a promising programming paradigm for shared memory multi-threaded programs as an alternative to conventional lock based synchronization primitives. Typical STM implementations employ a conflict detection scheme, which works with uniform access granularity, tracking shared data accesses either at word/cache line or at object level. It is well known that a single fixed access tracking granularity cannot meet the conflicting goals of reducing false conflicts without impacting concurrency adversely. A fine grained granularity while improving concurrency can have an adverse impact on performance due to lock aliasing, lock validation overheads, and additional cache pressure. On the other hand, a coarse grained granularity can impact performance due to reduced concurrency. Thus, in general, a fixed or uniform granularity access tracking (UGAT) scheme is application-unaware and rarely matches the access patterns of individual application or parts of an application, leading to sub-optimal performance for different parts of the application(s). In order to mitigate the disadvantages associated with UGAT scheme, we propose a Variable Granularity Access Tracking (VGAT) scheme in this paper. We propose a compiler based approach wherein the compiler uses inter-procedural whole program static analysis to select the access tracking granularity for different shared data structures of the application based on the application's data access pattern. We describe our prototype VGAT scheme, using TL2 as our STM implementation. Our experimental results reveal that VGAT-STM scheme can improve the application performance of STAMP benchmarks from 1.87% to up to 21.2%.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Loads that miss in L1 or L2 caches and waiting for their data at the head of the ROB cause significant slow down in the form of commit stalls. We identify that most of these commit stalls are caused by a small set of loads, referred to as LIMCOS (Loads Incurring Majority of COmmit Stalls). We propose simple history-based classifiers that track commit stalls suffered by loads to help us identify this small set of loads. We study an application of these classifiers to prefetching. The classifiers are used to train the prefetcher to focus on the misses suffered by LIMCOS. This, referred to as focused prefetching, results in a 9.8% gain in IPC over naive GHB based delta correlation prefetcher along with a 20.3% reduction in memory traffic for a set of 17 memory-intensive SPEC2000 benchmarks. Another important impact of focused prefetching is a 61% improvement in the accuracy of prefetches. We demonstrate that the proposed classification criterion performs better than other existing criteria like criticality and delinquent loads. Also we show that the criterion of focusing on commit stalls is robust enough across cache levels and can be applied to any prefetcher without any modifications to the prefetcher.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The design, implementation and evaluation are described of a dual-microcomputer system based on the concept of shared memory. Shared memory is useful for passing large blocks of data and it also provides a means to hold and work with shared data. In addition to the shared memory, a separate bus between the I/O ports of the microcomputers is provided. This bus is utilized for interprocessor synchronization. Software routines helpful in applying the dual-microcomputer system to realistic problems are presented. Performance evaluation of the system is carried out using benchmarks.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In prediction phase, the hierarchical tree structure obtained from the test image is used to predict every central pixel of an image by its four neighboring pixels. The prediction scheme generates the predicted error image, to which the wavelet/sub-band coding algorithm can be applied to obtain efficient compression. In quantization phase, we used a modified SPIHT algorithm to achieve efficiency in memory requirements. The memory constraint plays a vital role in wireless and bandwidth-limited applications. A single reusable list is used instead of three continuously growing linked lists as in case of SPIHT. This method is error resilient. The performance is measured in terms of PSNR and memory requirements. The algorithm shows good compression performance and significant savings in memory. (C) 2006 Elsevier B.V. All rights reserved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The effect of thermal cycling on the load-controlled tension-tension fatigue behavior of a Ni-Ti-Fe shape memory alloy (SMA) at room temperature was studied. Considerable strain accumulation was observed to occur in this alloy under both quasi-static and cyclic loading conditions. Though, in all cases, steady-state is reached within the first 50-100 cycles, the accumulated steady-state strain, epsilon(p.ss), is much smaller in thermally cycled alloy. As a result, the fatigue performance of them was found to be significantly enhanced vis-a-vis the as-solutionized alloy. Furthermore, under load-controlled conditions, the fatigue life of Ni-Ti-Fe alloys was found to be exclusively dependent on epsilon(p.ss). Observations made by profilometry and differential scanning calorimetry (DSC) indicate that the 200-500% enhancement in fatigue life of thermally cycled alloy is due to the homogeneous distribution of the accumulated fatigue strain. (C) 2010 Elsevier B.V. All rights reserved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We describe the design of a directory-based shared memory architecture on a hierarchical network of hypercubes. The distributed directory scheme comprises two separate hierarchical networks for handling cache requests and transfers. Further, the scheme assumes a single address space and each processing element views the entire network as contiguous memory space. The size of individual directories stored at each node of the network remains constant throughout the network. Although the size of the directory increases with the network size, the architecture is scalable. The results of the analytical studies demonstrate superior performance characteristics of our scheme compared with those of other schemes.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A single source network is said to be memory-free if all of the internal nodes (those except the source and the sinks) do not employ memory but merely send linear combinations of the symbols received at their incoming edges on their outgoing edges. In this work, we introduce network-error correction for single source, acyclic, unit-delay, memory-free networks with coherent network coding for multicast. A convolutional code is designed at the source based on the network code in order to correct network- errors that correspond to any of a given set of error patterns, as long as consecutive errors are separated by a certain interval which depends on the convolutional code selected. Bounds on this interval and the field size required for constructing the convolutional code with the required free distance are also obtained. We illustrate the performance of convolutional network error correcting codes (CNECCs) designed for the unit-delay networks using simulations of CNECCs on an example network under a probabilistic error model.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Neural network models of associative memory exhibit a large number of spurious attractors of the network dynamics which are not correlated with any memory state. These spurious attractors, analogous to "glassy" local minima of the energy or free energy of a system of particles, degrade the performance of the network by trapping trajectories starting from states that are not close to one of the memory states. Different methods for reducing the adverse effects of spurious attractors are examined with emphasis on the role of synaptic asymmetry. (C) 2002 Elsevier Science B.V. All rights reserved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In this paper we propose a new method of data handling for web servers. We call this method Network Aware Buffering and Caching (NABC for short). NABC facilitates reduction of data copies in web server's data sending path, by doing three things: (1) Layout the data in main memory in a way that protocol processing can be done without data copies (2) Keep a unified cache of data in kernel and ensure safe access to it by various processes and kernel and (3) Pass only the necessary meta data between processes so that bulk data handling time spent during IPC can be reduced. We realize NABC by implementing a set of system calls and an user library. The end product of the implementation is a set of APIs specifically designed for use by the web servers. We port an in house web server called SWEET, to NABC APIs and evaluate performance using a range of workloads both simulated and real. The results show a very impressive gain of 12% to 21% in throughput for static file serving and 1.6 to 4 times gain in throughput for lightweight dynamic content serving for a server using NABC APIs over the one using UNIX APIs.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Today's feature-rich multimedia products require embedded system solution with complex System-on-Chip (SoC) to meet market expectations of high performance at a low cost and lower energy consumption. The memory architecture of the embedded system strongly influences critical system design objectives like area, power and performance. Hence the embedded system designer performs a complete memory architecture exploration to custom design a memory architecture for a given set of applications. Further, the designer would be interested in multiple optimal design points to address various market segments. However, tight time-to-market constraints enforces short design cycle time. In this paper we address the multi-level multi-objective memory architecture exploration problem through a combination of exhaustive-search based memory exploration at the outer level and a two step based integrated data layout for SPRAM-Cache based architectures at the inner level. We present a two step integrated approach for data layout for SPRAM-Cache based hybrid architectures with the first step as data-partitioning that partitions data between SPRAM and Cache, and the second step is the cache conscious data layout. We formulate the cache-conscious data layout as a graph partitioning problem and show that our approach gives up to 34% improvement over an existing approach and also optimizes the off-chip memory address space. We experimented our approach with 3 embedded multimedia applications and our approach explores several hundred memory configurations for each application, yielding several optimal design points in a few hours of computation on a standard desktop.