100 resultados para cache placement


Relevância:

10.00% 10.00%

Publicador:

Resumo:

We describe a System-C based framework we are developing, to explore the impact of various architectural and microarchitectural level parameters of the on-chip interconnection network elements on its power and performance. The framework enables one to choose from a variety of architectural options like topology, routing policy, etc., as well as allows experimentation with various microarchitectural options for the individual links like length, wire width, pitch, pipelining, supply voltage and frequency. The framework also supports a flexible traffic generation and communication model. We provide preliminary results of using this framework to study the power, latency and throughput of a 4x4 multi-core processing array using mesh, torus and folded torus, for two different communication patterns of dense and sparse linear algebra. The traffic consists of both Request-Response messages (mimicing cache accesses)and One-Way messages. We find that the average latency can be reduced by increasing the pipeline depth, as it enables higher link frequencies. We also find that there exists an optimum degree of pipelining which minimizes energy-delay product.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this paper we incorporate a novel approach to synthesize a class of closed-loop feedback control, based on the variational structure assignment. Properties of a viscoelastic system are used to design an active feedback controller for an undamped structural system with distributed sensor, actuator and controller. Wave dispersion properties of onedimensional beam system have been studied. Efficiency of the chosen viscoelastic model in enhancing damping and stability properties of one-dimensional viscoelastic bar have been analyzed. The variational structure is projected on a solution space of a closed-loop system involving a weakly damped structure with distributed sensor and actuator with controller. These assign the phenomenology based internal strain rate damping parameter of a viscoelastic system to the usual elastic structure but with active control. In the formulation a model of cantilever beam with non-collocated actuator and sensor has been considered. The formulation leads to the matrix identification problem of two dynamic stiffness matrices. The method has been simplified to obtain control system gains for the free vibration control of a cantilever beam system with collocated actuator-sensor, using quadratic optimal control and pole-placement methods.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Multiple Clock Domain processors provide an attractive solution to the increasingly challenging problems of clock distribution and power dissipation. They allow their chips to be partitioned into different clock domains, and each domain’s frequency (voltage) to be independently configured. This flexibility adds new dimensions to the Dynamic Voltage and Frequency Scaling problem, while providing better scope for saving energy and meeting performance demands. In this paper, we propose a compiler directed approach for MCD-DVFS. We build a formal petri net based program performance model, parameterized by settings of microarchitectural components and resource configurations, and integrate it with our compiler passes for frequency selection.Our model estimates the performance impact of a frequency setting, unlike the existing best techniques which rely on weaker indicators of domain performance such as queue occupancies(used by online methods) and slack manifestation for a particular frequency setting (software based methods).We evaluate our method with subsets of SPECFP2000,Mediabench and Mibench benchmarks. Our mean energy savings is 60.39% (versus 33.91% of the best software technique)in a memory constrained system for cache miss dominated benchmarks, and we meet the performance demands.Our ED2 improves by 22.11% (versus 18.34%) for other benchmarks. For a CPU with restricted frequency settings, our energy consumption is within 4.69% of the optimal.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Digest caches have been proposed as an effective method tospeed up packet classification in network processors. In this paper, weshow that the presence of a large number of small flows and a few largeflows in the Internet has an adverse impact on the performance of thesedigest caches. In the Internet, a few large flows transfer a majority ofthe packets whereas the contribution of several small flows to the totalnumber of packets transferred is small. In such a scenario, the LRUcache replacement policy, which gives maximum priority to the mostrecently accessed digest, tends to evict digests belonging to the few largeflows. We propose a new cache management algorithm called SaturatingPriority (SP) which aims at improving the performance of digest cachesin network processors by exploiting the disparity between the number offlows and the number of packets transferred. Our experimental resultsdemonstrate that SP performs better than the widely used LRU cachereplacement policy in size constrained caches. Further, we characterizethe misses experienced by flow identifiers in digest caches.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The specific side-chain orientations of the phenyl group in the polypeptides poly-S-benzyl-L-cysteine, poly-S-carbobenzoxy-L-cysteine and poly-O-carbobenzoxy-L-serine in the beta-structure have been studied by spectral measurements in solutions. All the three polypeptides exhibit aromatic CD bands, indicating the asymmetric placement of the side-chain phenyl rings when the polypeptide backbone takes up the antiparallel beta-structure. Supporting evidence for this is derived from n.m.r. spectra of the polypeptides, which show upfield shift of the phenyl protons due to the stacking of the aromatic rings. Molecular model building studies reveal the stacking of alternate phenyl groups along the polypeptide chain.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

As the gap between processor and memory continues to grow Memory performance becomes a key performance bottleneck for many applications. Compilers therefore increasingly seek to modify an application’s data layout to improve cache locality and cache reuse. Whole program Structure Layout [WPSL] transformations can significantly increase the spatial locality of data and reduce the runtime of programs that use link-based data structures, by increasing the cache line utilization. However, in production compilers WPSL transformations do not realize the entire performance potential possible due to a number of factors. Structure layout decisions made on the basis of whole program aggregated affinity/hotness of structure fields, can be sub optimal for local code regions. WPSL is also restricted in applicability in production compilers for type unsafe languages like C/C++ due to the extensive legality checks and field sensitive pointer analysis required over the entire application. In order to overcome the issues associated with WPSL, we propose Region Based Structure Layout (RBSL) optimization framework, using selective data copying. We describe our RBSL framework, implemented in the production compiler for C/C++ on HP-UX IA-64. We show that acting in complement to the existing and mature WPSL transformation framework in our compiler, RBSL improves application performance in pointer intensive SPEC benchmarks ranging from 3% to 28% over WPSL

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper describes techniques to estimate the worst case execution time of executable code on architectures with data caches. The underlying mechanism is Abstract Interpretation, which is used for the dual purposes of tracking address computations and cache behavior. A simultaneous numeric and pointer analysis using an abstraction for discrete sets of values computes safe approximations of access addresses which are then used to predict cache behavior using Must Analysis. A heuristic is also proposed which generates likely worst case estimates. It can be used in soft real time systems and also for reasoning about the tightness of the safe estimate. The analysis methods can handle programs with non-affine access patterns, for which conventional Presburger Arithmetic formulations or Cache Miss Equations do not apply. The precision of the estimates is user-controlled and can be traded off against analysis time. Executables are analyzed directly, which, apart from enhancing precision, renders the method language independent.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Today's feature-rich multimedia products require embedded system solution with complex System-on-Chip (SoC) to meet market expectations of high performance at a low cost and lower energy consumption. The memory architecture of the embedded system strongly influences these parameters. Hence the embedded system designer performs a complete memory architecture exploration. This problem is a multi-objective optimization problem and can be tackled as a two-level optimization problem. The outer level explores various memory architecture while the inner level explores placement of data sections (data layout problem) to minimize memory stalls. Further, the designer would be interested in multiple optimal design points to address various market segments. However, tight time-to-market constraints enforces short design cycle time. In this paper we address the multi-level multi-objective memory architecture exploration problem through a combination of Multi-objective Genetic Algorithm (Memory Architecture exploration) and an efficient heuristic data placement algorithm. At the outer level the memory architecture exploration is done by picking memory modules directly from a ASIC memory Library. This helps in performing the memory architecture exploration in a integrated framework, where the memory allocation, memory exploration and data layout works in a tightly coupled way to yield optimal design points with respect to area, power and performance. We experimented our approach for 3 embedded applications and our approach explores several thousand memory architecture for each application, yielding a few hundred optimal design points in a few hours of computation time on a standard desktop.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Software transactional memory (STM) has been proposed as a promising programming paradigm for shared memory multi-threaded programs as an alternative to conventional lock based synchronization primitives. Typical STM implementations employ a conflict detection scheme, which works with uniform access granularity, tracking shared data accesses either at word/cache line or at object level. It is well known that a single fixed access tracking granularity cannot meet the conflicting goals of reducing false conflicts without impacting concurrency adversely. A fine grained granularity while improving concurrency can have an adverse impact on performance due to lock aliasing, lock validation overheads, and additional cache pressure. On the other hand, a coarse grained granularity can impact performance due to reduced concurrency. Thus, in general, a fixed or uniform granularity access tracking (UGAT) scheme is application-unaware and rarely matches the access patterns of individual application or parts of an application, leading to sub-optimal performance for different parts of the application(s). In order to mitigate the disadvantages associated with UGAT scheme, we propose a Variable Granularity Access Tracking (VGAT) scheme in this paper. We propose a compiler based approach wherein the compiler uses inter-procedural whole program static analysis to select the access tracking granularity for different shared data structures of the application based on the application's data access pattern. We describe our prototype VGAT scheme, using TL2 as our STM implementation. Our experimental results reveal that VGAT-STM scheme can improve the application performance of STAMP benchmarks from 1.87% to up to 21.2%.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Many web sites incorporate dynamic web pages to deliver customized contents to their users. However, dynamic pages result in increased user response times due to their construction overheads. In this paper, we consider mechanisms for reducing these overheads by utilizing the excess capacity with which web servers are typically provisioned. Specifically, we present a caching technique that integrates fragment caching with anticipatory page pre-generation in order to deliver dynamic pages faster during normal operating situations. A feedback mechanism is used to tune the page pre-generation process to match the current system load. The experimental results from a detailed simulation study of our technique indicate that, given a fixed cache budget, page construction speedups of more than fifty percent can be consistently achieved as compared to a pure fragment caching approach.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper presents a method for placement of Phasor Measurement Units, ensuring the monitoring of vulnerable buses which are obtained based on transient stability analysis of the overall system. Real-time monitoring of phase angles across different nodes, which indicates the proximity to instability, the very purpose will be well defined if the PMUs are placed at buses which are more vulnerable. The issue is to identify the key buses where the PMUs should be placed when the transient stability prediction is taken into account considering various disturbances. Integer Linear Programming technique with equality and inequality constraints is used to find out the optimal placement set with key buses identified from transient stability analysis. Results on IEEE-14 bus system are presented to illustrate the proposed approach.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We propose a novel technique for reducing the power consumed by the on-chip cache in SNUCA chip multicore platform. This is achieved by what we call a "remap table", which maps accesses to the cache banks that are as close as possible to the cores, on which the processes are scheduled. With this technique, instead of using all the available cache, we use a portion of the cache and allocate lesser cache to the application. We formulate the problem as an energy-delay (ED) minimization problem and solve it offline using a scalable genetic algorithm approach. Our experiments show up to 40% of savings in the memory sub-system power consumption and 47% savings in energy-delay product (ED).

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We propose a novel technique for reducing the power consumed by the on-chip cache in SNUCA chip multicore platform. This is achieved by what we call a "remap table", which maps accesses to the cache banks that are as close as possible to the cores, on which the processes are scheduled. With this technique, instead of using all the available cache, we use a portion of the cache and allocate lesser cache to the application. We formulate the problem as an energy-delay (ED) minimization problem and solve it offline using a scalable genetic algorithm approach. Our experiments show up to 40% of savings in the memory sub-system power consumption and 47% savings in energy-delay product (ED).

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Control of flow in duct networks has a myriad of applications ranging from heating, ventilation, and air-conditioning to blood flow networks. The system considered here provides vent velocity inputs to a novel 3-D wind display device called the TreadPort Active Wind Tunnel. An error-based robust decentralized sliding-mode control method with nominal feedforward terms is developed for individual ducts while considering cross coupling between ducts and model uncertainty as external disturbances in the output. This approach is important due to limited measurements, geometric complexities, and turbulent flow conditions. Methods for resolving challenges such as turbulence, electrical noise, valve actuator design, and sensor placement are presented. The efficacy of the controller and the importance of feedforward terms are demonstrated with simulations based upon an experimentally validated lumped parameter model and experiments on the physical system. Results show significant improvement over traditional control methods and validate prior assertions regarding the importance of decentralized control in practice.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Two models for AF relaying, namely, fixed gain and fixed power relaying, have been extensively studied in the literature given their ability to harness spatial diversity. In fixed gain relaying, the relay gain is fixed but its transmit power varies as a function of the source-relay channel gain. In fixed power relaying, the relay transmit power is fixed, but its gain varies. We revisit and generalize the fundamental two-hop AF relaying model. We present an optimal scheme in which an average power constrained AF relay adapts its gain and transmit power to minimize the symbol error probability (SEP) at the destination. Also derived are insightful and practically amenable closed-form bounds for the optimal relay gain. We then analyze the SEP of MPSK, derive tight bounds for it, and characterize the diversity order for Rayleigh fading. Also derived is an SEP approximation that is accurate to within 0.1 dB. Extensive results show that the scheme yields significant energy savings of 2.0-7.7 dB at the source and relay. Optimal relay placement for the proposed scheme is also characterized, and is different from fixed gain or power relaying. Generalizations to MQAM and other fading distributions are also discussed.