580 resultados para Benchmarks
Resumo:
Just-in-Time (JIT) compilers for Java can be augmented by making use of runtime profile information to produce better quality code and hence achieve higher performance. In a JIT compilation environment, the profile information obtained can be readily exploited in the same run to aid recompilation and optimization of frequently executed (hot) methods. This paper discusses a low overhead path profiling scheme for dynamically profiling AT produced native code. The profile information is used in recompilation during a subsequent invocation of the hot method. During recompilation tree regions along the hot paths are enlarged and instruction scheduling at the superblock level is performed. We have used the open source LaTTe AT compiler framework for our implementation. Our results on a SPARC platform for SPEC JVM98 benchmarks indicate that (i) there is a significant reduction in the number of tree regions along the hot paths, and (ii) profile aided recompilation in LaTTe achieves performance comparable to that of adaptive LaTTe in spite of retranslation and profiling overheads.
Resumo:
Multiple Clock Domain processors provide an attractive solution to the increasingly challenging problems of clock distribution and power dissipation. They allow their chips to be partitioned into different clock domains, and each domain’s frequency (voltage) to be independently configured. This flexibility adds new dimensions to the Dynamic Voltage and Frequency Scaling problem, while providing better scope for saving energy and meeting performance demands. In this paper, we propose a compiler directed approach for MCD-DVFS. We build a formal petri net based program performance model, parameterized by settings of microarchitectural components and resource configurations, and integrate it with our compiler passes for frequency selection.Our model estimates the performance impact of a frequency setting, unlike the existing best techniques which rely on weaker indicators of domain performance such as queue occupancies(used by online methods) and slack manifestation for a particular frequency setting (software based methods).We evaluate our method with subsets of SPECFP2000,Mediabench and Mibench benchmarks. Our mean energy savings is 60.39% (versus 33.91% of the best software technique)in a memory constrained system for cache miss dominated benchmarks, and we meet the performance demands.Our ED2 improves by 22.11% (versus 18.34%) for other benchmarks. For a CPU with restricted frequency settings, our energy consumption is within 4.69% of the optimal.
Resumo:
MATLAB is an array language, initially popular for rapid prototyping, but is now being increasingly used to develop production code for numerical and scientific applications. Typical MATLAB programs have abundant data parallelism. These programs also have control flow dominated scalar regions that have an impact on the program's execution time. Today's computer systems have tremendous computing power in the form of traditional CPU cores and throughput oriented accelerators such as graphics processing units(GPUs). Thus, an approach that maps the control flow dominated regions to the CPU and the data parallel regions to the GPU can significantly improve program performance. In this paper, we present the design and implementation of MEGHA, a compiler that automatically compiles MATLAB programs to enable synergistic execution on heterogeneous processors. Our solution is fully automated and does not require programmer input for identifying data parallel regions. We propose a set of compiler optimizations tailored for MATLAB. Our compiler identifies data parallel regions of the program and composes them into kernels. The problem of combining statements into kernels is formulated as a constrained graph clustering problem. Heuristics are presented to map identified kernels to either the CPU or GPU so that kernel execution on the CPU and the GPU happens synergistically and the amount of data transfer needed is minimized. In order to ensure required data movement for dependencies across basic blocks, we propose a data flow analysis and edge splitting strategy. Thus our compiler automatically handles composition of kernels, mapping of kernels to CPU and GPU, scheduling and insertion of required data transfer. The proposed compiler was implemented and experimental evaluation using a set of MATLAB benchmarks shows that our approach achieves a geometric mean speedup of 19.8X for data parallel benchmarks over native execution of MATLAB.
Resumo:
Context-sensitive points-to analysis is critical for several program optimizations. However, as the number of contexts grows exponentially, storage requirements for the analysis increase tremendously for large programs, making the analysis non-scalable. We propose a scalable flow-insensitive context-sensitive inclusion-based points-to analysis that uses a specially designed multi-dimensional bloom filter to store the points-to information. Two key observations motivate our proposal: (i) points-to information (between pointer-object and between pointer-pointer) is sparse, and (ii) moving from an exact to an approximate representation of points-to information only leads to reduced precision without affecting correctness of the (may-points-to) analysis. By using an approximate representation a multi-dimensional bloom filter can significantly reduce the memory requirements with a probabilistic bound on loss in precision. Experimental evaluation on SPEC 2000 benchmarks and two large open source programs reveals that with an average storage requirement of 4MB, our approach achieves almost the same precision (98.6%) as the exact implementation. By increasing the average memory to 27MB, it achieves precision upto 99.7% for these benchmarks. Using Mod/Ref analysis as the client, we find that the client analysis is not affected that often even when there is some loss of precision in the points-to representation. We find that the NoModRef percentage is within 2% of the exact analysis while requiring 4MB (maximum 15MB) memory and less than 4 minutes on average for the points-to analysis. Another major advantage of our technique is that it allows to trade off precision for memory usage of the analysis.
Resumo:
As the gap between processor and memory continues to grow Memory performance becomes a key performance bottleneck for many applications. Compilers therefore increasingly seek to modify an application’s data layout to improve cache locality and cache reuse. Whole program Structure Layout [WPSL] transformations can significantly increase the spatial locality of data and reduce the runtime of programs that use link-based data structures, by increasing the cache line utilization. However, in production compilers WPSL transformations do not realize the entire performance potential possible due to a number of factors. Structure layout decisions made on the basis of whole program aggregated affinity/hotness of structure fields, can be sub optimal for local code regions. WPSL is also restricted in applicability in production compilers for type unsafe languages like C/C++ due to the extensive legality checks and field sensitive pointer analysis required over the entire application. In order to overcome the issues associated with WPSL, we propose Region Based Structure Layout (RBSL) optimization framework, using selective data copying. We describe our RBSL framework, implemented in the production compiler for C/C++ on HP-UX IA-64. We show that acting in complement to the existing and mature WPSL transformation framework in our compiler, RBSL improves application performance in pointer intensive SPEC benchmarks ranging from 3% to 28% over WPSL
Resumo:
In achieving higher instruction level parallelism, software pipelining increases the register pressure in the loop. The usefulness of the generated schedule may be restricted to cases where the register pressure is less than the available number of registers. Spill instructions need to be introduced otherwise. But scheduling these spill instructions in the compact schedule is a difficult task. Several heuristics have been proposed to schedule spill code. These heuristics may generate more spill code than necessary, and scheduling them may necessitate increasing the initiation interval. We model the problem of register allocation with spill code generation and scheduling in software pipelined loops as a 0-1 integer linear program. The formulation minimizes the increase in initiation interval (II) by optimally placing spill code and simultaneously minimizes the amount of spill code produced. To the best of our knowledge, this is the first integrated formulation for register allocation, optimal spill code generation and scheduling for software pipelined loops. The proposed formulation performs better than the existing heuristics by preventing an increase in II in 11.11% of the loops and generating 18.48% less spill code on average among the loops extracted from Perfect Club and SPEC benchmarks with a moderate increase in compilation time.
Resumo:
With the emergence of voltage scaling as one of the most powerful power reduction techniques, it has been important to support voltage scalable statistical static timing analysis (SSTA) in deep submicrometer process nodes. In this paper, we propose a single delay model of logic gate using neural network which comprehensively captures process, voltage, and temperature variation along with input slew and output load. The number of simulation programs with integrated circuit emphasis (SPICE) required to create this model over a large voltage and temperature range is found to be modest and 4x less than that required for a conventional table-based approach with comparable accuracy. We show how the model can be used to derive sensitivities required for linear SSTA for an arbitrary voltage and temperature. Our experimentation on ISCAS 85 benchmarks across a voltage range of 0.9-1.1V shows that the average error in mean delay is less than 1.08% and average error in standard deviation is less than 2.85%. The errors in predicting the 99% and 1% probability point are 1.31% and 1%, respectively, with respect to SPICE. The two potential applications of voltage-aware SSTA have been presented, i.e., one for improving the accuracy of timing analysis by considering instance-specific voltage drops in power grids and the other for determining optimum supply voltage for target yield for dynamic voltage scaling applications.
Resumo:
Superscalar processors currently have the potential to fetch multiple basic blocks per cycle by employing one of several recently proposed instruction fetch mechanisms. However, this increased fetch bandwidth cannot be exploited unless pipeline stages further downstream correspondingly improve. In particular,register renaming a large number of instructions per cycle is diDcult. A large instruction window, needed to receive multiple basic blocks per cycle, will slow down dependence resolution and instruction issue. This paper addresses these and related issues by proposing (i) partitioning of the instruction window into multiple blocks, each holding a dynamic code sequence; (ii) logical partitioning of the registerjle into a global file and several local jles, the latter holding registers local to a dynamic code sequence; (iii) the dynamic recording and reuse of register renaming information for registers local to a dynamic code sequence. Performance studies show these mechanisms improve performance over traditional superscalar processors by factors ranging from 1.5 to a little over 3 for the SPEC Integer programs. Next, it is observed that several of the loops in the benchmarks display vector-like behavior during execution, even if the static loop bodies are likely complex for compile-time vectorization. A dynamic loop vectorization mechanism that builds on top of the above mechanisms is briefly outlined. The mechanism vectorizes up to 60% of the dynamic instructions for some programs, albeit the average number of iterations per loop is quite small.
Resumo:
Software transactional memory (STM) has been proposed as a promising programming paradigm for shared memory multi-threaded programs as an alternative to conventional lock based synchronization primitives. Typical STM implementations employ a conflict detection scheme, which works with uniform access granularity, tracking shared data accesses either at word/cache line or at object level. It is well known that a single fixed access tracking granularity cannot meet the conflicting goals of reducing false conflicts without impacting concurrency adversely. A fine grained granularity while improving concurrency can have an adverse impact on performance due to lock aliasing, lock validation overheads, and additional cache pressure. On the other hand, a coarse grained granularity can impact performance due to reduced concurrency. Thus, in general, a fixed or uniform granularity access tracking (UGAT) scheme is application-unaware and rarely matches the access patterns of individual application or parts of an application, leading to sub-optimal performance for different parts of the application(s). In order to mitigate the disadvantages associated with UGAT scheme, we propose a Variable Granularity Access Tracking (VGAT) scheme in this paper. We propose a compiler based approach wherein the compiler uses inter-procedural whole program static analysis to select the access tracking granularity for different shared data structures of the application based on the application's data access pattern. We describe our prototype VGAT scheme, using TL2 as our STM implementation. Our experimental results reveal that VGAT-STM scheme can improve the application performance of STAMP benchmarks from 1.87% to up to 21.2%.
Resumo:
Instruction reuse is a microarchitectural technique that improves the execution time of a program by removing redundant computations at run-time. Although this is the job of an optimizing compiler, they do not succeed many a time due to limited knowledge of run-time data. In this paper we examine instruction reuse of integer ALU and load instructions in network processing applications. Specifically, this paper attempts to answer the following questions: (1) How much of instruction reuse is inherent in network processing applications?, (2) Can reuse be improved by reducing interference in the reuse buffer?, (3) What characteristics of network applications can be exploited to improve reuse?, and (4) What is the effect of reuse on resource contention and memory accesses? We propose an aggregation scheme that combines the high-level concept of network traffic i.e. "flows" with a low level microarchitectural feature of programs i.e. repetition of instructions and data along with an architecture that exploits temporal locality in incoming packet data to improve reuse. We find that for the benchmarks considered, 1% to 50% of instructions are reused while the speedup achieved varies between 1% and 24%. As a side effect, instruction reuse reduces memory traffic and can therefore be considered as a scheme for low power.
Resumo:
Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs. In order for STMs to be adopted widely for performance critical software, understanding and improving the cache performance of applications running on STM becomes increasingly crucial, as the performance gap between processor and memory continues to grow. In this paper, we present the most detailed experimental evaluation to date, of the cache behavior of STM applications and quantify the impact of the different STM factors on the cache misses experienced by the applications. We find that STMs are not cache friendly, with the data cache stall cycles contributing to more than 50% of the execution cycles in a majority of the benchmarks. We find that on an average, misses occurring inside the STM account for 62% of total data cache miss latency cycles experienced by the applications and the cache performance is impacted adversely due to certain inherent characteristics of the STM itself. The above observations motivate us to propose a set of specific compiler transformations targeted at making the STMs cache friendly. We find that STM's fine grained and application unaware locking is a major contributor to its poor cache behavior. Hence we propose selective Lock Data co-location (LDC) and Redundant Lock Access Removal (RLAR) to address the lock access misses. We find that even transactions that are completely disjoint access parallel, suffer from costly coherence misses caused by the centralized global time stamp updates and hence we propose the Selective Per-Partition Time Stamp (SPTS) transformation to address this. We show that our transformations are effective in improving the cache behavior of STM applications by reducing the data cache miss latency by 20.15% to 37.14% and improving execution time by 18.32% to 33.12% in five of the 8 STAMP applications.
Resumo:
Practical usage of machine learning is gaining strategic importance in enterprises looking for business intelligence. However, most enterprise data is distributed in multiple relational databases with expert-designed schema. Using traditional single-table machine learning techniques over such data not only incur a computational penalty for converting to a flat form (mega-join), even the human-specified semantic information present in the relations is lost. In this paper, we present a practical, two-phase hierarchical meta-classification algorithm for relational databases with a semantic divide and conquer approach. We propose a recursive, prediction aggregation technique over heterogeneous classifiers applied on individual database tables. The proposed algorithm was evaluated on three diverse datasets. namely TPCH, PKDD and UCI benchmarks and showed considerable reduction in classification time without any loss of prediction accuracy. (C) 2012 Elsevier Ltd. All rights reserved.
Resumo:
Points-to analysis is a key compiler analysis. Several memory related optimizations use points-to information to improve their effectiveness. Points-to analysis is performed by building a constraint graph of pointer variables and dynamically updating it to propagate more and more points-to information across its subset edges. So far, the structure of the constraint graph has been only trivially exploited for efficient propagation of information, e.g., in identifying cyclic components or to propagate information in topological order. We perform a careful study of its structure and propose a new inclusion-based flow-insensitive context-sensitive points-to analysis algorithm based on the notion of dominant pointers. We also propose a new kind of pointer-equivalence based on dominant pointers which provides significantly more opportunities for reducing the number of pointers tracked during the analysis. Based on this hitherto unexplored form of pointer-equivalence, we develop a new context-sensitive flow-insensitive points-to analysis algorithm which uses incremental dominator update to efficiently compute points-to information. Using a large suite of programs consisting of SPEC 2000 benchmarks and five large open source programs we show that our points-to analysis is 88% faster than BDD-based Lazy Cycle Detection and 2x faster than Deep Propagation. We argue that our approach of detecting dominator-based pointer-equivalence is a key to improve points-to analysis efficiency.
Resumo:
Knowledge about program worst case execution time (WCET) is essential in validating real-time systems and helps in effective scheduling. One popular approach used in industry is to measure execution time of program components on the target architecture and combine them using static analysis of the program. Measurements need to be taken in the least intrusive way in order to avoid affecting accuracy of estimated WCET. Several programs exhibit phase behavior, wherein program dynamic execution is observed to be composed of phases. Each phase being distinct from the other, exhibits homogeneous behavior with respect to cycles per instruction (CPI), data cache misses etc. In this paper, we show that phase behavior has important implications on timing analysis. We make use of the homogeneity of a phase to reduce instrumentation overhead at the same time ensuring that accuracy of WCET is not largely affected. We propose a model for estimating WCET using static worst case instruction counts of individual phases and a function of measured average CPI. We describe a WCET analyzer built on this model which targets two different architectures. The WCET analyzer is observed to give safe estimates for most benchmarks considered in this paper. The tightness of the WCET estimates are observed to be improved for most benchmarks compared to Chronos, a well known static WCET analyzer.
Resumo:
Most of the existing WCET estimation methods directly estimate execution time, ET, in cycles. We propose to study ET as a product of two factors, ET = IC * CPI, where IC is instruction count and CPI is cycles per instruction. Considering directly the estimation of ET may lead to a highly pessimistic estimate since implicitly these methods may be using worst case IC and worst case CPI. We hypothesize that there exists a functional relationship between CPI and IC such that CPI=f(IC). This is ascertained by computing the covariance matrix and studying the scatter plots of CPI versus IC. IC and CPI values are obtained by running benchmarks with a large number of inputs using the cycle accurate architectural simulator, Simplescalar on two different architectures. It is shown that the benchmarks can be grouped into different classes based on the CPI versus IC relationship. For some benchmarks like FFT, FIR etc., both IC and CPI are almost a constant irrespective of the input. There are other benchmarks that exhibit a direct or an inverse relationship between CPI and IC. In such a case, one can predict CPI for a given IC as CPI=f(IC). We derive the theoretical worst case IC for a program, denoted as SWIC, using integer linear programming(ILP) and estimate WCET as SWIC*f(SWIC). However, if CPI decreases sharply with IC then measured maximum cycles is observed to be a better estimate. For certain other benchmarks, it is observed that the CPI versus IC relationship is either random or CPI remains constant with varying IC. In such cases, WCET is estimated as the product of SWIC and measured maximum CPI. It is observed that use of the proposed method results in tighter WCET estimates than Chronos, a static WCET analyzer, for most benchmarks for the two architectures considered in this paper.