580 resultados para Benchmarks


Relevância:

10.00% 10.00%

Publicador:

Resumo:

Pervasive use of pointers in large-scale real-world applications continues to make points-to analysis an important optimization-enabler. Rapid growth of software systems demands a scalable pointer analysis algorithm. A typical inclusion-based points-to analysis iteratively evaluates constraints and computes a points-to solution until a fixpoint. In each iteration, (i) points-to information is propagated across directed edges in a constraint graph G and (ii) more edges are added by processing the points-to constraints. We observe that prioritizing the order in which the information is processed within each of the above two steps can lead to efficient execution of the points-to analysis. While earlier work in the literature focuses only on the propagation order, we argue that the other dimension, that is, prioritizing the constraint processing, can lead to even higher improvements on how fast the fixpoint of the points-to algorithm is reached. This becomes especially important as we prove that finding an optimal sequence for processing the points-to constraints is NP-Complete. The prioritization scheme proposed in this paper is general enough to be applied to any of the existing points-to analyses. Using the prioritization framework developed in this paper, we implement prioritized versions of Andersen's analysis, Deep Propagation, Hardekopf and Lin's Lazy Cycle Detection and Bloom Filter based points-to analysis. In each case, we report significant improvements in the analysis times (33%, 47%, 44%, 20% respectively) as well as the memory requirements for a large suite of programs, including SPEC 2000 benchmarks and five large open source programs.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The effectiveness of the last-level shared cache is crucial to the performance of a multi-core system. In this paper, we observe and make use of the DelinquentPC - Next-Use characteristic to improve shared cache performance. We propose a new PC-centric cache organization, NUcache, for the shared last level cache of multi-cores. NUcache logically partitions the associative ways of a cache set into MainWays and DeliWays. While all lines have access to the MainWays, only lines brought in by a subset of delinquent PCs, selected by a PC selection mechanism, are allowed to enter the DeliWays. The PC selection mechanism is an intelligent cost-benefit analysis based algorithm that utilizes Next-Use information to select the set of PCs that can maximize the hits experienced in DeliWays. Performance evaluation reveals that NUcache improves the performance over a baseline design by 9.6%, 30% and 33% respectively for dual, quad and eight core workloads comprised of SPEC benchmarks. We also show that NUcache is more effective than other well-known cache-partitioning algorithms.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Data Prefetchers identify and make use of any regularity present in the history/training stream to predict future references and prefetch them into the cache. The training information used is typically the primary misses seen at a particular cache level, which is a filtered version of the accesses seen by the cache. In this work we demonstrate that extending the training information to include secondary misses and hits along with primary misses helps improve the performance of prefetchers. In addition to empirical evaluation, we use the information theoretic metric entropy, to quantify the regularity present in extended histories. Entropy measurements indicate that extended histories are more regular than the default primary miss only training stream. Entropy measurements also help corroborate our empirical findings. With extended histories, further benefits can be achieved by triggering prefetches during secondary misses also. In this paper we explore the design space of extended prefetch histories and alternative prefetch trigger points for delta correlation prefetchers. We observe that different prefetch schemes benefit to a different extent with extended histories and alternative trigger points. Also the best performing design point varies on a per-benchmark basis. To meet these requirements, we propose a simple adaptive scheme that identifies the best performing design point for a benchmark-prefetcher combination at runtime. In SPEC2000 benchmarks, using all the L2 accesses as history for prefetcher improves the performance in terms of both IPC and misses reduced over techniques that use only primary misses as history. The adaptive scheme improves the performance of CZone prefetcher over Baseline by 4.6% on an average. These performance gains are accompanied by a moderate reduction in the memory traffic requirements.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

High-level loop transformations are a key instrument in mapping computational kernels to effectively exploit the resources in modern processor architectures. Nevertheless, selecting required compositions of loop transformations to achieve this remains a significantly challenging task; current compilers may be off by orders of magnitude in performance compared to hand-optimized programs. To address this fundamental challenge, we first present a convex characterization of all distinct, semantics-preserving, multidimensional affine transformations. We then bring together algebraic, algorithmic, and performance analysis results to design a tractable optimization algorithm over this highly expressive space. Our framework has been implemented and validated experimentally on a representative set of benchmarks running on state-of-the-art multi-core platforms.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

MATLAB is an array language, initially popular for rapid prototyping, but is now being increasingly used to develop production code for numerical and scientific applications. Typical MATLAB programs have abundant data parallelism. These programs also have control flow dominated scalar regions that have an impact on the program's execution time. Today's computer systems have tremendous computing power in the form of traditional CPU cores and throughput oriented accelerators such as graphics processing units(GPUs). Thus, an approach that maps the control flow dominated regions to the CPU and the data parallel regions to the GPU can significantly improve program performance. In this paper, we present the design and implementation of MEGHA, a compiler that automatically compiles MATLAB programs to enable synergistic execution on heterogeneous processors. Our solution is fully automated and does not require programmer input for identifying data parallel regions. We propose a set of compiler optimizations tailored for MATLAB. Our compiler identifies data parallel regions of the program and composes them into kernels. The problem of combining statements into kernels is formulated as a constrained graph clustering problem. Heuristics are presented to map identified kernels to either the CPU or GPU so that kernel execution on the CPU and the GPU happens synergistically and the amount of data transfer needed is minimized. In order to ensure required data movement for dependencies across basic blocks, we propose a data flow analysis and edge splitting strategy. Thus our compiler automatically handles composition of kernels, mapping of kernels to CPU and GPU, scheduling and insertion of required data transfer. The proposed compiler was implemented and experimental evaluation using a set of MATLAB benchmarks shows that our approach achieves a geometric mean speedup of 19.8X for data parallel benchmarks over native execution of MATLAB.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

GPUs have been used for parallel execution of DOALL loops. However, loops with indirect array references can potentially cause cross iteration dependences which are hard to detect using existing compilation techniques. Applications with such loops cannot easily use the GPU and hence do not benefit from the tremendous compute capabilities of GPUs. In this paper, we present an algorithm to compute at runtime the cross iteration dependences in such loops. The algorithm uses both the CPU and the GPU to compute the dependences. Specifically, it effectively uses the compute capabilities of the GPU to quickly collect the memory accesses performed by the iterations by executing the slice functions generated for the indirect array accesses. Using the dependence information, the loop iterations are levelized such that each level contains independent iterations which can be executed in parallel. Another interesting aspect of the proposed solution is that it pipelines the dependence computation of the future level with the actual computation of the current level to effectively utilize the resources available in the GPU. We use NVIDIA Tesla C2070 to evaluate our implementation using benchmarks from Polybench suite and some synthetic benchmarks. Our experiments show that the proposed technique can achieve an average speedup of 6.4x on loops with a reasonable number of cross iteration dependences.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available resources, with over 30% of resources going unused on average for programs of the Parboil2 suite that we used in our work. Current GPUs therefore allow concurrent execution of kernels to improve utilization. In this work, we study concurrent execution of GPU kernels using multiprogram workloads on current NVIDIA Fermi GPUs. On two-program workloads from the Parboil2 benchmark suite we find concurrent execution is often no better than serialized execution. We identify that the lack of control over resource allocation to kernels is a major serialization bottleneck. We propose transformations that convert CUDA kernels into elastic kernels which permit fine-grained control over their resource usage. We then propose several elastic-kernel aware concurrency policies that offer significantly better performance and concurrency compared to the current CUDA policy. We evaluate our proposals on real hardware using multiprogrammed workloads constructed from benchmarks in the Parboil 2 suite. On average, our proposals increase system throughput (STP) by 1.21x and improve the average normalized turnaround time (ANTT) by 3.73x for two-program workloads when compared to the current CUDA concurrency implementation.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The ability to perform strong updates is the main contributor to the precision of flow-sensitive pointer analysis algorithms. Traditional flow-sensitive pointer analyses cannot strongly update pointers residing in the heap. This is a severe restriction for Java programs. In this paper, we propose a new flow-sensitive pointer analysis algorithm for Java that can perform strong updates on heap-based pointers effectively. Instead of points-to graphs, we represent our points-to information as maps from access paths to sets of abstract objects. We have implemented our analysis and run it on several large Java benchmarks. The results show considerable improvement in precision over the points-to graph based flow-insensitive and flow-sensitive analyses, with reasonable running time.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Software transactional memory(STM) is a promising programming paradigm for shared memory multithreaded programs. While STM offers the promise of being less error-prone and more programmer friendly compared to traditional lock-based synchronization, it also needs to be competitive in performance in order for it to be adopted in mainstream software. A major source of performance overheads in STM is transactional aborts. Conflict resolution and aborting a transaction typically happens at the transaction level which has the advantage that it is automatic and application agnostic. However it has a substantial disadvantage in that STM declares the entire transaction as conflicting and hence aborts it and re-executes it fully, instead of partially re-executing only those part(s) of the transaction, which have been affected due to the conflict. This "Re-execute Everything" approach has a significant adverse impact on STM performance. In order to mitigate the abort overheads, we propose a compiler aided Selective Reconciliation STM (SR-STM) scheme, wherein certain transactional conflicts can be reconciled by performing partial re-execution of the transaction. Ours is a selective hybrid approach which uses compiler analysis to identify those data accesses which are legal and profitable candidates for reconciliation and applies partial re-execution only to these candidates selectively while other conflicting data accesses are handled by the default STM approach of abort and full re-execution. We describe the compiler analysis and code transformations required for supporting selective reconciliation. We find that SR-STM is effective in reducing the transactional abort overheads by improving the performance for a set of five STAMP benchmarks by 12.58% on an average and up to 22.34%.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Precise pointer analysis is a problem of interest to both the compiler and the program verification community. Flow-sensitivity is an important dimension of pointer analysis that affects the precision of the final result computed. Scaling flow-sensitive pointer analysis to millions of lines of code is a major challenge. Recently, staged flow-sensitive pointer analysis has been proposed, which exploits a sparse representation of program code created by staged analysis. In this paper we formulate the staged flow-sensitive pointer analysis as a graph-rewriting problem. Graph-rewriting has already been used for flow-insensitive analysis. However, formulating flow-sensitive pointer analysis as a graph-rewriting problem adds additional challenges due to the nature of flow-sensitivity. We implement our parallel algorithm using Intel Threading Building Blocks and demonstrate considerable scaling (upto 2.6x) for 8 threads on a set of 10 benchmarks. Compared to the sequential implementation of staged flow-sensitive analysis, a single threaded execution of our implementation performs better in 8 of the benchmarks.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Counter systems are a well-known and powerful modeling notation for specifying infinite-state systems. In this paper we target the problem of checking liveness properties in counter systems. We propose two semi decision techniques towards this, both of which return a formula that encodes the set of reachable states of the system that satisfy a given liveness property. A novel aspect of our techniques is that they use reachability analysis techniques, which are well studied in the literature, as black boxes, and are hence able to compute precise answers on a much wider class of systems than previous approaches for the same problem. Secondly, they compute their results by iterative expansion or contraction, and hence permit an approximate solution to be obtained at any point. We state the formal properties of our techniques, and also provide experimental results using standard benchmarks to show the usefulness of our approaches. Finally, we sketch an extension of our liveness checking approach to check general CTL properties.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Cis-peptide embedded segments are rare in proteins but often highlight their important role in molecular function when they do occur. The high evolutionary conservation of these segments illustrates this observation almost universally, although no attempt has been made to systematically use this information for the purpose of function annotation. In the present study, we demonstrate how geometric clustering and level-specific Gene Ontology molecular-function terms (also known as annotations) can be used in a statistically significant manner to identify cis-embedded segments in a protein linked to its molecular function. The present study identifies novel cis-peptide fragments, which are subsequently used for fragment-based function annotation. Annotation recall benchmarks interpreted using the receiver-operator characteristic plot returned an area-under-curve >0.9, corroborating the utility of the annotation method. In addition, we identified cis-peptide fragments occurring in conjunction with functionally important trans-peptide fragments, providing additional insights into molecular function. We further illustrate the applicability of our method in function annotation where homology-based annotation transfer is not possible. The findings of the present study add to the repertoire of function annotation approaches and also facilitate engineering, design and allied studies around the cis-peptide neighborhood of proteins.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The problem addressed in this paper is sound, scalable, demand-driven null-dereference verification for Java programs. Our approach consists conceptually of a base analysis, plus two major extensions for enhanced precision. The base analysis is a dataflow analysis wherein we propagate formulas in the backward direction from a given dereference, and compute a necessary condition at the entry of the program for the dereference to be potentially unsafe. The extensions are motivated by the presence of certain ``difficult'' constructs in real programs, e.g., virtual calls with too many candidate targets, and library method calls, which happen to need excessive analysis time to be analyzed fully. The base analysis is hence configured to skip such a difficult construct when it is encountered by dropping all information that has been tracked so far that could potentially be affected by the construct. Our extensions are essentially more precise ways to account for the effect of these constructs on information that is being tracked, without requiring full analysis of these constructs. The first extension is a novel scheme to transmit formulas along certain kinds of def-use edges, while the second extension is based on using manually constructed backward-direction summary functions of library methods. We have implemented our approach, and applied it on a set of real-life benchmarks. The base analysis is on average able to declare about 84% of dereferences in each benchmark as safe, while the two extensions push this number up to 91%. (C) 2014 Elsevier B.V. All rights reserved.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Routing is a very important step in VLSI physical design. A set of nets are routed under delay and resource constraints in multi-net global routing. In this paper a delay-driven congestion-aware global routing algorithm is developed, which is a heuristic based method to solve a multi-objective NP-hard optimization problem. The proposed delay-driven Steiner tree construction method is of O(n(2) log n) complexity, where n is the number of terminal points and it provides n-approximation solution of the critical time minimization problem for a certain class of grid graphs. The existing timing-driven method (Hu and Sapatnekar, 2002) has a complexity O(n(4)) and is implemented on nets with small number of sinks. Next we propose a FPTAS Gradient algorithm for minimizing the total overflow. This is a concurrent approach considering all the nets simultaneously contrary to the existing approaches of sequential rip-up and reroute. The algorithms are implemented on ISPD98 derived benchmarks and the drastic reduction of overflow is observed. (C) 2014 Elsevier Inc. All rights reserved.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Branch divergence is a very commonly occurring performance problem in GPGPU in which the execution of diverging branches is serialized to execute only one control flow path at a time. Existing hardware mechanism to reconverge threads using a stack causes duplicate execution of code for unstructured control flow graphs. Also the stack mechanism cannot effectively utilize the available parallelism among diverging branches. Further, the amount of nested divergence allowed is also limited by depth of the branch divergence stack. In this paper we propose a simple and elegant transformation to handle all of the above mentioned problems. The transformation converts an unstructured CFG to a structured CFG without duplicating user code. It incurs only a linear increase in the number of basic blocks and also the number of instructions. Our solution linearizes the CFG using a predicate variable. This mechanism reconverges the divergent threads as early as possible. It also reduces the depth of the reconvergence stack. The available parallelism in nested branches can be effectively extracted by scheduling the basic blocks to reduce the effect of stalls due to memory accesses. It can also increase execution efficiency of nested loops with different trip counts for different threads. We implemented the proposed transformation at PTX level using the Ocelot compiler infrastructure. We evaluated the technique using various benchmarks to show that it can be effective in handling the performance problem due to divergence in unstructured CFGs.