169 resultados para Parallelizing Compilers


Relevância:

10.00% 10.00%

Publicador:

Resumo:

With the emergence of multi-core processors into the mainstream, parallel programming is no longer the specialized domain it once was. There is a growing need for systems to allow programmers to more easily reason about data dependencies and inherent parallelism in general purpose programs. Many of these programs are written in popular imperative programming languages like Java and C]. In this thesis I present a system for reasoning about side-effects of evaluation in an abstract and composable manner that is suitable for use by both programmers and automated tools such as compilers. The goal of developing such a system is to both facilitate the automatic exploitation of the inherent parallelism present in imperative programs and to allow programmers to reason about dependencies which may be limiting the parallelism available for exploitation in their applications. Previous work on languages and type systems for parallel computing has tended to focus on providing the programmer with tools to facilitate the manual parallelization of programs; programmers must decide when and where it is safe to employ parallelism without the assistance of the compiler or other automated tools. None of the existing systems combine abstraction and composition with parallelization and correctness checking to produce a framework which helps both programmers and automated tools to reason about inherent parallelism. In this work I present a system for abstractly reasoning about side-effects and data dependencies in modern, imperative, object-oriented languages using a type and effect system based on ideas from Ownership Types. I have developed sufficient conditions for the safe, automated detection and exploitation of a number task, data and loop parallelism patterns in terms of ownership relationships. To validate my work, I have applied my ideas to the C] version 3.0 language to produce a language extension called Zal. I have implemented a compiler for the Zal language as an extension of the GPC] research compiler as a proof of concept of my system. I have used it to parallelize a number of real-world applications to demonstrate the feasibility of my proposed approach. In addition to this empirical validation, I present an argument for the correctness of the type system and language semantics I have proposed as well as sketches of proofs for the correctness of the sufficient conditions for parallelization proposed.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Bana et al. proposed the relation formal indistinguishability (FIR), i.e. an equivalence between two terms built from an abstract algebra. Later Ene et al. extended it to cover active adversaries and random oracles. This notion enables a framework to verify computational indistinguishability while still offering the simplicity and formality of symbolic methods. We are in the process of making an automated tool for checking FIR between two terms. First, we extend the work by Ene et al. further, by covering ordered sorts and simplifying the way to cope with random oracles. Second, we investigate the possibility of combining algebras together, since it makes the tool scalable and able to cover a wide class of cryptographic schemes. Specially, we show that the combined algebra is still computationally sound, as long as each algebra is sound. Third, we design some proving strategies and implement the tool. Basically, the strategies allow us to find a sequence of intermediate terms, which are formally indistinguishable, between two given terms. FIR between the two given terms is then guaranteed by the transitivity of FIR. Finally, we show applications of the work, e.g. on key exchanges and encryption schemes. In the future, the tool should be extended easily to cover many schemes. This work continues previous research of ours on use of compilers to aid in automated proofs for key exchange.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This year marks the completion of data collection for year three (Wave 3) of the CAUSEE study. This report uses data from the first three years and focuses on the process of learning and adaptation in the business creation process. Most start-ups need to change their business model, their product, their marketing plan, their market or something else about the business to be successful. PayPal changed their product at least five times, moving from handheld security, to enterprise apps, to consumer apps, to a digital wallet, to payments between handhelds before finally stumbling on the model that made the a multi-billion dollar company revolving around email-based payments. PayPal is not alone and anecdotes abounds of start-ups changing direction: Sysmantec started as an artificial intelligence company, Apple started selling plans to build computers and Microsoft tried to peddle compilers before licensing an operating system out of New Mexico. To what extent do Australian new ventures change and adapt as their ideas and business develop? As a longitudinal study, CAUSEE was designed specifically to observe development in the venture creation process. In this research briefing paper, we compare development over time of randomly sampled Nascent Firms (NF) and Young Firms(YF), concentrating on the surviving cases. We also compare NFs with YFs at each yearly interval. The 'high potential' over sample is not used in this report.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The topic of this study is the most renowned anthology of essays written in Literary Chinese, Guwen guanzhi, compiled and edited by Wu Chengquan (Chucai) and Wu Dazhi (Diaohou), and first published during the Qing dynasty, in 1695. Because of the low social standing of the compilers, their anthology remained outside the recommended study materials produced by members of the established literati and used for preparing students in the imperial civil-service examinations. However, since the end of the imperial era, Guwen guanzhi has risen to a position as the classical anthology par excellence. Today it is widely used as required or supplementary reading material of Literary Chinese in middle-schools both in Mainland China and on Taiwan. The goal of this study is to explain the persistent longevity of the anthology. So far, Guwen guanzhi has not been a topic of any published academic study, and the opinions expressed on it in various sources are widely discrepant. Through a comparative study with a dozen classical Chinese anthologies in use during the early Qing dynasty, this study reveals the extent to which the compilers of Guwen guanzhi modelled their work after other selections. Altogether 86 % of the texts in Guwen guanzhi originate from another Qing era anthology, Guwen xiyi, often copied character by character. However, the notes and commentaries are all different. Concentrating on the special characteristics unique to Guwen guanzhi—the commentaries and certain peculiarities in the selection of texts—this study then discusses the possible reasons for the popularity of Guwen guanzhi over the competing readers during the Qing era. Most remarkably, Guwen guanzhi put in practise the equalitarian, educational ideals of the Ming philosopher Wang Shouren (Yangming). Thus Guwen guanzhi suited the self-enlightenment needs of the ”subordinate classes”, in particular the rising middle-class comprised mainly of merchants. The lack of moral teleology, together with the compact size, relative comprehensiveness of the selection and good notes and comments, have made Guwen guanzhi well suited for the new society since the abolition of the imperial examination system. Through a content analysis, based on a sample of the texts, this study measures the relative emphasis on centralism and localism (both in concrete and spiritual terms) expressed in the texts of Guwen guanzhi. The analysis shows that the texts manifest some bias towards emphasising innate virtue on the expense of state-defined moral. This may reflect hidden critique towards intellectual oppression by the centralised imperial rule. During the early decades of the Qing era, such critique was often linked to Ming-loyalism. Finally, this study concludes that the kind of ”spiritual localism” that Guwen guanzhi manifests gives it the potential to undermine monolithic orthodoxy even in today’s Chinese societies. This study has progressed hand in hand with the translation of a selection of texts from Guwen guanzhi into Finnish, published by Gaudeamus Helsinki University Press: Jadekasvot – Valittuja tarinoita Kiinan muinaisajoilta (2005), Jadelähde – Valittuja kirjoituksia Kiinan keskiajalta (2007) and Jadepeili – Valittuja kirjoituksia keisarillisen Kiinan kulta-ajoilta (2008). All translations are critical editions, complete with extensive notation. The trilogy is the first comprehensive translation based on Guwen guanzhi in a European language.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Parallel programming and effective partitioning of applications for embedded many-core architectures requires optimization algorithms. However, these algorithms have to quickly evaluate thousands of different partitions. We present a fast performance estimator embedded in a parallelizing compiler for streaming applications. The estimator combines a single execution-based simulation and an analytic approach. Experimental results demonstrate that the estimator has a mean error of 2.6% and computes its estimation 2848 times faster compared to a cycle accurate simulator.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In recent years, parallel computers have been attracting attention for simulating artificial neural networks (ANN). This is due to the inherent parallelism in ANN. This work is aimed at studying ways of parallelizing adaptive resonance theory (ART), a popular neural network algorithm. The core computations of ART are separated and different strategies of parallelizing ART are discussed. We present mapping strategies for ART 2-A neural network onto ring and mesh architectures. The required parallel architecture is simulated using a parallel architectural simulator, PROTEUS and parallel programs are written using a superset of C for the algorithms presented. A simulation-based scalability study of the algorithm-architecture match is carried out. The various overheads are identified in order to suggest ways of improving the performance. Our main objective is to find out the performance of the ART2-A network on different parallel architectures. (C) 1999 Elsevier Science B.V. All rights reserved.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Just-in-Time (JIT) compilers for Java can be augmented by making use of runtime profile information to produce better quality code and hence achieve higher performance. In a JIT compilation environment, the profile information obtained can be readily exploited in the same run to aid recompilation and optimization of frequently executed (hot) methods. This paper discusses a low overhead path profiling scheme for dynamically profiling AT produced native code. The profile information is used in recompilation during a subsequent invocation of the hot method. During recompilation tree regions along the hot paths are enlarged and instruction scheduling at the superblock level is performed. We have used the open source LaTTe AT compiler framework for our implementation. Our results on a SPARC platform for SPEC JVM98 benchmarks indicate that (i) there is a significant reduction in the number of tree regions along the hot paths, and (ii) profile aided recompilation in LaTTe achieves performance comparable to that of adaptive LaTTe in spite of retranslation and profiling overheads.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

As the gap between processor and memory continues to grow Memory performance becomes a key performance bottleneck for many applications. Compilers therefore increasingly seek to modify an application’s data layout to improve cache locality and cache reuse. Whole program Structure Layout [WPSL] transformations can significantly increase the spatial locality of data and reduce the runtime of programs that use link-based data structures, by increasing the cache line utilization. However, in production compilers WPSL transformations do not realize the entire performance potential possible due to a number of factors. Structure layout decisions made on the basis of whole program aggregated affinity/hotness of structure fields, can be sub optimal for local code regions. WPSL is also restricted in applicability in production compilers for type unsafe languages like C/C++ due to the extensive legality checks and field sensitive pointer analysis required over the entire application. In order to overcome the issues associated with WPSL, we propose Region Based Structure Layout (RBSL) optimization framework, using selective data copying. We describe our RBSL framework, implemented in the production compiler for C/C++ on HP-UX IA-64. We show that acting in complement to the existing and mature WPSL transformation framework in our compiler, RBSL improves application performance in pointer intensive SPEC benchmarks ranging from 3% to 28% over WPSL

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this paper we explore an implementation of a high-throughput, streaming application on REDEFINE-v2, which is an enhancement of REDEFINE. REDEFINE is a polymorphic ASIC combining the flexibility of a programmable solution with the execution speed of an ASIC. In REDEFINE Compute Elements are arranged in an 8x8 grid connected via a Network on Chip (NoC) called RECONNECT, to realize the various macrofunctional blocks of an equivalent ASIC. For a 1024-FFT we carry out an application-architecture design space exploration by examining the various characterizations of Compute Elements in terms of the size of the instruction store. We further study the impact by using application specific, vectorized FUs. By setting up different partitions of the FFT algorithm for persistent execution on REDEFINE-v2, we derive the benefits of setting up pipelined execution for higher performance. The impact of the REDEFINE-v2 micro-architecture for any arbitrary N-point FFT (N > 4096) FFT is also analyzed. We report the various algorithm-architecture tradeoffs in terms of area and execution speed with that of an ASIC implementation. In addition we compare the performance gain with respect to a GPP.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Knowledge about program worst case execution time (WCET) is essential in validating real-time systems and helps in effective scheduling. One popular approach used in industry is to measure execution time of program components on the target architecture and combine them using static analysis of the program. Measurements need to be taken in the least intrusive way in order to avoid affecting accuracy of estimated WCET. Several programs exhibit phase behavior, wherein program dynamic execution is observed to be composed of phases. Each phase being distinct from the other, exhibits homogeneous behavior with respect to cycles per instruction (CPI), data cache misses etc. In this paper, we show that phase behavior has important implications on timing analysis. We make use of the homogeneity of a phase to reduce instrumentation overhead at the same time ensuring that accuracy of WCET is not largely affected. We propose a model for estimating WCET using static worst case instruction counts of individual phases and a function of measured average CPI. We describe a WCET analyzer built on this model which targets two different architectures. The WCET analyzer is observed to give safe estimates for most benchmarks considered in this paper. The tightness of the WCET estimates are observed to be improved for most benchmarks compared to Chronos, a well known static WCET analyzer.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Data Prefetchers identify and make use of any regularity present in the history/training stream to predict future references and prefetch them into the cache. The training information used is typically the primary misses seen at a particular cache level, which is a filtered version of the accesses seen by the cache. In this work we demonstrate that extending the training information to include secondary misses and hits along with primary misses helps improve the performance of prefetchers. In addition to empirical evaluation, we use the information theoretic metric entropy, to quantify the regularity present in extended histories. Entropy measurements indicate that extended histories are more regular than the default primary miss only training stream. Entropy measurements also help corroborate our empirical findings. With extended histories, further benefits can be achieved by triggering prefetches during secondary misses also. In this paper we explore the design space of extended prefetch histories and alternative prefetch trigger points for delta correlation prefetchers. We observe that different prefetch schemes benefit to a different extent with extended histories and alternative trigger points. Also the best performing design point varies on a per-benchmark basis. To meet these requirements, we propose a simple adaptive scheme that identifies the best performing design point for a benchmark-prefetcher combination at runtime. In SPEC2000 benchmarks, using all the L2 accesses as history for prefetcher improves the performance in terms of both IPC and misses reduced over techniques that use only primary misses as history. The adaptive scheme improves the performance of CZone prefetcher over Baseline by 4.6% on an average. These performance gains are accompanied by a moderate reduction in the memory traffic requirements.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

High-level loop transformations are a key instrument in mapping computational kernels to effectively exploit the resources in modern processor architectures. Nevertheless, selecting required compositions of loop transformations to achieve this remains a significantly challenging task; current compilers may be off by orders of magnitude in performance compared to hand-optimized programs. To address this fundamental challenge, we first present a convex characterization of all distinct, semantics-preserving, multidimensional affine transformations. We then bring together algebraic, algorithmic, and performance analysis results to design a tractable optimization algorithm over this highly expressive space. Our framework has been implemented and validated experimentally on a representative set of benchmarks running on state-of-the-art multi-core platforms.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We discuss the computational bottlenecks in molecular dynamics (MD) and describe the challenges in parallelizing the computation-intensive tasks. We present a hybrid algorithm using MPI (Message Passing Interface) with OpenMP threads for parallelizing a generalized MD computation scheme for systems with short range interatomic interactions. The algorithm is discussed in the context of nano-indentation of Chromium films with carbon indenters using the Embedded Atom Method potential for Cr-Cr interaction and the Morse potential for Cr-C interactions. We study the performance of our algorithm for a range of MPI-thread combinations and find the performance to depend strongly on the computational task and load sharing in the multi-core processor. The algorithm scaled poorly with MPI and our hybrid schemes were observed to outperform the pure message passing scheme, despite utilizing the same number of processors or cores in the cluster. Speed-up achieved by our algorithm compared favorably with that achieved by standard MD packages. (C) 2013 Elsevier Inc. All rights reserved.