169 resultados para Parallelizing Compilers


Relevância:

10.00% 10.00%

Publicador:

Resumo:

Realizing scalable performance on high performance computing systems is not straightforward for single-phenomenon codes (such as computational fluid dynamics [CFD]). This task is magnified considerably when the target software involves the interactions of a range of phenomena that have distinctive solution procedures involving different discretization methods. The problems of addressing the key issues of retaining data integrity and the ordering of the calculation procedures are significant. A strategy for parallelizing this multiphysics family of codes is described for software exploiting finite-volume discretization methods on unstructured meshes using iterative solution procedures. A mesh partitioning-based SPMD approach is used. However, since different variables use distinct discretization schemes, this means that distinct partitions are required; techniques for addressing this issue are described using the mesh-partitioning tool, JOSTLE. In this contribution, the strategy is tested for a variety of test cases under a wide range of conditions (e.g., problem size, number of processors, asynchronous / synchronous communications, etc.) using a variety of strategies for mapping the mesh partition onto the processor topology.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This chapter discusses the code parallelization environment, where a number of tools that address the main tasks, such as code parallelization, debugging, and optimization are available. The parallelization tools include ParaWise and CAPO, which enable the near automatic parallelization of real world scientific application codes for shared and distributed memory-based parallel systems. The chapter discusses the use of ParaWise and CAPO to transform the original serial code into an equivalent parallel code that contains appropriate OpenMP directives. Additionally, as user involvement can introduce errors, a relative debugging tool (P2d2) is also available and can be used to perform near automatic relative debugging of an OpenMP program that has been parallelized either using the tools or manually. In order for these tools to be effective in parallelizing a range of applications, a high quality fully inter-procedural dependence analysis, as well as user interaction is vital to the generation of efficient parallel code and in the optimization of the backtracking and speculation process used in relative debugging. Results of parallelized NASA codes are discussed and show the benefits of using the environment.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

A simulation program has been developed to calculate the power-spectral density of thin avalanche photodiodes, which are used in optical networks. The program extends the time-domain analysis of the dead-space multiplication model to compute the autocorrelation function of the APD impulse response. However, the computation requires a large amount of memory space and is very time consuming. We describe our experiences in parallelizing the code using both MPI and OpenMP. Several array partitioning schemes and scheduling policies are implemented and tested Our results show that the OpenMP code is scalable up to 64 processors on an SGI Origin 2000 machine and has small average errors.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

An important factor for high-speed optical communication is the availability of ultrafast and low-noise photodetectors. Among the semiconductor photodetectors that are commonly used in today’s long-haul and metro-area fiber-optic systems, avalanche photodiodes (APDs) are often preferred over p-i-n photodiodes due to their internal gain, which significantly improves the receiver sensitivity and alleviates the need for optical pre-amplification. Unfortunately, the random nature of the very process of carrier impact ionization, which generates the gain, is inherently noisy and results in fluctuations not only in the gain but also in the time response. Recently, a theory characterizing the autocorrelation function of APDs has been developed by us which incorporates the dead-space effect, an effect that is very significant in thin, high-performance APDs. The research extends the time-domain analysis of the dead-space multiplication model to compute the autocorrelation function of the APD impulse response. However, the computation requires a large amount of memory space and is very time consuming. In this research, we describe our experiences in parallelizing the code in MPI and OpenMP using CAPTools. Several array partitioning schemes and scheduling policies are implemented and tested. Our results show that the code is scalable up to 64 processors on a SGI Origin 2000 machine and has small average errors.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Code parallelization using OpenMP for shared memory systems is relatively easier than using message passing for distributed memory systems. Despite this, it is still a challenge to use OpenMP to parallelize application codes in a way that yields effective scalable performance when executed on a shared memory parallel system. We describe an environment that will assist the programmer in the various tasks of code parallelization and this is achieved in a greatly reduced time frame and level of skill required. The parallelization environment includes a number of tools that address the main tasks of parallelism detection, OpenMP source code generation, debugging and optimization. These tools include a high quality, fully interprocedural dependence analysis with user interaction capabilities to facilitate the generation of efficient parallel code, an automatic relative debugging tool to identify erroneous user decisions in that interaction and also performance profiling to identify bottlenecks. Finally, experiences of parallelizing some NASA application codes are presented to illustrate some of the benefits of using the evolving environment.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

A FORTRAN 90 program is presented which calculates the total cross sections, and the electron energy spectra of the singly and doubly differential cross sections for the single target ionization of neutral atoms ranging from hydrogen up to and including argon. The code is applicable for the case of both high and low Z projectile impact in fast ion-atom collisions. The theoretical models provided for the program user are based on two quantum mechanical approximations which have proved to be very successful in the study of ionization in ion-atom collisions. These are the continuum-distorted-wave (CDW) and continuum-distorted-wave eikonal-initial-state (CDW-EIS) approximations. The codes presented here extend previously published. codes for single ionization of. target hydrogen [Crothers and McCartney, Comput. Phys. Commun. 72 (1992) 288], target helium [Nesbitt, O'Rourke and Crothers, Comput. Phys. Commun. 114 (1998) 385] and target atoms ranging from lithium to neon [O'Rourke, McSherry and Crothers, Comput. Phys. Commun. 131 (2000) 129]. Cross sections for all of these target atoms may be obtained as limiting cases from the present code. Title of program: ARGON Catalogue identifier: ADSE Program summary URL: http://cpc.cs.qub.ac.uk/cpc/summaries/ADSE Program obtainable from: CPC Program Library Queen's University of Belfast, N. Ireland Licensing provisions: none Computer for which the program is designed and others on which it is operable: Computers: Four by 200 MHz Pro Pentium Linux server, DEC Alpha 21164; Four by 400 MHz Pentium 2 Xeon 450 Linux server, IBM SP2 and SUN Enterprise 3500 Installations: Queen's University, Belfast Operating systems under which the program has been tested: Red-hat Linux 5.2, Digital UNIX Version 4.0d, AIX, Solaris SunOS 5.7 Compilers: PGI workstations, DEC CAMPUS Programming language used: FORTRAN 90 with MPI directives No. of bits in a word: 64, except on Linux servers 32 Number of processors used: any number Has the code been vectorized or parallelized? Parallelized using MPI No. of bytes in distributed program, including test data, etc.: 32 189 Distribution format: tar gzip file Keywords: Single ionization, cross sections, continuum-distorted-wave model, continuum- distorted-wave eikonal-initial-state model, target atoms, wave treatment Nature of physical problem: The code calculates total, and differential cross sections for the single ionization of target atoms ranging from hydrogen up to and including argon by both light and heavy ion impact. Method of solution: ARGON allows the user to calculate the cross sections using either the CDW or CDW-EIS [J. Phys. B 16 (1983) 3229] models within the wave treatment. Restrictions on the complexity of the program: Both the CDW and CDW-EIS models are two-state perturbative approximations. Typical running time: Times vary according to input data and number of processors. For one processor the test input data for double differential cross sections (40 points) took less than one second, whereas the test input for total cross sections (20 points) took 32 minutes. Unusual features of the program: none (C) 2003 Elsevier B.V All rights reserved.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Traditional static analysis fails to auto-parallelize programs with a complex control and data flow. Furthermore, thread-level parallelism in such programs is often restricted to pipeline parallelism, which can be hard to discover by a programmer. In this paper we propose a tool that, based on profiling information, helps the programmer to discover parallelism. The programmer hand-picks the code transformations from among the proposed candidates which are then applied by automatic code transformation techniques.

This paper contributes to the literature by presenting a profiling tool for discovering thread-level parallelism. We track dependencies at the whole-data structure level rather than at the element level or byte level in order to limit the profiling overhead. We perform a thorough analysis of the needs and costs of this technique. Furthermore, we present and validate the belief that programs with complex control and data flow contain significant amounts of exploitable coarse-grain pipeline parallelism in the program’s outer loops. This observation validates our approach to whole-data structure dependencies. As state-of-the-art compilers focus on loops iterating over data structure members, this observation also explains why our approach finds coarse-grain pipeline parallelism in cases that have remained out of reach for state-of-the-art compilers. In cases where traditional compilation techniques do find parallelism, our approach allows to discover higher degrees of parallelism, allowing a 40% speedup over traditional compilation techniques. Moreover, we demonstrate real speedups on multiple hardware platforms.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The Cell Broadband Engine (BE) Architecture is a new heterogeneous multi-core architecture targeted at compute-intensive workloads. The architecture of the Cell BE has several features that are unique in high-performance general-purpose processors, most notably the extensive support for vectorization, scratch pad memories and explicit programming of direct memory accesses (DMAs) and mailbox communication. While these features strongly increase programming complexity, it is generally claimed that significant speedups can be obtained by using Cell BE processors. This paper presents our experiences with using the Cell BE architecture to accelerate Clustal W, a bio-informatics program for multiple sequence alignment. We report on how we apply the unique features of the Cell BE to Clustal W and how important each is in obtaining high performance. By making extensive use of vectorization and by parallelizing the application across all cores, we demonstrate a speedup of 24.4 times when using 16 synergistic processor units on a QS21 Cell Blade compared to single-thread execution on the power processing unit. As the Cell BE exploits a large number of slim cores, our highly optimized implementation is just 3.8 times faster than a 3-thread version running on an Intel Core2 Duo, as the latter processor exploits a small number of fat cores.

Relevância:

10.00% 10.00%

Publicador:

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Data flow techniques have been around since the early '70s when they were used in compilers for sequential languages. Shortly after their introduction they were also consideredas a possible model for parallel computing, although the impact here was limited. Recently, however, data flow has been identified as a candidate for efficient implementation of various programming models on multi-core architectures. In most cases, however, the burden of determining data flow "macro" instructions is left to the programmer, while the compiler/run time system manages only the efficient scheduling of these instructions. We discuss a structured parallel programming approach supporting automatic compilation of programs to macro data flow and we show experimental results demonstrating the feasibility of the approach and the efficiency of the resulting "object" code on different classes of state-of-the-art multi-core architectures. The experimental results use different base mechanisms to implement the macro data flow run time support, from plain pthreads with condition variables to more modern and effective lock- and fence-free parallel frameworks. Experimental results comparing efficiency of the proposed approach with those achieved using other, more classical, parallel frameworks are also presented. © 2012 IEEE.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

On multiprocessors with explicitly managed memory hierarchies (EMM), software has the responsibility of moving data in and out of fast local memories. This task can be complex and error-prone even for expert programmers. Before we can allow compilers to handle the complexity for us, we must identify the abstractions that are general enough to allow us to write applications with reasonable effort, yet speci?c enough to exploit the vast on-chip memory bandwidth of EMM multi-processors. To this end, we compare two programming models against hand-tuned codes on the STI Cell, paying attention to programmability and performance. The ?rst programming model, Sequoia, abstracts the memory hierarchy as private address spaces, each corresponding to a parallel task. The second, Cellgen, is a new framework which provides OpenMP-like semantics and the abstraction of a shared address spaces divided into private and shared data. We compare three applications programmed using these models against their hand-optimized counterparts in terms of abstractions, programming complexity, and performance.