131 resultados para parallel programs


Relevância:

100.00% 100.00%

Publicador:

Resumo:

With proliferation of chip multicores (CMPs) on desktops and embedded platforms, multi-threaded programs have become ubiquitous. Existence of multiple threads may cause resource contention, such as, in on-chip shared cache and interconnects, depending upon how they access resources. Hence, we propose a tool - Thread Contention Predictor (TCP) to help quantify the number of threads sharing data and their sharing pattern. We demonstrate its use to predict a more profitable shared, last level on-chip cache (LLC) access policy on CMPs. Our cache configuration predictor is 2.2 times faster compared to the cycle-accurate simulations. We also demonstrate its use for identifying hot data structures in a program which may cause performance degradation due to false data sharing. We fix layout of such data structures and show up-to 10% and 18% improvement in execution time and energy-delay product (EDP), respectively.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In recent years, parallel computers have been attracting attention for simulating artificial neural networks (ANN). This is due to the inherent parallelism in ANN. This work is aimed at studying ways of parallelizing adaptive resonance theory (ART), a popular neural network algorithm. The core computations of ART are separated and different strategies of parallelizing ART are discussed. We present mapping strategies for ART 2-A neural network onto ring and mesh architectures. The required parallel architecture is simulated using a parallel architectural simulator, PROTEUS and parallel programs are written using a superset of C for the algorithms presented. A simulation-based scalability study of the algorithm-architecture match is carried out. The various overheads are identified in order to suggest ways of improving the performance. Our main objective is to find out the performance of the ART2-A network on different parallel architectures. (C) 1999 Elsevier Science B.V. All rights reserved.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

We introduce ToleRace, a runtime system that allows programs to detect and even tolerate asymmetric data races. Asymmetric races are race conditions where one thread correctly acquires and releases a lock for a shared variable while another thread improperly accesses the same variable. ToleRace provides approximate isolation in the critical sections of lock-based parallel programs by creating a local copy of each shared variable when entering a critical section, operating on the local copies, and propagating the appropriate copies upon leaving the critical section. We start by characterizing all possible interleavings that can cause races and precisely describe the effect of ToleRace in each case. Then, we study the theoretical aspects of an oracle that knows exactly what type of interleaving has occurred. Finally, we present software implementations of ToleRace and evaluate them on multithreaded applications from the SPLASH2 and PARSEC suites.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Rapid advancements in multi-core processor architectures coupled with low-cost, low-latency, high-bandwidth interconnects have made clusters of multi-core machines a common computing resource. Unfortunately, writing good parallel programs that efficiently utilize all the resources in such a cluster is still a major challenge. Various programming languages have been proposed as a solution to this problem, but are yet to be adopted widely to run performance-critical code mainly due to the relatively immature software framework and the effort involved in re-writing existing code in the new language. In this paper, we motivate and describe our initial study in exploring CUDA as a programming language for a cluster of multi-cores. We develop CUDA-For-Clusters (CFC), a framework that transparently orchestrates execution of CUDA kernels on a cluster of multi-core machines. The well-structured nature of a CUDA kernel, the growing popularity, support and stability of the CUDA software stack collectively make CUDA a good candidate to be considered as a programming language for a cluster. CFC uses a mixture of source-to-source compiler transformations, a work distribution runtime and a light-weight software distributed shared memory to manage parallel executions. Initial results on running several standard CUDA benchmark programs achieve impressive speedups of up to 7.5X on a cluster with 8 nodes, thereby opening up an interesting direction of research for further investigation.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Graph algorithms have been shown to possess enough parallelism to keep several computing resources busy-even hundreds of cores on a GPU. Unfortunately, tuning their implementation for efficient execution on a particular hardware configuration of heterogeneous systems consisting of multicore CPUs and GPUs is challenging, time consuming, and error prone. To address these issues, we propose a domain-specific language (DSL), Falcon, for implementing graph algorithms that (i) abstracts the hardware, (ii) provides constructs to write explicitly parallel programs at a higher level, and (iii) can work with general algorithms that may change the graph structure (morph algorithms). We illustrate the usage of our DSL to implement local computation algorithms (that do not change the graph structure) and morph algorithms such as Delaunay mesh refinement, survey propagation, and dynamic SSSP on GPU and multicore CPUs. Using a set of benchmark graphs, we illustrate that the generated code performs close to the state-of-the-art hand-tuned implementations.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

MATLAB is an array language, initially popular for rapid prototyping, but is now being increasingly used to develop production code for numerical and scientific applications. Typical MATLAB programs have abundant data parallelism. These programs also have control flow dominated scalar regions that have an impact on the program's execution time. Today's computer systems have tremendous computing power in the form of traditional CPU cores and throughput oriented accelerators such as graphics processing units(GPUs). Thus, an approach that maps the control flow dominated regions to the CPU and the data parallel regions to the GPU can significantly improve program performance. In this paper, we present the design and implementation of MEGHA, a compiler that automatically compiles MATLAB programs to enable synergistic execution on heterogeneous processors. Our solution is fully automated and does not require programmer input for identifying data parallel regions. We propose a set of compiler optimizations tailored for MATLAB. Our compiler identifies data parallel regions of the program and composes them into kernels. The problem of combining statements into kernels is formulated as a constrained graph clustering problem. Heuristics are presented to map identified kernels to either the CPU or GPU so that kernel execution on the CPU and the GPU happens synergistically and the amount of data transfer needed is minimized. In order to ensure required data movement for dependencies across basic blocks, we propose a data flow analysis and edge splitting strategy. Thus our compiler automatically handles composition of kernels, mapping of kernels to CPU and GPU, scheduling and insertion of required data transfer. The proposed compiler was implemented and experimental evaluation using a set of MATLAB benchmarks shows that our approach achieves a geometric mean speedup of 19.8X for data parallel benchmarks over native execution of MATLAB.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

MATLAB is an array language, initially popular for rapid prototyping, but is now being increasingly used to develop production code for numerical and scientific applications. Typical MATLAB programs have abundant data parallelism. These programs also have control flow dominated scalar regions that have an impact on the program's execution time. Today's computer systems have tremendous computing power in the form of traditional CPU cores and throughput oriented accelerators such as graphics processing units(GPUs). Thus, an approach that maps the control flow dominated regions to the CPU and the data parallel regions to the GPU can significantly improve program performance. In this paper, we present the design and implementation of MEGHA, a compiler that automatically compiles MATLAB programs to enable synergistic execution on heterogeneous processors. Our solution is fully automated and does not require programmer input for identifying data parallel regions. We propose a set of compiler optimizations tailored for MATLAB. Our compiler identifies data parallel regions of the program and composes them into kernels. The problem of combining statements into kernels is formulated as a constrained graph clustering problem. Heuristics are presented to map identified kernels to either the CPU or GPU so that kernel execution on the CPU and the GPU happens synergistically and the amount of data transfer needed is minimized. In order to ensure required data movement for dependencies across basic blocks, we propose a data flow analysis and edge splitting strategy. Thus our compiler automatically handles composition of kernels, mapping of kernels to CPU and GPU, scheduling and insertion of required data transfer. The proposed compiler was implemented and experimental evaluation using a set of MATLAB benchmarks shows that our approach achieves a geometric mean speedup of 19.8X for data parallel benchmarks over native execution of MATLAB.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Task-parallel languages are increasingly popular. Many of them provide expressive mechanisms for intertask synchronization. For example, OpenMP 4.0 will integrate data-driven execution semantics derived from the StarSs research language. Compared to the more restrictive data-parallel and fork-join concurrency models, the advanced features being introduced into task-parallelmodels in turn enable improved scalability through load balancing, memory latency hiding, mitigation of the pressure on memory bandwidth, and, as a side effect, reduced power consumption. In this article, we develop a systematic approach to compile loop nests into concurrent, dynamically constructed graphs of dependent tasks. We propose a simple and effective heuristic that selects the most profitable parallelization idiom for every dependence type and communication pattern. This heuristic enables the extraction of interband parallelism (cross-barrier parallelism) in a number of numerical computations that range from linear algebra to structured grids and image processing. The proposed static analysis and code generation alleviates the burden of a full-blown dependence resolver to track the readiness of tasks at runtime. We evaluate our approach and algorithms in the PPCG compiler, targeting OpenStream, a representative dataflow task-parallel language with explicit intertask dependences and a lightweight runtime. Experimental results demonstrate the effectiveness of the approach.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The unsteady incompressible viscous fluid flow between two parallel infinite disks which are located at a distance h(t*) at time t* has been studied. The upper disk moves towards the lower disk with velocity h'(t*). The lower disk is porous and rotates with angular velocity Omega(t*). A magnetic field B(t*) is applied perpendicular to the two disks. It has been found that the governing Navier-Stokes equations reduce to a set of ordinary differential equations if h(t*), a(t*) and B(t*) vary with time t* in a particular manner, i.e. h(t*) = H(1 - alpha t*)(1/2), Omega(t*) = Omega(0)(1 - alpha t*)(-1), B(t*) = B-0(1 - alpha t*)(-1/2). These ordinary differential equations have been solved numerically using a shooting method. For small Reynolds numbers, analytical solutions have been obtained using a regular perturbation technique. The effects of squeeze Reynolds numbers, Hartmann number and rotation of the disk on the flow pattern, normal force or load and torque have been studied in detail

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The elastodynamic response of a pair of parallel rigid strips embedded in an infinite orthotropic medium due to elastic waves incident normally on the strips has been investigated. The mixed boundary value problem has been solved by the Integral Equation method. The normal stress and the vertical displacement have been derived in closed form. Numerical values of stress intensity factors at inner and outer edges of the strips and vertical displacement at points in the plane of the strips for several orthotropic materials have been calculated and plotted graphically to show the effect of material orthotropy.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We consider the problem of deciding whether the output of a boolean circuit is determined by a partial assignment to its inputs. This problem is easily shown to be hard, i.e., co-Image Image -complete. However, many of the consequences of a partial input assignment may be determined in linear time, by iterating the following step: if we know the values of some inputs to a gate, we can deduce the values of some outputs of that gate. This process of iteratively deducing some of the consequences of a partial assignment is called propagation. This paper explores the parallel complexity of propagation, i.e., the complexity of determining whether the output of a given boolean circuit is determined by propagating a given partial input assignment. We give a complete classification of the problem into those cases that are Image -complete and those that are unlikely to be Image complete.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The paper presents two new algorithms for the direct parallel solution of systems of linear equations. The algorithms employ a novel recursive doubling technique to obtain solutions to an nth-order system in n steps with no more than 2n(n −1) processors. Comparing their performance with the Gaussian elimination algorithm (GE), we show that they are almost 100% faster than the latter. This speedup is achieved by dispensing with all the computation involved in the back-substitution phase of GE. It is also shown that the new algorithms exhibit error characteristics which are superior to GE. An n(n + 1) systolic array structure is proposed for the implementation of the new algorithms. We show that complete solutions can be obtained, through these single-phase solution methods, in 5n−log2n−4 computational steps, without the need for intermediate I/O operations.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

A new method of specifying the syntax of programming languages, known as hierarchical language specifications (HLS), is proposed. Efficient parallel algorithms for parsing languages generated by HLS are presented. These algorithms run on an exclusive-read exclusive-write parallel random-access machine. They require O(n) processors and O(log2n) time, where n is the length of the string to be parsed. The most important feature of these algorithms is that they do not use a stack.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper, the design and implementation of a single shared bus, shared memory multiprocessing system using Intel's single board computers is presented. The hardware configuration and the operating system developed to execute the parallel algorithms are discussed. The performance evaluation studies carried out on Image are outlined.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

A new parallel algorithm for transforming an arithmetic infix expression into a par se tree is presented. The technique is based on a result due to Fischer (1980) which enables the construction of the parse tree, by appropriately scanning the vector of precedence values associated with the elements of the expression. The algorithm presented here is suitable for execution on a shared memory model of an SIMD machine with no read/write conflicts permitted. It uses O(n) processors and has a time complexity of O(log2n) where n is the expression length. Parallel algorithms for generating code for an SIMD machine are also presented.