966 resultados para scalable parallel programming
Resumo:
Over the last three decades, computer architects have been able to achieve an increase in performance for single processors by, e.g., increasing clock speed, introducing cache memories and using instruction level parallelism. However, because of power consumption and heat dissipation constraints, this trend is going to cease. In recent times, hardware engineers have instead moved to new chip architectures with multiple processor cores on a single chip. With multi-core processors, applications can complete more total work than with one core alone. To take advantage of multi-core processors, parallel programming models are proposed as promising solutions for more effectively using multi-core processors. This paper discusses some of the existent models and frameworks for parallel programming, leading to outline a draft parallel programming model for Ada.
Resumo:
This paper describes our plans to evaluate the present state of affairs concerning parallel programming and its systems. Three subprojects are proposed: a survey among programmers and scientists, a comparison of parallel programming systems using a standard set of test programs, and a wiki resource for the parallel programming community - the Parawiki. We would like to invite you to participate and turn these subprojects into true community efforts.
Resumo:
In this publication, we report on an online survey that was carried out among parallel programmers. More than 250 people worldwide have submitted answers to our questions, and their responses are analyzed here. Although not statistically sound, the data we provide give useful insights about which parallel programming systems and languages are known and in actual use. For instance, the collected data indicate that for our survey group MPI and (to a lesser extent) C are the most widely used parallel programming system and language, respectively.
Resumo:
The process of developing software that takes advantage of multiple processors is commonly referred to as parallel programming. For various reasons, this process is much harder than the sequential case. For decades, parallel programming has been a problem for a small niche only: engineers working on parallelizing mostly numerical applications in High Performance Computing. This has changed with the advent of multi-core processors in mainstream computer architectures. Parallel programming in our days becomes a problem for a much larger group of developers. The main objective of this thesis was to find ways to make parallel programming easier for them. Different aims were identified in order to reach the objective: research the state of the art of parallel programming today, improve the education of software developers about the topic, and provide programmers with powerful abstractions to make their work easier. To reach these aims, several key steps were taken. To start with, a survey was conducted among parallel programmers to find out about the state of the art. More than 250 people participated, yielding results about the parallel programming systems and languages in use, as well as about common problems with these systems. Furthermore, a study was conducted in university classes on parallel programming. It resulted in a list of frequently made mistakes that were analyzed and used to create a programmers' checklist to avoid them in the future. For programmers' education, an online resource was setup to collect experiences and knowledge in the field of parallel programming - called the Parawiki. Another key step in this direction was the creation of the Thinking Parallel weblog, where more than 50.000 readers to date have read essays on the topic. For the third aim (powerful abstractions), it was decided to concentrate on one parallel programming system: OpenMP. Its ease of use and high level of abstraction were the most important reasons for this decision. Two different research directions were pursued. The first one resulted in a parallel library called AthenaMP. It contains so-called generic components, derived from design patterns for parallel programming. These include functionality to enhance the locks provided by OpenMP, to perform operations on large amounts of data (data-parallel programming), and to enable the implementation of irregular algorithms using task pools. AthenaMP itself serves a triple role: the components are well-documented and can be used directly in programs, it enables developers to study the source code and learn from it, and it is possible for compiler writers to use it as a testing ground for their OpenMP compilers. The second research direction was targeted at changing the OpenMP specification to make the system more powerful. The main contributions here were a proposal to enable thread-cancellation and a proposal to avoid busy waiting. Both were implemented in a research compiler, shown to be useful in example applications, and proposed to the OpenMP Language Committee.
Resumo:
A novel two-step paradigm was used to investigate the parallel programming of consecutive, stimulus-elicited ('reflexive') and endogenous ('voluntary') saccades. The mean latency of voluntary saccades, made following the first reflexive saccades in two-step conditions, was significantly reduced compared to that of voluntary saccades made in the single-step control trials. The latency of the first reflexive saccades was modulated by the requirement to make a second saccade: first saccade latency increased when a second voluntary saccade was required in the opposite direction to the first saccade, and decreased when a second saccade was required in the same direction as the first reflexive saccade. A second experiment confirmed the basic effect and also showed that a second reflexive saccade may be programmed in parallel with a first voluntary saccade. The results support the view that voluntary and reflexive saccades can be programmed in parallel on a common motor map. (c) 2006 Elsevier Ltd. All rights reserved.
Resumo:
Graph analytics is an important and computationally demanding class of data analytics. It is essential to balance scalability, ease-of-use and high performance in large scale graph analytics. As such, it is necessary to hide the complexity of parallelism, data distribution and memory locality behind an abstract interface. The aim of this work is to build a scalable graph analytics framework that does not demand significant parallel programming experience based on NUMA-awareness.
The realization of such a system faces two key problems:
(i)~how to develop a scale-free parallel programming framework that scales efficiently across NUMA domains; (ii)~how to efficiently apply graph partitioning in order to create separate and largely independent work items that can be distributed among threads.
Resumo:
A raíz de la aparición de los procesadores dotados de varios “cores”, la programación paralela, un concepto que, por otra parte no era nada nuevo y se conocía desde hace décadas, sufrió un nuevo impulso, pues se creía que se podía superar el techo tecnológico que había estado limitando el rendimiento de esta programación durante años. Este impulso se ha ido manteniendo hasta la actualidad, movido por la necesidad de sistemas cada vez más potentes y gracias al abaratamiento de los costes de fabricación. Esta tendencia ha motivado la aparición de nuevo software y lenguajes con componentes orientados precisamente al campo de la programación paralela. Este es el caso del lenguaje Go, desarrollado por Google y lanzado en 2009. Este lenguaje se basa en modelos de concurrencia que lo hacen muy adecuados para abordar desarrollos de naturaleza paralela. Sin embargo, la programación paralela es un campo complejo y heterogéneo, y los programadores son reticentes a utilizar herramientas nuevas, en beneficio de aquellas que ya conocen y les son familiares. Un buen ejemplo son aquellas implementaciones de lenguajes conocidos, pero orientadas a programación paralela, y que siguen las directrices de un estándar ampliamente reconocido y aceptado. Este es el caso del estándar OpenMP, un Interfaz de Programación de Aplicaciones (API) flexible, portable y escalable, orientado a la programación paralela multiproceso en arquitecturas multi-core o multinucleo. Dicho estándar posee actualmente implementaciones en los lenguajes C, C++ y Fortran. Este proyecto nace como un intento de aunar ambos conceptos: un lenguaje emergente con interesantes posibilidades en el campo de la programación paralela, y un estándar reputado y ampliamente extendido, con el que los programadores se encuentran familiarizados. El objetivo principal es el desarrollo de un conjunto de librerías del sistema (que engloben directivas de compilación o pragmas, librerías de ejecución y variables de entorno), soportadas por las características y los modelos de concurrencia propios de Go; y que añadan funcionalidades propias del estándar OpenMP. La idea es añadir funcionalidades que permitan programar en lenguaje Go utilizando la sintaxis que OpenMP proporciona para otros lenguajes, como Fortan y C/C++ (concretamente, similar a esta última), y, de esta forma, dotar al usuario de Go de herramientas para programar estructuras de procesamiento paralelo de forma sencilla y transparente, de la misma manera que lo haría utilizando C/C++.---ABSTRACT---As a result of the appearance of processors equipped with multiple "cores ", parallel programming, a concept which, moreover, it was not new and it was known for decades, suffered a new impulse, because it was believed they could overcome the technological ceiling had been limiting the performance of this program for years. This impulse has been maintained until today, driven by the need for ever more powerful systems and thanks to the decrease in manufacturing costs. This trend has led to the emergence of new software and languages with components guided specifically to the field of parallel programming. This is the case of Go language, developed by Google and released in 2009. This language is based on concurrency models that make it well suited to tackle developments in parallel nature. However, parallel programming is a complex and heterogeneous field, and developers are reluctant to use new tools to benefit those who already know and are familiar. A good example are those implementations from well-known languages, but parallel programming oriented, and witch follow the guidelines of a standard widely recognized and accepted. This is the case of the OpenMP standard, an application programming interface (API), flexible, portable and scalable, parallel programming oriented, and designed for multi-core architectures. This standard currently has implementations in C, C ++ and Fortran. This project was born as an attempt to combine two concepts: an emerging language, with interesting possibilities in the field of parallel programming, and a reputed and widespread standard, with which programmers are familiar with. The main objective is to develop a set of system libraries (which includes compiler directives or pragmas, runtime libraries and environment variables), supported by the characteristics and concurrency patterns of Go; and that add custom features from the OpenMP standard. The idea is to add features that allow programming in Go language using the syntax OpenMP provides for other languages, like Fortran and C / C ++ (specifically, similar to the latter ), and, in this way, provide Go users with tools for programming parallel structures easily and, in the same way they would using C / C ++.
Resumo:
Despite the apparent simplicity of the OpenMP directive shared memory programming model and the sophisticated dependence analysis and code generation capabilities of the ParaWise/CAPO tools, experience shows that a level of expertise is required to produce efficient parallel code. In a real world application the investigation of a single loop in a generated parallel code can soon become an in-depth inspection of numerous dependencies in many routines. The additional understanding of dependencies is also needed to effectively interpret the information provided and supply the required feedback. The ParaWise Expert Assistant has been developed to automate this investigation and present questions to the user about, and in the context of, their application code. In this paper, we demonstrate that knowledge of dependence information and OpenMP are no longer essential to produce efficient parallel code with the Expert Assistant. It is hoped that this will enable a far wider audience to use the tools and subsequently, exploit the benefits of large parallel systems.
Resumo:
An approach to the management of non-functional concerns in massively parallel and/or distributed architectures that marries parallel programming patterns with autonomic computing is presented. The necessity and suitability of the adoption of autonomic techniques are evidenced. Issues arising in the implementation of autonomic managers taking care of multiple concerns and of coordination among hierarchies of such autonomic managers are discussed. Experimental results are presented that demonstrate the feasibility of the approach.
Resumo:
Data flow techniques have been around since the early '70s when they were used in compilers for sequential languages. Shortly after their introduction they were also consideredas a possible model for parallel computing, although the impact here was limited. Recently, however, data flow has been identified as a candidate for efficient implementation of various programming models on multi-core architectures. In most cases, however, the burden of determining data flow "macro" instructions is left to the programmer, while the compiler/run time system manages only the efficient scheduling of these instructions. We discuss a structured parallel programming approach supporting automatic compilation of programs to macro data flow and we show experimental results demonstrating the feasibility of the approach and the efficiency of the resulting "object" code on different classes of state-of-the-art multi-core architectures. The experimental results use different base mechanisms to implement the macro data flow run time support, from plain pthreads with condition variables to more modern and effective lock- and fence-free parallel frameworks. Experimental results comparing efficiency of the proposed approach with those achieved using other, more classical, parallel frameworks are also presented. © 2012 IEEE.
Resumo:
This paper introduces hybrid address spaces as a fundamental design methodology for implementing scalable runtime systems on many-core architectures without hardware support for cache coherence. We use hybrid address spaces for an implementation of MapReduce, a programming model for large-scale data processing, and the implementation of a remote memory access (RMA) model. Both implementations are available on the Intel SCC and are portable to similar architectures. We present the design and implementation of HyMR, a MapReduce runtime system whereby different stages and the synchronization operations between them alternate between a distributed memory address space and a shared memory address space, to improve performance and scalability. We compare HyMR to a reference implementation and we find that HyMR improves performance by a factor of 1.71× over a set of representative MapReduce benchmarks. We also compare HyMR with Phoenix++, a state-of-art implementation for systems with hardware-managed cache coherence in terms of scalability and sustained to peak data processing bandwidth, where HyMR demon- strates improvements of a factor of 3.1× and 3.2× respectively. We further evaluate our hybrid remote memory access (HyRMA) programming model and assess its performance to be superior of that of message passing.
Resumo:
This paper presents a new programming methodology for introducing and tuning parallelism in Erlang programs, using source-level code refactoring from sequential source programs to parallel programs written using our skeleton library, Skel. High-level cost models allow us to predict with reasonable accuracy the parallel performance of the refactored program, enabling programmers to make informed decisions about which refactorings to apply. Using our approach, we demonstrate easily obtainable, significant and scalable speedups of up to 21 on a 24-core machine over the sequential code.
Resumo:
Processors with large numbers of cores are becoming commonplace. In order to utilise the available resources in such systems, the programming paradigm has to move towards increased parallelism. However, increased parallelism does not necessarily lead to better performance. Parallel programming models have to provide not only flexible ways of defining parallel tasks, but also efficient methods to manage the created tasks. Moreover, in a general-purpose system, applications residing in the system compete for the shared resources. Thread and task scheduling in such a multiprogrammed multithreaded environment is a significant challenge. In this thesis, we introduce a new task-based parallel reduction model, called the Glasgow Parallel Reduction Machine (GPRM). Our main objective is to provide high performance while maintaining ease of programming. GPRM supports native parallelism; it provides a modular way of expressing parallel tasks and the communication patterns between them. Compiling a GPRM program results in an Intermediate Representation (IR) containing useful information about tasks, their dependencies, as well as the initial mapping information. This compile-time information helps reduce the overhead of runtime task scheduling and is key to high performance. Generally speaking, the granularity and the number of tasks are major factors in achieving high performance. These factors are even more important in the case of GPRM, as it is highly dependent on tasks, rather than threads. We use three basic benchmarks to provide a detailed comparison of GPRM with Intel OpenMP, Cilk Plus, and Threading Building Blocks (TBB) on the Intel Xeon Phi, and with GNU OpenMP on the Tilera TILEPro64. GPRM shows superior performance in almost all cases, only by controlling the number of tasks. GPRM also provides a low-overhead mechanism, called “Global Sharing”, which improves performance in multiprogramming situations. We use OpenMP, as the most popular model for shared-memory parallel programming as the main GPRM competitor for solving three well-known problems on both platforms: LU factorisation of Sparse Matrices, Image Convolution, and Linked List Processing. We focus on proposing solutions that best fit into the GPRM’s model of execution. GPRM outperforms OpenMP in all cases on the TILEPro64. On the Xeon Phi, our solution for the LU Factorisation results in notable performance improvement for sparse matrices with large numbers of small blocks. We investigate the overhead of GPRM’s task creation and distribution for very short computations using the Image Convolution benchmark. We show that this overhead can be mitigated by combining smaller tasks into larger ones. As a result, GPRM can outperform OpenMP for convolving large 2D matrices on the Xeon Phi. Finally, we demonstrate that our parallel worksharing construct provides an efficient solution for Linked List processing and performs better than OpenMP implementations on the Xeon Phi. The results are very promising, as they verify that our parallel programming framework for manycore processors is flexible and scalable, and can provide high performance without sacrificing productivity.
Resumo:
Parallel computing is currently used in many engineering problems. However, because of limitations in curriculum design, it is not always possible to offer students specific formal teaching in this topic. Furthermore, parallel machines are still too expensive for many institutions. The latest microprocessors, such as Intel’s Pentium III and IV, embody single instruction multiple-data (SIMD) type parallel features, which makes them a viable solution for introducing parallel computing concepts to students. Final year projects have been initiated utilizing SSE (streaming SIMD extensions) features and it has been observed that students can easily learn parallel programming concepts after going through some programming exercises. They can now experiment with parallel algorithms on their own PCs at home. Keywords
Resumo:
Many common activities, like reading, scanning scenes, or searching for an inconspicuous item in a cluttered environment, entail serial movements of the eyes that shift the gaze from one object to another. Previous studies have shown that the primate brain is capable of programming sequential saccadic eye movements in parallel. Given that the onset of saccades directed to a target are unpredictable in individual trials, what prevents a saccade during parallel programming from being executed in the direction of the second target before execution of another saccade in the direction of the first target remains unclear. Using a computational model, here we demonstrate that sequential saccades inhibit each other and share the brain's limited processing resources (capacity) so that the planning of a saccade in the direction of the first target always finishes first. In this framework, the latency of a saccade increases linearly with the fraction of capacity allocated to the other saccade in the sequence, and exponentially with the duration of capacity sharing. Our study establishes a link between the dual-task paradigm and the ramp-to-threshold model of response time to identify a physiologically viable mechanism that preserves the serial order of saccades without compromising the speed of performance.