986 resultados para parallel architectures


Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper describes the ParaPhrase project, a new 3-year targeted research project funded under EU Framework 7 Objective 3.4 (Computer Systems), starting in October 2011. ParaPhrase aims to follow a new approach to introducing parallelism using advanced refactoring techniques coupled with high-level parallel design patterns. The refactoring approach will use these design patterns to restructure programs defined as networks of software components into other forms that are more suited to parallel execution. The programmer will be aided by high-level cost information that will be integrated into the refactoring tools. The implementation of these patterns will then use a well-understood algorithmic skeleton approach to achieve good parallelism. A key ParaPhrase design goal is that parallel components are intended to match heterogeneous architectures, defined in terms of CPU/GPU combinations, for example. In order to achieve this, the ParaPhrase approach will map components at link time to the available hardware, and will then re-map them during program execution, taking account of multiple applications, changes in hardware resource availability, the desire to reduce communication costs etc. In this way, we aim to develop a new approach to programming that will be able to produce software that can adapt to dynamic changes in the system environment. Moreover, by using a strong component basis for parallelism, we can achieve potentially significant gains in terms of reducing sharing at a high level of abstraction, and so in reducing or even eliminating the costs that are usually associated with cache management, locking, and synchronisation. © 2013 Springer-Verlag Berlin Heidelberg.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We propose a methodology for optimizing the execution of data parallel (sub-)tasks on CPU and GPU cores of the same heterogeneous architecture. The methodology is based on two main components: i) an analytical performance model for scheduling tasks among CPU and GPU cores, such that the global execution time of the overall data parallel pattern is optimized; and ii) an autonomic module which uses the analytical performance model to implement the data parallel computations in a completely autonomic way, requiring no programmer intervention to optimize the computation across CPU and GPU cores. The analytical performance model uses a small set of simple parameters to devise a partitioning-between CPU and GPU cores-of the tasks derived from structured data parallel patterns/algorithmic skeletons. The model takes into account both hardware related and application dependent parameters. It computes the percentage of tasks to be executed on CPU and GPU cores such that both kinds of cores are exploited and performance figures are optimized. The autonomic module, implemented in FastFlow, executes a generic map (reduce) data parallel pattern scheduling part of the tasks to the GPU and part to CPU cores so as to achieve optimal execution time. Experimental results on state-of-the-art CPU/GPU architectures are shown that assess both performance model properties and autonomic module effectiveness. © 2013 IEEE.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We introduce a new parallel pattern derived from a specific application domain and show how it turns out to have application beyond its domain of origin. The pool evolution pattern models the parallel evolution of a population subject to mutations and evolving in such a way that a given fitness function is optimized. The pattern has been demonstrated to be suitable for capturing and modeling the parallel patterns underpinning various evolutionary algorithms, as well as other parallel patterns typical of symbolic computation. In this paper we introduce the pattern, we discuss its implementation on modern multi/many core architectures and finally present experimental results obtained with FastFlow and Erlang implementations to assess its feasibility and scalability.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

As the complexity of computing systems grows, reliability and energy are two crucial challenges asking for holistic solutions. In this paper, we investigate the interplay among concurrency, power dissipation, energy consumption and voltage-frequency scaling for a key numerical kernel for the solution of sparse linear systems. Concretely, we leverage a task-parallel implementation of the Conjugate Gradient method, equipped with an state-of-the-art pre-conditioner embedded in the ILUPACK software, and target a low-power multi core processor from ARM.In addition, we perform a theoretical analysis on the impact of a technique like Near Threshold Voltage Computing (NTVC) from the points of view of increased hardware concurrency and error rate.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Energy consumption is an important concern in modern multicore processors. The energy consumed by a multicore processor during the execution of an application can be minimized by tuning the hardware state utilizing knobs such as frequency, voltage etc. The existing theoretical work on energy minimization using Global DVFS (Dynamic Voltage and Frequency Scaling), despite being thorough, ignores the time and the energy consumed by the CPU on memory accesses and the dynamic energy consumed by the idle cores. This article presents an analytical energy-performance model for parallel workloads that accounts for the time and the energy consumed by the CPU chip on memory accesses in addition to the time and energy consumed by the CPU on CPU instructions. In addition, the model we present also accounts for the dynamic energy consumed by the idle cores. The existing work on global DVFS for parallel workloads shows that using a single frequency for the entire duration of a parallel application is not energy optimal and that varying the frequency according to the changes in the parallelism of the workload can save energy. We present an analytical framework around our energy-performance model to predict the operating frequencies (that depend upon the amount of parallelism) for global DVFS that minimize the overall CPU energy consumption. We show how the optimal frequencies in our model differ from the optimal frequencies in a model that does not account for memory accesses. We further show how the memory intensity of an application affects the optimal frequencies.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Embedded real-time applications increasingly present high computation requirements, which need to be completed within specific deadlines, but that present highly variable patterns, depending on the set of data available in a determined instant. The current trend to provide parallel processing in the embedded domain allows providing higher processing power; however, it does not address the variability in the processing pattern. Dimensioning each device for its worst-case scenario implies lower average utilization, and increased available, but unusable, processing in the overall system. A solution for this problem is to extend the parallel execution of the applications, allowing networked nodes to distribute the workload, on peak situations, to neighbour nodes. In this context, this report proposes a framework to develop parallel and distributed real-time embedded applications, transparently using OpenMP and Message Passing Interface (MPI), within a programming model based on OpenMP. The technical report also devises an integrated timing model, which enables the structured reasoning on the timing behaviour of these hybrid architectures.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The recent trends of chip architectures with higher number of heterogeneous cores, and non-uniform memory/non-coherent caches, brings renewed attention to the use of Software Transactional Memory (STM) as a fundamental building block for developing parallel applications. Nevertheless, although STM promises to ease concurrent and parallel software development, it relies on the possibility of aborting conflicting transactions to maintain data consistency, which impacts on the responsiveness and timing guarantees required by embedded real-time systems. In these systems, contention delays must be (efficiently) limited so that the response times of tasks executing transactions are upper-bounded and task sets can be feasibly scheduled. In this paper we assess the use of STM in the development of embedded real-time software, defending that the amount of contention can be reduced if read-only transactions access recent consistent data snapshots, progressing in a wait-free manner. We show how the required number of versions of a shared object can be calculated for a set of tasks. We also outline an algorithm to manage conflicts between update transactions that prevents starvation.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Over the last three decades, computer architects have been able to achieve an increase in performance for single processors by, e.g., increasing clock speed, introducing cache memories and using instruction level parallelism. However, because of power consumption and heat dissipation constraints, this trend is going to cease. In recent times, hardware engineers have instead moved to new chip architectures with multiple processor cores on a single chip. With multi-core processors, applications can complete more total work than with one core alone. To take advantage of multi-core processors, parallel programming models are proposed as promising solutions for more effectively using multi-core processors. This paper discusses some of the existent models and frameworks for parallel programming, leading to outline a draft parallel programming model for Ada.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Single processor architectures are unable to provide the required performance of high performance embedded systems. Parallel processing based on general-purpose processors can achieve these performances with a considerable increase of required resources. However, in many cases, simplified optimized parallel cores can be used instead of general-purpose processors achieving better performance at lower resource utilization. In this paper, we propose a configurable many-core architecture to serve as a co-processor for high-performance embedded computing on Field-Programmable Gate Arrays. The architecture consists of an array of configurable simple cores with support for floating-point operations interconnected with a configurable interconnection network. For each core it is possible to configure the size of the internal memory, the supported operations and number of interfacing ports. The architecture was tested in a ZYNQ-7020 FPGA in the execution of several parallel algorithms. The results show that the proposed many-core architecture achieves better performance than that achieved with a parallel generalpurpose processor and that up to 32 floating-point cores can be implemented in a ZYNQ-7020 SoC FPGA.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The recent technological advancements and market trends are causing an interesting phenomenon towards the convergence of High-Performance Computing (HPC) and Embedded Computing (EC) domains. On one side, new kinds of HPC applications are being required by markets needing huge amounts of information to be processed within a bounded amount of time. On the other side, EC systems are increasingly concerned with providing higher performance in real-time, challenging the performance capabilities of current architectures. The advent of next-generation many-core embedded platforms has the chance of intercepting this converging need for predictable high-performance, allowing HPC and EC applications to be executed on efficient and powerful heterogeneous architectures integrating general-purpose processors with many-core computing fabrics. To this end, it is of paramount importance to develop new techniques for exploiting the massively parallel computation capabilities of such platforms in a predictable way. P-SOCRATES will tackle this important challenge by merging leading research groups from the HPC and EC communities. The time-criticality and parallelisation challenges common to both areas will be addressed by proposing an integrated framework for executing workload-intensive applications with real-time requirements on top of next-generation commercial-off-the-shelf (COTS) platforms based on many-core accelerated architectures. The project will investigate new HPC techniques that fulfil real-time requirements. The main sources of indeterminism will be identified, proposing efficient mapping and scheduling algorithms, along with the associated timing and schedulability analysis, to guarantee the real-time and performance requirements of the applications.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Most of the commercial and financial data are stored in decimal fonn. Recently, support for decimal arithmetic has received increased attention due to the growing importance in financial analysis, banking, tax calculation, currency conversion, insurance, telephone billing and accounting. Performing decimal arithmetic with systems that do not support decimal computations may give a result with representation error, conversion error, and/or rounding error. In this world of precision, such errors are no more tolerable. The errors can be eliminated and better accuracy can be achieved if decimal computations are done using Decimal Floating Point (DFP) units. But the floating-point arithmetic units in today's general-purpose microprocessors are based on the binary number system, and the decimal computations are done using binary arithmetic. Only few common decimal numbers can be exactly represented in Binary Floating Point (BF P). ln many; cases, the law requires that results generated from financial calculations performed on a computer should exactly match with manual calculations. Currently many applications involving fractional decimal data perform decimal computations either in software or with a combination of software and hardware. The performance can be dramatically improved by complete hardware DFP units and this leads to the design of processors that include DF P hardware.VLSI implementations using same modular building blocks can decrease system design and manufacturing cost. A multiplexer realization is a natural choice from the viewpoint of cost and speed.This thesis focuses on the design and synthesis of efficient decimal MAC (Multiply ACeumulate) architecture for high speed decimal processors based on IEEE Standard for Floating-point Arithmetic (IEEE 754-2008). The research goal is to design and synthesize deeimal'MAC architectures to achieve higher performance.Efficient design methods and architectures are developed for a high performance DFP MAC unit as part of this research.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The process of developing software that takes advantage of multiple processors is commonly referred to as parallel programming. For various reasons, this process is much harder than the sequential case. For decades, parallel programming has been a problem for a small niche only: engineers working on parallelizing mostly numerical applications in High Performance Computing. This has changed with the advent of multi-core processors in mainstream computer architectures. Parallel programming in our days becomes a problem for a much larger group of developers. The main objective of this thesis was to find ways to make parallel programming easier for them. Different aims were identified in order to reach the objective: research the state of the art of parallel programming today, improve the education of software developers about the topic, and provide programmers with powerful abstractions to make their work easier. To reach these aims, several key steps were taken. To start with, a survey was conducted among parallel programmers to find out about the state of the art. More than 250 people participated, yielding results about the parallel programming systems and languages in use, as well as about common problems with these systems. Furthermore, a study was conducted in university classes on parallel programming. It resulted in a list of frequently made mistakes that were analyzed and used to create a programmers' checklist to avoid them in the future. For programmers' education, an online resource was setup to collect experiences and knowledge in the field of parallel programming - called the Parawiki. Another key step in this direction was the creation of the Thinking Parallel weblog, where more than 50.000 readers to date have read essays on the topic. For the third aim (powerful abstractions), it was decided to concentrate on one parallel programming system: OpenMP. Its ease of use and high level of abstraction were the most important reasons for this decision. Two different research directions were pursued. The first one resulted in a parallel library called AthenaMP. It contains so-called generic components, derived from design patterns for parallel programming. These include functionality to enhance the locks provided by OpenMP, to perform operations on large amounts of data (data-parallel programming), and to enable the implementation of irregular algorithms using task pools. AthenaMP itself serves a triple role: the components are well-documented and can be used directly in programs, it enables developers to study the source code and learn from it, and it is possible for compiler writers to use it as a testing ground for their OpenMP compilers. The second research direction was targeted at changing the OpenMP specification to make the system more powerful. The main contributions here were a proposal to enable thread-cancellation and a proposal to avoid busy waiting. Both were implemented in a research compiler, shown to be useful in example applications, and proposed to the OpenMP Language Committee.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We propose a bridge between two important parallel programming paradigms: data parallelism and communicating sequential processes (CSP). Data parallel pipelined architectures obtained with the Alpha language can be embedded in a control intensive application expressed in CSP-based Handel formalism. The interface is formally defined from the semantics of the languages Alpha and Handel. This work will ease the design of compute intensive applications on FPGAs.