459 resultados para Parallelism
Resumo:
The use of accelerators, with compute architectures different and distinct from the CPU, has become a new research frontier in high-performance computing over the past ?ve years. This paper is a case study on how the instruction-level parallelism offered by three accelerator technologies, FPGA, GPU and ClearSpeed, can be exploited in atomic physics. The algorithm studied is the evaluation of two electron integrals, using direct numerical quadrature, a task that arises in the study of intermediate energy electron scattering by hydrogen atoms. The results of our ‘productivity’ study show that while each accelerator is viable, there are considerable differences in the implementation strategies that must be followed on each.
Resumo:
Quantum-dot Cellular Automata (QCA) technology is a promising potential alternative to CMOS technology. To explore the characteristics of QCA and suitable design methodologies, digital circuit design approaches have been investigated. Due to the inherent wire delay in QCA, pipelined architectures appear to be a particularly suitable design technique. Also, because of the pipeline nature of QCA technology, it is not suitable for complicated control system design. Systolic arrays take advantage of pipelining, parallelism and simple local control. Therefore, an investigation into these architectures in QCA technology is provided in this paper. Two case studies, (a matrix multiplier and a Galois Field multiplier) are designed and analyzed based on both multilayer and coplanar crossings. The performance of these two types of interconnections are compared and it is found that even though coplanar crossings are currently more practical, they tend to occupy a larger design area and incur slightly more delay. A general semi-conductor QCA systolic array design methodology is also proposed. It is found that by applying a systolic array structure in QCA design, significant benefits can be achieved particularly with large systolic arrays, even more so than when applied in CMOS-based technology.
Resumo:
Computing has recently reached an inflection point with the introduction of multicore processors. On-chip thread-level parallelism is doubling approximately every other year. Concurrency lends itself naturally to allowing a program to trade performance for power savings by regulating the number of active cores; however, in several domains, users are unwilling to sacrifice performance to save power. We present a prediction model for identifying energy-efficient operating points of concurrency in well-tuned multithreaded scientific applications and a runtime system that uses live program analysis to optimize applications dynamically. We describe a dynamic phase-aware performance prediction model that combines multivariate regression techniques with runtime analysis of data collected from hardware event counters to locate optimal operating points of concurrency. Using our model, we develop a prediction-driven phase-aware runtime optimization scheme that throttles concurrency so that power consumption can be reduced and performance can be set at the knee of the scalability curve of each program phase. The use of prediction reduces the overhead of searching the optimization space while achieving near-optimal performance and power savings. A thorough evaluation of our approach shows a reduction in power consumption of 10.8 percent, simultaneous with an improvement in performance of 17.9 percent, resulting in energy savings of 26.7 percent.
Resumo:
This implementation of a two-dimensional discrete cosine transform demonstrates the development of a suitable architectural style for a specific technology-in this case, the Xilinx XC6200 FPGA series. The design exploits distributed arithmetic, parallelism, and pipelining to achieve a high-performance custom-computing implementation.
Resumo:
Speeding up sequential programs on multicores is a challenging problem that is in urgent need of a solution. Automatic parallelization of irregular pointer-intensive codes, exempli?ed by the SPECint codes, is a very hard problem. This paper shows that, with a helping hand, such auto-parallelization is possible and fruitful. This paper makes the following contributions: (i) A compiler framework for extracting pipeline-like parallelism from outer program loops is presented. (ii) Using a light-weight programming model based on annotations, the programmer helps the compiler to ?nd thread-level parallelism. Each of the annotations speci?es only a small piece of semantic information that compiler analysis misses, e.g. stating that a variable is dead at a certain program point. The annotations are designed such that correctness is easily veri?ed. Furthermore, we present a tool for suggesting annotations to the programmer. (iii) The methodology is applied to autoparallelize several SPECint benchmarks. For the benchmark with most parallelism (hmmer), we obtain a scalable 7-fold speedup on an AMD quad-core dual processor. The annotations constitute a parallel programming model that relies extensively on a sequential program representation. Hereby, the complexity of debugging is not increased and it does not obscure the source code. These properties could prove valuable to increase the ef?ciency of parallel programming.
Resumo:
The prevalence of multicore processors is bound to drive most kinds of software development towards parallel programming. To limit the difficulty and overhead of parallel software design and maintenance, it is crucial that parallel programming models allow an easy-to-understand, concise and dense representation of parallelism. Parallel programming models such as Cilk++ and Intel TBBs attempt to offer a better, higher-level abstraction for parallel programming than threads and locking synchronization. It is not straightforward, however, to express all patterns of parallelism in these models. Pipelines are an important parallel construct, although difficult to express in Cilk and TBBs in a straightfor- ward way, not without a verbose restructuring of the code. In this paper we demonstrate that pipeline parallelism can be easily and concisely expressed in a Cilk-like language, which we extend with input, output and input/output dependency types on procedure arguments, enforced at runtime by the scheduler. We evaluate our implementation on real applications and show that our Cilk-like scheduler, extended to track and enforce these dependencies has performance comparable to Cilk++.
Resumo:
A number of high-performance VLSI architectures for real-time image coding applications are described. In particular, attention is focused on circuits for computing the 2-D DCT (discrete cosine transform) and for 2-D vector quantization. The former circuits are based on Winograd algorithms and comprise a number of bit-level systolic arrays with a bit-serial, word-parallel input. The latter circuits exhibit a similar data organization and consist of a number of inner product array circuits. Both circuits are highly regular and allow extremely high data rates to be achieved through extensive use of parallelism.
Resumo:
The study of parallel evolution facilitates the discovery of common rules of diversification. Here, we examine the repeated evolution of thick lips in Midas cichlid fishes (the Amphilophus citrinellus species complex) - from two Great Lakes and two crater lakes in Nicaragua - to assess whether similar changes in ecology, phenotypic trophic traits and gene expression accompany parallel trait evolution. Using next-generation sequencing technology, we characterize transcriptome-wide differential gene expression in the lips of wild-caught sympatric thick- and thin-lipped cichlids from all four instances of repeated thick-lip evolution. Six genes (apolipoprotein D, myelin-associated glycoprotein precursor, four-and-a-half LIM domain protein 2, calpain-9, GTPase IMAP family member 8-like and one hypothetical protein) are significantly underexpressed in the thick-lipped morph across all four lakes. However, other aspects of lips' gene expression in sympatric morphs differ in a lake-specific pattern, including the magnitude of differentially expressed genes (97-510). Generally, fewer genes are differentially expressed among morphs in the younger crater lakes than in those from the older Great Lakes. Body shape, lower pharyngeal jaw size and shape, and stable isotopes (dC and dN) differ between all sympatric morphs, with the greatest differentiation in the Great Lake Nicaragua. Some ecological traits evolve in parallel (those related to foraging ecology; e.g. lip size, body and head shape) but others, somewhat surprisingly, do not (those related to diet and food processing; e.g. jaw size and shape, stable isotopes). Taken together, this case of parallelism among thick- and thin-lipped cichlids shows a mosaic pattern of parallel and nonparallel evolution. © 2012 Blackwell Publishing Ltd.
Resumo:
With the emergence of multicore and manycore processors, engineers must design and develop software in drastically new ways to benefit from the computational power of all cores. However, developing parallel software is much harder than sequential software because parallelism can't be abstracted away easily. Authors Hans Vandierendonck and Tom Mens provide an overview of technologies and tools to support developers in this complex and error-prone task. © 2012 IEEE.
Resumo:
We describe an approach aimed at addressing the issue of joint exploitation of control (stream) and data parallelism in a skeleton based parallel programming environment, based on annotations and refactoring. Annotations drive efficient implementation of a parallel computation. Refactoring is used to transform the associated skeleton tree into a more efficient, functionally equivalent skeleton tree. In most cases, cost models are used to drive the refactoring process. We show how sample use case applications/kernels may be optimized and discuss preliminary experiments with FastFlow assessing the theoretical results. © 2013 Springer-Verlag.
Resumo:
Recent trends towards increasingly parallel computers mean that there needs to be a seismic shift in programming practice. The time is rapidly approaching when most programming will be for parallel systems. However, most programming techniques in use today are geared towards sequential, or occasionally small-scale parallel, programming. While refactoring has so far mainly been applied to sequential programs, it is our contention that refactoring can play a key role in significantly improving the programmability of parallel systems, by allowing the programmer to apply a set of well-defined transformations in order to parallelise their programs. In this paper, we describe a new language-independent refactoring approach that helps introduce and tune parallelism through high-level design patterns targeting a set of well-specified parallel skeletons. We believe this new refactoring process is the key to allowing programmers to truly start thinking in parallel. © 2012 ACM.
Resumo:
We describe a lightweight prototype framework (LIBERO) designed for experimentation with behavioural skeletons-components implementing a well-known parallelism exploitation pattern and a rule-based autonomic manager taking care of some non-functional feature related to pattern computation. LIBERO supports multiple autonomic managers within the same behavioural skeleton, each taking care of a different non-functional concern. We introduce LIBERO-built on plain Java and JBoss-and discuss how multiple managers may be coordinated to achieve a common goal using a two-phase coordination protocol developed in earlier work. We present experimental results that demonstrate how the prototype may be used to investigate autonomic management of multiple, independent concerns. © 2011 Springer-Verlag Berlin Heidelberg.
Resumo:
Recent trends in computing systems, such as multi-core processors and cloud computing, expose tens to thousands of processors to the software. Software developers must respond by introducing parallelism in their software. To obtain highest performance, it is not only necessary to identify parallelism, but also to reason about synchronization between threads and the communication of data from one thread to another. This entry gives an overview on some of the most common abstractions that are used in parallel programming, namely explicit vs. implicit expression of parallelism and shared and distributed memory. Several parallel programming models are reviewed and categorized by means of these abstractions. The pros and cons of parallel programming models from the perspective of performance and programmability are discussed.