848 resultados para parallel benchmarks
Resumo:
A method is outlined for optimising graph partitions which arise in mapping un- structured mesh calculations to parallel computers. The method employs a combination of iterative techniques to both evenly balance the workload and minimise the number and volume of interprocessor communications. They are designed to work efficiently in parallel as well as sequentially and when combined with a fast direct partitioning technique (such as the Greedy algorithm) to give an initial partition, the resulting two-stage process proves itself to be both a powerful and flexible solution to the static graph-partitioning problem. The algorithms can also be used for dynamic load-balancing and a clustering technique can additionally be employed to speed up the whole process. Experiments indicate that the resulting parallel code can provide high quality partitions, independent of the initial partition, within a few seconds.
Resumo:
In many areas of simulation, a crucial component for efficient numerical computations is the use of solution-driven adaptive features: locally adapted meshing or re-meshing; dynamically changing computational tasks. The full advantages of high performance computing (HPC) technology will thus only be able to be exploited when efficient parallel adaptive solvers can be realised. The resulting requirement for HPC software is for dynamic load balancing, which for many mesh-based applications means dynamic mesh re-partitioning. The DRAMA project has been initiated to address this issue, with a particular focus being the requirements of industrial Finite Element codes, but codes using Finite Volume formulations will also be able to make use of the project results.
Resumo:
We present a dynamic distributed load balancing algorithm for parallel, adaptive finite element simulations using preconditioned conjugate gradient solvers based on domain-decomposition. The load balancer is designed to maintain good partition aspect ratios. It can calculate a balancing flow using different versions of diffusion and a variant of breadth first search. Elements to be migrated are chosen according to a cost function aiming at the optimization of subdomain shapes. We show how to use information from the second step to guide the first. Experimental results using Bramble's preconditioner and comparisons to existing state-ot-the-art load balancers show the benefits of the construction.
Resumo:
We present a dynamic distributed load balancing algorithm for parallel, adaptive finite element simulations using preconditioned conjugate gradient solvers based on domain-decomposition. The load balancer is designed to maintain good partition aspect ratios. It calculates a balancing flow using different versions of diffusion and a variant of breadth first search. Elements to be migrated are chosen according to a cost function aiming at the optimization of subdomain shapes. We show how to use information from the second step to guide the first. Experimental results using Bramble's preconditioner and comparisons to existing state-of-the-art balancers show the benefits of the construction.
Resumo:
As the complexity of parallel applications increase, the performance limitations resulting from computational load imbalance become dominant. Mapping the problem space to the processors in a parallel machine in a manner that balances the workload of each processors will typically reduce the run-time. In many cases the computation time required for a given calculation cannot be predetermined even at run-time and so static partition of the problem returns poor performance. For problems in which the computational load across the discretisation is dynamic and inhomogeneous, for example multi-physics problems involving fluid and solid mechanics with phase changes, the workload for a static subdomain will change over the course of a computation and cannot be estimated beforehand. For such applications the mapping of loads to process is required to change dynamically, at run-time in order to maintain reasonable efficiency. The issue of dynamic load balancing are examined in the context of PHYSICA, a three dimensional unstructured mesh multi-physics continuum mechanics computational modelling code.
Resumo:
A method is outlined for optimising graph partitions which arise in mapping unstructured mesh calculations to parallel computers. The method employs a relative gain iterative technique to both evenly balance the workload and minimise the number and volume of interprocessor communications. A parallel graph reduction technique is also briefly described and can be used to give a global perspective to the optimisation. The algorithms work efficiently in parallel as well as sequentially and when combined with a fast direct partitioning technique (such as the Greedy algorithm) to give an initial partition, the resulting two-stage process proves itself to be both a powerful and flexible solution to the static graph-partitioning problem. Experiments indicate that the resulting parallel code can provide high quality partitions, independent of the initial partition, within a few seconds. The algorithms can also be used for dynamic load-balancing, reusing existing partitions and in this case the procedures are much faster than static techniques, provide partitions of similar or higher quality and, in comparison, involve the migration of a fraction of the data.
Resumo:
A parallel method for the dynamic partitioning of unstructured meshes is described. The method introduces a new iterative optimisation technique known as relative gain optimisation which both balances the workload and attempts to minimise the interprocessor communications overhead. Experiments on a series of adaptively refined meshes indicate that the algorithm provides partitions of an equivalent or higher quality to static partitioners (which do not reuse the existing partition) and much more rapidly. Perhaps more importantly, the algorithm results in only a small fraction of the amount of data migration compared to the static partitioners.
Resumo:
The central product of the DRAMA (Dynamic Re-Allocation of Meshes for parallel Finite Element Applications) project is a library comprising a variety of tools for dynamic re-partitioning of unstructured Finite Element (FE) applications. The input to the DRAMA library is the computational mesh, and corresponding costs, partitioned into sub-domains. The core library functions then perform a parallel computation of a mesh re-allocation that will re-balance the costs based on the DRAMA cost model. We discuss the basic features of this cost model, which allows a general approach to load identification, modelling and imbalance minimisation. Results from crash simulations are presented which show the necessity for multi-phase/multi-constraint partitioning components.
Resumo:
“Parallel Ruptures: Jews of Bessarabia and Transnistria between Romanian Nationalism and Soviet Communism, 1918-1940,” explores the political and social debates that took place in Jewish communities in Romanian-held Bessarabia and the Moldovan Autonomous Soviet Socialist Republic during the interwar era. Both had been part of the Russian Pale of Settlement until its dissolution in 1917; they were then divided by the Romanian Army’s occupation of Bessarabia in 1918 with the establishment of a well-guarded border along the Dniester River between two newly-formed states, Greater Romania and the Soviet Union. At its core, the project focuses in comparative context on the traumatic and multi-faceted confrontation with these two modernizing states: exclusion, discrimination and growing violence in Bessarabia; destruction of religious tradition, agricultural resettlement, and socialist re-education and assimilation in Soviet Transnistria. It examines also the similarities in both states’ striving to create model subjects usable by the homeland, as well as commonalities within Jewish responses on both sides of the border. Contacts between Jews on either side of the border remained significant after 1918 despite the efforts of both states to curb them, thereby necessitating a transnational view in order to examine Jewish political and social life in borderland regions. The desire among Jewish secular leaders to mold their co-religionists into modern Jews reached across state borders and ideological divides and sought to manipulate respective governments to establish these goals, however unsuccessful in the final analysis. Finally, strained relations between Jews in peripheral borderlands with those at national/imperial cores, Moscow and Bucharest, sheds light on the complex circumstances surrounding the inclusion versus exclusion debates at the heart of all interwar European states and the complicated negotiations that took place within all minority communities that responded to state policies.
Resumo:
Scientific applications rely heavily on floating point data types. Floating point operations are complex and require complicated hardware that is both area and power intensive. The emergence of massively parallel architectures like Rigel creates new challenges and poses new questions with respect to floating point support. The massively parallel aspect of Rigel places great emphasis on area efficient, low power designs. At the same time, Rigel is a general purpose accelerator and must provide high performance for a wide class of applications. This thesis presents an analysis of various floating point unit (FPU) components with respect to Rigel, and attempts to present a candidate design of an FPU that balances performance, area, and power and is suitable for massively parallel architectures like Rigel.
Resumo:
Small-colony variants (SCVs) are commonly observed in evolution experiments and clinical isolates, being associated with antibiotic resistance and persistent infections. We recently observed the repeated emergence of Escherichia coli SCVs during adaptation to the interaction with macrophages. To identify the genetic targets underlying the emergence of this clinically relevant morphotype, we performed whole-genome sequencing of independently evolved SCV clones. We uncovered novel mutational targets, not previously associated with SCVs (e.g. cydA, pepP) and observed widespread functional parallelism. All SCV clones had mutations in genes related to the electron-transport chain. As SCVs emerged during adaptation to macrophages, and often show increased antibiotic resistance, we measured SCV fitness inside macrophages and measured their antibiotic resistance profiles. SCVs had a fitness advantage inside macrophages and showed increased aminoglycoside resistance in vitro, but had collateral sensitivity to other antibiotics (e.g. tetracycline). Importantly, we observed similar results in vivo. SCVs had a fitness advantage upon colonization of the mouse gut, which could be tuned by antibiotic treatment: kanamycin (aminoglycoside) increased SCV fitness, but tetracycline strongly reduced it. Our results highlight the power of using experimental evolution as the basis for identifying the causes and consequences of adaptation during host-microbe interactions.
Resumo:
Processors with large numbers of cores are becoming commonplace. In order to utilise the available resources in such systems, the programming paradigm has to move towards increased parallelism. However, increased parallelism does not necessarily lead to better performance. Parallel programming models have to provide not only flexible ways of defining parallel tasks, but also efficient methods to manage the created tasks. Moreover, in a general-purpose system, applications residing in the system compete for the shared resources. Thread and task scheduling in such a multiprogrammed multithreaded environment is a significant challenge. In this thesis, we introduce a new task-based parallel reduction model, called the Glasgow Parallel Reduction Machine (GPRM). Our main objective is to provide high performance while maintaining ease of programming. GPRM supports native parallelism; it provides a modular way of expressing parallel tasks and the communication patterns between them. Compiling a GPRM program results in an Intermediate Representation (IR) containing useful information about tasks, their dependencies, as well as the initial mapping information. This compile-time information helps reduce the overhead of runtime task scheduling and is key to high performance. Generally speaking, the granularity and the number of tasks are major factors in achieving high performance. These factors are even more important in the case of GPRM, as it is highly dependent on tasks, rather than threads. We use three basic benchmarks to provide a detailed comparison of GPRM with Intel OpenMP, Cilk Plus, and Threading Building Blocks (TBB) on the Intel Xeon Phi, and with GNU OpenMP on the Tilera TILEPro64. GPRM shows superior performance in almost all cases, only by controlling the number of tasks. GPRM also provides a low-overhead mechanism, called “Global Sharing”, which improves performance in multiprogramming situations. We use OpenMP, as the most popular model for shared-memory parallel programming as the main GPRM competitor for solving three well-known problems on both platforms: LU factorisation of Sparse Matrices, Image Convolution, and Linked List Processing. We focus on proposing solutions that best fit into the GPRM’s model of execution. GPRM outperforms OpenMP in all cases on the TILEPro64. On the Xeon Phi, our solution for the LU Factorisation results in notable performance improvement for sparse matrices with large numbers of small blocks. We investigate the overhead of GPRM’s task creation and distribution for very short computations using the Image Convolution benchmark. We show that this overhead can be mitigated by combining smaller tasks into larger ones. As a result, GPRM can outperform OpenMP for convolving large 2D matrices on the Xeon Phi. Finally, we demonstrate that our parallel worksharing construct provides an efficient solution for Linked List processing and performs better than OpenMP implementations on the Xeon Phi. The results are very promising, as they verify that our parallel programming framework for manycore processors is flexible and scalable, and can provide high performance without sacrificing productivity.
Resumo:
Solving linear systems is an important problem for scientific computing. Exploiting parallelism is essential for solving complex systems, and this traditionally involves writing parallel algorithms on top of a library such as MPI. The SPIKE family of algorithms is one well-known example of a parallel solver for linear systems. The Hierarchically Tiled Array data type extends traditional data-parallel array operations with explicit tiling and allows programmers to directly manipulate tiles. The tiles of the HTA data type map naturally to the block nature of many numeric computations, including the SPIKE family of algorithms. The higher level of abstraction of the HTA enables the same program to be portable across different platforms. Current implementations target both shared-memory and distributed-memory models. In this thesis we present a proof-of-concept for portable linear solvers. We implement two algorithms from the SPIKE family using the HTA library. We show that our implementations of SPIKE exploit the abstractions provided by the HTA to produce a compact, clean code that can run on both shared-memory and distributed-memory models without modification. We discuss how we map the algorithms to HTA programs as well as examine their performance. We compare the performance of our HTA codes to comparable codes written in MPI as well as current state-of-the-art linear algebra routines.