993 resultados para distributed memory


Relevância:

100.00% 100.00%

Publicador:

Resumo:

We consider the often-studied problem of sorting, for a parallel computer. Given an input array distributed evenly over p processors, the task is to compute the sorted output array, also distributed over the p processors. Many existing algorithms take the approach of approximately load-balancing the output, leaving each processor with Θ(n/p) elements. However, in many cases, approximate load-balancing leads to inefficiencies in both the sorting itself and in further uses of the data after sorting. We provide a deterministic parallel sorting algorithm that uses parallel selection to produce any output distribution exactly, particularly one that is perfectly load-balanced. Furthermore, when using a comparison sort, this algorithm is 1-optimal in both computation and communication. We provide an empirical study that illustrates the efficiency of exact data splitting, and shows an improvement over two sample sort algorithms.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Modelos de tomada de decisão necessitam refletir os aspectos da psi- cologia humana. Com este objetivo, este trabalho é baseado na Sparse Distributed Memory (SDM), um modelo psicologicamente e neuro- cientificamente plausível da memória humana, publicado por Pentti Kanerva, em 1988. O modelo de Kanerva possui um ponto crítico: um item de memória aquém deste ponto é rapidamente encontrado, e items além do ponto crítico não o são. Kanerva calculou este ponto para um caso especial com um seleto conjunto de parâmetros (fixos). Neste trabalho estendemos o conhecimento deste ponto crítico, através de simulações computacionais, e analisamos o comportamento desta “Critical Distance” sob diferentes cenários: em diferentes dimensões; em diferentes números de items armazenados na memória; e em diferentes números de armazenamento do item. Também é derivada uma função que, quando minimizada, determina o valor da “Critical Distance” de acordo com o estado da memória. Um objetivo secundário do trabalho é apresentar a SDM de forma simples e intuitiva para que pesquisadores de outras áreas possam imaginar como ela pode ajudá-los a entender e a resolver seus problemas.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A parallel technique, for a distributed memory machine, based on domain decomposition for solving the Navier-Stokes equations in cartesian and cylindrical coordinates in two dimensions with free surfaces is described. It is based on the code by Tome and McKee (J. Comp. Phys. 110 (1994) 171-186) and Tome (Ph.D. Thesis, University of Strathclyde, Glasgow, 1993) which in turn is based on the SMAC method by Amsden and Harlow (Report LA-4370, Los Alamos Scientific Laboratory, 1971), which solves the Navier-Stokes equations in three steps: the momentum and Poisson equations and particle movement, These equations are discretized by explicit and 5-point finite differences. The parallelization is performed by splitting the computation domain into vertical panels and assigning each of these panels to a processor. All the computation can then be performed using nearest neighbour communication. Test runs comparing the performance of the parallel with the serial code, and a discussion of the load balancing question are presented. PVM is used for communication between processes. (C) 1999 Elsevier B.V. B.V. All rights reserved.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A large class of computational problems are characterised by frequent synchronisation, and computational requirements which change as a function of time. When such a problem is solved on a message passing multiprocessor machine [5], the combination of these characteristics leads to system performance which deteriorate in time. As the communication performance of parallel hardware steadily improves so load balance becomes a dominant factor in obtaining high parallel efficiency. Performance can be improved with periodic redistribution of computational load; however, redistribution can sometimes be very costly. We study the issue of deciding when to invoke a global load re-balancing mechanism. Such a decision policy must actively weigh the costs of remapping against the performance benefits, and should be general enough to apply automatically to a wide range of computations. This paper discusses a generic strategy for Dynamic Load Balancing (DLB) in unstructured mesh computational mechanics applications. The strategy is intended to handle varying levels of load changes throughout the run. The major issues involved in a generic dynamic load balancing scheme will be investigated together with techniques to automate the implementation of a dynamic load balancing mechanism within the Computer Aided Parallelisation Tools (CAPTools) environment, which is a semi-automatic tool for parallelisation of mesh based FORTRAN codes.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

As the complexity of parallel applications increase, the performance limitations resulting from computational load imbalance become dominant. Mapping the problem space to the processors in a parallel machine in a manner that balances the workload of each processors will typically reduce the run-time. In many cases the computation time required for a given calculation cannot be predetermined even at run-time and so static partition of the problem returns poor performance. For problems in which the computational load across the discretisation is dynamic and inhomogeneous, for example multi-physics problems involving fluid and solid mechanics with phase changes, the workload for a static subdomain will change over the course of a computation and cannot be estimated beforehand. For such applications the mapping of loads to process is required to change dynamically, at run-time in order to maintain reasonable efficiency. The issue of dynamic load balancing are examined in the context of PHYSICA, a three dimensional unstructured mesh multi-physics continuum mechanics computational modelling code.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A method is outlined for optimising graph partitions which arise in mapping unstructured mesh calculations to parallel computers. The method employs a relative gain iterative technique to both evenly balance the workload and minimise the number and volume of interprocessor communications. A parallel graph reduction technique is also briefly described and can be used to give a global perspective to the optimisation. The algorithms work efficiently in parallel as well as sequentially and when combined with a fast direct partitioning technique (such as the Greedy algorithm) to give an initial partition, the resulting two-stage process proves itself to be both a powerful and flexible solution to the static graph-partitioning problem. Experiments indicate that the resulting parallel code can provide high quality partitions, independent of the initial partition, within a few seconds. The algorithms can also be used for dynamic load-balancing, reusing existing partitions and in this case the procedures are much faster than static techniques, provide partitions of similar or higher quality and, in comparison, involve the migration of a fraction of the data.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Dissertação para obtenção do Grau de Mestre em Engenharia Informática

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The past few decades have seen a considerable increase in the number of parallel and distributed systems. With the development of more complex applications, the need for more powerful systems has emerged and various parallel and distributed environments have been designed and implemented. Each of the environments, including hardware and software, has unique strengths and weaknesses. There is no single parallel environment that can be identified as the best environment for all applications with respect to hardware and software properties. The main goal of this thesis is to provide a novel way of performing data-parallel computation in parallel and distributed environments by utilizing the best characteristics of difference aspects of parallel computing. For the purpose of this thesis, three aspects of parallel computing were identified and studied. First, three parallel environments (shared memory, distributed memory, and a network of workstations) are evaluated to quantify theirsuitability for different parallel applications. Due to the parallel and distributed nature of the environments, networks connecting the processors in these environments were investigated with respect to their performance characteristics. Second, scheduling algorithms are studied in order to make them more efficient and effective. A concept of application-specific information scheduling is introduced. The application- specific information is data about the workload extractedfrom an application, which is provided to a scheduling algorithm. Three scheduling algorithms are enhanced to utilize the application-specific information to further refine their scheduling properties. A more accurate description of the workload is especially important in cases where the workunits are heterogeneous and the parallel environment is heterogeneous and/or non-dedicated. The results obtained show that the additional information regarding the workload has a positive impact on the performance of applications. Third, a programming paradigm for networks of symmetric multiprocessor (SMP) workstations is introduced. The MPIT programming paradigm incorporates the Message Passing Interface (MPI) with threads to provide a methodology to write parallel applications that efficiently utilize the available resources and minimize the overhead. The MPIT allows for communication and computation to overlap by deploying a dedicated thread for communication. Furthermore, the programming paradigm implements an application-specific scheduling algorithm. The scheduling algorithm is executed by the communication thread. Thus, the scheduling does not affect the execution of the parallel application. Performance results achieved from the MPIT show that considerable improvements over conventional MPI applications are achieved.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Traditional psychometric theory and practice classify people according to broad ability dimensions but do not examine how these mental processes occur. Hunt and Lansman (1975) proposed a 'distributed memory' model of cognitive processes with emphasis on how to describe individual differences based on the assumption that each individual possesses the same components. It is in the quality of these components ~hat individual differences arise. Carroll (1974) expands Hunt's model to include a production system (after Newell and Simon, 1973) and a response system. He developed a framework of factor analytic (FA) factors for : the purpose of describing how individual differences may arise from them. This scheme is to be used in the analysis of psychometric tes ts . Recent advances in the field of information processing are examined and include. 1) Hunt's development of differences between subjects designated as high or low verbal , 2) Miller's pursuit of the magic number seven, plus or minus two, 3) Ferguson's examination of transfer and abilities and, 4) Brown's discoveries concerning strategy teaching and retardates . In order to examine possible sources of individual differences arising from cognitive tasks, traditional psychometric tests were searched for a suitable perceptual task which could be varied slightly and administered to gauge learning effects produced by controlling independent variables. It also had to be suitable for analysis using Carroll's f ramework . The Coding Task (a symbol substitution test) found i n the Performance Scale of the WISe was chosen. Two experiments were devised to test the following hypotheses. 1) High verbals should be able to complete significantly more items on the Symbol Substitution Task than low verbals (Hunt, Lansman, 1975). 2) Having previous practice on a task, where strategies involved in the task may be identified, increases the amount of output on a similar task (Carroll, 1974). J) There should be a sUbstantial decrease in the amount of output as the load on STM is increased (Miller, 1956) . 4) Repeated measures should produce an increase in output over trials and where individual differences in previously acquired abilities are involved, these should differentiate individuals over trials (Ferguson, 1956). S) Teaching slow learners a rehearsal strategy would improve their learning such that their learning would resemble that of normals on the ,:same task. (Brown, 1974). In the first experiment 60 subjects were d.ivided·into high and low verbal, further divided randomly into a practice group and nonpractice group. Five subjects in each group were assigned randomly to work on a five, seven and nine digit code throughout the experiment. The practice group was given three trials of two minutes each on the practice code (designed to eliminate transfer effects due to symbol similarity) and then three trials of two minutes each on the actual SST task . The nonpractice group was given three trials of two minutes each on the same actual SST task . Results were analyzed using a four-way analysis of variance . In the second experiment 18 slow learners were divided randomly into two groups. one group receiving a planned strategy practioe, the other receiving random practice. Both groups worked on the actual code to be used later in the actual task. Within each group subjects were randomly assigned to work on a five, seven or nine digit code throughout. Both practice and actual tests consisted on three trials of two minutes each. Results were analyzed using a three-way analysis of variance . It was found in t he first experiment that 1) high or low verbal ability by itself did not produce significantly different results. However, when in interaction with the other independent variables, a difference in performance was noted . 2) The previous practice variable was significant over all segments of the experiment. Those who received previo.us practice were able to score significantly higher than those without it. J) Increasing the size of the load on STM severely restricts performance. 4) The effect of repeated trials proved to be beneficial. Generally, gains were made on each successive trial within each group. S) In the second experiment, slow learners who were allowed to practice randomly performed better on the actual task than subjeots who were taught the code by means of a planned strategy. Upon analysis using the Carroll scheme, individual differences were noted in the ability to develop strategies of storing, searching and retrieving items from STM, and in adopting necessary rehearsals for retention in STM. While these strategies may benef it some it was found that for others they may be harmful . Temporal aspects and perceptual speed were also found to be sources of variance within individuals . Generally it was found that the largest single factor i nfluencing learning on this task was the repeated measures . What e~ables gains to be made, varies with individuals . There are environmental factors, specific abilities, strategy development, previous learning, amount of load on STM , perceptual and temporal parameters which influence learning and these have serious implications for educational programs .

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Cache-coherent non uniform memory access (ccNUMA) architecture is a standard design pattern for contemporary multicore processors, and future generations of architectures are likely to be NUMA. NUMA architectures create new challenges for managed runtime systems. Memory-intensive applications use the system’s distributed memory banks to allocate data, and the automatic memory manager collects garbage left in these memory banks. The garbage collector may need to access remote memory banks, which entails access latency overhead and potential bandwidth saturation for the interconnection between memory banks. This dissertation makes five significant contributions to garbage collection on NUMA systems, with a case study implementation using the Hotspot Java Virtual Machine. It empirically studies data locality for a Stop-The-World garbage collector when tracing connected objects in NUMA heaps. First, it identifies a locality richness which exists naturally in connected objects that contain a root object and its reachable set— ‘rooted sub-graphs’. Second, this dissertation leverages the locality characteristic of rooted sub-graphs to develop a new NUMA-aware garbage collection mechanism. A garbage collector thread processes a local root and its reachable set, which is likely to have a large number of objects in the same NUMA node. Third, a garbage collector thread steals references from sibling threads that run on the same NUMA node to improve data locality. This research evaluates the new NUMA-aware garbage collector using seven benchmarks of an established real-world DaCapo benchmark suite. In addition, evaluation involves a widely used SPECjbb benchmark and Neo4J graph database Java benchmark, as well as an artificial benchmark. The results of the NUMA-aware garbage collector on a multi-hop NUMA architecture show an average of 15% performance improvement. Furthermore, this performance gain is shown to be as a result of an improved NUMA memory access in a ccNUMA system. Fourth, the existing Hotspot JVM adaptive policy for configuring the number of garbage collection threads is shown to be suboptimal for current NUMA machines. The policy uses outdated assumptions and it generates a constant thread count. In fact, the Hotspot JVM still uses this policy in the production version. This research shows that the optimal number of garbage collection threads is application-specific and configuring the optimal number of garbage collection threads yields better collection throughput than the default policy. Fifth, this dissertation designs and implements a runtime technique, which involves heuristics from dynamic collection behavior to calculate an optimal number of garbage collector threads for each collection cycle. The results show an average of 21% improvements to the garbage collection performance for DaCapo benchmarks.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this and a preceding paper, we provide an introduction to the Fujitsu VPP range of vector-parallel supercomputers and to some of the computational chemistry software available for the VPP. Here, we consider the implementation and performance of seven popular chemistry application packages. The codes discussed range from classical molecular dynamics to semiempirical and ab initio quantum chemistry. All have evolved from sequential codes, and have typically been parallelised using a replicated data approach. As such they are well suited to the large-memory/fast-processor architecture of the VPP. For one code, CASTEP, a distributed-memory data-driven parallelisation scheme is presented. (C) 2000 Published by Elsevier Science B.V. All rights reserved.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Numerical methods related to Krylov subspaces are widely used in large sparse numerical linear algebra. Vectors in these subspaces are manipulated via their representation onto orthonormal bases. Nowadays, on serial computers, the method of Arnoldi is considered as a reliable technique for constructing such bases. However, although easily parallelizable, this technique is not as scalable as expected for communications. In this work we examine alternative methods aimed at overcoming this drawback. Since they retrieve upon completion the same information as Arnoldi's algorithm does, they enable us to design a wide family of stable and scalable Krylov approximation methods for various parallel environments. We present timing results obtained from their implementation on two distributed-memory multiprocessor supercomputers: the Intel Paragon and the IBM Scalable POWERparallel SP2. (C) 1997 by John Wiley & Sons, Ltd.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Dissertação para obtenção do Grau de Mestre em Engenharia Biomédica

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Background: Parallel T-Coffee (PTC) was the first parallel implementation of the T-Coffee multiple sequence alignment tool. It is based on MPI and RMA mechanisms. Its purpose is to reduce the execution time of the large-scale sequence alignments. It can be run on distributed memory clusters allowing users to align data sets consisting of hundreds of proteins within a reasonable time. However, most of the potential users of this tool are not familiar with the use of grids or supercomputers. Results: In this paper we show how PTC can be easily deployed and controlled on a super computer architecture using a web portal developed using Rapid. Rapid is a tool for efficiently generating standardized portlets for a wide range of applications and the approach described here is generic enough to be applied to other applications, or to deploy PTC on different HPC environments. Conclusions: The PTC portal allows users to upload a large number of sequences to be aligned by the parallel version of TC that cannot be aligned by a single machine due to memory and execution time constraints. The web portal provides a user-friendly solution.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A parallel pseudo-spectral method for the simulation in distributed memory computers of the shallow-water equations in primitive form was developed and used on the study of turbulent shallow-waters LES models for orographic subgrid-scale perturbations. The main characteristics of the code are: momentum equations integrated in time using an accurate pseudo-spectral technique; Eulerian treatment of advective terms; and parallelization of the code based on a domain decomposition technique. The parallel pseudo-spectral code is efficient on various architectures. It gives high performance onvector computers and good speedup on distributed memory systems. The code is being used for the study of the interaction mechanisms in shallow-water ows with regular as well as random orography with a prescribed spectrum of elevations. Simulations show the evolution of small scale vortical motions from the interaction of the large scale flow and the small-scale orographic perturbations. These interactions transfer energy from the large-scale motions to the small (usually unresolved) scales. The possibility of including the parametrization of this effects in turbulent LES subgrid-stress models for the shallow-water equations is addressed.