135 resultados para Parallel execution
Resumo:
Many meteorological phenomena occur at different locations simultaneously. These phenomena vary temporally and spatially. It is essential to track these multiple phenomena for accurate weather prediction. Efficient analysis require high-resolution simulations which can be conducted by introducing finer resolution nested simulations, nests at the locations of these phenomena. Simultaneous tracking of these multiple weather phenomena requires simultaneous execution of the nests on different subsets of the maximum number of processors for the main weather simulation. Dynamic variation in the number of these nests require efficient processor reallocation strategies. In this paper, we have developed strategies for efficient partitioning and repartitioning of the nests among the processors. As a case study, we consider an application of tracking multiple organized cloud clusters in tropical weather systems. We first present a parallel data analysis algorithm to detect such clouds. We have developed a tree-based hierarchical diffusion method which reallocates processors for the nests such that the redistribution cost is less. We achieve this by a novel tree reorganization approach. We show that our approach exhibits up to 25% lower redistribution cost and 53% lesser hop-bytes than the processor reallocation strategy that does not consider the existing processor allocation.
Resumo:
We present a nonequilibrium strong-coupling approach to inhomogeneous systems of ultracold atoms in optical lattices. We demonstrate its application to the Mott-insulating phase of a two-dimensional Fermi-Hubbard model in the presence of a trap potential. Since the theory is formulated self-consistently, the numerical implementation relies on a massively parallel evaluation of the self-energy and the Green's function at each lattice site, employing thousands of CPUs. While the computation of the self-energy is straightforward to parallelize, the evaluation of the Green's function requires the inversion of a large sparse 10(d) x 10(d) matrix, with d > 6. As a crucial ingredient, our solution heavily relies on the smallness of the hopping as compared to the interaction strength and yields a widely scalable realization of a rapidly converging iterative algorithm which evaluates all elements of the Green's function. Results are validated by comparing with the homogeneous case via the local-density approximation. These calculations also show that the local-density approximation is valid in nonequilibrium setups without mass transport.
Resumo:
Programming for parallel architectures that do not have a shared address space is extremely difficult due to the need for explicit communication between memories of different compute devices. A heterogeneous system with CPUs and multiple GPUs, or a distributed-memory cluster are examples of such systems. Past works that try to automate data movement for distributed-memory architectures can lead to excessive redundant communication. In this paper, we propose an automatic data movement scheme that minimizes the volume of communication between compute devices in heterogeneous and distributed-memory systems. We show that by partitioning data dependences in a particular non-trivial way, one can generate data movement code that results in the minimum volume for a vast majority of cases. The techniques are applicable to any sequence of affine loop nests and works on top of any choice of loop transformations, parallelization, and computation placement. The data movement code generated minimizes the volume of communication for a particular configuration of these. We use a combination of powerful static analyses relying on the polyhedral compiler framework and lightweight runtime routines they generate, to build a source-to-source transformation tool that automatically generates communication code. We demonstrate that the tool is scalable and leads to substantial gains in efficiency. On a heterogeneous system, the communication volume is reduced by a factor of 11X to 83X over state-of-the-art, translating into a mean execution time speedup of 1.53X. On a distributed-memory cluster, our scheme reduces the communication volume by a factor of 1.4X to 63.5X over state-of-the-art, resulting in a mean speedup of 1.55X. In addition, our scheme yields a mean speedup of 2.19X over hand-optimized UPC codes.
Resumo:
We show that every graph of maximum degree 3 can be represented as the intersection graph of axis parallel boxes in three dimensions, that is, every vertex can be mapped to an axis parallel box such that two boxes intersect if and only if their corresponding vertices are adjacent. In fact, we construct a representation in which any two intersecting boxes touch just at their boundaries.
Resumo:
In this paper we present a massively parallel open source solver for Richards equation, named the RichardsFOAM solver. This solver has been developed in the framework of the open source generalist computational fluid dynamics tool box OpenFOAM (R) and is capable to deal with large scale problems in both space and time. The source code for RichardsFOAM may be downloaded from the CPC program library website. It exhibits good parallel performances (up to similar to 90% parallel efficiency with 1024 processors both in strong and weak scaling), and the conditions required for obtaining such performances are analysed and discussed. These performances enable the mechanistic modelling of water fluxes at the scale of experimental watersheds (up to few square kilometres of surface area), and on time scales of decades to a century. Such a solver can be useful in various applications, such as environmental engineering for long term transport of pollutants in soils, water engineering for assessing the impact of land settlement on water resources, or in the study of weathering processes on the watersheds. (C) 2014 Elsevier B.V. All rights reserved.
Resumo:
The binding of ligand 5,10,15,20-tetra(N-methyl-4-pyridyl)porphine (TMPyP4) with telomeric and genomic G-quadruplex DNA has been extensively studied. However, a comparative study of interactions of TMPyP4 with different conformations of human telomeric G-quadruplex DNA, namely, parallel propeller-type (PP), antiparallel basket-type (AB), and mixed hybrid-type (MH) G-quadruplex DNA, has not been done. We considered all the possible binding sites in each of the G-quadruplex DNA structures and docked TMPyP4 to each one of them. The resultant most potent sites for binding were analyzed from the mean binding free energy of the complexes. Molecular dynamics simulations were then carried out, and analysis of the binding free energy of the TMPyP4-G-quadruplex complex showed that the binding of TMPyP4 with parallel propeller-type G-quadruplex DNA is preferred over the other two G-quadruplex DNA conformations. The results obtained from the change in solvent excluded surface area (SESA) and solvent accessible surface area (SASA) also support the more pronounced binding of the ligand with the parallel propeller-type G-quadruplex DNA.
Resumo:
Task-parallel languages are increasingly popular. Many of them provide expressive mechanisms for intertask synchronization. For example, OpenMP 4.0 will integrate data-driven execution semantics derived from the StarSs research language. Compared to the more restrictive data-parallel and fork-join concurrency models, the advanced features being introduced into task-parallelmodels in turn enable improved scalability through load balancing, memory latency hiding, mitigation of the pressure on memory bandwidth, and, as a side effect, reduced power consumption. In this article, we develop a systematic approach to compile loop nests into concurrent, dynamically constructed graphs of dependent tasks. We propose a simple and effective heuristic that selects the most profitable parallelization idiom for every dependence type and communication pattern. This heuristic enables the extraction of interband parallelism (cross-barrier parallelism) in a number of numerical computations that range from linear algebra to structured grids and image processing. The proposed static analysis and code generation alleviates the burden of a full-blown dependence resolver to track the readiness of tasks at runtime. We evaluate our approach and algorithms in the PPCG compiler, targeting OpenStream, a representative dataflow task-parallel language with explicit intertask dependences and a lightweight runtime. Experimental results demonstrate the effectiveness of the approach.
Resumo:
The ac-side terminal voltages of parallel-connected converters are different if the line reactive drops of the individual converters are different. This could result either from differences in per-phase inductances or from differences in the line currents of the converters. In such cases, the modulating signals are different for the converters. Hence, the common-mode (CM) voltages for the converters, injected by conventional space vector pulsewidth modulation (CSVPWM) to increase dc-bus utilization, are different. Consequently, significant low-frequency zero-sequence circulating currents result. This paper proposes a new modulation method for parallel-connected converters with unequal terminal voltages. This method does not cause low-frequency zero-sequence circulating currents and is comparable with CSVPWM in terms of dc-bus utilization and device power loss. Experimental results are presented at a power level of 150 kVA from a circulating-power test setup, where the differences in converter terminal voltages are quite significant.
Resumo:
An area-efficient, wideband RF frequency synthesizer, which simultaneously generates multiple local oscillator (LO) signals, is designed. It is suitable for parallel wideband RF spectrum sensing in cognitive radios. The frequency synthesizer consists of an injection locked oscillator cascade (ILOC) where all the LO signals are derived from a single reference oscillator. The ILOC is implemented in a 130-nm technology with an active area of . It generates 4 uniformly spaced LO carrier frequencies from 500 MHz to 2 GHz. This design is the first known implementation of a CMOS based ILOC for wide-band RF spectrum sensing applications.
Resumo:
Prediction of queue waiting times of jobs submitted to production parallel batch systems is important to provide overall estimates to users and can also help meta-schedulers make scheduling decisions. In this work, we have developed a framework for predicting ranges of queue waiting times for jobs by employing multi-class classification of similar jobs in history. Our hierarchical prediction strategy first predicts the point wait time of a job using dynamic k-Nearest Neighbor (kNN) method. It then performs a multi-class classification using Support Vector Machines (SVMs) among all the classes of the jobs. The probabilities given by the SVM for the class predicted using k-NN and its neighboring classes are used to provide a set of ranges of predicted wait times with probabilities. We have used these predictions and probabilities in a meta-scheduling strategy that distributes jobs to different queues/sites in a multi-queue/grid environment for minimizing wait times of the jobs. Experiments with different production supercomputer job traces show that our prediction strategies can give correct predictions for about 77-87% of the jobs, and also result in about 12% improved accuracy when compared to the next best existing method. Experiments with our meta-scheduling strategy using different production and synthetic job traces for various system sizes, partitioning schemes and different workloads, show that the meta-scheduling strategy gives much improved performance when compared to existing scheduling policies by reducing the overall average queue waiting times of the jobs by about 47%.
Resumo:
In concentrated solar power(CSP) generating stations, incident solar energy is reflected from a large number of mirrors or heliostats to a faraway receiver. In typical CSP installations, the mirror needs to be moved about two axes independently using two actuators in series with the mirror effectively mounted at a single point. A three degree-of-freedom parallel manipulator, namely the 3-RPS parallel manipulator, is proposed to track the sun. The proposed 3-RPS parallel manipulator supports the load of the mirror, structure and wind loading at three points resulting in less deflection, and thus a much larger mirror can be moved with the required tracking accuracy and without increasing the weight of the support structure. The kinematics equations to determine motion of the actuated prismatic joints in the 3-RPS parallel manipulator such that the sun's rays are reflected on to a stationary receiver are developed. Using finite element analysis, it is shown that for same sized mirror, wind loading and maximum deflection requirement, the weight of the support structure is between 15% and 60% less with the 3-RPS parallel manipulator when compared to azimuth-elevation or the target-aligned configurations.
Resumo:
Graph algorithms have been shown to possess enough parallelism to keep several computing resources busy-even hundreds of cores on a GPU. Unfortunately, tuning their implementation for efficient execution on a particular hardware configuration of heterogeneous systems consisting of multicore CPUs and GPUs is challenging, time consuming, and error prone. To address these issues, we propose a domain-specific language (DSL), Falcon, for implementing graph algorithms that (i) abstracts the hardware, (ii) provides constructs to write explicitly parallel programs at a higher level, and (iii) can work with general algorithms that may change the graph structure (morph algorithms). We illustrate the usage of our DSL to implement local computation algorithms (that do not change the graph structure) and morph algorithms such as Delaunay mesh refinement, survey propagation, and dynamic SSSP on GPU and multicore CPUs. Using a set of benchmark graphs, we illustrate that the generated code performs close to the state-of-the-art hand-tuned implementations.
Resumo:
Quantum cellular automata (QCA) is a new technology in the nanometer scale and has been considered as one of the alternative to CMOS technology. In this paper, we describe the design and layout of a serial memory and parallel memory, showing the layout of individual memory cells. Assuming that we can fabricate cells which are separated by 10nm, memory capacities of over 1.6 Gbit/cm2 can be achieved. Simulations on the proposed memories were carried out using QCADesigner, a layout and simulation tool for QCA. During the design, we have tried to reduce the number of cells as well as to reduce the area which is found to be 86.16sq mm and 0.12 nm2 area with the QCA based memory cell. We have also achieved an increase in efficiency by 40%.These circuits are the building block of nano processors and provide us to understand the nano devices of the future.
Resumo:
The crystal structure of a tripeptide Boc-Leu-Val-Ac(12)c-OMe (1) is determined, which incorporates a bulky 1-aminocyclododecane-1-carboxylic acid (Ac(12)c) side chain. The peptide adopts a semi-extended backbone conformation for Leu and Val residues, while the backbone torsion angles of the C-,C--dialkylated residue Ac(12)c are in the helical region of the Ramachandran map. The molecular packing of 1 revealed a unique supramolecular twisted parallel -sheet coiling into a helical architecture in crystals, with the bulky hydrophobic Ac(12)c side chains projecting outward the helical column. This arrangement resembles the packing of peptide helices in crystal structures. Although short oligopeptides often assemble as parallel or anti-parallel -sheet in crystals, twisted or helical -sheet formation has been observed in a few examples of dipeptide crystal structures. Peptide 1 presents the first example of a tripeptide showing twisted -sheet assembly in crystals. Copyright (c) 2016 European Peptide Society and John Wiley & Sons, Ltd.
Resumo:
The crystal structure of a tripeptide Boc-Leu-Val-Ac(12)c-OMe (1) is determined, which incorporates a bulky 1-aminocyclododecane-1-carboxylic acid (Ac(12)c) side chain. The peptide adopts a semi-extended backbone conformation for Leu and Val residues, while the backbone torsion angles of the C-,C--dialkylated residue Ac(12)c are in the helical region of the Ramachandran map. The molecular packing of 1 revealed a unique supramolecular twisted parallel -sheet coiling into a helical architecture in crystals, with the bulky hydrophobic Ac(12)c side chains projecting outward the helical column. This arrangement resembles the packing of peptide helices in crystal structures. Although short oligopeptides often assemble as parallel or anti-parallel -sheet in crystals, twisted or helical -sheet formation has been observed in a few examples of dipeptide crystal structures. Peptide 1 presents the first example of a tripeptide showing twisted -sheet assembly in crystals. Copyright (c) 2016 European Peptide Society and John Wiley & Sons, Ltd.