961 resultados para parallel algorithm
Resumo:
An efficient parallelization algorithm for the Fast Multipole Method which aims to alleviate the parallelization bottleneck arising from lower job-count closer to root levels is presented. An electrostatic problem of 12 million non-uniformly distributed mesh elements is solved with 80-85% parallel efficiency in matrix setup and matrix-vector product using 60GB and 16 threads on shared memory architecture.
Resumo:
We present a nonequilibrium strong-coupling approach to inhomogeneous systems of ultracold atoms in optical lattices. We demonstrate its application to the Mott-insulating phase of a two-dimensional Fermi-Hubbard model in the presence of a trap potential. Since the theory is formulated self-consistently, the numerical implementation relies on a massively parallel evaluation of the self-energy and the Green's function at each lattice site, employing thousands of CPUs. While the computation of the self-energy is straightforward to parallelize, the evaluation of the Green's function requires the inversion of a large sparse 10(d) x 10(d) matrix, with d > 6. As a crucial ingredient, our solution heavily relies on the smallness of the hopping as compared to the interaction strength and yields a widely scalable realization of a rapidly converging iterative algorithm which evaluates all elements of the Green's function. Results are validated by comparing with the homogeneous case via the local-density approximation. These calculations also show that the local-density approximation is valid in nonequilibrium setups without mass transport.
Resumo:
The boxicity (resp. cubicity) of a graph G(V, E) is the minimum integer k such that G can be represented as the intersection graph of axis parallel boxes (resp. cubes) in R-k. Equivalently, it is the minimum number of interval graphs (resp. unit interval graphs) on the vertex set V, such that the intersection of their edge sets is E. The problem of computing boxicity (resp. cubicity) is known to be inapproximable, even for restricted graph classes like bipartite, co-bipartite and split graphs, within an O(n(1-epsilon))-factor for any epsilon > 0 in polynomial time, unless NP = ZPP. For any well known graph class of unbounded boxicity, there is no known approximation algorithm that gives n(1-epsilon)-factor approximation algorithm for computing boxicity in polynomial time, for any epsilon > 0. In this paper, we consider the problem of approximating the boxicity (cubicity) of circular arc graphs intersection graphs of arcs of a circle. Circular arc graphs are known to have unbounded boxicity, which could be as large as Omega(n). We give a (2 + 1/k) -factor (resp. (2 + log n]/k)-factor) polynomial time approximation algorithm for computing the boxicity (resp. cubicity) of any circular arc graph, where k >= 1 is the value of the optimum solution. For normal circular arc (NCA) graphs, with an NCA model given, this can be improved to an additive two approximation algorithm. The time complexity of the algorithms to approximately compute the boxicity (resp. cubicity) is O(mn + n(2)) in both these cases, and in O(mn + kn(2)) = O(n(3)) time we also get their corresponding box (resp. cube) representations, where n is the number of vertices of the graph and m is its number of edges. Our additive two approximation algorithm directly works for any proper circular arc graph, since their NCA models can be computed in polynomial time. (C) 2014 Elsevier B.V. All rights reserved.
Resumo:
Support vector machines (SVM) are a popular class of supervised models in machine learning. The associated compute intensive learning algorithm limits their use in real-time applications. This paper presents a fully scalable architecture of a coprocessor, which can compute multiple rows of the kernel matrix in parallel. Further, we propose an extended variant of the popular decomposition technique, sequential minimal optimization, which we call hybrid working set (HWS) algorithm, to effectively utilize the benefits of cached kernel columns and the parallel computational power of the coprocessor. The coprocessor is implemented on Xilinx Virtex 7 field-programmable gate array-based VC707 board and achieves a speedup of upto 25x for kernel computation over single threaded computation on Intel Core i5. An application speedup of upto 15x over software implementation of LIBSVM and speedup of upto 23x over SVMLight is achieved using the HWS algorithm in unison with the coprocessor. The reduction in the number of iterations and sensitivity of the optimization time to variation in cache size using the HWS algorithm are also shown.
Resumo:
This paper introduces a new technique called species conservation for evolving parallel subpopulations. The technique is based on the concept of dividing the population into several species according to their similarity. Each of these species is built around a dominating individual called the species seed. Species seeds found in the current generation are saved (conserved) by moving them into the next generation. Our technique has proved to be very effective in finding multiple solutions of multimodal optimization problems. We demonstrate this by applying it to a set of test problems, including some problems known to be deceptive to genetic algorithms.
Resumo:
A parallel strategy for solving multidimensional tridiagonal equations is investigated in this paper. We present in detail an improved version of single parallel partition (SPP) algorithm in conjunction with message vectorization, which aggregates several communication messages into one to reduce the communication cost. We show the resulting block SPP can achieve good speedup for a wide range of message vector length (MVL), especially when the number of grid points in the divided direction is large. Instead of only using the largest possible MVL, we adopt numerical tests and modeling analysis to determine an optimal MVL so that significant improvement in speedup can be obtained.
Resumo:
We investigate the 2d O(3) model with the standard action by Monte Carlo simulation at couplings β up to 2.05. We measure the energy density, mass gap and susceptibility of the model, and gather high statistics on lattices of size L ≤ 1024 using the Floating Point Systems T-series vector hypercube and the Thinking Machines Corp.'s Connection Machine 2. Asymptotic scaling does not appear to set in for this action, even at β = 2.10, where the correlation length is 420. We observe a 20% difference between our estimate m/Λ^─_(Ms) = 3.52(6) at this β and the recent exact analytical result . We use the overrelaxation algorithm interleaved with Metropolis updates and show that decorrelation time scales with the correlation length and the number of overrelaxation steps per sweep. We determine its effective dynamical critical exponent to be z' = 1.079(10); thus critical slowing down is reduced significantly for this local algorithm that is vectorizable and parallelizable.
We also use the cluster Monte Carlo algorithms, which are non-local Monte Carlo update schemes which can greatly increase the efficiency of computer simulations of spin models. The major computational task in these algorithms is connected component labeling, to identify clusters of connected sites on a lattice. We have devised some new SIMD component labeling algorithms, and implemented them on the Connection Machine. We investigate their performance when applied to the cluster update of the two dimensional Ising spin model.
Finally we use a Monte Carlo Renormalization Group method to directly measure the couplings of block Hamiltonians at different blocking levels. For the usual averaging block transformation we confirm the renormalized trajectory (RT) observed by Okawa. For another improved probabilistic block transformation we find the RT, showing that it is much closer to the Standard Action. We then use this block transformation to obtain the discrete β-function of the model which we compare to the perturbative result. We do not see convergence, except when using a rescaled coupling β_E to effectively resum the series. For the latter case we see agreement for m/ Λ^─_(Ms) at , β = 2.14, 2.26, 2.38 and 2.50. To three loops m/Λ^─_(Ms) = 3.047(35) at β = 2.50, which is very close to the exact value m/ Λ^─_(Ms) = 2.943. Our last point at β = 2.62 disagrees with this estimate however.
Resumo:
167 p.
Resumo:
On the basis of signed-digit negabinary representation, parallel two-step addition and one-step subtraction can be performed for arbitrary-length negabinary operands.; The arithmetic is realized by signed logic operations and optically implemented by spatial encoding and decoding techniques. The proposed algorithm and optical system are simple, reliable, and practicable, and they have the property of parallel processing of two-dimensional data. This leads to an efficient design for the optical arithmetic and logic unit. (C) 1997 Optical Society of America.
Resumo:
A compact two-step modified-signed-digit arithmetic-logic array processor is proposed. When the reference digits are programmed, both addition and subtraction can be performed by the same binary logic operations regardless of the sign of the input digits. The optical implementation and experimental demonstration with an electron-trapping device are shown. Each digit is encoded by a single pixel, and no polarization is included. Any combinational logic can be easily performed without optoelectronic and electro-optic conversions of the intermediate results. The system is compact, general purpose, simple to align, and has a high signal-to-noise ratio. (C) 1999 Optical Society of America.
Resumo:
A novel, to our knowledge, two-step digit-set-restricted modified signed-digit (MSD) addition-subtraction algorithm is proposed. With the introduction of the reference digits, the operand words are mapped into an intermediate carry word with all digits restricted to the set {(1) over bar, 0} and an intermediate sum word with all digits restricted to the set {0, 1}, which can be summed to form the final result without carry generation. The operation can be performed in parallel by use of binary logic. An optical system that utilizes an electron-trapping device is suggested for accomplishing the required binary logic operations. By programming of the illumination of data arrays, any complex logic operations of multiple variables can be realized without additional temporal latency of the intermediate results. This technique has a high space-bandwidth product and signal-to-noise ratio. The main structure can be stacked to construct a compact optoelectronic MSD adder-subtracter. (C) 1999 Optical Society of America.
Resumo:
FRAME3D, a program for the nonlinear seismic analysis of steel structures, has previously been used to study the collapse mechanisms of steel buildings up to 20 stories tall. The present thesis is inspired by the need to conduct similar analysis for much taller structures. It improves FRAME3D in two primary ways.
First, FRAME3D is revised to address specific nonlinear situations involving large displacement/rotation increments, the backup-subdivide algorithm, element failure, and extremely narrow joint hysteresis. The revisions result in superior convergence capabilities when modeling earthquake-induced collapse. The material model of a steel fiber is also modified to allow for post-rupture compressive strength.
Second, a parallel FRAME3D (PFRAME3D) is developed. The serial code is optimized and then parallelized. A distributed-memory divide-and-conquer approach is used for both the global direct solver and element-state updates. The result is an implicit finite-element hybrid-parallel program that takes advantage of the narrow-band nature of very tall buildings and uses nearest-neighbor-only communication patterns.
Using three structures of varied sized, PFRAME3D is shown to compute reproducible results that agree with that of the optimized 1-core version (displacement time-history response root-mean-squared errors are ~〖10〗^(-5) m) with much less wall time (e.g., a dynamic time-history collapse simulation of a 60-story building is computed in 5.69 hrs with 128 cores—a speedup of 14.7 vs. the optimized 1-core version). The maximum speedups attained are shown to increase with building height (as the total number of cores used also increases), and the parallel framework can be expected to be suitable for buildings taller than the ones presented here.
PFRAME3D is used to analyze a hypothetical 60-story steel moment-frame tube building (fundamental period of 6.16 sec) designed according to the 1994 Uniform Building Code. Dynamic pushover and time-history analyses are conducted. Multi-story shear-band collapse mechanisms are observed around mid-height of the building. The use of closely-spaced columns and deep beams is found to contribute to the building's “somewhat brittle” behavior (ductility ratio ~2.0). Overall building strength is observed to be sensitive to whether a model is fracture-capable.
Resumo:
This paper presents a pseudo-time-step method to calculate a (vector) Green function for the adjoint linearised Euler equations as a scattering problem in the frequency domain, for use as a jet-noise propagation prediction tool. A method of selecting the acoustics-related solution in a truncated spatial domain while suppressing any possible shear-layer-type instability is presented. Numerical tests for 3-D axisymmetrical parallel mean flows against semi-analytical reference solutions indicate that the new iterative algorithm is capable of producing accurate solutions with modest computational requirements.
Resumo:
In application of the Balancing Domain Decomposition by Constraints (BDDC) to a case with many substructures, solving the coarse problem exactly becomes the bottleneck which spoils scalability of the solver. However, it is straightforward for BDDC to substitute the exact solution of the coarse problem by another step of BDDC method with subdomains playing the role of elements. In this way, the algorithm of three-level BDDC method is obtained. If this approach is applied recursively, multilevel BDDC method is derived. We present a detailed description of a recently developed parallel implementation of this algorithm. The implementation is applied to an engineering problem of linear elasticity and a benchmark problem of Stokes flow in a cavity. Results by the multilevel approach are compared to those by the standard (two-level) BDDC method.