989 resultados para parallel computation


Relevância:

40.00% 40.00%

Publicador:

Resumo:

This article describes advances in statistical computation for large-scale data analysis in structured Bayesian mixture models via graphics processing unit (GPU) programming. The developments are partly motivated by computational challenges arising in fitting models of increasing heterogeneity to increasingly large datasets. An example context concerns common biological studies using high-throughput technologies generating many, very large datasets and requiring increasingly high-dimensional mixture models with large numbers of mixture components.We outline important strategies and processes for GPU computation in Bayesian simulation and optimization approaches, give examples of the benefits of GPU implementations in terms of processing speed and scale-up in ability to analyze large datasets, and provide a detailed, tutorial-style exposition that will benefit readers interested in developing GPU-based approaches in other statistical models. Novel, GPU-oriented approaches to modifying existing algorithms software design can lead to vast speed-up and, critically, enable statistical analyses that presently will not be performed due to compute time limitations in traditional computational environments. Supplementalmaterials are provided with all source code, example data, and details that will enable readers to implement and explore the GPU approach in this mixture modeling context. © 2010 American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

During 1990's the Wavelet Transform emerged as an important signal processing tool with potential applications in time-frequency analysis and non-stationary signal processing.Wavelets have gained popularity in broad range of disciplines like signal/image compression, medical diagnostics, boundary value problems, geophysical signal processing, statistical signal processing,pattern recognition,underwater acoustics etc.In 1993, G. Evangelista introduced the Pitch- synchronous Wavelet Transform, which is particularly suited for pseudo-periodic signal processing.The work presented in this thesis mainly concentrates on two interrelated topics in signal processing,viz. the Wavelet Transform based signal compression and the computation of Discrete Wavelet Transform. A new compression scheme is described in which the Pitch-Synchronous Wavelet Transform technique is combined with the popular linear Predictive Coding method for pseudo-periodic signal processing. Subsequently,A novel Parallel Multiple Subsequence structure is presented for the efficient computation of Wavelet Transform. Case studies also presented to highlight the potential applications.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

A general approach is presented for implementing discrete transforms as a set of first-order or second-order recursive digital filters. Clenshaw's recurrence formulae are used to formulate the second-order filters. The resulting structure is suitable for efficient implementation of discrete transforms in VLSI or FPGA circuits. The general approach is applied to the discrete Legendre transform as an illustration.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The Streaming SIMD extension (SSE) is a special feature embedded in the Intel Pentium III and IV classes of microprocessors. It enables the execution of SIMD type operations to exploit data parallelism. This article presents improving computation performance of a railway network simulator by means of SSE. Voltage and current at various points of the supply system to an electrified railway line are crucial for design, daily operation and planning. With computer simulation, their time-variations can be attained by solving a matrix equation, whose size mainly depends upon the number of trains present in the system. A large coefficient matrix, as a result of congested railway line, inevitably leads to heavier computational demand and hence jeopardizes the simulation speed. With the special architectural features of the latest processors on PC platforms, significant speed-up in computations can be achieved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The Streaming SIMD extension (SSE) is a special feature that is available in the Intel Pentium III and P4 classes of microprocessors. As its name implies, SSE enables the execution of SIMD (Single Instruction Multiple Data) operations upon 32-bit floating-point data therefore, performance of floating-point algorithms can be improved. In electrified railway system simulation, the computation involves the solving of a huge set of simultaneous linear equations, which represent the electrical characteristic of the railway network at a particular time-step and a fast solution for the equations is desirable in order to simulate the system in real-time. In this paper, we present how SSE is being applied to the railway network simulation.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A composite SaaS (Software as a Service) is a software that is comprised of several software components and data components. The composite SaaS placement problem is to determine where each of the components should be deployed in a cloud computing environment such that the performance of the composite SaaS is optimal. From the computational point of view, the composite SaaS placement problem is a large-scale combinatorial optimization problem. Thus, an Iterative Cooperative Co-evolutionary Genetic Algorithm (ICCGA) was proposed. The ICCGA can find reasonable quality of solutions. However, its computation time is noticeably slow. Aiming at improving the computation time, we propose an unsynchronized Parallel Cooperative Co-evolutionary Genetic Algorithm (PCCGA) in this paper. Experimental results have shown that the PCCGA not only has quicker computation time, but also generates better quality of solutions than the ICCGA.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

As computational models in fields such as medicine and engineering get more refined, resource requirements are increased. In a first instance, these needs have been satisfied using parallel computing and HPC clusters. However, such systems are often costly and lack flexibility. HPC users are therefore tempted to move to elastic HPC using cloud services. One difficulty in making this transition is that HPC and cloud systems are different, and performance may vary. The purpose of this study is to evaluate cloud services as a means to minimise both cost and computation time for large-scale simulations, and to identify which system properties have the most significant impact on performance. Our simulation results show that, while the performance of Virtual CPU (VCPU) is satisfactory, network throughput may lead to difficulties.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We consider the problem of deciding whether the output of a boolean circuit is determined by a partial assignment to its inputs. This problem is easily shown to be hard, i.e., co-Image Image -complete. However, many of the consequences of a partial input assignment may be determined in linear time, by iterating the following step: if we know the values of some inputs to a gate, we can deduce the values of some outputs of that gate. This process of iteratively deducing some of the consequences of a partial assignment is called propagation. This paper explores the parallel complexity of propagation, i.e., the complexity of determining whether the output of a given boolean circuit is determined by propagating a given partial input assignment. We give a complete classification of the problem into those cases that are Image -complete and those that are unlikely to be Image complete.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The paper presents two new algorithms for the direct parallel solution of systems of linear equations. The algorithms employ a novel recursive doubling technique to obtain solutions to an nth-order system in n steps with no more than 2n(n −1) processors. Comparing their performance with the Gaussian elimination algorithm (GE), we show that they are almost 100% faster than the latter. This speedup is achieved by dispensing with all the computation involved in the back-substitution phase of GE. It is also shown that the new algorithms exhibit error characteristics which are superior to GE. An n(n + 1) systolic array structure is proposed for the implementation of the new algorithms. We show that complete solutions can be obtained, through these single-phase solution methods, in 5n−log2n−4 computational steps, without the need for intermediate I/O operations.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Tridiagonal diagonally dominant linear systems arise in many scientific and engineering applications. The standard Thomas algorithm for solving such systems is inherently serial forming a bottleneck in computation. Algorithms such as cyclic reduction and SPIKE reduce a single large tridiagonal system into multiple small independent systems which can be solved in parallel. We have developed portable cyclic reduction and SPIKE algorithm OpenCL implementations with the intent to target a range of co-processors in a heterogeneous computing environment including Field Programmable Gate Arrays (FPGAs), Graphics Processing Units (GPUs) and other multi-core processors. In this paper, we evaluate these designs in the context of solver performance, resource efficiency and numerical accuracy.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Although various strategies have been developed for scheduling parallel applications with independent tasks, very little work exists for scheduling tightly coupled parallel applications on cluster environments. In this paper, we compare four different strategies based on performance models of tightly coupled parallel applications for scheduling the applications on clusters. In addition to algorithms based on existing popular optimization techniques, we also propose a new algorithm called Box Elimination that searches the space of performance model parameters to determine the best schedule of machines. By means of real and simulation experiments, we evaluated the algorithms on single cluster and multi-cluster setups. We show that our Box Elimination algorithm generates up to 80% more efficient schedule than other algorithms. We also show that the execution times of the schedules produced by our algorithm are more robust against the performance modeling errors.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We consider single-source, single-sink multi-hop relay networks, with slow-fading Rayleigh fading links and single-antenna relay nodes operating under the half-duplex constraint. While two hop relay networks have been studied in great detail in terms of the diversity-multiplexing tradeoff (DMT), few results are available for more general networks. In this two-part paper, we identify two families of networks that are multi-hop generalizations of the two hop network: K-Parallel-Path (KPP) networks and Layered networks. In the first part, we initially consider KPP networks, which can be viewed as the union of K node-disjoint parallel paths, each of length > 1. The results are then generalized to KPP(I) networks, which permit interference between paths and to KPP(D) networks, which possess a direct link from source to sink. We characterize the optimal DMT of KPP(D) networks with K >= 4, and KPP(I) networks with K >= 3. Along the way, we derive lower bounds for the DMT of triangular channel matrices, which are useful in DMT computation of various protocols. As a special case, the DMT of two-hop relay network without direct link is obtained. Two key implications of the results in the two-part paper are that the half-duplex constraint does not necessarily entail rate loss by a factor of two, as previously believed and that, simple AF protocols are often sufficient to attain the best possible DMT.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper deals with the development of a new model for the cooling process on the runout table of hot strip mills, The suitability of different numerical methods for the solution of the proposed model equation from the point of view of accuracy and computation time are studied, Parallel solutions for the model equation are proposed.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Workstation clusters equipped with high performance interconnect having programmable network processors facilitate interesting opportunities to enhance the performance of parallel application run on them. In this paper, we propose schemes where certain application level processing in parallel database query execution is performed on the network processor. We evaluate the performance of TPC-H queries executing on a high end cluster where all tuple processing is done on the host processor, using a timed Petri net model, and find that tuple processing costs on the host processor dominate the execution time. These results are validated using a small cluster. We therefore propose 4 schemes where certain tuple processing activity is offloaded to the network processor. The first 2 schemes offload the tuple splitting activity - computation to identify the node on which to process the tuples, resulting in an execution time speedup of 1.09 relative to the base scheme, but with I/O bus becoming the bottleneck resource. In the 3rd scheme in addition to offloading tuple processing activity, the disk and network interface are combined to avoid the I/O bus bottleneck, which results in speedups up to 1.16, but with high host processor utilization. Our 4th scheme where the network processor also performs apart of join operation along with the host processor, gives a speedup of 1.47 along with balanced system resource utilizations. Further we observe that the proposed schemes perform equally well even in a scaled architecture i.e., when the number of processors is increased from 2 to 64

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper presents a decentralized/peer-to-peer architecture-based parallel version of the vector evaluated particle swarm optimization (VEPSO) algorithm for multi-objective design optimization of laminated composite plates using message passing interface (MPI). The design optimization of laminated composite plates being a combinatorially explosive constrained non-linear optimization problem (CNOP), with many design variables and a vast solution space, warrants the use of non-parametric and heuristic optimization algorithms like PSO. Optimization requires minimizing both the weight and cost of these composite plates, simultaneously, which renders the problem multi-objective. Hence VEPSO, a multi-objective variant of the PSO algorithm, is used. Despite the use of such a heuristic, the application problem, being computationally intensive, suffers from long execution times due to sequential computation. Hence, a parallel version of the PSO algorithm for the problem has been developed to run on several nodes of an IBM P720 cluster. The proposed parallel algorithm, using MPI's collective communication directives, establishes a peer-to-peer relationship between the constituent parallel processes, deviating from the more common master-slave approach, in achieving reduction of computation time by factor of up to 10. Finally we show the effectiveness of the proposed parallel algorithm by comparing it with a serial implementation of VEPSO and a parallel implementation of the vector evaluated genetic algorithm (VEGA) for the same design problem. (c) 2012 Elsevier Ltd. All rights reserved.