Biblioteca Digital

205 resultados para parallel applications

em Indian Institute of Science - Bangalore - Índia

Grids with multiple batch systems for performance enhancement of multi-component and parameter sweep parallel applications

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this work, we evaluate the benefits of using Grids with multiple batch systems to improve the performance of multi-component and parameter sweep parallel applications by reduction in queue waiting times. Using different job traces of different loads, job distributions and queue waiting times corresponding to three different queuing policies(FCFS, conservative and EASY backfilling), we conducted a large number of experiments using simulators of two important classes of applications. The first simulator models Community Climate System Model (CCSM), a prominent multi-component application and the second simulator models parameter sweep applications. We compare the performance of the applications when executed on multiple batch systems and on a single batch system for different system and application configurations. We show that there are a large number of configurations for which application execution using multiple batch systems can give improved performance over execution on a single system.

A strategy for scheduling tightly coupled parallel applications on clusters

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Although various strategies have been developed for scheduling parallel applications with independent tasks, very little work exists for scheduling tightly coupled parallel applications on cluster environments. In this paper, we compare four different strategies based on performance models of tightly coupled parallel applications for scheduling the applications on clusters. In addition to algorithms based on existing popular optimization techniques, we also propose a new algorithm called Box Elimination that searches the space of performance model parameters to determine the best schedule of machines. By means of real and simulation experiments, we evaluated the algorithms on single cluster and multi-cluster setups. We show that our Box Elimination algorithm generates up to 80% more efficient schedule than other algorithms. We also show that the execution times of the schedules produced by our algorithm are more robust against the performance modeling errors.

Strategies for Rescheduling Tightly-Coupled Parallel Applications in Multi-Cluster Grids

Relevância:

100.00% 100.00%

Publicador:

Resumo:

As computational Grids are increasingly used for executing long running multi-phase parallel applications, it is important to develop efficient rescheduling frameworks that adapt application execution in response to resource and application dynamics. In this paper, three strategies or algorithms have been developed for deciding when and where to reschedule parallel applications that execute on multi-cluster Grids. The algorithms derive rescheduling plans that consist of potential points in application execution for rescheduling and schedules of resources for application execution between two consecutive rescheduling points. Using large number of simulations, it is shown that the rescheduling plans developed by the algorithms can lead to large decrease in application execution times when compared to executions without rescheduling on dynamic Grid resources. The rescheduling plans generated by the algorithms are also shown to be competitive when compared to the near-optimal plans generated by brute-force methods. Of the algorithms, genetic algorithm yielded the most efficient rescheduling plans with 9-12% smaller average execution times than the other algorithms.

Performance Modeling based on Multidimensional Surface Learning for Performance Predictions of Parallel Applications in Non-Dedicated Environments

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Modeling the performance behavior of parallel applications to predict the execution times of the applications for larger problem sizes and number of processors has been an active area of research for several years. The existing curve fitting strategies for performance modeling utilize data from experiments that are conducted under uniform loading conditions. Hence the accuracy of these models degrade when the load conditions on the machines and network change. In this paper, we analyze a curve fitting model that attempts to predict execution times for any load conditions that may exist on the systems during application execution. Based on the experiments conducted with the model for a parallel eigenvalue problem, we propose a multi-dimensional curve-fitting model based on rational polynomials for performance predictions of parallel applications in non-dedicated environments. We used the rational polynomial based model to predict execution times for 2 other parallel applications on systems with large load dynamics. In all the cases, the model gave good predictions of execution times with average percentage prediction errors of less than 20%

Middleware for Long-running Applications on Batch Grids. Student Research Symposium poster

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Computational grids with multiple batch systems (batch grids) can be powerful infrastructures for executing long-running multicomponent parallel applications. In this paper, we have constructed a middleware framework for executing such long-running applications spanning multiple submissions to the queues on multiple batch systems. We have used our framework for execution of a foremost long-running multi-component application for climate modeling, the Community Climate System Model (CCSM). Our framework coordinates the distribution, execution, migration and restart of the components of CCSM on the multiple queues where the component jobs of the different queues can have different queue waiting and startup times.

Adaptive Executions of Multi-Physics Coupled Applications on Batch Grids

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Long running multi-physics coupled parallel applications have gained prominence in recent years. The high computational requirements and long durations of simulations of these applications necessitate the use of multiple systems of a Grid for execution. In this paper, we have built an adaptive middleware framework for execution of long running multi-physics coupled applications across multiple batch systems of a Grid. Our framework, apart from coordinating the executions of the component jobs of an application on different batch systems, also automatically resubmits the jobs multiple times to the batch queues to continue and sustain long running executions. As the set of active batch systems available for execution changes, our framework performs migration and rescheduling of components using a robust rescheduling decision algorithm. We have used our framework for improving the application throughput of a foremost long running multi-component application for climate modeling, the Community Climate System Model (CCSM). Our real multi-site experiments with CCSM indicate that Grid executions can lead to improved application throughput for climate models.

Large improvements in application throughput of long-running multi-component applications using batch grids

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Computational grids with multiple batch systems (batch grids) can be powerful infrastructures for executing long-running multi-component parallel applications. In this paper, we evaluate the potential improvements in throughput of long-running multi-component applications when the different components of the applications are executed on multiple batch systems of batch grids. We compare the multiple batch executions with executions of the components on a single batch system without increasing the number of processors used for executions. We perform our analysis with a foremost long-running multi-component application for climate modeling, the Community Climate System Model (CCSM). We have built a robust simulator that models the characteristics of both the multi-component application and the batch systems. By conducting large number of simulations with different workload characteristics and queuing policies of the systems, processor allocations to components of the application, distributions of the components to the batch systems and inter-cluster bandwidths, we show that multiple batch executions lead to 55% average increase in throughput over single batch executions for long-running CCSM. We also conducted real experiments with a practical middleware infrastructure and showed that multi-site executions lead to effective utilization of batch systems for executions of CCSM and give higher simulation throughput than single-site executions. Copyright (c) 2011 John Wiley & Sons, Ltd.

Accelerating multi-core simulators

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Simulation is an important means of evaluating new microarchitectures. With the invention of multi-core (CMP) platforms, simulators are becoming larger and more complex. However, with the availability of CMPs with larger caches and higher operating frequency, the wall clock time required for simulating an application has become comparatively shorter. Reducing this simulation time further is a great challenge, especially in the case of multi-threaded workload due to indeterminacy introduced due to simultaneously executing various threads. In this paper, we propose a technique for speeding multi-core simulation. The model of the processor core and cache are replaced with functional models, to achieve speedup. A timed Petri net model is used to estimate the execution time of the processor and the memory access latencies are estimated using hit/miss information obtained from the functional model of the cache. This model can be used to predict performance of data parallel applications or multiprogramming workload on CMP platform with various cache hierarchies and shared bus interconnect. The error in estimation of the execution time of an application is within 6%. The speedup achieved ranges between an average of 2x--4x over the cycle accurate simulator.

Application-oriented adaptive MPI_Bcast for grids

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Due to the importance of collective communications in scientific parallel applications, many strategies have been devised for optimizing collective communications for different kinds of parallel environments. There has been an increasing interest to evolve efficient broadcast algorithms for computational grids. In this paper, we present application-oriented adaptive techniques that take into account resource characteristics as well as the application's usage of broadcasts for deriving efficient broadcast trees. In particular, we consider two broadcast parameters used in the application, namely, the broadcast message sizes and the time interval between the broadcasts. The results indicate that our adaptive strategies can provide 20% average improvement in performance over the popular MPICH-G2's MPI_Bcast implementation for loaded network conditions.

Geometric Representation of Graphs in Low Dimension Using Axis Parallel Boxes

Relevância:

30.00% 30.00%

Publicador:

Resumo:

An axis-parallel k-dimensional box is a Cartesian product R-1 x R-2 x...x R-k where R-i (for 1 <= i <= k) is a closed interval of the form [a(i), b(i)] on the real line. For a graph G, its boxicity box(G) is the minimum dimension k, such that G is representable as the intersection graph of (axis-parallel) boxes in k-dimensional space. The concept of boxicity finds applications in various areas such as ecology, operations research etc. A number of NP-hard problems are either polynomial time solvable or have much better approximation ratio on low boxicity graphs. For example, the max-clique problem is polynomial time solvable on bounded boxicity graphs and the maximum independent set problem for boxicity d graphs, given a box representation, has a left perpendicular1 + 1/c log n right perpendicular(d-1) approximation ratio for any constant c >= 1 when d >= 2. In most cases, the first step usually is computing a low dimensional box representation of the given graph. Deciding whether the boxicity of a graph is at most 2 itself is NP-hard. We give an efficient randomized algorithm to construct a box representation of any graph G on n vertices in left perpendicular(Delta + 2) ln nright perpendicular dimensions, where Delta is the maximum degree of G. This algorithm implies that box(G) <= left perpendicular(Delta + 2) ln nright perpendicular for any graph G. Our bound is tight up to a factor of ln n. We also show that our randomized algorithm can be derandomized to get a polynomial time deterministic algorithm. Though our general upper bound is in terms of maximum degree Delta, we show that for almost all graphs on n vertices, their boxicity is O(d(av) ln n) where d(av) is the average degree.

Framework for enabling highly available distributed applications for utility computing

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The move towards IT outsourcing is the first step towards an environment where compute infrastructure is treated as a service. In utility computing this IT service has to honor Service Level Agreements (SLA) in order to meet the desired Quality of Service (QoS) guarantees. Such an environment requires reliable services in order to maximize the utilization of the resources and to decrease the Total Cost of Ownership (TCO). Such reliability cannot come at the cost of resource duplication, since it increases the TCO of the data center and hence the cost per compute unit. We, in this paper, look into aspects of projecting impact of hardware failures on the SLAs and techniques required to take proactive recovery steps in case of a predicted failure. By maintaining health vectors of all hardware and system resources, we predict the failure probability of resources based on observed hardware errors/failure events, at runtime. This inturn influences an availability aware middleware to take proactive action (even before the application is affected in case the system and the application have low recoverability). The proposed framework has been prototyped on a system running HP-UX. Our offline analysis of the prediction system on hardware error logs indicate no more than 10% false positives. This work to the best of our knowledge is the first of its kind to perform an end-to-end analysis of the impact of a hardware fault on application SLAs, in a live system.

Parallel circuit partitioning on a reduced array architecture

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The physical design of a VLSI circuit involves circuit partitioning as a subtask. Typically, it is necessary to partition a large electrical circuit into several smaller circuits such that the total cross-wiring is minimized. This problem is a variant of the more general graph partitioning problem, and it is known that there does not exist a polynomial time algorithm to obtain an optimal partition. The heuristic procedure proposed by Kernighan and Lin1,2 requires O(n2 log2n) time to obtain a near-optimal two-way partition of a circuit with n modules. In the VLSI context, due to the large problem size involved, this computational requirement is unacceptably high. This paper is concerned with the hardware acceleration of the Kernighan-Lin procedure on an SIMD architecture. The proposed parallel partitioning algorithm requires O(n) processors, and has a time complexity of O(n log2n). In the proposed scheme, the reduced array architecture is employed with due considerations towards cost effectiveness and VLSI realizability of the architecture.The authors are not aware of any earlier attempts to parallelize a circuit partitioning algorithm in general or the Kernighan-Lin algorithm in particular. The use of the reduced array architecture is novel and opens up the possibilities of using this computing structure for several other applications in electronic design automation.

Analyzing Cache Performance Bottlenecks of STM Applications and addressing them with Compiler's help

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs as an alternative to traditional lock based synchronization. However adoption of STM in mainstream software has been quite low due to its considerable overheads and its poor cache/memory performance. In this paper, we perform a detailed study of the cache behavior of STM applications and quantify the impact of different STM factors on the cache misses experienced by the applications. Based on our analysis, we propose a compiler driven Lock-Data Colocation (LDC), targeted at reducing the cache overheads on STM. We show that LDC is effective in improving the cache behavior of STM applications by reducing the dcache miss latency and improving execution time performance.

Toward Effective Scalar Hardware for Highly Vectorizable Applications

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The performance of a program will ultimately be limited by its serial (scalar) portion, as pointed out by Amdahl′s Law. Reported studies thus far of instruction-level parallelism have mixed data-parallel program portions with scalar program portions, often leading to contradictory and controversial results. We report an instruction-level behavioral characterization of scalar code containing minimal data-parallelism, extracted from highly vectorized programs of the PERFECT benchmark suite running on a Cray Y-MP system. We classify scalar basic blocks according to their instruction mix, characterize the data dependencies seen in each class, and, as a first step, measure the maximum intrablock instruction-level parallelism available. We observe skewed rather than balanced instruction distributions in scalar code and in individual basic block classes of scalar code; nonuniform distribution of parallelism across instruction classes; and, as expected, limited available intrablock parallelism. We identify frequently occurring data-dependence patterns and discuss new instructions to reduce latency. Toward effective scalar hardware, we study latency-pipelining trade-offs and restricted multiple instruction issue mechanisms.

Parallel computing concepts and methods for floquet analysis of helicopter trim and stability

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Floquet analysis is widely used for small-order systems (say, order M < 100) to find trim results of control inputs and periodic responses, and stability results of damping levels and frequencies, Presently, however, it is practical neither for design applications nor for comprehensive analysis models that lead to large systems (M > 100); the run time on a sequential computer is simply prohibitive, Accordingly, a massively parallel Floquet analysis is developed with emphasis on large systems, and it is implemented on two SIMD or single-instruction, multiple-data computers with 4096 and 8192 processors, The focus of this development is a parallel shooting method with damped Newton iteration to generate trim results; the Floquet transition matrix (FTM) comes out as a byproduct, The eigenvalues and eigenvectors of the FTM are computed by a parallel QR method, and thereby stability results are generated, For illustration, flap and flap-lag stability of isolated rotors are treated by the parallel analysis and by a corresponding sequential analysis with the conventional shooting and QR methods; linear quasisteady airfoil aerodynamics and a finite-state three-dimensional wake model are used, Computational reliability is quantified by the condition numbers of the Jacobian matrices in Newton iteration, the condition numbers of the eigenvalues and the residual errors of the eigenpairs, and reliability figures are comparable in both the parallel and sequential analyses, Compared to the sequential analysis, the parallel analysis reduces the run time of large systems dramatically, and the reduction increases with increasing system order; this finding offers considerable promise for design and comprehensive-analysis applications.

«
1
2
3
4
5
6
7
8
...
13
14
»