980 resultados para parallel applications


Relevância:

70.00% 70.00%

Publicador:

Resumo:

We assert that companies can make more money and research institutions can improve their performance if inexpensive clusters and enterprise grids are exploited. In this paper, we have demonstrated that our claim is valid by showing the study of how programming environments, tools and middleware could be used for the execution of parallel and sequential applications, multiple parallel applications executing simultaneously on a non-dedicated cluster, and parallel applications on an enterprise grid and that the execution performance was improved. For this purpose an execution environment, and parallel and sequential benchmark applications selected for, and used in, the experiments were characterised.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Cloud computing is the most recent realisation of computing as a utility. Recently, fields with substantial computational requirements, e.g., biology, are turning to clouds for cheap, on-demand provisioning of resources. Of interest to this paper is the execution of compute intensive applications on hybrid clouds. If application requirements exceed private cloud resource capacity, clients require scaling down their applications. The outcome of this research is Web technology realising a new form of cloud called HPC Hybrid Deakin (H2D) Cloud -- an experimental hybrid cloud capable of utilising both local and remote computational services for single large embarrassingly parallel applications.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Cloud computing offers massive scalability and elasticity required by many scien-tific and commercial applications. Combining the computational and data handling capabilities of clouds with parallel processing also has the potential to tackle Big Data problems efficiently. Science gateway frameworks and workflow systems enable application developers to implement complex applications and make these available for end-users via simple graphical user interfaces. The integration of such frameworks with Big Data processing tools on the cloud opens new oppor-tunities for application developers. This paper investigates how workflow sys-tems and science gateways can be extended with Big Data processing capabilities. A generic approach based on infrastructure aware workflows is suggested and a proof of concept is implemented based on the WS-PGRADE/gUSE science gateway framework and its integration with the Hadoop parallel data processing solution based on the MapReduce paradigm in the cloud. The provided analysis demonstrates that the methods described to integrate Big Data processing with workflows and science gateways work well in different cloud infrastructures and application scenarios, and can be used to create massively parallel applications for scientific analysis of Big Data.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

As the complexity of parallel applications increase, the performance limitations resulting from computational load imbalance become dominant. Mapping the problem space to the processors in a parallel machine in a manner that balances the workload of each processors will typically reduce the run-time. In many cases the computation time required for a given calculation cannot be predetermined even at run-time and so static partition of the problem returns poor performance. For problems in which the computational load across the discretisation is dynamic and inhomogeneous, for example multi-physics problems involving fluid and solid mechanics with phase changes, the workload for a static subdomain will change over the course of a computation and cannot be estimated beforehand. For such applications the mapping of loads to process is required to change dynamically, at run-time in order to maintain reasonable efficiency. The issue of dynamic load balancing are examined in the context of PHYSICA, a three dimensional unstructured mesh multi-physics continuum mechanics computational modelling code.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Ordinary desktop computers continue to obtain ever more resources – in-creased processing power, memory, network speed and bandwidth – yet these resources spend much of their time underutilised. Cycle stealing frameworks harness these resources so they can be used for high-performance computing. Traditionally cycle stealing systems have used client-server based architectures which place significant limits on their ability to scale and the range of applica-tions they can support. By applying a fully decentralised network model to cycle stealing the limits of centralised models can be overcome. Using decentralised networks in this manner presents some difficulties which have not been encountered in their previous uses. Generally decentralised ap-plications do not require any significant fault tolerance guarantees. High-performance computing on the other hand requires very stringent guarantees to ensure correct results are obtained. Unfortunately mechanisms developed for traditional high-performance computing cannot be simply translated because of their reliance on a reliable storage mechanism. In the highly dynamic world of P2P computing this reliable storage is not available. As part of this research a fault tolerance system has been created which provides considerable reliability without the need for a persistent storage. As well as increased scalability, fully decentralised networks offer the ability for volunteers to communicate directly. This ability provides the possibility of supporting applications whose tasks require direct, message passing style communication. Previous cycle stealing systems have only supported embarrassingly parallel applications and applications with limited forms of communication so a new programming model has been developed which can support this style of communication within a cycle stealing context. In this thesis I present a fully decentralised cycle stealing framework. The framework addresses the problems of providing a reliable fault tolerance sys-tem and supporting direct communication between parallel tasks. The thesis includes a programming model for developing cycle stealing applications with direct inter-process communication and methods for optimising object locality on decentralised networks.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Simulation is an important means of evaluating new microarchitectures. With the invention of multi-core (CMP) platforms, simulators are becoming larger and more complex. However, with the availability of CMPs with larger caches and higher operating frequency, the wall clock time required for simulating an application has become comparatively shorter. Reducing this simulation time further is a great challenge, especially in the case of multi-threaded workload due to indeterminacy introduced due to simultaneously executing various threads. In this paper, we propose a technique for speeding multi-core simulation. The model of the processor core and cache are replaced with functional models, to achieve speedup. A timed Petri net model is used to estimate the execution time of the processor and the memory access latencies are estimated using hit/miss information obtained from the functional model of the cache. This model can be used to predict performance of data parallel applications or multiprogramming workload on CMP platform with various cache hierarchies and shared bus interconnect. The error in estimation of the execution time of an application is within 6%. The speedup achieved ranges between an average of 2x--4x over the cycle accurate simulator.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Due to the importance of collective communications in scientific parallel applications, many strategies have been devised for optimizing collective communications for different kinds of parallel environments. There has been an increasing interest to evolve efficient broadcast algorithms for computational grids. In this paper, we present application-oriented adaptive techniques that take into account resource characteristics as well as the application's usage of broadcasts for deriving efficient broadcast trees. In particular, we consider two broadcast parameters used in the application, namely, the broadcast message sizes and the time interval between the broadcasts. The results indicate that our adaptive strategies can provide 20% average improvement in performance over the popular MPICH-G2's MPI_Bcast implementation for loaded network conditions.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This thesis introduces the Named-State Register File, a fine-grain, fully-associative register file. The NSF allows fast context switching between concurrent threads as well as efficient sequential program performance. The NSF holds more live data than conventional register files, and requires less spill and reload traffic to switch between contexts. This thesis demonstrates an implementation of the Named-State Register File and estimates the access time and chip area required for different organizations. Architectural simulations of large sequential and parallel applications show that the NSF can reduce execution time by 9% to 17% compared to alternative register files.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

We propose a data flow based run time system as an efficient tool for supporting execution of parallel code on heterogeneous architectures hosting both multicore CPUs and GPUs. We discuss how the proposed run time system may be the target of both structured parallel applications developed using algorithmic skeletons/parallel design patterns and also more "domain specific" programming models. Experimental results demonstrating the feasibility of the approach are presented. © 2012 World Scientific Publishing Company.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Many scientific applications are programmed using hybrid programming models that use both message passing and shared memory, due to the increasing prevalence of large-scale systems with multicore, multisocket nodes. Previous work has shown that energy efficiency can be improved using software-controlled execution schemes that consider both the programming model and the power-aware execution capabilities of the system. However, such approaches have focused on identifying optimal resource utilization for one programming model, either shared memory or message passing, in isolation. The potential solution space, thus the challenge, increases substantially when optimizing hybrid models since the possible resource configurations increase exponentially. Nonetheless, with the accelerating adoption of hybrid programming models, we increasingly need improved energy efficiency in hybrid parallel applications on large-scale systems. In this work, we present new software-controlled execution schemes that consider the effects of dynamic concurrency throttling (DCT) and dynamic voltage and frequency scaling (DVFS) in the context of hybrid programming models. Specifically, we present predictive models and novel algorithms based on statistical analysis that anticipate application power and time requirements under different concurrency and frequency configurations. We apply our models and methods to the NPB MZ benchmarks and selected applications from the ASC Sequoia codes. Overall, we achieve substantial energy savings (8.74 percent on average and up to 13.8 percent) with some performance gain (up to 7.5 percent) or negligible performance loss.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Autonomic management can be used to improve the QoS provided by parallel/distributed applications. We discuss behavioural skeletons introduced in earlier work: rather than relying on programmer ability to design “from scratch” efficient autonomic policies, we encapsulate general autonomic controller features into algorithmic skeletons. Then we leave to the programmer the duty of specifying the parameters needed to specialise the skeletons to the needs of the particular application at hand. This results in the programmer having the ability to fast prototype and tune distributed/parallel applications with non-trivial autonomic management capabilities. We discuss how behavioural skeletons have been implemented in the framework of GCM(the Grid ComponentModel developed within the CoreGRID NoE and currently being implemented within the GridCOMP STREP project). We present results evaluating the overhead introduced by autonomic management activities as well as the overall behaviour of the skeletons. We also present results achieved with a long running application subject to autonomic management and dynamically adapting to changing features of the target architecture.
Overall the results demonstrate both the feasibility of implementing autonomic control via behavioural skeletons and the effectiveness of our sample behavioural skeletons in managing the “functional replication” pattern(s).

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The use of efficient synchronization mechanisms is crucial for implementing fine grained parallel programs on modern shared cache multi-core architectures. In this paper we study this problem by considering Single-Producer/Single- Consumer (SPSC) coordination using unbounded queues. A novel unbounded SPSC algorithm capable of reducing the row synchronization latency and speeding up Producer-Consumer coordination is presented. The algorithm has been extensively tested on a shared-cache multi-core platform and a sketch proof of correctness is presented. The queues proposed have been used as basic building blocks to implement the FastFlow parallel framework, which has been demonstrated to offer very good performance for fine-grain parallel applications. © 2012 Springer-Verlag.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper presents a scalable, statistical ‘black-box’ model for predicting the performance of parallel programs on multi-core non-uniform memory access (NUMA) systems. We derive a model with low overhead, by reducing data collection and model training time. The model can accurately predict the behaviour of parallel applications in response to changes in their concurrency, thread layout on NUMA nodes, and core voltage and frequency. We present a framework that applies the model to achieve significant energy and energy-delay-square (ED2) savings (9% and 25%, respectively) along with performance improvement (10% mean) on an actual 16-core NUMA system running realistic application workloads. Our prediction model proves substantially more accurate than previous efforts.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

We introduce and address the problem of concurrent autonomic management of different non-functional concerns in parallel applications build as a hierarchical composition of behavioural skeletons. We first define the problems arising when multiple concerns are dealt with by independent managers, then we propose a methodology supporting coordinated management, and finally we discuss how autonomic management of multiple concerns may be implemented in a typical use case. Being based on the behavioural skeleton concept proposed in the CoreGRID GCM, it is anticipated that the methodology will be readily integrated into the current reference implementation of GCM based on Java Pro Active and running on top of major grid middleware systems.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In 2006 the Route load balancing algorithm was proposed and compared to other techniques aiming at optimizing the process allocation in grid environments. This algorithm schedules tasks of parallel applications considering computer neighborhoods (where the distance is defined by the network latency). Route presents good results for large environments, although there are cases where neighbors do not have an enough computational capacity nor communication system capable of serving the application. In those situations the Route migrates tasks until they stabilize in a grid area with enough resources. This migration may take long time what reduces the overall performance. In order to improve such stabilization time, this paper proposes RouteGA (Route with Genetic Algorithm support) which considers historical information on parallel application behavior and also the computer capacities and load to optimize the scheduling. This information is extracted by using monitors and summarized in a knowledge base used to quantify the occupation of tasks. Afterwards, such information is used to parameterize a genetic algorithm responsible for optimizing the task allocation. Results confirm that RouteGA outperforms the load balancing carried out by the original Route, which had previously outperformed others scheduling algorithms from literature.