246 resultados para parallel architectures

em QUB Research Portal - Research Directory and Institutional Repository for Queen's University Belfast


Relevância:

60.00% 60.00%

Publicador:

Resumo:

Speeding up sequential programs on multicores is a challenging problem that is in urgent need of a solution. Automatic parallelization of irregular pointer-intensive codes, exempli?ed by the SPECint codes, is a very hard problem. This paper shows that, with a helping hand, such auto-parallelization is possible and fruitful. This paper makes the following contributions: (i) A compiler framework for extracting pipeline-like parallelism from outer program loops is presented. (ii) Using a light-weight programming model based on annotations, the programmer helps the compiler to ?nd thread-level parallelism. Each of the annotations speci?es only a small piece of semantic information that compiler analysis misses, e.g. stating that a variable is dead at a certain program point. The annotations are designed such that correctness is easily veri?ed. Furthermore, we present a tool for suggesting annotations to the programmer. (iii) The methodology is applied to autoparallelize several SPECint benchmarks. For the benchmark with most parallelism (hmmer), we obtain a scalable 7-fold speedup on an AMD quad-core dual processor. The annotations constitute a parallel programming model that relies extensively on a sequential program representation. Hereby, the complexity of debugging is not increased and it does not obscure the source code. These properties could prove valuable to increase the ef?ciency of parallel programming.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

We propose a dynamic verification approach for large-scale message passing programs to locate correctness bugs caused by unforeseen nondeterministic interactions. This approach hinges on an efficient protocol to track the causality between nondeterministic message receive operations and potentially matching send operations. We show that causality tracking protocols that rely solely on logical clocks fail to capture all nuances of MPI program behavior, including the variety of ways in which nonblocking calls can complete. Our approach is hinged on formally defining the matches-before relation underlying the MPI standard, and devising lazy update logical clock based algorithms that can correctly discover all potential outcomes of nondeterministic receives in practice. can achieve the same coverage as a vector clock based algorithm while maintaining good scalability. LLCP allows us to analyze realistic MPI programs involving a thousand MPI processes, incurring only modest overheads in terms of communication bandwidth, latency, and memory consumption. © 2011 IEEE.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Task dataflow languages simplify the specification of parallel programs by dynamically detecting and enforcing dependencies between tasks. These languages are, however, often restricted to a single level of parallelism. This language design is reflected in the runtime system, where a master thread explicitly generates a task graph and worker threads execute ready tasks and wake-up their dependents. Such an approach is incompatible with state-of-the-art schedulers such as the Cilk scheduler, that minimize the creation of idle tasks (work-first principle) and place all task creation and scheduling off the critical path. This paper proposes an extension to the Cilk scheduler in order to reconcile task dependencies with the work-first principle. We discuss the impact of task dependencies on the properties of the Cilk scheduler. Furthermore, we propose a low-overhead ticket-based technique for dependency tracking and enforcement at the object level. Our scheduler also supports renaming of objects in order to increase task-level parallelism. Renaming is implemented using versioned objects, a new type of hyper object. Experimental evaluation shows that the unified scheduler is as efficient as the Cilk scheduler when tasks have no dependencies. Moreover, the unified scheduler is more efficient than SMPSS, a particular implementation of a task dataflow language.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Energy efficiency is an essential requirement for all contemporary computing systems. We thus need tools to measure the energy consumption of computing systems and to understand how workloads affect it. Significant recent research effort has targeted direct power measurements on production computing systems using on-board sensors or external instruments. These direct methods have in turn guided studies of software techniques to reduce energy consumption via workload allocation and scaling. Unfortunately, direct energy measurements are hampered by the low power sampling frequency of power sensors. The coarse granularity of power sensing limits our understanding of how power is allocated in systems and our ability to optimize energy efficiency via workload allocation.
We present ALEA, a tool to measure power and energy consumption at the granularity of basic blocks, using a probabilistic approach. ALEA provides fine-grained energy profiling via sta- tistical sampling, which overcomes the limitations of power sens- ing instruments. Compared to state-of-the-art energy measurement tools, ALEA provides finer granularity without sacrificing accuracy. ALEA achieves low overhead energy measurements with mean error rates between 1.4% and 3.5% in 14 sequential and paral- lel benchmarks tested on both Intel and ARM platforms. The sampling method caps execution time overhead at approximately 1%. ALEA is thus suitable for online energy monitoring and optimization. Finally, ALEA is a user-space tool with a portable, machine-independent sampling method. We demonstrate two use cases of ALEA, where we reduce the energy consumption of a k-means computational kernel by 37% and an ocean modelling code by 33%, compared to high-performance execution baselines, by varying the power optimization strategy between basic blocks.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

How can GPU acceleration be obtained as a service in a cluster? This question has become increasingly significant due to the inefficiency of installing GPUs on all nodes of a cluster. The research reported in this paper is motivated to address the above question by employing rCUDA (remote CUDA), a framework that facilitates Acceleration-as-a-Service (AaaS), such that the nodes of a cluster can request the acceleration of a set of remote GPUs on demand. The rCUDA framework exploits virtualisation and ensures that multiple nodes can share the same GPU. In this paper we test the feasibility of the rCUDA framework on a real-world application employed in the financial risk industry that can benefit from AaaS in the production setting. The results confirm the feasibility of rCUDA and highlight that rCUDA achieves similar performance compared to CUDA, provides consistent results, and more importantly, allows for a single application to benefit from all the GPUs available in the cluster without loosing efficiency.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Graph analytics is an important and computationally demanding class of data analytics. It is essential to balance scalability, ease-of-use and high performance in large scale graph analytics. As such, it is necessary to hide the complexity of parallelism, data distribution and memory locality behind an abstract interface. The aim of this work is to build a scalable graph analytics framework that does not demand significant parallel programming experience based on NUMA-awareness.
The realization of such a system faces two key problems:
(i)~how to develop a scale-free parallel programming framework that scales efficiently across NUMA domains; (ii)~how to efficiently apply graph partitioning in order to create separate and largely independent work items that can be distributed among threads.