3 resultados para distributed shared memory

em Glasgow Theses Service


Relevância:

80.00% 80.00%

Publicador:

Resumo:

Processors with large numbers of cores are becoming commonplace. In order to utilise the available resources in such systems, the programming paradigm has to move towards increased parallelism. However, increased parallelism does not necessarily lead to better performance. Parallel programming models have to provide not only flexible ways of defining parallel tasks, but also efficient methods to manage the created tasks. Moreover, in a general-purpose system, applications residing in the system compete for the shared resources. Thread and task scheduling in such a multiprogrammed multithreaded environment is a significant challenge. In this thesis, we introduce a new task-based parallel reduction model, called the Glasgow Parallel Reduction Machine (GPRM). Our main objective is to provide high performance while maintaining ease of programming. GPRM supports native parallelism; it provides a modular way of expressing parallel tasks and the communication patterns between them. Compiling a GPRM program results in an Intermediate Representation (IR) containing useful information about tasks, their dependencies, as well as the initial mapping information. This compile-time information helps reduce the overhead of runtime task scheduling and is key to high performance. Generally speaking, the granularity and the number of tasks are major factors in achieving high performance. These factors are even more important in the case of GPRM, as it is highly dependent on tasks, rather than threads. We use three basic benchmarks to provide a detailed comparison of GPRM with Intel OpenMP, Cilk Plus, and Threading Building Blocks (TBB) on the Intel Xeon Phi, and with GNU OpenMP on the Tilera TILEPro64. GPRM shows superior performance in almost all cases, only by controlling the number of tasks. GPRM also provides a low-overhead mechanism, called “Global Sharing”, which improves performance in multiprogramming situations. We use OpenMP, as the most popular model for shared-memory parallel programming as the main GPRM competitor for solving three well-known problems on both platforms: LU factorisation of Sparse Matrices, Image Convolution, and Linked List Processing. We focus on proposing solutions that best fit into the GPRM’s model of execution. GPRM outperforms OpenMP in all cases on the TILEPro64. On the Xeon Phi, our solution for the LU Factorisation results in notable performance improvement for sparse matrices with large numbers of small blocks. We investigate the overhead of GPRM’s task creation and distribution for very short computations using the Image Convolution benchmark. We show that this overhead can be mitigated by combining smaller tasks into larger ones. As a result, GPRM can outperform OpenMP for convolving large 2D matrices on the Xeon Phi. Finally, we demonstrate that our parallel worksharing construct provides an efficient solution for Linked List processing and performs better than OpenMP implementations on the Xeon Phi. The results are very promising, as they verify that our parallel programming framework for manycore processors is flexible and scalable, and can provide high performance without sacrificing productivity.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Cache-coherent non uniform memory access (ccNUMA) architecture is a standard design pattern for contemporary multicore processors, and future generations of architectures are likely to be NUMA. NUMA architectures create new challenges for managed runtime systems. Memory-intensive applications use the system’s distributed memory banks to allocate data, and the automatic memory manager collects garbage left in these memory banks. The garbage collector may need to access remote memory banks, which entails access latency overhead and potential bandwidth saturation for the interconnection between memory banks. This dissertation makes five significant contributions to garbage collection on NUMA systems, with a case study implementation using the Hotspot Java Virtual Machine. It empirically studies data locality for a Stop-The-World garbage collector when tracing connected objects in NUMA heaps. First, it identifies a locality richness which exists naturally in connected objects that contain a root object and its reachable set— ‘rooted sub-graphs’. Second, this dissertation leverages the locality characteristic of rooted sub-graphs to develop a new NUMA-aware garbage collection mechanism. A garbage collector thread processes a local root and its reachable set, which is likely to have a large number of objects in the same NUMA node. Third, a garbage collector thread steals references from sibling threads that run on the same NUMA node to improve data locality. This research evaluates the new NUMA-aware garbage collector using seven benchmarks of an established real-world DaCapo benchmark suite. In addition, evaluation involves a widely used SPECjbb benchmark and Neo4J graph database Java benchmark, as well as an artificial benchmark. The results of the NUMA-aware garbage collector on a multi-hop NUMA architecture show an average of 15% performance improvement. Furthermore, this performance gain is shown to be as a result of an improved NUMA memory access in a ccNUMA system. Fourth, the existing Hotspot JVM adaptive policy for configuring the number of garbage collection threads is shown to be suboptimal for current NUMA machines. The policy uses outdated assumptions and it generates a constant thread count. In fact, the Hotspot JVM still uses this policy in the production version. This research shows that the optimal number of garbage collection threads is application-specific and configuring the optimal number of garbage collection threads yields better collection throughput than the default policy. Fifth, this dissertation designs and implements a runtime technique, which involves heuristics from dynamic collection behavior to calculate an optimal number of garbage collector threads for each collection cycle. The results show an average of 21% improvements to the garbage collection performance for DaCapo benchmarks.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The Internet has grown in size at rapid rates since BGP records began, and continues to do so. This has raised concerns about the scalability of the current BGP routing system, as the routing state at each router in a shortest-path routing protocol will grow at a supra-linearly rate as the network grows. The concerns are that the memory capacity of routers will not be able to keep up with demands, and that the growth of the Internet will become ever more cramped as more and more of the world seeks the benefits of being connected. Compact routing schemes, where the routing state grows only sub-linearly relative to the growth of the network, could solve this problem and ensure that router memory would not be a bottleneck to Internet growth. These schemes trade away shortest-path routing for scalable memory state, by allowing some paths to have a certain amount of bounded “stretch”. The most promising such scheme is Cowen Routing, which can provide scalable, compact routing state for Internet routing, while still providing shortest-path routing to nearly all other nodes, with only slightly stretched paths to a very small subset of the network. Currently, there is no fully distributed form of Cowen Routing that would be practical for the Internet. This dissertation describes a fully distributed and compact protocol for Cowen routing, using the k-core graph decomposition. Previous compact routing work showed the k-core graph decomposition is useful for Cowen Routing on the Internet, but no distributed form existed. This dissertation gives a distributed k-core algorithm optimised to be efficient on dynamic graphs, along with with proofs of its correctness. The performance and efficiency of this distributed k-core algorithm is evaluated on large, Internet AS graphs, with excellent results. This dissertation then goes on to describe a fully distributed and compact Cowen Routing protocol. This protocol being comprised of a landmark selection process for Cowen Routing using the k-core algorithm, with mechanisms to ensure compact state at all times, including at bootstrap; a local cluster routing process, with mechanisms for policy application and control of cluster sizes, ensuring again that state can remain compact at all times; and a landmark routing process is described with a prioritisation mechanism for announcements that ensures compact state at all times.