4 resultados para benchmarks
em Glasgow Theses Service
Resumo:
Processors with large numbers of cores are becoming commonplace. In order to utilise the available resources in such systems, the programming paradigm has to move towards increased parallelism. However, increased parallelism does not necessarily lead to better performance. Parallel programming models have to provide not only flexible ways of defining parallel tasks, but also efficient methods to manage the created tasks. Moreover, in a general-purpose system, applications residing in the system compete for the shared resources. Thread and task scheduling in such a multiprogrammed multithreaded environment is a significant challenge. In this thesis, we introduce a new task-based parallel reduction model, called the Glasgow Parallel Reduction Machine (GPRM). Our main objective is to provide high performance while maintaining ease of programming. GPRM supports native parallelism; it provides a modular way of expressing parallel tasks and the communication patterns between them. Compiling a GPRM program results in an Intermediate Representation (IR) containing useful information about tasks, their dependencies, as well as the initial mapping information. This compile-time information helps reduce the overhead of runtime task scheduling and is key to high performance. Generally speaking, the granularity and the number of tasks are major factors in achieving high performance. These factors are even more important in the case of GPRM, as it is highly dependent on tasks, rather than threads. We use three basic benchmarks to provide a detailed comparison of GPRM with Intel OpenMP, Cilk Plus, and Threading Building Blocks (TBB) on the Intel Xeon Phi, and with GNU OpenMP on the Tilera TILEPro64. GPRM shows superior performance in almost all cases, only by controlling the number of tasks. GPRM also provides a low-overhead mechanism, called “Global Sharing”, which improves performance in multiprogramming situations. We use OpenMP, as the most popular model for shared-memory parallel programming as the main GPRM competitor for solving three well-known problems on both platforms: LU factorisation of Sparse Matrices, Image Convolution, and Linked List Processing. We focus on proposing solutions that best fit into the GPRM’s model of execution. GPRM outperforms OpenMP in all cases on the TILEPro64. On the Xeon Phi, our solution for the LU Factorisation results in notable performance improvement for sparse matrices with large numbers of small blocks. We investigate the overhead of GPRM’s task creation and distribution for very short computations using the Image Convolution benchmark. We show that this overhead can be mitigated by combining smaller tasks into larger ones. As a result, GPRM can outperform OpenMP for convolving large 2D matrices on the Xeon Phi. Finally, we demonstrate that our parallel worksharing construct provides an efficient solution for Linked List processing and performs better than OpenMP implementations on the Xeon Phi. The results are very promising, as they verify that our parallel programming framework for manycore processors is flexible and scalable, and can provide high performance without sacrificing productivity.
Resumo:
This PhD thesis contains three main chapters on macro finance, with a focus on the term structure of interest rates and the applications of state-of-the-art Bayesian econometrics. Except for Chapter 1 and Chapter 5, which set out the general introduction and conclusion, each of the chapters can be considered as a standalone piece of work. In Chapter 2, we model and predict the term structure of US interest rates in a data rich environment. We allow the model dimension and parameters to change over time, accounting for model uncertainty and sudden structural changes. The proposed timevarying parameter Nelson-Siegel Dynamic Model Averaging (DMA) predicts yields better than standard benchmarks. DMA performs better since it incorporates more macro-finance information during recessions. The proposed method allows us to estimate plausible realtime term premia, whose countercyclicality weakened during the financial crisis. Chapter 3 investigates global term structure dynamics using a Bayesian hierarchical factor model augmented with macroeconomic fundamentals. More than half of the variation in the bond yields of seven advanced economies is due to global co-movement. Our results suggest that global inflation is the most important factor among global macro fundamentals. Non-fundamental factors are essential in driving global co-movements, and are closely related to sentiment and economic uncertainty. Lastly, we analyze asymmetric spillovers in global bond markets connected to diverging monetary policies. Chapter 4 proposes a no-arbitrage framework of term structure modeling with learning and model uncertainty. The representative agent considers parameter instability, as well as the uncertainty in learning speed and model restrictions. The empirical evidence shows that apart from observational variance, parameter instability is the dominant source of predictive variance when compared with uncertainty in learning speed or model restrictions. When accounting for ambiguity aversion, the out-of-sample predictability of excess returns implied by the learning model can be translated into significant and consistent economic gains over the Expectations Hypothesis benchmark.
Resumo:
This PhD thesis contains three main chapters on macro finance, with a focus on the term structure of interest rates and the applications of state-of-the-art Bayesian econometrics. Except for Chapter 1 and Chapter 5, which set out the general introduction and conclusion, each of the chapters can be considered as a standalone piece of work. In Chapter 2, we model and predict the term structure of US interest rates in a data rich environment. We allow the model dimension and parameters to change over time, accounting for model uncertainty and sudden structural changes. The proposed time-varying parameter Nelson-Siegel Dynamic Model Averaging (DMA) predicts yields better than standard benchmarks. DMA performs better since it incorporates more macro-finance information during recessions. The proposed method allows us to estimate plausible real-time term premia, whose countercyclicality weakened during the financial crisis. Chapter 3 investigates global term structure dynamics using a Bayesian hierarchical factor model augmented with macroeconomic fundamentals. More than half of the variation in the bond yields of seven advanced economies is due to global co-movement. Our results suggest that global inflation is the most important factor among global macro fundamentals. Non-fundamental factors are essential in driving global co-movements, and are closely related to sentiment and economic uncertainty. Lastly, we analyze asymmetric spillovers in global bond markets connected to diverging monetary policies. Chapter 4 proposes a no-arbitrage framework of term structure modeling with learning and model uncertainty. The representative agent considers parameter instability, as well as the uncertainty in learning speed and model restrictions. The empirical evidence shows that apart from observational variance, parameter instability is the dominant source of predictive variance when compared with uncertainty in learning speed or model restrictions. When accounting for ambiguity aversion, the out-of-sample predictability of excess returns implied by the learning model can be translated into significant and consistent economic gains over the Expectations Hypothesis benchmark.
Resumo:
Cache-coherent non uniform memory access (ccNUMA) architecture is a standard design pattern for contemporary multicore processors, and future generations of architectures are likely to be NUMA. NUMA architectures create new challenges for managed runtime systems. Memory-intensive applications use the system’s distributed memory banks to allocate data, and the automatic memory manager collects garbage left in these memory banks. The garbage collector may need to access remote memory banks, which entails access latency overhead and potential bandwidth saturation for the interconnection between memory banks. This dissertation makes five significant contributions to garbage collection on NUMA systems, with a case study implementation using the Hotspot Java Virtual Machine. It empirically studies data locality for a Stop-The-World garbage collector when tracing connected objects in NUMA heaps. First, it identifies a locality richness which exists naturally in connected objects that contain a root object and its reachable set— ‘rooted sub-graphs’. Second, this dissertation leverages the locality characteristic of rooted sub-graphs to develop a new NUMA-aware garbage collection mechanism. A garbage collector thread processes a local root and its reachable set, which is likely to have a large number of objects in the same NUMA node. Third, a garbage collector thread steals references from sibling threads that run on the same NUMA node to improve data locality. This research evaluates the new NUMA-aware garbage collector using seven benchmarks of an established real-world DaCapo benchmark suite. In addition, evaluation involves a widely used SPECjbb benchmark and Neo4J graph database Java benchmark, as well as an artificial benchmark. The results of the NUMA-aware garbage collector on a multi-hop NUMA architecture show an average of 15% performance improvement. Furthermore, this performance gain is shown to be as a result of an improved NUMA memory access in a ccNUMA system. Fourth, the existing Hotspot JVM adaptive policy for configuring the number of garbage collection threads is shown to be suboptimal for current NUMA machines. The policy uses outdated assumptions and it generates a constant thread count. In fact, the Hotspot JVM still uses this policy in the production version. This research shows that the optimal number of garbage collection threads is application-specific and configuring the optimal number of garbage collection threads yields better collection throughput than the default policy. Fifth, this dissertation designs and implements a runtime technique, which involves heuristics from dynamic collection behavior to calculate an optimal number of garbage collector threads for each collection cycle. The results show an average of 21% improvements to the garbage collection performance for DaCapo benchmarks.