Biblioteca Digital

Task-based dataflow programming models and runtimes emerge as promising candidates for programming multicore and manycore architectures. These programming models analyze dynamically task dependencies at runtime and schedule independent tasks concurrently to the processing elements. In such models, cache locality, which is critical for performance, becomes more challenging in the presence of fine-grain tasks, and in architectures with many simple cores.

This paper presents a combined hardware-software approach to improve cache locality and offer better performance is terms of execution time and energy in the memory system. We propose the explicit bulk prefetcher (EBP) and epoch-based cache management (ECM) to help runtimes prefetch task data and guide the replacement decisions in caches. The runtimem software can use this hardware support to expose its internal knowledge about the tasks to the architecture and achieve more efficient task-based execution. Our combined scheme outperforms HW-only prefetchers and state-of-the-art replacement policies, improves performance by an average of 17%, generates on average 26% fewer L2 misses, and consumes on average 28% less energy in the components of the memory system.

Veja mais

Kernel-Level Scheduling for the Nano-Threads Programming Model

Relevância:

10.00% 10.00%

Publicador:

Veja mais

A Quantitative Evaluation of Synchronization Algorithms and Disciplines on ccNUMA Systems: The Case of the SGI Origin2000

Relevância:

10.00% 10.00%

Publicador:

Veja mais

A Case for User-Level Page Migration

Relevância:

10.00% 10.00%

Publicador:

Veja mais

Is Data Distribution Necessary in OpenMP?

Relevância:

10.00% 10.00%

Publicador:

Veja mais

The Trade-Off Between Implicit and Explicit Data Distribution in Shared-Memory Programming Paradigms

Relevância:

10.00% 10.00%

Publicador:

Veja mais

Scaling Irregular Parallel Codes with Minimal Programming Effort

Relevância:

10.00% 10.00%

Publicador:

Veja mais

Multigrain Parallel Delaunay Mesh Generation: Challenges and Opportunities for Multithreaded Architectures

Relevância:

10.00% 10.00%

Publicador:

Veja mais

Online Power-Performance Adaptation of Multithreaded Programs using Event-Based Prediction

Relevância:

10.00% 10.00%

Publicador:

Veja mais

A parallel pattern for iterative stencil + reduce

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We advocate the Loop-of-stencil-reduce pattern as a means of simplifying the implementation of data-parallel programs on heterogeneous multi-core platforms. Loop-of-stencil-reduce is general enough to subsume map, reduce, map-reduce, stencil, stencil-reduce, and, crucially, their usage in a loop in both data-parallel and streaming applications, or a combination of both. The pattern makes it possible to deploy a single stencil computation kernel on different GPUs. We discuss the implementation of Loop-of-stencil-reduce in FastFlow, a framework for the implementation of applications based on the parallel patterns. Experiments are presented to illustrate the use of Loop-of-stencil-reduce in developing data-parallel kernels running on heterogeneous systems.

Veja mais

14 resultados para supercomputing

em QUB Research Portal - Research Directory and Institutional Repository for Queen's University Belfast

Filtro por publicador