999 resultados para Hardware Transactional Memory
Resumo:
Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs. In order for STMs to be adopted widely for performance critical software, understanding and improving the cache performance of applications running on STM becomes increasingly crucial, as the performance gap between processor and memory continues to grow. In this paper, we present the most detailed experimental evaluation to date, of the cache behavior of STM applications and quantify the impact of the different STM factors on the cache misses experienced by the applications. We find that STMs are not cache friendly, with the data cache stall cycles contributing to more than 50% of the execution cycles in a majority of the benchmarks. We find that on an average, misses occurring inside the STM account for 62% of total data cache miss latency cycles experienced by the applications and the cache performance is impacted adversely due to certain inherent characteristics of the STM itself. The above observations motivate us to propose a set of specific compiler transformations targeted at making the STMs cache friendly. We find that STM's fine grained and application unaware locking is a major contributor to its poor cache behavior. Hence we propose selective Lock Data co-location (LDC) and Redundant Lock Access Removal (RLAR) to address the lock access misses. We find that even transactions that are completely disjoint access parallel, suffer from costly coherence misses caused by the centralized global time stamp updates and hence we propose the Selective Per-Partition Time Stamp (SPTS) transformation to address this. We show that our transformations are effective in improving the cache behavior of STM applications by reducing the data cache miss latency by 20.15% to 37.14% and improving execution time by 18.32% to 33.12% in five of the 8 STAMP applications.
Resumo:
Recent trends in computing systems, such as multi-core processors and cloud computing, expose tens to thousands of processors to the software. Software developers must respond by introducing parallelism in their software. To obtain highest performance, it is not only necessary to identify parallelism, but also to reason about synchronization between threads and the communication of data from one thread to another. This entry gives an overview on some of the most common abstractions that are used in parallel programming, namely explicit vs. implicit expression of parallelism and shared and distributed memory. Several parallel programming models are reviewed and categorized by means of these abstractions. The pros and cons of parallel programming models from the perspective of performance and programmability are discussed.
Resumo:
How can GPU acceleration be obtained as a service in a cluster? This question has become increasingly significant due to the inefficiency of installing GPUs on all nodes of a cluster. The research reported in this paper is motivated to address the above question by employing rCUDA (remote CUDA), a framework that facilitates Acceleration-as-a-Service (AaaS), such that the nodes of a cluster can request the acceleration of a set of remote GPUs on demand. The rCUDA framework exploits virtualisation and ensures that multiple nodes can share the same GPU. In this paper we test the feasibility of the rCUDA framework on a real-world application employed in the financial risk industry that can benefit from AaaS in the production setting. The results confirm the feasibility of rCUDA and highlight that rCUDA achieves similar performance compared to CUDA, provides consistent results, and more importantly, allows for a single application to benefit from all the GPUs available in the cluster without loosing efficiency.
Resumo:
Heutzutage haben selbst durchschnittliche Computersysteme mehrere unabhängige Recheneinheiten (Kerne). Wird ein rechenintensives Problem in mehrere Teilberechnungen unterteilt, können diese parallel und damit schneller verarbeitet werden. Obwohl die Entwicklung paralleler Programme mittels Abstraktionen vereinfacht werden kann, ist es selbst für Experten anspruchsvoll, effiziente und korrekte Programme zu schreiben. Während traditionelle Programmiersprachen auf einem eher geringen Abstraktionsniveau arbeiten, bieten funktionale Programmiersprachen wie z.B. Haskell, Möglichkeiten zur fortgeschrittenen Abstrahierung. Das Ziel der vorliegenden Dissertation war es, zu untersuchen, wie gut verschiedene Arten der Abstraktion das Programmieren mit Concurrent Haskell unterstützen. Concurrent Haskell ist eine Bibliothek für Haskell, die parallele Programmierung auf Systemen mit gemeinsamem Speicher ermöglicht. Im Mittelpunkt der Dissertation standen zwei Forschungsfragen. Erstens wurden verschiedene Synchronisierungsansätze verglichen, die sich in ihrem Abstraktionsgrad unterscheiden. Zweitens wurde untersucht, wie Abstraktionen verwendet werden können, um die Komplexität der Parallelisierung vor dem Entwickler zu verbergen. Bei dem Vergleich der Synchronisierungsansätze wurden Locks, Compare-and-Swap Operationen und Software Transactional Memory berücksichtigt. Die Ansätze wurden zunächst bezüglich ihrer Eignung für die Synchronisation einer Prioritätenwarteschlange auf Basis von Skiplists untersucht. Anschließend wurden verschiedene Varianten des Taskpool Entwurfsmusters implementiert (globale Taskpools sowie private Taskpools mit und ohne Taskdiebstahl). Zusätzlich wurde für das Entwurfsmuster eine Abstraktionsschicht entwickelt, welche eine einfache Formulierung von Taskpool-basierten Algorithmen erlaubt. Für die Untersuchung der Frage, ob Haskells Abstraktionsmethoden die Komplexität paralleler Programmierung verbergen können, wurden zunächst stencil-basierte Algorithmen betrachtet. Es wurde eine Bibliothek entwickelt, die eine deklarative Beschreibung von stencil-basierten Algorithmen sowie ihre parallele Ausführung erlaubt. Mit Hilfe dieses deklarativen Interfaces wurde die parallele Implementation vollständig vor dem Anwender verborgen. Anschließend wurde eine eingebettete domänenspezifische Sprache (EDSL) für Knoten-basierte Graphalgorithmen sowie eine entsprechende Ausführungsplattform entwickelt. Die Plattform erlaubt die automatische parallele Verarbeitung dieser Algorithmen. Verschiedene Beispiele zeigten, dass die EDSL eine knappe und dennoch verständliche Formulierung von Graphalgorithmen ermöglicht.
Resumo:
Software Transactional Memory (STM) systems have poor performance under high contention scenarios. Since many transactions compete for the same data, most of them are aborted, wasting processor runtime. Contention management policies are typically used to avoid that, but they are passive approaches as they wait for an abort to happen so they can take action. More proactive approaches have emerged, trying to predict when a transaction is likely to abort so its execution can be delayed. Such techniques are limited, as they do not replace the doomed transaction by another or, when they do, they rely on the operating system for that, having little or no control on which transaction should run. In this paper we propose LUTS, a Lightweight User-Level Transaction Scheduler, which is based on an execution context record mechanism. Unlike other techniques, LUTS provides the means for selecting another transaction to run in parallel, thus improving system throughput. Moreover, it avoids most of the issues caused by pseudo parallelism, as it only launches as many system-level threads as the number of available processor cores. We discuss LUTS design and present three conflict-avoidance heuristics built around LUTS scheduling capabilities. Experimental results, conducted with STMBench7 and STAMP benchmark suites, show LUTS efficiency when running high contention applications and how conflict-avoidance heuristics can improve STM performance even more. In fact, our transaction scheduling techniques are capable of improving program performance even in overloaded scenarios. © 2011 Springer-Verlag.
Resumo:
Software transaction memory (STM) systems have been used as an approach to improve performance, by allowing the concurrent execution of atomic blocks. However, under high-contention workloads, STM-based systems can considerably degrade performance, as transaction conflict rate increases. Contention management policies have been used as a way to select which transaction to abort when a conflict occurs. In general, contention managers are not capable of avoiding conflicts, as they can only select which transaction to abort and the moment it should restart. Since contention managers act only after a conflict is detected, it becomes harder to effectively increase transaction throughput. More proactive approaches have emerged, aiming at predicting when a transaction is likely to abort, postponing its execution. Nevertheless, most of the proposed proactive techniques are limited, as they do not replace the doomed transaction by another or, when they do, they rely on the operating system for that, having little or no control on which transaction to run. This article proposes LUTS, a lightweight user-level transaction scheduler. Unlike other techniques, LUTS provides the means for selecting another transaction to run in parallel, thus improving system throughput. We discuss LUTS design and propose a dynamic conflict-avoidance heuristic built around its scheduling capabilities. Experimental results, conducted with the STAMP and STMBench7 benchmark suites, running on TinySTM and SwissTM, show how our conflict-avoidance heuristic can effectively improve STM performance on high contention applications. © 2012 Springer Science+Business Media, LLC.
Resumo:
Pós-graduação em Ciência da Computação - IBILCE
Resumo:
Pós-graduação em Ciência da Computação - IBILCE
Resumo:
La memoria transaccional (TM) constituye un paradigma de concurrencia optimista en arquitecturas multinúcleo que puede ser de utilidad en la explotación de paralelismo en aplicaciones irregulares, en las que la información sobre las dependencias de datos no está disponible hasta la ejecución. Este trabajo presenta y discute cómo aprovechar las características de un sistema STM (software transactio- nal memory) en patrones de computación que involucren operaciones de reducción, ligadas frecuentemente a aplicaciones irregulares. Con el fin de comparar el uso de enfoques STM en esta clase de patrones con otras soluciones más clásicas, se ha implementa do como prueba de concepto un sistema STM, que denominaremos ReduxSTM, que combina dos ideas: una consolidación (commit) ordenada de las transacciones, que asegura una equivalencia con la ejecución secuencial del código; y una extensión del mecanismo de privatización subyacente al sistema STM que contempla las operaciones de reducción.
Resumo:
Generation of hardware architectures directly from dataflow representations is increasingly being considered as research moves toward system level design methodologies. Creation of networks of IP cores to implement actor functionality is a common approach to the problem, but often the memory sub-systems produced using these techniques are inefficiently utilised. This paper explores some of the issues in terms of memory organisation and accesses when developing systems from these high level representations. Using a template matching design study, challenges such as modelling memory reuse and minimising buffer requirements are examined, yielding results with significantly less memory requirements and costly off-chip memory accesses.
Resumo:
This paper presents a hardware solution for network flow processing at full line rate. Advanced memory architecture using DDR3 SDRAMs is proposed to cope with the flow match limitations in packet throughput, number of supported flows and number of packet header fields (or tuples) supported for flow identifications. The described architecture has been prototyped for accommodating 8 million flows, and tested on an FPGA platform achieving a minimum of 70 million lookups per second. This is sufficient to process internet traffic flows at 40 Gigabit Ethernet.
Resumo:
The memory hierarchy is the main bottleneck in modern computer systems as the gap between the speed of the processor and the memory continues to grow larger. The situation in embedded systems is even worse. The memory hierarchy consumes a large amount of chip area and energy, which are precious resources in embedded systems. Moreover, embedded systems have multiple design objectives such as performance, energy consumption, and area, etc. Customizing the memory hierarchy for specific applications is a very important way to take full advantage of limited resources to maximize the performance. However, the traditional custom memory hierarchy design methodologies are phase-ordered. They separate the application optimization from the memory hierarchy architecture design, which tend to result in local-optimal solutions. In traditional Hardware-Software co-design methodologies, much of the work has focused on utilizing reconfigurable logic to partition the computation. However, utilizing reconfigurable logic to perform the memory hierarchy design is seldom addressed. In this paper, we propose a new framework for designing memory hierarchy for embedded systems. The framework will take advantage of the flexible reconfigurable logic to customize the memory hierarchy for specific applications. It combines the application optimization and memory hierarchy design together to obtain a global-optimal solution. Using the framework, we performed a case study to design a new software-controlled instruction memory that showed promising potential.
Resumo:
In questa tesi sono stati apportati due importanti contributi nel campo degli acceleratori embedded many-core. Abbiamo implementato un runtime OpenMP ottimizzato per la gestione del tasking model per sistemi a processori strettamente accoppiati in cluster e poi interconnessi attraverso una network on chip. Ci siamo focalizzati sulla loro scalabilità e sul supporto di task di granularità fine, come è tipico nelle applicazioni embedded. Il secondo contributo di questa tesi è stata proporre una estensione del runtime di OpenMP che cerca di prevedere la manifestazione di errori dati da fenomeni di variability tramite una schedulazione efficiente del carico di lavoro.