987 resultados para Software Transaction Memory
Resumo:
Software transactional memory (STM) has been proposed as a promising programming paradigm for shared memory multi-threaded programs as an alternative to conventional lock based synchronization primitives. Typical STM implementations employ a conflict detection scheme, which works with uniform access granularity, tracking shared data accesses either at word/cache line or at object level. It is well known that a single fixed access tracking granularity cannot meet the conflicting goals of reducing false conflicts without impacting concurrency adversely. A fine grained granularity while improving concurrency can have an adverse impact on performance due to lock aliasing, lock validation overheads, and additional cache pressure. On the other hand, a coarse grained granularity can impact performance due to reduced concurrency. Thus, in general, a fixed or uniform granularity access tracking (UGAT) scheme is application-unaware and rarely matches the access patterns of individual application or parts of an application, leading to sub-optimal performance for different parts of the application(s). In order to mitigate the disadvantages associated with UGAT scheme, we propose a Variable Granularity Access Tracking (VGAT) scheme in this paper. We propose a compiler based approach wherein the compiler uses inter-procedural whole program static analysis to select the access tracking granularity for different shared data structures of the application based on the application's data access pattern. We describe our prototype VGAT scheme, using TL2 as our STM implementation. Our experimental results reveal that VGAT-STM scheme can improve the application performance of STAMP benchmarks from 1.87% to up to 21.2%.
Resumo:
The recent trends of chip architectures with higher number of heterogeneous cores, and non-uniform memory/non-coherent caches, brings renewed attention to the use of Software Transactional Memory (STM) as a fundamental building block for developing parallel applications. Nevertheless, although STM promises to ease concurrent and parallel software development, it relies on the possibility of aborting conflicting transactions to maintain data consistency, which impacts on the responsiveness and timing guarantees required by embedded real-time systems. In these systems, contention delays must be (efficiently) limited so that the response times of tasks executing transactions are upper-bounded and task sets can be feasibly scheduled. In this paper we assess the use of STM in the development of embedded real-time software, defending that the amount of contention can be reduced if read-only transactions access recent consistent data snapshots, progressing in a wait-free manner. We show how the required number of versions of a shared object can be calculated for a set of tasks. We also outline an algorithm to manage conflicts between update transactions that prevents starvation.
Resumo:
Poster for IRP project 'Side-effects in Software Transactional Memory: Extending Deuce with TwilightSTM'
Resumo:
Transactional memory (TM) is a new synchronization mechanism devised to simplify parallel programming, thereby helping programmers to unleash the power of current multicore processors. Although software implementations of TM (STM) have been extensively analyzed in terms of runtime performance, little attention has been paid to an equally important constraint faced by nearly all computer systems: energy consumption. In this work we conduct a comprehensive study of energy and runtime tradeoff sin software transactional memory systems. We characterize the behavior of three state-of-the-art lock-based STM algorithms, along with three different conflict resolution schemes. As a result of this characterization, we propose a DVFS-based technique that can be integrated into the resolution policies so as to improve the energy-delay product (EDP). Experimental results show that our DVFS-enhanced policies are indeed beneficial for applications with high contention levels. Improvements of up to 59% in EDP can be observed in this scenario, with an average EDP reduction of 16% across the STAMP workloads. © 2012 IEEE.
Resumo:
Software transaction memory (STM) systems have been used as an approach to improve performance, by allowing the concurrent execution of atomic blocks. However, under high-contention workloads, STM-based systems can considerably degrade performance, as transaction conflict rate increases. Contention management policies have been used as a way to select which transaction to abort when a conflict occurs. In general, contention managers are not capable of avoiding conflicts, as they can only select which transaction to abort and the moment it should restart. Since contention managers act only after a conflict is detected, it becomes harder to effectively increase transaction throughput. More proactive approaches have emerged, aiming at predicting when a transaction is likely to abort, postponing its execution. Nevertheless, most of the proposed proactive techniques are limited, as they do not replace the doomed transaction by another or, when they do, they rely on the operating system for that, having little or no control on which transaction to run. This article proposes LUTS, a lightweight user-level transaction scheduler. Unlike other techniques, LUTS provides the means for selecting another transaction to run in parallel, thus improving system throughput. We discuss LUTS design and propose a dynamic conflict-avoidance heuristic built around its scheduling capabilities. Experimental results, conducted with the STAMP and STMBench7 benchmark suites, running on TinySTM and SwissTM, show how our conflict-avoidance heuristic can effectively improve STM performance on high contention applications. © 2012 Springer Science+Business Media, LLC.
Resumo:
Hardware vendors make an important effort creating low-power CPUs that keep battery duration and durability above acceptable levels. In order to achieve this goal and provide good performance-energy for a wide variety of applications, ARM designed the big.LITTLE architecture. This heterogeneous multi-core architecture features two different types of cores: big cores oriented to performance and little cores, slower and aimed to save energy consumption. As all the cores have access to the same memory, multi-threaded applications must resort to some mutual exclusion mechanism to coordinate the access to shared data by the concurrent threads. Transactional Memory (TM) represents an optimistic approach for shared-memory synchronization. To take full advantage of the features offered by software TM, but also benefit from the characteristics of the heterogeneous big.LITTLE architectures, our focus is to propose TM solutions that take into account the power/performance requirements of the application and what it is offered by the architecture. In order to understand the current state-of-the-art and obtain useful information for future power-aware software TM solutions, we have performed an analysis of a popular TM library running on top of an ARM big.LITTLE processor. Experiments show, in general, better scalability for the LITTLE cores for most of the applications except for one, which requires the computing performance that the big cores offer.
Resumo:
Software Transactional Memory (STM) systems have poor performance under high contention scenarios. Since many transactions compete for the same data, most of them are aborted, wasting processor runtime. Contention management policies are typically used to avoid that, but they are passive approaches as they wait for an abort to happen so they can take action. More proactive approaches have emerged, trying to predict when a transaction is likely to abort so its execution can be delayed. Such techniques are limited, as they do not replace the doomed transaction by another or, when they do, they rely on the operating system for that, having little or no control on which transaction should run. In this paper we propose LUTS, a Lightweight User-Level Transaction Scheduler, which is based on an execution context record mechanism. Unlike other techniques, LUTS provides the means for selecting another transaction to run in parallel, thus improving system throughput. Moreover, it avoids most of the issues caused by pseudo parallelism, as it only launches as many system-level threads as the number of available processor cores. We discuss LUTS design and present three conflict-avoidance heuristics built around LUTS scheduling capabilities. Experimental results, conducted with STMBench7 and STAMP benchmark suites, show LUTS efficiency when running high contention applications and how conflict-avoidance heuristics can improve STM performance even more. In fact, our transaction scheduling techniques are capable of improving program performance even in overloaded scenarios. © 2011 Springer-Verlag.
Resumo:
Software transactional memory has the potential to greatly simplify development of concurrent software, by supporting safe composition of concurrent shared-state abstractions. However, STM semantics are defined in terms of low-level reads and writes on individual memory locations, so implementations are unable to take advantage of the properties of user-defined abstractions. Consequently, the performance of transactions over some structures can be disappointing. ----- ----- We present Modular Transactional Memory, our framework which allows programmers to extend STM with concurrency control algorithms tailored to the data structures they use in concurrent programs. We describe our implementation in Concurrent Haskell, and two example structures: a finite map which allows concurrent transactions to operate on disjoint sets of keys, and a non-deterministic channel which supports concurrent sources and sinks. ----- ----- Our approach is based on previous work by others on boosted and open-nested transactions, with one significant development: transactions are given types which denote the concurrency control algorithms they employ. Typed transactions offer a higher level of assurance for programmers reusing transactional code, and allow more flexible abstract concurrency control.
Resumo:
Software transactional memory(STM) is a promising programming paradigm for shared memory multithreaded programs. While STM offers the promise of being less error-prone and more programmer friendly compared to traditional lock-based synchronization, it also needs to be competitive in performance in order for it to be adopted in mainstream software. A major source of performance overheads in STM is transactional aborts. Conflict resolution and aborting a transaction typically happens at the transaction level which has the advantage that it is automatic and application agnostic. However it has a substantial disadvantage in that STM declares the entire transaction as conflicting and hence aborts it and re-executes it fully, instead of partially re-executing only those part(s) of the transaction, which have been affected due to the conflict. This "Re-execute Everything" approach has a significant adverse impact on STM performance. In order to mitigate the abort overheads, we propose a compiler aided Selective Reconciliation STM (SR-STM) scheme, wherein certain transactional conflicts can be reconciled by performing partial re-execution of the transaction. Ours is a selective hybrid approach which uses compiler analysis to identify those data accesses which are legal and profitable candidates for reconciliation and applies partial re-execution only to these candidates selectively while other conflicting data accesses are handled by the default STM approach of abort and full re-execution. We describe the compiler analysis and code transformations required for supporting selective reconciliation. We find that SR-STM is effective in reducing the transactional abort overheads by improving the performance for a set of five STAMP benchmarks by 12.58% on an average and up to 22.34%.
Analyzing Cache Performance Bottlenecks of STM Applications and addressing them with Compiler's help
Resumo:
Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs as an alternative to traditional lock based synchronization. However adoption of STM in mainstream software has been quite low due to its considerable overheads and its poor cache/memory performance. In this paper, we perform a detailed study of the cache behavior of STM applications and quantify the impact of different STM factors on the cache misses experienced by the applications. Based on our analysis, we propose a compiler driven Lock-Data Colocation (LDC), targeted at reducing the cache overheads on STM. We show that LDC is effective in improving the cache behavior of STM applications by reducing the dcache miss latency and improving execution time performance.
Resumo:
Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs. In order for STMs to be adopted widely for performance critical software, understanding and improving the cache performance of applications running on STM becomes increasingly crucial, as the performance gap between processor and memory continues to grow. In this paper, we present the most detailed experimental evaluation to date, of the cache behavior of STM applications and quantify the impact of the different STM factors on the cache misses experienced by the applications. We find that STMs are not cache friendly, with the data cache stall cycles contributing to more than 50% of the execution cycles in a majority of the benchmarks. We find that on an average, misses occurring inside the STM account for 62% of total data cache miss latency cycles experienced by the applications and the cache performance is impacted adversely due to certain inherent characteristics of the STM itself. The above observations motivate us to propose a set of specific compiler transformations targeted at making the STMs cache friendly. We find that STM's fine grained and application unaware locking is a major contributor to its poor cache behavior. Hence we propose selective Lock Data co-location (LDC) and Redundant Lock Access Removal (RLAR) to address the lock access misses. We find that even transactions that are completely disjoint access parallel, suffer from costly coherence misses caused by the centralized global time stamp updates and hence we propose the Selective Per-Partition Time Stamp (SPTS) transformation to address this. We show that our transformations are effective in improving the cache behavior of STM applications by reducing the data cache miss latency by 20.15% to 37.14% and improving execution time by 18.32% to 33.12% in five of the 8 STAMP applications.
Resumo:
Parallel computing on a network of workstations can saturate the communication network, leading to excessive message delays and consequently poor application performance. We examine empirically the consequences of integrating a flow control protocol, called Warp control [Par93], into Mermera, a software shared memory system that supports parallel computing on distributed systems [HS93]. For an asynchronous iterative program that solves a system of linear equations, our measurements show that Warp succeeds in stabilizing the network's behavior even under high levels of contention. As a result, the application achieves a higher effective communication throughput, and a reduced completion time. In some cases, however, Warp control does not achieve the performance attainable by fixed size buffering when using a statically optimal buffer size. Our use of Warp to regulate the allocation of network bandwidth emphasizes the possibility for integrating it with the allocation of other resources, such as CPU cycles and disk bandwidth, so as to optimize overall system throughput, and enable fully-shared execution of parallel programs.
Resumo:
Heutzutage haben selbst durchschnittliche Computersysteme mehrere unabhängige Recheneinheiten (Kerne). Wird ein rechenintensives Problem in mehrere Teilberechnungen unterteilt, können diese parallel und damit schneller verarbeitet werden. Obwohl die Entwicklung paralleler Programme mittels Abstraktionen vereinfacht werden kann, ist es selbst für Experten anspruchsvoll, effiziente und korrekte Programme zu schreiben. Während traditionelle Programmiersprachen auf einem eher geringen Abstraktionsniveau arbeiten, bieten funktionale Programmiersprachen wie z.B. Haskell, Möglichkeiten zur fortgeschrittenen Abstrahierung. Das Ziel der vorliegenden Dissertation war es, zu untersuchen, wie gut verschiedene Arten der Abstraktion das Programmieren mit Concurrent Haskell unterstützen. Concurrent Haskell ist eine Bibliothek für Haskell, die parallele Programmierung auf Systemen mit gemeinsamem Speicher ermöglicht. Im Mittelpunkt der Dissertation standen zwei Forschungsfragen. Erstens wurden verschiedene Synchronisierungsansätze verglichen, die sich in ihrem Abstraktionsgrad unterscheiden. Zweitens wurde untersucht, wie Abstraktionen verwendet werden können, um die Komplexität der Parallelisierung vor dem Entwickler zu verbergen. Bei dem Vergleich der Synchronisierungsansätze wurden Locks, Compare-and-Swap Operationen und Software Transactional Memory berücksichtigt. Die Ansätze wurden zunächst bezüglich ihrer Eignung für die Synchronisation einer Prioritätenwarteschlange auf Basis von Skiplists untersucht. Anschließend wurden verschiedene Varianten des Taskpool Entwurfsmusters implementiert (globale Taskpools sowie private Taskpools mit und ohne Taskdiebstahl). Zusätzlich wurde für das Entwurfsmuster eine Abstraktionsschicht entwickelt, welche eine einfache Formulierung von Taskpool-basierten Algorithmen erlaubt. Für die Untersuchung der Frage, ob Haskells Abstraktionsmethoden die Komplexität paralleler Programmierung verbergen können, wurden zunächst stencil-basierte Algorithmen betrachtet. Es wurde eine Bibliothek entwickelt, die eine deklarative Beschreibung von stencil-basierten Algorithmen sowie ihre parallele Ausführung erlaubt. Mit Hilfe dieses deklarativen Interfaces wurde die parallele Implementation vollständig vor dem Anwender verborgen. Anschließend wurde eine eingebettete domänenspezifische Sprache (EDSL) für Knoten-basierte Graphalgorithmen sowie eine entsprechende Ausführungsplattform entwickelt. Die Plattform erlaubt die automatische parallele Verarbeitung dieser Algorithmen. Verschiedene Beispiele zeigten, dass die EDSL eine knappe und dennoch verständliche Formulierung von Graphalgorithmen ermöglicht.
Resumo:
The growing demand for large-scale virtualization environments, such as the ones used in cloud computing, has led to a need for efficient management of computing resources. RAM memory is the one of the most required resources in these environments, and is usually the main factor limiting the number of virtual machines that can run on the physical host. Recently, hypervisors have brought mechanisms for transparent memory sharing between virtual machines in order to reduce the total demand for system memory. These mechanisms “merge” similar pages detected in multiple virtual machines into the same physical memory, using a copy-on-write mechanism in a manner that is transparent to the guest systems. The objective of this study is to present an overview of these mechanisms and also evaluate their performance and effectiveness. The results of two popular hypervisors (VMware and KVM) using different guest operating systems (Linux and Windows) and different workloads (synthetic and real) are presented herein. The results show significant performance differences between hypervisors according to the guest system workloads and execution time.
Resumo:
La memoria transaccional (TM) constituye un paradigma de concurrencia optimista en arquitecturas multinúcleo que puede ser de utilidad en la explotación de paralelismo en aplicaciones irregulares, en las que la información sobre las dependencias de datos no está disponible hasta la ejecución. Este trabajo presenta y discute cómo aprovechar las características de un sistema STM (software transactio- nal memory) en patrones de computación que involucren operaciones de reducción, ligadas frecuentemente a aplicaciones irregulares. Con el fin de comparar el uso de enfoques STM en esta clase de patrones con otras soluciones más clásicas, se ha implementa do como prueba de concepto un sistema STM, que denominaremos ReduxSTM, que combina dos ideas: una consolidación (commit) ordenada de las transacciones, que asegura una equivalencia con la ejecución secuencial del código; y una extensión del mecanismo de privatización subyacente al sistema STM que contempla las operaciones de reducción.