134 resultados para Hardware Transactional Memory
em Indian Institute of Science - Bangalore - Índia
Resumo:
Software transactional memory (STM) has been proposed as a promising programming paradigm for shared memory multi-threaded programs as an alternative to conventional lock based synchronization primitives. Typical STM implementations employ a conflict detection scheme, which works with uniform access granularity, tracking shared data accesses either at word/cache line or at object level. It is well known that a single fixed access tracking granularity cannot meet the conflicting goals of reducing false conflicts without impacting concurrency adversely. A fine grained granularity while improving concurrency can have an adverse impact on performance due to lock aliasing, lock validation overheads, and additional cache pressure. On the other hand, a coarse grained granularity can impact performance due to reduced concurrency. Thus, in general, a fixed or uniform granularity access tracking (UGAT) scheme is application-unaware and rarely matches the access patterns of individual application or parts of an application, leading to sub-optimal performance for different parts of the application(s). In order to mitigate the disadvantages associated with UGAT scheme, we propose a Variable Granularity Access Tracking (VGAT) scheme in this paper. We propose a compiler based approach wherein the compiler uses inter-procedural whole program static analysis to select the access tracking granularity for different shared data structures of the application based on the application's data access pattern. We describe our prototype VGAT scheme, using TL2 as our STM implementation. Our experimental results reveal that VGAT-STM scheme can improve the application performance of STAMP benchmarks from 1.87% to up to 21.2%.
Resumo:
Software transactional memory(STM) is a promising programming paradigm for shared memory multithreaded programs. While STM offers the promise of being less error-prone and more programmer friendly compared to traditional lock-based synchronization, it also needs to be competitive in performance in order for it to be adopted in mainstream software. A major source of performance overheads in STM is transactional aborts. Conflict resolution and aborting a transaction typically happens at the transaction level which has the advantage that it is automatic and application agnostic. However it has a substantial disadvantage in that STM declares the entire transaction as conflicting and hence aborts it and re-executes it fully, instead of partially re-executing only those part(s) of the transaction, which have been affected due to the conflict. This "Re-execute Everything" approach has a significant adverse impact on STM performance. In order to mitigate the abort overheads, we propose a compiler aided Selective Reconciliation STM (SR-STM) scheme, wherein certain transactional conflicts can be reconciled by performing partial re-execution of the transaction. Ours is a selective hybrid approach which uses compiler analysis to identify those data accesses which are legal and profitable candidates for reconciliation and applies partial re-execution only to these candidates selectively while other conflicting data accesses are handled by the default STM approach of abort and full re-execution. We describe the compiler analysis and code transformations required for supporting selective reconciliation. We find that SR-STM is effective in reducing the transactional abort overheads by improving the performance for a set of five STAMP benchmarks by 12.58% on an average and up to 22.34%.
Analyzing Cache Performance Bottlenecks of STM Applications and addressing them with Compiler's help
Resumo:
Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs as an alternative to traditional lock based synchronization. However adoption of STM in mainstream software has been quite low due to its considerable overheads and its poor cache/memory performance. In this paper, we perform a detailed study of the cache behavior of STM applications and quantify the impact of different STM factors on the cache misses experienced by the applications. Based on our analysis, we propose a compiler driven Lock-Data Colocation (LDC), targeted at reducing the cache overheads on STM. We show that LDC is effective in improving the cache behavior of STM applications by reducing the dcache miss latency and improving execution time performance.
Resumo:
Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs. In order for STMs to be adopted widely for performance critical software, understanding and improving the cache performance of applications running on STM becomes increasingly crucial, as the performance gap between processor and memory continues to grow. In this paper, we present the most detailed experimental evaluation to date, of the cache behavior of STM applications and quantify the impact of the different STM factors on the cache misses experienced by the applications. We find that STMs are not cache friendly, with the data cache stall cycles contributing to more than 50% of the execution cycles in a majority of the benchmarks. We find that on an average, misses occurring inside the STM account for 62% of total data cache miss latency cycles experienced by the applications and the cache performance is impacted adversely due to certain inherent characteristics of the STM itself. The above observations motivate us to propose a set of specific compiler transformations targeted at making the STMs cache friendly. We find that STM's fine grained and application unaware locking is a major contributor to its poor cache behavior. Hence we propose selective Lock Data co-location (LDC) and Redundant Lock Access Removal (RLAR) to address the lock access misses. We find that even transactions that are completely disjoint access parallel, suffer from costly coherence misses caused by the centralized global time stamp updates and hence we propose the Selective Per-Partition Time Stamp (SPTS) transformation to address this. We show that our transformations are effective in improving the cache behavior of STM applications by reducing the data cache miss latency by 20.15% to 37.14% and improving execution time by 18.32% to 33.12% in five of the 8 STAMP applications.
Resumo:
In this paper, the design and implementation of a single shared bus, shared memory multiprocessing system using Intel's single board computers is presented. The hardware configuration and the operating system developed to execute the parallel algorithms are discussed. The performance evaluation studies carried out on Image are outlined.
Resumo:
Memory models of shared memory concurrent programs define the values a read of a shared memory location is allowed to see. Such memory models are typically weaker than the intuitive sequential consistency semantics to allow efficient execution. In this paper, we present WOMM (abbreviation for Weak Operational Memory Model) that formally unifies two sources of weak behavior in hardware memory models: reordering of instructions and weakly consistent memory. We show that a large number of optimizations are allowed by WOMM. We also show that WOMM is weaker than a number of hardware memory models. Consequently, if a program behaves correctly under WOMM, it will be correct with respect to those hardware memory models. Hence, WOMM can be used as a formally specified abstraction of the hardware memory models. Moreover; unlike most weak memory models, WOMM is described using operational semantics, making it easy to integrate into a model checker for concurrent programs. We further show that WOMM has an important property - it has sequential consistency semantics for datarace-free programs.
Resumo:
Today's SoCs are complex designs with multiple embedded processors, memory subsystems, and application specific peripherals. The memory architecture of embedded SoCs strongly influences the power and performance of the entire system. Further, the memory subsystem constitutes a major part (typically up to 70%) of the silicon area for the current day SoC. In this article, we address the on-chip memory architecture exploration for DSP processors which are organized as multiple memory banks, where banks can be single/dual ported with non-uniform bank sizes. In this paper we propose two different methods for physical memory architecture exploration and identify the strengths and applicability of these methods in a systematic way. Both methods address the memory architecture exploration for a given target application by considering the application's data access characteristics and generates a set of Pareto-optimal design points that are interesting from a power, performance and VLSI area perspective. To the best of our knowledge, this is the first comprehensive work on memory space exploration at physical memory level that integrates data layout and memory exploration to address the system objectives from both hardware design and application software development perspective. Further we propose an automatic framework that explores the design space identifying 100's of Pareto-optimal design points within a few hours of running on a standard desktop configuration.
Resumo:
Fast content addressable data access mechanisms have compelling applications in today's systems. Many of these exploit the powerful wildcard matching capabilities provided by ternary content addressable memories. For example, TCAM based implementations of important algorithms in data mining been developed in recent years; these achieve an an order of magnitude speedup over prevalent techniques. However, large hardware TCAMs are still prohibitively expensive in terms of power consumption and cost per bit. This has been a barrier to extending their exploitation beyond niche and special purpose systems. We propose an approach to overcome this barrier by extending the traditional virtual memory hierarchy to scale up the user visible capacity of TCAMs while mitigating the power consumption overhead. By exploiting the notion of content locality (as opposed to spatial locality), we devise a novel combination of software and hardware techniques to provide an abstraction of a large virtual ternary content addressable space. In the long run, such abstractions enable applications to disassociate considerations of spatial locality and contiguity from the way data is referenced. If successful, ideas for making content addressability a first class abstraction in computing systems can open up a radical shift in the way applications are optimized for memory locality, just as storage class memories are soon expected to shift away from the way in which applications are typically optimized for disk access locality.
Resumo:
In this paper we propose a fully parallel 64K point radix-4(4) FFT processor. The radix-4(4) parallel unrolled architecture uses a novel radix-4 butterfly unit which takes all four inputs in parallel and can selectively produce one out of the four outputs. The radix-4(4) block can take all 256 inputs in parallel and can use the select control signals to generate one out of the 256 outputs. The resultant 64K point FFT processor shows significant reduction in intermediate memory but with increased hardware complexity. Compared to the state-of-art implementation 5], our architecture shows reduced latency with comparable throughput and area. The 64K point FFT architecture was synthesized using a 130nm CMOS technology which resulted in a throughput of 1.4 GSPS and latency of 47.7 mu s with a maximum clock frequency of 350MHz. When compared to 5], the latency is reduced by 303 mu s with 50.8% reduction in area.
Resumo:
High end network security applications demand high speed operation and large rule set support. Packet classification is the core functionality that demands high throughput in such applications. This paper proposes a packet classification architecture to meet such high throughput. We have implemented a Firewall with this architecture in reconflgurable hardware. We propose an extension to Distributed Crossproducting of Field Labels (DCFL) technique to achieve scalable and high performance architecture. The implemented Firewall takes advantage of inherent structure and redundancy of rule set by using our DCFL Extended (DCFLE) algorithm. The use of DCFLE algorithm results in both speed and area improvement when it is implemented in hardware. Although we restrict ourselves to standard 5-tuple matching, the architecture supports additional fields. High throughput classification invariably uses Ternary Content Addressable Memory (TCAM) for prefix matching, though TCAM fares poorly in terms of area and power efficiency. Use of TCAM for port range matching is expensive, as the range to prefix conversion results in large number of prefixes leading to storage inefficiency. Extended TCAM (ETCAM) is fast and the most storage efficient solution for range matching. We present for the first time a reconfigurable hardware implementation of ETCAM. We have implemented our Firewall as an embedded system on Virtex-II Pro FPGA based platform, running Linux with the packet classification in hardware. The Firewall was tested in real time with 1 Gbps Ethernet link and 128 sample rules. The packet classification hardware uses a quarter of logic resources and slightly over one third of memory resources of XC2VP30 FPGA. It achieves a maximum classification throughput of 50 million packet/s corresponding to 16 Gbps link rate for the worst case packet size. The Firewall rule update involves only memory re-initialization in software without any hardware change.
Resumo:
We had earlier proposed a hypothesis to explain the mechanism of perpetuation of immunological memory based on the operation of idiotypic network in the complete absence of antigen. Experimental evidences were provided for memory maintenance through anti-idiotypic antibody (Ab2) carrying the internal image of the antigen. In the present work, we describe a structural basis for such memory perpetuation by molecular modeling and structural analysis studies. A three-dimensional model of Ab2 was generated and the structure of the antigenic site on the hemagglutinin protein H of Rinderpest virus was modeled using the structural template of hemagglutinin protein of Measles virus. Our results show that a large portion of heavy chain containing the CDR regions of Ab2 resembles the domain of the hemagglutinin housing the epitope regions. The similarity demonstrates that an internal image of the H antigen is formed in Ab2, which provides a structural basis for functional mimicry demonstrated earlier. This work brings out the importance of the structural similarity between a domain of hemagglutinin protein to that of its corresponding Ab2. It provides evidence that Ab2 is indeed capable of functioning as surrogate antigen and provides support to earlier proposed relay hypothesis which has provided a mechanism for the maintenance of immunological memory.
Resumo:
High end network security applications demand high speed operation and large rule set support. Packet classification is the core functionality that demands high throughput in such applications. This paper proposes a packet classification architecture to meet such high throughput. We have Implemented a Firewall with this architecture in reconfigurable hardware. We propose an extension to Distributed Crossproducting of Field Labels (DCFL) technique to achieve scalable and high performance architecture. The implemented Firewall takes advantage of inherent structure and redundancy of rule set by using, our DCFL Extended (DCFLE) algorithm. The use of DCFLE algorithm results In both speed and area Improvement when It is Implemented in hardware. Although we restrict ourselves to standard 5-tuple matching, the architecture supports additional fields.High throughput classification Invariably uses Ternary Content Addressable Memory (TCAM) for prefix matching, though TCAM fares poorly In terms of area and power efficiency. Use of TCAM for port range matching is expensive, as the range to prefix conversion results in large number of prefixes leading to storage inefficiency. Extended TCAM (ETCAM) is fast and the most storage efficient solution for range matching. We present for the first time a reconfigurable hardware Implementation of ETCAM. We have implemented our Firewall as an embedded system on Virtex-II Pro FPGA based platform, running Linux with the packet classification in hardware. The Firewall was tested in real time with 1 Gbps Ethernet link and 128 sample rules. The packet classification hardware uses a quarter of logic resources and slightly over one third of memory resources of XC2VP30 FPGA. It achieves a maximum classification throughput of 50 million packet/s corresponding to 16 Gbps link rate for file worst case packet size. The Firewall rule update Involves only memory re-initialiization in software without any hardware change.
Resumo:
We had earlier proposed a hypothesis to explain the mechanism of perpetuation of immunological memory based on the operation of idiotypic network in the complete absence of antigen. Experimental evidences were provided for memory maintenance through anti-idiotypic antibody (Ab(2)) carrying the internal image of the antigen. In the present work, we describe a structural basis for such memory perpetuation by molecular modeling and structural analysis studies. A three-dimensional model of Ab(2) was generated and the structure of the antigenic site on the hemagglutinin protein H of Rinderpest virus was modeled using the structural template of hemagglutinin protein of Measles virus. Our results show that a large portion of heavy chain containing the CDR regions of Ab(2) resembles the domain of the hemagglutinin housing the epitope regions. The similarity demonstrates that an internal image of the H antigen is formed in Ab(2), which provides a structural basis for functional mimicry demonstrated earlier. This work brings out the importance of the structural similarity between a domain of hemagglutinin protein to that of its corresponding Ab(2). It provides evidence that Ab(2) is indeed capable of functioning as surrogate antigen and provides support to earlier proposed relay hypothesis which has provided a mechanism for the maintenance of immunological memory.
Resumo:
Using the promeasure technique, we give an alternative evaluation of a path integral corresponding to a quadratic action with a generalized memory.
Resumo:
A constitutive modeling approach for shape memory alloy (SMA) wire by taking into account the microstructural phase inhomogeneity and the associated solid-solid phase transformation kinetics is reported in this paper. The approach is applicable to general thermomechanical loading. Characterization of various scales in the non-local rate sensitive kinetics is the main focus of this paper. Design of SMA materials and actuators not only involve an optimal exploitation of the hysteresis loops during loading-unloading, but also accounts for fatigue and training cycle identifications. For a successful design of SMA integrated actuator systems, it is essential to include the microstructural inhomogeneity effects and the loading rate dependence of the martensitic evolution, since these factors play predominant role in fatigue. In the proposed formulation, the evolution of new phase is assumed according to Weibull distribution. Fourier transformation and finite difference methods are applied to arrive at the analytical form of two important scaling parameters. The ratio of these scaling parameters is of the order of 10(6) for stress-free temperature-induced transformation and 10(4) for stress-induced transformation. These scaling parameters are used in order to study the effect of microstructural variation on the thermo-mechanical force and interface driving force. It is observed that the interface driving force is significant during the evolution. Increase in the slopes of the transformation start and end regions in the stress-strain hysteresis loop is observed for mechanical loading with higher rates.