158 resultados para Software Transactional Memory (STM)

em Indian Institute of Science - Bangalore - Índia


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Software transactional memory (STM) has been proposed as a promising programming paradigm for shared memory multi-threaded programs as an alternative to conventional lock based synchronization primitives. Typical STM implementations employ a conflict detection scheme, which works with uniform access granularity, tracking shared data accesses either at word/cache line or at object level. It is well known that a single fixed access tracking granularity cannot meet the conflicting goals of reducing false conflicts without impacting concurrency adversely. A fine grained granularity while improving concurrency can have an adverse impact on performance due to lock aliasing, lock validation overheads, and additional cache pressure. On the other hand, a coarse grained granularity can impact performance due to reduced concurrency. Thus, in general, a fixed or uniform granularity access tracking (UGAT) scheme is application-unaware and rarely matches the access patterns of individual application or parts of an application, leading to sub-optimal performance for different parts of the application(s). In order to mitigate the disadvantages associated with UGAT scheme, we propose a Variable Granularity Access Tracking (VGAT) scheme in this paper. We propose a compiler based approach wherein the compiler uses inter-procedural whole program static analysis to select the access tracking granularity for different shared data structures of the application based on the application's data access pattern. We describe our prototype VGAT scheme, using TL2 as our STM implementation. Our experimental results reveal that VGAT-STM scheme can improve the application performance of STAMP benchmarks from 1.87% to up to 21.2%.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs as an alternative to traditional lock based synchronization. However adoption of STM in mainstream software has been quite low due to its considerable overheads and its poor cache/memory performance. In this paper, we perform a detailed study of the cache behavior of STM applications and quantify the impact of different STM factors on the cache misses experienced by the applications. Based on our analysis, we propose a compiler driven Lock-Data Colocation (LDC), targeted at reducing the cache overheads on STM. We show that LDC is effective in improving the cache behavior of STM applications by reducing the dcache miss latency and improving execution time performance.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Software transactional memory(STM) is a promising programming paradigm for shared memory multithreaded programs. While STM offers the promise of being less error-prone and more programmer friendly compared to traditional lock-based synchronization, it also needs to be competitive in performance in order for it to be adopted in mainstream software. A major source of performance overheads in STM is transactional aborts. Conflict resolution and aborting a transaction typically happens at the transaction level which has the advantage that it is automatic and application agnostic. However it has a substantial disadvantage in that STM declares the entire transaction as conflicting and hence aborts it and re-executes it fully, instead of partially re-executing only those part(s) of the transaction, which have been affected due to the conflict. This "Re-execute Everything" approach has a significant adverse impact on STM performance. In order to mitigate the abort overheads, we propose a compiler aided Selective Reconciliation STM (SR-STM) scheme, wherein certain transactional conflicts can be reconciled by performing partial re-execution of the transaction. Ours is a selective hybrid approach which uses compiler analysis to identify those data accesses which are legal and profitable candidates for reconciliation and applies partial re-execution only to these candidates selectively while other conflicting data accesses are handled by the default STM approach of abort and full re-execution. We describe the compiler analysis and code transformations required for supporting selective reconciliation. We find that SR-STM is effective in reducing the transactional abort overheads by improving the performance for a set of five STAMP benchmarks by 12.58% on an average and up to 22.34%.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs. In order for STMs to be adopted widely for performance critical software, understanding and improving the cache performance of applications running on STM becomes increasingly crucial, as the performance gap between processor and memory continues to grow. In this paper, we present the most detailed experimental evaluation to date, of the cache behavior of STM applications and quantify the impact of the different STM factors on the cache misses experienced by the applications. We find that STMs are not cache friendly, with the data cache stall cycles contributing to more than 50% of the execution cycles in a majority of the benchmarks. We find that on an average, misses occurring inside the STM account for 62% of total data cache miss latency cycles experienced by the applications and the cache performance is impacted adversely due to certain inherent characteristics of the STM itself. The above observations motivate us to propose a set of specific compiler transformations targeted at making the STMs cache friendly. We find that STM's fine grained and application unaware locking is a major contributor to its poor cache behavior. Hence we propose selective Lock Data co-location (LDC) and Redundant Lock Access Removal (RLAR) to address the lock access misses. We find that even transactions that are completely disjoint access parallel, suffer from costly coherence misses caused by the centralized global time stamp updates and hence we propose the Selective Per-Partition Time Stamp (SPTS) transformation to address this. We show that our transformations are effective in improving the cache behavior of STM applications by reducing the data cache miss latency by 20.15% to 37.14% and improving execution time by 18.32% to 33.12% in five of the 8 STAMP applications.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The Java Memory Model (JMM) provides a semantics of Java multithreading for any implementation platform. The JMM is defined in a declarative fashion with an allowed program execution being defined in terms of existence of "commit sequences" (roughly, the order in which actions in the execution are committed). In this work, we develop OpMM, an operational under-approximation of the JMM. The immediate motivation of this work lies in integrating a formal specification of the JMM with software model checkers. We show how our operational memory model description can be integrated into a Java Path Finder (JPF) style model checker for Java programs.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The literature contains many examples of digital procedures for the analytical treatment of electroencephalograms, but there is as yet no standard by which those techniques may be judged or compared. This paper proposes one method of generating an EEG, based on a computer program for Zetterberg's simulation. It is assumed that the statistical properties of an EEG may be represented by stationary processes having rational transfer functions and achieved by a system of software fillers and random number generators.The model represents neither the neurological mechanism response for generating the EEG, nor any particular type of EEG record; transient phenomena such as spikes, sharp waves and alpha bursts also are excluded. The basis of the program is a valid ‘partial’ statistical description of the EEG; that description is then used to produce a digital representation of a signal which if plotted sequentially, might or might not by chance resemble an EEG, that is unimportant. What is important is that the statistical properties of the series remain those of a real EEG; it is in this sense that the output is a simulation of the EEG. There is considerable flexibility in the form of the output, i.e. its alpha, beta and delta content, which may be selected by the user, the same selected parameters always producing the same statistical output. The filtered outputs from the random number sequences may be scaled to provide realistic power distributions in the accepted EEG frequency bands and then summed to create a digital output signal, the ‘stationary EEG’. It is suggested that the simulator might act as a test input to digital analytical techniques for the EEG, a simulator which would enable at least a substantial part of those techniques to be compared and assessed in an objective manner. The equations necessary to implement the model are given. The program has been run on a DEC1090 computer but is suitable for any microcomputer having more than 32 kBytes of memory; the execution time required to generate a 25 s simulated EEG is in the region of 15 s.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The design, implementation and evaluation are described of a dual-microcomputer system based on the concept of shared memory. Shared memory is useful for passing large blocks of data and it also provides a means to hold and work with shared data. In addition to the shared memory, a separate bus between the I/O ports of the microcomputers is provided. This bus is utilized for interprocessor synchronization. Software routines helpful in applying the dual-microcomputer system to realistic problems are presented. Performance evaluation of the system is carried out using benchmarks.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multi-core architectures. This model allows programmers to specify the structure of a program as a set of filters that act upon data, and a set of communication channels between them. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on modern Graphics Processing Units (GPUs), as they support abundant parallelism in hardware. In this paper, we describe the challenges in mapping StreamIt to GPUs and propose an efficient technique to software pipeline the execution of stream programs on GPUs. We formulate this problem - both scheduling and assignment of filters to processors - as an efficient Integer Linear Program (ILP), which is then solved using ILP solvers. We also describe a novel buffer layout technique for GPUs which facilitates exploiting the high memory bandwidth available in GPUs. The proposed scheduling utilizes both the scalar units in GPU, to exploit data parallelism, and multiprocessors, to exploit task and pipelin parallelism. Further it takes into consideration the synchronization and bandwidth limitations of GPUs, and yields speedups between 1.87X and 36.83X over a single threaded CPU.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Today's SoCs are complex designs with multiple embedded processors, memory subsystems, and application specific peripherals. The memory architecture of embedded SoCs strongly influences the power and performance of the entire system. Further, the memory subsystem constitutes a major part (typically up to 70%) of the silicon area for the current day SoC. In this article, we address the on-chip memory architecture exploration for DSP processors which are organized as multiple memory banks, where banks can be single/dual ported with non-uniform bank sizes. In this paper we propose two different methods for physical memory architecture exploration and identify the strengths and applicability of these methods in a systematic way. Both methods address the memory architecture exploration for a given target application by considering the application's data access characteristics and generates a set of Pareto-optimal design points that are interesting from a power, performance and VLSI area perspective. To the best of our knowledge, this is the first comprehensive work on memory space exploration at physical memory level that integrates data layout and memory exploration to address the system objectives from both hardware design and application software development perspective. Further we propose an automatic framework that explores the design space identifying 100's of Pareto-optimal design points within a few hours of running on a standard desktop configuration.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Fast content addressable data access mechanisms have compelling applications in today's systems. Many of these exploit the powerful wildcard matching capabilities provided by ternary content addressable memories. For example, TCAM based implementations of important algorithms in data mining been developed in recent years; these achieve an an order of magnitude speedup over prevalent techniques. However, large hardware TCAMs are still prohibitively expensive in terms of power consumption and cost per bit. This has been a barrier to extending their exploitation beyond niche and special purpose systems. We propose an approach to overcome this barrier by extending the traditional virtual memory hierarchy to scale up the user visible capacity of TCAMs while mitigating the power consumption overhead. By exploiting the notion of content locality (as opposed to spatial locality), we devise a novel combination of software and hardware techniques to provide an abstraction of a large virtual ternary content addressable space. In the long run, such abstractions enable applications to disassociate considerations of spatial locality and contiguity from the way data is referenced. If successful, ideas for making content addressability a first class abstraction in computing systems can open up a radical shift in the way applications are optimized for memory locality, just as storage class memories are soon expected to shift away from the way in which applications are typically optimized for disk access locality.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Most Java programmers would agree that Java is a language that promotes a philosophy of “create and go forth”. By design, temporary objects are meant to be created on the heap, possibly used and then abandoned to be collected by the garbage collector. Excessive generation of temporary objects is termed “object churn” and is a form of software bloat that often leads to performance and memory problems. To mitigate this problem, many compiler optimizations aim at identifying objects that may be allocated on the stack. However, most such optimizations miss large opportunities for memory reuse when dealing with objects inside loops or when dealing with container objects. In this paper, we describe a novel algorithm that detects bloat caused by the creation of temporary container and String objects within a loop. Our analysis determines which objects created within a loop can be reused. Then we describe a source-to-source transformation that efficiently reuses such objects. Empirical evaluation indicates that our solution can reduce upto 40% of temporary object allocations in large programs, resulting in a performance improvement that can be as high as a 20% reduction in the run time, specifically when a program has a high churn rate or when the program is memory intensive and needs to run the GC often.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Exploiting the performance potential of GPUs requires managing the data transfers to and from them efficiently which is an error-prone and tedious task. In this paper, we develop a software coherence mechanism to fully automate all data transfers between the CPU and GPU without any assistance from the programmer. Our mechanism uses compiler analysis to identify potential stale accesses and uses a runtime to initiate transfers as necessary. This allows us to avoid redundant transfers that are exhibited by all other existing automatic memory management proposals. We integrate our automatic memory manager into the X10 compiler and runtime, and find that it not only results in smaller and simpler programs, but also eliminates redundant memory transfers. Tested on eight programs ported from the Rodinia benchmark suite it achieves (i) a 1.06x speedup over hand-tuned manual memory management, and (ii) a 1.29x speedup over another recently proposed compiler--runtime automatic memory management system. Compared to other existing runtime-only and compiler-only proposals, it also transfers 2.2x to 13.3x less data on average.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We study by means of experiments and Monte Carlo simulations, the scattering of light in random media, to determine the distance up to which photons travel along almost undeviated paths within a scattering medium, and are therefore capable of casting a shadow of an opaque inclusion embedded within the medium. Such photons are isolated by polarisation discrimination wherein the plane of linear polarisation of the input light is continuously rotated and the polarisation preserving component of the emerging light is extracted by means of a Fourier transform. This technique is a software implementation of lock-in detection. We find that images may be recovered to a depth far in excess of that predicted by the diffusion theory of photon propagation. To understand our experimental results, we perform Monte Carlo simulations to model the random walk behaviour of the multiply scattered photons. We present a. new definition of a diffusing photon in terms of the memory of its initial direction of propagation, which we then quantify in terms of an angular correlation function. This redefinition yields the penetration depth of the polarisation preserving photons. Based on these results, we have formulated a model to understand shadow formation in a turbid medium, the predictions of which are in good agreement with our experimental results.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

High end network security applications demand high speed operation and large rule set support. Packet classification is the core functionality that demands high throughput in such applications. This paper proposes a packet classification architecture to meet such high throughput. We have implemented a Firewall with this architecture in reconflgurable hardware. We propose an extension to Distributed Crossproducting of Field Labels (DCFL) technique to achieve scalable and high performance architecture. The implemented Firewall takes advantage of inherent structure and redundancy of rule set by using our DCFL Extended (DCFLE) algorithm. The use of DCFLE algorithm results in both speed and area improvement when it is implemented in hardware. Although we restrict ourselves to standard 5-tuple matching, the architecture supports additional fields. High throughput classification invariably uses Ternary Content Addressable Memory (TCAM) for prefix matching, though TCAM fares poorly in terms of area and power efficiency. Use of TCAM for port range matching is expensive, as the range to prefix conversion results in large number of prefixes leading to storage inefficiency. Extended TCAM (ETCAM) is fast and the most storage efficient solution for range matching. We present for the first time a reconfigurable hardware implementation of ETCAM. We have implemented our Firewall as an embedded system on Virtex-II Pro FPGA based platform, running Linux with the packet classification in hardware. The Firewall was tested in real time with 1 Gbps Ethernet link and 128 sample rules. The packet classification hardware uses a quarter of logic resources and slightly over one third of memory resources of XC2VP30 FPGA. It achieves a maximum classification throughput of 50 million packet/s corresponding to 16 Gbps link rate for the worst case packet size. The Firewall rule update involves only memory re-initialization in software without any hardware change.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We had earlier proposed a hypothesis to explain the mechanism of perpetuation of immunological memory based on the operation of idiotypic network in the complete absence of antigen. Experimental evidences were provided for memory maintenance through anti-idiotypic antibody (Ab2) carrying the internal image of the antigen. In the present work, we describe a structural basis for such memory perpetuation by molecular modeling and structural analysis studies. A three-dimensional model of Ab2 was generated and the structure of the antigenic site on the hemagglutinin protein H of Rinderpest virus was modeled using the structural template of hemagglutinin protein of Measles virus. Our results show that a large portion of heavy chain containing the CDR regions of Ab2 resembles the domain of the hemagglutinin housing the epitope regions. The similarity demonstrates that an internal image of the H antigen is formed in Ab2, which provides a structural basis for functional mimicry demonstrated earlier. This work brings out the importance of the structural similarity between a domain of hemagglutinin protein to that of its corresponding Ab2. It provides evidence that Ab2 is indeed capable of functioning as surrogate antigen and provides support to earlier proposed relay hypothesis which has provided a mechanism for the maintenance of immunological memory.