44 resultados para forced execution of obligations


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Relentless CMOS scaling coupled with lower design tolerances is making ICs increasingly susceptible to wear-out related permanent faults and transient faults, necessitating on-chip fault tolerance in future chip microprocessors (CMPs). In this paper we introduce a new energy-efficient fault-tolerant CMP architecture known as Redundant Execution using Critical Value Forwarding (RECVF). RECVF is based on two observations: (i) forwarding critical instruction results from the leading to the trailing core enables the latter to execute faster, and (ii) this speedup can be exploited to reduce energy consumption by operating the trailing core at a lower voltage-frequency level. Our evaluation shows that RECVF consumes 37% less energy than conventional dual modular redundant (DMR) execution of a program. It consumes only 1.26 times the energy of a non-fault-tolerant baseline and has a performance overhead of just 1.2%.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Very Long Instruction Word (VLIW) architectures exploit instruction level parallelism (ILP) with the help of the compiler to achieve higher instruction throughput with minimal hardware. However, control and data dependencies between operations limit the available ILP, which not only hinders the scalability of VLIW architectures, but also result in code size expansion. Although speculation and predicated execution mitigate ILP limitations due to control dependencies to a certain extent, they increase hardware cost and exacerbate code size expansion. Simultaneous multistreaming (SMS) can significantly improve operation throughput by allowing interleaved execution of operations from multiple instruction streams. In this paper we study SMS for VLIW architectures and quantify the benefits associated with it using a case study of the MPEG-2 video decoder. We also propose the notion of virtual resources for VLIW architectures, which decouple architectural resources (resources exposed to the compiler) from the microarchitectural resources, to limit code size expansion. Our results for a VLIW architecture demonstrate that: (1) SMS delivers much higher throughput than that achieved by speculation and predicated execution, (2) the increase in performance due to the addition of speculation and predicated execution support over SMS averages around 12%. The minor increase in performance might not warrant the additional hardware complexity involved, and (3) the notion of virtual resources is very effective in reducing no-operations (NOPs) and consequently reduce code size with little or no impact on performance.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Computational grids with multiple batch systems (batch grids) can be powerful infrastructures for executing long-running multicomponent parallel applications. In this paper, we have constructed a middleware framework for executing such long-running applications spanning multiple submissions to the queues on multiple batch systems. We have used our framework for execution of a foremost long-running multi-component application for climate modeling, the Community Climate System Model (CCSM). Our framework coordinates the distribution, execution, migration and restart of the components of CCSM on the multiple queues where the component jobs of the different queues can have different queue waiting and startup times.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The origin of hydrodynamic turbulence in rotating shear flows is investigated, with particular emphasis on the flows whose angular velocity decreases but whose specific angular momentum increases with the increasing radial coordinate. Such flows are Rayleigh stable, but must be turbulent in order to explain the observed data. Such a mismatch between the linear theory and the observations/experiments is more severe when any hydromagnetic/magnetohydrodynamic instability and then the corresponding turbulence therein is ruled out. This work explores the effect of stochastic noise on such hydrodynamic flows. We essentially concentrate on a small section of such a flow, which is nothing but a plane shear flow supplemented by the Coriolis effect. This also mimics a small section of an astrophysical accretion disc. It is found that such stochastically driven flows exhibit large temporal and spatial correlations of perturbation velocities and hence large energy dissipations of perturbation, which presumably generate the instability. A range of angular velocity (Omega) profiles of the background flow, starting from that of a constant specific angular momentum (lambda = Omega r(2); r being the radial coordinate) to a constant circular velocity (v(phi) = Omega r), is explored. However, all the background angular velocities exhibit identical growth and roughness exponents of their perturbations, revealing a unique universality class for the stochastically forced hydrodynamics of rotating shear flows. This work, to the best of our knowledge, is the first attempt to understand the origin of instability and turbulence in three-dimensional Rayleigh stable rotating shear flows by introducing additive noise to the underlying linearized governing equations. This has important implications to resolve the turbulence problem in astrophysical hydrodynamic flows such as accretion discs.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Muscle development is a multistep process which includes myoblast diversification, proliferation, migration, fusion, differentiation and growth. A hierarchical exhibition of myogenic factors is important for dexterous execution of progressive events in muscle formation. EWG (erect wing) is a transcription factor known to have a role in indirect flight muscle development (IFM) in Drosophila. We marked out the precise spatio-temporal expression profile of EWG in the myoblasts, and in the developing muscles. Mutant adult flies null for EWG in myoblasts show variable number of IFM, suggesting that EWG is required for patterning of the IFM. The remnant muscle found in the EWG null flies show proper assembly of the structural proteins, which implies that some myoblasts manage to fuse, develop and differentiate normally indicating that EWG is not required for differentiation program per se. However, when EWG expression is extended beyond its expression window in a wild type background, muscle thinning is observed implying EWG function in protein synthesis inhibition. Mis-expression studies in wing disc myoblasts hinted at its role in myoblast proliferation. We thus conclude that EWG is important for regulating fusion events which in turn decides the IFM pattern. Also IFM in EWG null mutants show clumps containing broken fibres and an altered mitochondrial morphology. The vertebrate homolog of EWG is nuclear respiratory factor1 (NRF1) which is known to have a function in mitochondrial biogenesis and protection against oxidative stress. Gene expression for inner mitochondrial membrane protein, Opa1-like was found to be absent in these mutants. Also, these flies were more sensitive to oxidative stress, indicating a compromised mitochondrial functioning. Our results therefore demonstrate that EWG functions in maintaining muscles’ structural integrity by ensuing proper mitochondrial activity.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Pervasive use of pointers in large-scale real-world applications continues to make points-to analysis an important optimization-enabler. Rapid growth of software systems demands a scalable pointer analysis algorithm. A typical inclusion-based points-to analysis iteratively evaluates constraints and computes a points-to solution until a fixpoint. In each iteration, (i) points-to information is propagated across directed edges in a constraint graph G and (ii) more edges are added by processing the points-to constraints. We observe that prioritizing the order in which the information is processed within each of the above two steps can lead to efficient execution of the points-to analysis. While earlier work in the literature focuses only on the propagation order, we argue that the other dimension, that is, prioritizing the constraint processing, can lead to even higher improvements on how fast the fixpoint of the points-to algorithm is reached. This becomes especially important as we prove that finding an optimal sequence for processing the points-to constraints is NP-Complete. The prioritization scheme proposed in this paper is general enough to be applied to any of the existing points-to analyses. Using the prioritization framework developed in this paper, we implement prioritized versions of Andersen's analysis, Deep Propagation, Hardekopf and Lin's Lazy Cycle Detection and Bloom Filter based points-to analysis. In each case, we report significant improvements in the analysis times (33%, 47%, 44%, 20% respectively) as well as the memory requirements for a large suite of programs, including SPEC 2000 benchmarks and five large open source programs.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available resources, with over 30% of resources going unused on average for programs of the Parboil2 suite that we used in our work. Current GPUs therefore allow concurrent execution of kernels to improve utilization. In this work, we study concurrent execution of GPU kernels using multiprogram workloads on current NVIDIA Fermi GPUs. On two-program workloads from the Parboil2 benchmark suite we find concurrent execution is often no better than serialized execution. We identify that the lack of control over resource allocation to kernels is a major serialization bottleneck. We propose transformations that convert CUDA kernels into elastic kernels which permit fine-grained control over their resource usage. We then propose several elastic-kernel aware concurrency policies that offer significantly better performance and concurrency compared to the current CUDA policy. We evaluate our proposals on real hardware using multiprogrammed workloads constructed from benchmarks in the Parboil 2 suite. On average, our proposals increase system throughput (STP) by 1.21x and improve the average normalized turnaround time (ANTT) by 3.73x for two-program workloads when compared to the current CUDA concurrency implementation.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Software transactional memory(STM) is a promising programming paradigm for shared memory multithreaded programs. While STM offers the promise of being less error-prone and more programmer friendly compared to traditional lock-based synchronization, it also needs to be competitive in performance in order for it to be adopted in mainstream software. A major source of performance overheads in STM is transactional aborts. Conflict resolution and aborting a transaction typically happens at the transaction level which has the advantage that it is automatic and application agnostic. However it has a substantial disadvantage in that STM declares the entire transaction as conflicting and hence aborts it and re-executes it fully, instead of partially re-executing only those part(s) of the transaction, which have been affected due to the conflict. This "Re-execute Everything" approach has a significant adverse impact on STM performance. In order to mitigate the abort overheads, we propose a compiler aided Selective Reconciliation STM (SR-STM) scheme, wherein certain transactional conflicts can be reconciled by performing partial re-execution of the transaction. Ours is a selective hybrid approach which uses compiler analysis to identify those data accesses which are legal and profitable candidates for reconciliation and applies partial re-execution only to these candidates selectively while other conflicting data accesses are handled by the default STM approach of abort and full re-execution. We describe the compiler analysis and code transformations required for supporting selective reconciliation. We find that SR-STM is effective in reducing the transactional abort overheads by improving the performance for a set of five STAMP benchmarks by 12.58% on an average and up to 22.34%.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Many meteorological phenomena occur at different locations simultaneously. These phenomena vary temporally and spatially. It is essential to track these multiple phenomena for accurate weather prediction. Efficient analysis require high-resolution simulations which can be conducted by introducing finer resolution nested simulations, nests at the locations of these phenomena. Simultaneous tracking of these multiple weather phenomena requires simultaneous execution of the nests on different subsets of the maximum number of processors for the main weather simulation. Dynamic variation in the number of these nests require efficient processor reallocation strategies. In this paper, we have developed strategies for efficient partitioning and repartitioning of the nests among the processors. As a case study, we consider an application of tracking multiple organized cloud clusters in tropical weather systems. We first present a parallel data analysis algorithm to detect such clouds. We have developed a tree-based hierarchical diffusion method which reallocates processors for the nests such that the redistribution cost is less. We achieve this by a novel tree reorganization approach. We show that our approach exhibits up to 25% lower redistribution cost and 53% lesser hop-bytes than the processor reallocation strategy that does not consider the existing processor allocation.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Precise pointer analysis is a problem of interest to both the compiler and the program verification community. Flow-sensitivity is an important dimension of pointer analysis that affects the precision of the final result computed. Scaling flow-sensitive pointer analysis to millions of lines of code is a major challenge. Recently, staged flow-sensitive pointer analysis has been proposed, which exploits a sparse representation of program code created by staged analysis. In this paper we formulate the staged flow-sensitive pointer analysis as a graph-rewriting problem. Graph-rewriting has already been used for flow-insensitive analysis. However, formulating flow-sensitive pointer analysis as a graph-rewriting problem adds additional challenges due to the nature of flow-sensitivity. We implement our parallel algorithm using Intel Threading Building Blocks and demonstrate considerable scaling (upto 2.6x) for 8 threads on a set of 10 benchmarks. Compared to the sequential implementation of staged flow-sensitive analysis, a single threaded execution of our implementation performs better in 8 of the benchmarks.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Matroidal networks were introduced by Dougherty et al. and have been well studied in the recent past. It was shown that a network has a scalar linear network coding solution if and only if it is matroidal associated with a representable matroid. A particularly interesting feature of this development is the ability to construct (scalar and vector) linearly solvable networks using certain classes of matroids. Furthermore, it was shown through the connection between network coding and matroid theory that linear network coding is not always sufficient for general network coding scenarios. The current work attempts to establish a connection between matroid theory and network-error correcting and detecting codes. In a similar vein to the theory connecting matroids and network coding, we abstract the essential aspects of linear network-error detecting codes to arrive at the definition of a matroidal error detecting network (and similarly, a matroidal error correcting network abstracting from network-error correcting codes). An acyclic network (with arbitrary sink demands) is then shown to possess a scalar linear error detecting (correcting) network code if and only if it is a matroidal error detecting (correcting) network associated with a representable matroid. Therefore, constructing such network-error correcting and detecting codes implies the construction of certain representable matroids that satisfy some special conditions, and vice versa. We then present algorithms that enable the construction of matroidal error detecting and correcting networks with a specified capability of network-error correction. Using these construction algorithms, a large class of hitherto unknown scalar linearly solvable networks with multisource, multicast, and multiple-unicast network-error correcting codes is made available for theoretical use and practical implementation, with parameters, such as number of information symbols, number of sinks, number of coding nodes, error correcting capability, and so on, being arbitrary but for computing power (for the execution of the algorithms). The complexity of the construction of these networks is shown to be comparable with the complexity of existing algorithms that design multicast scalar linear network-error correcting codes. Finally, we also show that linear network coding is not sufficient for the general network-error correction (detection) problem with arbitrary demands. In particular, for the same number of network errors, we show a network for which there is a nonlinear network-error detecting code satisfying the demands at the sinks, whereas there are no linear network-error detecting codes that do the same.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Designing and implementing thread-safe multithreaded libraries can be a daunting task as developers of these libraries need to ensure that their implementations are free from concurrency bugs, including deadlocks. The usual practice involves employing software testing and/or dynamic analysis to detect. deadlocks. Their effectiveness is dependent on well-designed multithreaded test cases. Unsurprisingly, developing multithreaded tests is significantly harder than developing sequential tests for obvious reasons. In this paper, we address the problem of automatically synthesizing multithreaded tests that can induce deadlocks. The key insight to our approach is that a subset of the properties observed when a deadlock manifests in a concurrent execution can also be observed in a single threaded execution. We design a novel, automatic, scalable and directed approach that identifies these properties and synthesizes a deadlock revealing multithreaded test. The input to our approach is the library implementation under consideration and the output is a set of deadlock revealing multithreaded tests. We have implemented our approach as part of a tool, named OMEN1. OMEN is able to synthesize multithreaded tests on many multithreaded Java libraries. Applying a dynamic deadlock detector on the execution of the synthesized tests results in the detection of a number of deadlocks, including 35 real deadlocks in classes documented as thread-safe. Moreover, our experimental results show that dynamic analysis on multithreaded tests that are either synthesized randomly or developed by third-party programmers are ineffective in detecting the deadlocks.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Branch divergence is a very commonly occurring performance problem in GPGPU in which the execution of diverging branches is serialized to execute only one control flow path at a time. Existing hardware mechanism to reconverge threads using a stack causes duplicate execution of code for unstructured control flow graphs. Also the stack mechanism cannot effectively utilize the available parallelism among diverging branches. Further, the amount of nested divergence allowed is also limited by depth of the branch divergence stack. In this paper we propose a simple and elegant transformation to handle all of the above mentioned problems. The transformation converts an unstructured CFG to a structured CFG without duplicating user code. It incurs only a linear increase in the number of basic blocks and also the number of instructions. Our solution linearizes the CFG using a predicate variable. This mechanism reconverges the divergent threads as early as possible. It also reduces the depth of the reconvergence stack. The available parallelism in nested branches can be effectively extracted by scheduling the basic blocks to reduce the effect of stalls due to memory accesses. It can also increase execution efficiency of nested loops with different trip counts for different threads. We implemented the proposed transformation at PTX level using the Ocelot compiler infrastructure. We evaluated the technique using various benchmarks to show that it can be effective in handling the performance problem due to divergence in unstructured CFGs.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Subtle concurrency errors in multithreaded libraries that arise because of incorrect or inadequate synchronization are often difficult to pinpoint precisely using only static techniques. On the other hand, the effectiveness of dynamic race detectors is critically dependent on multithreaded test suites whose execution can be used to identify and trigger races. Usually, such multithreaded tests need to invoke a specific combination of methods with objects involved in the invocations being shared appropriately to expose a race. Without a priori knowledge of the race, construction of such tests can be challenging. In this paper, we present a lightweight and scalable technique for synthesizing precisely these kinds of tests. Given a multithreaded library and a sequential test suite, we describe a fully automated analysis that examines sequential execution traces, and produces as its output a concurrent client program that drives shared objects via library method calls to states conducive for triggering a race. Experimental results on a variety of well-tested Java libraries yield 101 synthesized multithreaded tests in less than four minutes. Analyzing the execution of these tests using an off-the-shelf race detector reveals 187 harmful races, including several previously unreported ones.