77 resultados para Fault tolerance

em Indian Institute of Science - Bangalore - Índia


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Multiprocessor systems which afford a high degree of parallelism are used in a variety of applications. The extremely stringent reliability requirement has made the provision of fault-tolerance an important aspect in the design of such systems. This paper presents a review of the various approaches towards tolerating hardware faults in multiprocessor systems. It. emphasizes the basic concepts of fault tolerant design and the various problems to be taken care of by the designer. An indepth survey of the various models, techniques and methods for fault diagnosis is given. Further, we consider the strategies for fault-tolerance in specialized multiprocessor architectures which have the ability of dynamic reconfiguration and are suited to VLSI implementation. An analysis of the state-óf-the-art is given which points out the major aspects of fault-tolerance in such architectures.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Relentless CMOS scaling coupled with lower design tolerances is making ICs increasingly susceptible to wear-out related permanent faults and transient faults, necessitating on-chip fault tolerance in future chip microprocessors (CMPs). In this paper we introduce a new energy-efficient fault-tolerant CMP architecture known as Redundant Execution using Critical Value Forwarding (RECVF). RECVF is based on two observations: (i) forwarding critical instruction results from the leading to the trailing core enables the latter to execute faster, and (ii) this speedup can be exploited to reduce energy consumption by operating the trailing core at a lower voltage-frequency level. Our evaluation shows that RECVF consumes 37% less energy than conventional dual modular redundant (DMR) execution of a program. It consumes only 1.26 times the energy of a non-fault-tolerant baseline and has a performance overhead of just 1.2%.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than one hour. Malleable applications, where the number of processors on which the applications execute can be changed during executions, can make use of their malleability to better tolerate high failure rates. We present AdFT, an adaptive fault tolerance framework for long running malleable applications to maximize application performance in the presence of failures. AdFT framework includes cost models for evaluating the benefits of various fault tolerance actions including checkpointing, live-migration and rescheduling, and runtime decisions for dynamically selecting the fault tolerance actions at different points of application execution to maximize performance. Simulations with real and synthetic failure traces show that our approach outperforms existing fault tolerance mechanisms for malleable applications yielding up to 23% improvement in application performance, and is effective even for petascale systems and beyond.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A new hybrid multilevel power converter topology is presented in this paper. The proposed power converter topology uses only one DC source and floating capacitors charged to asymmetrical voltage levels, are used for generating different voltage levels. The SVPWM based control strategy used in this converter maintains the capacitor voltages at the required levels in the entire modulation range including the over-modulation region. For the voltage levels: nine and above, the number of components required in the proposed topology is significantly lower, compared to the conventional multilevel inverter topologies. The number of capacitors required in this topology reduces drastically compared to the conventional flying capacitor topology, when the number of levels in the inverter output increases. This topology has better fault tolerance, as it is capable of operating with reduced number of levels, in the entire modulation range, in the event of any failure in the H-bridges. The transient as well as the steady state performance of the nine-level version of the proposed topology is experimentally verified in the entire modulation range including the over-modulation region.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than one hour. At such low MTBFs, employing periodic checkpointing alone will result in low efficiency because of the high number of application failures resulting in large amount of lost work due to rollbacks. In such scenarios, it is highly necessary to have proactive fault tolerance mechanisms that can help avoid significant number of failures. In this work, we have developed a mechanism for proactive fault tolerance using partial replication of a set of application processes. Our fault tolerance framework adaptively changes the set of replicated processes periodically based on failure predictions to avoid failures. We have developed an MPI prototype implementation, PAREP-MPI that allows changing the replica set. We have shown that our strategy involving adaptive process replication significantly outperforms existing mechanisms providing up to 20 percent improvement in application efficiency even for exascale systems.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The fault-tolerant multiprocessor (ftmp) is a bus-based multiprocessor architecture with real-time and fault- tolerance features and is used in critical aerospace applications. A preliminary performance evaluation is of crucial importance in the design of such systems. In this paper, we review stochastic Petri nets (spn) and developspn-based performance models forftmp. These performance models enable efficient computation of important performance measures such as processing power, bus contention, bus utilization, and waiting times.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Fault-tolerance is due to the semiconductor technology development important, not only for safety-critical systems but also for general-purpose (non-safety critical) systems. However, instead of guaranteeing that deadlines always are met, it is for general-purpose systems important to minimize the average execution time (AET) while ensuring fault-tolerance. For a given job and a soft (transient) error probability, we define mathematical formulas for AET that includes bus communication overhead for both voting (active replication) and rollback-recovery with checkpointing (RRC). And, for a given multi-processor system-on-chip (MPSoC), we define integer linear programming (ILP) models that minimize AET including bus communication overhead when: (1) selecting the number of checkpoints when using RRC, (2) finding the number of processors and job-to-processor assignment when using voting, and (3) defining fault-tolerance scheme (voting or RRC) per job and defining its usage for each job. Experiments demonstrate significant savings in AET.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

With the advent of Internet, video over IP is gaining popularity. In such an environment, scalability and fault tolerance will be the key issues. Existing video on demand (VoD) service systems are usually neither scalable nor tolerant to server faults and hence fail to comply to multi-user, failure-prone networks such as the Internet. Current research areas concerning VoD often focus on increasing the throughput and reliability of single server, but rarely addresses the smooth provision of service during server as well as network failures. Reliable Server Pooling (RSerPool), being capable of providing high availability by using multiple redundant servers as single source point, can be a solution to overcome the above failures. During a possible server failure, the continuity of service is retained by another server. In order to achieve transparent failover, efficient state sharing is an important requirement. In this paper, we present an elegant, simple, efficient and scalable approach which has been developed to facilitate the transfer of state by the client itself, using extended cookie mechanism, which ensures that there is no noticeable change in disruption or the video quality.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper presents on overview of the issues in precisely defining, specifying and evaluating the dependability of software, particularly in the context of computer controlled process systems. Dependability is intended to be a generic term embodying various quality factors and is useful for both software and hardware. While the developments in quality assurance and reliability theories have proceeded mostly in independent directions for hardware and software systems, we present here the case for developing a unified framework of dependability—a facet of operational effectiveness of modern technological systems, and develop a hierarchical systems model helpful in clarifying this view. In the second half of the paper, we survey the models and methods available for measuring and improving software reliability. The nature of software “bugs”, the failure history of the software system in the various phases of its lifecycle, the reliability growth in the development phase, estimation of the number of errors remaining in the operational phase, and the complexity of the debugging process have all been considered to varying degrees of detail. We also discuss the notion of software fault-tolerance, methods of achieving the same, and the status of other measures of software dependability such as maintainability, availability and safety.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper is aimed at reviewing the notion of Byzantine-resilient distributed computing systems, the relevant protocols and their possible applications as reported in the literature. The three agreement problems, namely, the consensus problem, the interactive consistency problem, and the generals problem have been discussed. Various agreement protocols for the Byzantine generals problem have been summarized in terms of their performance and level of fault-tolerance. The three classes of Byzantine agreement protocols discussed are the deterministic, randomized, and approximate agreement protocols. Finally, application of the Byzantine agreement protocols to clock synchronization is highlighted.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Our main result is a new sequential method for the design of decentralized control systems. Controller synthesis is conducted on a loop-by-loop basis, and at each step the designer obtains an explicit characterization of the class C of all compensators for the loop being closed that results in closed-loop system poles being in a specified closed region D of the s-plane, instead of merely stabilizing the closed-loop system. Since one of the primary goals of control system design is to satisfy basic performance requirements that are often directly related to closed-loop pole location (bandwidth, percentage overshoot, rise time, settling time), this approach immediately allows the designer to focus on other concerns such as robustness and sensitivity. By considering only compensators from class C and seeking the optimum member of that set with respect to sensitivity or robustness, the designer has a clearly-defined limited optimization problem to solve without concern for loss of performance. A solution to the decentralized tracking problem is also provided. This design approach has the attractive features of expandability, the use of only 'local models' for controller synthesis, and fault tolerance with respect to certain types of failure.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Mobile ad-hoc networks (MANETs) have recently drawn significant research attention since they offer unique benefits and versatility with respect to bandwidth spatial reuse, intrinsic fault tolerance, and low-cost rapid deployment. This paper addresses the issue of delay sensitive realtime data transport in these type of networks. An effective QoS mechanism is thereby required for the speedy transport of the realtime data. QoS issue in MANET is an open-end problem. Various QoS measures are incorporated in the upperlayers of the network, but a few techniques addresses QoS techniques in the MAC layer. There are quite a few QoS techniques in the MAC layer for the infrastructure based wireless network. The goal and the challenge is to achieve a QoS delivery and a priority access to the real time traffic in adhoc wireless environment, while maintaining democracy in the resource allocation. We propose a MAC layer protocol called "FCP based FAMA protocol", which allocates the channel resources to the needy in a more democratic way, by examining the requirements, malicious behavior and genuineness of the request. We have simulated both the FAMA as well as FCP based FAMA and tested in various MANET conditions. Simulated results have clearly shown a performance improvement in the channel utilization and a decrease in the delay parameters in the later case. Our new protocol outperforms the other QoS aware MAC layer protocols.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Relentless CMOS scaling coupled with lower design tolerances is making ICs increasingly susceptible to wear-out related permanent faults and transient faults, necessitating on-chip fault tolerance in future chip microprocessors (CMPs). In this paper, we describe a power-efficient architecture for redundant execution on chip multiprocessors (CMPs) which when coupled with our per-core dynamic voltage and frequency scaling (DVFS) algorithm significantly reduces the energy overhead of redundant execution without sacrificing performance. Our evaluation shows that this architecture has a performance overhead of only 0.3% and consumes only 1.48 times the energy of a non-fault-tolerant baseline.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Among the intelligent safety technologies for road vehicles, active suspensions controlled by embedded computing elements for preventing rollover have received a lot of attention. The existing models for synthesizing and allocating forces in such suspensions are conservatively based on the constraints that are valid until no wheels lift off the ground. However, the fault tolerance of the rollover-preventive systems can be enhanced if the smart/active suspensions can intervene in the more severe situation in which the wheels have just lifted off the ground. The difficulty in computing control in the last situation is that the vehicle dynamics then passes into the regime that yields a model involving disjunctive constraints on the dynamics. Simulation of dynamics with disjunctive constraints in this context becomes necessary to estimate, synthesize, and allocate the intended hardware realizable forces in an active suspension. In this paper, we give an algorithm for the previously mentioned problem by solving it as a disjunctive dynamic optimization problem. Based on this, we synthesize and allocate the roll-stabilizing time-dependent active suspension forces in terms of sensor output data. We show that the forces obtained from disjunctive dynamics are comparable with existing force allocations and, hence, are possibly realizable in the existing hardware framework toward enhancing the safety and fault tolerance.