860 resultados para Fault-tolerance


Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents an architecture (Multi-μ) being implemented to study and develop software based fault tolerant mechanisms for Real-Time Systems, using the Ada language (Ada 95) and Commercial Off-The-Shelf (COTS) components. Several issues regarding fault tolerance are presented and mechanisms to achieve fault tolerance by software active replication in Ada 95 are discussed. The Multi-μ architecture, based on a specifically proposed Fault Tolerance Manager (FTManager), is then described. Finally, some considerations are made about the work being done and essential future developments.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

It is imperative to accept that failures can and will occur, even in meticulously designed distributed systems, and design proper measures to counter those failures. Passive replication minimises resource consumption by only activating redundant replicas in case of failures, as typically providing and applying state updates is less resource demanding than requesting execution. However, most existing solutions for passive fault tolerance are usually designed and configured at design time, explicitly and statically identifying the most critical components and their number of replicas, lacking the needed flexibility to handle the runtime dynamics of distributed component-based embedded systems. This paper proposes a cost-effective adaptive fault tolerance solution with a significant lower overhead compared to a strict active redundancy-based approach, achieving a high error coverage with the minimum amount of redundancy. The activation of passive replicas is coordinated through a feedback-based coordination model that reduces the complexity of the needed interactions among components until a new collective global service solution is determined, improving the overall maintainability and robustness of the system.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Technology scaling has proceeded into dimensions in which the reliability of manufactured devices is becoming endangered. The reliability decrease is a consequence of physical limitations, relative increase of variations, and decreasing noise margins, among others. A promising solution for bringing the reliability of circuits back to a desired level is the use of design methods which introduce tolerance against possible faults in an integrated circuit. This thesis studies and presents fault tolerance methods for network-onchip (NoC) which is a design paradigm targeted for very large systems-onchip. In a NoC resources, such as processors and memories, are connected to a communication network; comparable to the Internet. Fault tolerance in such a system can be achieved at many abstraction levels. The thesis studies the origin of faults in modern technologies and explains the classification to transient, intermittent and permanent faults. A survey of fault tolerance methods is presented to demonstrate the diversity of available methods. Networks-on-chip are approached by exploring their main design choices: the selection of a topology, routing protocol, and flow control method. Fault tolerance methods for NoCs are studied at different layers of the OSI reference model. The data link layer provides a reliable communication link over a physical channel. Error control coding is an efficient fault tolerance method especially against transient faults at this abstraction level. Error control coding methods suitable for on-chip communication are studied and their implementations presented. Error control coding loses its effectiveness in the presence of intermittent and permanent faults. Therefore, other solutions against them are presented. The introduction of spare wires and split transmissions are shown to provide good tolerance against intermittent and permanent errors and their combination to error control coding is illustrated. At the network layer positioned above the data link layer, fault tolerance can be achieved with the design of fault tolerant network topologies and routing algorithms. Both of these approaches are presented in the thesis together with realizations in the both categories. The thesis concludes that an optimal fault tolerance solution contains carefully co-designed elements from different abstraction levels

Relevância:

100.00% 100.00%

Publicador:

Resumo:

An n-dimensional Mobius cube, 0MQ(n) or 1MQ(n), is a variation of n-dimensional cube Q(n) which possesses many attractive properties such as significantly smaller communication delay and stronger graph-embedding capabilities. In some practical situations, the fault tolerance of a distributed memory multiprocessor system can be measured more precisely by the connectivity of the underlying graph under forbidden fault set models. This article addresses the connectivity of 0MQ(n)/1MQ(n), under two typical forbidden fault set models. We first prove that the connectivity of 0MQ(n)/1MQ(n) is 2n - 2 when the fault set does not contain the neighborhood of any vertex as a subset. We then prove that the connectivity of 0MQ(n)/1MQ(n) is 3n - 5 provided that the neighborhood of any vertex as well as that of any edge cannot fail simultaneously These results demonstrate that 0MQ(n)/1MQ(n) has the same connectivity as Q(n) under either of the previous assumptions.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Processor virtualization for process migration in distributed parallel computing systems has formed a significant component of research on load balancing. In contrast, the potential of processor virtualization for fault tolerance has been addressed minimally. The work reported in this paper is motivated towards extending concepts of processor virtualization towards ‘intelligent cores’ as a means to achieve fault tolerance in distributed parallel computing systems. Intelligent cores are an abstraction of the hardware processing cores, with the incorporation of cognitive capabilities, on which parallel tasks can be executed and migrated. When a processing core executing a task is predicted to fail the task being executed is proactively transferred onto another core. A parallel reduction algorithm incorporating concepts of intelligent cores is implemented on a computer cluster using Adaptive MPI and Charm ++. Preliminary results confirm the feasibility of the approach.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Recent research in multi-agent systems incorporate fault tolerance concepts, but does not explore the extension and implementation of such ideas for large scale parallel computing systems. The work reported in this paper investigates a swarm array computing approach, namely 'Intelligent Agents'. A task to be executed on a parallel computing system is decomposed to sub-tasks and mapped onto agents that traverse an abstracted hardware layer. The agents intercommunicate across processors to share information during the event of a predicted core/processor failure and for successfully completing the task. The feasibility of the approach is validated by simulations on an FPGA using a multi-agent simulator, and implementation of a parallel reduction algorithm on a computer cluster using the Message Passing Interface.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Service-based architectures enable the development of new classes of Grid and distributed applications. One of the main capabilities provided by such systems is the dynamic and flexible integration of services, according to which services are allowed to be a part of more than one distributed system and simultaneously serve different applications. This increased flexibility in system composition makes it difficult to address classical distributed system issues such as fault-tolerance. While it is relatively easy to make an individual service fault-tolerant, improving fault-tolerance of services collaborating in multiple application scenarios is a challenging task. In this paper, we look at the issue of developing fault-tolerant service-based distributed systems, and propose an infrastructure to implement fault tolerance capabilities transparent to services.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Questa tesi presenta e discute le sfide per ottenere sistemi di swarm robotis affidabili e tolleranti ai guasti e quindi anche alcuni metodi per rilevare anomalie in essi, in modo tale che ipotetiche procedure per il recupero possano essere affrontate, viene sottolineata inoltre l’ importanza di un’ analisi qualitativa dei guasti.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents an analysis of the fault tolerance achieved by an autonomous, fully embedded evolvable hardware system, which uses a combination of partial dynamic reconfiguration and an evolutionary algorithm (EA). It demonstrates that the system may self-recover from both transient and cumulative permanent faults. This self-adaptive system, based on a 2D array of 16 (4×4) Processing Elements (PEs), is tested with an image filtering application. Results show that it may properly recover from faults in up to 3 PEs, that is, more than 18% cumulative permanent faults. Two fault models are used for testing purposes, at PE and CLB levels. Two self-healing strategies are also introduced, depending on whether fault diagnosis is available or not. They are based on scrubbing, fitness evaluation, dynamic partial reconfiguration and in-system evolutionary adaptation. Since most of these adaptability features are already available on the system for its normal operation, resource cost for self-healing is very low (only some code additions in the internal microprocessor core)

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This work presents a theoretical-graph method of determining the fault tolerance degree of the computer network interconnections and nodes. Experimental results received from simulations of this method over a distributed computing network environment are also presented.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Electric vehicles (EVs) and hybrid electric vehicles (HEVs) can reduce greenhouse gas emissions while switched reluctance motor (SRM) is one of the promising motor for such applications. This paper presents a novel SRM fault-diagnosis and fault-tolerance operation solution. Based on the traditional asymmetric half-bridge topology for the SRM driving, the central tapped winding of the SRM in modular half-bridge configuration are introduced to provide fault-diagnosis and fault-tolerance functions, which are set idle in normal conditions. The fault diagnosis can be achieved by detecting the characteristic of the excitation and demagnetization currents. An SRM fault-tolerance operation strategy is also realized by the proposed topology, which compensates for the missing phase torque under the open-circuit fault, and reduces the unbalanced phase current under the short-circuit fault due to the uncontrolled faulty phase. Furthermore, the current sensor placement strategy is also discussed to give two placement methods for low cost or modular structure. Simulation results in MATLAB/Simulink and experiments on a 750-W SRM validate the effectiveness of the proposed strategy, which may have significant implications and improve the reliability of EVs/HEVs.