821 resultados para Fault detection, fail-safety, fault tolerance, UAV
Resumo:
In questo lavoro di tesi si affronta una delle problematiche che si presentano oggi nell'impiego degli APR (Aeromobili a Pilotaggio Remoto): la gestione della safety. Non si può più, in altri termini, negare che tali oggetti siano parte integrante dello spazio aereo civile. Proprio su questo tema recentemente gli enti regolatori dello spazio aereo stanno proiettando i loro sforzi al fine di stabilire una serie di regolamenti che disciplinino da una parte le modalità con cui questi oggetti si interfacciano con le altre categorie di velivoli e dall'altra i criteri di idoneità perché anche essi possano operare nello spazio aereo in maniera sicura. Si rende quindi necessario, in tal senso, dotare essi stessi di un sufficiente grado di sicurezza che permetta di evitare eventi disastrosi nel momento in cui si presenta un guasto nel sistema; è questa la definizione di un sistema fail-safe. Lo studio e lo sviluppo di questa tipologia di sistemi può aiutare il costruttore a superare la barriera oggi rappresentata dal regolamento che spesso e volentieri rappresenta l'unico ostacolo non fisico per la categoria dei velivoli unmanned tra la terra e il cielo. D'altro canto, al fine di garantire a chi opera a distanza su questi oggetti di avere, per tutta la durata della missione, la chiara percezione dello stato di funzionamento attuale del sistema e di come esso può (o potrebbe) interagire con l'ambiente che lo circonda (situational awarness), è necessario dotare il velivolo di apparecchiature che permettano di poter rilevare, all'occorrenza, il malfunzionamento: è questo il caso dei sistemi di fault detection. Questi due fondamentali aspetti sono la base fondante del presente lavoro che verte sul design di un ridotto ma preponderante sottosistema dell'UAV: il sistema di attuazione delle superfici di controllo. Esse sono, infatti, l'unico mezzo disponibile all'operatore per governare il mezzo nelle normali condizioni di funzionamento ma anche l'ultima possibilità per tentare di evitare l'evento disastroso nel caso altri sottosistemi siano chiaramente fuori dalle condizioni di normale funzionamento dell'oggetto.
Resumo:
Background: Large-scale biological jobs on high-performance computing systems require manual intervention if one or more computing cores on which they execute fail. This places not only a cost on the maintenance of the job, but also a cost on the time taken for reinstating the job and the risk of losing data and execution accomplished by the job before it failed. Approaches which can proactively detect computing core failures and take action to relocate the computing core's job onto reliable cores can make a significant step towards automating fault tolerance. Method: This paper describes an experimental investigation into the use of multi-agent approaches for fault tolerance. Two approaches are studied, the first at the job level and the second at the core level. The approaches are investigated for single core failure scenarios that can occur in the execution of parallel reduction algorithms on computer clusters. A third approach is proposed that incorporates multi-agent technology both at the job and core level. Experiments are pursued in the context of genome searching, a popular computational biology application. Result: The key conclusion is that the approaches proposed are feasible for automating fault tolerance in high-performance computing systems with minimal human intervention. In a typical experiment in which the fault tolerance is studied, centralised and decentralised checkpointing approaches on an average add 90% to the actual time for executing the job. On the other hand, in the same experiment the multi-agent approaches add only 10% to the overall execution time
Resumo:
An n-dimensional Mobius cube, 0MQ(n) or 1MQ(n), is a variation of n-dimensional cube Q(n) which possesses many attractive properties such as significantly smaller communication delay and stronger graph-embedding capabilities. In some practical situations, the fault tolerance of a distributed memory multiprocessor system can be measured more precisely by the connectivity of the underlying graph under forbidden fault set models. This article addresses the connectivity of 0MQ(n)/1MQ(n), under two typical forbidden fault set models. We first prove that the connectivity of 0MQ(n)/1MQ(n) is 2n - 2 when the fault set does not contain the neighborhood of any vertex as a subset. We then prove that the connectivity of 0MQ(n)/1MQ(n) is 3n - 5 provided that the neighborhood of any vertex as well as that of any edge cannot fail simultaneously These results demonstrate that 0MQ(n)/1MQ(n) has the same connectivity as Q(n) under either of the previous assumptions.
Resumo:
Processor virtualization for process migration in distributed parallel computing systems has formed a significant component of research on load balancing. In contrast, the potential of processor virtualization for fault tolerance has been addressed minimally. The work reported in this paper is motivated towards extending concepts of processor virtualization towards ‘intelligent cores’ as a means to achieve fault tolerance in distributed parallel computing systems. Intelligent cores are an abstraction of the hardware processing cores, with the incorporation of cognitive capabilities, on which parallel tasks can be executed and migrated. When a processing core executing a task is predicted to fail the task being executed is proactively transferred onto another core. A parallel reduction algorithm incorporating concepts of intelligent cores is implemented on a computer cluster using Adaptive MPI and Charm ++. Preliminary results confirm the feasibility of the approach.
Resumo:
Questa tesi presenta e discute le sfide per ottenere sistemi di swarm robotis affidabili e tolleranti ai guasti e quindi anche alcuni metodi per rilevare anomalie in essi, in modo tale che ipotetiche procedure per il recupero possano essere affrontate, viene sottolineata inoltre l’ importanza di un’ analisi qualitativa dei guasti.
Resumo:
Fault tolerance allows a system to remain operational to some degree when some of its components fail. One of the most common fault tolerance mechanisms consists on logging the system state periodically, and recovering the system to a consistent state in the event of a failure. This paper describes a general fault tolerance logging-based mechanism, which can be layered over deterministic systems. Our proposal describes how a logging mechanism can recover the underlying system to a consistent state, even if an action or set of actions were interrupted mid-way, due to a server crash. We also propose different methods of storing the logging information, and describe how to deploy a fault tolerant master-slave cluster for information replication. We adapt our model to a previously proposed framework, which provided common relational features, like transactions with atomic, consistent, isolated and durable properties, to NoSQL database management systems.
Resumo:
Multiprocessor systems which afford a high degree of parallelism are used in a variety of applications. The extremely stringent reliability requirement has made the provision of fault-tolerance an important aspect in the design of such systems. This paper presents a review of the various approaches towards tolerating hardware faults in multiprocessor systems. It. emphasizes the basic concepts of fault tolerant design and the various problems to be taken care of by the designer. An indepth survey of the various models, techniques and methods for fault diagnosis is given. Further, we consider the strategies for fault-tolerance in specialized multiprocessor architectures which have the ability of dynamic reconfiguration and are suited to VLSI implementation. An analysis of the state-óf-the-art is given which points out the major aspects of fault-tolerance in such architectures.
Resumo:
Relentless CMOS scaling coupled with lower design tolerances is making ICs increasingly susceptible to wear-out related permanent faults and transient faults, necessitating on-chip fault tolerance in future chip microprocessors (CMPs). In this paper we introduce a new energy-efficient fault-tolerant CMP architecture known as Redundant Execution using Critical Value Forwarding (RECVF). RECVF is based on two observations: (i) forwarding critical instruction results from the leading to the trailing core enables the latter to execute faster, and (ii) this speedup can be exploited to reduce energy consumption by operating the trailing core at a lower voltage-frequency level. Our evaluation shows that RECVF consumes 37% less energy than conventional dual modular redundant (DMR) execution of a program. It consumes only 1.26 times the energy of a non-fault-tolerant baseline and has a performance overhead of just 1.2%.
Resumo:
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than one hour. Malleable applications, where the number of processors on which the applications execute can be changed during executions, can make use of their malleability to better tolerate high failure rates. We present AdFT, an adaptive fault tolerance framework for long running malleable applications to maximize application performance in the presence of failures. AdFT framework includes cost models for evaluating the benefits of various fault tolerance actions including checkpointing, live-migration and rescheduling, and runtime decisions for dynamically selecting the fault tolerance actions at different points of application execution to maximize performance. Simulations with real and synthetic failure traces show that our approach outperforms existing fault tolerance mechanisms for malleable applications yielding up to 23% improvement in application performance, and is effective even for petascale systems and beyond.
Resumo:
A new hybrid multilevel power converter topology is presented in this paper. The proposed power converter topology uses only one DC source and floating capacitors charged to asymmetrical voltage levels, are used for generating different voltage levels. The SVPWM based control strategy used in this converter maintains the capacitor voltages at the required levels in the entire modulation range including the over-modulation region. For the voltage levels: nine and above, the number of components required in the proposed topology is significantly lower, compared to the conventional multilevel inverter topologies. The number of capacitors required in this topology reduces drastically compared to the conventional flying capacitor topology, when the number of levels in the inverter output increases. This topology has better fault tolerance, as it is capable of operating with reduced number of levels, in the entire modulation range, in the event of any failure in the H-bridges. The transient as well as the steady state performance of the nine-level version of the proposed topology is experimentally verified in the entire modulation range including the over-modulation region.
Resumo:
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than one hour. At such low MTBFs, employing periodic checkpointing alone will result in low efficiency because of the high number of application failures resulting in large amount of lost work due to rollbacks. In such scenarios, it is highly necessary to have proactive fault tolerance mechanisms that can help avoid significant number of failures. In this work, we have developed a mechanism for proactive fault tolerance using partial replication of a set of application processes. Our fault tolerance framework adaptively changes the set of replicated processes periodically based on failure predictions to avoid failures. We have developed an MPI prototype implementation, PAREP-MPI that allows changing the replica set. We have shown that our strategy involving adaptive process replication significantly outperforms existing mechanisms providing up to 20 percent improvement in application efficiency even for exascale systems.
Resumo:
This paper introduces the notion of M-step robust fault tolerance for discrete-time systems where finite-time completion of a control manoeuvre is desired. It considers a scenario with two distinct objectives; a primary and secondary target are specified as sets to be reached in finite-time, whilst satisfying operating constraints on the states and inputs. The primary target is switched to the secondary target when a fault affects the system. As it is unknown when or if the fault will occur, the trajectory to the primary target is constrained to ensure reachability of the secondary target within M steps. A variable-horizon linear MPC formulation is developed to illustrate the concept. The formulation is then extended to provide robustness to bounded disturbances by use of tightened constraints. Simulations demonstrate the efficacy of the controller formulation on a double-integrator model. © 2011 IFAC.