846 resultados para fault-tolerant scheduling
Resumo:
The K-Means algorithm for cluster analysis is one of the most influential and popular data mining methods. Its straightforward parallel formulation is well suited for distributed memory systems with reliable interconnection networks, such as massively parallel processors and clusters of workstations. However, in large-scale geographically distributed systems the straightforward parallel algorithm can be rendered useless by a single communication failure or high latency in communication paths. The lack of scalable and fault tolerant global communication and synchronisation methods in large-scale systems has hindered the adoption of the K-Means algorithm for applications in large networked systems such as wireless sensor networks, peer-to-peer systems and mobile ad hoc networks. This work proposes a fully distributed K-Means algorithm (EpidemicK-Means) which does not require global communication and is intrinsically fault tolerant. The proposed distributed K-Means algorithm provides a clustering solution which can approximate the solution of an ideal centralised algorithm over the aggregated data as closely as desired. A comparative performance analysis is carried out against the state of the art sampling methods and shows that the proposed method overcomes the limitations of the sampling-based approaches for skewed clusters distributions. The experimental analysis confirms that the proposed algorithm is very accurate and fault tolerant under unreliable network conditions (message loss and node failures) and is suitable for asynchronous networks of very large and extreme scale.
Resumo:
Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has introduced an operation, MPI_Comm_shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI_Comm_shrink operation requires a fault tolerant failure detection and consensus algorithm. This paper presents and compares two novel failure detection and consensus algorithms. The proposed algorithms are based on Gossip protocols and are inherently fault-tolerant and scalable. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that in both algorithms the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus.
Fault detection, diagnosis and active fault tolerant control for a satellite attitude control system
Resumo:
Modern control systems are becoming more and more complex and control algorithms more and more sophisticated. Consequently, Fault Detection and Diagnosis (FDD) and Fault Tolerant Control (FTC) have gained central importance over the past decades, due to the increasing requirements of availability, cost efficiency, reliability and operating safety. This thesis deals with the FDD and FTC problems in a spacecraft Attitude Determination and Control System (ADCS). Firstly, the detailed nonlinear models of the spacecraft attitude dynamics and kinematics are described, along with the dynamic models of the actuators and main external disturbance sources. The considered ADCS is composed of an array of four redundant reaction wheels. A set of sensors provides satellite angular velocity, attitude and flywheel spin rate information. Then, general overviews of the Fault Detection and Isolation (FDI), Fault Estimation (FE) and Fault Tolerant Control (FTC) problems are presented, and the design and implementation of a novel diagnosis system is described. The system consists of a FDI module composed of properly organized model-based residual filters, exploiting the available input and output information for the detection and localization of an occurred fault. A proper fault mapping procedure and the nonlinear geometric approach are exploited to design residual filters explicitly decoupled from the external aerodynamic disturbance and sensitive to specific sets of faults. The subsequent use of suitable adaptive FE algorithms, based on the exploitation of radial basis function neural networks, allows to obtain accurate fault estimations. Finally, this estimation is actively exploited in a FTC scheme to achieve a suitable fault accommodation and guarantee the desired control performances. A standard sliding mode controller is implemented for attitude stabilization and control. Several simulation results are given to highlight the performances of the overall designed system in case of different types of faults affecting the ADCS actuators and sensors.
Resumo:
The design of fault tolerant systems is gaining importance in large domains of embedded applications where design constrains are as important as reliability. New software techniques, based on selective application of redundancy, have shown remarkable fault coverage with reduced costs and overheads. However, the large number of different solutions provided by these techniques, and the costly process to assess their reliability, make the design space exploration a very difficult and time-consuming task. This paper proposes the integration of a multi-objective optimization tool with a software hardening environment to perform an automatic design space exploration in the search for the best trade-offs between reliability, cost, and performance. The first tool is commanded by a genetic algorithm which can simultaneously fulfill many design goals thanks to the use of the NSGA-II multi-objective algorithm. The second is a compiler-based infrastructure that automatically produces selective protected (hardened) versions of the software and generates accurate overhead reports and fault coverage estimations. The advantages of our proposal are illustrated by means of a complex and detailed case study involving a typical embedded application, the AES (Advanced Encryption Standard).
Resumo:
"This project is funded in part by NASA grant NSG 1471."
Resumo:
"UILU-ENG 80 1742"--Cover.
Resumo:
"June 1980."
Resumo:
"August 1980."