7 resultados para fault tolerant systems

em Massachusetts Institute of Technology


Relevância:

100.00% 100.00%

Publicador:

Resumo:

As the size of digital systems increases, the mean time between single component failures diminishes. To avoid component related failures, large computers must be fault-tolerant. In this paper, we focus on methods for achieving a high degree of fault-tolerance in multistage routing networks. We describe a multipath scheme for providing end-to-end fault-tolerance on large networks. The scheme improves routing performance while keeping network latency low. We also describe the novel routing component, RN1, which implements this scheme, showing how it can be the basic building block for fault-tolerant multistage routing networks.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Parallel shared-memory machines with hundreds or thousands of processor-memory nodes have been built; in the future we will see machines with millions or even billions of nodes. Associated with such large systems is a new set of design challenges. Many problems must be addressed by an architecture in order for it to be successful; of these, we focus on three in particular. First, a scalable memory system is required. Second, the network messaging protocol must be fault-tolerant. Third, the overheads of thread creation, thread management and synchronization must be extremely low. This thesis presents the complete system design for Hamal, a shared-memory architecture which addresses these concerns and is directly scalable to one million nodes. Virtual memory and distributed objects are implemented in a manner that requires neither inter-node synchronization nor the storage of globally coherent translations at each node. We develop a lightweight fault-tolerant messaging protocol that guarantees message delivery and idempotence across a discarding network. A number of hardware mechanisms provide efficient support for massive multithreading and fine-grained synchronization. Experiments are conducted in simulation, using a trace-driven network simulator to investigate the messaging protocol and a cycle-accurate simulator to evaluate the Hamal architecture. We determine implementation parameters for the messaging protocol which optimize performance. A discarding network is easier to design and can be clocked at a higher rate, and we find that with this protocol its performance can approach that of a non-discarding network. Our simulations of Hamal demonstrate the effectiveness of its thread management and synchronization primitives. In particular, we find register-based synchronization to be an extremely efficient mechanism which can be used to implement a software barrier with a latency of only 523 cycles on a 512 node machine.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The Transit network provides high-speed, low-latency, fault-tolerant interconnect for high-performance, multiprocessor computers. The basic connection scheme for Transit uses bidelta style, multistage networks to support up to 256 processors. Scaling to larger machines by simply extending the bidelta network topology will result in a uniform degradation of network latency between all processors. By employing a fat-tree network structure in larger systems, the network provides locality and universality properties which can help minimize the impact of scaling on network latency. This report details the topology and construction issues associated with integrating Transit routing technology into fat-tree interconnect topologies.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The blocking probability of a network is a common measure of its performance. There exist means of quickly calculating the blocking probabilities of Banyan networks; however, because Banyan networks have no redundant paths, they are not inherently fault-tolerant, and so their use in large-scale multiprocessors is problematic. Unfortunately, the addition of multiple paths between message sources and sinks in a network complicates the calculation of blocking probabilities. A methodology for exact calculation of blocking probabilities for small networks with redundant paths is presented here, with some discussion of its potential use in approximating blocking probabilities for large networks with redundant paths.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

For a very large network deployed in space with only nearby nodes able to talk to each other, we want to do tasks like robust routing and data storage. One way to organize the network is via a hierarchy, but hierarchies often have a few critical nodes whose death can disrupt organization over long distances. I address this with a system of distributed aggregates called Persistent Nodes, such that spatially local failures disrupt the hierarchy in an area proportional to the diameter of the failure. I describe and analyze this system, which has been implemented in simulation.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This thesis describes the design and implementation of an integrated circuit and associated packaging to be used as the building block for the data routing network of a large scale shared memory multiprocessor system. A general purpose multiprocessor depends on high-bandwidth, low-latency communications between computing elements. This thesis describes the design and construction of RN1, a novel self-routing, enhanced crossbar switch as a CMOS VLSI chip. This chip provides the basic building block for a scalable pipelined routing network with byte-wide data channels. A series of RN1 chips can be cascaded with no additional internal network components to form a multistage fault-tolerant routing switch. The chip is designed to operate at clock frequencies up to 100Mhz using Hewlett-Packard's HP34 $1.2\\mu$ process. This aggressive performance goal demands that special attention be paid to optimization of the logic architecture and circuit design.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This report addresses the problem of achieving cooperation within small- to medium- sized teams of heterogeneous mobile robots. I describe a software architecture I have developed, called ALLIANCE, that facilitates robust, fault tolerant, reliable, and adaptive cooperative control. In addition, an extended version of ALLIANCE, called L-ALLIANCE, is described, which incorporates a dynamic parameter update mechanism that allows teams of mobile robots to improve the efficiency of their mission performance through learning. A number of experimental results of implementing these architectures on both physical and simulated mobile robot teams are described. In addition, this report presents the results of studies of a number of issues in mobile robot cooperation, including fault tolerant cooperative control, adaptive action selection, distributed control, robot awareness of team member actions, improving efficiency through learning, inter-robot communication, action recognition, and local versus global control.