8 resultados para Fault tolerance
em Boston University Digital Common
Resumo:
The proliferation of mobile computers and wireless networks requires the design of future distributed real-time applications to recognize and deal with the significant asymmetry between downstream and upstream communication capacities, and the significant disparity between server and client storage capacities. Recent research work proposed the use of Broadcast Disks as a scalable mechanism to deal with this problem. In this paper, we propose a new broadcast disks protocol, based on our Adaptive Information Dispersal Algorithm (AIDA). Our protocol is different from previous broadcast disks protocols in that it improves communication timeliness, fault-tolerance, and security, while allowing for a finer control of multiplexing of prioritized data (broadcast frequencies). We start with a general introduction of broadcast disks. Next, we propose broadcast disk organizations that are suitable for real-time applications. Next, we present AIDA and show its fault-tolerance and security properties. We conclude the paper with the description and analysis of AIDA-based broadcast disks organizations that achieve both timeliness and fault-tolerance, while preserving downstream communication capacity.
Resumo:
The design of programs for broadcast disks which incorporate real-time and fault-tolerance requirements is considered. A generalized model for real-time fault-tolerant broadcast disks is defined. It is shown that designing programs for broadcast disks specified in this model is closely related to the scheduling of pinwheel task systems. Some new results in pinwheel scheduling theory are derived, which facilitate the efficient generation of real-time fault-tolerant broadcast disk programs.
Resumo:
The exploding demand for services like the World Wide Web reflects the potential that is presented by globally distributed information systems. The number of WWW servers world-wide has doubled every 3 to 5 months since 1993, outstripping even the growth of the Internet. At each of these self-managed sites, the Common Gateway Interface (CGI) and Hypertext Transfer Protocol (HTTP) already constitute a rudimentary basis for contributing local resources to remote collaborations. However, the Web has serious deficiencies that make it unsuited for use as a true medium for metacomputing --- the process of bringing hardware, software, and expertise from many geographically dispersed sources to bear on large scale problems. These deficiencies are, paradoxically, the direct result of the very simple design principles that enabled its exponential growth. There are many symptoms of the problems exhibited by the Web: disk and network resources are consumed extravagantly; information search and discovery are difficult; protocols are aimed at data movement rather than task migration, and ignore the potential for distributing computation. However, all of these can be seen as aspects of a single problem: as a distributed system for metacomputing, the Web offers unpredictable performance and unreliable results. The goal of our project is to use the Web as a medium (within either the global Internet or an enterprise intranet) for metacomputing in a reliable way with performance guarantees. We attack this problem one four levels: (1) Resource Management Services: Globally distributed computing allows novel approaches to the old problems of performance guarantees and reliability. Our first set of ideas involve setting up a family of real-time resource management models organized by the Web Computing Framework with a standard Resource Management Interface (RMI), a Resource Registry, a Task Registry, and resource management protocols to allow resource needs and availability information be collected and disseminated so that a family of algorithms with varying computational precision and accuracy of representations can be chosen to meet realtime and reliability constraints. (2) Middleware Services: Complementary to techniques for allocating and scheduling available resources to serve application needs under realtime and reliability constraints, the second set of ideas aim at reduce communication latency, traffic congestion, server work load, etc. We develop customizable middleware services to exploit application characteristics in traffic analysis to drive new server/browser design strategies (e.g., exploit self-similarity of Web traffic), derive document access patterns via multiserver cooperation, and use them in speculative prefetching, document caching, and aggressive replication to reduce server load and bandwidth requirements. (3) Communication Infrastructure: Finally, to achieve any guarantee of quality of service or performance, one must get at the network layer that can provide the basic guarantees of bandwidth, latency, and reliability. Therefore, the third area is a set of new techniques in network service and protocol designs. (4) Object-Oriented Web Computing Framework A useful resource management system must deal with job priority, fault-tolerance, quality of service, complex resources such as ATM channels, probabilistic models, etc., and models must be tailored to represent the best tradeoff for a particular setting. This requires a family of models, organized within an object-oriented framework, because no one-size-fits-all approach is appropriate. This presents a software engineering challenge requiring integration of solutions at all levels: algorithms, models, protocols, and profiling and monitoring tools. The framework captures the abstract class interfaces of the collection of cooperating components, but allows the concretization of each component to be driven by the requirements of a specific approach and environment.
Resumo:
One-and two-dimensional cellular automata which are known to be fault-tolerant are very complex. On the other hand, only very simple cellular automata have actually been proven to lack fault-tolerance, i.e., to be mixing. The latter either have large noise probability ε or belong to the small family of two-state nearest-neighbor monotonic rules which includes local majority voting. For a certain simple automaton L called the soldiers rule, this problem has intrigued researchers for the last two decades since L is clearly more robust than local voting: in the absence of noise, L eliminates any finite island of perturbation from an initial configuration of all 0's or all 1's. The same holds for a 4-state monotonic variant of L, K, called two-line voting. We will prove that the probabilistic cellular automata Kε and Lε asymptotically lose all information about their initial state when subject to small, strongly biased noise. The mixing property trivially implies that the systems are ergodic. The finite-time information-retaining quality of a mixing system can be represented by its relaxation time Relax(⋅), which measures the time before the onset of significant information loss. This is known to grow as (1/ε)^c for noisy local voting. The impressive error-correction ability of L has prompted some researchers to conjecture that Relax(Lε) = 2^(c/ε). We prove the tight bound 2^(c1log^21/ε) < Relax(Lε) < 2^(c2log^21/ε) for a biased error model. The same holds for Kε. Moreover, the lower bound is independent of the bias assumption. The strong bias assumption makes it possible to apply sparsity/renormalization techniques, the main tools of our investigation, used earlier in the opposite context of proving fault-tolerance.
Resumo:
Distributed hash tables have recently become a useful building block for a variety of distributed applications. However, current schemes based upon consistent hashing require both considerable implementation complexity and substantial storage overhead to achieve desired load balancing goals. We argue in this paper that these goals can b e achieved more simply and more cost-effectively. First, we suggest the direct application of the "power of two choices" paradigm, whereby an item is stored at the less loaded of two (or more) random alternatives. We then consider how associating a small constant number of hash values with a key can naturally b e extended to support other load balancing methods, including load-stealing or load-shedding schemes, as well as providing natural fault-tolerance mechanisms.
Resumo:
Understanding and modeling the factors that underlie the growth and evolution of network topologies are basic questions that impact capacity planning, forecasting, and protocol research. Early topology generation work focused on generating network-wide connectivity maps, either at the AS-level or the router-level, typically with an eye towards reproducing abstract properties of observed topologies. But recently, advocates of an alternative "first-principles" approach question the feasibility of realizing representative topologies with simple generative models that do not explicitly incorporate real-world constraints, such as the relative costs of router configurations, into the model. Our work synthesizes these two lines by designing a topology generation mechanism that incorporates first-principles constraints. Our goal is more modest than that of constructing an Internet-wide topology: we aim to generate representative topologies for single ISPs. However, our methods also go well beyond previous work, as we annotate these topologies with representative capacity and latency information. Taking only demand for network services over a given region as input, we propose a natural cost model for building and interconnecting PoPs and formulate the resulting optimization problem faced by an ISP. We devise hill-climbing heuristics for this problem and demonstrate that the solutions we obtain are quantitatively similar to those in measured router-level ISP topologies, with respect to both topological properties and fault-tolerance.
Resumo:
In a probabilistic cellular automaton in which all local transitions have positive probability, the problem of keeping a bit of information for more than a constant number of steps is nontrivial, even in an infinite automaton. Still, there is a solution in 2 dimensions, and this solution can be used to construct a simple 3-dimensional discrete-time universal fault-tolerant cellular automaton. This technique does not help much to solve the following problems: remembering a bit of information in 1 dimension; computing in dimensions lower than 3; computing in any dimension with non-synchronized transitions. Our more complex technique organizes the cells in blocks that perform a reliable simulation of a second (generalized) cellular automaton. The cells of the latter automaton are also organized in blocks, simulating even more reliably a third automaton, etc. Since all this (a possibly infinite hierarchy) is organized in "software", it must be under repair all the time from damage caused by errors. A large part of the problem is essentially self-stabilization recovering from a mess of arbitrary-size and content caused by the faults. The present paper constructs an asynchronous one-dimensional fault-tolerant cellular automaton, with the further feature of "self-organization". The latter means that unless a large amount of input information must be given, the initial configuration can be chosen to be periodical with a small period.
Resumo:
To construct high performance Web servers, system builders are increasingly turning to distributed designs. An important challenge that arises in distributed Web servers is the need to direct incoming connections to individual hosts. Previous methods for connection routing have employed a centralized node which handles all incoming requests. In contrast, we propose a distributed approach, called Distributed Packet Rewriting (DPR), in which all hosts of the distributed system participate in connection routing. We argue that this approach promises better scalability and fault-tolerance than the centralized approach. We describe our implementation of four variants of DPR and compare their performance. We show that DPR provides performance comparable to centralized alternatives, measured in terms of throughput and delay under the SPECweb96 benchmark. Finally, we argue that DPR is particularly attractive both for small scale systems and for systems following the emerging trend toward increasingly intelligent I/O subsystems.