137 resultados para Distributed virtualization
Resumo:
In this paper, we describe an efficient coordinated-checkpointing and recovery algorithm which can work even when the channels are assumed to be non-FIFO, and messages may be lost. Nodes are assumed to be autonomous, and they do not block while taking checkpoints. Based on the local conditions, any process can request the previous coordinator for the 'permission' to initiate a new checkpoint. Allowing multiple initiators of checkpoints avoids the bottleneck associated with a single initiator, but the algorithm permits only a single instance of checkpointing process at any given time, thus reducing much of the overhead associated with multiple initiators of distributed algorithms.
Resumo:
The move towards IT outsourcing is the first step towards an environment where compute infrastructure is treated as a service. In utility computing this IT service has to honor Service Level Agreements (SLA) in order to meet the desired Quality of Service (QoS) guarantees. Such an environment requires reliable services in order to maximize the utilization of the resources and to decrease the Total Cost of Ownership (TCO). Such reliability cannot come at the cost of resource duplication, since it increases the TCO of the data center and hence the cost per compute unit. We, in this paper, look into aspects of projecting impact of hardware failures on the SLAs and techniques required to take proactive recovery steps in case of a predicted failure. By maintaining health vectors of all hardware and system resources, we predict the failure probability of resources based on observed hardware errors/failure events, at runtime. This inturn influences an availability aware middleware to take proactive action (even before the application is affected in case the system and the application have low recoverability). The proposed framework has been prototyped on a system running HP-UX. Our offline analysis of the prediction system on hardware error logs indicate no more than 10% false positives. This work to the best of our knowledge is the first of its kind to perform an end-to-end analysis of the impact of a hardware fault on application SLAs, in a live system.
Resumo:
The author presents adaptive control techniques for controlling the flow of real-time jobs from the peripheral processors (PPs) to the central processor (CP) of a distributed system with a star topology. He considers two classes of flow control mechanisms: (1) proportional control, where a certain proportion of the load offered to each PP is sent to the CP, and (2) threshold control, where there is a maximum rate at which each PP can send jobs to the CP. The problem is to obtain good algorithms for dynamically adjusting the control level at each PP in order to prevent overload of the CP, when the load offered by the PPs is unknown and varying. The author formulates the problem approximately as a standard system control problem in which the system has unknown parameters that are subject to change. Using well-known techniques (e.g., naive-feedback-controller and stochastic approximation techniques), he derives adaptive controls for the system control problem. He demonstrates the efficacy of these controls in the original problem by using the control algorithms in simulations of a queuing model of the CP and the load controls.
Resumo:
Erasure coding techniques are used to increase the reliability of distributed storage systems while minimizing storage overhead. Also of interest is minimization of the bandwidth required to repair the system following a node failure. In a recent paper, Wu et al. characterize the tradeoff between the repair bandwidth and the amount of data stored per node. They also prove the existence of regenerating codes that achieve this tradeoff. In this paper, we introduce Exact Regenerating Codes, which are regenerating codes possessing the additional property of being able to duplicate the data stored at a failed node. Such codes require low processing and communication overheads, making the system practical and easy to maintain. Explicit construction of exact regenerating codes is provided for the minimum bandwidth point on the storage-repair bandwidth tradeoff, relevant to distributed-mail-server applications. A sub-space based approach is provided and shown to yield necessary and sufficient conditions on a linear code to possess the exact regeneration property as well as prove the uniqueness of our construction. Also included in the paper, is an explicit construction of regenerating codes for the minimum storage point for parameters relevant to storage in peer-to-peer systems. This construction supports a variable number of nodes and can handle multiple, simultaneous node failures. All constructions given in the paper are of low complexity, requiring low field size in particular.
Resumo:
In this paper we address the problem of distributed transmission of functions of correlated sources over a fast fading multiple access channel (MAC). This is a basic building block in a hierarchical sensor network used in estimating a random field where the cluster head is interested only in estimating a function of the observations. The observations are transmitted to the cluster head through a fast fading MAC. We provide sufficient conditions for lossy transmission when the encoders and decoders are provided with partial information about the channel state. Furthermore signal side information maybe available at the encoders and the decoder. Various previous studies are shown as special cases. Efficient joint-source channel coding schemes are discussed for transmission of discrete and continuous alphabet sources to recover function values.
Resumo:
An important issue in the design of a distributed computing system (DCS) is the development of a suitable protocol. This paper presents an effort to systematize the protocol design procedure for a DCS. Protocol design and development can be divided into six phases: specification of the DCS, specification of protocol requirements, protocol design, specification and validation of the designed protocol, performance evaluation, and hardware/software implementation. This paper describes techniques for the second and third phases, while the first phase has been considered by the authors in their earlier work. Matrix and set theoretic based approaches are used for specification of a DCS and for specification of the protocol requirements. These two formal specification techniques form the basis of the development of a simple and straightforward procedure for the design of the protocol. The applicability of the above design procedure has been illustrated by considering an example of a computing system encountered on board a spacecraft. A Petri-net based approach has been adopted to model the protocol. The methodology developed in this paper can be used in other DCS applications.
Resumo:
A detailed characterization of interference power statistics in CDMA systems is of considerable practical and theoretical interest. Such a characterization for uplink inter-cell interference has been difficult because of transmit power control, randomness in the number of interfering mobile stations, and randomness in their locations. We develop a new method to model the uplink inter-cell interference power as a lognormal distribution, and show that it is an order of magnitude more accurate than the conventional Gaussian approximation even when the average number of mobile stations per cell is relatively large and even outperforms the moment-matched lognormal approximation considered in the literature. The proposed method determines the lognormal parameters by matching its moment generating function with a new approximation of the moment generating function for the inter-cell interference. The method is tractable and exploits the elegant spatial Poisson process theory. Using several numerical examples, the accuracy of the proposed method in modeling the probability distribution of inter-cell interference is verified for both small and large values of interference.
Resumo:
There are a number of large networks which occur in many problems dealing with the flow of power, communication signals, water, gas, transportable goods, etc. Both design and planning of these networks involve optimization problems. The first part of this paper introduces the common characteristics of a nonlinear network (the network may be linear, the objective function may be non linear, or both may be nonlinear). The second part develops a mathematical model trying to put together some important constraints based on the abstraction for a general network. The third part deals with solution procedures; it converts the network to a matrix based system of equations, gives the characteristics of the matrix and suggests two solution procedures, one of them being a new one. The fourth part handles spatially distributed networks and evolves a number of decomposition techniques so that we can solve the problem with the help of a distributed computer system. Algorithms for parallel processors and spatially distributed systems have been described.There are a number of common features that pertain to networks. A network consists of a set of nodes and arcs. In addition at every node, there is a possibility of an input (like power, water, message, goods etc) or an output or none. Normally, the network equations describe the flows amoungst nodes through the arcs. These network equations couple variables associated with nodes. Invariably, variables pertaining to arcs are constants; the result required will be flows through the arcs. To solve the normal base problem, we are given input flows at nodes, output flows at nodes and certain physical constraints on other variables at nodes and we should find out the flows through the network (variables at nodes will be referred to as across variables).The optimization problem involves in selecting inputs at nodes so as to optimise an objective function; the objective may be a cost function based on the inputs to be minimised or a loss function or an efficiency function. The above mathematical model can be solved using Lagrange Multiplier technique since the equalities are strong compared to inequalities. The Lagrange multiplier technique divides the solution procedure into two stages per iteration. Stage one calculates the problem variables % and stage two the multipliers lambda. It is shown that the Jacobian matrix used in stage one (for solving a nonlinear system of necessary conditions) occurs in the stage two also.A second solution procedure has also been imbedded into the first one. This is called total residue approach. It changes the equality constraints so that we can get faster convergence of the iterations.Both solution procedures are found to coverge in 3 to 7 iterations for a sample network.The availability of distributed computer systems — both LAN and WAN — suggest the need for algorithms to solve the optimization problems. Two types of algorithms have been proposed — one based on the physics of the network and the other on the property of the Jacobian matrix. Three algorithms have been deviced, one of them for the local area case. These algorithms are called as regional distributed algorithm, hierarchical regional distributed algorithm (both using the physics properties of the network), and locally distributed algorithm (a multiprocessor based approach with a local area network configuration). The approach used was to define an algorithm that is faster and uses minimum communications. These algorithms are found to converge at the same rate as the non distributed (unitary) case.
Resumo:
In a storage system where individual storage nodes are prone to failure, the redundant storage of data in a distributed manner across multiple nodes is a must to ensure reliability. Reed-Solomon codes possess the reconstruction property under which the stored data can be recovered by connecting to any k of the n nodes in the network across which data is dispersed. This property can be shown to lead to vastly improved network reliability over simple replication schemes. Also of interest in such storage systems is the minimization of the repair bandwidth, i.e., the amount of data needed to be downloaded from the network in order to repair a single failed node. Reed-Solomon codes perform poorly here as they require the entire data to be downloaded. Regenerating codes are a new class of codes which minimize the repair bandwidth while retaining the reconstruction property. This paper provides an overview of regenerating codes including a discussion on the explicit construction of optimum codes.
Resumo:
A new language concept for high-level distributed programming is proposed. Programs are organised as a collection of concurrently executing processes. Some of these processes, referred to as liaison processes, have a monitor-like structure and contain ports which may be invoked by other processes for the purposes of synchronisation and communication. Synchronisation is achieved by conditional activation of ports and also through port control constructs which may directly specify the execution ordering of ports. These constructs implement a path-expression-like mechanism for synchronisation and are also equipped with options to provide conditional, non-deterministic and priority ordering of ports. The usefulness and expressive power of the proposed concepts are illustrated through solutions of several representative programming problems. Some implementation issues are also considered.
Resumo:
Distributed computing systems can be modeled adequately by Petri nets. The computation of invariants of Petri nets becomes necessary for proving the properties of modeled systems. This paper presents a two-phase, bottom-up approach for invariant computation and analysis of Petri nets. In the first phase, a newly defined subnet, called the RP-subnet, with an invariant is chosen. In the second phase, the selected RP-subnet is analyzed. Our methodology is illustrated with two examples viz., the dining philosophers' problem and the connection-disconnection phase of a transport protocol. We believe that this new method, which is computationally no worse than the existing techniques, would simplify the analysis of many practical distributed systems.
Resumo:
In a typical sensor network scenario a goal is to monitor a spatio-temporal process through a number of inexpensive sensing nodes, the key parameter being the fidelity at which the process has to be estimated at distant locations. We study such a scenario in which multiple encoders transmit their correlated data at finite rates to a distant, common decoder over a discrete time multiple access channel under various side information assumptions. In particular, we derive an achievable rate region for this communication problem.
Resumo:
Formal specification is vital to the development of distributed real-time systems as these systems are inherently complex and safety-critical. It is widely acknowledged that formal specification and automatic analysis of specifications can significantly increase system reliability. Although a number of specification techniques for real-time systems have been reported in the literature, most of these formalisms do not adequately address to the constraints that the aspects of 'distribution' and 'real-time' impose on specifications. Further, an automatic verification tool is necessary to reduce human errors in the reasoning process. In this regard, this paper is an attempt towards the development of a novel executable specification language for distributed real-time systems. First, we give a precise characterization of the syntax and semantics of DL. Subsequently, we discuss the problems of model checking, automatic verification of satisfiability of DL specifications, and testing conformance of event traces with DL specifications. Effective solutions to these problems are presented as extensions to the classical first-order tableau algorithm. The use of the proposed framework is illustrated by specifying a sample problem.
Explicit and Optimal Exact-Regenerating Codes for the Minimum-Bandwidth Point in Distributed Storage
Resumo:
In the distributed storage setting that we consider, data is stored across n nodes in the network such that the data can be recovered by connecting to any subset of k nodes. Additionally, one can repair a failed node by connecting to any d nodes while downloading beta units of data from each. Dimakis et al. show that the repair bandwidth d beta can be considerably reduced if each node stores slightly more than the minimum required and characterize the tradeoff between the amount of storage per node and the repair bandwidth. In the exact regeneration variation, unlike the functional regeneration, the replacement for a failed node is required to store data identical to that in the failed node. This greatly reduces the complexity of system maintenance. The main result of this paper is an explicit construction of codes for all values of the system parameters at one of the two most important and extreme points of the tradeoff - the Minimum Bandwidth Regenerating point, which performs optimal exact regeneration of any failed node. A second result is a non-existence proof showing that with one possible exception, no other point on the tradeoff can be achieved for exact regeneration.