726 resultados para Fault tolerant computing
Resumo:
Technology scaling has proceeded into dimensions in which the reliability of manufactured devices is becoming endangered. The reliability decrease is a consequence of physical limitations, relative increase of variations, and decreasing noise margins, among others. A promising solution for bringing the reliability of circuits back to a desired level is the use of design methods which introduce tolerance against possible faults in an integrated circuit. This thesis studies and presents fault tolerance methods for network-onchip (NoC) which is a design paradigm targeted for very large systems-onchip. In a NoC resources, such as processors and memories, are connected to a communication network; comparable to the Internet. Fault tolerance in such a system can be achieved at many abstraction levels. The thesis studies the origin of faults in modern technologies and explains the classification to transient, intermittent and permanent faults. A survey of fault tolerance methods is presented to demonstrate the diversity of available methods. Networks-on-chip are approached by exploring their main design choices: the selection of a topology, routing protocol, and flow control method. Fault tolerance methods for NoCs are studied at different layers of the OSI reference model. The data link layer provides a reliable communication link over a physical channel. Error control coding is an efficient fault tolerance method especially against transient faults at this abstraction level. Error control coding methods suitable for on-chip communication are studied and their implementations presented. Error control coding loses its effectiveness in the presence of intermittent and permanent faults. Therefore, other solutions against them are presented. The introduction of spare wires and split transmissions are shown to provide good tolerance against intermittent and permanent errors and their combination to error control coding is illustrated. At the network layer positioned above the data link layer, fault tolerance can be achieved with the design of fault tolerant network topologies and routing algorithms. Both of these approaches are presented in the thesis together with realizations in the both categories. The thesis concludes that an optimal fault tolerance solution contains carefully co-designed elements from different abstraction levels
Resumo:
Sharing of information with those in need of it has always been an idealistic goal of networked environments. With the proliferation of computer networks, information is so widely distributed among systems, that it is imperative to have well-organized schemes for retrieval and also discovery. This thesis attempts to investigate the problems associated with such schemes and suggests a software architecture, which is aimed towards achieving a meaningful discovery. Usage of information elements as a modelling base for efficient information discovery in distributed systems is demonstrated with the aid of a novel conceptual entity called infotron.The investigations are focused on distributed systems and their associated problems. The study was directed towards identifying suitable software architecture and incorporating the same in an environment where information growth is phenomenal and a proper mechanism for carrying out information discovery becomes feasible. An empirical study undertaken with the aid of an election database of constituencies distributed geographically, provided the insights required. This is manifested in the Election Counting and Reporting Software (ECRS) System. ECRS system is a software system, which is essentially distributed in nature designed to prepare reports to district administrators about the election counting process and to generate other miscellaneous statutory reports.Most of the distributed systems of the nature of ECRS normally will possess a "fragile architecture" which would make them amenable to collapse, with the occurrence of minor faults. This is resolved with the help of the penta-tier architecture proposed, that contained five different technologies at different tiers of the architecture.The results of experiment conducted and its analysis show that such an architecture would help to maintain different components of the software intact in an impermeable manner from any internal or external faults. The architecture thus evolved needed a mechanism to support information processing and discovery. This necessitated the introduction of the noveI concept of infotrons. Further, when a computing machine has to perform any meaningful extraction of information, it is guided by what is termed an infotron dictionary.The other empirical study was to find out which of the two prominent markup languages namely HTML and XML, is best suited for the incorporation of infotrons. A comparative study of 200 documents in HTML and XML was undertaken. The result was in favor ofXML.The concept of infotron and that of infotron dictionary, which were developed, was applied to implement an Information Discovery System (IDS). IDS is essentially, a system, that starts with the infotron(s) supplied as clue(s), and results in brewing the information required to satisfy the need of the information discoverer by utilizing the documents available at its disposal (as information space). The various components of the system and their interaction follows the penta-tier architectural model and therefore can be considered fault-tolerant. IDS is generic in nature and therefore the characteristics and the specifications were drawn up accordingly. Many subsystems interacted with multiple infotron dictionaries that were maintained in the system.In order to demonstrate the working of the IDS and to discover the information without modification of a typical Library Information System (LIS), an Information Discovery in Library Information System (lDLIS) application was developed. IDLIS is essentially a wrapper for the LIS, which maintains all the databases of the library. The purpose was to demonstrate that the functionality of a legacy system could be enhanced with the augmentation of IDS leading to information discovery service. IDLIS demonstrates IDS in action. IDLIS proves that any legacy system could be augmented with IDS effectively to provide the additional functionality of information discovery service.Possible applications of IDS and scope for further research in the field are covered.
Resumo:
The proliferation of wireless sensor networks in a large spectrum of applications had been spurered by the rapid advances in MEMS(micro-electro mechanical systems )based sensor technology coupled with low power,Low cost digital signal processors and radio frequency circuits.A sensor network is composed of thousands of low cost and portable devices bearing large sensing computing and wireless communication capabilities. This large collection of tiny sensors can form a robust data computing and communication distributed system for automated information gathering and distributed sensing.The main attractive feature is that such a sensor network can be deployed in remote areas.Since the sensor node is battery powered,all the sensor nodes should collaborate together to form a fault tolerant network so as toprovide an efficient utilization of precious network resources like wireless channel,memory and battery capacity.The most crucial constraint is the energy consumption which has become the prime challenge for the design of long lived sensor nodes.
Resumo:
In recent years, reversible logic has emerged as one of the most important approaches for power optimization with its application in low power CMOS, nanotechnology and quantum computing. This research proposes quick addition of decimals (QAD) suitable for multi-digit BCD addition, using reversible conservative logic. The design makes use of reversible fault tolerant Fredkin gates only. The implementation strategy is to reduce the number of levels of delay there by increasing the speed, which is the most important factor for high speed circuits.
Resumo:
The speed of fault isolation is crucial for the design and reconfiguration of fault tolerant control (FTC). In this paper the fault isolation problem is stated as a constraint satisfaction problem (CSP) and solved using constraint propagation techniques. The proposed method is based on constraint satisfaction techniques and uncertainty space refining of interval parameters. In comparison with other approaches based on adaptive observers, the major advantage of the presented method is that the isolation speed is fast even taking into account uncertainty in parameters, measurements and model errors and without the monotonicity assumption. In order to illustrate the proposed approach, a case study of a nonlinear dynamic system is presented
Resumo:
The paper presents how workflow-oriented, single-user Grid portals could be extended to meet the requirements of users with collaborative needs. Through collaborative Grid portals different research and engineering teams would be able to share knowledge and resources. At the same time the workflow concept assures that the shared knowledge and computational capacity is aggregated to achieve the high-level goals of the group. The paper discusses the different issues collaborative support requires from Grid portal environments during the different phases of the workflow-oriented development work. While in the design period the most important task of the portal is to provide consistent and fault tolerant data management, during the workflow execution it must act upon the security framework its back-end Grids are built on.
Resumo:
An interconnection network with n nodes is four-pancyclic if it contains a cycle of length l for each integer l with 4 <= l <= n. An interconnection network is fault-tolerant four-pancyclic if the surviving network is four-pancyclic in the presence of faults. The fault-tolerant four-pancyclicity of interconnection networks is a desired property because many classical parallel algorithms can be mapped onto such networks in a communication-efficient fashion, even in the presence of failing nodes or edges. Due to some attractive properties as compared with its hypercube counterpart of the same size, the Mobius cube has been proposed as a promising candidate for interconnection topology. Hsieh and Chen [S.Y. Hsieh, C.H. Chen, Pancyclicity on Mobius cubes with maximal edge faults, Parallel Computing, 30(3) (2004) 407-421.] showed that an n-dimensional Mobius cube is four-pancyclic in the presence of up to n-2 faulty edges. In this paper, we show that an n-dimensional Mobius cube is four-pancyclic in the presence of up to n-2 faulty nodes. The obtained result is optimal in that, if n-1 nodes are removed, the surviving network may not be four-pancyclic. (C) 2005 Elsevier B.V. All rights reserved.
Resumo:
Service-based architectures enable the development of new classes of Grid and distributed applications. One of the main capabilities provided by such systems is the dynamic and flexible integration of services, according to which services are allowed to be a part of more than one distributed system and simultaneously serve different applications. This increased flexibility in system composition makes it difficult to address classical distributed system issues such as fault-tolerance. While it is relatively easy to make an individual service fault-tolerant, improving fault-tolerance of services collaborating in multiple application scenarios is a challenging task. In this paper, we look at the issue of developing fault-tolerant service-based distributed systems, and propose an infrastructure to implement fault tolerance capabilities transparent to services.
Resumo:
Establishing a fault-tolerant connection in a network involves computation of diverse working and protection paths. The Shared Risk Link Group (SRLG) [1] concept is used to model several types of failure conditions such as link, node, fiber conduit, etc. In this work we focus on the problem of computing optimal SRLG/link diverse paths under shared protection. Shared protection technique improves network resource utilization by allowing protection paths of multiple connections to share resources. In this work we propose an iterative heuristic for computing SRLG/link diverse paths. We present a method to calculate a quantitative measure that provides a bounded guarantee on the optimality of the diverse paths computed by the heuristic. The experimental results on computing link diverse paths show that our proposed heuristic is efficient in terms of number of iterations required (time taken) to compute diverse paths when compared to other previously proposed heuristics.
Resumo:
One of the important issues in establishing a fault tolerant connection in a wavelength division multiplexing optical network is computing a pair of disjoint working and protection paths and a free wavelength along the paths. While most of the earlier research focused only on computing disjoint paths, in this work we consider computing both disjoint paths and a free wavelength along the paths. The concept of dependent cost structure (DCS) of protection paths to enhance their resource sharing ability was proposed in our earlier work. In this work we extend the concept of DCS of protection paths to wavelength continuous networks. We formalize the problem of computing disjoint paths with DCS in wavelength continuous networks and prove that it is NP-complete. We present an iterative heuristic that uses a layered graph model to compute disjoint paths with DCS and identify a free wavelength.
Resumo:
A new control scheme has been presented in this thesis. Based on the NonLinear Geometric Approach, the proposed Active Control System represents a new way to see the reconfigurable controllers for aerospace applications. The presence of the Diagnosis module (providing the estimation of generic signals which, based on the case, can be faults, disturbances or system parameters), mean feature of the depicted Active Control System, is a characteristic shared by three well known control systems: the Active Fault Tolerant Controls, the Indirect Adaptive Controls and the Active Disturbance Rejection Controls. The standard NonLinear Geometric Approach (NLGA) has been accurately investigated and than improved to extend its applicability to more complex models. The standard NLGA procedure has been modified to take account of feasible and estimable sets of unknown signals. Furthermore the application of the Singular Perturbations approximation has led to the solution of Detection and Isolation problems in scenarios too complex to be solved by the standard NLGA. Also the estimation process has been improved, where multiple redundant measuremtent are available, by the introduction of a new algorithm, here called "Least Squares - Sliding Mode". It guarantees optimality, in the sense of the least squares, and finite estimation time, in the sense of the sliding mode. The Active Control System concept has been formalized in two controller: a nonlinear backstepping controller and a nonlinear composite controller. Particularly interesting is the integration, in the controller design, of the estimations coming from the Diagnosis module. Stability proofs are provided for both the control schemes. Finally, different applications in aerospace have been provided to show the applicability and the effectiveness of the proposed NLGA-based Active Control System.
Resumo:
Los sistemas técnicos son cada vez más complejos, incorporan funciones más avanzadas, están más integrados con otros sistemas y trabajan en entornos menos controlados. Todo esto supone unas condiciones más exigentes y con mayor incertidumbre para los sistemas de control, a los que además se demanda un comportamiento más autónomo y fiable. La adaptabilidad de manera autónoma es un reto para tecnologías de control actualmente. El proyecto de investigación ASys propone abordarlo trasladando la responsabilidad de la capacidad de adaptación del sistema de los ingenieros en tiempo de diseño al propio sistema en operación. Esta tesis pretende avanzar en la formulación y materialización técnica de los principios de ASys de cognición y auto-consciencia basadas en modelos y autogestión de los sistemas en tiempo de operación para una autonomía robusta. Para ello el trabajo se ha centrado en la capacidad de auto-conciencia, inspirada en los sistemas biológicos, y se ha explorado la posibilidad de integrarla en la arquitectura de los sistemas de control. Además de la auto-consciencia, se han explorado otros temas relevantes: modelado funcional, modelado de software, tecnología de los patrones, tecnología de componentes, tolerancia a fallos. Se ha analizado el estado de la técnica en los ámbitos pertinentes para las cuestiones de la auto-consciencia y la adaptabilidad en sistemas técnicos: arquitecturas cognitivas, control tolerante a fallos, y arquitecturas software dinámicas y computación autonómica. El marco teórico de ASys existente de sistemas autónomos cognitivos ha sido adaptado para servir de base para este análisis de autoconsciencia y adaptación y para dar sustento conceptual al posterior desarrollo de la solución. La tesis propone una solución general de diseño para la construcción de sistemas autónomos auto-conscientes. La idea central es la integración de un meta-controlador en la arquitectura de control del sistema autónomo, capaz de percibir la estado funcional del sistema de control y, si es necesario, reconfigurarlo en tiempo de operación. Esta solución de metacontrol se ha formalizado en cuatro patrones de diseño: i) el Patrón Metacontrol, que define la integración de un subsistema de metacontrol, responsable de controlar al propio sistema de control a través de la interfaz proporcionada por su plataforma de componentes, ii) el patrón Bucle de Control Epistémico, que define un bucle de control cognitivo basado en el modelos y que se puede aplicar al diseño del metacontrol, iii) el patrón de Reflexión basada en Modelo Profundo propone una solución para construir el modelo ejecutable utilizado por el meta-controlador mediante una transformación de modelo a modelo a partir del modelo de ingeniería del sistema, y, finalmente, iv) el Patrón Metacontrol Funcional, que estructura el meta-controlador en dos bucles, uno para el control de la configuración de los componentes del sistema de control, y otro sobre éste, controlando las funciones que realiza dicha configuración de componentes; de esta manera las consideraciones funcionales y estructurales se desacoplan. La Arquitectura OM y el metamodelo TOMASys son las piezas centrales del marco arquitectónico desarrollado para materializar la solución compuesta de los patrones anteriores. El metamodelo TOMASys ha sido desarrollado para la representación de la estructura y su relación con los requisitos funcionales de cualquier sistema autónomo. La Arquitectura OM es un patrón de referencia para la construcción de una metacontrolador integrando los patrones de diseño propuestos. Este meta-controlador se puede integrar en la arquitectura de cualquier sistema control basado en componentes. El elemento clave de su funcionamiento es un modelo TOMASys del sistema decontrol, que el meta-controlador usa para monitorizarlo y calcular las acciones de reconfiguración necesarias para adaptarlo a las circunstancias en cada momento. Un proceso de ingeniería, complementado con otros recursos, ha sido elaborado para guiar la aplicación del marco arquitectónico OM. Dicho Proceso de Ingeniería OM define la metodología a seguir para construir el subsistema de metacontrol para un sistema autónomo a partir del modelo funcional del mismo. La librería OMJava proporciona una implementación del meta-controlador OM que se puede integrar en el control de cualquier sistema autónomo, independientemente del dominio de la aplicación o de su tecnología de implementación. Para concluir, la solución completa ha sido validada con el desarrollo de un robot móvil autónomo que incorpora un meta-controlador con la Arquitectura OM. Las propiedades de auto-consciencia y adaptación proporcionadas por el meta-controlador han sido validadas en diferentes escenarios de operación del robot, en los que el sistema era capaz de sobreponerse a fallos en el sistema de control mediante reconfiguraciones orquestadas por el metacontrolador. ABSTRACT Technical systems are becoming more complex, they incorporate more advanced functionalities, they are more integrated with other systems and they are deployed in less controlled environments. All this supposes a more demanding and uncertain scenario for control systems, which are also required to be more autonomous and dependable. Autonomous adaptivity is a current challenge for extant control technologies. The ASys research project proposes to address it by moving the responsibility for adaptivity from the engineers at design time to the system at run-time. This thesis has intended to advance in the formulation and technical reification of ASys principles of model-based self-cognition and having systems self-handle at runtime for robust autonomy. For that it has focused on the biologically inspired capability of self-awareness, and explored the possibilities to embed it into the very architecture of control systems. Besides self-awareness, other themes related to the envisioned solution have been explored: functional modeling, software modeling, patterns technology, components technology, fault tolerance. The state of the art in fields relevant for the issues of self-awareness and adaptivity has been analysed: cognitive architectures, fault-tolerant control, and software architectural reflection and autonomic computing. The extant and evolving ASys Theoretical Framework for cognitive autonomous systems has been adapted to provide a basement for this selfhood-centred analysis and to conceptually support the subsequent development of our solution. The thesis proposes a general design solution for building self-aware autonomous systems. Its central idea is the integration of a metacontroller in the control architecture of the autonomous system, capable of perceiving the functional state of the control system and reconfiguring it if necessary at run-time. This metacontrol solution has been formalised into four design patterns: i) the Metacontrol Pattern, which defines the integration of a metacontrol subsystem, controlling the domain control system through an interface provided by its implementation component platform, ii) the Epistemic Control Loop pattern, which defines a modelbased cognitive control loop that can be applied to the design of such a metacontroller, iii) the Deep Model Reflection pattern proposes a solution to produce the online executable model used by the metacontroller by model-to-model transformation from the engineering model, and, finally, iv) the Functional Metacontrol pattern, which proposes to structure the metacontroller in two loops, one for controlling the configuration of components of the controller, and another one on top of the former, controlling the functions being realised by that configuration; this way the functional and structural concerns become decoupled. The OM Architecture and the TOMASys metamodel are the core pieces of the architectural framework developed to reify this patterned solution. The TOMASys metamodel has been developed for representing the structure and its relation to the functional requirements of any autonomous system. The OM architecture is a blueprint for building a metacontroller according to the patterns. This metacontroller can be integrated on top of any component-based control architecture. At the core of its operation lies a TOMASys model of the control system. An engineering process and accompanying assets have been constructed to complete and exploit the architectural framework. The OM Engineering Process defines the process to follow to develop the metacontrol subsystem from the functional model of the controller of the autonomous system. The OMJava library provides a domain and application-independent implementation of an OM Metacontroller than can be used in the implementation phase of OMEP. Finally, the complete solution has been validated in the development of an autonomous mobile robot that incorporates an OM metacontroller. The functional selfawareness and adaptivity properties achieved thanks to the metacontrol system have been validated in different scenarios. In these scenarios the robot was able to overcome failures in the control system thanks to reconfigurations performed by the metacontroller.
Resumo:
The distributed computing models typically assume every process in the system has a distinct identifier (ID) or each process is programmed differently, which is named as eponymous system. In such kind of distributed systems, the unique ID is helpful to solve problems: it can be incorporated into messages to make them trackable (i.e., to or from which process they are sent) to facilitate the message transmission; several problems (leader election, consensus, etc.) can be solved without the information of network property in priori if processes have unique IDs; messages in the register of one process will not be overwritten by others process if this process announces; it is useful to break the symmetry. Hence, eponymous systems have influenced the distributed computing community significantly either in theory or in practice. However, every thing in the world has its own two sides. The unique ID also has disadvantages: it can leak information of the network(size); processes in the system have no privacy; assign unique ID is costly in bulk-production(e.g, sensors). Hence, homonymous system is appeared. If some processes share the same ID and programmed identically is called homonymous system. Furthermore, if all processes shared the same ID or have no ID is named as anonymous system. In homonymous or anonymous distributed systems, the symmetry problem (i.e., how to distinguish messages sent from which process) is the main obstacle in the design of algorithms. This thesis is aimed to propose different symmetry break methods (e.g., random function, counting technique, etc.) to solve agreement problem. Agreement is a fundamental problem in distributed computing including a family of abstractions. In this thesis, we mainly focus on the design of consensus, set agreement, broadcast algorithms in anonymous and homonymous distributed systems. Firstly, the fault-tolerant broadcast abstraction is studied in anonymous systems with reliable or fair lossy communication channels separately. Two classes of anonymous failure detectors AΘ and AP∗ are proposed, and both of them together with a already proposed failure detector ψ are implemented and used to enrich the system model to implement broadcast abstraction. Then, in the study of the consensus abstraction, it is proved the AΩ′ failure detector class is strictly weaker than AΩ and AΩ′ is implementable. The first implementation of consensus in anonymous asynchronous distributed systems augmented with AΩ′ and where a majority of processes does not crash. Finally, a general consensus problem– k-set agreement is researched and the weakest failure detector L used to solve it, in asynchronous message passing systems where processes may crash and recover, with homonyms (i.e., processes may have equal identities), and without a complete initial knowledge of the membership.
Resumo:
Los ataques a redes de información son cada vez más sofisticados y exigen una constante evolución y mejora de las técnicas de detección. Para ello, en este proyecto se ha diseñado e implementado una plataforma cooperativa para la detección de intrusiones basada en red. En primer lugar, se ha realizado un estudio teórico previo del marco tecnológico relacionado con este ámbito, en el que se describe y caracteriza el software que se utiliza para realizar ataques a sistemas (malware) así como los métodos que se utilizan para llegar a transmitir ese software (vectores de ataque). En el documento también se describen los llamados APT, que son ataques dirigidos con una gran inversión económica y temporal. Estos pueden englobar todos los malware y vectores de ataque existentes. Para poder evitar estos ataques, se estudiarán los sistemas de detección y prevención de intrusiones, describiendo brevemente los algoritmos que se tienden a utilizar en la actualidad. En segundo lugar, se ha planteado y desarrollado una plataforma en red dedicada al análisis de paquetes y conexiones para detectar posibles intrusiones. Este sistema está orientado a sistemas SCADA (Supervisory Control And Data Adquisition) aunque funciona sobre cualquier red IPv4/IPv6, para ello se definirá previamente lo que es un sistema SCADA, así como sus partes principales. Para implementar el sistema se han utilizado dispositivos de bajo consumo llamados Raspberry PI, estos se ubican entre la red y el equipo final que se quiera analizar. En ellos se ejecutan 2 aplicaciones desarrolladas de tipo cliente-servidor (la Raspberry central ejecutará la aplicación servidora y las esclavas la aplicación cliente) que funcionan de forma cooperativa utilizando la tecnología distribuida de Hadoop, la cual se explica previamente. Mediante esta tecnología se consigue desarrollar un sistema completamente escalable. La aplicación servidora muestra una interfaz gráfica que permite administrar la plataforma de análisis de forma centralizada, pudiendo ver así las alarmas de cada dispositivo y calificando cada paquete según su peligrosidad. El algoritmo desarrollado en la aplicación calcula el ratio de paquetes/tiempo que entran/salen del equipo final, procesando los paquetes y analizándolos teniendo en cuenta la información de señalización, creando diferentes bases de datos que irán mejorando la robustez del sistema, reduciendo así la posibilidad de ataques externos. Para concluir, el proyecto inicial incluía el procesamiento en la nube de la aplicación principal, pudiendo administrar así varias infraestructuras concurrentemente, aunque debido al trabajo extra necesario se ha dejado preparado el sistema para poder implementar esta funcionalidad. En el caso experimental actual el procesamiento de la aplicación servidora se realiza en la Raspberry principal, creando un sistema escalable, rápido y tolerante a fallos. ABSTRACT. The attacks to networks of information are increasingly sophisticated and demand a constant evolution and improvement of the technologies of detection. For this project it is developed and implemented a cooperative platform for detect intrusions based on networking. First, there has been a previous theoretical study of technological framework related to this area, which describes the software used for attacks on systems (malware) as well as the methods used in order to transmit this software (attack vectors). In this document it is described the APT, which are attacks directed with a big economic and time inversion. These can contain all existing malware and attack vectors. To prevent these attacks, intrusion detection systems and prevention intrusion systems will be discussed, describing previously the algorithms tend to use today. Secondly, a platform for analyzing network packets has been proposed and developed to detect possible intrusions in SCADA (Supervisory Control And Data Adquisition) systems. This platform is designed for SCADA systems (Supervisory Control And Data Acquisition) but works on any IPv4 / IPv6 network. Previously, it is defined what a SCADA system is and the main parts of it. To implement it, we used low-power devices called Raspberry PI, these are located between the network and the final device to analyze it. In these Raspberry run two applications client-server developed (the central Raspberry runs the server application and the slaves the client application) that work cooperatively using Hadoop distributed technology, which is previously explained. Using this technology is achieved develop a fully scalable system. The server application displays a graphical interface to manage analytics platform centrally, thereby we can see each device alarms and qualifying each packet by dangerousness. The algorithm developed in the application calculates the ratio of packets/time entering/leaving the terminal device, processing the packets and analyzing the signaling information of each packet, reating different databases that will improve the system, thereby reducing the possibility of external attacks. In conclusion, the initial project included cloud computing of the main application, being able to manage multiple concurrent infrastructure, but due to the extra work required has been made ready the system to implement this funcionality. In the current test case the server application processing is made on the main Raspberry, creating a scalable, fast and fault-tolerant system.
Resumo:
Debido al gran incremento de datos digitales que ha tenido lugar en los últimos años, ha surgido un nuevo paradigma de computación paralela para el procesamiento eficiente de grandes volúmenes de datos. Muchos de los sistemas basados en este paradigma, también llamados sistemas de computación intensiva de datos, siguen el modelo de programación de Google MapReduce. La principal ventaja de los sistemas MapReduce es que se basan en la idea de enviar la computación donde residen los datos, tratando de proporcionar escalabilidad y eficiencia. En escenarios libres de fallo, estos sistemas generalmente logran buenos resultados. Sin embargo, la mayoría de escenarios donde se utilizan, se caracterizan por la existencia de fallos. Por tanto, estas plataformas suelen incorporar características de tolerancia a fallos y fiabilidad. Por otro lado, es reconocido que las mejoras en confiabilidad vienen asociadas a costes adicionales en recursos. Esto es razonable y los proveedores que ofrecen este tipo de infraestructuras son conscientes de ello. No obstante, no todos los enfoques proporcionan la misma solución de compromiso entre las capacidades de tolerancia a fallo (o de manera general, las capacidades de fiabilidad) y su coste. Esta tesis ha tratado la problemática de la coexistencia entre fiabilidad y eficiencia de los recursos en los sistemas basados en el paradigma MapReduce, a través de metodologías que introducen el mínimo coste, garantizando un nivel adecuado de fiabilidad. Para lograr esto, se ha propuesto: (i) la formalización de una abstracción de detección de fallos; (ii) una solución alternativa a los puntos únicos de fallo de estas plataformas, y, finalmente, (iii) un nuevo sistema de asignación de recursos basado en retroalimentación a nivel de contenedores. Estas contribuciones genéricas han sido evaluadas tomando como referencia la arquitectura Hadoop YARN, que, hoy en día, es la plataforma de referencia en la comunidad de los sistemas de computación intensiva de datos. En la tesis se demuestra cómo todas las contribuciones de la misma superan a Hadoop YARN tanto en fiabilidad como en eficiencia de los recursos utilizados. ABSTRACT Due to the increase of huge data volumes, a new parallel computing paradigm to process big data in an efficient way has arisen. Many of these systems, called dataintensive computing systems, follow the Google MapReduce programming model. The main advantage of these systems is based on the idea of sending the computation where the data resides, trying to provide scalability and efficiency. In failure-free scenarios, these frameworks usually achieve good results. However, these ones are not realistic scenarios. Consequently, these frameworks exhibit some fault tolerance and dependability techniques as built-in features. On the other hand, dependability improvements are known to imply additional resource costs. This is reasonable and providers offering these infrastructures are aware of this. Nevertheless, not all the approaches provide the same tradeoff between fault tolerant capabilities (or more generally, reliability capabilities) and cost. In this thesis, we have addressed the coexistence between reliability and resource efficiency in MapReduce-based systems, looking for methodologies that introduce the minimal cost and guarantee an appropriate level of reliability. In order to achieve this, we have proposed: (i) a formalization of a failure detector abstraction; (ii) an alternative solution to single points of failure of these frameworks, and finally (iii) a novel feedback-based resource allocation system at the container level. Finally, our generic contributions have been instantiated for the Hadoop YARN architecture, which is the state-of-the-art framework in the data-intensive computing systems community nowadays. The thesis demonstrates how all our approaches outperform Hadoop YARN in terms of reliability and resource efficiency.