884 resultados para Hadoop distributed file system (HDFS)
Resumo:
Hundreds of Terabytes of CMS (Compact Muon Solenoid) data are being accumulated for storage day by day at the University of Nebraska-Lincoln, which is one of the eight US CMS Tier-2 sites. Managing this data includes retaining useful CMS data sets and clearing storage space for newly arriving data by deleting less useful data sets. This is an important task that is currently being done manually and it requires a large amount of time. The overall objective of this study was to develop a methodology to help identify the data sets to be deleted when there is a requirement for storage space. CMS data is stored using HDFS (Hadoop Distributed File System). HDFS logs give information regarding file access operations. Hadoop MapReduce was used to feed information in these logs to Support Vector Machines (SVMs), a machine learning algorithm applicable to classification and regression which is used in this Thesis to develop a classifier. Time elapsed in data set classification by this method is dependent on the size of the input HDFS log file since the algorithmic complexities of Hadoop MapReduce algorithms here are O(n). The SVM methodology produces a list of data sets for deletion along with their respective sizes. This methodology was also compared with a heuristic called Retention Cost which was calculated using size of the data set and the time since its last access to help decide how useful a data set is. Accuracies of both were compared by calculating the percentage of data sets predicted for deletion which were accessed at a later instance of time. Our methodology using SVMs proved to be more accurate than using the Retention Cost heuristic. This methodology could be used to solve similar problems involving other large data sets.
Resumo:
In order to simplify computer management, several system administrators are adopting advanced techniques to manage software configuration of enterprise computer networks, but the tight coupling between hardware and software makes every PC an individual managed entity, lowering the scalability and increasing the costs to manage hundreds or thousands of PCs. Virtualization is an established technology, however its use is been more focused on server consolidation and virtual desktop infrastructure, not for managing distributed computers over a network. This paper discusses the feasibility of the Distributed Virtual Machine Environment, a new approach for enterprise computer management that combines virtualization and distributed system architecture as the basis of the management architecture. © 2008 IEEE.
Resumo:
During the last decades, we assisted to what is called “information explosion”. With the advent of the new technologies and new contexts, the volume, velocity and variety of data has increased exponentially, becoming what is known today as big data. Among them, we emphasize telecommunications operators, which gather, using network monitoring equipment, millions of network event records, the Call Detail Records (CDRs) and the Event Detail Records (EDRs), commonly known as xDRs. These records are stored and later processed to compute network performance and quality of service metrics. With the ever increasing number of collected xDRs, its generated volume needing to be stored has increased exponentially, making the current solutions based on relational databases not suited anymore. To tackle this problem, the relational data store can be replaced by Hadoop File System (HDFS). However, HDFS is simply a distributed file system, this way not supporting any aspect of the relational paradigm. To overcome this difficulty, this paper presents a framework that enables the current systems inserting data into relational databases, to keep doing it transparently when migrating to Hadoop. As proof of concept, the developed platform was integrated with the Altaia - a performance and QoS management of telecommunications networks and services.
Resumo:
Dissertação apresentada para obtenção do Grau de Doutor em Informática Pela Universidade Nova de Lisboa, Faculdade de Ciências e Tecnologia
Resumo:
Trabalho apresentado no âmbito do Mestrado em Engenharia Informática, como requisito parcial para obtenção do grau de Mestre em Engenharia Informática
Resumo:
In this paper we present BitWorker, a platform for community distributed computing based on BitTorrent. Any splittable task can be easily specified by a user in a meta-information task file, such that it can be downloaded and performed by other volunteers. Peers find each other using Distributed Hash Tables, download existing results, and compute missing ones. Unlike existing distributed computing schemes relying on centralized coordination point(s), our scheme is totally distributed, therefore, highly robust. We evaluate the performance of BitWorker using mathematical models and real tests, showing processing and robustness gains. BitWorker is available for download and use by the community.
Resumo:
This research is aimed to find a solution for a distributed storage system adapted for CoDeS. By studying how DSSs work and how they are implemented, we can conclude how we can implement a DSS compatible with CoDeS requirements.
Resumo:
In this thesis the bifurcational behavior of the solutions of Langford system is analysed. The equilibriums of the Langford system are found, and the stability of equilibriums is discussed. The conditions of loss of stability are found. The periodic solution of the system is approximated. We consider three types of boundary condition for Langford spatially distributed system: Neumann conditions, Dirichlet conditions and Neumann conditions with additional requirement of zero average. We apply the Lyapunov-Schmidt method to Langford spatially distributed system for asymptotic approximation of the periodic mode. We analyse the influence of the diffusion on the behavior of self-oscillations. As well in the present work we perform numerical experiments and compare it with the analytical results.
Resumo:
In this paper, we have evolved a generic software architecture for a domain specific distributed embedded system. The system under consideration belongs to the Command, Control and Communication systems domain. The systems in such domain have very long operational lifetime. The quality attributes of these systems are equally important as the functional requirements. The main guiding principle followed in this paper for evolving the software architecture has been functional independence of the modules. The quality attributes considered most important for the system are maintainability and modifiability. Architectural styles best suited for the functionally independent modules are proposed with focus on these quality attributes. The software architecture for the system is envisioned as a collection of architecture styles of the functionally independent modules identified
Resumo:
The number of electronic devices connected to agricultural machinery is increasing to support new agricultural practices tasks related to the Precision Agriculture such as spatial variability mapping and Variable Rate Technology (VRT). The Distributed Control System (DCS) is a suitable solution for decentralization of the data acquisition system and the Controller Area Network (CAN) is the major trend among the embedded communications protocols for agricultural machinery and vehicles. The application of soil correctives is a typical problem in Brazil. The efficiency of this correction process is highly dependent of the inputs way at soil and the occurrence of errors affects directly the agricultural yield. To handle this problem, this paper presents the development of a CAN-based distributed control system for a VRT system of soil corrective in agricultural machinery. The VRT system is composed by a tractor-implement that applies a desired rate of inputs according to the georeferenced prescription map of the farm field to support PA (Precision Agriculture). The performance evaluation of the CAN-based VRT system was done by experimental tests and analyzing the CAN messages transmitted in the operation of the entire system. The results of the control error according to the necessity of agricultural application allow conclude that the developed VRT system is suitable for the agricultural productions reaching an acceptable response time and application error. The CAN-Based DCS solution applied in the VRT system reduced the complexity of the control system, easing the installation and maintenance. The use of VRT system allowed applying only the required inputs, increasing the efficiency operation and minimizing the environmental impact.
Resumo:
Current scientific applications have been producing large amounts of data. The processing, handling and analysis of such data require large-scale computing infrastructures such as clusters and grids. In this area, studies aim at improving the performance of data-intensive applications by optimizing data accesses. In order to achieve this goal, distributed storage systems have been considering techniques of data replication, migration, distribution, and access parallelism. However, the main drawback of those studies is that they do not take into account application behavior to perform data access optimization. This limitation motivated this paper which applies strategies to support the online prediction of application behavior in order to optimize data access operations on distributed systems, without requiring any information on past executions. In order to accomplish such a goal, this approach organizes application behaviors as time series and, then, analyzes and classifies those series according to their properties. By knowing properties, the approach selects modeling techniques to represent series and perform predictions, which are, later on, used to optimize data access operations. This new approach was implemented and evaluated using the OptorSim simulator, sponsored by the LHC-CERN project and widely employed by the scientific community. Experiments confirm this new approach reduces application execution time in about 50 percent, specially when handling large amounts of data.
Resumo:
The research activity carried out during the PhD course in Electrical Engineering belongs to the branch of electric and electronic measurements. The main subject of the present thesis is a distributed measurement system to be installed in Medium Voltage power networks, as well as the method developed to analyze data acquired by the measurement system itself and to monitor power quality. In chapter 2 the increasing interest towards power quality in electrical systems is illustrated, by reporting the international research activity inherent to the problem and the relevant standards and guidelines emitted. The aspect of the quality of voltage provided by utilities and influenced by customers in the various points of a network came out only in recent years, in particular as a consequence of the energy market liberalization. Usually, the concept of quality of the delivered energy has been associated mostly to its continuity. Hence the reliability was the main characteristic to be ensured for power systems. Nowadays, the number and duration of interruptions are the “quality indicators” commonly perceived by most customers; for this reason, a short section is dedicated also to network reliability and its regulation. In this contest it should be noted that although the measurement system developed during the research activity belongs to the field of power quality evaluation systems, the information registered in real time by its remote stations can be used to improve the system reliability too. Given the vast scenario of power quality degrading phenomena that usually can occur in distribution networks, the study has been focused on electromagnetic transients affecting line voltages. The outcome of such a study has been the design and realization of a distributed measurement system which continuously monitor the phase signals in different points of a network, detect the occurrence of transients superposed to the fundamental steady state component and register the time of occurrence of such events. The data set is finally used to locate the source of the transient disturbance propagating along the network lines. Most of the oscillatory transients affecting line voltages are due to faults occurring in any point of the distribution system and have to be seen before protection equipment intervention. An important conclusion is that the method can improve the monitored network reliability, since the knowledge of the location of a fault allows the energy manager to reduce as much as possible both the area of the network to be disconnected for protection purposes and the time spent by technical staff to recover the abnormal condition and/or the damage. The part of the thesis presenting the results of such a study and activity is structured as follows: chapter 3 deals with the propagation of electromagnetic transients in power systems by defining characteristics and causes of the phenomena and briefly reporting the theory and approaches used to study transients propagation. Then the state of the art concerning methods to detect and locate faults in distribution networks is presented. Finally the attention is paid on the particular technique adopted for the same purpose during the thesis, and the methods developed on the basis of such approach. Chapter 4 reports the configuration of the distribution networks on which the fault location method has been applied by means of simulations as well as the results obtained case by case. In this way the performance featured by the location procedure firstly in ideal then in realistic operating conditions are tested. In chapter 5 the measurement system designed to implement the transients detection and fault location method is presented. The hardware belonging to the measurement chain of every acquisition channel in remote stations is described. Then, the global measurement system is characterized by considering the non ideal aspects of each device that can concur to the final combined uncertainty on the estimated position of the fault in the network under test. Finally, such parameter is computed according to the Guide to the Expression of Uncertainty in Measurements, by means of a numeric procedure. In the last chapter a device is described that has been designed and realized during the PhD activity aiming at substituting the commercial capacitive voltage divider belonging to the conditioning block of the measurement chain. Such a study has been carried out aiming at providing an alternative to the used transducer that could feature equivalent performance and lower cost. In this way, the economical impact of the investment associated to the whole measurement system would be significantly reduced, making the method application much more feasible.
Resumo:
La necessità di sincronizzare i propri dati si presenta in una moltitudine di situazioni, infatti il numero di dispositivi informatici a nostra disposizione è in continua crescita e, all' aumentare del loro numero, cresce l' esigenza di mantenere aggiornate le multiple copie dei dati in essi memorizzati. Vi sono diversi fattori che complicano tale situazione, tra questi la varietà sempre maggiore dei sistemi operativi utilizzati nei diversi dispositivi, si parla di Microsoft Windows, delle tante distribuzioni Linux, di Mac OS X, di Solaris o di altri sistemi operativi UNIX, senza contare i sistemi operativi più orientati al settore mobile come Android. Ogni sistema operativo ha inoltre un modo particolare di gestire i dati, si pensi alla differente gestione dei permessi dei file o alla sensibilità alle maiuscole. Bisogna anche considerare che se gli aggiornamenti dei dati avvenissero soltanto su di uno di questi dispositivi sarebbe richiesta una semplice copia dei dati aggiornati sugli altri dispositivi, ma che non è sempre possibile utilizzare tale approccio. Infatti i dati vengono spesso aggiornati in maniera indipendente in più di un dispositivo, magari nello stesso momento, è pertanto necessario che le applicazioni che si occupano di sincronizzare tali dati riconoscano le situazioni di conflitto, nelle quali gli stessi dati sono stati aggiornati in più di una copia ed in maniera differente, e permettano di risolverle, uniformando lo stato delle repliche. Considerando l' importanza e il valore che possono avere i dati, sia a livello lavorativo che personale, è necessario che tali applicazioni possano garantirne la sicurezza, evitando in ogni caso un loro danneggiamento, perchè sempre più spesso il valore di un dispositivo dipende più dai dati in esso contenuti che dal costo dello hardware. In questa tesi verranno illustrate alcune idee alternative su come possa aver luogo la condivisione e la sincronizzazione di dati tra sistemi operativi diversi, sia nel caso in cui siano installati nello stesso dispositivo che tra dispositivi differenti. La prima parte della tesi descriverà nel dettaglio l' applicativo Unison. Tale applicazione, consente di mantenere sincronizzate tra di loro repliche dei dati, memorizzate in diversi dispositivi che possono anche eseguire sistemi operativi differenti. Unison funziona a livello utente, analizzando separatamente lo stato delle repliche al momento dell' esecuzione, senza cioè mantenere traccia delle operazioni che sono state effettuate sui dati per modificarli dal loro stato precedente a quello attuale. Unison permette la sincronizzazione anche quando i dati siano stati modificati in maniera indipendente su più di un dispositivo, occupandosi di risolvere gli eventuali conflitti che possono verificarsi rispettando la volontà dell' utente. Verranno messe in evidenza le strategie utilizzate dai suoi ideatori per garantire la sicurezza dei dati ad esso affidati e come queste abbiano effetto nelle più diverse condizioni. Verrà poi fornita un' analisi dettagiata di come possa essere utilizzata l' applicazione, fornendo una descrizione accurata delle funzionalità e vari esempi per renderne più chiaro il funzionamento. Nella seconda parte della tesi si descriverà invece come condividere file system tra sistemi operativi diversi all' interno della stessa macchina, si tratta di un approccio diametralmente opposto al precedente, in cui al posto di avere una singola copia dei dati, si manteneva una replica per ogni dispositivo coinvolto. Concentrando l' attenzione sui sistemi operativi Linux e Microsoft Windows verranno descritti approfonditamente gli strumenti utilizzati e illustrate le caratteristiche tecniche sottostanti.
Resumo:
L'obiettivo di questa tesi è studiare la fattibilità dello studio della produzione associata ttH del bosone di Higgs con due quark top nell'esperimento CMS, e valutare le funzionalità e le caratteristiche della prossima generazione di toolkit per l'analisi distribuita a CMS (CRAB versione 3) per effettuare tale analisi. Nel settore della fisica del quark top, la produzione ttH è particolarmente interessante, soprattutto perchè rappresenta l'unica opportunità di studiare direttamente il vertice t-H senza dover fare assunzioni riguardanti possibili contributi dalla fisica oltre il Modello Standard. La preparazione per questa analisi è cruciale in questo momento, prima dell'inizio del Run-2 dell'LHC nel 2015. Per essere preparati a tale studio, le implicazioni tecniche di effettuare un'analisi completa in un ambito di calcolo distribuito come la Grid non dovrebbero essere sottovalutate. Per questo motivo, vengono presentati e discussi un'analisi dello stesso strumento CRAB3 (disponibile adesso in versione di pre-produzione) e un confronto diretto di prestazioni con CRAB2. Saranno raccolti e documentati inoltre suggerimenti e consigli per un team di analisi che sarà eventualmente coinvolto in questo studio. Nel Capitolo 1 è introdotta la fisica delle alte energie a LHC nell'esperimento CMS. Il Capitolo 2 discute il modello di calcolo di CMS e il sistema di analisi distribuita della Grid. Nel Capitolo 3 viene brevemente presentata la fisica del quark top e del bosone di Higgs. Il Capitolo 4 è dedicato alla preparazione dell'analisi dal punto di vista degli strumenti della Grid (CRAB3 vs CRAB2). Nel capitolo 5 è presentato e discusso uno studio di fattibilità per un'analisi del canale ttH in termini di efficienza di selezione.