876 resultados para Distributed data
Resumo:
This project was a step forward in developing intrusion detection systems in distributed environments such as web services. It investigates a new approach of detection based on so-called "taint-marking" techniques and introduces a theoretical framework along with its implementation in the Linux kernel.
Resumo:
This research studied distributed computing of all-to-all comparison problems with big data sets. The thesis formalised the problem, and developed a high-performance and scalable computing framework with a programming model, data distribution strategies and task scheduling policies to solve the problem. The study considered storage usage, data locality and load balancing for performance improvement in solving the problem. The research outcomes can be applied in bioinformatics, biometrics and data mining and other domains in which all-to-all comparisons are a typical computing pattern.
Resumo:
Solving large-scale all-to-all comparison problems using distributed computing is increasingly significant for various applications. Previous efforts to implement distributed all-to-all comparison frameworks have treated the two phases of data distribution and comparison task scheduling separately. This leads to high storage demands as well as poor data locality for the comparison tasks, thus creating a need to redistribute the data at runtime. Furthermore, most previous methods have been developed for homogeneous computing environments, so their overall performance is degraded even further when they are used in heterogeneous distributed systems. To tackle these challenges, this paper presents a data-aware task scheduling approach for solving all-to-all comparison problems in heterogeneous distributed systems. The approach formulates the requirements for data distribution and comparison task scheduling simultaneously as a constrained optimization problem. Then, metaheuristic data pre-scheduling and dynamic task scheduling strategies are developed along with an algorithmic implementation to solve the problem. The approach provides perfect data locality for all comparison tasks, avoiding rearrangement of data at runtime. It achieves load balancing among heterogeneous computing nodes, thus enhancing the overall computation time. It also reduces data storage requirements across the network. The effectiveness of the approach is demonstrated through experimental studies.
Resumo:
A key trait of Free and Open Source Software (FOSS) development is its distributed nature. Nevertheless, two project-level operations, the fork and the merge of program code, are among the least well understood events in the lifespan of a FOSS project. Some projects have explicitly adopted these operations as the primary means of concurrent development. In this study, we examine the effect of highly distributed software development, is found in the Linux kernel project, on collection and modelling of software development data. We find that distributed development calls for sophisticated temporal modelling techniques where several versions of the source code tree can exist at once. Attention must be turned towards the methods of quality assurance and peer review that projects employ to manage these parallel source trees. Our analysis indicates that two new metrics, fork rate and merge rate, could be useful for determining the role of distributed version control systems in FOSS projects. The study presents a preliminary data set consisting of version control and mailing list data.
Resumo:
A key trait of Free and Open Source Software (FOSS) development is its distributed nature. Nevertheless, two project-level operations, the fork and the merge of program code, are among the least well understood events in the lifespan of a FOSS project. Some projects have explicitly adopted these operations as the primary means of concurrent development. In this study, we examine the effect of highly distributed software development, is found in the Linux kernel project, on collection and modelling of software development data. We find that distributed development calls for sophisticated temporal modelling techniques where several versions of the source code tree can exist at once. Attention must be turned towards the methods of quality assurance and peer review that projects employ to manage these parallel source trees. Our analysis indicates that two new metrics, fork rate and merge rate, could be useful for determining the role of distributed version control systems in FOSS projects. The study presents a preliminary data set consisting of version control and mailing list data.
Resumo:
The lifetime calculation of large dense sensor networks with fixed energy resources and the remaining residual energy have shown that for a constant energy resource in a sensor network the fault rate at the cluster head is network size invariant when using the network layer with no MAC losses.Even after increasing the battery capacities in the nodes the total lifetime does not increase after a max limit of 8 times. As this is a serious limitation lots of research has been done at the MAC layer which allows to adapt to the specific connectivity, traffic and channel polling needs for sensor networks. There have been lots of MAC protocols which allow to control the channel polling of new radios which are available to sensor nodes to communicate. This further reduces the communication overhead by idling and sleep scheduling thus extending the lifetime of the monitoring application. We address the two issues which effects the distributed characteristics and performance of connected MAC nodes. (1) To determine the theoretical minimum rate based on joint coding for a correlated data source at the singlehop, (2a) to estimate cluster head errors using Bayesian rule for routing using persistence clustering when node densities are the same and stored using prior probability at the network layer, (2b) to estimate the upper bound of routing errors when using passive clustering were the node densities at the multi-hop MACS are unknown and not stored at the multi-hop nodes a priori. In this paper we evaluate many MAC based sensor network protocols and study the effects on sensor network lifetime. A renewable energy MAC routing protocol is designed when the probabilities of active nodes are not known a priori. From theoretical derivations we show that for a Bayesian rule with known class densities of omega1, omega2 with expected error P* is bounded by max error rate of P=2P* for single-hop. We study the effects of energy losses using cross-layer simulation of - large sensor network MACS setup, the error rate which effect finding sufficient node densities to have reliable multi-hop communications due to unknown node densities. The simulation results show that even though the lifetime is comparable the expected Bayesian posterior probability error bound is close or higher than Pges2P*.
Resumo:
This paper primarily intends to develop a GIS (geographical information system)-based data mining approach for optimally selecting the locations and determining installed capacities for setting up distributed biomass power generation systems in the context of decentralized energy planning for rural regions. The optimal locations within a cluster of villages are obtained by matching the installed capacity needed with the demand for power, minimizing the cost of transportation of biomass from dispersed sources to power generation system, and cost of distribution of electricity from the power generation system to demand centers or villages. The methodology was validated by using it for developing an optimal plan for implementing distributed biomass-based power systems for meeting the rural electricity needs of Tumkur district in India consisting of 2700 villages. The approach uses a k-medoid clustering algorithm to divide the total region into clusters of villages and locate biomass power generation systems at the medoids. The optimal value of k is determined iteratively by running the algorithm for the entire search space for different values of k along with demand-supply matching constraints. The optimal value of the k is chosen such that it minimizes the total cost of system installation, costs of transportation of biomass, and transmission and distribution. A smaller region, consisting of 293 villages was selected to study the sensitivity of the results to varying demand and supply parameters. The results of clustering are represented on a GIS map for the region.
Resumo:
Programming for parallel architectures that do not have a shared address space is extremely difficult due to the need for explicit communication between memories of different compute devices. A heterogeneous system with CPUs and multiple GPUs, or a distributed-memory cluster are examples of such systems. Past works that try to automate data movement for distributed-memory architectures can lead to excessive redundant communication. In this paper, we propose an automatic data movement scheme that minimizes the volume of communication between compute devices in heterogeneous and distributed-memory systems. We show that by partitioning data dependences in a particular non-trivial way, one can generate data movement code that results in the minimum volume for a vast majority of cases. The techniques are applicable to any sequence of affine loop nests and works on top of any choice of loop transformations, parallelization, and computation placement. The data movement code generated minimizes the volume of communication for a particular configuration of these. We use a combination of powerful static analyses relying on the polyhedral compiler framework and lightweight runtime routines they generate, to build a source-to-source transformation tool that automatically generates communication code. We demonstrate that the tool is scalable and leads to substantial gains in efficiency. On a heterogeneous system, the communication volume is reduced by a factor of 11X to 83X over state-of-the-art, translating into a mean execution time speedup of 1.53X. On a distributed-memory cluster, our scheme reduces the communication volume by a factor of 1.4X to 63.5X over state-of-the-art, resulting in a mean speedup of 1.55X. In addition, our scheme yields a mean speedup of 2.19X over hand-optimized UPC codes.
Resumo:
This paper studies a pilot-assisted physical layer data fusion technique known as Distributed Co-Phasing (DCP). In this two-phase scheme, the sensors first estimate the channel to the fusion center (FC) using pilots sent by the latter; and then they simultaneously transmit their common data by pre-rotating them by the estimated channel phase, thereby achieving physical layer data fusion. First, by analyzing the symmetric mutual information of the system, it is shown that the use of higher order constellations (HOC) can improve the throughput of DCP compared to the binary signaling considered heretofore. Using an HOC in the DCP setting requires the estimation of the composite DCP channel at the FC for data decoding. To this end, two blind algorithms are proposed: 1) power method, and 2) modified K-means algorithm. The latter algorithm is shown to be computationally efficient and converges significantly faster than the conventional K-means algorithm. Analytical expressions for the probability of error are derived, and it is found that even at moderate to low SNRs, the modified K-means algorithm achieves a probability of error comparable to that achievable with a perfect channel estimate at the FC, while requiring no pilot symbols to be transmitted from the sensor nodes. Also, the problem of signal corruption due to imperfect DCP is investigated, and constellation shaping to minimize the probability of signal corruption is proposed and analyzed. The analysis is validated, and the promising performance of DCP for energy-efficient physical layer data fusion is illustrated, using Monte Carlo simulations.
Resumo:
It is estimated that the quantity of digital data being transferred, processed or stored at any one time currently stands at 4.4 zettabytes (4.4 × 2 70 bytes) and this figure is expected to have grown by a factor of 10 to 44 zettabytes by 2020. Exploiting this data is, and will remain, a significant challenge. At present there is the capacity to store 33% of digital data in existence at any one time; by 2020 this capacity is expected to fall to 15%. These statistics suggest that, in the era of Big Data, the identification of important, exploitable data will need to be done in a timely manner. Systems for the monitoring and analysis of data, e.g. stock markets, smart grids and sensor networks, can be made up of massive numbers of individual components. These components can be geographically distributed yet may interact with one another via continuous data streams, which in turn may affect the state of the sender or receiver. This introduces a dynamic causality, which further complicates the overall system by introducing a temporal constraint that is difficult to accommodate. Practical approaches to realising the system described above have led to a multiplicity of analysis techniques, each of which concentrates on specific characteristics of the system being analysed and treats these characteristics as the dominant component affecting the results being sought. The multiplicity of analysis techniques introduces another layer of heterogeneity, that is heterogeneity of approach, partitioning the field to the extent that results from one domain are difficult to exploit in another. The question is asked can a generic solution for the monitoring and analysis of data that: accommodates temporal constraints; bridges the gap between expert knowledge and raw data; and enables data to be effectively interpreted and exploited in a transparent manner, be identified? The approach proposed in this dissertation acquires, analyses and processes data in a manner that is free of the constraints of any particular analysis technique, while at the same time facilitating these techniques where appropriate. Constraints are applied by defining a workflow based on the production, interpretation and consumption of data. This supports the application of different analysis techniques on the same raw data without the danger of incorporating hidden bias that may exist. To illustrate and to realise this approach a software platform has been created that allows for the transparent analysis of data, combining analysis techniques with a maintainable record of provenance so that independent third party analysis can be applied to verify any derived conclusions. In order to demonstrate these concepts, a complex real world example involving the near real-time capturing and analysis of neurophysiological data from a neonatal intensive care unit (NICU) was chosen. A system was engineered to gather raw data, analyse that data using different analysis techniques, uncover information, incorporate that information into the system and curate the evolution of the discovered knowledge. The application domain was chosen for three reasons: firstly because it is complex and no comprehensive solution exists; secondly, it requires tight interaction with domain experts, thus requiring the handling of subjective knowledge and inference; and thirdly, given the dearth of neurophysiologists, there is a real world need to provide a solution for this domain
Resumo:
Hydrologic research is a very demanding application of fiber-optic distributed temperature sensing (DTS) in terms of precision, accuracy and calibration. The physics behind the most frequently used DTS instruments are considered as they apply to four calibration methods for single-ended DTS installations. The new methods presented are more accurate than the instrument-calibrated data, achieving accuracies on the order of tenths of a degree root mean square error (RMSE) and mean bias. Effects of localized non-uniformities that violate the assumptions of single-ended calibration data are explored and quantified. Experimental design considerations such as selection of integration times or selection of the length of the reference sections are discussed, and the impacts of these considerations on calibrated temperatures are explored in two case studies.
Resumo:
We examined how marine plankton interaction networks, as inferred by multivariate autoregressive (MAR) analysis of time-series, differ based on data collected at a fixed sampling location (L4 station in the Western English Channel) and four similar time-series prepared by averaging Continuous Plankton Recorder (CPR) datapoints in the region surrounding the fixed station. None of the plankton community structures suggested by the MAR models generated from the CPR datasets were well correlated with the MAR model for L4, but of the four CPR models, the one most closely resembling the L4 model was that for the CPR region nearest to L4. We infer that observation error and spatial variation in plankton community dynamics influenced the model performance for the CPR datasets. A modified MAR framework in which observation error and spatial variation are explicitly incorporated could allow the analysis to better handle the diverse time-series data collected in marine environments.
Resumo:
Researchers want to analyse Health Care data which may requires large pools of compute and data resources. To have them they need access to Distributed Computing Infrastructures (DCI). To use them it requires expertise which researchers may not have. Workflows can hide infrastructures. There are many workflow systems but they are not interoperable. To learn a workflow system and create workflows in a workflow system may require significant effort. Considering these efforts it is not reasonable to expect that researchers will learn new workflow systems if they want to run workflows of other workflow systems. As a result, the lack of interoperability prevents workflow sharing and a vast amount of research efforts is wasted. The FP7 Sharing Interoperable Workflow for Large-Scale Scientific Simulation on Available DCIs (SHIWA) project developed the Coarse-Grained Interoperability (CGI) to enable workflow sharing. The project created the SHIWA Simulation Platform (SSP) to support CGI as a production-level service. The paper describes how the CGI approach can be used for analysis and simulation in Health Care.