54 resultados para Distributed data
em CentAUR: Central Archive University of Reading - UK
Resumo:
We present a general Multi-Agent System framework for distributed data mining based on a Peer-to-Peer model. Agent protocols are implemented through message-based asynchronous communication. The framework adopts a dynamic load balancing policy that is particularly suitable for irregular search algorithms. A modular design allows a separation of the general-purpose system protocols and software components from the specific data mining algorithm. The experimental evaluation has been carried out on a parallel frequent subgraph mining algorithm, which has shown good scalability performances.
Resumo:
Climate-G is a large scale distributed testbed devoted to climate change research. It is an unfunded effort started in 2008 and involving a wide community both in Europe and US. The testbed is an interdisciplinary effort involving partners from several institutions and joining expertise in the field of climate change and computational science. Its main goal is to allow scientists carrying out geographical and cross-institutional data discovery, access, analysis, visualization and sharing of climate data. It represents an attempt to address, in a real environment, challenging data and metadata management issues. This paper presents a complete overview about the Climate-G testbed highlighting the most important results that have been achieved since the beginning of this project.
Resumo:
Collaborative mining of distributed data streams in a mobile computing environment is referred to as Pocket Data Mining PDM. Hoeffding trees techniques have been experimentally and analytically validated for data stream classification. In this paper, we have proposed, developed and evaluated the adoption of distributed Hoeffding trees for classifying streaming data in PDM applications. We have identified a realistic scenario in which different users equipped with smart mobile devices run a local Hoeffding tree classifier on a subset of the attributes. Thus, we have investigated the mining of vertically partitioned datasets with possible overlap of attributes, which is the more likely case. Our experimental results have validated the efficiency of our proposed model achieving promising accuracy for real deployment.
Resumo:
The P-found protein folding and unfolding simulation repository is designed to allow scientists to perform data mining and other analyses across large, distributed simulation data sets. There are two storage components in P-found: a primary repository of simulation data that is used to populate the second component, and a data warehouse that contains important molecular properties. These properties may be used for data mining studies. Here we demonstrate how grid technologies can support multiple, distributed P-found installations. In particular, we look at two aspects: firstly, how grid data management technologies can be used to access the distributed data warehouses; and secondly, how the grid can be used to transfer analysis programs to the primary repositories — this is an important and challenging aspect of P-found, due to the large data volumes involved and the desire of scientists to maintain control of their own data. The grid technologies we are developing with the P-found system will allow new large data sets of protein folding simulations to be accessed and analysed in novel ways, with significant potential for enabling scientific discovery.
Resumo:
Advances in hardware and software technology enable us to collect, store and distribute large quantities of data on a very large scale. Automatically discovering and extracting hidden knowledge in the form of patterns from these large data volumes is known as data mining. Data mining technology is not only a part of business intelligence, but is also used in many other application areas such as research, marketing and financial analytics. For example medical scientists can use patterns extracted from historic patient data in order to determine if a new patient is likely to respond positively to a particular treatment or not; marketing analysts can use extracted patterns from customer data for future advertisement campaigns; finance experts have an interest in patterns that forecast the development of certain stock market shares for investment recommendations. However, extracting knowledge in the form of patterns from massive data volumes imposes a number of computational challenges in terms of processing time, memory, bandwidth and power consumption. These challenges have led to the development of parallel and distributed data analysis approaches and the utilisation of Grid and Cloud computing. This chapter gives an overview of parallel and distributed computing approaches and how they can be used to scale up data mining to large datasets.
Resumo:
In real world applications sequential algorithms of data mining and data exploration are often unsuitable for datasets with enormous size, high-dimensionality and complex data structure. Grid computing promises unprecedented opportunities for unlimited computing and storage resources. In this context there is the necessity to develop high performance distributed data mining algorithms. However, the computational complexity of the problem and the large amount of data to be explored often make the design of large scale applications particularly challenging. In this paper we present the first distributed formulation of a frequent subgraph mining algorithm for discriminative fragments of molecular compounds. Two distributed approaches have been developed and compared on the well known National Cancer Institute’s HIV-screening dataset. We present experimental results on a small-scale computing environment.
Resumo:
Existing research on synchronous remote working in CSCW has highlighted the troubles that can arise because actions at one site are (partially) unavailable to remote colleagues. Such ‘local action’ is routinely characterised as a nuisance, a distraction, subordinate and the like. This paper explores interconnections between ‘local action’ and ‘distributed work’ in the case of a research team virtually collocated through ‘MiMeG’. MiMeG is an e-Social Science tool that facilitates ‘distributed data sessions’ in which social scientists are able to remotely collaborate on the real-time analysis of video data. The data are visible and controllable in a shared workspace and participants are additionally connected via audio conferencing. The findings reveal that whilst the (partial) unavailability of local action is at times problematic, it is also used as a resource for coordinating work. The paper considers how local action is interactionally managed in distributed data sessions and concludes by outlining implications of the analysis for the design and study of technologies to support group-to-group collaboration.
Resumo:
The P-found protein folding and unfolding simulation repository is designed to allow scientists to perform analyses across large, distributed simulation data sets. There are two storage components in P-found: a primary repository of simulation data and a data warehouse. Here we demonstrate how grid technologies can support multiple, distributed P-found installations. In particular we look at two aspects, first how grid data management technologies can be used to access the distributed data warehouses; and secondly, how the grid can be used to transfer analysis programs to the primary repositories --- this is an important and challenging aspect of P-found because the data volumes involved are too large to be centralised. The grid technologies we are developing with the P-found system will allow new large data sets of protein folding simulations to be accessed and analysed in novel ways, with significant potential for enabling new scientific discoveries.
Resumo:
In this article, we review the state-of-the-art techniques in mining data streams for mobile and ubiquitous environments. We start the review with a concise background of data stream processing, presenting the building blocks for mining data streams. In a wide range of applications, data streams are required to be processed on small ubiquitous devices like smartphones and sensor devices. Mobile and ubiquitous data mining target these applications with tailored techniques and approaches addressing scarcity of resources and mobility issues. Two categories can be identified for mobile and ubiquitous mining of streaming data: single-node and distributed. This survey will cover both categories. Mining mobile and ubiquitous data require algorithms with the ability to monitor and adapt the working conditions to the available computational resources. We identify the key characteristics of these algorithms and present illustrative applications. Distributed data stream mining in the mobile environment is then discussed, presenting the Pocket Data Mining framework. Mobility of users stimulates the adoption of context-awareness in this area of research. Context-awareness and collaboration are discussed in the Collaborative Data Stream Mining, where agents share knowledge to learn adaptive accurate models.
Resumo:
Owing to continuous advances in the computational power of handheld devices like smartphones and tablet computers, it has become possible to perform Big Data operations including modern data mining processes onboard these small devices. A decade of research has proved the feasibility of what has been termed as Mobile Data Mining, with a focus on one mobile device running data mining processes. However, it is not before 2010 until the authors of this book initiated the Pocket Data Mining (PDM) project exploiting the seamless communication among handheld devices performing data analysis tasks that were infeasible until recently. PDM is the process of collaboratively extracting knowledge from distributed data streams in a mobile computing environment. This book provides the reader with an in-depth treatment on this emerging area of research. Details of techniques used and thorough experimental studies are given. More importantly and exclusive to this book, the authors provide detailed practical guide on the deployment of PDM in the mobile environment. An important extension to the basic implementation of PDM dealing with concept drift is also reported. In the era of Big Data, potential applications of paramount importance offered by PDM in a variety of domains including security, business and telemedicine are discussed.
Resumo:
Two ongoing projects at ESSC that involve the development of new techniques for extracting information from airborne LiDAR data and combining this information with environmental models will be discussed. The first project in conjunction with Bristol University is aiming to improve 2-D river flood flow models by using remote sensing to provide distributed data for model calibration and validation. Airborne LiDAR can provide such models with a dense and accurate floodplain topography together with vegetation heights for parameterisation of model friction. The vegetation height data can be used to specify a friction factor at each node of a model’s finite element mesh. A LiDAR range image segmenter has been developed which converts a LiDAR image into separate raster maps of surface topography and vegetation height for use in the model. Satellite and airborne SAR data have been used to measure flood extent remotely in order to validate the modelled flood extent. Methods have also been developed for improving the models by decomposing the model’s finite element mesh to reflect floodplain features such as hedges and trees having different frictional properties to their surroundings. Originally developed for rural floodplains, the segmenter is currently being extended to provide DEMs and friction parameter maps for urban floods, by fusing the LiDAR data with digital map data. The second project is concerned with the extraction of tidal channel networks from LiDAR. These networks are important features of the inter-tidal zone, and play a key role in tidal propagation and in the evolution of salt-marshes and tidal flats. The study of their morphology is currently an active area of research, and a number of theories related to networks have been developed which require validation using dense and extensive observations of network forms and cross-sections. The conventional method of measuring networks is cumbersome and subjective, involving manual digitisation of aerial photographs in conjunction with field measurement of channel depths and widths for selected parts of the network. A semi-automatic technique has been developed to extract networks from LiDAR data of the inter-tidal zone. A multi-level knowledge-based approach has been implemented, whereby low level algorithms first extract channel fragments based mainly on image properties then a high level processing stage improves the network using domain knowledge. The approach adopted at low level uses multi-scale edge detection to detect channel edges, then associates adjacent anti-parallel edges together to form channels. The higher level processing includes a channel repair mechanism.
Resumo:
In clinical trials, situations often arise where more than one response from each patient is of interest; and it is required that any decision to stop the study be based upon some or all of these measures simultaneously. Theory for the design of sequential experiments with simultaneous bivariate responses is described by Jennison and Turnbull (Jennison, C., Turnbull, B. W. (1993). Group sequential tests for bivariate response: interim analyses of clinical trials with both efficacy and safety endpoints. Biometrics 49:741-752) and Cook and Farewell (Cook, R. J., Farewell, V. T. (1994). Guidelines for monitoring efficacy and toxicity responses in clinical trials. Biometrics 50:1146-1152) in the context of one efficacy and one safety response. These expositions are in terms of normally distributed data with known covariance. The methods proposed require specification of the correlation, ρ between test statistics monitored as part of the sequential test. It can be difficult to quantify ρ and previous authors have suggested simply taking the lowest plausible value, as this will guarantee power. This paper begins with an illustration of the effect that inappropriate specification of ρ can have on the preservation of trial error rates. It is shown that both the type I error and the power can be adversely affected. As a possible solution to this problem, formulas are provided for the calculation of correlation from data collected as part of the trial. An adaptive approach is proposed and evaluated that makes use of these formulas and an example is provided to illustrate the method. Attention is restricted to the bivariate case for ease of computation, although the formulas derived are applicable in the general multivariate case.
Resumo:
This article reviews current technological developments, particularly Peer-to-Peer technologies and Distributed Data Systems, and their value to community memory projects, particularly those concerned with the preservation of the cultural, literary and administrative data of cultures which have suffered genocide or are at risk of genocide. It draws attention to the comparatively good representation online of genocide denial groups and changes in the technological strategies of holocaust denial and other far-right groups. It draws on the author's work in providing IT support for a UK-based Non-Governmental Organization providing support for survivors of genocide in Rwanda.
Resumo:
The Distributed Rule Induction (DRI) project at the University of Portsmouth is concerned with distributed data mining algorithms for automatically generating rules of all kinds. In this paper we present a system architecture and its implementation for inducing modular classification rules in parallel in a local area network using a distributed blackboard system. We present initial results of a prototype implementation based on the Prism algorithm.