68 resultados para Data stream mining


Relevância:

40.00% 40.00%

Publicador:

Resumo:

Exascale systems are the next frontier in high-performance computing and are expected to deliver a performance of the order of 10^18 operations per second using massive multicore processors. Very large- and extreme-scale parallel systems pose critical algorithmic challenges, especially related to concurrency, locality and the need to avoid global communication patterns. This work investigates a novel protocol for dynamic group communication that can be used to remove the global communication requirement and to reduce the communication cost in parallel formulations of iterative data mining algorithms. The protocol is used to provide a communication-efficient parallel formulation of the k-means algorithm for cluster analysis. The approach is based on a collective communication operation for dynamic groups of processes and exploits non-uniform data distributions. Non-uniform data distributions can be either found in real-world distributed applications or induced by means of multidimensional binary search trees. The analysis of the proposed dynamic group communication protocol has shown that it does not introduce significant communication overhead. The parallel clustering algorithm has also been extended to accommodate an approximation error, which allows a further reduction of the communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing elements.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Owing to continuous advances in the computational power of handheld devices like smartphones and tablet computers, it has become possible to perform Big Data operations including modern data mining processes onboard these small devices. A decade of research has proved the feasibility of what has been termed as Mobile Data Mining, with a focus on one mobile device running data mining processes. However, it is not before 2010 until the authors of this book initiated the Pocket Data Mining (PDM) project exploiting the seamless communication among handheld devices performing data analysis tasks that were infeasible until recently. PDM is the process of collaboratively extracting knowledge from distributed data streams in a mobile computing environment. This book provides the reader with an in-depth treatment on this emerging area of research. Details of techniques used and thorough experimental studies are given. More importantly and exclusive to this book, the authors provide detailed practical guide on the deployment of PDM in the mobile environment. An important extension to the basic implementation of PDM dealing with concept drift is also reported. In the era of Big Data, potential applications of paramount importance offered by PDM in a variety of domains including security, business and telemedicine are discussed.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Guest Editorial

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Social network has gained remarkable attention in the last decade. Accessing social network sites such as Twitter, Facebook LinkedIn and Google+ through the internet and the web 2.0 technologies has become more affordable. People are becoming more interested in and relying on social network for information, news and opinion of other users on diverse subject matters. The heavy reliance on social network sites causes them to generate massive data characterised by three computational issues namely; size, noise and dynamism. These issues often make social network data very complex to analyse manually, resulting in the pertinent use of computational means of analysing them. Data mining provides a wide range of techniques for detecting useful knowledge from massive datasets like trends, patterns and rules [44]. Data mining techniques are used for information retrieval, statistical modelling and machine learning. These techniques employ data pre-processing, data analysis, and data interpretation processes in the course of data analysis. This survey discusses different data mining techniques used in mining diverse aspects of the social network over decades going from the historical techniques to the up-to-date models, including our novel technique named TRCM. All the techniques covered in this survey are listed in the Table.1 including the tools employed as well as names of their authors.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The General Election for the 56th United Kingdom Parliament was held on 7 May 2015. Tweets related to UK politics, not only those with the specific hashtag ”#GE2015”, have been collected in the period between March 1 and May 31, 2015. The resulting dataset contains over 28 million tweets for a total of 118 GB in uncompressed format or 15 GB in compressed format. This study describes the method that was used to collect the tweets and presents some analysis, including a political sentiment index, and outlines interesting research directions on Big Social Data based on Twitter microblogging.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

[1] We present a new, process-based model of soil and stream water dissolved organic carbon (DOC): the Integrated Catchments Model for Carbon (INCA-C). INCA-C is the first model of DOC cycling to explicitly include effects of different land cover types, hydrological flow paths, in-soil carbon biogeochemistry, and surface water processes on in-stream DOC concentrations. It can be calibrated using only routinely available monitoring data. INCA-C simulates daily DOC concentrations over a period of years to decades. Sources, sinks, and transformation of solid and dissolved organic carbon in peat and forest soils, wetlands, and streams as well as organic carbon mineralization in stream waters are modeled. INCA-C is designed to be applied to natural and seminatural forested and peat-dominated catchments in boreal and temperate regions. Simulations at two forested catchments showed that seasonal and interannual patterns of DOC concentration could be modeled using climate-related parameters alone. A sensitivity analysis showed that model predictions were dependent on the mass of organic carbon in the soil and that in-soil process rates were dependent on soil moisture status. Sensitive rate coefficients in the model included those for organic carbon sorption and desorption and DOC mineralization in the soil. The model was also sensitive to the amount of litter fall. Our results show the importance of climate variability in controlling surface water DOC concentrations and suggest the need for further research on the mechanisms controlling production and consumption of DOC in soils.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A regional overview of the water quality and ecology of the River Lee catchment is presented. Specifically, data describing the chemical, microbiological and macrobiological water quality and fisheries communities have been analysed, based on a division into river, sewage treatment works, fish-farm, lake and industrial samples. Nutrient enrichment and the highest concentrations of metals and micro-organics were found in the urbanised, lower reaches of the Lee and in the Lee Navigation. Average annual concentrations of metals were generally within environmental quality standards although, oil many occasions, concentrations of cadmium, copper, lead, mercury and zinc were in excess of the standards. Various organic substances (used as herbicides, fungicides, insecticides, chlorination by-products and industrial solvents) were widely detected in the Lee system. Concentrations of ten micro-organic substances were observed in excess of their environmental quality standards, though not in terms of annual averages. Sewage treatment works were the principal point source input of nutrients. metals and micro-organic determinands to the catchment. Diffuse nitrogen sources contributed approximately 60% and 27% of the in-stream load in the upper and lower Lee respectively, whereas approximately 60% and 20% of the in-stream phosphorus load was derived from diffuse sources in the upper and lower Lee. For metals, the most significant source was the urban runoff from North London. In reaches less affected by effluent discharges, diffuse runoff from urban and agricultural areas dominated trends. Flig-h microbiological content, observed in the River Lee particularly in urbanised reaches, was far in excess of the EC Bathing Water Directive standards. Water quality issues and degraded habitat in the lower reaches of the Lee have led to impoverished aquatic fauna but, within the mid-catchment reaches and upper agricultural tributaries, less nutrient enrichment and channel alteration has permitted more diverse aquatic fauna.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The beds of active ice streams in Greenland and Antarctica are largely inaccessible, hindering a full understanding of the processes that initiate, sustain and inhibit fast ice flow in ice sheets. Detailed mapping of the glacial geomorphology of palaeo-ice stream tracks is, therefore, a valuable tool for exploring the basal processes that control their behaviour. In this paper we present a map that shows detailed glacial geomorphology from a part of the Dubawnt Lake Palaeo-Ice Stream bed on the north-western Canadian Shield (Northwest Territories), which operated at the end of the last glacial cycle. The map (centred on 63 degrees 55 '' 42'N, 102 degrees 29 '' 11'W, approximate scale 1:90,000) was compiled from digital Landsat Enhanced Thematic Mapper Plus satellite imagery and digital and hard-copy stereo-aerial photographs. The ice stream bed is dominated by parallel mega-scale glacial lineations (MGSL), whose lengths exceed several kilometres but the map also reveals that they have, in places, been superimposed with transverse ridges known as ribbed moraines. The ribbed moraines lie on top of the MSGL and appear to have segmented the individual lineaments. This indicates that formation of the ribbed moraines post-date the formation of the MSGL. The presence of ribbed moraine in the onset zone of another palaeo-ice stream has been linked to oscillations between cold and warm-based ice and/or a patchwork of cold-based areas which led to acceleration and deceleration of ice velocity. Our hypothesis is that the ribbed moraines on the Dubawnt Lake Ice Stream bed are a manifestation of the process that led to ice stream shut-down and may be associated with the process of basal freeze-on. The precise formation of ribbed moraines, however, remains open to debate and field observation of their structure will provide valuable data for formal testing of models of their formation.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Ascertaining the location of palaeo-ice streams is crucial in order to produce accurate reconstructions of palaeo-ice sheets and examine interactions with the ocean-climate system. This paper reports evidence for a major ice stream in Amundsen Gulf, Canadian Arctic Archipelago. Mapping from satellite imagery (Landsat ETM+) and digital elevation models, including bathymetric data, is used to reconstruct flow-patterns on southwestern Victoria Island and the adjacent mainland (Nunavut and Northwest Territories). Several flow-sets indicative of ice streaming are found feeding into the marine trough and cross-cutting relationships between these flow-sets (and utilising previously published radiocarbon dates) reveal several phases of ice stream activity centred in Amundsen Gulf and Dolphin and Union Strait. A large erosional footprint on the continental shelf indicates that the ice stream (ca. 1000 km long and ca. 150 km wide) filled Amundsen Gulf, probably at the Last Glacial Maximum. Subsequent to this, the ice stream reorganised as the margin retreated back along the marine trough, eventually splitting into two separate low-gradient lobes in Prince Albert Sound and Dolphin and Union Strait. The location of this major ice stream holds important implications for ice sheet-ocean interactions and specifically, the development of Arctic Ocean ice shelves and the delivery of icebergs into the western Arctic Ocean during the late Pleistocene. Copyright (C) 2006 John Wiley & Sons, Ltd.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In molecular biology, it is often desirable to find common properties in large numbers of drug candidates. One family of methods stems from the data mining community, where algorithms to find frequent graphs have received increasing attention over the past years. However, the computational complexity of the underlying problem and the large amount of data to be explored essentially render sequential algorithms useless. In this paper, we present a distributed approach to the frequent subgraph mining problem to discover interesting patterns in molecular compounds. This problem is characterized by a highly irregular search tree, whereby no reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely, a dynamic partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiverinitiated load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer Institute’s HIV-screening data set, where we were able to show close-to linear speedup in a network of workstations. The proposed approach also allows for dynamic resource aggregation in a non dedicated computational environment. These features make it suitable for large-scale, multi-domain, heterogeneous environments, such as computational grids.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Recently, two approaches have been introduced that distribute the molecular fragment mining problem. The first approach applies a master/worker topology, the second approach, a completely distributed peer-to-peer system, solves the scalability problem due to the bottleneck at the master node. However, in many real world scenarios the participating computing nodes cannot communicate directly due to administrative policies such as security restrictions. Thus, potential computing power is not accessible to accelerate the mining run. To solve this shortcoming, this work introduces a hierarchical topology of computing resources, which distributes the management over several levels and adapts to the natural structure of those multi-domain architectures. The most important aspect is the load balancing scheme, which has been designed and optimized for the hierarchical structure. The approach allows dynamic aggregation of heterogenous computing resources and is applied to wide area network scenarios.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In real world applications sequential algorithms of data mining and data exploration are often unsuitable for datasets with enormous size, high-dimensionality and complex data structure. Grid computing promises unprecedented opportunities for unlimited computing and storage resources. In this context there is the necessity to develop high performance distributed data mining algorithms. However, the computational complexity of the problem and the large amount of data to be explored often make the design of large scale applications particularly challenging. In this paper we present the first distributed formulation of a frequent subgraph mining algorithm for discriminative fragments of molecular compounds. Two distributed approaches have been developed and compared on the well known National Cancer Institute’s HIV-screening dataset. We present experimental results on a small-scale computing environment.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Structured data represented in the form of graphs arises in several fields of the science and the growing amount of available data makes distributed graph mining techniques particularly relevant. In this paper, we present a distributed approach to the frequent subgraph mining problem to discover interesting patterns in molecular compounds. The problem is characterized by a highly irregular search tree, whereby no reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely a dynamic partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiver-initiated, load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer Institute’s HIV-screening dataset, where the approach attains close-to linear speedup in a network of workstations.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Frequent pattern discovery in structured data is receiving an increasing attention in many application areas of sciences. However, the computational complexity and the large amount of data to be explored often make the sequential algorithms unsuitable. In this context high performance distributed computing becomes a very interesting and promising approach. In this paper we present a parallel formulation of the frequent subgraph mining problem to discover interesting patterns in molecular compounds. The application is characterized by a highly irregular tree-structured computation. No estimation is available for task workloads, which show a power-law distribution in a wide range. The proposed approach allows dynamic resource aggregation and provides fault and latency tolerance. These features make the distributed application suitable for multi-domain heterogeneous environments, such as computational Grids. The distributed application has been evaluated on the well known National Cancer Institute’s HIV-screening dataset.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper presents a simple Bayesian approach to sample size determination in clinical trials. It is required that the trial should be large enough to ensure that the data collected will provide convincing evidence either that an experimental treatment is better than a control or that it fails to improve upon control by some clinically relevant difference. The method resembles standard frequentist formulations of the problem, and indeed in certain circumstances involving 'non-informative' prior information it leads to identical answers. In particular, unlike many Bayesian approaches to sample size determination, use is made of an alternative hypothesis that an experimental treatment is better than a control treatment by some specified magnitude. The approach is introduced in the context of testing whether a single stream of binary observations are consistent with a given success rate p(0). Next the case of comparing two independent streams of normally distributed responses is considered, first under the assumption that their common variance is known and then for unknown variance. Finally, the more general situation in which a large sample is to be collected and analysed according to the asymptotic properties of the score statistic is explored. Copyright (C) 2007 John Wiley & Sons, Ltd.