124 resultados para binary search
em CentAUR: Central Archive University of Reading - UK
Resumo:
Clustering is defined as the grouping of similar items in a set, and is an important process within the field of data mining. As the amount of data for various applications continues to increase, in terms of its size and dimensionality, it is necessary to have efficient clustering methods. A popular clustering algorithm is K-Means, which adopts a greedy approach to produce a set of K-clusters with associated centres of mass, and uses a squared error distortion measure to determine convergence. Methods for improving the efficiency of K-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting a more efficient data structure, notably a multi-dimensional binary search tree (KD-Tree) to store either centroids or data points. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient K-Means techniques in parallel computational environments. In this work, we provide a parallel formulation for the KD-Tree based K-Means algorithm and address its load balancing issues.
Resumo:
One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis. Techniques for improving the efficiency of k-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing issue. Three solutions have been developed and tested. Two approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy.
Resumo:
K-Means is a popular clustering algorithm which adopts an iterative refinement procedure to determine data partitions and to compute their associated centres of mass, called centroids. The straightforward implementation of the algorithm is often referred to as `brute force' since it computes a proximity measure from each data point to each centroid at every iteration of the K-Means process. Efficient implementations of the K-Means algorithm have been predominantly based on multi-dimensional binary search trees (KD-Trees). A combination of an efficient data structure and geometrical constraints allow to reduce the number of distance computations required at each iteration. In this work we present a general space partitioning approach for improving the efficiency and the scalability of the K-Means algorithm. We propose to adopt approximate hierarchical clustering methods to generate binary space partitioning trees in contrast to KD-Trees. In the experimental analysis, we have tested the performance of the proposed Binary Space Partitioning K-Means (BSP-KM) when a divisive clustering algorithm is used. We have carried out extensive experimental tests to compare the proposed approach to the one based on KD-Trees (KD-KM) in a wide range of the parameters space. BSP-KM is more scalable than KDKM, while keeping the deterministic nature of the `brute force' algorithm. In particular, the proposed space partitioning approach has shown to overcome the well-known limitation of KD-Trees in high-dimensional spaces and can also be adopted to improve the efficiency of other algorithms in which KD-Trees have been used.
Resumo:
Exascale systems are the next frontier in high-performance computing and are expected to deliver a performance of the order of 10^18 operations per second using massive multicore processors. Very large- and extreme-scale parallel systems pose critical algorithmic challenges, especially related to concurrency, locality and the need to avoid global communication patterns. This work investigates a novel protocol for dynamic group communication that can be used to remove the global communication requirement and to reduce the communication cost in parallel formulations of iterative data mining algorithms. The protocol is used to provide a communication-efficient parallel formulation of the k-means algorithm for cluster analysis. The approach is based on a collective communication operation for dynamic groups of processes and exploits non-uniform data distributions. Non-uniform data distributions can be either found in real-world distributed applications or induced by means of multidimensional binary search trees. The analysis of the proposed dynamic group communication protocol has shown that it does not introduce significant communication overhead. The parallel clustering algorithm has also been extended to accommodate an approximation error, which allows a further reduction of the communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing elements.
Resumo:
Global communication requirements and load imbalance of some parallel data mining algorithms are the major obstacles to exploit the computational power of large-scale systems. This work investigates how non-uniform data distributions can be exploited to remove the global communication requirement and to reduce the communication cost in iterative parallel data mining algorithms. In particular, the analysis focuses on one of the most influential and popular data mining methods, the k-means algorithm for cluster analysis. The straightforward parallel formulation of the k-means algorithm requires a global reduction operation at each iteration step, which hinders its scalability. This work studies a different parallel formulation of the algorithm where the requirement of global communication can be relaxed while still providing the exact solution of the centralised k-means algorithm. The proposed approach exploits a non-uniform data distribution which can be either found in real world distributed applications or can be induced by means of multi-dimensional binary search trees. The approach can also be extended to accommodate an approximation error which allows a further reduction of the communication costs.
Resumo:
Considerable specification choice confronts countable adoption investigations and there is need to measure, formally, the evidence in favor of competing formulations. This article presents alternative countable adoption specifications—hitherto neglected in the agricultural-economics literature—and assesses formally their usefulness to practitioners. Reference to the left side of de Finetti's (1937) famous representation theorem motivates Bayesian unification of agricultural adoption studies and facilitates comparisons with conventional binary-choice specifications. Such comparisons have not previously been considered. The various formulations and the specific techniques are highlighted in an application to crossbred cow adoption in Sri Lanka's small-holder dairy sector.
Resumo:
Holocene tidal palaoechannels, Severn Estuary Levels, UK: a search for granulometric and foraminiferal criteria. Proceedings of the Geologists' Association, 117, 329-344. Grain-size characteristics (by laser granulometry) and foraminiferal assemblages have been established for silts accumulated in five, dissimilar tidal palaeochannels of mid or late Holocene age in the Severn Estuary Levels, representative of muddy tidal systems. For purposes of general comparison, similar data were obtained from a representative active tidal inlet in the area, but all of these channels have been subject to human interference and are not relied upon as a model for environmental interpretation. Although the palaeochannel deposits differ substantially in their bedding characteristics and stratigraphical relationships from the level-bedded salt-marsh platform and mudflat deposits with which they are associated, and although the channel environment is distinctive morphologically and hydraulically, no critical textural differences could be found between the channel deposits and the associated facies. Similarly, no foraminiferal assemblages distinctive of a tidal channel were encountered. Instead, the assemblages compare with those from mudflats and salt-marsh platforms. It is concluded that the sides of the subfossil channels carried some vegetation, as was observed to be the case in the modern inlet. An alternative approach is necessary if concealed palaeochannel deposits are to be recognized in muddy systems from limited numbers of subsurface samples. Although the palaeochannels afforded no characteristic textural signature, they yield transverse grain-size patterns pointing to coastal movements during their evolution. Concave-up trends suggest outward coastal building, whereas convex-up ones point to marsh-edge retreat.
Resumo:
Ecological risk assessments must increasingly consider the effects of chemical mixtures on the environment as anthropogenic pollution continues to grow in complexity. Yet testing every possible mixture combination is impractical and unfeasible; thus, there is an urgent need for models that can accurately predict mixture toxicity from single-compound data. Currently, two models are frequently used to predict mixture toxicity from single-compound data: Concentration addition and independent action (IA). The accuracy of the predictions generated by these models is currently debated and needs to be resolved before their use in risk assessments can be fully justified. The present study addresses this issue by determining whether the IA model adequately described the toxicity of binary mixtures of five pesticides and other environmental contaminants (cadmium, chlorpyrifos, diuron, nickel, and prochloraz) each with dissimilar modes of action on the reproduction of the nematode Caenorhabditis elegans. In three out of 10 cases, the IA model failed to describe mixture toxicity adequately with significant or antagonism being observed. In a further three cases, there was an indication of synergy, antagonism, and effect-level-dependent deviations, respectively, but these were not statistically significant. The extent of the significant deviations that were found varied, but all were such that the predicted percentage effect seen on reproductive output would have been wrong by 18 to 35% (i.e., the effect concentration expected to cause a 50% effect led to an 85% effect). The presence of such a high number and variety of deviations has important implications for the use of existing mixture toxicity models for risk assessments, especially where all or part of the deviation is synergistic.
Resumo:
Enhanced phytoextraction proposes the use of soil amendments to increase the heavy-metal content of above-ground harvestable plant tissues. This study compares the effect of synthetic aminopolycarboxylic acids [ethylenediamine tetraacetatic acid (EDTA), nitriloacetic acid (NTA), and diethylenetriamine pentaacetic acid (DTPA)] with a number of biodegradable, low-molecular weight, organic acids (citric acid, ascorbic acid, oxalic acid, salicylic acid, and NH4 acetate) as potential soil amendments for enhancing phytoextraction of heavy metals (Cu, Zn, Cd, Pb, and Ni) by Zea mays. The treatments in this study were applied at a dose of 2 mmol/kg(-1) 1 d before sowing. To compare possible effects between presow and postgermination treatments, a second smaller experiment was conducted in which EDTA, citric acid, and NH4 acetate were added 10 d after germination as opposed to 1 d before sowing. The soil used in this screening was a moderately contaminated topsoil derived from a dredged sediment disposal site. This site has been in an oxidized state for more than 8 years before being used in this research. The high carbonate, high organic matter, and high clay content characteristic to this type of sediment are thought to suppress heavy-metal phytoavailability. Both EDTA and DTPA resulted in increased levels of heavy metals in the above-ground biomass. However, the observed increases in uptake were not as large as reported in the literature. Neither the NTA nor organic acid treatments had any significant effect on uptake when applied prior to sowing. This was attributed to the rapid mineralization of these substances and the relatively low doses applied. The generally low extraction observed in this experiment restricts the use of phytoextraction as an effective remediation alternative under the current conditions, with regard to amendments used, applied dose (2 mmol/kg(-1) soil), application time (presow), plant species (Zea mays), and sediment (calcareous clayey soil) under study.
Resumo:
In this paper, we present a distributed computing framework for problems characterized by a highly irregular search tree, whereby no reliable workload prediction is available. The framework is based on a peer-to-peer computing environment and dynamic load balancing. The system allows for dynamic resource aggregation, does not depend on any specific meta-computing middleware and is suitable for large-scale, multi-domain, heterogeneous environments, such as computational Grids. Dynamic load balancing policies based on global statistics are known to provide optimal load balancing performance, while randomized techniques provide high scalability. The proposed method combines both advantages and adopts distributed job-pools and a randomized polling technique. The framework has been successfully adopted in a parallel search algorithm for subgraph mining and evaluated on a molecular compounds dataset. The parallel application has shown good calability and close-to linear speedup in a distributed network of workstations.
Resumo:
Empirical orthogonal functions (EOFs) are widely used in climate research to identify dominant patterns of variability and to reduce the dimensionality of climate data. EOFs, however, can be difficult to interpret. Rotated empirical orthogonal functions (REOFs) have been proposed as more physical entities with simpler patterns than EOFs. This study presents a new approach for finding climate patterns with simple structures that overcomes the problems encountered with rotation. The method achieves simplicity of the patterns by using the main properties of EOFs and REOFs simultaneously. Orthogonal patterns that maximise variance subject to a constraint that induces a form of simplicity are found. The simplified empirical orthogonal function (SEOF) patterns, being more 'local'. are constrained to have zero loadings outside the main centre of action. The method is applied to winter Northern Hemisphere (NH) monthly mean sea level pressure (SLP) reanalyses over the period 1948-2000. The 'simplified' leading patterns of variability are identified and compared to the leading patterns obtained from EOFs and REOFs. Copyright (C) 2005 Royal Meteorological Society.