22 resultados para Elaborazione d’immagini, Microscopia, Istopatologia, Classificazione, K-means

em CentAUR: Central Archive University of Reading - UK


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Clustering is defined as the grouping of similar items in a set, and is an important process within the field of data mining. As the amount of data for various applications continues to increase, in terms of its size and dimensionality, it is necessary to have efficient clustering methods. A popular clustering algorithm is K-Means, which adopts a greedy approach to produce a set of K-clusters with associated centres of mass, and uses a squared error distortion measure to determine convergence. Methods for improving the efficiency of K-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting a more efficient data structure, notably a multi-dimensional binary search tree (KD-Tree) to store either centroids or data points. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient K-Means techniques in parallel computational environments. In this work, we provide a parallel formulation for the KD-Tree based K-Means algorithm and address its load balancing issues.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis. Techniques for improving the efficiency of k-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing issue. Three solutions have been developed and tested. Two approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

K-Means is a popular clustering algorithm which adopts an iterative refinement procedure to determine data partitions and to compute their associated centres of mass, called centroids. The straightforward implementation of the algorithm is often referred to as `brute force' since it computes a proximity measure from each data point to each centroid at every iteration of the K-Means process. Efficient implementations of the K-Means algorithm have been predominantly based on multi-dimensional binary search trees (KD-Trees). A combination of an efficient data structure and geometrical constraints allow to reduce the number of distance computations required at each iteration. In this work we present a general space partitioning approach for improving the efficiency and the scalability of the K-Means algorithm. We propose to adopt approximate hierarchical clustering methods to generate binary space partitioning trees in contrast to KD-Trees. In the experimental analysis, we have tested the performance of the proposed Binary Space Partitioning K-Means (BSP-KM) when a divisive clustering algorithm is used. We have carried out extensive experimental tests to compare the proposed approach to the one based on KD-Trees (KD-KM) in a wide range of the parameters space. BSP-KM is more scalable than KDKM, while keeping the deterministic nature of the `brute force' algorithm. In particular, the proposed space partitioning approach has shown to overcome the well-known limitation of KD-Trees in high-dimensional spaces and can also be adopted to improve the efficiency of other algorithms in which KD-Trees have been used.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Radial basis function networks can be trained quickly using linear optimisation once centres and other associated parameters have been initialised. The authors propose a small adjustment to a well accepted initialisation algorithm which improves the network accuracy over a range of problems. The algorithm is described and results are presented.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The K-Means algorithm for cluster analysis is one of the most influential and popular data mining methods. Its straightforward parallel formulation is well suited for distributed memory systems with reliable interconnection networks. However, in large-scale geographically distributed systems the straightforward parallel algorithm can be rendered useless by a single communication failure or high latency in communication paths. This work proposes a fully decentralised algorithm (Epidemic K-Means) which does not require global communication and is intrinsically fault tolerant. The proposed distributed K-Means algorithm provides a clustering solution which can approximate the solution of an ideal centralised algorithm over the aggregated data as closely as desired. A comparative performance analysis is carried out against the state of the art distributed K-Means algorithms based on sampling methods. The experimental analysis confirms that the proposed algorithm is a practical and accurate distributed K-Means implementation for networked systems of very large and extreme scale.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The K-Means algorithm for cluster analysis is one of the most influential and popular data mining methods. Its straightforward parallel formulation is well suited for distributed memory systems with reliable interconnection networks, such as massively parallel processors and clusters of workstations. However, in large-scale geographically distributed systems the straightforward parallel algorithm can be rendered useless by a single communication failure or high latency in communication paths. The lack of scalable and fault tolerant global communication and synchronisation methods in large-scale systems has hindered the adoption of the K-Means algorithm for applications in large networked systems such as wireless sensor networks, peer-to-peer systems and mobile ad hoc networks. This work proposes a fully distributed K-Means algorithm (EpidemicK-Means) which does not require global communication and is intrinsically fault tolerant. The proposed distributed K-Means algorithm provides a clustering solution which can approximate the solution of an ideal centralised algorithm over the aggregated data as closely as desired. A comparative performance analysis is carried out against the state of the art sampling methods and shows that the proposed method overcomes the limitations of the sampling-based approaches for skewed clusters distributions. The experimental analysis confirms that the proposed algorithm is very accurate and fault tolerant under unreliable network conditions (message loss and node failures) and is suitable for asynchronous networks of very large and extreme scale.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The k-means cluster technique is used to examine 43 yr of daily winter Northern Hemisphere (NH) polar stratospheric data from the 40-yr ECMWF Re-Analysis (ERA-40). The results show that the NH winter stratosphere exists in two natural well-separated states. In total, 10% of the analyzed days exhibit a warm disturbed state that is typical of sudden stratospheric warming events. The remaining 90% of the days are in a state typical of a colder undisturbed vortex. These states are determined objectively, with no preconceived notion of the groups. The two stratospheric states are described and compared with alternative indicators of the polar winter flow, such as the northern annular mode. It is shown that the zonally averaged zonal winds in the polar upper stratosphere at 7 hPa can best distinguish between the two states, using a threshold value of 4 m s−1, which is remarkably close to the standard WMO criterion for major warming events. The analysis also determines that there are no further divisions within the warm state, indicating that there is no well-designated threshold between major and minor warmings, nor between split and displaced vortex events. These different manifestations are simply members of a continuum of warming events.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Radial basis functions can be combined into a network structure that has several advantages over conventional neural network solutions. However, to operate effectively the number and positions of the basis function centres must be carefully selected. Although no rigorous algorithm exists for this purpose, several heuristic methods have been suggested. In this paper a new method is proposed in which radial basis function centres are selected by the mean-tracking clustering algorithm. The mean-tracking algorithm is compared with k means clustering and it is shown that it achieves significantly better results in terms of radial basis function performance. As well as being computationally simpler, the mean-tracking algorithm in general selects better centre positions, thus providing the radial basis functions with better modelling accuracy

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A fast backward elimination algorithm is introduced based on a QR decomposition and Givens transformations to prune radial-basis-function networks. Nodes are sequentially removed using an increment of error variance criterion. The procedure is terminated by using a prediction risk criterion so as to obtain a model structure with good generalisation properties. The algorithm can be used to postprocess radial basis centres selected using a k-means routine and, in this mode, it provides a hybrid supervised centre selection approach.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper deals with the selection of centres for radial basis function (RBF) networks. A novel mean-tracking clustering algorithm is described as a way in which centers can be chosen based on a batch of collected data. A direct comparison is made between the mean-tracking algorithm and k-means clustering and it is shown how mean-tracking clustering is significantly better in terms of achieving an RBF network which performs accurate function modelling.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Global communicationrequirements andloadimbalanceof someparalleldataminingalgorithms arethe major obstacles to exploitthe computational power of large-scale systems. This work investigates how non-uniform data distributions can be exploited to remove the global communication requirement and to reduce the communication costin parallel data mining algorithms and, in particular, in the k-means algorithm for cluster analysis. In the straightforward parallel formulation of the k-means algorithm, data and computation loads are uniformly distributed over the processing nodes. This approach has excellent load balancing characteristics that may suggest it could scale up to large and extreme-scale parallel computing systems. However, at each iteration step the algorithm requires a global reduction operationwhichhinders thescalabilityoftheapproach.Thisworkstudiesadifferentparallelformulation of the algorithm where the requirement of global communication is removed, while maintaining the same deterministic nature ofthe centralised algorithm. The proposed approach exploits a non-uniform data distribution which can be either found in real-world distributed applications or can be induced by means ofmulti-dimensional binary searchtrees. The approachcanalso be extended to accommodate an approximation error which allows a further reduction ofthe communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing element

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A statistical–dynamical regionalization approach is developed to assess possible changes in wind storm impacts. The method is applied to North Rhine-Westphalia (Western Germany) using the FOOT3DK mesoscale model for dynamical downscaling and ECHAM5/OM1 global circulation model climate projections. The method first classifies typical weather developments within the reanalysis period using K-means cluster algorithm. Most historical wind storms are associated with four weather developments (primary storm-clusters). Mesoscale simulations are performed for representative elements for all clusters to derive regional wind climatology. Additionally, 28 historical storms affecting Western Germany are simulated. Empirical functions are estimated to relate wind gust fields and insured losses. Transient ECHAM5/OM1 simulations show an enhanced frequency of primary storm-clusters and storms for 2060–2100 compared to 1960–2000. Accordingly, wind gusts increase over Western Germany, reaching locally +5% for 98th wind gust percentiles (A2-scenario). Consequently, storm losses are expected to increase substantially (+8% for A1B-scenario, +19% for A2-scenario). Regional patterns show larger changes over north-eastern parts of North Rhine-Westphalia than for western parts. For storms with return periods above 20 yr, loss expectations for Germany may increase by a factor of 2. These results document the method's functionality to assess future changes in loss potentials in regional terms.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Boreal winter wind storm situations over Central Europe are investigated by means of an objective cluster analysis. Surface data from the NCEP-Reanalysis and ECHAM4/OPYC3-climate change GHG simulation (IS92a) are considered. To achieve an optimum separation of clusters of extreme storm conditions, 55 clusters of weather patterns are differentiated. To reduce the computational effort, a PCA is initially performed, leading to a data reduction of about 98 %. The clustering itself was computed on 3-day periods constructed with the first six PCs using "k-means" clustering algorithm. The applied method enables an evaluation of the time evolution of the synoptic developments. The climate change signal is constructed by a projection of the GCM simulation on the EOFs attained from the NCEP-Reanalysis. Consequently, the same clusters are obtained and frequency distributions can be compared. For Central Europe, four primary storm clusters are identified. These clusters feature almost 72 % of the historical extreme storms events and add only to 5 % of the total relative frequency. Moreover, they show a statistically significant signature in the associated wind fields over Europe. An increased frequency of Central European storm clusters is detected with enhanced GHG conditions, associated with an enhancement of the pressure gradient over Central Europe. Consequently, more intense wind events over Central Europe are expected. The presented algorithm will be highly valuable for the analysis of huge data amounts as is required for e.g. multi-model ensemble analysis, particularly because of the enormous data reduction.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Exascale systems are the next frontier in high-performance computing and are expected to deliver a performance of the order of 10^18 operations per second using massive multicore processors. Very large- and extreme-scale parallel systems pose critical algorithmic challenges, especially related to concurrency, locality and the need to avoid global communication patterns. This work investigates a novel protocol for dynamic group communication that can be used to remove the global communication requirement and to reduce the communication cost in parallel formulations of iterative data mining algorithms. The protocol is used to provide a communication-efficient parallel formulation of the k-means algorithm for cluster analysis. The approach is based on a collective communication operation for dynamic groups of processes and exploits non-uniform data distributions. Non-uniform data distributions can be either found in real-world distributed applications or induced by means of multidimensional binary search trees. The analysis of the proposed dynamic group communication protocol has shown that it does not introduce significant communication overhead. The parallel clustering algorithm has also been extended to accommodate an approximation error, which allows a further reduction of the communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing elements.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Global communication requirements and load imbalance of some parallel data mining algorithms are the major obstacles to exploit the computational power of large-scale systems. This work investigates how non-uniform data distributions can be exploited to remove the global communication requirement and to reduce the communication cost in iterative parallel data mining algorithms. In particular, the analysis focuses on one of the most influential and popular data mining methods, the k-means algorithm for cluster analysis. The straightforward parallel formulation of the k-means algorithm requires a global reduction operation at each iteration step, which hinders its scalability. This work studies a different parallel formulation of the algorithm where the requirement of global communication can be relaxed while still providing the exact solution of the centralised k-means algorithm. The proposed approach exploits a non-uniform data distribution which can be either found in real world distributed applications or can be induced by means of multi-dimensional binary search trees. The approach can also be extended to accommodate an approximation error which allows a further reduction of the communication costs.