925 resultados para K-Means Cluster


Relevância:

100.00% 100.00%

Publicador:

Resumo:

This study subdivides the Potter Cove, King George Island, Antarctica, into seafloor regions using multivariate statistical methods. These regions are categories used for comparing, contrasting and quantifying biogeochemical processes and biodiversity between ocean regions geographically but also regions under development within the scope of global change. The division obtained is characterized by the dominating components and interpreted in terms of ruling environmental conditions. The analysis includes in total 42 different environmental variables, interpolated based on samples taken during Australian summer seasons 2010/2011 and 2011/2012. The statistical errors of several interpolation methods (e.g. IDW, Indicator, Ordinary and Co-Kriging) with changing settings have been compared and the most reasonable method has been applied. The multivariate mathematical procedures used are regionalized classification via k means cluster analysis, canonical-correlation analysis and multidimensional scaling. Canonical-correlation analysis identifies the influencing factors in the different parts of the cove. Several methods for the identification of the optimum number of clusters have been tested and 4, 7, 10 as well as 12 were identified as reasonable numbers for clustering the Potter Cove. Especially the results of 10 and 12 clusters identify marine-influenced regions which can be clearly separated from those determined by the geological catchment area and the ones dominated by river discharge.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Wireless Sensor Networks (WSN) are a special kind of ad-hoc networks that is usually deployed in a monitoring field in order to detect some physical phenomenon. Due to the low dependability of individual nodes, small radio coverage and large areas to be monitored, the organization of nodes in small clusters is generally used. Moreover, a large number of WSN nodes is usually deployed in the monitoring area to increase WSN dependability. Therefore, the best cluster head positioning is a desirable characteristic in a WSN. In this paper, we propose a hybrid clustering algorithm based on community detection in complex networks and traditional K-means clustering technique: the QK-Means algorithm. Simulation results show that QK-Means detect communities and sub-communities thus lost message rate is decreased and WSN coverage is increased. © 2012 IEEE.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Partitional clustering algorithms, which partition the dataset into a pre-defined number of clusters, can be broadly classified into two types: algorithms which explicitly take the number of clusters as input and algorithms that take the expected size of a cluster as input. In this paper, we propose a variant of the k-means algorithm and prove that it is more efficient than standard k-means algorithms. An important contribution of this paper is the establishment of a relation between the number of clusters and the size of the clusters in a dataset through the analysis of our algorithm. We also demonstrate that the integration of this algorithm as a pre-processing step in classification algorithms reduces their running-time complexity.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The k-means algorithm is an extremely popular technique for clustering data. One of the major limitations of the k-means is that the time to cluster a given dataset D is linear in the number of clusters, k. In this paper, we employ height balanced trees to address this issue. Specifically, we make two major contributions, (a) we propose an algorithm, RACK (acronym for RApid Clustering using k-means), which takes time favorably comparable with the fastest known existing techniques, and (b) we prove an expected bound on the quality of clustering achieved using RACK. Our experimental results on large datasets strongly suggest that RACK is competitive with the k-means algorithm in terms of quality of clustering, while taking significantly less time.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Tianjin University of Technology

Relevância:

100.00% 100.00%

Publicador:

Resumo:

One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis. Techniques for improving the efficiency of k-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing issue. Three solutions have been developed and tested. Two approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The K-Means algorithm for cluster analysis is one of the most influential and popular data mining methods. Its straightforward parallel formulation is well suited for distributed memory systems with reliable interconnection networks. However, in large-scale geographically distributed systems the straightforward parallel algorithm can be rendered useless by a single communication failure or high latency in communication paths. This work proposes a fully decentralised algorithm (Epidemic K-Means) which does not require global communication and is intrinsically fault tolerant. The proposed distributed K-Means algorithm provides a clustering solution which can approximate the solution of an ideal centralised algorithm over the aggregated data as closely as desired. A comparative performance analysis is carried out against the state of the art distributed K-Means algorithms based on sampling methods. The experimental analysis confirms that the proposed algorithm is a practical and accurate distributed K-Means implementation for networked systems of very large and extreme scale.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The K-Means algorithm for cluster analysis is one of the most influential and popular data mining methods. Its straightforward parallel formulation is well suited for distributed memory systems with reliable interconnection networks, such as massively parallel processors and clusters of workstations. However, in large-scale geographically distributed systems the straightforward parallel algorithm can be rendered useless by a single communication failure or high latency in communication paths. The lack of scalable and fault tolerant global communication and synchronisation methods in large-scale systems has hindered the adoption of the K-Means algorithm for applications in large networked systems such as wireless sensor networks, peer-to-peer systems and mobile ad hoc networks. This work proposes a fully distributed K-Means algorithm (EpidemicK-Means) which does not require global communication and is intrinsically fault tolerant. The proposed distributed K-Means algorithm provides a clustering solution which can approximate the solution of an ideal centralised algorithm over the aggregated data as closely as desired. A comparative performance analysis is carried out against the state of the art sampling methods and shows that the proposed method overcomes the limitations of the sampling-based approaches for skewed clusters distributions. The experimental analysis confirms that the proposed algorithm is very accurate and fault tolerant under unreliable network conditions (message loss and node failures) and is suitable for asynchronous networks of very large and extreme scale.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

One of the top ten most influential data mining algorithms, k-means, is known for being simple and scalable. However, it is sensitive to initialization of prototypes and requires that the number of clusters be specified in advance. This paper shows that evolutionary techniques conceived to guide the application of k-means can be more computationally efficient than systematic (i.e., repetitive) approaches that try to get around the above-mentioned drawbacks by repeatedly running the algorithm from different configurations for the number of clusters and initial positions of prototypes. To do so, a modified version of a (k-means based) fast evolutionary algorithm for clustering is employed. Theoretical complexity analyses for the systematic and evolutionary algorithms under interest are provided. Computational experiments and statistical analyses of the results are presented for artificial and text mining data sets. (C) 2010 Elsevier B.V. All rights reserved.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The k-means algorithm is a partitional clustering method. Over 60 years old, it has been successfully used for a variety of problems. The popularity of k-means is in large part a consequence of its simplicity and efficiency. In this paper we are inspired by these appealing properties of k-means in the development of a clustering algorithm which accepts the notion of "positively" and "negatively" labelled data. The goal is to discover the cluster structure of both positive and negative data in a manner which allows for the discrimination between the two sets. The usefulness of this idea is demonstrated practically on the problem of face recognition, where the task of learning the scope of a person's appearance should be done in a manner which allows this face to be differentiated from others.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In order to alleviate the traffic congestion and reduce the complexity of traffic control and management, it is necessary to exploit traffic sub-areas division which should be effective in planing traffic. Some researchers applied the K-Means algorithm to divide traffic sub-areas on the taxi trajectories. However, the traditional K-Means algorithms faced difficulties in processing large-scale Global Position System(GPS) trajectories of taxicabs with the restrictions of memory, I/O, computing performance. This paper proposes a Parallel Traffic Sub-Areas Division(PTSD) method which consists of two stages, on the basis of the Parallel K-Means(PKM) algorithm. During the first stage, we develop a process to cluster traffic sub-areas based on the PKM algorithm. Then, the second stage, we identify boundary of traffic sub-areas on the base of cluster result. According to this method, we divide traffic sub-areas of Beijing on the real-word (GPS) trajectories of taxicabs. The experiment and discussion show that the method is effective in dividing traffic sub-areas.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Purpose: Web search engines are frequently used by people to locate information on the Internet. However, not all queries have an informational goal. Instead of information, some people may be looking for specific web sites or may wish to conduct transactions with web services. This paper aims to focus on automatically classifying the different user intents behind web queries. Design/methodology/approach: For the research reported in this paper, 130,000 web search engine queries are categorized as informational, navigational, or transactional using a k-means clustering approach based on a variety of query traits. Findings: The research findings show that more than 75 percent of web queries (clustered into eight classifications) are informational in nature, with about 12 percent each for navigational and transactional. Results also show that web queries fall into eight clusters, six primarily informational, and one each of primarily transactional and navigational. Research limitations/implications: This study provides an important contribution to web search literature because it provides information about the goals of searchers and a method for automatically classifying the intents of the user queries. Automatic classification of user intent can lead to improved web search engines by tailoring results to specific user needs. Practical implications: The paper discusses how web search engines can use automatically classified user queries to provide more targeted and relevant results in web searching by implementing a real time classification method as presented in this research. Originality/value: This research investigates a new application of a method for automatically classifying the intent of user queries. There has been limited research to date on automatically classifying the user intent of web queries, even though the pay-off for web search engines can be quite beneficial. © Emerald Group Publishing Limited.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The K-means algorithm for clustering is very much dependent on the initial seed values. We use a genetic algorithm to find a near-optimal partitioning of the given data set by selecting proper initial seed values in the K-means algorithm. Results obtained are very encouraging and in most of the cases, on data sets having well separated clusters, the proposed scheme reached a global minimum.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Non-negative matrix factorization [5](NMF) is a well known tool for unsupervised machine learning. It can be viewed as a generalization of the K-means clustering, Expectation Maximization based clustering and aspect modeling by Probabilistic Latent Semantic Analysis (PLSA). Specifically PLSA is related to NMF with KL-divergence objective function. Further it is shown that K-means clustering is a special case of NMF with matrix L2 norm based error function. In this paper our objective is to analyze the relation between K-means clustering and PLSA by examining the KL-divergence function and matrix L2 norm based error function.