96 resultados para height partition clustering
em Indian Institute of Science - Bangalore - Índia
Resumo:
A computationally efficient agglomerative clustering algorithm based on multilevel theory is presented. Here, the data set is divided randomly into a number of partitions. The samples of each such partition are clustered separately using hierarchical agglomerative clustering algorithm to form sub-clusters. These are merged at higher levels to get the final classification. This algorithm leads to the same classification as that of hierarchical agglomerative clustering algorithm when the clusters are well separated. The advantages of this algorithm are short run time and small storage requirement. It is observed that the savings, in storage space and computation time, increase nonlinearly with the sample size.
Resumo:
Partitional clustering algorithms, which partition the dataset into a pre-defined number of clusters, can be broadly classified into two types: algorithms which explicitly take the number of clusters as input and algorithms that take the expected size of a cluster as input. In this paper, we propose a variant of the k-means algorithm and prove that it is more efficient than standard k-means algorithms. An important contribution of this paper is the establishment of a relation between the number of clusters and the size of the clusters in a dataset through the analysis of our algorithm. We also demonstrate that the integration of this algorithm as a pre-processing step in classification algorithms reduces their running-time complexity.
Resumo:
The k-means algorithm is an extremely popular technique for clustering data. One of the major limitations of the k-means is that the time to cluster a given dataset D is linear in the number of clusters, k. In this paper, we employ height balanced trees to address this issue. Specifically, we make two major contributions, (a) we propose an algorithm, RACK (acronym for RApid Clustering using k-means), which takes time favorably comparable with the fastest known existing techniques, and (b) we prove an expected bound on the quality of clustering achieved using RACK. Our experimental results on large datasets strongly suggest that RACK is competitive with the k-means algorithm in terms of quality of clustering, while taking significantly less time.
Resumo:
Clustering techniques are used in regional flood frequency analysis (RFFA) to partition watersheds into natural groups or regions with similar hydrologic responses. The linear Kohonen's self‐organizing feature map (SOFM) has been applied as a clustering technique for RFFA in several recent studies. However, it is seldom possible to interpret clusters from the output of an SOFM, irrespective of its size and dimensionality. In this study, we demonstrate that SOFMs may, however, serve as a useful precursor to clustering algorithms. We present a two‐level. SOFM‐based clustering approach to form regions for FFA. In the first level, the SOFM is used to form a two‐dimensional feature map. In the second level, the output nodes of SOFM are clustered using Fuzzy c‐means algorithm to form regions. The optimal number of regions is based on fuzzy cluster validation measures. Effectiveness of the proposed approach in forming homogeneous regions for FFA is illustrated through application to data from watersheds in Indiana, USA. Results show that the performance of the proposed approach to form regions is better than that based on classical SOFM.
Resumo:
In this paper, we develop a game theoretic approach for clustering features in a learning problem. Feature clustering can serve as an important preprocessing step in many problems such as feature selection, dimensionality reduction, etc. In this approach, we view features as rational players of a coalitional game where they form coalitions (or clusters) among themselves in order to maximize their individual payoffs. We show how Nash Stable Partition (NSP), a well known concept in the coalitional game theory, provides a natural way of clustering features. Through this approach, one can obtain some desirable properties of the clusters by choosing appropriate payoff functions. For a small number of features, the NSP based clustering can be found by solving an integer linear program (ILP). However, for large number of features, the ILP based approach does not scale well and hence we propose a hierarchical approach. Interestingly, a key result that we prove on the equivalence between a k-size NSP of a coalitional game and minimum k-cut of an appropriately constructed graph comes in handy for large scale problems. In this paper, we use feature selection problem (in a classification setting) as a running example to illustrate our approach. We conduct experiments to illustrate the efficacy of our approach.
Resumo:
Clustering has been the most popular method for data exploration. Clustering is partitioning the data set into sub-partitions based on some measures say the distance measure, each partition has its own significant information. There are a number of algorithms explored for this purpose, one such algorithm is the Particle Swarm Optimization(PSO) which is a population based heuristic search technique derived from swarm intelligence. In this paper we present an improved version of the Particle Swarm Optimization where, each feature of the data set is given significance accordingly by adding some random weights, which also minimizes the distortions in the dataset if any. The performance of the above proposed algorithm is evaluated using some benchmark datasets from Machine Learning Repository. The experimental results shows that our proposed methodology performs significantly better than the previously performed experiments.
Resumo:
Homogeneous temperature regions are necessary for use in hydrometeorological studies. The regions are often delineated by analysing statistics derived from time series of maximum, minimum or mean temperature, rather than attributes influencing temperature. This practice cannot yield meaningful regions in data-sparse areas. Further, independent validation of the delineated regions for homogeneity in temperature is not possible, as temperature records form the basis to arrive at the regions. To address these issues, a two-stage clustering approach is proposed in this study to delineate homogeneous temperature regions. First stage of the approach involves (1) determining correlation structure between observed temperature over the study area and possible predictors (large-scale atmospheric variables) influencing the temperature and (2) using the correlation structure as the basis to delineate sites in the study area into clusters. Second stage of the approach involves analysis on each of the clusters to (1) identify potential predictors (large-scale atmospheric variables) influencing temperature at sites in the cluster and (2) partition the cluster into homogeneous fuzzy temperature regions using the identified potential predictors. Application of the proposed approach to India yielded 28 homogeneous regions that were demonstrated to be effective when compared to an alternate set of 6 regions that were previously delineated over the study area. Intersite cross-correlations of monthly maximum and minimum temperatures in the existing regions were found to be weak and negative for several months, which is undesirable. This problem was not found in the case of regions delineated using the proposed approach. Utility of the proposed regions in arriving at estimates of potential evapotranspiration for ungauged locations in the study area is demonstrated.
Resumo:
The work reported hen was motivated by a desire to verify the existence of structure - specifically MP-rich clusters induced by sodium bromide (NaBr) in the ternary liquid mixture 3-methylpyridine (Mf) + water(W) + NaBr. We present small-angle X-ray scattering (SAXS) measurements in this mixture. These measurements were obtained at room temperature (similar to 298 K) in the one-phase region (below the relevant lower consolute points, T(L)s) at different values of X (i.e., X = 0.02 - 0.17), where X is the weight fraction of NaBr in the mixture. Cluster-size distribution, estimated on the assumption that the clusters are spherical, shows systematic behaviour in that the peak of the distribution shifts rewards larger values of cluster radius as X increases. The largest spatial extent of the clusters (similar to 4.5 nm) is seen at X = 0.17. Data analysis assuming arbitrary shapes and sizes of clusters gives a limiting value of cluster size (- 4.5 nm) that is not very sensitive to X. It is suggested that the cluster size determined may not be the same as the usual critical-point fluctuations far removed from the critical point (T-L). The influence of the additional length scale due to clustering is discussed from the standpoint of crossover from Ising to mean-field critical behaviour, when moving away from the T-L.
Resumo:
We demonstrate that the hyper-Rayleigh scattering technique can be employed to measure the partition coefficient (k(p)) of a solute in a mixture of two immiscible solvents. Specifically, partition coefficients of six substituted benzoic acids in water/toluene (1:1 v/v) and water/chloroform (1:1 v/v) systems have been measured. Our values compare well with the k(p) values measured earlier by other techniques, The advantages offered by this technique are also discussed.
Resumo:
n this paper, a multistage evolutionary scheme is proposed for clustering in a large data base, like speech data. This is achieved by clustering a small subset of the entire sample set in each stage and treating the cluster centroids so obtained as samples, together with another subset of samples not considered previously, as input data to the next stage. This is continued till the whole sample set is exhausted. The clustering is accomplished by constructing a fuzzy similarity matrix and using the fuzzy techniques proposed here. The technique is illustrated by an efficient scheme for voiced-unvoiced-silence classification of speech.
Resumo:
Need to analyze particles in a flow? This system takes electrical pulses from acoustical or optical sensors and groups them into bands representing ranges of particle sizes.
Resumo:
In this paper the notion of conceptual cohesiveness is precised and used to group objects semantically, based on a knowledge structure called ‘cohesion forest’. A set of axioms is proposed which should be satisfied to make the generated clusters meaningful.
Resumo:
K-means algorithm is a well known nonhierarchical method for clustering data. The most important limitations of this algorithm are that: (1) it gives final clusters on the basis of the cluster centroids or the seed points chosen initially, and (2) it is appropriate for data sets having fairly isotropic clusters. But this algorithm has the advantage of low computation and storage requirements. On the other hand, hierarchical agglomerative clustering algorithm, which can cluster nonisotropic (chain-like and concentric) clusters, requires high storage and computation requirements. This paper suggests a new method for selecting the initial seed points, so that theK-means algorithm gives the same results for any input data order. This paper also describes a hybrid clustering algorithm, based on the concepts of multilevel theory, which is nonhierarchical at the first level and hierarchical from second level onwards, to cluster data sets having (i) chain-like clusters and (ii) concentric clusters. It is observed that this hybrid clustering algorithm gives the same results as the hierarchical clustering algorithm, with less computation and storage requirements.
Resumo:
Need to analyze particles in a flow? This system takes electrical pulses from acoustical or optical sensors and groups them into bands representing ranges of particle sizes.
Resumo:
The “partition method” or “sub-domain method” consists of expressing the solution of a governing differential equation, partial or ordinary, in terms of functions which satisfy the boundary conditions and setting to zero the error in the differential equation integrated over each of the sub-domains into which the given domain is partitioned. In this paper, the use of this method in eigenvalue problems with particular reference to vibration of plates is investigated. The deflection of the plate is expressed in terms of polynomials satisfying the boundary conditions completely. Setting the integrated error in each of the subdomains to zero results in a set of simultaneous, linear, homogeneous, algebraic equations in the undetermined coefficients of the deflection series. The algebraic eigenvalue problem is then solved for eigenvalues and eigenvectors. Convergence is examined in a few typical cases and is found to be satisfactory. The results obtained are compared with existing results based on other methods and are found to be in very good agreement.