783 resultados para grid, clustering, statistical, clustering
Resumo:
Here we rederive the hierarchy of equations for the evolution of distribution functions of various orders using a convenient parameterization. We use this to obtain equations for two- and three-point correlation functions in powers of a small parameter, viz., the initial density contrast. The correspondence of the lowest order solutions of these equations to the results from the linear theory of density perturbations is shown for an OMEGA = 1 universe. These equations are then used to calculate, to the lowest order, the induced three-point correlation function that arises from Gaussian initial conditions in an OMEGA = 1 universe. We obtain an expression which explicitly exhibits the spatial structure of the induced three-point correlation function. It is seen that the spatial structure of this quantity is independent of the value of OMEGA. We also calculate the triplet momentum. We find that the induced three-point correlation function does not have the ''hierarchical'' form often assumed. We discuss possibilities of using the induced three-point correlation to interpret observational data. The formalism developed here can also be used to test a validity of different schemes to close the
Resumo:
In the knowledge-based clustering approaches reported in the literature, explicit know ledge, typically in the form of a set of concepts, is used in computing similarity or conceptual cohesiveness between objects and in grouping them. We propose a knowledge-based clustering approach in which the domain knowledge is also used in the pattern representation phase of clustering. We argue that such a knowledge-based pattern representation scheme reduces the complexity of similarity computation and grouping phases. We present a knowledge-based clustering algorithm for grouping hooks in a library.
Resumo:
We use the BBGKY hierarchy equations to calculate, perturbatively, the lowest order nonlinear correction to the two-point correlation and the pair velocity for Gaussian initial conditions in a critical density matter-dominated cosmological model. We compare our results with the results obtained using the hydrodynamic equations that neglect pressure and find that the two match, indicating that there are no effects of multistreaming at this order of perturbation. We analytically study the effect of small scales on the large scales by calculating the nonlinear correction for a Dirac delta function initial two-point correlation. We find that the induced two-point correlation has a x(-6) behavior at large separations. We have considered a class of initial conditions where the initial power spectrum at small k has the form k(n) with 0 < n less than or equal to 3 and have numerically calculated the nonlinear correction to the two-point correlation, its average over a sphere and the pair velocity over a large dynamical range. We find that at small separations the effect of the nonlinear term is to enhance the clustering, whereas at intermediate scales it can act to either increase or decrease the clustering. At large scales we find a simple formula that gives a very good fit for the nonlinear correction in terms of the initial function. This formula explicitly exhibits the influence of small scales on large scales and because of this coupling the perturbative treatment breaks down at large scales much before one would expect it to if the nonlinearity were local in real space. We physically interpret this formula in terms of a simple diffusion process. We have also investigated the case n = 0, and we find that it differs from the other cases in certain respects. We investigate a recently proposed scaling property of gravitational clustering, and we find that the lowest order nonlinear terms cause deviations from the scaling relations that are strictly valid in the linear regime. The approximate validity of these relations in the nonlinear regime in l(T)-body simulations cannot be understood at this order of evolution.
Resumo:
In this article, we present a novel application of a quantum clustering (QC) technique to objectively cluster the conformations, sampled by molecular dynamics simulations performed on different ligand bound structures of the protein. We further portray each conformational population in terms of dynamically stable network parameters which beautifully capture the ligand induced variations in the ensemble in atomistic detail. The conformational populations thus identified by the QC method and verified by network parameters are evaluated for different ligand bound states of the protein pyrrolysyl-tRNA synthetase (DhPylRS) from D. hafniense. The ligand/environment induced re-distribution of protein conformational ensembles forms the basis for understanding several important biological phenomena such as allostery and enzyme catalysis. The atomistic level characterization of each population in the conformational ensemble in terms of the re-orchestrated networks of amino acids is a challenging problem, especially when the changes are minimal at the backbone level. Here we demonstrate that the QC method is sensitive to such subtle changes and is able to cluster MD snapshots which are similar at the side-chain interaction level. Although we have applied these methods on simulation trajectories of a modest time scale (20 ns each), we emphasize that our methodology provides a general approach towards an objective clustering of large-scale MD simulation data and may be applied to probe multistate equilibria at higher time scales, and to problems related to protein folding for any protein or protein-protein/RNA/DNA complex of interest with a known structure.
Resumo:
Emerging high-dimensional data mining applications needs to find interesting clusters embeded in arbitrarily aligned subspaces of lower dimensionality. It is difficult to cluster high-dimensional data objects, when they are sparse and skewed. Updations are quite common in dynamic databases and they are usually processed in batch mode. In very large dynamic databases, it is necessary to perform incremental cluster analysis only to the updations. We present a incremental clustering algorithm for subspace clustering in very high dimensions, which handles both insertion and deletions of datapoints to the backend databases.
Resumo:
Delineation of homogeneous precipitation regions (regionalization) is necessary for investigating frequency and spatial distribution of meteorological droughts. The conventional methods of regionalization use statistics of precipitation as attributes to establish homogeneous regions. Therefore they cannot be used to form regions in ungauged areas, and they may not be useful to form meaningful regions in areas having sparse rain gauge density. Further, validation of the regions for homogeneity in precipitation is not possible, since the use of the precipitation statistics to form regions and subsequently to test the regional homogeneity is not appropriate. To alleviate this problem, an approach based on fuzzy cluster analysis is presented. It allows delineation of homogeneous precipitation regions in data sparse areas using large scale atmospheric variables (LSAV), which influence precipitation in the study area, as attributes. The LSAV, location parameters (latitude, longitude and altitude) and seasonality of precipitation are suggested as features for regionalization. The approach allows independent validation of the identified regions for homogeneity using statistics computed from the observed precipitation. Further it has the ability to form regions even in ungauged areas, owing to the use of attributes that can be reliably estimated even when no at-site precipitation data are available. The approach was applied to delineate homogeneous annual rainfall regions in India, and its effectiveness is illustrated by comparing the results with those obtained using rainfall statistics, regionalization based on hard cluster analysis, and meteorological sub-divisions in India. (C) 2011 Elsevier B.V. All rights reserved.
Resumo:
Advertisements(Ads) are the main revenue earner for Television (TV) broadcasters. As TV reaches a large audience, it acts as the best media for advertisements of products and services. With the emergence of digital TV, it is important for the broadcasters to provide an intelligent service according to the various dimensions like program features, ad features, viewers’ interest and sponsors’ preference. We present an automatic ad recommendation algorithm that selects a set of ads by considering these dimensions and semantically match them with programs. Features of the ad video are captured interms of annotations and they are grouped into number of predefined semantic categories by using a categorization technique. Fuzzy categorical data clustering technique is applied on categorized data for selecting better suited ads for a particular program. Since the same ad can be recommended for more than one program depending upon multiple parameters, fuzzy clustering acts as the best suited method for ad recommendation. The relative fuzzy score called “degree of membership” calculated for each ad indicates the membership of a particular ad to different program clusters. Subjective evaluation of the algorithm is done by 10 different people and rated with a high success score.
Resumo:
Applications in various domains often lead to very large and frequently high-dimensional data. Successful algorithms must avoid the curse of dimensionality but at the same time should be computationally efficient. Finding useful patterns in large datasets has attracted considerable interest recently. The primary goal of the paper is to implement an efficient Hybrid Tree based clustering method based on CF-Tree and KD-Tree, and combine the clustering methods with KNN-Classification. The implementation of the algorithm involves many issues like good accuracy, less space and less time. We will evaluate the time and space efficiency, data input order sensitivity, and clustering quality through several experiments.
Resumo:
Clustering techniques are used in regional flood frequency analysis (RFFA) to partition watersheds into natural groups or regions with similar hydrologic responses. The linear Kohonen's self‐organizing feature map (SOFM) has been applied as a clustering technique for RFFA in several recent studies. However, it is seldom possible to interpret clusters from the output of an SOFM, irrespective of its size and dimensionality. In this study, we demonstrate that SOFMs may, however, serve as a useful precursor to clustering algorithms. We present a two‐level. SOFM‐based clustering approach to form regions for FFA. In the first level, the SOFM is used to form a two‐dimensional feature map. In the second level, the output nodes of SOFM are clustered using Fuzzy c‐means algorithm to form regions. The optimal number of regions is based on fuzzy cluster validation measures. Effectiveness of the proposed approach in forming homogeneous regions for FFA is illustrated through application to data from watersheds in Indiana, USA. Results show that the performance of the proposed approach to form regions is better than that based on classical SOFM.
Resumo:
This paper presents a novel Second Order Cone Programming (SOCP) formulation for large scale binary classification tasks. Assuming that the class conditional densities are mixture distributions, where each component of the mixture has a spherical covariance, the second order statistics of the components can be estimated efficiently using clustering algorithms like BIRCH. For each cluster, the second order moments are used to derive a second order cone constraint via a Chebyshev-Cantelli inequality. This constraint ensures that any data point in the cluster is classified correctly with a high probability. This leads to a large margin SOCP formulation whose size depends on the number of clusters rather than the number of training data points. Hence, the proposed formulation scales well for large datasets when compared to the state-of-the-art classifiers, Support Vector Machines (SVMs). Experiments on real world and synthetic datasets show that the proposed algorithm outperforms SVM solvers in terms of training time and achieves similar accuracies.
Resumo:
This paper presents hierarchical clustering algorithms for land cover mapping problem using multi-spectral satellite images. In unsupervised techniques, the automatic generation of number of clusters and its centers for a huge database is not exploited to their full potential. Hence, a hierarchical clustering algorithm that uses splitting and merging techniques is proposed. Initially, the splitting method is used to search for the best possible number of clusters and its centers using Mean Shift Clustering (MSC), Niche Particle Swarm Optimization (NPSO) and Glowworm Swarm Optimization (GSO). Using these clusters and its centers, the merging method is used to group the data points based on a parametric method (k-means algorithm). A performance comparison of the proposed hierarchical clustering algorithms (MSC, NPSO and GSO) is presented using two typical multi-spectral satellite images - Landsat 7 thematic mapper and QuickBird. From the results obtained, we conclude that the proposed GSO based hierarchical clustering algorithm is more accurate and robust.
Resumo:
Lack of supervision in clustering algorithms often leads to clusters that are not useful or interesting to human reviewers. We investigate if supervision can be automatically transferred for clustering a target task, by providing a relevant supervised partitioning of a dataset from a different source task. The target clustering is made more meaningful for the human user by trading-off intrinsic clustering goodness on the target task for alignment with relevant supervised partitions in the source task, wherever possible. We propose a cross-guided clustering algorithm that builds on traditional k-means by aligning the target clusters with source partitions. The alignment process makes use of a cross-task similarity measure that discovers hidden relationships across tasks. When the source and target tasks correspond to different domains with potentially different vocabularies, we propose a projection approach using pivot vocabularies for the cross-domain similarity measure. Using multiple real-world and synthetic datasets, we show that our approach improves clustering accuracy significantly over traditional k-means and state-of-the-art semi-supervised clustering baselines, over a wide range of data characteristics and parameter settings.
Resumo:
In this paper, we develop a game theoretic approach for clustering features in a learning problem. Feature clustering can serve as an important preprocessing step in many problems such as feature selection, dimensionality reduction, etc. In this approach, we view features as rational players of a coalitional game where they form coalitions (or clusters) among themselves in order to maximize their individual payoffs. We show how Nash Stable Partition (NSP), a well known concept in the coalitional game theory, provides a natural way of clustering features. Through this approach, one can obtain some desirable properties of the clusters by choosing appropriate payoff functions. For a small number of features, the NSP based clustering can be found by solving an integer linear program (ILP). However, for large number of features, the ILP based approach does not scale well and hence we propose a hierarchical approach. Interestingly, a key result that we prove on the equivalence between a k-size NSP of a coalitional game and minimum k-cut of an appropriately constructed graph comes in handy for large scale problems. In this paper, we use feature selection problem (in a classification setting) as a running example to illustrate our approach. We conduct experiments to illustrate the efficacy of our approach.
Resumo:
In this paper, we approach the classical problem of clustering using solution concepts from cooperative game theory such as Nucleolus and Shapley value. We formulate the problem of clustering as a characteristic form game and develop a novel algorithm DRAC (Density-Restricted Agglomerative Clustering) for clustering. With extensive experimentation on standard data sets, we compare the performance of DRAC with that of well known algorithms. We show an interesting result that four prominent solution concepts, Nucleolus, Shapley value, Gately point and \tau-value coincide for the defined characteristic form game. This vindicates the choice of the characteristic function of the clustering game and also provides strong intuitive foundation for our approach.