875 resultados para document clustering


Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper, we approach the classical problem of clustering using solution concepts from cooperative game theory such as Nucleolus and Shapley value. We formulate the problem of clustering as a characteristic form game and develop a novel algorithm DRAC (Density-Restricted Agglomerative Clustering) for clustering. With extensive experimentation on standard data sets, we compare the performance of DRAC with that of well known algorithms. We show an interesting result that four prominent solution concepts, Nucleolus, Shapley value, Gately point and \tau-value coincide for the defined characteristic form game. This vindicates the choice of the characteristic function of the clustering game and also provides strong intuitive foundation for our approach.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper presents an improved hierarchical clustering algorithm for land cover mapping problem using quasi-random distribution. Initially, Niche Particle Swarm Optimization (NPSO) with pseudo/quasi-random distribution is used for splitting the data into number of cluster centers by satisfying Bayesian Information Criteria (BIC). Themain objective is to search and locate the best possible number of cluster and its centers. NPSO which highly depends on the initial distribution of particles in search space is not been exploited to its full potential. In this study, we have compared more uniformly distributed quasi-random with pseudo-random distribution with NPSO for splitting data set. Here to generate quasi-random distribution, Faure method has been used. Performance of previously proposed methods namely K-means, Mean Shift Clustering (MSC) and NPSO with pseudo-random is compared with the proposed approach - NPSO with quasi distribution(Faure). These algorithms are used on synthetic data set and multi-spectral satellite image (Landsat 7 thematic mapper). From the result obtained we conclude that use of quasi-random sequence with NPSO for hierarchical clustering algorithm results in a more accurate data classification.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper, a comparative study is carried using three nature-inspired algorithms namely Genetic Algorithm (GA), Particle Swarm Optimization (PSO) and Cuckoo Search (CS) on clustering problem. Cuckoo search is used with levy flight. The heavy-tail property of levy flight is exploited here. These algorithms are used on three standard benchmark datasets and one real-time multi-spectral satellite dataset. The results are tabulated and analysed using various techniques. Finally we conclude that under the given set of parameters, cuckoo search works efficiently for majority of the dataset and levy flight plays an important role.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper illustrates the application of a new technique, based on Support Vector Clustering (SVC) for the direct identification of coherent synchronous generators in a large interconnected Multi-Machine Power Systems. The clustering is based on coherency measures, obtained from the time domain responses of the generators following system disturbances. The proposed clustering algorithm could be integrated into a wide-area measurement system that enables fast identification of coherent clusters of generators for the construction of dynamic equivalent models. An application of the proposed method is demonstrated on a practical 15 generators 72-bus system, an equivalent of Indian Southern grid in an attempt to show the effectiveness of this clustering approach. The effects of short circuit fault locations on coherency are also investigated.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We address the problem of detecting cells in biological images. The problem is important in many automated image analysis applications. We identify the problem as one of clustering and formulate it within the framework of robust estimation using loss functions. We show how suitable loss functions may be chosen based on a priori knowledge of the noise distribution. Specifically, in the context of biological images, since the measurement noise is not Gaussian, quadratic loss functions yield suboptimal results. We show that by incorporating the Huber loss function, cells can be detected robustly and accurately. To initialize the algorithm, we also propose a seed selection approach. Simulation results show that Huber loss exhibits better performance compared with some standard loss functions. We also provide experimental results on confocal images of yeast cells. The proposed technique exhibits good detection performance even when the signal-to-noise ratio is low.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper, we present a novel approach that makes use of topic models based on Latent Dirichlet allocation(LDA) for generating single document summaries. Our approach is distinguished from other LDA based approaches in that we identify the summary topics which best describe a given document and only extract sentences from those paragraphs within the document which are highly correlated given the summary topics. This ensures that our summaries always highlight the crux of the document without paying any attention to the grammar and the structure of the documents. Finally, we evaluate our summaries on the DUC 2002 Single document summarization data corpus using ROUGE measures. Our summaries had higher ROUGE values and better semantic similarity with the documents than the DUC summaries.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Classification of a large document collection involves dealing with a huge feature space where each distinct word is a feature. In such an environment, classification is a costly task both in terms of running time and computing resources. Further it will not guarantee optimal results because it is likely to overfit by considering every feature for classification. In such a context, feature selection is inevitable. This work analyses the feature selection methods, explores the relations among them and attempts to find a minimal subset of features which are discriminative for document classification.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The role of crystallite size and clustering in influencing the stability of the structures of a large tetragonality ferroelectric system 0.6BiFeO(3)-0.4PbTiO(3) was investigated. The system exhibits cubic phase for a crystallite size similar to 25 nm, three times larger than the critical size reported for one of its end member PbTiO3. With increased degree of clustering for the same average crystallite size, partial stabilization of the ferroelectric tetragonal phase takes place. The results suggest that clustering helps in reducing the depolarization energy without the need for increasing the crystallite size of free particles.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Clustering has been the most popular method for data exploration. Clustering is partitioning the data set into sub-partitions based on some measures say the distance measure, each partition has its own significant information. There are a number of algorithms explored for this purpose, one such algorithm is the Particle Swarm Optimization(PSO) which is a population based heuristic search technique derived from swarm intelligence. In this paper we present an improved version of the Particle Swarm Optimization where, each feature of the data set is given significance accordingly by adding some random weights, which also minimizes the distortions in the dataset if any. The performance of the above proposed algorithm is evaluated using some benchmark datasets from Machine Learning Repository. The experimental results shows that our proposed methodology performs significantly better than the previously performed experiments.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Chebyshev-inequality-based convex relaxations of Chance-Constrained Programs (CCPs) are shown to be useful for learning classifiers on massive datasets. In particular, an algorithm that integrates efficient clustering procedures and CCP approaches for computing classifiers on large datasets is proposed. The key idea is to identify high density regions or clusters from individual class conditional densities and then use a CCP formulation to learn a classifier on the clusters. The CCP formulation ensures that most of the data points in a cluster are correctly classified by employing a Chebyshev-inequality-based convex relaxation. This relaxation is heavily dependent on the second-order statistics. However, this formulation and in general such relaxations that depend on the second-order moments are susceptible to moment estimation errors. One of the contributions of the paper is to propose several formulations that are robust to such errors. In particular a generic way of making such formulations robust to moment estimation errors is illustrated using two novel confidence sets. An important contribution is to show that when either of the confidence sets is employed, for the special case of a spherical normal distribution of clusters, the robust variant of the formulation can be posed as a second-order cone program. Empirical results show that the robust formulations achieve accuracies comparable to that with true moments, even when moment estimates are erroneous. Results also illustrate the benefits of employing the proposed methodology for robust classification of large-scale datasets.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Learning from Positive and Unlabelled examples (LPU) has emerged as an important problem in data mining and information retrieval applications. Existing techniques are not ideally suited for real world scenarios where the datasets are linearly inseparable, as they either build linear classifiers or the non-linear classifiers fail to achieve the desired performance. In this work, we propose to extend maximum margin clustering ideas and present an iterative procedure to design a non-linear classifier for LPU. In particular, we build a least squares support vector classifier, suitable for handling this problem due to symmetry of its loss function. Further, we present techniques for appropriately initializing the labels of unlabelled examples and for enforcing the ratio of positive to negative examples while obtaining these labels. Experiments on real-world datasets demonstrate that the non-linear classifier designed using the proposed approach gives significantly better generalization performance than the existing relevant approaches for LPU.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Scatter/Gather systems are increasingly becoming useful in browsing document corpora. Usability of the present-day systems are restricted to monolingual corpora, and their methods for clustering and labeling do not easily extend to the multilingual setting, especially in the absence of dictionaries/machine translation. In this paper, we study the cluster labeling problem for multilingual corpora in the absence of machine translation, but using comparable corpora. Using a variational approach, we show that multilingual topic models can effectively handle the cluster labeling problem, which in turn allows us to design a novel Scatter/Gather system ShoBha. Experimental results on three datasets, namely the Canadian Hansards corpus, the entire overlapping Wikipedia of English, Hindi and Bengali articles, and a trilingual news corpus containing 41,000 articles, confirm the utility of the proposed system.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning and data mining. Clustering is grouping of a data set or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait according to some defined distance measure. In this paper we present the genetically improved version of particle swarm optimization algorithm which is a population based heuristic search technique derived from the analysis of the particle swarm intelligence and the concepts of genetic algorithms (GA). The algorithm combines the concepts of PSO such as velocity and position update rules together with the concepts of GA such as selection, crossover and mutation. The performance of the above proposed algorithm is evaluated using some benchmark datasets from Machine Learning Repository. The performance of our method is better than k-means and PSO algorithm.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Data clustering groups data so that data which are similar to each other are in the same group and data which are dissimilar to each other are in different groups. Since generally clustering is a subjective activity, it is possible to get different clusterings of the same data depending on the need. This paper attempts to find the best clustering of the data by first carrying out feature selection and using only the selected features, for clustering. A PSO (Particle Swarm Optimization)has been used for clustering but feature selection has also been carried out simultaneously. The performance of the above proposed algorithm is evaluated on some benchmark data sets. The experimental results shows the proposed methodology outperforms the previous approaches such as basic PSO and Kmeans for the clustering problem.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This study investigates the application of support vector clustering (SVC) for the direct identification of coherent synchronous generators in large interconnected multi-machine power systems. The clustering is based on coherency measure, which indicates the degree of coherency between any pair of generators. The proposed SVC algorithm processes the coherency measure matrix that is formulated using the generator rotor measurements to cluster the coherent generators. The proposed approach is demonstrated on IEEE 10 generator 39-bus system and an equivalent 35 generators, 246-bus system of practical Indian southern grid. The effect of number of data samples and fault locations are also examined for determining the accuracy of the proposed approach. An extended comparison with other clustering techniques is also included, to show the effectiveness of the proposed approach in grouping the data into coherent groups of generators. This effectiveness of the coherent clusters obtained with the proposed approach is compared in terms of a set of clustering validity indicators and in terms of statistical assessment that is based on the coherency degree of a generator pair.