10 resultados para K-Means clustering

em Aston University Research Archive


Relevância:

100.00% 100.00%

Publicador:

Resumo:

The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Descriptions of vegetation communities are often based on vague semantic terms describing species presence and dominance. For this reason, some researchers advocate the use of fuzzy sets in the statistical classification of plant species data into communities. In this study, spatially referenced vegetation abundance values collected from Greek phrygana were analysed by ordination (DECORANA), and classified on the resulting axes using fuzzy c-means to yield a point data-set representing local memberships in characteristic plant communities. The fuzzy clusters matched vegetation communities noted in the field, which tended to grade into one another, rather than occupying discrete patches. The fuzzy set representation of the community exploited the strengths of detrended correspondence analysis while retaining richer information than a TWINSPAN classification of the same data. Thus, in the absence of phytosociological benchmarks, meaningful and manageable habitat information could be derived from complex, multivariate species data. We also analysed the influence of the reliability of different surveyors' field observations by multiple sampling at a selected sample location. We show that the impact of surveyor error was more severe in the Boolean than the fuzzy classification. © 2007 Springer.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We propose a hybrid generative/discriminative framework for semantic parsing which combines the hidden vector state (HVS) model and the hidden Markov support vector machines (HM-SVMs). The HVS model is an extension of the basic discrete Markov model in which context is encoded as a stack-oriented state vector. The HM-SVMs combine the advantages of the hidden Markov models and the support vector machines. By employing a modified K-means clustering method, a small set of most representative sentences can be automatically selected from an un-annotated corpus. These sentences together with their abstract annotations are used to train an HVS model which could be subsequently applied on the whole corpus to generate semantic parsing results. The most confident semantic parsing results are selected to generate a fully-annotated corpus which is used to train the HM-SVMs. The proposed framework has been tested on the DARPA Communicator Data. Experimental results show that an improvement over the baseline HVS parser has been observed using the hybrid framework. When compared with the HM-SVMs trained from the fully-annotated corpus, the hybrid framework gave a comparable performance with only a small set of lightly annotated sentences. © 2008. Licensed under the Creative Commons.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Clustering techniques such as k-means and hierarchical clustering are commonly used to analyze DNA microarray derived gene expression data. However, the interactions between processes underlying the cell activity suggest that the complexity of the microarray data structure may not be fully represented with discrete clustering methods.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Objective: Recently, much research has been proposed using nature inspired algorithms to perform complex machine learning tasks. Ant colony optimization (ACO) is one such algorithm based on swarm intelligence and is derived from a model inspired by the collective foraging behavior of ants. Taking advantage of the ACO in traits such as self-organization and robustness, this paper investigates ant-based algorithms for gene expression data clustering and associative classification. Methods and material: An ant-based clustering (Ant-C) and an ant-based association rule mining (Ant-ARM) algorithms are proposed for gene expression data analysis. The proposed algorithms make use of the natural behavior of ants such as cooperation and adaptation to allow for a flexible robust search for a good candidate solution. Results: Ant-C has been tested on the three datasets selected from the Stanford Genomic Resource Database and achieved relatively high accuracy compared to other classical clustering methods. Ant-ARM has been tested on the acute lymphoblastic leukemia (ALL)/acute myeloid leukemia (AML) dataset and generated about 30 classification rules with high accuracy. Conclusions: Ant-C can generate optimal number of clusters without incorporating any other algorithms such as K-means or agglomerative hierarchical clustering. For associative classification, while a few of the well-known algorithms such as Apriori, FP-growth and Magnum Opus are unable to mine any association rules from the ALL/AML dataset within a reasonable period of time, Ant-ARM is able to extract associative classification rules.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Projection of a high-dimensional dataset onto a two-dimensional space is a useful tool to visualise structures and relationships in the dataset. However, a single two-dimensional visualisation may not display all the intrinsic structure. Therefore, hierarchical/multi-level visualisation methods have been used to extract more detailed understanding of the data. Here we propose a multi-level Gaussian process latent variable model (MLGPLVM). MLGPLVM works by segmenting data (with e.g. K-means, Gaussian mixture model or interactive clustering) in the visualisation space and then fitting a visualisation model to each subset. To measure the quality of multi-level visualisation (with respect to parent and child models), metrics such as trustworthiness, continuity, mean relative rank errors, visualisation distance distortion and the negative log-likelihood per point are used. We evaluate the MLGPLVM approach on the ‘Oil Flow’ dataset and a dataset of protein electrostatic potentials for the ‘Major Histocompatibility Complex (MHC) class I’ of humans. In both cases, visual observation and the quantitative quality measures have shown better visualisation at lower levels.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

An overview of neural networks, covering multilayer perceptrons, radial basis functions, constructive algorithms, Kohonen and K-means unupervised algorithms, RAMnets, first and second order training methods, and Bayesian regularisation methods.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The Dirichlet process mixture model (DPMM) is a ubiquitous, flexible Bayesian nonparametric statistical model. However, full probabilistic inference in this model is analytically intractable, so that computationally intensive techniques such as Gibbs sampling are required. As a result, DPMM-based methods, which have considerable potential, are restricted to applications in which computational resources and time for inference is plentiful. For example, they would not be practical for digital signal processing on embedded hardware, where computational resources are at a serious premium. Here, we develop a simplified yet statistically rigorous approximate maximum a-posteriori (MAP) inference algorithm for DPMMs. This algorithm is as simple as DP-means clustering, solves the MAP problem as well as Gibbs sampling, while requiring only a fraction of the computational effort. (For freely available code that implements the MAP-DP algorithm for Gaussian mixtures see http://www.maxlittle.net/.) Unlike related small variance asymptotics (SVA), our method is non-degenerate and so inherits the “rich get richer” property of the Dirichlet process. It also retains a non-degenerate closed-form likelihood which enables out-of-sample calculations and the use of standard tools such as cross-validation. We illustrate the benefits of our algorithm on a range of examples and contrast it to variational, SVA and sampling approaches from both a computational complexity perspective as well as in terms of clustering performance. We demonstrate the wide applicabiity of our approach by presenting an approximate MAP inference method for the infinite hidden Markov model whose performance contrasts favorably with a recently proposed hybrid SVA approach. Similarly, we show how our algorithm can applied to a semiparametric mixed-effects regression model where the random effects distribution is modelled using an infinite mixture model, as used in longitudinal progression modelling in population health science. Finally, we propose directions for future research on approximate MAP inference in Bayesian nonparametrics.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper examines the potential for the development of patient services that could arise from the co-location of pharmacies with medical practices in the new "one-stop" centres. A review of the pharmacy-specific literature shows limited understanding of influence of location upon service development and highlights a tension between the professional and commercial drives. The aim of the survey of health centre pharmacists was to describe the current patterns of integration in the primary health care team. The study demonstrates that co-location offers opportunities but that there are barriers linked to the loss of traditional commercial activity. © 2003 Elsevier Science Ltd. All rights reserved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Segmentation is an important step in many medical imaging applications and a variety of image segmentation techniques exist. One group of segmentation algorithms is based on clustering concepts. In this article we investigate several fuzzy c-means based clustering algorithms and their application to medical image segmentation. In particular we evaluate the conventional hard c-means (HCM) and fuzzy c-means (FCM) approaches as well as three computationally more efficient derivatives of fuzzy c-means: fast FCM with random sampling, fast generalised FCM, and a new anisotropic mean shift based FCM. © 2010 by IJTS, ISDER.