21 resultados para k-Means algorithm


Relevância:

100.00% 100.00%

Publicador:

Resumo:

The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Objective: Recently, much research has been proposed using nature inspired algorithms to perform complex machine learning tasks. Ant colony optimization (ACO) is one such algorithm based on swarm intelligence and is derived from a model inspired by the collective foraging behavior of ants. Taking advantage of the ACO in traits such as self-organization and robustness, this paper investigates ant-based algorithms for gene expression data clustering and associative classification. Methods and material: An ant-based clustering (Ant-C) and an ant-based association rule mining (Ant-ARM) algorithms are proposed for gene expression data analysis. The proposed algorithms make use of the natural behavior of ants such as cooperation and adaptation to allow for a flexible robust search for a good candidate solution. Results: Ant-C has been tested on the three datasets selected from the Stanford Genomic Resource Database and achieved relatively high accuracy compared to other classical clustering methods. Ant-ARM has been tested on the acute lymphoblastic leukemia (ALL)/acute myeloid leukemia (AML) dataset and generated about 30 classification rules with high accuracy. Conclusions: Ant-C can generate optimal number of clusters without incorporating any other algorithms such as K-means or agglomerative hierarchical clustering. For associative classification, while a few of the well-known algorithms such as Apriori, FP-growth and Magnum Opus are unable to mine any association rules from the ALL/AML dataset within a reasonable period of time, Ant-ARM is able to extract associative classification rules.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

An overview of neural networks, covering multilayer perceptrons, radial basis functions, constructive algorithms, Kohonen and K-means unupervised algorithms, RAMnets, first and second order training methods, and Bayesian regularisation methods.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Clustering techniques such as k-means and hierarchical clustering are commonly used to analyze DNA microarray derived gene expression data. However, the interactions between processes underlying the cell activity suggest that the complexity of the microarray data structure may not be fully represented with discrete clustering methods.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

We propose a hybrid generative/discriminative framework for semantic parsing which combines the hidden vector state (HVS) model and the hidden Markov support vector machines (HM-SVMs). The HVS model is an extension of the basic discrete Markov model in which context is encoded as a stack-oriented state vector. The HM-SVMs combine the advantages of the hidden Markov models and the support vector machines. By employing a modified K-means clustering method, a small set of most representative sentences can be automatically selected from an un-annotated corpus. These sentences together with their abstract annotations are used to train an HVS model which could be subsequently applied on the whole corpus to generate semantic parsing results. The most confident semantic parsing results are selected to generate a fully-annotated corpus which is used to train the HM-SVMs. The proposed framework has been tested on the DARPA Communicator Data. Experimental results show that an improvement over the baseline HVS parser has been observed using the hybrid framework. When compared with the HM-SVMs trained from the fully-annotated corpus, the hybrid framework gave a comparable performance with only a small set of lightly annotated sentences. © 2008. Licensed under the Creative Commons.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Projection of a high-dimensional dataset onto a two-dimensional space is a useful tool to visualise structures and relationships in the dataset. However, a single two-dimensional visualisation may not display all the intrinsic structure. Therefore, hierarchical/multi-level visualisation methods have been used to extract more detailed understanding of the data. Here we propose a multi-level Gaussian process latent variable model (MLGPLVM). MLGPLVM works by segmenting data (with e.g. K-means, Gaussian mixture model or interactive clustering) in the visualisation space and then fitting a visualisation model to each subset. To measure the quality of multi-level visualisation (with respect to parent and child models), metrics such as trustworthiness, continuity, mean relative rank errors, visualisation distance distortion and the negative log-likelihood per point are used. We evaluate the MLGPLVM approach on the ‘Oil Flow’ dataset and a dataset of protein electrostatic potentials for the ‘Major Histocompatibility Complex (MHC) class I’ of humans. In both cases, visual observation and the quantitative quality measures have shown better visualisation at lower levels.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The storage capacity of multilayer networks with overlapping receptive fields is investigated for a constructive algorithm within a one-step replica symmetry breaking (RSB) treatment. We find that the storage capacity increases logarithmically with the number of hidden units K without saturating the Mitchison-Durbin bound. The slope of the logarithmic increase decays exponentionally with the stability with which the patterns have been stored.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The Generative Topographic Mapping (GTM) algorithm of Bishop et al. (1997) has been introduced as a principled alternative to the Self-Organizing Map (SOM). As well as avoiding a number of deficiencies in the SOM, the GTM algorithm has the key property that the smoothness properties of the model are decoupled from the reference vectors, and are described by a continuous mapping from a lower-dimensional latent space into the data space. Magnification factors, which are approximated by the difference between code-book vectors in SOMs, can therefore be evaluated for the GTM model as continuous functions of the latent variables using the techniques of differential geometry. They play an important role in data visualization by highlighting the boundaries between data clusters, and are illustrated here for both a toy data set, and a problem involving the identification of crab species from morphological data.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The distribution of finished products from depots to customers is a practical and challenging problem in logistics management. Better routing and scheduling decisions can result in higher level of customer satisfaction because more customers can be served in a shorter time. The distribution problem is generally formulated as the vehicle routing problem (VRP). Nevertheless, there is a rigid assumption that there is only one depot. In cases, for instance, where a logistics company has more than one depot, the VRP is not suitable. To resolve this limitation, this paper focuses on the VRP with multiple depots, or multi-depot VRP (MDVRP). The MDVRP is NP-hard, which means that an efficient algorithm for solving the problem to optimality is unavailable. To deal with the problem efficiently, two hybrid genetic algorithms (HGAs) are developed in this paper. The major difference between the HGAs is that the initial solutions are generated randomly in HGA1. The Clarke and Wright saving method and the nearest neighbor heuristic are incorporated into HGA2 for the initialization procedure. A computational study is carried out to compare the algorithms with different problem sizes. It is proved that the performance of HGA2 is superior to that of HGA1 in terms of the total delivery time.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The generalised transportation problem (GTP) is an extension of the linear Hitchcock transportation problem. However, it does not have the unimodularity property, which means the linear programming solution (like the simplex method) cannot guarantee to be integer. This is a major difference between the GTP and the Hitchcock transportation problem. Although some special algorithms, such as the generalised stepping-stone method, have been developed, but they are based on the linear programming model and the integer solution requirement of the GTP is relaxed. This paper proposes a genetic algorithm (GA) to solve the GTP and a numerical example is presented to show the algorithm and its efficiency.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper examines the potential for the development of patient services that could arise from the co-location of pharmacies with medical practices in the new "one-stop" centres. A review of the pharmacy-specific literature shows limited understanding of influence of location upon service development and highlights a tension between the professional and commercial drives. The aim of the survey of health centre pharmacists was to describe the current patterns of integration in the primary health care team. The study demonstrates that co-location offers opportunities but that there are barriers linked to the loss of traditional commercial activity. © 2003 Elsevier Science Ltd. All rights reserved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Nowadays, road safety and traffic congestion are major concerns worldwide. This is why research on vehicular communication is very vital. In static scenarios vehicles behave typically like in an office network where nodes transmit without moving and with no defined position. This paper analyses the impact of context information on existing popular rate adaptation algorithms. Our simulation was done in MATLAB by observing the impact of context information on these algorithms. Simulation was performed for both static and mobile cases.Our simulations are based on IEEE 802.11p wireless standard. For static scenarios vehicles do not move and without defined positions, while for the mobile case, vehicles are mobile with uniformly selected speed and randomized positions. Network performance are analysed using context information. Our results show that in mobility when context information is used, the system performance can be improved for all three rate adaptation algorithms. That can be explained by that with range checking, when many vehicles are out of communication range, less vehicles contend for network resources, thereby increasing the network performances. © 2013 IEEE.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper presents a simulated genetic algorithm (GA) model of scheduling the flow shop problem with re-entrant jobs. The objective of this research is to minimize the weighted tardiness and makespan. The proposed model considers that the jobs with non-identical due dates are processed on the machines in the same order. Furthermore, the re-entrant jobs are stochastic as only some jobs are required to reenter to the flow shop. The tardiness weight is adjusted once the jobs reenter to the shop. The performance of the proposed GA model is verified by a number of numerical experiments where the data come from the case company. The results show the proposed method has a higher order satisfaction rate than the current industrial practices.