153 resultados para Incremental Clustering


Relevância:

20.00% 20.00%

Publicador:

Resumo:

To efficiently and yet accurately cluster Web documents is of great interests to Web users and is a key component of the searching accuracy of a Web search engine. To achieve this, this paper introduces a new approach for the clustering of Web documents, which is called maximal frequent itemset (MFI) approach. Iterative clustering algorithms, such as K-means and expectation-maximization (EM), are sensitive to their initial conditions. MFI approach firstly locates the center points of high density clusters precisely. These center points then are used as initial points for the K-means algorithm. Our experimental results tested on 3 Web document sets show that our MFI approach outperforms the other methods we compared in most cases, particularly in the case of large number of categories in Web document sets.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The human immune system provides inspiration for solving a wide range of innovative problems. In this paper, we propse an immune network based approach for web document clustering. All the immune cells in the network competitively recognize the antigens (web documents) which are presented to the network one by one. The interaction between immune cells and an antigen leads to an augment of the network through the clonal selection and somatic mutation of the stimulated immune cells, while the interaction among immune cells results in a network compression. The structure of the immune network is well maintained by learning and self-regularity. We use a public web document data set to test the effectiveness of our method and compare it with other approaches. The experimental results demonstrate that the most striking advantage of immune-based data clustering is its adaptation in dynamic environment and the capability of finding new clusters automatically.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We propose a new data induced metric to perform un supervised data classification (clustering). Our goal is to automatically recognize clusters of non-convex shape. We present a new version of fuzzy c-means al gorithm, based on the data induced metric, which is capable to identify non-convex d-dimensional clusters.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

For many clustering algorithms, such as K-Means, EM, and CLOPE, there is usually a requirement to set some parameters. Often, these parameters directly or indirectly control the number of clusters, that is, k, to return. In the presence of different data characteristics and analysis contexts, it is often difficult for the user to estimate the number of clusters in the data set. This is especially true in text collections such as Web documents, images, or biological data. In an effort to improve the effectiveness of clustering, we seek the answer to a fundamental question: How can we effectively estimate the number of clusters in a given data set? We propose an efficient method based on spectra analysis of eigenvalues (not eigenvectors) of the data set as the solution to the above. We first present the relationship between a data set and its underlying spectra with theoretical and experimental results. We then show how our method is capable of suggesting a range of k that is well suited to different analysis contexts. Finally, we conclude with further  empirical results to show how the answer to this fundamental question enhances the clustering process for large text collections.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Clustering of multivariate data is a commonly used technique in ecology, and many approaches to clustering are available. The results from a clustering algorithm are uncertain, but few clustering approaches explicitly acknowledge this uncertainty. One exception is Bayesian mixture modelling, which treats all results probabilistically, and allows comparison of multiple plausible classifications of the same data set. We used this method, implemented in the AutoClass program, to classify catchments (watersheds) in the Murray Darling Basin (MDB), Australia, based on their physiographic characteristics (e.g. slope, rainfall, lithology). The most likely classification found nine classes of catchments. Members of each class were aggregated geographically within the MDB. Rainfall and slope were the two most important variables that defined classes. The second-most likely classification was very similar to the first, but had one fewer class. Increasing the nominal uncertainty of continuous data resulted in a most likely classification with five classes, which were again aggregated geographically. Membership probabilities suggested that a small number of cases could be members of either of two classes. Such cases were located on the edges of groups of catchments that belonged to one class, with a group belonging to the second-most likely class adjacent. A comparison of the Bayesian approach to a distance-based deterministic method showed that the Bayesian mixture model produced solutions that were more spatially cohesive and intuitively appealing. The probabilistic presentation of results from the Bayesian classification allows richer interpretation, including decisions on how to treat cases that are intermediate between two or more classes, and whether to consider more than one classification. The explicit consideration and presentation of uncertainty makes this approach useful for ecological investigations, where both data and expectations are often highly uncertain.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Clustering is widely used in bioinformatics to find gene correlation patterns. Although many algorithms have been proposed, these are usually confronted with difficulties in meeting the requirements of both automation and high quality. In this paper, we propose a novel algorithm for clustering genes from their expression profiles. The unique features of the proposed algorithm are twofold: it takes into consideration global, rather than local, gene correlation information in clustering processes; and it incorporates clustering quality measurement into the clustering processes to implement non-parametric, automatic and global optimal gene clustering. The evaluation on simulated and real gene data sets demonstrates the effectiveness of the algorithm.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Email overload is a recent problem that there is increasingly difficulty people have faced to process the large number of emails received daily. Currently this problem becomes more and more serious and it has already affected the normal usage of email as a knowledge management tool. It has been recognized that categorizing emails into meaningful groups can greatly save cognitive load to process emails and thus this is an effective way to manage email overload problem. However, most current approaches still require significant human input when categorizing emails. In this paper we develop an automatic email clustering system, underpinned by a new nonparametric text clustering algorithm. This system does not require any predefined input parameters and can automatically generate meaningful email clusters. Experiments show our new algorithm outperforms existing text clustering algorithms with higher efficiency in terms of computational time and clustering quality measured by different gauges.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper describes the methodology for identifying moving obstacles by obtaining a reliable and a sparse optical flow from image sequences. Given a sequence of images, basically we can detect two-types of on road vehicles, vehicles traveling in the opposite direction and vehicles traveling in the same direction. For both types, distinct feature points can be detected by Shi and Tomasi corner detector algorithm. Then pyramidal Lucas Kanade method for optical flow calculation is used to match the sparse feature set of one frame on the consecutive frame. By applying k means clustering on four component feature vector, which are to be the coordinates of the feature point and the two components of the optical flow, we can easily calculate the centroids of the clusters and the objects can be easily tracked. The vehicles traveling in the opposite direction produce a diverging vector field, while vehicles traveling in the same direction produce a converging vector field

Relevância:

20.00% 20.00%

Publicador:

Resumo:

For many clustering algorithms, such as k-means, EM, and CLOPE, there is usually a requirement to set some parameters. Often, these parameters directly or indirectly control the number of clusters to return. In the presence of different data characteristics and analysis contexts, it is often difficult for the user to estimate the number of clusters in the data set. This is especially true in text collections such as Web documents, images or biological data. The fundamental question this paper addresses is: ldquoHow can we effectively estimate the natural number of clusters in a given text collection?rdquo. We propose to use spectral analysis, which analyzes the eigenvalues (not eigenvectors) of the collection, as the solution to the above. We first present the relationship between a text collection and its underlying spectra. We then show how the answer to this question enhances the clustering process. Finally, we conclude with empirical results and related work.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Athletes commonly attempt to enhance performance by training in normoxia but sleeping in hypoxia [live high and train low (LHTL)]. However, chronic hypoxia reduces muscle Na+-K+-ATPase content, whereas fatiguing contractions reduce Na+-K+-ATPase activity, which each may impair performance. We examined whether LHTL and intense exercise would decrease muscle Na+-K+-ATPase activity and whether these effects would be additive and sufficient to impair performance or plasma K+ regulation. Thirteen subjects were randomly assigned to two fitness-matched groups, LHTL (n = 6) or control (Con, n = 7). LHTL slept at simulated moderate altitude (3,000 m, inspired O2 fraction = 15.48%) for 23 nights and lived and trained by day under normoxic conditions in Canberra (altitude ~600 m). Con lived, trained, and slept in normoxia. A standardized incremental exercise test was conducted before and after LHTL. A vastus lateralis muscle biopsy was taken at rest and after exercise, before and after LHTL or Con, and analyzed for maximal Na+-K+-ATPase activity [K+-stimulated 3-O-methylfluorescein phosphatase (3-O-MFPase)] and Na+-K+-ATPase content ([3H]ouabain binding sites). 3-O-MFPase activity was decreased by –2.9 ± 2.6% in LHTL (P < 0.05) and was depressed immediately after exercise (P < 0.05) similarly in Con and LHTL (–13.0 ± 3.2 and –11.8 ± 1.5%, respectively). Plasma K+ concentration during exercise was unchanged by LHTL; [3H]ouabain binding was unchanged with LHTL or exercise. Peak oxygen consumption was reduced in LHTL (P < 0.05) but not in Con, whereas exercise work was unchanged in either group. Thus LHTL had a minor effect on, and incremental exercise reduced, Na+-K+-ATPase activity. However, the small LHTL-induced depression of 3-O-MFPase activity was insufficient to adversely affect either K+ regulation or total work performed.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper presents an algorithm based on the Growing Self Organizing Map (GSOM) called the High Dimensional Growing Self Organizing Map with Randomness (HDGSOMr) that can cluster massive high dimensional data efficiently. The original GSOM algorithm is altered to accommodate for the issues related to massive high dimensional data. These modifications are presented in detail with experimental results of a massive real-world dataset.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Recently, much attention has been given to the mass spectrometry (MS) technology based disease classification, diagnosis, and protein-based biomarker identification. Similar to microarray based investigation, proteomic data generated by such kind of high-throughput experiments are often with high feature-to-sample ratio. Moreover, biological information and pattern are compounded with data noise, redundancy and outliers. Thus, the development of algorithms and procedures for the analysis and interpretation of such kind of data is of paramount importance. In this paper, we propose a hybrid system for analyzing such high dimensional data. The proposed method uses the k-mean clustering algorithm based feature extraction and selection procedure to bridge the filter selection and wrapper selection methods. The potential informative mass/charge (m/z) markers selected by filters are subject to the k-mean clustering algorithm for correlation and redundancy reduction, and a multi-objective Genetic Algorithm selector is then employed to identify discriminative m/z markers generated by k-mean clustering algorithm. Experimental results obtained by using the proposed method indicate that it is suitable for m/z biomarker selection and MS based sample classification.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The high-throughput experimental data from the new gene microarray technology has spurred numerous efforts to find effective ways of processing microarray data for revealing real biological relationships among genes. This work proposes an innovative data pre-processing approach to identify noise data in the data sets and eliminate or reduce the impact of the noise data on gene clustering, With the proposed algorithm, the pre-processed data sets make the clustering results stable across clustering algorithms with different similarity metrics, the important information of genes and features is kept, and the clustering quality is improved. The primary evaluation on real microarray data sets has shown the effectiveness of the proposed algorithm.