63 resultados para Unsupervised clustering
Resumo:
Data mining is the process to identify valid, implicit, previously unknown, potentially useful and understandable information from large databases. It is an important step in the process of knowledge discovery in databases, (Olaru & Wehenkel, 1999). In a data mining process, input data can be structured, seme-structured, or unstructured. Data can be in text, categorical or numerical values. One of the important characteristics of data mining is its ability to deal data with large volume, distributed, time variant, noisy, and high dimensionality. A large number of data mining algorithms have been developed for different applications. For example, association rules mining can be useful for market basket problems, clustering algorithms can be used to discover trends in unsupervised learning problems, classification algorithms can be applied in decision-making problems, and sequential and time series mining algorithms can be used in predicting events, fault detection, and other supervised learning problems (Vapnik, 1999). Classification is among the most important tasks in the data mining, particularly for data mining applications into engineering fields. Together with regression, classification is mainly for predictive modelling. So far, there have been a number of classification algorithms in practice. According to (Sebastiani, 2002), the main classification algorithms can be categorized as: decision tree and rule based approach such as C4.5 (Quinlan, 1996); probability methods such as Bayesian classifier (Lewis, 1998); on-line methods such as Winnow (Littlestone, 1988) and CVFDT (Hulten 2001), neural networks methods (Rumelhart, Hinton & Wiliams, 1986); example-based methods such as k-nearest neighbors (Duda & Hart, 1973), and SVM (Cortes & Vapnik, 1995). Other important techniques for classification tasks include Associative Classification (Liu et al, 1998) and Ensemble Classification (Tumer, 1996).
Resumo:
Recent efforts in the characterization of air-water flows properties have included some clustering process analysis. A cluster of bubbles is defined as a group of two or more bubbles, with a distinct separation from other bubbles before and after the cluster. The present paper compares the results of clustering processes two hydraulic structures. That is, a large-size dropshaft and a hydraulic jump in a rectangular horizontal channel. The comparison highlighted some significant differences in clustering production and structures. Both dropshaft and hydraulic jump flows are complex turbulent shear flows, and some clustering index may provide some measure of the bubble-turbulence interactions and associated energy dissipation.
Resumo:
The high-affinity receptors for human granulocyte-macrophage colony-stimulating factor (GM-CSF), interleukin-1 (IL-3), and IL-5 are heterodimeric complexes consisting of cytokine-specific alpha subunits and a common signal-transducing beta subunit (h beta c). We have previously demonstrated the oncogenic potential of this group of receptors by identifying constitutively activating point mutations in the extracellular and transmembrane domains of h beta c. We report here a comprehensive screen of the entire h beta c molecule that has led to the identification of additional constitutive point mutations by virtue of their ability to confer factor independence on murine FDC-P1 cells. These mutations were clustered exclusively in a central region of h beta c that encompasses the extracellular membrane-proximal domain, transmembrane domain, and membrane-proximal region of the cytoplasmic domain. Interestingly, most h beta c mutants exhibited cell type-specific constitutive activity, with only two transmembrane domain mutants able to confer factor independence on both murine FDC-P1 and BAF-B03 cells. Examination of the biochemical properties of these mutants in FDC-P1 cells indicated that MAP kinase (ERK1/2), STAT, and JAK2 signaling molecules were constitutively activated. In contrast, only some of the mutant beta subunits were constitutively tyrosine phosphorylated. Taken together; these results highlight key regions involved in h beta c activation, dissociate h beta c tyrosine phosphorylation from MAP kinase and STAT activation, and suggest the involvement of distinct mechanisms by which proliferative signals can be generated by h beta c. (C) 1998 by The American Society of Hematology.
Resumo:
The task of segmenting cell nuclei from cytoplasm in conventional Papanicolaou (Pap) stained cervical cell images is a classical image analysis problem which may prove to be crucial to the development of successful systems which automate the analysis of Pap smears for detection of cancer of the cervix. Although simple thresholding techniques will extract the nucleus in some cases, accurate unsupervised segmentation of very large image databases is elusive. Conventional active contour models as introduced by Kass, Witkin and Terzopoulos (1988) offer a number of advantages in this application, but suffer from the well-known drawbacks of initialisation and minimisation. Here we show that a Viterbi search-based dual active contour algorithm is able to overcome many of these problems and achieve over 99% accurate segmentation on a database of 20 130 Pap stained cell images. (C) 1998 Elsevier Science B.V. All rights reserved.
Resumo:
Geospatial clustering must be designed in such a way that it takes into account the special features of geoinformation and the peculiar nature of geographical environments in order to successfully derive geospatially interesting global concentrations and localized excesses. This paper examines families of geospaital clustering recently proposed in the data mining community and identifies several features and issues especially important to geospatial clustering in data-rich environments.
Resumo:
In this paper a methodology for integrated multivariate monitoring and control of biological wastewater treatment plants during extreme events is presented. To monitor the process, on-line dynamic principal component analysis (PCA) is performed on the process data to extract the principal components that represent the underlying mechanisms of the process. Fuzzy c-means (FCM) clustering is used to classify the operational state. Performing clustering on scores from PCA solves computational problems as well as increases robustness due to noise attenuation. The class-membership information from FCM is used to derive adequate control set points for the local control loops. The methodology is illustrated by a simulation study of a biological wastewater treatment plant, on which disturbances of various types are imposed. The results show that the methodology can be used to determine and co-ordinate control actions in order to shift the control objective and improve the effluent quality.
Resumo:
Motivation: This paper introduces the software EMMIX-GENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. Results: The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets.
Resumo:
In microarray studies, the application of clustering techniques is often used to derive meaningful insights into the data. In the past, hierarchical methods have been the primary clustering tool employed to perform this task. The hierarchical algorithms have been mainly applied heuristically to these cluster analysis problems. Further, a major limitation of these methods is their inability to determine the number of clusters. Thus there is a need for a model-based approach to these. clustering problems. To this end, McLachlan et al. [7] developed a mixture model-based algorithm (EMMIX-GENE) for the clustering of tissue samples. To further investigate the EMMIX-GENE procedure as a model-based -approach, we present a case study involving the application of EMMIX-GENE to the breast cancer data as studied recently in van 't Veer et al. [10]. Our analysis considers the problem of clustering the tissue samples on the basis of the genes which is a non-standard problem because the number of genes greatly exceed the number of tissue samples. We demonstrate how EMMIX-GENE can be useful in reducing the initial set of genes down to a more computationally manageable size. The results from this analysis also emphasise the difficulty associated with the task of separating two tissue groups on the basis of a particular subset of genes. These results also shed light on why supervised methods have such a high misallocation error rate for the breast cancer data.
Resumo:
We consider the problem of assessing the number of clusters in a limited number of tissue samples containing gene expressions for possibly several thousands of genes. It is proposed to use a normal mixture model-based approach to the clustering of the tissue samples. One advantage of this approach is that the question on the number of clusters in the data can be formulated in terms of a test on the smallest number of components in the mixture model compatible with the data. This test can be carried out on the basis of the likelihood ratio test statistic, using resampling to assess its null distribution. The effectiveness of this approach is demonstrated on simulated data and on some microarray datasets, as considered previously in the bioinformatics literature. (C) 2004 Elsevier Inc. All rights reserved.