Biblioteca Digital

948 resultados para Clustering Analysis

The full Bayesian significance test for mixture models: results in gene expression clustering

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Gene clustering is a useful exploratory technique to group together genes with similar expression levels under distinct cell cycle phases or distinct conditions. It helps the biologist to identify potentially meaningful relationships between genes. In this study, we propose a clustering method based on multivariate normal mixture models, where the number of clusters is predicted via sequential hypothesis tests: at each step, the method considers a mixture model of m components (m = 2 in the first step) and tests if in fact it should be m - 1. If the hypothesis is rejected, m is increased and a new test is carried out. The method continues (increasing m) until the hypothesis is accepted. The theoretical core of the method is the full Bayesian significance test, an intuitive Bayesian approach, which needs no model complexity penalization nor positive probabilities for sharp hypotheses. Numerical experiments were based on a cDNA microarray dataset consisting of expression levels of 205 genes belonging to four functional categories, for 10 distinct strains of Saccharomyces cerevisiae. To analyze the method's sensitivity to data dimension, we performed principal components analysis on the original dataset and predicted the number of classes using 2 to 10 principal components. Compared to Mclust (model-based clustering), our method shows more consistent results.

Multidimensional cluster stability analysis from a Brazilian Bradyrhizobium sp RFLP/PCR data set

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The taxonomy of the N(2)-fixing bacteria belonging to the genus Bradyrhizobium is still poorly refined, mainly due to conflicting results obtained by the analysis of the phenotypic and genotypic properties. This paper presents an application of a method aiming at the identification of possible new clusters within a Brazilian collection of 119 Bradryrhizobium strains showing phenotypic characteristics of B. japonicum and B. elkanii. The stability was studied as a function of the number of restriction enzymes used in the RFLP-PCR analysis of three ribosomal regions with three restriction enzymes per region. The method proposed here uses Clustering algorithms with distances calculated by average-linkage clustering. Introducing perturbations using sub-sampling techniques makes the stability analysis. The method showed efficacy in the grouping of the species B. japonicum and B. elkanii. Furthermore, two new clusters were clearly defined, indicating possible new species, and sub-clusters within each detected cluster. (C) 2008 Elsevier B.V. All rights reserved.

Equational Reasoning as a Tool for Data Analysis

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A combination of deductive reasoning, clustering, and inductive learning is given as an example of a hybrid system for exploratory data analysis. Visualization is replaced by a dialogue with the data.

Sequence analysis of rabbit haemorrhagic disease virus (RHDV) in Australia: alterations after its release

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Liver samples from rabbits killed by RHDV, collected from five States in Australia in 1996 and 1997 were analysed by RT-PCR. A 398 bp fragment of the capsid protein (VP60) gene was amplified by PCR and directly sequenced. The alignment of the nucleotide and amino acid sequences and their comparison with the original strain of the virus released in Australia indicated genetic changes after two years have been small with 98.2% to 100% identity. The constructed phylogenetic tree suggests slight differences in nucleotide substitutions in various States but there is no clear evidence of clustering of sequences according to their geographic origin. In practical terms, sequencing of viral RNA provides a means of testing the efficacy of further releases and subsequent spread of the virus if such a strategy is employed as a means of enhancing RHD as a biological control of the wild rabbit in Australia.

Classification System of Raman Spectra using Cluster Analysis to Diagnose Coronary Artery Lesions

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The traditional methods employed to detect atherosclerotic lesions allow for the identification of lesions; however, they do not provide specific characterization of the lesion`s biochemistry. Currently, Raman spectroscopy techniques are widely used as a characterization method for unknown substances, which makes this technique very important for detecting atherosclerotic lesions. The spectral interpretation is based on the analysis of frequency peaks present in the signal; however, spectra obtained from the same substance can show peaks slightly different and these differences make difficult the creation of an automatic method for spectral signal analysis. This paper presents a signal analysis method based on a clustering technique that allows for the classification of spectra as well as the inference of a diagnosis about the arterial wall condition. The objective is to develop a computational tool that is able to create clusters of spectra according to the arterial wall state and, after data collection, to allow for the classification of a specific spectrum into its correct cluster.

A mixture model-based approach to the clustering of microarray expression data

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Motivation: This paper introduces the software EMMIX-GENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. Results: The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets.

Model-based clustering in gene expression microarrays: An application to breast cancer data

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In microarray studies, the application of clustering techniques is often used to derive meaningful insights into the data. In the past, hierarchical methods have been the primary clustering tool employed to perform this task. The hierarchical algorithms have been mainly applied heuristically to these cluster analysis problems. Further, a major limitation of these methods is their inability to determine the number of clusters. Thus there is a need for a model-based approach to these. clustering problems. To this end, McLachlan et al. [7] developed a mixture model-based algorithm (EMMIX-GENE) for the clustering of tissue samples. To further investigate the EMMIX-GENE procedure as a model-based -approach, we present a case study involving the application of EMMIX-GENE to the breast cancer data as studied recently in van 't Veer et al. [10]. Our analysis considers the problem of clustering the tissue samples on the basis of the genes which is a non-standard problem because the number of genes greatly exceed the number of tissue samples. We demonstrate how EMMIX-GENE can be useful in reducing the initial set of genes down to a more computationally manageable size. The results from this analysis also emphasise the difficulty associated with the task of separating two tissue groups on the basis of a particular subset of genes. These results also shed light on why supervised methods have such a high misallocation error rate for the breast cancer data.

Document retrieval for e-mail search and discovery using Formal Concept Analysis

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper discusses a document discovery tool based on Conceptual Clustering by Formal Concept Analysis. The program allows users to navigate e-mail using a visual lattice metaphor rather than a tree. It implements a virtual. le structure over e-mail where files and entire directories can appear in multiple positions. The content and shape of the lattice formed by the conceptual ontology can assist in e-mail discovery. The system described provides more flexibility in retrieving stored e-mails than what is normally available in e-mail clients. The paper discusses how conceptual ontologies can leverage traditional document retrieval systems and aid knowledge discovery in document collections.

Residents’ perceptions of tourism impacts in Guimarães (Portugal): a cluster analysis

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In this investigation, a cluster analysis was used to separate Guimara˜es (Portugal) residents into clusters according to their perceptions of the impacts of tourism development. This approach is uncommonly applied to Portugal data and is even rarer for world heritage sites. The world heritage designation is believed to make an area more attractive to tourists. The clustering procedure analysed 400 data observations from a Guimara˜es resident survey and revealed the existence of three clusters: the Sceptics, the Moderately Optimistic and the Enthusiasts. The results were consistent with the empirical literature’s results, with the emergent nature of the destination found to be relevant. The fact that tourism is relatively recent in this destination has its major reflex in the devaluation by most of the residents of the negative impacts of tourism development.

Automated image analysis of lung branching morphogenesis from microscopic images of fetal rat explants

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Background: Regulating mechanisms of branching morphogenesis of fetal lung rat explants have been an essential tool for molecular research. This work presents a new methodology to accurately quantify the epithelial, outer contour and peripheral airway buds of lung explants during cellular development from microscopic images. Methods: The outer contour was defined using an adaptive and multi-scale threshold algorithm whose level was automatically calculated based on an entropy maximization criterion. The inner lung epithelial was defined by a clustering procedure that groups small image regions according to the minimum description length principle and local statistical properties. Finally, the number of peripheral buds were counted as the skeleton branched ends from a skeletonized image of the lung inner epithelial. Results: The time for lung branching morphometric analysis was reduced in 98% in contrast to the manual method. Best results were obtained in the first two days of cellular development, with lesser standard deviations. Non-significant differences were found between the automatic and manual results in all culture days. Conclusions: The proposed method introduces a series of advantages related to its intuitive use and accuracy, making the technique suitable to images with different lightning characteristics and allowing a reliable comparison between different researchers.

Automated image analysis of lung branching morphogenesis

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Regulating mechanisms of branchingmorphogenesis of fetal lung rat explants have been an essential tool formolecular research.This work presents a new methodology to accurately quantify the epithelial, outer contour, and peripheral airway buds of lung explants during cellular development frommicroscopic images. Methods.Theouter contour was defined using an adaptive and multiscale threshold algorithm whose level was automatically calculated based on an entropy maximization criterion. The inner lung epithelium was defined by a clustering procedure that groups small image regions according to the minimum description length principle and local statistical properties. Finally, the number of peripheral buds was counted as the skeleton branched ends from a skeletonized image of the lung inner epithelia. Results. The time for lung branching morphometric analysis was reduced in 98% in contrast to themanualmethod. Best results were obtained in the first two days of cellular development, with lesser standard deviations. Nonsignificant differences were found between the automatic and manual results in all culture days. Conclusions. The proposed method introduces a series of advantages related to its intuitive use and accuracy, making the technique suitable to images with different lighting characteristics and allowing a reliable comparison between different researchers.

Incidence rate and spatio-temporal clustering of type 1 diabetes in Santiago, Chile, from 1997 to 1998

Relevância:

30.00% 30.00%

Publicador:

Resumo:

OBJECTIVE: To estimate the incidence rate of type 1 diabetes in the urban area of Santiago, Chile, from March 21, 1997 to March 20, 1998, and to assess the spatio-temporal clustering of cases during that period. METHODS: All sixty-one incident cases were located temporally (day of diagnosis) and spatially (place of residence) in the area of study. Knox's method was used to assess spatio-temporal clustering of incident cases. RESULTS: The overall incidence rate of type 1 diabetes was 4.11 cases per 100,000 children aged less than 15 years per year (95% confidence interval: 3.06--5.14). The incidence rate seems to have increased since the last estimate of the incidence calculated for the years 1986--1992 in the metropolitan region of Santiago. Different combinations of space-time intervals have been evaluated to assess spatio-temporal clustering. The smallest p-value was found for the combination of critical distances of 750 meters and 60 days (uncorrected p-value = 0.048). CONCLUSIONS: Although these are preliminary results regarding space-time clustering in Santiago, exploratory analysis of the data method would suggest a possible aggregation of incident cases in space-time coordinates.

Typical load profiles in the smart grid context – a clustering methods comparison

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The present research paper presents five different clustering methods to identify typical load profiles of medium voltage (MV) electricity consumers. These methods are intended to be used in a smart grid environment to extract useful knowledge about customer’s behaviour. The obtained knowledge can be used to support a decision tool, not only for utilities but also for consumers. Load profiles can be used by the utilities to identify the aspects that cause system load peaks and enable the development of specific contracts with their customers. The framework presented throughout the paper consists in several steps, namely the pre-processing data phase, clustering algorithms application and the evaluation of the quality of the partition, which is supported by cluster validity indices. The process ends with the analysis of the discovered knowledge. To validate the proposed framework, a case study with a real database of 208 MV consumers is used.

Zonal prices analysis supported by a data mining based methodology

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A methodology based on data mining techniques to support the analysis of zonal prices in real transmission networks is proposed in this paper. The mentioned methodology uses clustering algorithms to group the buses in typical classes that include a set of buses with similar LMP values. Two different clustering algorithms have been used to determine the LMP clusters: the two-step and K-means algorithms. In order to evaluate the quality of the partition as well as the best performance algorithm adequacy measurements indices are used. The paper includes a case study using a Locational Marginal Prices (LMP) data base from the California ISO (CAISO) in order to identify zonal prices.

Multi-dimensional scaling applied to histogram-based DNA analysis

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper aims to study the relationships between chromosomal DNA sequences of twenty species. We propose a methodology combining DNA-based word frequency histograms, correlation methods, and an MDS technique to visualize structural information underlying chromosomes (CRs) and species. Four statistical measures are tested (Minkowski, Cosine, Pearson product-moment, and Kendall τ rank correlations) to analyze the information content of 421 nuclear CRs from twenty species. The proposed methodology is built on mathematical tools and allows the analysis and visualization of very large amounts of stream data, like DNA sequences, with almost no assumptions other than the predefined DNA “word length.” This methodology is able to produce comprehensible three-dimensional visualizations of CR clustering and related spatial and structural patterns. The results of the four test correlation scenarios show that the high-level information clusterings produced by the MDS tool are qualitatively similar, with small variations due to each correlation method characteristics, and that the clusterings are a consequence of the input data and not method’s artifacts.

«
1
2
...
7
8
9
10
11
12
13
...
63
64
»