919 resultados para Model-based Categorical Sequence Clustering
Resumo:
Le nématode doré, Globodera rostochiensis, est un nématode phytoparasite qui peut infecter des plantes agricoles telles la pomme de terre, la tomate et l’aubergine. En raison des pertes de rendement considérables associées à cet organisme, il est justifiable de quarantaine dans plusieurs pays, dont le Canada. Les kystes du nématode doré protègent les œufs qu’ils contiennent, leur permettant de survivre (en état de dormance) jusqu’à 20 ans dans le sol. L’éclosion des œufs n’aura lieu qu’en présence d’exsudats racinaires d’une plante hôte compatible à proximité. Malheureusement, très peu de connaissances sont disponibles sur les mécanismes moléculaires liés à cette étape-clé du cycle vital du nématode doré. Dans cet ouvrage, nous avons utilisé la technique RNA-seq pour séquencer tous les ARNm d’un échantillon de kystes du nématode doré afin d’assembler un transcriptome de novo (sans référence) et d’identifier des gènes jouant un rôle dans les mécanismes de survie et d’éclosion. Cette méthode nous a permis de constater que les processus d’éclosion et de parasitisme sont étroitement reliés. Plusieurs effecteurs impliqués dans le mouvement vers la plante hôte et la pénétration de la racine sont induits dès que le kyste est hydraté (avant même le déclenchement de l’éclosion). Avec l’aide du génome de référence du nématode doré, nous avons pu constater que la majorité des transcrits du transcriptome ne provenaient pas du nématode doré. En effet, les kystes échantillonnés au champ peuvent contenir des contaminants (bactéries, champignons, etc.) sur leur paroi et même à l’intérieur du kyste. Ces contaminants seront donc séquencés et assemblés avec le transcriptome de novo. Ces transcrits augmentent la taille du transcriptome et induisent des erreurs lors des analyses post-assemblages. Les méthodes de décontamination actuelles utilisent des alignements sur des bases de données d’organismes connus pour identifier ces séquences provenant de contaminants. Ces méthodes sont efficaces lorsque le ou les contaminants sont connus (possède un génome de référence) comme la contamination humaine. Par contre, lorsque le ou les contaminants sont inconnus, ces méthodes deviennent insuffisantes pour produire un transcriptome décontaminé de qualité. Nous avons donc conçu une méthode qui utilise un algorithme de regroupement hiérarchique des séquences. Cette méthode produit, de façon récursive, des sous-groupes de séquences homogènes en fonction des patrons fréquents présents dans les séquences. Une fois les groupes créés, ils sont étiquetés comme contaminants ou non en fonction des résultats d’alignements du sous-groupe. Les séquences ambiguës ayant aucun ou plusieurs alignements différents sont donc facilement classées en fonction de l’étiquette de leur groupe. Notre méthode a été efficace pour décontaminer le transcriptome du nématode doré ainsi que d’autres cas de contamination. Cette méthode fonctionne pour décontaminer un transcriptome, mais nous avons aussi démontré qu’elle a le potentiel de décontaminer de courtes séquences brutes. Décontaminer directement les séquences brutes serait la méthode de décontamination optimale, car elle minimiserait les erreurs d’assemblage.
Resumo:
In cluster analysis, it can be useful to interpret the partition built from the data in the light of external categorical variables which are not directly involved to cluster the data. An approach is proposed in the model-based clustering context to select a number of clusters which both fits the data well and takes advantage of the potential illustrative ability of the external variables. This approach makes use of the integrated joint likelihood of the data and the partitions at hand, namely the model-based partition and the partitions associated to the external variables. It is noteworthy that each mixture model is fitted by the maximum likelihood methodology to the data, excluding the external variables which are used to select a relevant mixture model only. Numerical experiments illustrate the promising behaviour of the derived criterion. © 2014 Springer-Verlag Berlin Heidelberg.
Resumo:
In cluster analysis, it can be useful to interpret the partition built from the data in the light of external categorical variables which are not directly involved to cluster the data. An approach is proposed in the model-based clustering context to select a number of clusters which both fits the data well and takes advantage of the potential illustrative ability of the external variables. This approach makes use of the integrated joint likelihood of the data and the partitions at hand, namely the model-based partition and the partitions associated to the external variables. It is noteworthy that each mixture model is fitted by the maximum likelihood methodology to the data, excluding the external variables which are used to select a relevant mixture model only. Numerical experiments illustrate the promising behaviour of the derived criterion.
Resumo:
Motivation: This paper introduces the software EMMIX-GENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. Results: The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets.
Resumo:
In microarray studies, the application of clustering techniques is often used to derive meaningful insights into the data. In the past, hierarchical methods have been the primary clustering tool employed to perform this task. The hierarchical algorithms have been mainly applied heuristically to these cluster analysis problems. Further, a major limitation of these methods is their inability to determine the number of clusters. Thus there is a need for a model-based approach to these. clustering problems. To this end, McLachlan et al. [7] developed a mixture model-based algorithm (EMMIX-GENE) for the clustering of tissue samples. To further investigate the EMMIX-GENE procedure as a model-based -approach, we present a case study involving the application of EMMIX-GENE to the breast cancer data as studied recently in van 't Veer et al. [10]. Our analysis considers the problem of clustering the tissue samples on the basis of the genes which is a non-standard problem because the number of genes greatly exceed the number of tissue samples. We demonstrate how EMMIX-GENE can be useful in reducing the initial set of genes down to a more computationally manageable size. The results from this analysis also emphasise the difficulty associated with the task of separating two tissue groups on the basis of a particular subset of genes. These results also shed light on why supervised methods have such a high misallocation error rate for the breast cancer data.
Resumo:
Research on cluster analysis for categorical data continues to develop, new clustering algorithms being proposed. However, in this context, the determination of the number of clusters is rarely addressed. We propose a new approach in which clustering and the estimation of the number of clusters is done simultaneously for categorical data. We assume that the data originate from a finite mixture of multinomial distributions and use a minimum message length criterion (MML) to select the number of clusters (Wallace and Bolton, 1986). For this purpose, we implement an EM-type algorithm (Silvestre et al., 2008) based on the (Figueiredo and Jain, 2002) approach. The novelty of the approach rests on the integration of the model estimation and selection of the number of clusters in a single algorithm, rather than selecting this number based on a set of pre-estimated candidate models. The performance of our approach is compared with the use of Bayesian Information Criterion (BIC) (Schwarz, 1978) and Integrated Completed Likelihood (ICL) (Biernacki et al., 2000) using synthetic data. The obtained results illustrate the capacity of the proposed algorithm to attain the true number of cluster while outperforming BIC and ICL since it is faster, which is especially relevant when dealing with large data sets.
Resumo:
Objectives: A recently introduced pragmatic scheme promises to be a useful catalog of interneuron names.We sought to automatically classify digitally reconstructed interneuronal morphologies according tothis scheme. Simultaneously, we sought to discover possible subtypes of these types that might emergeduring automatic classification (clustering). We also investigated which morphometric properties weremost relevant for this classification.Materials and methods: A set of 118 digitally reconstructed interneuronal morphologies classified into thecommon basket (CB), horse-tail (HT), large basket (LB), and Martinotti (MA) interneuron types by 42 of theworld?s leading neuroscientists, quantified by five simple morphometric properties of the axon and fourof the dendrites. We labeled each neuron with the type most commonly assigned to it by the experts. Wethen removed this class information for each type separately, and applied semi-supervised clustering tothose cells (keeping the others? cluster membership fixed), to assess separation from other types and lookfor the formation of new groups (subtypes). We performed this same experiment unlabeling the cells oftwo types at a time, and of half the cells of a single type at a time. The clustering model is a finite mixtureof Gaussians which we adapted for the estimation of local (per-cluster) feature relevance. We performedthe described experiments on three different subsets of the data, formed according to how many expertsagreed on type membership: at least 18 experts (the full data set), at least 21 (73 neurons), and at least26 (47 neurons).Results: Interneurons with more reliable type labels were classified more accurately. We classified HTcells with 100% accuracy, MA cells with 73% accuracy, and CB and LB cells with 56% and 58% accuracy,respectively. We identified three subtypes of the MA type, one subtype of CB and LB types each, andno subtypes of HT (it was a single, homogeneous type). We got maximum (adapted) Silhouette widthand ARI values of 1, 0.83, 0.79, and 0.42, when unlabeling the HT, CB, LB, and MA types, respectively,confirming the quality of the formed cluster solutions. The subtypes identified when unlabeling a singletype also emerged when unlabeling two types at a time, confirming their validity. Axonal morphometricproperties were more relevant that dendritic ones, with the axonal polar histogram length in the [pi, 2pi) angle interval being particularly useful.Conclusions: The applied semi-supervised clustering method can accurately discriminate among CB, HT, LB, and MA interneuron types while discovering potential subtypes, and is therefore useful for neuronal classification. The discovery of potential subtypes suggests that some of these types are more heteroge-neous that previously thought. Finally, axonal variables seem to be more relevant than dendritic ones fordistinguishing among the CB, HT, LB, and MA interneuron types.
Resumo:
We consider the problem of assessing the number of clusters in a limited number of tissue samples containing gene expressions for possibly several thousands of genes. It is proposed to use a normal mixture model-based approach to the clustering of the tissue samples. One advantage of this approach is that the question on the number of clusters in the data can be formulated in terms of a test on the smallest number of components in the mixture model compatible with the data. This test can be carried out on the basis of the likelihood ratio test statistic, using resampling to assess its null distribution. The effectiveness of this approach is demonstrated on simulated data and on some microarray datasets, as considered previously in the bioinformatics literature. (C) 2004 Elsevier Inc. All rights reserved.
Resumo:
Conventional methods of gene prediction rely on the recognition of DNA-sequence signals, the coding potential or the comparison of a genomic sequence with a cDNA, EST, or protein database. Reasons for limited accuracy in many circumstances are species-specific training and the incompleteness of reference databases. Lately, comparative genome analysis has attracted increasing attention. Several analysis tools that are based on human/mouse comparisons are already available. Here, we present a program for the prediction of protein-coding genes, termed SGP-1 (Syntenic Gene Prediction), which is based on the similarity of homologous genomic sequences. In contrast to most existing tools, the accuracy of SGP-1 depends little on species-specific properties such as codon usage or the nucleotide distribution. SGP-1 may therefore be applied to nonstandard model organisms in vertebrates as well as in plants, without the need for extensive parameter training. In addition to predicting genes in large-scale genomic sequences, the program may be useful to validate gene structure annotations from databases. To this end, SGP-1 output also contains comparisons between predicted and annotated gene structures in HTML format. The program can be accessed via a Web server at http://soft.ice.mpg.de/sgp-1. The source code, written in ANSI C, is available on request from the authors.
Resumo:
BACKGROUND: Left atrial (LA) dilatation is associated with a large variety of cardiac diseases. Current cardiovascular magnetic resonance (CMR) strategies to measure LA volumes are based on multi-breath-hold multi-slice acquisitions, which are time-consuming and susceptible to misregistration. AIM: To develop a time-efficient single breath-hold 3D CMR acquisition and reconstruction method to precisely measure LA volumes and function. METHODS: A highly accelerated compressed-sensing multi-slice cine sequence (CS-cineCMR) was combined with a non-model-based 3D reconstruction method to measure LA volumes with high temporal and spatial resolution during a single breath-hold. This approach was validated in LA phantoms of different shapes and applied in 3 patients. In addition, the influence of slice orientations on accuracy was evaluated in the LA phantoms for the new approach in comparison with a conventional model-based biplane area-length reconstruction. As a reference in patients, a self-navigated high-resolution whole-heart 3D dataset (3D-HR-CMR) was acquired during mid-diastole to yield accurate LA volumes. RESULTS: Phantom studies. LA volumes were accurately measured by CS-cineCMR with a mean difference of -4.73 ± 1.75 ml (-8.67 ± 3.54%, r2 = 0.94). For the new method the calculated volumes were not significantly different when different orientations of the CS-cineCMR slices were applied to cover the LA phantoms. Long-axis "aligned" vs "not aligned" with the phantom long-axis yielded similar differences vs the reference volume (-4.87 ± 1.73 ml vs. -4.45 ± 1.97 ml, p = 0.67) and short-axis "perpendicular" vs. "not-perpendicular" with the LA long-axis (-4.72 ± 1.66 ml vs. -4.75 ± 2.13 ml; p = 0.98). The conventional bi-plane area-length method was susceptible for slice orientations (p = 0.0085 for the interaction of "slice orientation" and "reconstruction technique", 2-way ANOVA for repeated measures). To use the 3D-HR-CMR as the reference for LA volumes in patients, it was validated in the LA phantoms (mean difference: -1.37 ± 1.35 ml, -2.38 ± 2.44%, r2 = 0.97). Patient study: The CS-cineCMR LA volumes of the mid-diastolic frame matched closely with the reference LA volume (measured by 3D-HR-CMR) with a difference of -2.66 ± 6.5 ml (3.0% underestimation; true LA volumes: 63 ml, 62 ml, and 395 ml). Finally, a high intra- and inter-observer agreement for maximal and minimal LA volume measurement is also shown. CONCLUSIONS: The proposed method combines a highly accelerated single-breathhold compressed-sensing multi-slice CMR technique with a non-model-based 3D reconstruction to accurately and reproducibly measure LA volumes and function.
Resumo:
PURPOSE: According to estimations around 230 people die as a result of radon exposure in Switzerland. This public health concern makes reliable indoor radon prediction and mapping methods necessary in order to improve risk communication to the public. The aim of this study was to develop an automated method to classify lithological units according to their radon characteristics and to develop mapping and predictive tools in order to improve local radon prediction. METHOD: About 240 000 indoor radon concentration (IRC) measurements in about 150 000 buildings were available for our analysis. The automated classification of lithological units was based on k-medoids clustering via pair-wise Kolmogorov distances between IRC distributions of lithological units. For IRC mapping and prediction we used random forests and Bayesian additive regression trees (BART). RESULTS: The automated classification groups lithological units well in terms of their IRC characteristics. Especially the IRC differences in metamorphic rocks like gneiss are well revealed by this method. The maps produced by random forests soundly represent the regional difference of IRCs in Switzerland and improve the spatial detail compared to existing approaches. We could explain 33% of the variations in IRC data with random forests. Additionally, the influence of a variable evaluated by random forests shows that building characteristics are less important predictors for IRCs than spatial/geological influences. BART could explain 29% of IRC variability and produced maps that indicate the prediction uncertainty. CONCLUSION: Ensemble regression trees are a powerful tool to model and understand the multidimensional influences on IRCs. Automatic clustering of lithological units complements this method by facilitating the interpretation of radon properties of rock types. This study provides an important element for radon risk communication. Future approaches should consider taking into account further variables like soil gas radon measurements as well as more detailed geological information.
Resumo:
Building robust recognition systems requires a careful understanding of the effects of error in sensed features. Error in these image features results in a region of uncertainty in the possible image location of each additional model feature. We present an accurate, analytic approximation for this uncertainty region when model poses are based on matching three image and model points, for both Gaussian and bounded error in the detection of image points, and for both scaled-orthographic and perspective projection models. This result applies to objects that are fully three- dimensional, where past results considered only two-dimensional objects. Further, we introduce a linear programming algorithm to compute the uncertainty region when poses are based on any number of initial matches. Finally, we use these results to extend, from two-dimensional to three- dimensional objects, robust implementations of alignmentt interpretation- tree search, and ransformation clustering.
Resumo:
We have developed a model of the local field potential (LFP) based on the conservation of charge, the independence principle of ionic flows and the classical Hodgkin–Huxley (HH) type intracellular model of synaptic activity. Insights were gained through the simulation of the HH intracellular model on the nonlinear relationship between the balance of synaptic conductances and that of post-synaptic currents. The latter is dependent not only on the former, but also on the temporal lag between the excitatory and inhibitory conductances, as well as the strength of the afferent signal. The proposed LFP model provides a method for decomposing the LFP recordings near the soma of layer IV pyramidal neurons in the barrel cortex of anaesthetised rats into two highly correlated components with opposite polarity. The temporal dynamics and the proportional balance of the two components are comparable to the excitatory and inhibitory post-synaptic currents computed from the HH model. This suggests that the two components of the LFP reflect the underlying excitatory and inhibitory post-synaptic currents of the local neural population. We further used the model to decompose a sequence of evoked LFP responses under repetitive electrical stimulation (5 Hz) of the whisker pad. We found that as neural responses adapted, the excitatory and inhibitory components also adapted proportionately, while the temporal lag between the onsets of the two components increased during frequency adaptation. Our results demonstrated that the balance between neural excitation and inhibition can be investigated using extracellular recordings. Extension of the model to incorporate multiple compartments should allow more quantitative interpretations of surface Electroencephalography (EEG) recordings into components reflecting the excitatory, inhibitory and passive ionic current flows generated by local neural populations.
Resumo:
The behavior of composed Web services depends on the results of the invoked services; unexpected behavior of one of the invoked services can threat the correct execution of an entire composition. This paper proposes an event-based approach to black-box testing of Web service compositions based on event sequence graphs, which are extended by facilities to deal not only with service behavior under regular circumstances (i.e., where cooperating services are working as expected) but also with their behavior in undesirable situations (i.e., where cooperating services are not working as expected). Furthermore, the approach can be used independently of artifacts (e.g., Business Process Execution Language) or type of composition (orchestration/choreography). A large case study, based on a commercial Web application, demonstrates the feasibility of the approach and analyzes its characteristics. Test generation and execution are supported by dedicated tools. Especially, the use of an enterprise service bus for test execution is noteworthy and differs from other approaches. The results of the case study encourage to suggest that the new approach has the power to detect faults systematically, performing properly even with complex and large compositions. Copyright © 2012 John Wiley & Sons, Ltd.