80 resultados para Labeling hierarchical clustering
em Université de Lausanne, Switzerland
Resumo:
MOTIVATION: Analysis of millions of pyro-sequences is currently playing a crucial role in the advance of environmental microbiology. Taxonomy-independent, i.e. unsupervised, clustering of these sequences is essential for the definition of Operational Taxonomic Units. For this application, reproducibility and robustness should be the most sought after qualities, but have thus far largely been overlooked. RESULTS: More than 1 million hyper-variable internal transcribed spacer 1 (ITS1) sequences of fungal origin have been analyzed. The ITS1 sequences were first properly extracted from 454 reads using generalized profiles. Then, otupipe, cd-hit-454, ESPRIT-Tree and DBC454, a new algorithm presented here, were used to analyze the sequences. A numerical assay was developed to measure the reproducibility and robustness of these algorithms. DBC454 was the most robust, closely followed by ESPRIT-Tree. DBC454 features density-based hierarchical clustering, which complements the other methods by providing insights into the structure of the data. AVAILABILITY: An executable is freely available for non-commercial users at ftp://ftp.vital-it.ch/tools/dbc454. It is designed to run under MPI on a cluster of 64-bit Linux machines running Red Hat 4.x, or on a multi-core OSX system. CONTACT: dbc454@vital-it.ch or nicolas.guex@isb-sib.ch.
Resumo:
Although Leontopodium alpinum is considered to be threatened in many countries, only limited scientific information about its autecology is available. In this study, we aim to define the most important ecological factors which influence the distribution of L. alpinum in the Swiss Alps. These were assessed at the national scale using species distribution models based on topoclimatic predictors and at the community scale using exhaustive plant inventories. The latter were analysed using hierarchical clustering and principal component analysis, and the results were interpreted using ecological indicator values. L. alpinum was found almost exclusively on base-rich bedrocks (limestone and ultramaphic rocks). The species distribution models showed that the available moisture (dry regions, mostly in the Inner Alps), elevation (mostly above 2000 m.a.s.l.) and slope (mostly >30°) were the most important predictors. The relevés showed that L. alpinum is present in a wide range of plant communities, all subalpine-alpine open grasslands, with a low grass cover. As a light-demanding and short species, L. alpinum requires light at ground level; hence, it can only grow in open, nutrient-poor grasslands. These conditions are met in dry conditions (dry, summer-warm climate, rocky and draining soil, south-facing aspect and/or steep slope), at high elevations, on oligotrophic soils and/or on windy ridges. Base-rich soils appear to also be essential, although it is still unclear if this corresponds to physiological or ecological (lower competition) requirements.
Resumo:
Previous microarray studies on breast cancer identified multiple tumour classes, of which the most prominent, named luminal and basal, differ in expression of the oestrogen receptor alpha gene (ER). We report here the identification of a group of breast tumours with increased androgen signalling and a 'molecular apocrine' gene expression profile. Tumour samples from 49 patients with large operable or locally advanced breast cancers were tested on Affymetrix U133A gene expression microarrays. Principal components analysis and hierarchical clustering split the tumours into three groups: basal, luminal and a group we call molecular apocrine. All of the molecular apocrine tumours have strong apocrine features on histological examination (P=0.0002). The molecular apocrine group is androgen receptor (AR) positive and contains all of the ER-negative tumours outside the basal group. Kolmogorov-Smirnov testing indicates that oestrogen signalling is most active in the luminal group, and androgen signalling is most active in the molecular apocrine group. ERBB2 amplification is commoner in the molecular apocrine than the other groups. Genes that best split the three groups were identified by Wilcoxon test. Correlation of the average expression profile of these genes in our data with the expression profile of individual tumours in four published breast cancer studies suggest that molecular apocrine tumours represent 8-14% of tumours in these studies. Our data show that it is possible with microarray data to divide mammary tumour cells into three groups based on steroid receptor activity: luminal (ER+ AR+), basal (ER- AR-) and molecular apocrine (ER- AR+).
Resumo:
Microarray gene expression profiles of fresh clinical samples of chronic myeloid leukaemia in chronic phase, acute promyelocytic leukaemia and acute monocytic leukaemia were compared with profiles from cell lines representing the corresponding types of leukaemia (K562, NB4, HL60). In a hierarchical clustering analysis, all clinical samples clustered separately from the cell lines, regardless of leukaemic subtype. Gene ontology analysis showed that cell lines chiefly overexpressed genes related to macromolecular metabolism, whereas in clinical samples genes related to the immune response were abundantly expressed. These findings must be taken into consideration when conclusions from cell line-based studies are extrapolated to patients.
Resumo:
The ability to obtain gene expression profiles from human disease specimens provides an opportunity to identify relevant gene pathways, but is limited by the absence of data sets spanning a broad range of conditions. Here, we analyzed publicly available microarray data from 16 diverse skin conditions in order to gain insight into disease pathogenesis. Unsupervised hierarchical clustering separated samples by disease as well as common cellular and molecular pathways. Disease-specific signatures were leveraged to build a multi-disease classifier, which predicted the diagnosis of publicly and prospectively collected expression profiles with 93% accuracy. In one sample, the molecular classifier differed from the initial clinical diagnosis and correctly predicted the eventual diagnosis as the clinical presentation evolved. Finally, integration of IFN-regulated gene programs with the skin database revealed a significant inverse correlation between IFN-β and IFN-γ programs across all conditions. Our study provides an integrative approach to the study of gene signatures from multiple skin conditions, elucidating mechanisms of disease pathogenesis. In addition, these studies provide a framework for developing tools for personalized medicine toward the precise prediction, prevention, and treatment of disease on an individual level.
Resumo:
In this paper, we consider active sampling to label pixels grouped with hierarchical clustering. The objective of the method is to match the data relationships discovered by the clustering algorithm with the user's desired class semantics. The first is represented as a complete tree to be pruned and the second is iteratively provided by the user. The active learning algorithm proposed searches the pruning of the tree that best matches the labels of the sampled points. By choosing the part of the tree to sample from according to current pruning's uncertainty, sampling is focused on most uncertain clusters. This way, large clusters for which the class membership is already fixed are no longer queried and sampling is focused on division of clusters showing mixed labels. The model is tested on a VHR image in a multiclass classification setting. The method clearly outperforms random sampling in a transductive setting, but cannot generalize to unseen data, since it aims at optimizing the classification of a given cluster structure.
Resumo:
A new issue, once again a bouquet of attractive papers. First of all the paper by Droit-Dupré et al. (10.1007/s00428-015-1724-9). The group studied colonic adenocarcinomas, not otherwise specified, by immunohistochemistry for the expression of markers of intestinal epithelial cell differentiation. Hierarchical clustering analysis identified a major cluster of two thirds of the case series, expressing cytokeratin 20, CDX2 and MUC2 and invariably mismatch repair competent, which they called crypt-like. In stage III colon cancer, the crypt-like cluster had a better prognosis. The paper is a relatively simple example of what is happening in cancer classification beyond morphology: multiparameter differentiation and (epi)genomic markers defining new subtypes of cancer with potential clinical significance in clinical decision making.
Resumo:
Specific properties emerge from the structure of large networks, such as that of worldwide air traffic, including a highly hierarchical node structure and multi-level small world sub-groups that strongly influence future dynamics. We have developed clustering methods to understand the form of these structures, to identify structural properties, and to evaluate the effects of these properties. Graph clustering methods are often constructed from different components: a metric, a clustering index, and a modularity measure to assess the quality of a clustering method. To understand the impact of each of these components on the clustering method, we explore and compare different combinations. These different combinations are used to compare multilevel clustering methods to delineate the effects of geographical distance, hubs, network densities, and bridges on worldwide air passenger traffic. The ultimate goal of this methodological research is to demonstrate evidence of combined effects in the development of an air traffic network. In fact, the network can be divided into different levels of âeurooecohesionâeuro, which can be qualified and measured by comparative studies (Newman, 2002; Guimera et al., 2005; Sales-Pardo et al., 2007).
Resumo:
The coverage and volume of geo-referenced datasets are extensive and incessantly¦growing. The systematic capture of geo-referenced information generates large volumes¦of spatio-temporal data to be analyzed. Clustering and visualization play a key¦role in the exploratory data analysis and the extraction of knowledge embedded in¦these data. However, new challenges in visualization and clustering are posed when¦dealing with the special characteristics of this data. For instance, its complex structures,¦large quantity of samples, variables involved in a temporal context, high dimensionality¦and large variability in cluster shapes.¦The central aim of my thesis is to propose new algorithms and methodologies for¦clustering and visualization, in order to assist the knowledge extraction from spatiotemporal¦geo-referenced data, thus improving making decision processes.¦I present two original algorithms, one for clustering: the Fuzzy Growing Hierarchical¦Self-Organizing Networks (FGHSON), and the second for exploratory visual data analysis:¦the Tree-structured Self-organizing Maps Component Planes. In addition, I present¦methodologies that combined with FGHSON and the Tree-structured SOM Component¦Planes allow the integration of space and time seamlessly and simultaneously in¦order to extract knowledge embedded in a temporal context.¦The originality of the FGHSON lies in its capability to reflect the underlying structure¦of a dataset in a hierarchical fuzzy way. A hierarchical fuzzy representation of¦clusters is crucial when data include complex structures with large variability of cluster¦shapes, variances, densities and number of clusters. The most important characteristics¦of the FGHSON include: (1) It does not require an a-priori setup of the number¦of clusters. (2) The algorithm executes several self-organizing processes in parallel.¦Hence, when dealing with large datasets the processes can be distributed reducing the¦computational cost. (3) Only three parameters are necessary to set up the algorithm.¦In the case of the Tree-structured SOM Component Planes, the novelty of this algorithm¦lies in its ability to create a structure that allows the visual exploratory data analysis¦of large high-dimensional datasets. This algorithm creates a hierarchical structure¦of Self-Organizing Map Component Planes, arranging similar variables' projections in¦the same branches of the tree. Hence, similarities on variables' behavior can be easily¦detected (e.g. local correlations, maximal and minimal values and outliers).¦Both FGHSON and the Tree-structured SOM Component Planes were applied in¦several agroecological problems proving to be very efficient in the exploratory analysis¦and clustering of spatio-temporal datasets.¦In this thesis I also tested three soft competitive learning algorithms. Two of them¦well-known non supervised soft competitive algorithms, namely the Self-Organizing¦Maps (SOMs) and the Growing Hierarchical Self-Organizing Maps (GHSOMs); and the¦third was our original contribution, the FGHSON. Although the algorithms presented¦here have been used in several areas, to my knowledge there is not any work applying¦and comparing the performance of those techniques when dealing with spatiotemporal¦geospatial data, as it is presented in this thesis.¦I propose original methodologies to explore spatio-temporal geo-referenced datasets¦through time. Our approach uses time windows to capture temporal similarities and¦variations by using the FGHSON clustering algorithm. The developed methodologies¦are used in two case studies. In the first, the objective was to find similar agroecozones¦through time and in the second one it was to find similar environmental patterns¦shifted in time.¦Several results presented in this thesis have led to new contributions to agroecological¦knowledge, for instance, in sugar cane, and blackberry production.¦Finally, in the framework of this thesis we developed several software tools: (1)¦a Matlab toolbox that implements the FGHSON algorithm, and (2) a program called¦BIS (Bio-inspired Identification of Similar agroecozones) an interactive graphical user¦interface tool which integrates the FGHSON algorithm with Google Earth in order to¦show zones with similar agroecological characteristics.
Resumo:
Résumé : Les progrès techniques de la spectrométrie de masse (MS) ont contribué au récent développement de la protéomique. Cette technique peut actuellement détecter, identifier et quantifier des milliers de protéines. Toutefois, elle n'est pas encore assez puissante pour fournir une analyse complète des modifications du protéome corrélées à des phénomènes biologiques. Notre objectif était le développement d'une nouvelle stratégie pour la détection spécifique et la quantification des variations du protéome, basée sur la mesure de la synthèse des protéines plutôt que sur celle de la quantité de protéines totale. Pour cela, nous volions associer le marquage pulsé des protéines par des isotopes stables avec une méthode d'acquisition MS basée sur le balayage des ions précurseurs (precursor ion scan, ou PIS), afin de détecter spécifiquement les protéines ayant intégré les isotopes et d'estimer leur abondance par rapport aux protéines non marquées. Une telle approche peut identifier les protéines avec les plus hauts taux de synthèse dans une période de temps donnée, y compris les protéines dont l'expression augmente spécifiquement suite à un événement précis. Nous avons tout d'abord testé différents acides aminés marqués en combinaison avec des méthodes PIS spécifiques. Ces essais ont permis la détection spécifique des protéines marquées. Cependant, en raison des limitations instrumentales du spectromètre de masse utilisé pour les méthodes PIS, la sensibilité de cette approche s'est révélée être inférieure à une analyse non ciblée réalisée sur un instrument plus récent (Chapitre 2.1). Toutefois, pour l'analyse différentielle de deux milieux de culture conditionnés par des cellules cancéreuses humaines, nous avons utilisé le marquage métabolique pour distinguer les protéines d'origine cellulaire des protéines non marquées du sérum présentes dans les milieux de culture (Chapitre 2.2). Parallèlement, nous avons développé une nouvelle méthode de quantification nommée IBIS, qui utilise des paires d'isotopes stables d'acides aminés capables de produire des ions spécifiques qui peuvent être utilisés pour la quantification relative. La méthode IBIS a été appliquée à l'analyse de deux lignées cellulaires cancéreuses complètement marquées, mais de manière différenciée, par des paires d'acides aminés (Chapitre 2.3). Ensuite, conformément à l'objectif initial de cette thèse, nous avons utilisé une variante pulsée de l'IBIS pour détecter des modifications du protéome dans des cellules HeLa infectée par le virus humain Herpes Simplex-1 (Chapitre 2.4). Ce virus réprime la synthèse des protéines des cellules hôtes afin d'exploiter leur mécanisme de traduction pour la production massive de virions. Comme prévu, de hauts taux de synthèse ont été mesurés pour les protéines virales détectées, attestant de leur haut niveau d'expression. Nous avons de plus identifié un certain nombre de protéines humaines dont le rapport de synthèse et de dégradation (S/D) a été modifié par l'infection virale, ce qui peut donner des indications sur les stratégies utilisées par les virus pour détourner la machinerie cellulaire. En conclusion, nous avons montré dans ce travail que le marquage métabolique peut être employé de façon non conventionnelle pour étudier des dimensions peu explorées en protéomique. Summary : In recent years major technical advancements greatly supported the development of mass spectrometry (MS)-based proteomics. Currently, this technique can efficiently detect, identify and quantify thousands of proteins. However, it is not yet sufficiently powerful to provide a comprehensive analysis of the proteome changes correlated with biological phenomena. The aim of our project was the development of ~a new strategy for the specific detection and quantification of proteomé variations based on measurements of protein synthesis rather than total protein amounts. The rationale for this approach was that changes in protein synthesis more closely reflect dynamic cellular responses than changes in total protein concentrations. Our starting idea was to couple "pulsed" stable-isotope labeling of proteins with a specific MS acquisition method based on precursor ion scan (PIS), to specifically detect proteins that incorporated the label and to simultaneously estimate their abundance, relative to the unlabeled protein isoform. Such approach could highlight proteins with the highest synthesis rate in a given time frame, including proteins specifically up-regulated by a given biological stimulus. As a first step, we tested different isotope-labeled amino acids in combination with dedicated PIS methods and showed that this leads to specific detection of labeled proteins. Sensitivity, however, turned out to be lower than an untargeted analysis run on a more recent instrument, due to MS hardware limitations (Chapter 2.1). We next used metabolic labeling to distinguish the proteins of cellular origin from a high background of unlabeled (serum) proteins, for the differential analysis of two serum-containing culture media conditioned by labeled human cancer cells (Chapter 2.2). As a parallel project we developed a new quantification method (named ISIS), which uses pairs of stable-isotope labeled amino acids able to produce specific reporter ions, which can be used for relative quantification. The ISIS method was applied to the analysis of two fully, yet differentially labeled cancer cell lines, as described in Chapter 2.3. Next, in line with the original purpose of this thesis, we used a "pulsed" variant of ISIS to detect proteome changes in HeLa cells after the infection with human Herpes Simplex Virus-1 (Chapter 2.4). This virus is known to repress the synthesis of host cell proteins to exploit the translation machinery for the massive production of virions. As expected, high synthesis rates were measured for the detected viral proteins, confirming their up-regulation. Moreover, we identified a number of human proteins whose synthesis/degradation ratio (S/D) was affected by the viral infection and which could provide clues on the strategies used by the virus to hijack the cellular machinery. Overall, in this work, we showed that metabolic labeling can be employed in alternative ways to investigate poorly explored dimensions in proteomics.
Resumo:
The long term goal of this research is to develop a program able to produce an automatic segmentation and categorization of textual sequences into discourse types. In this preliminary contribution, we present the construction of an algorithm which takes a segmented text as input and attempts to produce a categorization of sequences, such as narrative, argumentative, descriptive and so on. Also, this work aims at investigating a possible convergence between the typological approach developed in particular in the field of text and discourse analysis in French by Adam (2008) and Bronckart (1997) and unsupervised statistical learning.
Resumo:
Recent evidence suggests that lactate could be a preferential energy substrate transferred from astrocytes to neurons. This would imply the presence of specific transporters for lactate on both cell types. We have investigated the immunohistochemical localization of two monocarboxylate transporters, MCT1 and MCT2, in the adult mouse brain. Using specific antibodies raised against MCT1 and MCT2, we found strong immunoreactivity for each transporter in glia limitans, ependymocytes and several microvessel-like elements. In addition, small processes distributed throughout the cerebral parenchyma were immunolabeled for monocarboxylate transporters. Double immunofluorescent labeling and confocal microscopy examination of these small processes revealed no co-localization between glial fibrillary acidic protein and monocarboxylate transporters, although many glial fibrillary acidic protein-positive processes were often in close apposition to elements labeled for monocarboxylate transporters. In contrast, several elements expressing the S100beta protein, another astrocytic marker found to be located in distinct parts of the same cell when compared with glial fibrillary acidic protein, were also strongly immunoreactive for MCT1, suggesting expression of this transporter by astrocytes. In contrast, MCT2 was expressed in a small subset of microtubule-associated protein-2-positive elements, indicating a neuronal localization. In conclusion, these observations are consistent with the possibility that lactate, produced and released by astrocytes (via MCT1), could be taken up (via MCT2) and used by neurons as an energy substrate.
Resumo:
Over the past decades, several sensitive post-electrophoretic stains have been developed for an identification of proteins in general, or for a specific detection of post-translational modifications such as phosphorylation, glycosylation or oxidation. Yet, for a visualization and quantification of protein differences, the differential two-dimensional gel electrophoresis, termed DIGE, has become the method of choice for a detection of differences in two sets of proteomes. The goal of this review is to evaluate the use of the most common non-covalent and covalent staining techniques in 2D electrophoresis gels, in order to obtain maximal information per electrophoresis gel and for an identification of potential biomarkers. We will also discuss the use of detergents during covalent labeling, the identification of oxidative modifications and review influence of detergents on finger prints analysis and MS/MS identification in relation to 2D electrophoresis.