29 resultados para Hierarchical clustering model
em Université de Lausanne, Switzerland
Resumo:
MOTIVATION: Analysis of millions of pyro-sequences is currently playing a crucial role in the advance of environmental microbiology. Taxonomy-independent, i.e. unsupervised, clustering of these sequences is essential for the definition of Operational Taxonomic Units. For this application, reproducibility and robustness should be the most sought after qualities, but have thus far largely been overlooked. RESULTS: More than 1 million hyper-variable internal transcribed spacer 1 (ITS1) sequences of fungal origin have been analyzed. The ITS1 sequences were first properly extracted from 454 reads using generalized profiles. Then, otupipe, cd-hit-454, ESPRIT-Tree and DBC454, a new algorithm presented here, were used to analyze the sequences. A numerical assay was developed to measure the reproducibility and robustness of these algorithms. DBC454 was the most robust, closely followed by ESPRIT-Tree. DBC454 features density-based hierarchical clustering, which complements the other methods by providing insights into the structure of the data. AVAILABILITY: An executable is freely available for non-commercial users at ftp://ftp.vital-it.ch/tools/dbc454. It is designed to run under MPI on a cluster of 64-bit Linux machines running Red Hat 4.x, or on a multi-core OSX system. CONTACT: dbc454@vital-it.ch or nicolas.guex@isb-sib.ch.
Resumo:
Microarray gene expression profiles of fresh clinical samples of chronic myeloid leukaemia in chronic phase, acute promyelocytic leukaemia and acute monocytic leukaemia were compared with profiles from cell lines representing the corresponding types of leukaemia (K562, NB4, HL60). In a hierarchical clustering analysis, all clinical samples clustered separately from the cell lines, regardless of leukaemic subtype. Gene ontology analysis showed that cell lines chiefly overexpressed genes related to macromolecular metabolism, whereas in clinical samples genes related to the immune response were abundantly expressed. These findings must be taken into consideration when conclusions from cell line-based studies are extrapolated to patients.
Resumo:
In occupational exposure assessment of airborne contaminants, exposure levels can either be estimated through repeated measurements of the pollutant concentration in air, expert judgment or through exposure models that use information on the conditions of exposure as input. In this report, we propose an empirical hierarchical Bayesian model to unify these approaches. Prior to any measurement, the hygienist conducts an assessment to generate prior distributions of exposure determinants. Monte-Carlo samples from these distributions feed two level-2 models: a physical, two-compartment model, and a non-parametric, neural network model trained with existing exposure data. The outputs of these two models are weighted according to the expert's assessment of their relevance to yield predictive distributions of the long-term geometric mean and geometric standard deviation of the worker's exposure profile (level-1 model). Bayesian inferences are then drawn iteratively from subsequent measurements of worker exposure. Any traditional decision strategy based on a comparison with occupational exposure limits (e.g. mean exposure, exceedance strategies) can then be applied. Data on 82 workers exposed to 18 contaminants in 14 companies were used to validate the model with cross-validation techniques. A user-friendly program running the model is available upon request.
Resumo:
General clustering deals with weighted objects and fuzzy memberships. We investigate the group- or object-aggregation-invariance properties possessed by the relevant functionals (effective number of groups or objects, centroids, dispersion, mutual object-group information, etc.). The classical squared Euclidean case can be generalized to non-Euclidean distances, as well as to non-linear transformations of the memberships, yielding the c-means clustering algorithm as well as two presumably new procedures, the convex and pairwise convex clustering. Cluster stability and aggregation-invariance of the optimal memberships associated to the various clustering schemes are examined as well.
Resumo:
In this paper, we consider active sampling to label pixels grouped with hierarchical clustering. The objective of the method is to match the data relationships discovered by the clustering algorithm with the user's desired class semantics. The first is represented as a complete tree to be pruned and the second is iteratively provided by the user. The active learning algorithm proposed searches the pruning of the tree that best matches the labels of the sampled points. By choosing the part of the tree to sample from according to current pruning's uncertainty, sampling is focused on most uncertain clusters. This way, large clusters for which the class membership is already fixed are no longer queried and sampling is focused on division of clusters showing mixed labels. The model is tested on a VHR image in a multiclass classification setting. The method clearly outperforms random sampling in a transductive setting, but cannot generalize to unseen data, since it aims at optimizing the classification of a given cluster structure.
Resumo:
Although Leontopodium alpinum is considered to be threatened in many countries, only limited scientific information about its autecology is available. In this study, we aim to define the most important ecological factors which influence the distribution of L. alpinum in the Swiss Alps. These were assessed at the national scale using species distribution models based on topoclimatic predictors and at the community scale using exhaustive plant inventories. The latter were analysed using hierarchical clustering and principal component analysis, and the results were interpreted using ecological indicator values. L. alpinum was found almost exclusively on base-rich bedrocks (limestone and ultramaphic rocks). The species distribution models showed that the available moisture (dry regions, mostly in the Inner Alps), elevation (mostly above 2000 m.a.s.l.) and slope (mostly >30°) were the most important predictors. The relevés showed that L. alpinum is present in a wide range of plant communities, all subalpine-alpine open grasslands, with a low grass cover. As a light-demanding and short species, L. alpinum requires light at ground level; hence, it can only grow in open, nutrient-poor grasslands. These conditions are met in dry conditions (dry, summer-warm climate, rocky and draining soil, south-facing aspect and/or steep slope), at high elevations, on oligotrophic soils and/or on windy ridges. Base-rich soils appear to also be essential, although it is still unclear if this corresponds to physiological or ecological (lower competition) requirements.
Resumo:
Previous microarray studies on breast cancer identified multiple tumour classes, of which the most prominent, named luminal and basal, differ in expression of the oestrogen receptor alpha gene (ER). We report here the identification of a group of breast tumours with increased androgen signalling and a 'molecular apocrine' gene expression profile. Tumour samples from 49 patients with large operable or locally advanced breast cancers were tested on Affymetrix U133A gene expression microarrays. Principal components analysis and hierarchical clustering split the tumours into three groups: basal, luminal and a group we call molecular apocrine. All of the molecular apocrine tumours have strong apocrine features on histological examination (P=0.0002). The molecular apocrine group is androgen receptor (AR) positive and contains all of the ER-negative tumours outside the basal group. Kolmogorov-Smirnov testing indicates that oestrogen signalling is most active in the luminal group, and androgen signalling is most active in the molecular apocrine group. ERBB2 amplification is commoner in the molecular apocrine than the other groups. Genes that best split the three groups were identified by Wilcoxon test. Correlation of the average expression profile of these genes in our data with the expression profile of individual tumours in four published breast cancer studies suggest that molecular apocrine tumours represent 8-14% of tumours in these studies. Our data show that it is possible with microarray data to divide mammary tumour cells into three groups based on steroid receptor activity: luminal (ER+ AR+), basal (ER- AR-) and molecular apocrine (ER- AR+).
Resumo:
The interpretation of the Wechsler Intelligence Scale for Children-Fourth Edition (WISC-IV) is based on a 4-factor model, which is only partially compatible with the mainstream Cattell-Horn-Carroll (CHC) model of intelligence measurement. The structure of cognitive batteries is frequently analyzed via exploratory factor analysis and/or confirmatory factor analysis. With classical confirmatory factor analysis, almost all crossloadings between latent variables and measures are fixed to zero in order to allow the model to be identified. However, inappropriate zero cross-loadings can contribute to poor model fit, distorted factors, and biased factor correlations; most important, they do not necessarily faithfully reflect theory. To deal with these methodological and theoretical limitations, we used a new statistical approach, Bayesian structural equation modeling (BSEM), among a sample of 249 French-speaking Swiss children (8-12 years). With BSEM, zero-fixed cross-loadings between latent variables and measures are replaced by approximate zeros, based on informative, small-variance priors. Results indicated that a direct hierarchical CHC-based model with 5 factors plus a general intelligence factor better represented the structure of the WISC-IV than did the 4-factor structure and the higher order models. Because a direct hierarchical CHC model was more adequate, it was concluded that the general factor should be considered as a breadth rather than a superordinate factor. Because it was possible for us to estimate the influence of each of the latent variables on the 15 subtest scores, BSEM allowed improvement of the understanding of the structure of intelligence tests and the clinical interpretation of the subtest scores.
Resumo:
We investigated sex specificities in the evolutionary processes shaping Y chromosome, autosomes, and mitochondrial DNA patterns of genetic structure in the Valais shrew (Sorex antinorii), a mountain dwelling species with a hierarchical distribution. Both hierarchical analyses of variance and isolation-by-distance analyses revealed patterns of population structure that were not consistent across maternal, paternal, and biparentally inherited markers. Differentiation on a Y microsatellite was lower than expected from the comparison with autosomal microsatellites and mtDNA, and it was mostly due to genetic variance among populations within valleys, whereas the opposite was observed on other markers. In addition, there was no pattern of isolation by distance for the Y, whereas there was strong isolation by distance on mtDNA and autosomes. We use a hierarchical island model of coancestry dynamics to discuss the relative roles of the microevolutionary forces that may induce such patterns. We conclude that sex-biased dispersal is the most important driver of the observed genetic structure, but with an intriguing twist: it seems that dispersal is strongly male biased at large spatial scale, whereas it is mildly biased in favor of females at local scale. These results add to recent reports of scale-specific sex-biased dispersal patterns, and emphasize the usefulness of the Y chromosome in conjunction with mtDNA and autosomes to infer sex specificities.
Resumo:
The ability to obtain gene expression profiles from human disease specimens provides an opportunity to identify relevant gene pathways, but is limited by the absence of data sets spanning a broad range of conditions. Here, we analyzed publicly available microarray data from 16 diverse skin conditions in order to gain insight into disease pathogenesis. Unsupervised hierarchical clustering separated samples by disease as well as common cellular and molecular pathways. Disease-specific signatures were leveraged to build a multi-disease classifier, which predicted the diagnosis of publicly and prospectively collected expression profiles with 93% accuracy. In one sample, the molecular classifier differed from the initial clinical diagnosis and correctly predicted the eventual diagnosis as the clinical presentation evolved. Finally, integration of IFN-regulated gene programs with the skin database revealed a significant inverse correlation between IFN-β and IFN-γ programs across all conditions. Our study provides an integrative approach to the study of gene signatures from multiple skin conditions, elucidating mechanisms of disease pathogenesis. In addition, these studies provide a framework for developing tools for personalized medicine toward the precise prediction, prevention, and treatment of disease on an individual level.
Resumo:
A new issue, once again a bouquet of attractive papers. First of all the paper by Droit-Dupré et al. (10.1007/s00428-015-1724-9). The group studied colonic adenocarcinomas, not otherwise specified, by immunohistochemistry for the expression of markers of intestinal epithelial cell differentiation. Hierarchical clustering analysis identified a major cluster of two thirds of the case series, expressing cytokeratin 20, CDX2 and MUC2 and invariably mismatch repair competent, which they called crypt-like. In stage III colon cancer, the crypt-like cluster had a better prognosis. The paper is a relatively simple example of what is happening in cancer classification beyond morphology: multiparameter differentiation and (epi)genomic markers defining new subtypes of cancer with potential clinical significance in clinical decision making.
Resumo:
Wastewater-based epidemiology consists in acquiring relevant information about the lifestyle and health status of the population through the analysis of wastewater samples collected at the influent of a wastewater treatment plant. Whilst being a very young discipline, it has experienced an astonishing development since its firs application in 2005. The possibility to gather community-wide information about drug use has been among the major field of application. The wide resonance of the first results sparked the interest of scientists from various disciplines. Since then, research has broadened in innumerable directions. Although being praised as a revolutionary approach, there was a need to critically assess its added value, with regard to the existing indicators used to monitor illicit drug use. The main, and explicit, objective of this research was to evaluate the added value of wastewater-based epidemiology with regards to two particular, although interconnected, dimensions of illicit drug use. The first is related to trying to understand the added value of the discipline from an epidemiological, or societal, perspective. In other terms, to evaluate if and how it completes our current vision about the extent of illicit drug use at the population level, and if it can guide the planning of future prevention measures and drug policies. The second dimension is the criminal one, with a particular focus on the networks which develop around the large demand in illicit drugs. The goal here was to assess if wastewater-based epidemiology, combined to indicators stemming from the epidemiological dimension, could provide additional clues about the structure of drug distribution networks and the size of their market. This research had also an implicit objective, which focused on initiating the path of wastewater- based epidemiology at the Ecole des Sciences Criminelles of the University of Lausanne. This consisted in gathering the necessary knowledge about the collection, preparation, and analysis of wastewater samples and, most importantly, to understand how to interpret the acquired data and produce useful information. In the first phase of this research, it was possible to determine that ammonium loads, measured directly in the wastewater stream, could be used to monitor the dynamics of the population served by the wastewater treatment plant. Furthermore, it was shown that on the long term, the population did not have a substantial impact on consumption patterns measured through wastewater analysis. Focussing on methadone, for which precise prescription data was available, it was possible to show that reliable consumption estimates could be obtained via wastewater analysis. This allowed to validate the selected sampling strategy, which was then used to monitor the consumption of heroin, through the measurement of morphine. The latter, in combination to prescription and sales data, provided estimates of heroin consumption in line with other indicators. These results, combined to epidemiological data, highlighted the good correspondence between measurements and expectations and, furthermore, suggested that the dark figure of heroin users evading harm-reduction programs, which would thus not be measured by conventional indicators, is likely limited. In the third part, which consisted in a collaborative study aiming at extensively investigating geographical differences in drug use, wastewater analysis was shown to be a useful complement to existing indicators. In particular for stigmatised drugs, such as cocaine and heroin, it allowed to decipher the complex picture derived from surveys and crime statistics. Globally, it provided relevant information to better understand the drug market, both from an epidemiological and repressive perspective. The fourth part focused on cannabis and on the potential of combining wastewater and survey data to overcome some of their respective limitations. Using a hierarchical inference model, it was possible to refine current estimates of cannabis prevalence in the metropolitan area of Lausanne. Wastewater results suggested that the actual prevalence is substantially higher compared to existing figures, thus supporting the common belief that surveys tend to underestimate cannabis use. Whilst being affected by several biases, the information collected through surveys allowed to overcome some of the limitations linked to the analysis of cannabis markers in wastewater (i.e., stability and limited excretion data). These findings highlighted the importance and utility of combining wastewater-based epidemiology to existing indicators about drug use. Similarly, the fifth part of the research was centred on assessing the potential uses of wastewater-based epidemiology from a law enforcement perspective. Through three concrete examples, it was shown that results from wastewater analysis can be used to produce highly relevant intelligence, allowing drug enforcement to assess the structure and operations of drug distribution networks and, ultimately, guide their decisions at the tactical and/or operational level. Finally, the potential to implement wastewater-based epidemiology to monitor the use of harmful, prohibited and counterfeit pharmaceuticals was illustrated through the analysis of sibutramine, and its urinary metabolite, in wastewater samples. The results of this research have highlighted that wastewater-based epidemiology is a useful and powerful approach with numerous scopes. Faced with the complexity of measuring a hidden phenomenon like illicit drug use, it is a major addition to the panoply of existing indicators. -- L'épidémiologie basée sur l'analyse des eaux usées (ou, selon sa définition anglaise, « wastewater-based epidemiology ») consiste en l'acquisition d'informations portant sur le mode de vie et l'état de santé d'une population via l'analyse d'échantillons d'eaux usées récoltés à l'entrée des stations d'épuration. Bien qu'il s'agisse d'une discipline récente, elle a vécu des développements importants depuis sa première mise en oeuvre en 2005, notamment dans le domaine de l'analyse des résidus de stupéfiants. Suite aux retombées médiatiques des premiers résultats de ces analyses de métabolites dans les eaux usées, de nombreux scientifiques provenant de différentes disciplines ont rejoint les rangs de cette nouvelle discipline en développant plusieurs axes de recherche distincts. Bien que reconnu pour son coté objectif et révolutionnaire, il était nécessaire d'évaluer sa valeur ajoutée en regard des indicateurs couramment utilisés pour mesurer la consommation de stupéfiants. En se focalisant sur deux dimensions spécifiques de la consommation de stupéfiants, l'objectif principal de cette recherche était focalisé sur l'évaluation de la valeur ajoutée de l'épidémiologie basée sur l'analyse des eaux usées. La première dimension abordée était celle épidémiologique ou sociétale. En d'autres termes, il s'agissait de comprendre si et comment l'analyse des eaux usées permettait de compléter la vision actuelle sur la problématique, ainsi que déterminer son utilité dans la planification des mesures préventives et des politiques en matière de stupéfiants actuelles et futures. La seconde dimension abordée était celle criminelle, en particulier, l'étude des réseaux qui se développent autour du trafic de produits stupéfiants. L'objectif était de déterminer si cette nouvelle approche combinée aux indicateurs conventionnels, fournissait de nouveaux indices quant à la structure et l'organisation des réseaux de distribution ainsi que sur les dimensions du marché. Cette recherche avait aussi un objectif implicite, développer et d'évaluer la mise en place de l'épidémiologie basée sur l'analyse des eaux usées. En particulier, il s'agissait d'acquérir les connaissances nécessaires quant à la manière de collecter, traiter et analyser des échantillons d'eaux usées, mais surtout, de comprendre comment interpréter les données afin d'en extraire les informations les plus pertinentes. Dans la première phase de cette recherche, il y pu être mis en évidence que les charges en ammonium, mesurées directement dans les eaux usées permettait de suivre la dynamique des mouvements de la population contributrice aux eaux usées de la station d'épuration de la zone étudiée. De plus, il a pu être démontré que, sur le long terme, les mouvements de la population n'avaient pas d'influence substantielle sur le pattern de consommation mesuré dans les eaux usées. En se focalisant sur la méthadone, une substance pour laquelle des données précises sur le nombre de prescriptions étaient disponibles, il a pu être démontré que des estimations exactes sur la consommation pouvaient être tirées de l'analyse des eaux usées. Ceci a permis de valider la stratégie d'échantillonnage adoptée, qui, par le bais de la morphine, a ensuite été utilisée pour suivre la consommation d'héroïne. Combinée aux données de vente et de prescription, l'analyse de la morphine a permis d'obtenir des estimations sur la consommation d'héroïne en accord avec des indicateurs conventionnels. Ces résultats, combinés aux données épidémiologiques ont permis de montrer une bonne adéquation entre les projections des deux approches et ainsi démontrer que le chiffre noir des consommateurs qui échappent aux mesures de réduction de risque, et qui ne seraient donc pas mesurés par ces indicateurs, est vraisemblablement limité. La troisième partie du travail a été réalisée dans le cadre d'une étude collaborative qui avait pour but d'investiguer la valeur ajoutée de l'analyse des eaux usées à mettre en évidence des différences géographiques dans la consommation de stupéfiants. En particulier pour des substances stigmatisées, telles la cocaïne et l'héroïne, l'approche a permis d'objectiver et de préciser la vision obtenue avec les indicateurs traditionnels du type sondages ou les statistiques policières. Globalement, l'analyse des eaux usées s'est montrée être un outil très utile pour mieux comprendre le marché des stupéfiants, à la fois sous l'angle épidémiologique et répressif. La quatrième partie du travail était focalisée sur la problématique du cannabis ainsi que sur le potentiel de combiner l'analyse des eaux usées aux données de sondage afin de surmonter, en partie, leurs limitations. En utilisant un modèle d'inférence hiérarchique, il a été possible d'affiner les actuelles estimations sur la prévalence de l'utilisation de cannabis dans la zone métropolitaine de la ville de Lausanne. Les résultats ont démontré que celle-ci est plus haute que ce que l'on s'attendait, confirmant ainsi l'hypothèse que les sondages ont tendance à sous-estimer la consommation de cannabis. Bien que biaisés, les données récoltées par les sondages ont permis de surmonter certaines des limitations liées à l'analyse des marqueurs du cannabis dans les eaux usées (i.e., stabilité et manque de données sur l'excrétion). Ces résultats mettent en évidence l'importance et l'utilité de combiner les résultats de l'analyse des eaux usées aux indicateurs existants. De la même façon, la cinquième partie du travail était centrée sur l'apport de l'analyse des eaux usées du point de vue de la police. Au travers de trois exemples, l'utilisation de l'indicateur pour produire du renseignement concernant la structure et les activités des réseaux de distribution de stupéfiants, ainsi que pour guider les choix stratégiques et opérationnels de la police, a été mise en évidence. Dans la dernière partie, la possibilité d'utiliser cette approche pour suivre la consommation de produits pharmaceutiques dangereux, interdits ou contrefaits, a été démontrée par l'analyse dans les eaux usées de la sibutramine et ses métabolites. Les résultats de cette recherche ont mis en évidence que l'épidémiologie par l'analyse des eaux usées est une approche pertinente et puissante, ayant de nombreux domaines d'application. Face à la complexité de mesurer un phénomène caché comme la consommation de stupéfiants, la valeur ajoutée de cette approche a ainsi pu être démontrée.
Resumo:
In the context of Systems Biology, computer simulations of gene regulatory networks provide a powerful tool to validate hypotheses and to explore possible system behaviors. Nevertheless, modeling a system poses some challenges of its own: especially the step of model calibration is often difficult due to insufficient data. For example when considering developmental systems, mostly qualitative data describing the developmental trajectory is available while common calibration techniques rely on high-resolution quantitative data. Focusing on the calibration of differential equation models for developmental systems, this study investigates different approaches to utilize the available data to overcome these difficulties. More specifically, the fact that developmental processes are hierarchically organized is exploited to increase convergence rates of the calibration process as well as to save computation time. Using a gene regulatory network model for stem cell homeostasis in Arabidopsis thaliana the performance of the different investigated approaches is evaluated, documenting considerable gains provided by the proposed hierarchical approach.
Resumo:
Specific properties emerge from the structure of large networks, such as that of worldwide air traffic, including a highly hierarchical node structure and multi-level small world sub-groups that strongly influence future dynamics. We have developed clustering methods to understand the form of these structures, to identify structural properties, and to evaluate the effects of these properties. Graph clustering methods are often constructed from different components: a metric, a clustering index, and a modularity measure to assess the quality of a clustering method. To understand the impact of each of these components on the clustering method, we explore and compare different combinations. These different combinations are used to compare multilevel clustering methods to delineate the effects of geographical distance, hubs, network densities, and bridges on worldwide air passenger traffic. The ultimate goal of this methodological research is to demonstrate evidence of combined effects in the development of an air traffic network. In fact, the network can be divided into different levels of âeurooecohesionâeuro, which can be qualified and measured by comparative studies (Newman, 2002; Guimera et al., 2005; Sales-Pardo et al., 2007).
Resumo:
The study was designed to investigate the psychometric properties of the French version and the cross-language replicability of the Hierarchical Personality Inventory for Children (HiPIC). The HiPIC is an instrument aimed at assessing the five dimensions of the Five-Factor Model for Children. Subjects were 552 children aged between 8 and 12 years, rated by one or both parents. At the domain level, reliability ranged from .83 to .93 and at the facet level, reliability ranged from .69 to .89. Differences between genders were congruent with those found in the Dutch sample. Girls scored higher on Benevolence and Conscientiousness. Age was negatively correlated with Extraversion and Imagination. For girls, we also observed a decrease of Emotional Stability. A series of exploratory factor analyses confirmed the overall five-factor structure for girls and boys. Targeted factor analyses and congruence coefficients revealed high cross-language replicability at the domain and at the facet levels. The results showed that the French version of the HiPIC is a reliable and valid instrument for assessing personality with children and has a particularly high cross-language replicability.