196 resultados para Clustering methods
em Université de Lausanne, Switzerland
Resumo:
Specific properties emerge from the structure of large networks, such as that of worldwide air traffic, including a highly hierarchical node structure and multi-level small world sub-groups that strongly influence future dynamics. We have developed clustering methods to understand the form of these structures, to identify structural properties, and to evaluate the effects of these properties. Graph clustering methods are often constructed from different components: a metric, a clustering index, and a modularity measure to assess the quality of a clustering method. To understand the impact of each of these components on the clustering method, we explore and compare different combinations. These different combinations are used to compare multilevel clustering methods to delineate the effects of geographical distance, hubs, network densities, and bridges on worldwide air passenger traffic. The ultimate goal of this methodological research is to demonstrate evidence of combined effects in the development of an air traffic network. In fact, the network can be divided into different levels of âeurooecohesionâeuro, which can be qualified and measured by comparative studies (Newman, 2002; Guimera et al., 2005; Sales-Pardo et al., 2007).
Resumo:
The use of the Internet now has a specific purpose: to find information. Unfortunately, the amount of data available on the Internet is growing exponentially, creating what can be considered a nearly infinite and ever-evolving network with no discernable structure. This rapid growth has raised the question of how to find the most relevant information. Many different techniques have been introduced to address the information overload, including search engines, Semantic Web, and recommender systems, among others. Recommender systems are computer-based techniques that are used to reduce information overload and recommend products likely to interest a user when given some information about the user's profile. This technique is mainly used in e-Commerce to suggest items that fit a customer's purchasing tendencies. The use of recommender systems for e-Government is a research topic that is intended to improve the interaction among public administrations, citizens, and the private sector through reducing information overload on e-Government services. More specifically, e-Democracy aims to increase citizens' participation in democratic processes through the use of information and communication technologies. In this chapter, an architecture of a recommender system that uses fuzzy clustering methods for e-Elections is introduced. In addition, a comparison with the smartvote system, a Web-based Voting Assistance Application (VAA) used to aid voters in finding the party or candidate that is most in line with their preferences, is presented.
Resumo:
Tire traces can be observed on several crime scenes as vehicles are often used by criminals. The tread abrasion on the road, while braking or skidding, leads to the production of small rubber particles which can be collected for comparison purposes. This research focused on the statistical comparison of Py-GC/MS profiles of tire traces and tire treads. The optimisation of the analytical method was carried out using experimental designs. The aim was to determine the best pyrolysis parameters regarding the repeatability of the results. Thus, the pyrolysis factor effect could also be calculated. The pyrolysis temperature was found to be five time more important than time. Finally, a pyrolysis at 650 °C during 15 s was selected. Ten tires of different manufacturers and models were used for this study. Several samples were collected on each tire, and several replicates were carried out to study the variability within each tire (intravariability). More than eighty compounds were integrated for each analysis and the variability study showed that more than 75% presented a relative standard deviation (RSD) below 5% for the ten tires, thus supporting a low intravariability. The variability between the ten tires (intervariability) presented higher values and the ten most variant compounds had a RSD value above 13%, supporting their high potential of discrimination between the tires tested. Principal Component Analysis (PCA) was able to fully discriminate the ten tires with the help of the first three principal components. The ten tires were finally used to perform braking tests on a racetrack with a vehicle equipped with an anti-lock braking system. The resulting tire traces were adequately collected using sheets of white gelatine. As for tires, the intravariability for the traces was found to be lower than the intervariability. Clustering methods were carried out and the Ward's method based on the squared Euclidean distance was able to correctly group all of the tire traces replicates in the same cluster than the replicates of their corresponding tire. Blind tests on traces were performed and were correctly assigned to their tire source. These results support the hypothesis that the tested tires, of different manufacturers and models, can be discriminated by a statistical comparison of their chemical profiles. The traces were found to be not differentiable from their source but differentiable from all the other tires present in the subset. The results are promising and will be extended on a larger sample set.
Resumo:
The use by police services and inquiring agencies of forensic data in an intelligence perspective is still fragmentary and to some extent ignored. In order to increase the efficiency of criminal investigation to target illegal drug trafficking organisations and to provide valuable information about their methods, it is necessary to include and interpret objective drug analysis results already during the investigation phase. The value of visual, physical and chemical data of seized ecstasy tablets, as a support for criminal investigation on a strategic and tactical level has been investigated. In a first phase different characteristics of ecstasy tablets have been studied in order to define their relevance, variation, correlation and discriminating power in an intelligence perspective. During 5 years, over 1200 cases of ecstasy seizures (concerning about 150000 seized tablets) coming from different regions of Switzerland (City and Canton of Zurich, Cantons Ticino, Neuchâtel and Geneva) have been systematically recorded. This turned out to be a statistically representative database including large and small cases. During the second phase various comparison and clustering methods have been tested and evaluated, on the type and relevance of tablet characteristics, thus increasing knowledge about synthetic drugs, their manufacturing and trafficking. Finally analytical methodologies have been investigated and formalised, applying traditional intelligence methods. In this context classical tools, which are used in criminal analysis (like the I2 Analyst Notebook, I2 Ibase, ?) have been tested and adapted to address the specific need of forensic drug intelligence. The interpretation of these links provides valuable information about criminal organisations and their trafficking methods. In the final part of this thesis practical examples illustrate the use and value of such information.
Resumo:
This thesis develops a comprehensive and a flexible statistical framework for the analysis and detection of space, time and space-time clusters of environmental point data. The developed clustering methods were applied in both simulated datasets and real-world environmental phenomena; however, only the cases of forest fires in Canton of Ticino (Switzerland) and in Portugal are expounded in this document. Normally, environmental phenomena can be modelled as stochastic point processes where each event, e.g. the forest fire ignition point, is characterised by its spatial location and occurrence in time. Additionally, information such as burned area, ignition causes, landuse, topographic, climatic and meteorological features, etc., can also be used to characterise the studied phenomenon. Thereby, the space-time pattern characterisa- tion represents a powerful tool to understand the distribution and behaviour of the events and their correlation with underlying processes, for instance, socio-economic, environmental and meteorological factors. Consequently, we propose a methodology based on the adaptation and application of statistical and fractal point process measures for both global (e.g. the Morisita Index, the Box-counting fractal method, the multifractal formalism and the Ripley's K-function) and local (e.g. Scan Statistics) analysis. Many measures describing the space-time distribution of environmental phenomena have been proposed in a wide variety of disciplines; nevertheless, most of these measures are of global character and do not consider complex spatial constraints, high variability and multivariate nature of the events. Therefore, we proposed an statistical framework that takes into account the complexities of the geographical space, where phenomena take place, by introducing the Validity Domain concept and carrying out clustering analyses in data with different constrained geographical spaces, hence, assessing the relative degree of clustering of the real distribution. Moreover, exclusively to the forest fire case, this research proposes two new methodologies to defining and mapping both the Wildland-Urban Interface (WUI) described as the interaction zone between burnable vegetation and anthropogenic infrastructures, and the prediction of fire ignition susceptibility. In this regard, the main objective of this Thesis was to carry out a basic statistical/- geospatial research with a strong application part to analyse and to describe complex phenomena as well as to overcome unsolved methodological problems in the characterisation of space-time patterns, in particular, the forest fire occurrences. Thus, this Thesis provides a response to the increasing demand for both environmental monitoring and management tools for the assessment of natural and anthropogenic hazards and risks, sustainable development, retrospective success analysis, etc. The major contributions of this work were presented at national and international conferences and published in 5 scientific journals. National and international collaborations were also established and successfully accomplished. -- Cette thèse développe une méthodologie statistique complète et flexible pour l'analyse et la détection des structures spatiales, temporelles et spatio-temporelles de données environnementales représentées comme de semis de points. Les méthodes ici développées ont été appliquées aux jeux de données simulées autant qu'A des phénomènes environnementaux réels; nonobstant, seulement le cas des feux forestiers dans le Canton du Tessin (la Suisse) et celui de Portugal sont expliqués dans ce document. Normalement, les phénomènes environnementaux peuvent être modélisés comme des processus ponctuels stochastiques ou chaque événement, par ex. les point d'ignition des feux forestiers, est déterminé par son emplacement spatial et son occurrence dans le temps. De plus, des informations tels que la surface bru^lée, les causes d'ignition, l'utilisation du sol, les caractéristiques topographiques, climatiques et météorologiques, etc., peuvent aussi être utilisées pour caractériser le phénomène étudié. Par conséquent, la définition de la structure spatio-temporelle représente un outil puissant pour compren- dre la distribution du phénomène et sa corrélation avec des processus sous-jacents tels que les facteurs socio-économiques, environnementaux et météorologiques. De ce fait, nous proposons une méthodologie basée sur l'adaptation et l'application de mesures statistiques et fractales des processus ponctuels d'analyse global (par ex. l'indice de Morisita, la dimension fractale par comptage de boîtes, le formalisme multifractal et la fonction K de Ripley) et local (par ex. la statistique de scan). Des nombreuses mesures décrivant les structures spatio-temporelles de phénomènes environnementaux peuvent être trouvées dans la littérature. Néanmoins, la plupart de ces mesures sont de caractère global et ne considèrent pas de contraintes spatiales com- plexes, ainsi que la haute variabilité et la nature multivariée des événements. A cet effet, la méthodologie ici proposée prend en compte les complexités de l'espace géographique ou le phénomène a lieu, à travers de l'introduction du concept de Domaine de Validité et l'application des mesures d'analyse spatiale dans des données en présentant différentes contraintes géographiques. Cela permet l'évaluation du degré relatif d'agrégation spatiale/temporelle des structures du phénomène observé. En plus, exclusif au cas de feux forestiers, cette recherche propose aussi deux nouvelles méthodologies pour la définition et la cartographie des zones périurbaines, décrites comme des espaces anthropogéniques à proximité de la végétation sauvage ou de la forêt, et de la prédiction de la susceptibilité à l'ignition de feu. A cet égard, l'objectif principal de cette Thèse a été d'effectuer une recherche statistique/géospatiale avec une forte application dans des cas réels, pour analyser et décrire des phénomènes environnementaux complexes aussi bien que surmonter des problèmes méthodologiques non résolus relatifs à la caractérisation des structures spatio-temporelles, particulièrement, celles des occurrences de feux forestières. Ainsi, cette Thèse fournit une réponse à la demande croissante de la gestion et du monitoring environnemental pour le déploiement d'outils d'évaluation des risques et des dangers naturels et anthro- pogéniques. Les majeures contributions de ce travail ont été présentées aux conférences nationales et internationales, et ont été aussi publiées dans 5 revues internationales avec comité de lecture. Des collaborations nationales et internationales ont été aussi établies et accomplies avec succès.
Resumo:
Abstract : This work is concerned with the development and application of novel unsupervised learning methods, having in mind two target applications: the analysis of forensic case data and the classification of remote sensing images. First, a method based on a symbolic optimization of the inter-sample distance measure is proposed to improve the flexibility of spectral clustering algorithms, and applied to the problem of forensic case data. This distance is optimized using a loss function related to the preservation of neighborhood structure between the input space and the space of principal components, and solutions are found using genetic programming. Results are compared to a variety of state-of--the-art clustering algorithms. Subsequently, a new large-scale clustering method based on a joint optimization of feature extraction and classification is proposed and applied to various databases, including two hyperspectral remote sensing images. The algorithm makes uses of a functional model (e.g., a neural network) for clustering which is trained by stochastic gradient descent. Results indicate that such a technique can easily scale to huge databases, can avoid the so-called out-of-sample problem, and can compete with or even outperform existing clustering algorithms on both artificial data and real remote sensing images. This is verified on small databases as well as very large problems. Résumé : Ce travail de recherche porte sur le développement et l'application de méthodes d'apprentissage dites non supervisées. Les applications visées par ces méthodes sont l'analyse de données forensiques et la classification d'images hyperspectrales en télédétection. Dans un premier temps, une méthodologie de classification non supervisée fondée sur l'optimisation symbolique d'une mesure de distance inter-échantillons est proposée. Cette mesure est obtenue en optimisant une fonction de coût reliée à la préservation de la structure de voisinage d'un point entre l'espace des variables initiales et l'espace des composantes principales. Cette méthode est appliquée à l'analyse de données forensiques et comparée à un éventail de méthodes déjà existantes. En second lieu, une méthode fondée sur une optimisation conjointe des tâches de sélection de variables et de classification est implémentée dans un réseau de neurones et appliquée à diverses bases de données, dont deux images hyperspectrales. Le réseau de neurones est entraîné à l'aide d'un algorithme de gradient stochastique, ce qui rend cette technique applicable à des images de très haute résolution. Les résultats de l'application de cette dernière montrent que l'utilisation d'une telle technique permet de classifier de très grandes bases de données sans difficulté et donne des résultats avantageusement comparables aux méthodes existantes.
Resumo:
Distribution of socio-economic features in urban space is an important source of information for land and transportation planning. The metropolization phenomenon has changed the distribution of types of professions in space and has given birth to different spatial patterns that the urban planner must know in order to plan a sustainable city. Such distributions can be discovered by statistical and learning algorithms through different methods. In this paper, an unsupervised classification method and a cluster detection method are discussed and applied to analyze the socio-economic structure of Switzerland. The unsupervised classification method, based on Ward's classification and self-organized maps, is used to classify the municipalities of the country and allows to reduce a highly-dimensional input information to interpret the socio-economic landscape. The cluster detection method, the spatial scan statistics, is used in a more specific manner in order to detect hot spots of certain types of service activities. The method is applied to the distribution services in the agglomeration of Lausanne. Results show the emergence of new centralities and can be analyzed in both transportation and social terms.
Resumo:
OBJECTIVE: This study assessed clustering of multiple risk behaviors (i.e., low leisure-time physical activity, low fruits/vegetables intake, and high alcohol consumption) with level of cigarette consumption. METHODS: Data from the 2002 Swiss Health Survey, a population-based cross-sectional telephone survey assessing health and self-reported risk behaviors, were used. 18,005 subjects (8052 men and 9953 women) aged 25 years old or more participated. RESULTS: Smokers more frequently had low leisure time physical activity, low fruits/vegetables intake, and high alcohol consumption than non- and ex-smokers. Frequency of each risk behavior increased steadily with cigarette consumption. Clustering of risk behaviors increased with cigarette consumption in both men and women. For men, the odds ratios of multiple (> or =2) risk behaviors other than smoking, adjusted for age, nationality, and educational level, were 1.14 (95% confidence interval: 0.97, 1.33) for ex-smokers, 1.24 (0.93, 1.64) for light smokers (1-9 cigarettes/day), 1.72 (1.36, 2.17) for moderate smokers (10-19 cigarettes/day), and 3.07 (2.59, 3.64) for heavy smokers (> or =20 cigarettes/day) versus non-smokers. Similar odds ratios were found for women for corresponding groups, i.e., 1.01 (0.86, 1.19), 1.26 (1.00, 1.58), 1.62 (1.33, 1.98), and 2.75 (2.30, 3.29). CONCLUSIONS: Counseling and intervention with smokers should take into account the strong clustering of risk behaviors with level of cigarette consumption.
Resumo:
AIMS/HYPOTHESIS: The metabolic syndrome comprises a clustering of cardiovascular risk factors but the underlying mechanism is not known. Mice with targeted disruption of endothelial nitric oxide synthase (eNOS) are hypertensive and insulin resistant. We wondered, whether eNOS deficiency in mice is associated with a phenotype mimicking the human metabolic syndrome. METHODS AND RESULTS: In addition to arterial pressure and insulin sensitivity (euglycaemic hyperinsulinaemic clamp), we measured the plasma concentration of leptin, insulin, cholesterol, triglycerides, free fatty acids, fibrinogen and uric acid in 10 to 12 week old eNOS-/- and wild type mice. We also assessed glucose tolerance under basal conditions and following a metabolic stress with a high fat diet. As expected eNOS-/- mice were hypertensive and insulin resistant, as evidenced by fasting hyperinsulinaemia and a roughly 30 percent lower steady state glucose infusion rate during the clamp. eNOS-/- mice had a 1.5 to 2-fold elevation of the cholesterol, triglyceride and free fatty acid plasma concentration. Even though body weight was comparable, the leptin plasma level was 30% higher in eNOS-/- than in wild type mice. Finally, uric acid and fibrinogen were elevated in the eNOS-/- mice. Whereas under basal conditions, glucose tolerance was comparable in knock out and control mice, on a high fat diet, knock out mice became significantly more glucose intolerant than control mice. CONCLUSIONS: A single gene defect, eNOS deficiency, causes a clustering of cardiovascular risk factors in young mice. We speculate that defective nitric oxide synthesis could trigger many of the abnormalities making up the metabolic syndrome in humans.
Resumo:
BACKGROUND: Little is known about engagement in multiple health behaviours in childhood cancer survivors. METHODS: Using latent class analysis, we identified health behaviour patterns in 835 adult survivors of childhood cancer (age 20-35 years) and 1670 age- and sex-matched controls from the general population. Behaviour groups were determined from replies to questions on smoking, drinking, cannabis use, sporting activities, diet, sun protection and skin examination. RESULTS: The model identified four health behaviour patterns: 'risk-avoidance', with a generally healthy behaviour; 'moderate drinking', with higher levels of sporting activities, but moderate alcohol-consumption; 'risk-taking', engaging in several risk behaviours; and 'smoking', smoking but not drinking. Similar proportions of survivors and controls fell into the 'risk-avoiding' (42% vs 44%) and the 'risk-taking' cluster (14% vs 12%), but more survivors were in the 'moderate drinking' (39% vs 28%) and fewer in the 'smoking' cluster (5% vs 16%). Determinants of health behaviour clusters were gender, migration background, income and therapy. CONCLUSION: A comparable proportion of childhood cancer survivors as in the general population engage in multiple health-compromising behaviours. Because of increased vulnerability of survivors, multiple risk behaviours should be addressed in targeted health interventions.
Resumo:
A recurring task in the analysis of mass genome annotation data from high-throughput technologies is the identification of peaks or clusters in a noisy signal profile. Examples of such applications are the definition of promoters on the basis of transcription start site profiles, the mapping of transcription factor binding sites based on ChIP-chip data and the identification of quantitative trait loci (QTL) from whole genome SNP profiles. Input to such an analysis is a set of genome coordinates associated with counts or intensities. The output consists of a discrete number of peaks with respective volumes, extensions and center positions. We have developed for this purpose a flexible one-dimensional clustering tool, called MADAP, which we make available as a web server and as standalone program. A set of parameters enables the user to customize the procedure to a specific problem. The web server, which returns results in textual and graphical form, is useful for small to medium-scale applications, as well as for evaluation and parameter tuning in view of large-scale applications, requiring a local installation. The program written in C++ can be freely downloaded from ftp://ftp.epd.unil.ch/pub/software/unix/madap. The MADAP web server can be accessed at http://www.isrec.isb-sib.ch/madap/.
Resumo:
PURPOSE: To objectively characterize different heart tissues from functional and viability images provided by composite-strain-encoding (C-SENC) MRI. MATERIALS AND METHODS: C-SENC is a new MRI technique for simultaneously acquiring cardiac functional and viability images. In this work, an unsupervised multi-stage fuzzy clustering method is proposed to identify different heart tissues in the C-SENC images. The method is based on sequential application of the fuzzy c-means (FCM) and iterative self-organizing data (ISODATA) clustering algorithms. The proposed method is tested on simulated heart images and on images from nine patients with and without myocardial infarction (MI). The resulting clustered images are compared with MRI delayed-enhancement (DE) viability images for determining MI. Also, Bland-Altman analysis is conducted between the two methods. RESULTS: Normal myocardium, infarcted myocardium, and blood are correctly identified using the proposed method. The clustered images correctly identified 90 +/- 4% of the pixels defined as infarct in the DE images. In addition, 89 +/- 5% of the pixels defined as infarct in the clustered images were also defined as infarct in DE images. The Bland-Altman results show no bias between the two methods in identifying MI. CONCLUSION: The proposed technique allows for objectively identifying divergent heart tissues, which would be potentially important for clinical decision-making in patients with MI.
Resumo:
MOTIVATION: Analysis of millions of pyro-sequences is currently playing a crucial role in the advance of environmental microbiology. Taxonomy-independent, i.e. unsupervised, clustering of these sequences is essential for the definition of Operational Taxonomic Units. For this application, reproducibility and robustness should be the most sought after qualities, but have thus far largely been overlooked. RESULTS: More than 1 million hyper-variable internal transcribed spacer 1 (ITS1) sequences of fungal origin have been analyzed. The ITS1 sequences were first properly extracted from 454 reads using generalized profiles. Then, otupipe, cd-hit-454, ESPRIT-Tree and DBC454, a new algorithm presented here, were used to analyze the sequences. A numerical assay was developed to measure the reproducibility and robustness of these algorithms. DBC454 was the most robust, closely followed by ESPRIT-Tree. DBC454 features density-based hierarchical clustering, which complements the other methods by providing insights into the structure of the data. AVAILABILITY: An executable is freely available for non-commercial users at ftp://ftp.vital-it.ch/tools/dbc454. It is designed to run under MPI on a cluster of 64-bit Linux machines running Red Hat 4.x, or on a multi-core OSX system. CONTACT: dbc454@vital-it.ch or nicolas.guex@isb-sib.ch.
Resumo:
We present a new framework for large-scale data clustering. The main idea is to modify functional dimensionality reduction techniques to directly optimize over discrete labels using stochastic gradient descent. Compared to methods like spectral clustering our approach solves a single optimization problem, rather than an ad-hoc two-stage optimization approach, does not require a matrix inversion, can easily encode prior knowledge in the set of implementable functions, and does not have an ?out-of-sample? problem. Experimental results on both artificial and real-world datasets show the usefulness of our approach.
Resumo:
Multicentric carpotarsal osteolysis (MCTO) is a rare skeletal dysplasia characterized by aggressive osteolysis, particularly affecting the carpal and tarsal bones, and is frequently associated with progressive renal failure. Using exome capture and next-generation sequencing in five unrelated simplex cases of MCTO, we identified previously unreported missense mutations clustering within a 51 base pair region of the single exon of MAFB, validated by Sanger sequencing. A further six unrelated simplex cases with MCTO were also heterozygous for previously unreported mutations within this same region, as were affected members of two families with autosomal-dominant MCTO. MAFB encodes a transcription factor that negatively regulates RANKL-induced osteoclastogenesis and is essential for normal renal development. Identification of this gene paves the way for development of novel therapeutic approaches for this crippling disease and provides insight into normal bone and kidney development.