175 resultados para spatial clustering algorithms
em Université de Lausanne, Switzerland
Resumo:
In many fields, the spatial clustering of sampled data points has many consequences. Therefore, several indices have been proposed to assess the level of clustering affecting datasets (e.g. the Morisita index, Ripley's Kfunction and Rényi's generalized entropy). The classical Morisita index measures how many times it is more likely to select two measurement points from the same quadrats (the data set is covered by a regular grid of changing size) than it would be in the case of a random distribution generated from a Poisson process. The multipoint version (k-Morisita) takes into account k points with k >= 2. The present research deals with a new development of the k-Morisita index for (1) monitoring network characterization and for (2) detection of patterns in monitored phenomena. From a theoretical perspective, a connection between the k-Morisita index and multifractality has also been found and highlighted on a mathematical multifractal set.
Resumo:
PURPOSE: To objectively characterize different heart tissues from functional and viability images provided by composite-strain-encoding (C-SENC) MRI. MATERIALS AND METHODS: C-SENC is a new MRI technique for simultaneously acquiring cardiac functional and viability images. In this work, an unsupervised multi-stage fuzzy clustering method is proposed to identify different heart tissues in the C-SENC images. The method is based on sequential application of the fuzzy c-means (FCM) and iterative self-organizing data (ISODATA) clustering algorithms. The proposed method is tested on simulated heart images and on images from nine patients with and without myocardial infarction (MI). The resulting clustered images are compared with MRI delayed-enhancement (DE) viability images for determining MI. Also, Bland-Altman analysis is conducted between the two methods. RESULTS: Normal myocardium, infarcted myocardium, and blood are correctly identified using the proposed method. The clustered images correctly identified 90 +/- 4% of the pixels defined as infarct in the DE images. In addition, 89 +/- 5% of the pixels defined as infarct in the clustered images were also defined as infarct in DE images. The Bland-Altman results show no bias between the two methods in identifying MI. CONCLUSION: The proposed technique allows for objectively identifying divergent heart tissues, which would be potentially important for clinical decision-making in patients with MI.
Resumo:
Abstract : This work is concerned with the development and application of novel unsupervised learning methods, having in mind two target applications: the analysis of forensic case data and the classification of remote sensing images. First, a method based on a symbolic optimization of the inter-sample distance measure is proposed to improve the flexibility of spectral clustering algorithms, and applied to the problem of forensic case data. This distance is optimized using a loss function related to the preservation of neighborhood structure between the input space and the space of principal components, and solutions are found using genetic programming. Results are compared to a variety of state-of--the-art clustering algorithms. Subsequently, a new large-scale clustering method based on a joint optimization of feature extraction and classification is proposed and applied to various databases, including two hyperspectral remote sensing images. The algorithm makes uses of a functional model (e.g., a neural network) for clustering which is trained by stochastic gradient descent. Results indicate that such a technique can easily scale to huge databases, can avoid the so-called out-of-sample problem, and can compete with or even outperform existing clustering algorithms on both artificial data and real remote sensing images. This is verified on small databases as well as very large problems. Résumé : Ce travail de recherche porte sur le développement et l'application de méthodes d'apprentissage dites non supervisées. Les applications visées par ces méthodes sont l'analyse de données forensiques et la classification d'images hyperspectrales en télédétection. Dans un premier temps, une méthodologie de classification non supervisée fondée sur l'optimisation symbolique d'une mesure de distance inter-échantillons est proposée. Cette mesure est obtenue en optimisant une fonction de coût reliée à la préservation de la structure de voisinage d'un point entre l'espace des variables initiales et l'espace des composantes principales. Cette méthode est appliquée à l'analyse de données forensiques et comparée à un éventail de méthodes déjà existantes. En second lieu, une méthode fondée sur une optimisation conjointe des tâches de sélection de variables et de classification est implémentée dans un réseau de neurones et appliquée à diverses bases de données, dont deux images hyperspectrales. Le réseau de neurones est entraîné à l'aide d'un algorithme de gradient stochastique, ce qui rend cette technique applicable à des images de très haute résolution. Les résultats de l'application de cette dernière montrent que l'utilisation d'une telle technique permet de classifier de très grandes bases de données sans difficulté et donne des résultats avantageusement comparables aux méthodes existantes.
Resumo:
Aim We test for the congruence between allele-based range boundaries (break zones) in silicicolous alpine plants and species-based break zones in the silicicolous flora of the European Alps. We also ask whether such break zones coincide with areas of large elevational variation.Location The European Alps.Methods On a regular grid laid across the entire Alps, we determined areas of allele- and species-based break zones using respective clustering algorithms, identifying discontinuities in cluster distributions (breaks), and quantifying integrated break densities (break zones). Discontinuities were identified based on the intra-specific genetic variation of 12 species and on the floristic distribution data from 239 species, respectively. Coincidence between the two types of break zones was tested using Spearman's correlation. Break zone densities were also regressed on topographical complexity to test for the effect of elevational variation.Results We found that two main break zones in the distribution of alleles and species were significantly correlated. Furthermore, we show that these break zones are in topographically complex regions, characterized by massive elevational ranges owing to high mountains and deep glacial valleys. We detected a third break zone in the distribution of species in the eastern Alps, which is not correlated with topographic complexity, and which is also not evident from allelic distribution patterns. Species with the potential for long-distance dispersal tended to show larger distribution ranges than short-distance dispersers.Main conclusions We suggest that the history of Pleistocene glaciations is the main driver of the congruence between allele-based and species-based distribution patterns, because occurrences of both species and alleles were subject to the same processes (such as extinction, migration and drift) that shaped the distributions of species and genetic lineages. Large elevational ranges have had a profound effect as a dispersal barrier for alleles during post-glacial immigration. Because plant species, unlike alleles, cannot spread via pollen but only via seed, and thus disperse less effectively, we conclude that species break zones are maintained over longer time spans and reflect more ancient patterns than allele break zones.Conny Thiel-Egenter and Nadir Alvarez contributed equally to this paper and are considered joint first authors.
Resumo:
Studying patterns of species distributions along elevation gradients is frequently used to identify the primary factors that determine the distribution, diversity and assembly of species. However, despite their crucial role in ecosystem functioning, our understanding of the distribution of below-ground fungi is still limited, calling for more comprehensive studies of fungal biogeography along environmental gradients at various scales (from regional to global). Here, we investigated the richness of taxa of soil fungi and their phylogenetic diversity across a wide range of grassland types along a 2800 m elevation gradient at a large number of sites (213), stratified across a region of the Western Swiss Alps (700 km(2)). We used 454 pyrosequencing to obtain fungal sequences that were clustered into operational taxonomic units (OTUs). The OTU diversity-area relationship revealed uneven distribution of fungal taxa across the study area (i.e. not all taxa are everywhere) and fine-scale spatial clustering. Fungal richness and phylogenetic diversity were found to be higher in lower temperatures and higher moisture conditions. Climatic and soil characteristics as well as plant community composition were related to OTU alpha, beta and phylogenetic diversity, with distinct fungal lineages suggesting distinct ecological tolerances. Soil fungi, thus, show lineage-specific biogeographic patterns, even at a regional scale, and follow environmental determinism, mediated by interactions with plants.
Resumo:
The quality of environmental data analysis and propagation of errors are heavily affected by the representativity of the initial sampling design [CRE 93, DEU 97, KAN 04a, LEN 06, MUL07]. Geostatistical methods such as kriging are related to field samples, whose spatial distribution is crucial for the correct detection of the phenomena. Literature about the design of environmental monitoring networks (MN) is widespread and several interesting books have recently been published [GRU 06, LEN 06, MUL 07] in order to clarify the basic principles of spatial sampling design (monitoring networks optimization) based on Support Vector Machines was proposed. Nonetheless, modelers often receive real data coming from environmental monitoring networks that suffer from problems of non-homogenity (clustering). Clustering can be related to the preferential sampling or to the impossibility of reaching certain regions.
Resumo:
We present a new global method for the identification of hotspots in conservation and ecology. The method is based on the identification of spatial structure properties through cumulative relative frequency distributions curves, and is tested with two case studies, the identification of fish density hotspots and terrestrial vertebrate species diversity hotspots. Results from the frequency distribution method are compared with those from standard techniques among local, partially local and global methods. Our approach offers the main advantage to be independent from the selection of any threshold, neighborhood, or other parameter that affect most of the currently available methods for hotspot analysis. The two case studies show how such elements of arbitrariness of the traditional methods influence both size and location of the identified hotspots, and how this new global method can be used for a more objective selection of hotspots.
Resumo:
PURPOSE: The natural history of prostate cancer might be driven by the index lesion. We determined the percent of men in whom the index lesion could be defined using transperineal template prostate mapping biopsies. MATERIALS AND METHODS: Included in study were consecutive men undergoing transperineal template prostate mapping biopsies with biopsies grouped into 20 zones. Men with clinically significant disease in only 1 prostate area were considered to have an identifiable index lesion. We evaluated the impact of using 2 definitions of clinically significant disease (Gleason grade pattern 4 and/or lesion volume 0.5 cc or greater) and 2 clustering rules (stringent and tolerant) to define the index lesion. RESULTS: Included in study were 391 men with a median age of 62 years (IQR 58-67) and a median prostate specific antigen of 6.9 ng/ml (IQR 4.8-10.0). Of the men 269 (69%) were previously diagnosed with prostate cancer. By deploying a median of 1.2 cores per ml (IQR 0.9-1.7) cancer was diagnosed in 82.9% of the men (324 of 391) with a median of 6 positive cores (IQR 2-9), a median maximum cancer core length of 5 mm (IQR 3-8) and a total cancer core length per zone of 7 mm (IQR 3-13). Insignificant disease was found in 26.3% to 42.9% of cases. When a stringent spatial relationship was used to define individual lesions, 44.4% to 54.6% of patients had 1 index lesion and 12.7% to 19.1% had more than 1 area with clinically significant disease. These proportions changed to 46.6% to 59.2% and 10.5% to 14.5%, respectively, when less stringent spatial clustering was applied. CONCLUSIONS: Transperineal template prostate mapping biopsies enable the index lesion to be localized in most men with clinically significant disease. This information may be important to select appropriate candidates for targeted therapy and to plan a tailored treatment strategy in men undergoing radical therapy.
Resumo:
This paper presents general problems and approaches for the spatial data analysis using machine learning algorithms. Machine learning is a very powerful approach to adaptive data analysis, modelling and visualisation. The key feature of the machine learning algorithms is that they learn from empirical data and can be used in cases when the modelled environmental phenomena are hidden, nonlinear, noisy and highly variable in space and in time. Most of the machines learning algorithms are universal and adaptive modelling tools developed to solve basic problems of learning from data: classification/pattern recognition, regression/mapping and probability density modelling. In the present report some of the widely used machine learning algorithms, namely artificial neural networks (ANN) of different architectures and Support Vector Machines (SVM), are adapted to the problems of the analysis and modelling of geo-spatial data. Machine learning algorithms have an important advantage over traditional models of spatial statistics when problems are considered in a high dimensional geo-feature spaces, when the dimension of space exceeds 5. Such features are usually generated, for example, from digital elevation models, remote sensing images, etc. An important extension of models concerns considering of real space constrains like geomorphology, networks, and other natural structures. Recent developments in semi-supervised learning can improve modelling of environmental phenomena taking into account on geo-manifolds. An important part of the study deals with the analysis of relevant variables and models' inputs. This problem is approached by using different feature selection/feature extraction nonlinear tools. To demonstrate the application of machine learning algorithms several interesting case studies are considered: digital soil mapping using SVM, automatic mapping of soil and water system pollution using ANN; natural hazards risk analysis (avalanches, landslides), assessments of renewable resources (wind fields) with SVM and ANN models, etc. The dimensionality of spaces considered varies from 2 to more than 30. Figures 1, 2, 3 demonstrate some results of the studies and their outputs. Finally, the results of environmental mapping are discussed and compared with traditional models of geostatistics.
Resumo:
Distribution of socio-economic features in urban space is an important source of information for land and transportation planning. The metropolization phenomenon has changed the distribution of types of professions in space and has given birth to different spatial patterns that the urban planner must know in order to plan a sustainable city. Such distributions can be discovered by statistical and learning algorithms through different methods. In this paper, an unsupervised classification method and a cluster detection method are discussed and applied to analyze the socio-economic structure of Switzerland. The unsupervised classification method, based on Ward's classification and self-organized maps, is used to classify the municipalities of the country and allows to reduce a highly-dimensional input information to interpret the socio-economic landscape. The cluster detection method, the spatial scan statistics, is used in a more specific manner in order to detect hot spots of certain types of service activities. The method is applied to the distribution services in the agglomeration of Lausanne. Results show the emergence of new centralities and can be analyzed in both transportation and social terms.
Resumo:
The analysis of rockfall characteristics and spatial distribution is fundamental to understand and model the main factors that predispose to failure. In our study we analysed LiDAR point clouds aiming to: (1) detect and characterise single rockfalls; (2) investigate their spatial distribution. To this end, different cluster algorithms were applied: 1a) Nearest Neighbour Clutter Removal (NNCR) in combination with the Expectation?Maximization (EM) in order to separate feature points from clutter; 1b) a density based algorithm (DBSCAN) was applied to isolate the single clusters (i.e. the rockfall events); 2) finally we computed the Ripley's K-function to investigate the global spatial pattern of the extracted rockfalls. The method allowed proper identification and characterization of more than 600 rockfalls occurred on a cliff located in Puigcercos (Catalonia, Spain) during a time span of six months. The spatial distribution of these events proved that rockfall were clustered distributed at a welldefined distance-range. Computations were carried out using R free software for statistical computing and graphics. The understanding of the spatial distribution of precursory rockfalls may shed light on the forecasting of future failures.
Resumo:
The algorithmic approach to data modelling has developed rapidly these last years, in particular methods based on data mining and machine learning have been used in a growing number of applications. These methods follow a data-driven methodology, aiming at providing the best possible generalization and predictive abilities instead of concentrating on the properties of the data model. One of the most successful groups of such methods is known as Support Vector algorithms. Following the fruitful developments in applying Support Vector algorithms to spatial data, this paper introduces a new extension of the traditional support vector regression (SVR) algorithm. This extension allows for the simultaneous modelling of environmental data at several spatial scales. The joint influence of environmental processes presenting different patterns at different scales is here learned automatically from data, providing the optimum mixture of short and large-scale models. The method is adaptive to the spatial scale of the data. With this advantage, it can provide efficient means to model local anomalies that may typically arise in situations at an early phase of an environmental emergency. However, the proposed approach still requires some prior knowledge on the possible existence of such short-scale patterns. This is a possible limitation of the method for its implementation in early warning systems. The purpose of this paper is to present the multi-scale SVR model and to illustrate its use with an application to the mapping of Cs137 activity given the measurements taken in the region of Briansk following the Chernobyl accident.
Resumo:
In recent years, multi-atlas fusion methods have gainedsignificant attention in medical image segmentation. Inthis paper, we propose a general Markov Random Field(MRF) based framework that can perform edge-preservingsmoothing of the labels at the time of fusing the labelsitself. More specifically, we formulate the label fusionproblem with MRF-based neighborhood priors, as an energyminimization problem containing a unary data term and apairwise smoothness term. We present how the existingfusion methods like majority voting, global weightedvoting and local weighted voting methods can be reframedto profit from the proposed framework, for generatingmore accurate segmentations as well as more contiguoussegmentations by getting rid of holes and islands. Theproposed framework is evaluated for segmenting lymphnodes in 3D head and neck CT images. A comparison ofvarious fusion algorithms is also presented.
Resumo:
Defining an efficient training set is one of the most delicate phases for the success of remote sensing image classification routines. The complexity of the problem, the limited temporal and financial resources, as well as the high intraclass variance can make an algorithm fail if it is trained with a suboptimal dataset. Active learning aims at building efficient training sets by iteratively improving the model performance through sampling. A user-defined heuristic ranks the unlabeled pixels according to a function of the uncertainty of their class membership and then the user is asked to provide labels for the most uncertain pixels. This paper reviews and tests the main families of active learning algorithms: committee, large margin, and posterior probability-based. For each of them, the most recent advances in the remote sensing community are discussed and some heuristics are detailed and tested. Several challenging remote sensing scenarios are considered, including very high spatial resolution and hyperspectral image classification. Finally, guidelines for choosing the good architecture are provided for new and/or unexperienced user.