959 resultados para Clustering methods
Resumo:
In recent years, new analytical tools have allowed researchers to extract historical information contained in molecular data, which has fundamentally transformed our understanding of processes ruling biological invasions. However, the use of these new analytical tools has been largely restricted to studies of terrestrial organisms despite the growing recognition that the sea contains ecosystems that are amongst the most heavily affected by biological invasions, and that marine invasion histories are often remarkably complex. Here, we studied the routes of invasion and colonisation histories of an invasive marine invertebrate Microcosmus squamiger (Ascidiacea) using microsatellite loci, mitochondrial DNA sequence data and 11 worldwide populations. Discriminant analysis of principal components, clustering methods and approximate Bayesian computation (ABC) methods showed that the most likely source of the introduced populations was a single admixture event that involved populations from two genetically differentiated ancestral regions - the western and eastern coasts of Australia. The ABC analyses revealed that colonisation of the introduced range of M. squamiger consisted of a series of non-independent introductions along the coastlines of Africa, North America and Europe. Furthermore, we inferred that the sequence of colonisation across continents was in line with historical taxonomic records - first the Mediterranean Sea and South Africa from an unsampled ancestral population, followed by sequential introductions in California and, more recently, the NE Atlantic Ocean. We revealed the most likely invasion history for world populations of M. squamiger, which is broadly characterized by the presence of multiple ancestral sources and non-independent introductions within the introduced range. The results presented here illustrate the complexity of marine invasion routes and identify a cause-effect relationship between human-mediated transport and the success of widespread marine non-indigenous species, which benefit from stepping-stone invasions and admixture processes involving different sources for the spread and expansion of their range.
Resumo:
This thesis develops a comprehensive and a flexible statistical framework for the analysis and detection of space, time and space-time clusters of environmental point data. The developed clustering methods were applied in both simulated datasets and real-world environmental phenomena; however, only the cases of forest fires in Canton of Ticino (Switzerland) and in Portugal are expounded in this document. Normally, environmental phenomena can be modelled as stochastic point processes where each event, e.g. the forest fire ignition point, is characterised by its spatial location and occurrence in time. Additionally, information such as burned area, ignition causes, landuse, topographic, climatic and meteorological features, etc., can also be used to characterise the studied phenomenon. Thereby, the space-time pattern characterisa- tion represents a powerful tool to understand the distribution and behaviour of the events and their correlation with underlying processes, for instance, socio-economic, environmental and meteorological factors. Consequently, we propose a methodology based on the adaptation and application of statistical and fractal point process measures for both global (e.g. the Morisita Index, the Box-counting fractal method, the multifractal formalism and the Ripley's K-function) and local (e.g. Scan Statistics) analysis. Many measures describing the space-time distribution of environmental phenomena have been proposed in a wide variety of disciplines; nevertheless, most of these measures are of global character and do not consider complex spatial constraints, high variability and multivariate nature of the events. Therefore, we proposed an statistical framework that takes into account the complexities of the geographical space, where phenomena take place, by introducing the Validity Domain concept and carrying out clustering analyses in data with different constrained geographical spaces, hence, assessing the relative degree of clustering of the real distribution. Moreover, exclusively to the forest fire case, this research proposes two new methodologies to defining and mapping both the Wildland-Urban Interface (WUI) described as the interaction zone between burnable vegetation and anthropogenic infrastructures, and the prediction of fire ignition susceptibility. In this regard, the main objective of this Thesis was to carry out a basic statistical/- geospatial research with a strong application part to analyse and to describe complex phenomena as well as to overcome unsolved methodological problems in the characterisation of space-time patterns, in particular, the forest fire occurrences. Thus, this Thesis provides a response to the increasing demand for both environmental monitoring and management tools for the assessment of natural and anthropogenic hazards and risks, sustainable development, retrospective success analysis, etc. The major contributions of this work were presented at national and international conferences and published in 5 scientific journals. National and international collaborations were also established and successfully accomplished. -- Cette thèse développe une méthodologie statistique complète et flexible pour l'analyse et la détection des structures spatiales, temporelles et spatio-temporelles de données environnementales représentées comme de semis de points. Les méthodes ici développées ont été appliquées aux jeux de données simulées autant qu'A des phénomènes environnementaux réels; nonobstant, seulement le cas des feux forestiers dans le Canton du Tessin (la Suisse) et celui de Portugal sont expliqués dans ce document. Normalement, les phénomènes environnementaux peuvent être modélisés comme des processus ponctuels stochastiques ou chaque événement, par ex. les point d'ignition des feux forestiers, est déterminé par son emplacement spatial et son occurrence dans le temps. De plus, des informations tels que la surface bru^lée, les causes d'ignition, l'utilisation du sol, les caractéristiques topographiques, climatiques et météorologiques, etc., peuvent aussi être utilisées pour caractériser le phénomène étudié. Par conséquent, la définition de la structure spatio-temporelle représente un outil puissant pour compren- dre la distribution du phénomène et sa corrélation avec des processus sous-jacents tels que les facteurs socio-économiques, environnementaux et météorologiques. De ce fait, nous proposons une méthodologie basée sur l'adaptation et l'application de mesures statistiques et fractales des processus ponctuels d'analyse global (par ex. l'indice de Morisita, la dimension fractale par comptage de boîtes, le formalisme multifractal et la fonction K de Ripley) et local (par ex. la statistique de scan). Des nombreuses mesures décrivant les structures spatio-temporelles de phénomènes environnementaux peuvent être trouvées dans la littérature. Néanmoins, la plupart de ces mesures sont de caractère global et ne considèrent pas de contraintes spatiales com- plexes, ainsi que la haute variabilité et la nature multivariée des événements. A cet effet, la méthodologie ici proposée prend en compte les complexités de l'espace géographique ou le phénomène a lieu, à travers de l'introduction du concept de Domaine de Validité et l'application des mesures d'analyse spatiale dans des données en présentant différentes contraintes géographiques. Cela permet l'évaluation du degré relatif d'agrégation spatiale/temporelle des structures du phénomène observé. En plus, exclusif au cas de feux forestiers, cette recherche propose aussi deux nouvelles méthodologies pour la définition et la cartographie des zones périurbaines, décrites comme des espaces anthropogéniques à proximité de la végétation sauvage ou de la forêt, et de la prédiction de la susceptibilité à l'ignition de feu. A cet égard, l'objectif principal de cette Thèse a été d'effectuer une recherche statistique/géospatiale avec une forte application dans des cas réels, pour analyser et décrire des phénomènes environnementaux complexes aussi bien que surmonter des problèmes méthodologiques non résolus relatifs à la caractérisation des structures spatio-temporelles, particulièrement, celles des occurrences de feux forestières. Ainsi, cette Thèse fournit une réponse à la demande croissante de la gestion et du monitoring environnemental pour le déploiement d'outils d'évaluation des risques et des dangers naturels et anthro- pogéniques. Les majeures contributions de ce travail ont été présentées aux conférences nationales et internationales, et ont été aussi publiées dans 5 revues internationales avec comité de lecture. Des collaborations nationales et internationales ont été aussi établies et accomplies avec succès.
Resumo:
Les modèles à sur-représentation de zéros discrets et continus ont une large gamme d'applications et leurs propriétés sont bien connues. Bien qu'il existe des travaux portant sur les modèles discrets à sous-représentation de zéro et modifiés à zéro, la formulation usuelle des modèles continus à sur-représentation -- un mélange entre une densité continue et une masse de Dirac -- empêche de les généraliser afin de couvrir le cas de la sous-représentation de zéros. Une formulation alternative des modèles continus à sur-représentation de zéros, pouvant aisément être généralisée au cas de la sous-représentation, est présentée ici. L'estimation est d'abord abordée sous le paradigme classique, et plusieurs méthodes d'obtention des estimateurs du maximum de vraisemblance sont proposées. Le problème de l'estimation ponctuelle est également considéré du point de vue bayésien. Des tests d'hypothèses classiques et bayésiens visant à déterminer si des données sont à sur- ou sous-représentation de zéros sont présentées. Les méthodes d'estimation et de tests sont aussi évaluées au moyen d'études de simulation et appliquées à des données de précipitation agrégées. Les diverses méthodes s'accordent sur la sous-représentation de zéros des données, démontrant la pertinence du modèle proposé. Nous considérons ensuite la classification d'échantillons de données à sous-représentation de zéros. De telles données étant fortement non normales, il est possible de croire que les méthodes courantes de détermination du nombre de grappes s'avèrent peu performantes. Nous affirmons que la classification bayésienne, basée sur la distribution marginale des observations, tiendrait compte des particularités du modèle, ce qui se traduirait par une meilleure performance. Plusieurs méthodes de classification sont comparées au moyen d'une étude de simulation, et la méthode proposée est appliquée à des données de précipitation agrégées provenant de 28 stations de mesure en Colombie-Britannique.
Resumo:
Un clúster es entendido por la gran mayoría como un gran conglomerado de empresas que giran en torno a un objetivo, en su gran mayoría económico. Su intención es competir con otros conglomerados en cuanto a precios y cantidades, ya que de manera individual no podrían. En consecuencia, esta unión se utiliza en un principio para crear ventajas tanto competitivas como comparativas en contra de la competencia, lo cual genera un valor a esta unión, con el fin de producir fidelidad en el cliente y recordación de todos los productos que tal unión brinde. Según estudios realizados por diversos autores, en muchas ocasiones, los clúster no se crean con una finalidad económica, sino como desarrollo de un perfil comunitario que ayude a la sociedad y las organizaciones que la componen. La base de las relaciones se centra en la comunicación y en las diversas técnicas que existen en ese ámbito para asegurar la sostenibilidad de la organización. Dentro de estas relaciones, se le da un reconocimiento a la educación y la cultura en donde se encuentra ubicado el clúster, ya que las estrategias que se implementen se relacionan directamente con las necesidades de los clientes, generando en el pensamiento de la comunidad la perdurabilidad y sostenibilidad como efecto del desarrollo social.
Resumo:
Aquesta memòria està estructurada en sis capítols amb l'objectiu final de fonamentar i desenvolupar les eines matemàtiques necessàries per a la classificació de conjunts de subconjunts borrosos. El nucli teòric del treball el formen els capítols 3, 4 i 5; els dos primers són dos capítols de caire més general, i l'últim és una aplicació dels anteriors a la classificació dels països de la Unió Europea en funció de determinades característiques borroses. En el capítol 1 s'analitzen les diferents connectives borroses posant una especial atenció en aquells aspectes que en altres capítols tindran una aplicació específica. És per aquest motiu que s'estudien les ordenacions de famílies de t-normes, donada la seva importància en la transitivitat de les relacions borroses. La verificació del principi del terç exclòs és necessària per assegurar que un conjunt significatiu de mesures borroses generalitzades, introduïdes en el capítol 3, siguin reflexives. Estudiem per a quines t-normes es verifica aquesta propietat i introduïm un nou conjunt de t-normes que verifiquen aquest principi. En el capítol 2 es fa un recorregut general per les relacions borroses centrant-nos en l'estudi de la clausura transitiva per a qualsevol t-norma, el càlcul de la qual és en molts casos fonamental per portar a terme el procés de classificació. Al final del capítol s'exposa un procediment pràctic per al càlcul d'una relació borrosa amb l'ajuda d'experts i de sèries estadístiques. El capítol 3 és un monogràfic sobre mesures borroses. El primer objectiu és relacionar les mesures (o distàncies) usualment utilitzades en les aplicacions borroses amb les mesures conjuntistes crisp. Es tracta d'un enfocament diferent del tradicional enfocament geomètric. El principal resultat és la introducció d'una família parametritzada de mesures que verifiquen unes propietats de caràcter conjuntista prou satisfactòries. L'estudi de la verificació del principi del terç exclòs té aquí la seva aplicació sobre la reflexivitat d'aquestes mesures, que són estudiades amb una certa profunditat en alguns casos particulars. El capítol 4 és, d'entrada, un repàs dels principals resultats i mètodes borrosos per a la classificació dels elements d'un mateix conjunt de subconjunts borrosos. És aquí on s'apliquen els resultats sobre les ordenacions de les famílies de t-normes i t-conormes estudiades en el capítol 1. S'introdueix un nou mètode de clusterització, canviant la matriu de la relació borrosa cada vegada que s'obté un nou clúster. Aquest mètode permet homogeneïtzar la metodologia del càlcul de la relació borrosa amb el mètode de clusterització. El capítol 5 tracta sobre l'agrupació d'objectes de diferent naturalesa; és a dir, subconjunts borrosos que pertanyen a diferents conjunts. Aquesta teoria ja ha estat desenvolupada en el cas binari; aquí, el que es presenta és la seva generalització al cas n-ari. Més endavant s'estudien certs aspectes de les projeccions de la relació sobre un cert espai i el recíproc, l'estudi de cilindres de relacions predeterminades. Una aplicació sobre l'agrupació de les comarques gironines en funció de certes variables borroses es presenta al final del capítol. L'últim capítol és eminentment pràctic, ja que s'aplica allò estudiat principalment en els capítols 3 i 4 a la classificació dels països de la Unió Europea en funció de determinades característiques borroses. Per tal de fer previsions per a anys venidors s'han utilitzat sèries temporals i xarxes neuronals. S'han emprat diverses mesures i mètodes de clusterització per tal de poder comparar els diversos dendogrames que resulten del procés de clusterització. Finalment, als annexos es poden consultar les sèries estadístiques utilitzades, la seva extrapolació, els càlculs per a la construcció de les matrius de les relacions borroses, les matrius de mesura i les seves clausures.
Resumo:
Clustering is defined as the grouping of similar items in a set, and is an important process within the field of data mining. As the amount of data for various applications continues to increase, in terms of its size and dimensionality, it is necessary to have efficient clustering methods. A popular clustering algorithm is K-Means, which adopts a greedy approach to produce a set of K-clusters with associated centres of mass, and uses a squared error distortion measure to determine convergence. Methods for improving the efficiency of K-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting a more efficient data structure, notably a multi-dimensional binary search tree (KD-Tree) to store either centroids or data points. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient K-Means techniques in parallel computational environments. In this work, we provide a parallel formulation for the KD-Tree based K-Means algorithm and address its load balancing issues.
Resumo:
Cedrus atlantica (Pinaceae) is a large and exceptionally long-lived conifer native to the Rif and Atlas Mountains of North Africa. To assess levels and patterns of genetic diversity of this species. samples were obtained throughout the natural range in Morocco and from a forest plantation in Arbucies, Girona (Spain) and analyzed using RAPD markers. Within-population genetic diversity was high and comparable to that revealed by isozymes. Managed populations harbored levels of genetic variation similar to those found in their natural counterparts. Genotypic analyses Of Molecular variance (AMOVA) found that most variation was within populations. but significant differentiation was also found between populations. particularly in Morocco. Bayesian estimates of F,, corroborated the AMOVA partitioning and provided evidence for Population differentiation in C. atlantica. Both distance- and Bayesian-based Clustering methods revealed that Moroccan populations comprise two genetically distinct groups. Within each group, estimates of population differentiation were close to those previously reported in other gymnosperms. These results are interpreted in the context of the postglacial history of the species and human impact. The high degree of among-group differentiation recorded here highlights the need for additional conservation measures for some Moroccan Populations of C. atlantica.
Resumo:
K-Means is a popular clustering algorithm which adopts an iterative refinement procedure to determine data partitions and to compute their associated centres of mass, called centroids. The straightforward implementation of the algorithm is often referred to as `brute force' since it computes a proximity measure from each data point to each centroid at every iteration of the K-Means process. Efficient implementations of the K-Means algorithm have been predominantly based on multi-dimensional binary search trees (KD-Trees). A combination of an efficient data structure and geometrical constraints allow to reduce the number of distance computations required at each iteration. In this work we present a general space partitioning approach for improving the efficiency and the scalability of the K-Means algorithm. We propose to adopt approximate hierarchical clustering methods to generate binary space partitioning trees in contrast to KD-Trees. In the experimental analysis, we have tested the performance of the proposed Binary Space Partitioning K-Means (BSP-KM) when a divisive clustering algorithm is used. We have carried out extensive experimental tests to compare the proposed approach to the one based on KD-Trees (KD-KM) in a wide range of the parameters space. BSP-KM is more scalable than KDKM, while keeping the deterministic nature of the `brute force' algorithm. In particular, the proposed space partitioning approach has shown to overcome the well-known limitation of KD-Trees in high-dimensional spaces and can also be adopted to improve the efficiency of other algorithms in which KD-Trees have been used.
Resumo:
Differently from theoretical scale-free networks, most real networks present multi-scale behavior, with nodes structured in different types of functional groups and communities. While the majority of approaches for classification of nodes in a complex network has relied on local measurements of the topology/connectivity around each node, valuable information about node functionality can be obtained by concentric (or hierarchical) measurements. This paper extends previous methodologies based on concentric measurements, by studying the possibility of using agglomerative clustering methods, in order to obtain a set of functional groups of nodes, considering particular institutional collaboration network nodes, including various known communities (departments of the University of Sao Paulo). Among the interesting obtained findings, we emphasize the scale-free nature of the network obtained, as well as identification of different patterns of authorship emerging from different areas (e.g. human and exact sciences). Another interesting result concerns the relatively uniform distribution of hubs along concentric levels, contrariwise to the non-uniform pattern found in theoretical scale-free networks such as the BA model. (C) 2008 Elsevier B.V. All rights reserved.
Resumo:
Andryala (Asteraceae: Cichorieae) is a little-known Mediterranean-Macaronesian genus whose taxonomy is much in need of revision. The aim of the present biosystematic study was to elucidate species relationships within this genus based on morphological and molecular data. In this study several taxa are recognised: 17 species, 14 subspecies, and 3 hybrids. Among these, 5 species are Macaronesian endemics (A. glandulosa, A. sparsiflora, A. crithmifolia Aiton, A. pinnatifida, and A. perezii), 4 species are Northwest African endemics (A. mogadorensis, A. maroccana, A. chevallieri, and A. nigricans) and one species is endemic to Romania (A. laevitomentosa). Historical background regarding taxonomic delimitation in the genus is addressed from Linnaean to present day concepts, as well as the origin of the name Andryala. The origin of Asteraceae and the systematic position of Andryala is shortly summarised. The morphological study was based on a bibliographic review and the revision of 1066 specimens of 13 herbaria as well as additional material collected during fieldwork. The variability of the morphological characters of the genus, including both vegetative taxonomic characters (root, stem, leaf and indumentum characters) and reproductive ones (inflorescence, floret, fruit and pappus characters), is assessed. Numerical analysis of the morphological data was performed using different similarity or dissimilarity measures and coefficients, as well as ordination and clustering methods. Results support the segregation of the recognised taxa and the congruence of the several analyses in the separation of the recognised taxa (using quantitative, binary or multi-state characters). The proposed taxonomy for Andryala includes a new infra-generic classification, new taxa and new combinations and ranks, typifications and diagnostic keys (one for the species and several for subspecies). For each taxon a list of synonyms, typification comments and a detailed description are provided, just as comments on taxonomy and nomenclature, and a brief discussion on karyology. Additionally, information on ecology and conservation status as well as on distribution and a list of studied material are also presented. Phylogenetic analyses based on different nuclear and chloroplast DNA markers, using Bayesian and maximum parsimony methods of inference, were performed. Results support three main lineages: separate ones for the relict species A. agardhii and A. laevitomentosa and a third including the majority of the Andryala species that underwent a relatively rapid and recent speciation. They also suggest a single colonization event of Madeira and the Canary Islands from the Mediterranean region, followed by insular speciation. Biogeography and speciation within the genus are briefly discussed, including a proposal for the centre of origin of the genus and possible dispersal routes.
Resumo:
The use of clustering methods for the discovery of cancer subtypes has drawn a great deal of attention in the scientific community. While bioinformaticians have proposed new clustering methods that take advantage of characteristics of the gene expression data, the medical community has a preference for using classic clustering methods. There have been no studies thus far performing a large-scale evaluation of different clustering methods in this context. This work presents the first large-scale analysis of seven different clustering methods and four proximity measures for the analysis of 35 cancer gene expression data sets. Results reveal that the finite mixture of Gaussians, followed closely by k-means, exhibited the best performance in terms of recovering the true structure of the data sets. These methods also exhibited, on average, the smallest difference between the actual number of classes in the data sets and the best number of clusters as indicated by our validation criteria. Furthermore, hierarchical methods, which have been widely used by the medical community, exhibited a poorer recovery performance than that of the other methods evaluated. Moreover, as a stable basis for the assessment and comparison of different clustering methods for cancer gene expression data, this study provides a common group of data sets (benchmark data sets) to be shared among researchers and used for comparisons with new methods
Resumo:
Symbolic Data Analysis (SDA) main aims to provide tools for reducing large databases to extract knowledge and provide techniques to describe the unit of such data in complex units, as such, interval or histogram. The objective of this work is to extend classical clustering methods for symbolic interval data based on interval-based distance. The main advantage of using an interval-based distance for interval-based data lies on the fact that it preserves the underlying imprecision on intervals which is usually lost when real-valued distances are applied. This work includes an approach allow existing indices to be adapted to interval context. The proposed methods with interval-based distances are compared with distances punctual existing literature through experiments with simulated data and real data interval
Resumo:
Sessenta e nove acessos de Psidium, coletados em seis estados brasileiros, foram analisados para dois métodos não hierárquicos de agrupamento e por componentes principais (CP), visando orientar programas de melhoramento. Foram analisadas as variáveis ácido ascórbico, β-caroteno, licopeno, fenóis totais, flavonóides totais, atividade antioxidante, acidez titulável, sólidos solúveis, açúcares solúveis totais, teor de umidade, diâmetro lateral e transversal do fruto, peso da polpa e das sementes/fruto, número e produção de frutos/planta. Foram observados agrupamentos específicos para os acessos de araçazeiros no método de Tocher e do k-means e na dispersão tridimensional dos quatro CPs. Os acessos de araçazeiros foram separados dos de goiabeira. Não foi observado nenhum agrupamento específico por estado de coleta, indicando a inexistência de barreiras na propagação dos acessos de goiabeira. As análises sugerem a prospecção de maior número de amostras de germoplasma num menor número de regiões, bem como acessos divergentes com alto teor de compostos nutricionais.
Resumo:
O objetivo deste trabalho foi comparar diferentes técnicas multivariadas na caracterização de 35 genótipos de gergelim mediante 769 marcadores RAPD. As distâncias genéticas foram obtidas pelo complemento aritmético do coeficiente de Jaccard e agrupadas pelos métodos hierárquicos do vizinho mais próximo, do vizinho mais distante, das médias aritméticas não ponderadas (UPGMA), do método de otimização de Tocher e análises de coordenadas principais. O agrupamento dos genótipos foi alterado em função dos diferentes métodos usados. Adotando-se a mesma distância genética (0,36) como valor de corte, diferenciaram-se quatro grupos no método do vizinho mais próximo, 13 para o vizinho mais distante, 11 no UPGMA e quatro no Tocher. Entre os métodos hierárquicos, o UPGMA apresentou o melhor ajuste das distâncias originais e estimadas (CCC = 0,89). As análises das coordenadas principais confirmaram a baixa diversidade existente entre os genótipos. A maior divergência ocorreu entre as cultivares Seridó 1 e Arawaca 4, e a menor, entre os genótipos VCR-101 e GP-3314. As três primeiras coordenadas principais contabilizaram 35,13% do total da variabilidade, e 18 autovalores foram necessários para explicar 81% da variação genética. Os métodos UPGMA, de otimização de Tocher, e as análises de coordenadas principais são complementares na formação dos grupos.
Resumo:
Flavonoid compounds were analyzed in ripe fruit pulp of ten species of Coffea, including two cultivars of C. arabica and two of C. canephora. Three coefficients of similarity: Simple-Matching, Jaccard and Ochiai and three different clustering methods, Single Linkage, Complete Linkage and Unweighted Pair Group, Using Arithmetic Averages (UPGMA), were used to analyze the data.Jaccard and Ochiai's coefficients of association showed a more coherent result, when compared with taxonomic and hybridization studies. Inclusion of Psilanthopsis kapakata in the genus Coffea, as C. kapakata, is justified by the similarity of this species with other studied species, and clusters clearly approximate the species C. arabica and C. eugenioides. The latter is one of the possible parents of the allotetraploid species C. arabica, C. congensis is the only species whose position remains ambiguous, probably due to the fact that the plants of this species that were introduced into the Campinas collections, were hybrids and not typical of C. congensis.