795 resultados para label hierarchical clustering
Resumo:
Spin systems in the presence of disorder are described by two sets of degrees of freedom, associated with orientational (spin) and disorder variables, which may be characterized by two distinct relaxation times. Disordered spin models have been mostly investigated in the quenched regime, which is the usual situation in solid state physics, and in which the relaxation time of the disorder variables is much larger than the typical measurement times. In this quenched regime, disorder variables are fixed, and only the orientational variables are duly thermalized. Recent studies in the context of lattice statistical models for the phase diagrams of nematic liquid-crystalline systems have stimulated the interest of going beyond the quenched regime. The phase diagrams predicted by these calculations for a simple Maier-Saupe model turn out to be qualitative different from the quenched case if the two sets of degrees of freedom are allowed to reach thermal equilibrium during the experimental time, which is known as the fully annealed regime. In this work, we develop a transfer matrix formalism to investigate annealed disordered Ising models on two hierarchical structures, the diamond hierarchical lattice (DHL) and the Apollonian network (AN). The calculations follow the same steps used for the analysis of simple uniform systems, which amounts to deriving proper recurrence maps for the thermodynamic and magnetic variables in terms of the generations of the construction of the hierarchical structures. In this context, we may consider different kinds of disorder, and different types of ferromagnetic and anti-ferromagnetic interactions. In the present work, we analyze the effects of dilution, which are produced by the removal of some magnetic ions. The system is treated in a “grand canonical" ensemble. The introduction of two extra fields, related to the concentration of two different types of particles, leads to higher-rank transfer matrices as compared with the formalism for the usual uniform models. Preliminary calculations on a DHL indicate that there is a phase transition for a wide range of dilution concentrations. Ising spin systems on the AN are known to be ferromagnetically ordered at all temperatures; in the presence of dilution, however, there are indications of a disordered (paramagnetic) phase at low concentrations of magnetic ions.
Resumo:
Given a large image set, in which very few images have labels, how to guess labels for the remaining majority? How to spot images that need brand new labels different from the predefined ones? How to summarize these data to route the user’s attention to what really matters? Here we answer all these questions. Specifically, we propose QuMinS, a fast, scalable solution to two problems: (i) Low-labor labeling (LLL) – given an image set, very few images have labels, find the most appropriate labels for the rest; and (ii) Mining and attention routing – in the same setting, find clusters, the top-'N IND.O' outlier images, and the 'N IND.R' images that best represent the data. Experiments on satellite images spanning up to 2.25 GB show that, contrasting to the state-of-the-art labeling techniques, QuMinS scales linearly on the data size, being up to 40 times faster than top competitors (GCap), still achieving better or equal accuracy, it spots images that potentially require unpredicted labels, and it works even with tiny initial label sets, i.e., nearly five examples. We also report a case study of our method’s practical usage to show that QuMinS is a viable tool for automatic coffee crop detection from remote sensing images.
Resumo:
The present work proposes a method based on CLV (Clustering around Latent Variables) for identifying groups of consumers in L-shape data. This kind of datastructure is very common in consumer studies where a panel of consumers is asked to assess the global liking of a certain number of products and then, preference scores are arranged in a two-way table Y. External information on both products (physicalchemical description or sensory attributes) and consumers (socio-demographic background, purchase behaviours or consumption habits) may be available in a row descriptor matrix X and in a column descriptor matrix Z respectively. The aim of this method is to automatically provide a consumer segmentation where all the three matrices play an active role in the classification, getting homogeneous groups from all points of view: preference, products and consumer characteristics. The proposed clustering method is illustrated on data from preference studies on food products: juices based on berry fruits and traditional cheeses from Trentino. The hedonic ratings given by the consumer panel on the products under study were explained with respect to the product chemical compounds, sensory evaluation and consumer socio-demographic information, purchase behaviour and consumption habits.
Resumo:
The main aim of this Ph.D. dissertation is the study of clustering dependent data by means of copula functions with particular emphasis on microarray data. Copula functions are a popular multivariate modeling tool in each field where the multivariate dependence is of great interest and their use in clustering has not been still investigated. The first part of this work contains the review of the literature of clustering methods, copula functions and microarray experiments. The attention focuses on the K–means (Hartigan, 1975; Hartigan and Wong, 1979), the hierarchical (Everitt, 1974) and the model–based (Fraley and Raftery, 1998, 1999, 2000, 2007) clustering techniques because their performance is compared. Then, the probabilistic interpretation of the Sklar’s theorem (Sklar’s, 1959), the estimation methods for copulas like the Inference for Margins (Joe and Xu, 1996) and the Archimedean and Elliptical copula families are presented. In the end, applications of clustering methods and copulas to the genetic and microarray experiments are highlighted. The second part contains the original contribution proposed. A simulation study is performed in order to evaluate the performance of the K–means and the hierarchical bottom–up clustering methods in identifying clusters according to the dependence structure of the data generating process. Different simulations are performed by varying different conditions (e.g., the kind of margins (distinct, overlapping and nested) and the value of the dependence parameter ) and the results are evaluated by means of different measures of performance. In light of the simulation results and of the limits of the two investigated clustering methods, a new clustering algorithm based on copula functions (‘CoClust’ in brief) is proposed. The basic idea, the iterative procedure of the CoClust and the description of the written R functions with their output are given. The CoClust algorithm is tested on simulated data (by varying the number of clusters, the copula models, the dependence parameter value and the degree of overlap of margins) and is compared with the performance of model–based clustering by using different measures of performance, like the percentage of well–identified number of clusters and the not rejection percentage of H0 on . It is shown that the CoClust algorithm allows to overcome all observed limits of the other investigated clustering techniques and is able to identify clusters according to the dependence structure of the data independently of the degree of overlap of margins and the strength of the dependence. The CoClust uses a criterion based on the maximized log–likelihood function of the copula and can virtually account for any possible dependence relationship between observations. Many peculiar characteristics are shown for the CoClust, e.g. its capability of identifying the true number of clusters and the fact that it does not require a starting classification. Finally, the CoClust algorithm is applied to the real microarray data of Hedenfalk et al. (2001) both to the gene expressions observed in three different cancer samples and to the columns (tumor samples) of the whole data matrix.
Resumo:
[EN]We present a new strategy for constructing spline spaces over hierarchical T-meshes with quad- and octree subdivision scheme. The proposed technique includes some simple rules for inferring local knot vectors to define C 2 -continuous cubic tensor product spline blending functions. Our conjecture is that these rules allow to obtain, for a given T-mesh, a set of linearly independent spline functions with the property that spaces spanned by nested T-meshes are also nested, and therefore, the functions can reproduce cubic polynomials. In order to span spaces with these properties applying the proposed rules, the T-mesh should fulfill the only requirement of being a 0- balanced mesh...
Resumo:
The intensity of regional specialization in specific activities, and conversely, the level of industrial concentration in specific locations, has been used as a complementary evidence for the existence and significance of externalities. Additionally, economists have mainly focused the debate on disentangling the sources of specialization and concentration processes according to three vectors: natural advantages, internal, and external scale economies. The arbitrariness of partitions plays a key role in capturing these effects, while the selection of the partition would have to reflect the actual characteristics of the economy. Thus, the identification of spatial boundaries to measure specialization becomes critical, since most likely the model will be adapted to different scales of distance, and be influenced by different types of externalities or economies of agglomeration, which are based on the mechanisms of interaction with particular requirements of spatial proximity. This work is based on the analysis of the spatial aspect of economic specialization supported by the manufacturing industry case. The main objective is to propose, for discrete and continuous space: i) a measure of global specialization; ii) a local disaggregation of the global measure; and iii) a spatial clustering method for the identification of specialized agglomerations.
Resumo:
This PhD Thesis is part of a long-term wide research project, carried out by the "Osservatorio Astronomico di Bologna (INAF-OABO)", that has as primary goal the comprehension and reconstruction of formation mechanism of galaxies and their evolution history. There is now substantial evidence, both from theoretical and observational point of view, in favor of the hypothesis that the halo of our Galaxy has been at least partially, built up by the progressive accretion of small fragments, similar in nature to the present day dwarf galaxies of the Local Group. In this context, the photometric and spectroscopic study of systems which populate the halo of our Galaxy (i.e. dwarf spheroidal galaxy, tidal streams, massive globular cluster, etc) permits to discover, not only the origin and behaviour of these systems, but also the structure of our Galactic halo, combined with its formation history. In fact, the study of the population of these objects and also of their chemical compositions, age, metallicities and velocity dispersion, permit us not only an improvement in the understanding of the mechanisms that govern the Galactic formation, but also a valid indirect test for cosmological model itself. Specifically, in this Thesis we provided a complete characterization of the tidal Stream of the Sagittarius dwarf spheroidal galaxy, that is the most striking example of the process of tidal disruption and accretion of a dwarf satellite in to our Galaxy. Using Red Clump stars, extracted from the catalogue of the Sloan Digital Sky Survey (SDSS) we obtained an estimate of the distance, the depth along the line of sight and of the number density for each detected portion of the Stream (and more in general for each detected structure along our line of sight). Moreover comparing the relative number (i.e. the ratio) of Blue Horizontal Branch stars and Red Clump stars (the two features are tracers of different age/different metallicity populations) in the main body of the galaxy and in the Stream, in order to verify the presence of an age-metallicity gradient along the Stream. We also report the detection of a population of Red Clump stars probably associated with the recently discovered Bootes III stellar system. Finally, we also present the results of a survey of radial velocities over a wide region, extending from r ~ 10' out to r ~ 80' within the massive star cluster Omega Centauri. The survey was performed with FLAMES@VLT, to study the velocity dispersion profile in the outer regions of this stellar system. All the results presented in this Thesis, have already been published in refeered journals.
Resumo:
There are different ways to do cluster analysis of categorical data in the literature and the choice among them is strongly related to the aim of the researcher, if we do not take into account time and economical constraints. Main approaches for clustering are usually distinguished into model-based and distance-based methods: the former assume that objects belonging to the same class are similar in the sense that their observed values come from the same probability distribution, whose parameters are unknown and need to be estimated; the latter evaluate distances among objects by a defined dissimilarity measure and, basing on it, allocate units to the closest group. In clustering, one may be interested in the classification of similar objects into groups, and one may be interested in finding observations that come from the same true homogeneous distribution. But do both of these aims lead to the same clustering? And how good are clustering methods designed to fulfil one of these aims in terms of the other? In order to answer, two approaches, namely a latent class model (mixture of multinomial distributions) and a partition around medoids one, are evaluated and compared by Adjusted Rand Index, Average Silhouette Width and Pearson-Gamma indexes in a fairly wide simulation study. Simulation outcomes are plotted in bi-dimensional graphs via Multidimensional Scaling; size of points is proportional to the number of points that overlap and different colours are used according to the cluster membership.
Resumo:
Il task del data mining si pone come obiettivo l'estrazione automatica di schemi significativi da grandi quantità di dati. Un esempio di schemi che possono essere cercati sono raggruppamenti significativi dei dati, si parla in questo caso di clustering. Gli algoritmi di clustering tradizionali mostrano grossi limiti in caso di dataset ad alta dimensionalità, composti cioè da oggetti descritti da un numero consistente di attributi. Di fronte a queste tipologie di dataset è necessario quindi adottare una diversa metodologia di analisi: il subspace clustering. Il subspace clustering consiste nella visita del reticolo di tutti i possibili sottospazi alla ricerca di gruppi signicativi (cluster). Una ricerca di questo tipo è un'operazione particolarmente costosa dal punto di vista computazionale. Diverse ottimizzazioni sono state proposte al fine di rendere gli algoritmi di subspace clustering più efficienti. In questo lavoro di tesi si è affrontato il problema da un punto di vista diverso: l'utilizzo della parallelizzazione al fine di ridurre il costo computazionale di un algoritmo di subspace clustering.
Resumo:
In questo lavoro di tesi si è studiato il clustering degli ammassi di galassie e la determinazione della posizione del picco BAO per ottenere vincoli sui parametri cosmologici. A tale scopo si è implementato un codice per la stima dell'errore tramite i metodi di jackknife e bootstrap. La misura del picco BAO confrontata con i modelli cosmologici, grazie all'errore stimato molto piccolo, è risultato in accordo con il modelli LambdaCDM, e permette di ottenere vincoli su alcuni parametri dei modelli cosmologici.