808 resultados para Agglomerative Hierarchical Clustering


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Hierarchical multi-label classification is a complex classification task where the classes involved in the problem are hierarchically structured and each example may simultaneously belong to more than one class in each hierarchical level. In this paper, we extend our previous works, where we investigated a new local-based classification method that incrementally trains a multi-layer perceptron for each level of the classification hierarchy. Predictions made by a neural network in a given level are used as inputs to the neural network responsible for the prediction in the next level. We compare the proposed method with one state-of-the-art decision-tree induction method and two decision-tree induction methods, using several hierarchical multi-label classification datasets. We perform a thorough experimental analysis, showing that our method obtains competitive results to a robust global method regarding both precision and recall evaluation measures.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The present work proposes a method based on CLV (Clustering around Latent Variables) for identifying groups of consumers in L-shape data. This kind of datastructure is very common in consumer studies where a panel of consumers is asked to assess the global liking of a certain number of products and then, preference scores are arranged in a two-way table Y. External information on both products (physicalchemical description or sensory attributes) and consumers (socio-demographic background, purchase behaviours or consumption habits) may be available in a row descriptor matrix X and in a column descriptor matrix Z respectively. The aim of this method is to automatically provide a consumer segmentation where all the three matrices play an active role in the classification, getting homogeneous groups from all points of view: preference, products and consumer characteristics. The proposed clustering method is illustrated on data from preference studies on food products: juices based on berry fruits and traditional cheeses from Trentino. The hedonic ratings given by the consumer panel on the products under study were explained with respect to the product chemical compounds, sensory evaluation and consumer socio-demographic information, purchase behaviour and consumption habits.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The main aim of this Ph.D. dissertation is the study of clustering dependent data by means of copula functions with particular emphasis on microarray data. Copula functions are a popular multivariate modeling tool in each field where the multivariate dependence is of great interest and their use in clustering has not been still investigated. The first part of this work contains the review of the literature of clustering methods, copula functions and microarray experiments. The attention focuses on the K–means (Hartigan, 1975; Hartigan and Wong, 1979), the hierarchical (Everitt, 1974) and the model–based (Fraley and Raftery, 1998, 1999, 2000, 2007) clustering techniques because their performance is compared. Then, the probabilistic interpretation of the Sklar’s theorem (Sklar’s, 1959), the estimation methods for copulas like the Inference for Margins (Joe and Xu, 1996) and the Archimedean and Elliptical copula families are presented. In the end, applications of clustering methods and copulas to the genetic and microarray experiments are highlighted. The second part contains the original contribution proposed. A simulation study is performed in order to evaluate the performance of the K–means and the hierarchical bottom–up clustering methods in identifying clusters according to the dependence structure of the data generating process. Different simulations are performed by varying different conditions (e.g., the kind of margins (distinct, overlapping and nested) and the value of the dependence parameter ) and the results are evaluated by means of different measures of performance. In light of the simulation results and of the limits of the two investigated clustering methods, a new clustering algorithm based on copula functions (‘CoClust’ in brief) is proposed. The basic idea, the iterative procedure of the CoClust and the description of the written R functions with their output are given. The CoClust algorithm is tested on simulated data (by varying the number of clusters, the copula models, the dependence parameter value and the degree of overlap of margins) and is compared with the performance of model–based clustering by using different measures of performance, like the percentage of well–identified number of clusters and the not rejection percentage of H0 on . It is shown that the CoClust algorithm allows to overcome all observed limits of the other investigated clustering techniques and is able to identify clusters according to the dependence structure of the data independently of the degree of overlap of margins and the strength of the dependence. The CoClust uses a criterion based on the maximized log–likelihood function of the copula and can virtually account for any possible dependence relationship between observations. Many peculiar characteristics are shown for the CoClust, e.g. its capability of identifying the true number of clusters and the fact that it does not require a starting classification. Finally, the CoClust algorithm is applied to the real microarray data of Hedenfalk et al. (2001) both to the gene expressions observed in three different cancer samples and to the columns (tumor samples) of the whole data matrix.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

[EN]We present a new strategy for constructing spline spaces over hierarchical T-meshes with quad- and octree subdivision scheme. The proposed technique includes some simple rules for inferring local knot vectors to define C 2 -continuous cubic tensor product spline blending functions. Our conjecture is that these rules allow to obtain, for a given T-mesh, a set of linearly independent spline functions with the property that spaces spanned by nested T-meshes are also nested, and therefore, the functions can reproduce cubic polynomials. In order to span spaces with these properties applying the proposed rules, the T-mesh should fulfill the only requirement of being a 0- balanced mesh...

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The intensity of regional specialization in specific activities, and conversely, the level of industrial concentration in specific locations, has been used as a complementary evidence for the existence and significance of externalities. Additionally, economists have mainly focused the debate on disentangling the sources of specialization and concentration processes according to three vectors: natural advantages, internal, and external scale economies. The arbitrariness of partitions plays a key role in capturing these effects, while the selection of the partition would have to reflect the actual characteristics of the economy. Thus, the identification of spatial boundaries to measure specialization becomes critical, since most likely the model will be adapted to different scales of distance, and be influenced by different types of externalities or economies of agglomeration, which are based on the mechanisms of interaction with particular requirements of spatial proximity. This work is based on the analysis of the spatial aspect of economic specialization supported by the manufacturing industry case. The main objective is to propose, for discrete and continuous space: i) a measure of global specialization; ii) a local disaggregation of the global measure; and iii) a spatial clustering method for the identification of specialized agglomerations.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This PhD Thesis is part of a long-term wide research project, carried out by the "Osservatorio Astronomico di Bologna (INAF-OABO)", that has as primary goal the comprehension and reconstruction of formation mechanism of galaxies and their evolution history. There is now substantial evidence, both from theoretical and observational point of view, in favor of the hypothesis that the halo of our Galaxy has been at least partially, built up by the progressive accretion of small fragments, similar in nature to the present day dwarf galaxies of the Local Group. In this context, the photometric and spectroscopic study of systems which populate the halo of our Galaxy (i.e. dwarf spheroidal galaxy, tidal streams, massive globular cluster, etc) permits to discover, not only the origin and behaviour of these systems, but also the structure of our Galactic halo, combined with its formation history. In fact, the study of the population of these objects and also of their chemical compositions, age, metallicities and velocity dispersion, permit us not only an improvement in the understanding of the mechanisms that govern the Galactic formation, but also a valid indirect test for cosmological model itself. Specifically, in this Thesis we provided a complete characterization of the tidal Stream of the Sagittarius dwarf spheroidal galaxy, that is the most striking example of the process of tidal disruption and accretion of a dwarf satellite in to our Galaxy. Using Red Clump stars, extracted from the catalogue of the Sloan Digital Sky Survey (SDSS) we obtained an estimate of the distance, the depth along the line of sight and of the number density for each detected portion of the Stream (and more in general for each detected structure along our line of sight). Moreover comparing the relative number (i.e. the ratio) of Blue Horizontal Branch stars and Red Clump stars (the two features are tracers of different age/different metallicity populations) in the main body of the galaxy and in the Stream, in order to verify the presence of an age-metallicity gradient along the Stream. We also report the detection of a population of Red Clump stars probably associated with the recently discovered Bootes III stellar system. Finally, we also present the results of a survey of radial velocities over a wide region, extending from r ~ 10' out to r ~ 80' within the massive star cluster Omega Centauri. The survey was performed with FLAMES@VLT, to study the velocity dispersion profile in the outer regions of this stellar system. All the results presented in this Thesis, have already been published in refeered journals.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

There are different ways to do cluster analysis of categorical data in the literature and the choice among them is strongly related to the aim of the researcher, if we do not take into account time and economical constraints. Main approaches for clustering are usually distinguished into model-based and distance-based methods: the former assume that objects belonging to the same class are similar in the sense that their observed values come from the same probability distribution, whose parameters are unknown and need to be estimated; the latter evaluate distances among objects by a defined dissimilarity measure and, basing on it, allocate units to the closest group. In clustering, one may be interested in the classification of similar objects into groups, and one may be interested in finding observations that come from the same true homogeneous distribution. But do both of these aims lead to the same clustering? And how good are clustering methods designed to fulfil one of these aims in terms of the other? In order to answer, two approaches, namely a latent class model (mixture of multinomial distributions) and a partition around medoids one, are evaluated and compared by Adjusted Rand Index, Average Silhouette Width and Pearson-Gamma indexes in a fairly wide simulation study. Simulation outcomes are plotted in bi-dimensional graphs via Multidimensional Scaling; size of points is proportional to the number of points that overlap and different colours are used according to the cluster membership.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Il task del data mining si pone come obiettivo l'estrazione automatica di schemi significativi da grandi quantità di dati. Un esempio di schemi che possono essere cercati sono raggruppamenti significativi dei dati, si parla in questo caso di clustering. Gli algoritmi di clustering tradizionali mostrano grossi limiti in caso di dataset ad alta dimensionalità, composti cioè da oggetti descritti da un numero consistente di attributi. Di fronte a queste tipologie di dataset è necessario quindi adottare una diversa metodologia di analisi: il subspace clustering. Il subspace clustering consiste nella visita del reticolo di tutti i possibili sottospazi alla ricerca di gruppi signicativi (cluster). Una ricerca di questo tipo è un'operazione particolarmente costosa dal punto di vista computazionale. Diverse ottimizzazioni sono state proposte al fine di rendere gli algoritmi di subspace clustering più efficienti. In questo lavoro di tesi si è affrontato il problema da un punto di vista diverso: l'utilizzo della parallelizzazione al fine di ridurre il costo computazionale di un algoritmo di subspace clustering.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In questo lavoro di tesi si è studiato il clustering degli ammassi di galassie e la determinazione della posizione del picco BAO per ottenere vincoli sui parametri cosmologici. A tale scopo si è implementato un codice per la stima dell'errore tramite i metodi di jackknife e bootstrap. La misura del picco BAO confrontata con i modelli cosmologici, grazie all'errore stimato molto piccolo, è risultato in accordo con il modelli LambdaCDM, e permette di ottenere vincoli su alcuni parametri dei modelli cosmologici.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Lo scopo del clustering è quindi quello di individuare strutture nei dati significative, ed è proprio dalla seguente definizione che è iniziata questa attività di tesi , fornendo un approccio innovativo ed inesplorato al cluster, ovvero non ricercando la relazione ma ragionando su cosa non lo sia. Osservando un insieme di dati ,cosa rappresenta la non relazione? Una domanda difficile da porsi , che ha intrinsecamente la sua risposta, ovvero l’indipendenza di ogni singolo dato da tutti gli altri. La ricerca quindi dell’indipendenza tra i dati ha portato il nostro pensiero all’approccio statistico ai dati , in quanto essa è ben descritta e dimostrata in statistica. Ogni punto in un dataset, per essere considerato “privo di collegamenti/relazioni” , significa che la stessa probabilità di essere presente in ogni elemento spaziale dell’intero dataset. Matematicamente parlando , ogni punto P in uno spazio S ha la stessa probabilità di cadere in una regione R ; il che vuol dire che tale punto può CASUALMENTE essere all’interno di una qualsiasi regione del dataset. Da questa assunzione inizia il lavoro di tesi, diviso in più parti. Il secondo capitolo analizza lo stato dell’arte del clustering, raffrontato alla crescente problematica della mole di dati, che con l’avvento della diffusione della rete ha visto incrementare esponenzialmente la grandezza delle basi di conoscenza sia in termini di attributi (dimensioni) che in termini di quantità di dati (Big Data). Il terzo capitolo richiama i concetti teorico-statistici utilizzati dagli algoritimi statistici implementati. Nel quarto capitolo vi sono i dettagli relativi all’implementazione degli algoritmi , ove sono descritte le varie fasi di investigazione ,le motivazioni sulle scelte architetturali e le considerazioni che hanno portato all’esclusione di una delle 3 versioni implementate. Nel quinto capitolo gli algoritmi 2 e 3 sono confrontati con alcuni algoritmi presenti in letteratura, per dimostrare le potenzialità e le problematiche dell’algoritmo sviluppato , tali test sono a livello qualitativo , in quanto l’obbiettivo del lavoro di tesi è dimostrare come un approccio statistico può rivelarsi un’arma vincente e non quello di fornire un nuovo algoritmo utilizzabile nelle varie problematiche di clustering. Nel sesto capitolo saranno tratte le conclusioni sul lavoro svolto e saranno elencati i possibili interventi futuri dai quali la ricerca appena iniziata del clustering statistico potrebbe crescere.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Atmosphärische Aerosolpartikel wirken in vielerlei Hinsicht auf die Menschen und die Umwelt ein. Eine genaue Charakterisierung der Partikel hilft deren Wirken zu verstehen und dessen Folgen einzuschätzen. Partikel können hinsichtlich ihrer Größe, ihrer Form und ihrer chemischen Zusammensetzung charakterisiert werden. Mit der Laserablationsmassenspektrometrie ist es möglich die Größe und die chemische Zusammensetzung einzelner Aerosolpartikel zu bestimmen. Im Rahmen dieser Arbeit wurde das SPLAT (Single Particle Laser Ablation Time-of-flight mass spectrometer) zur besseren Analyse insbesondere von atmosphärischen Aerosolpartikeln weiterentwickelt. Der Aerosoleinlass wurde dahingehend optimiert, einen möglichst weiten Partikelgrößenbereich (80 nm - 3 µm) in das SPLAT zu transferieren und zu einem feinen Strahl zu bündeln. Eine neue Beschreibung für die Beziehung der Partikelgröße zu ihrer Geschwindigkeit im Vakuum wurde gefunden. Die Justage des Einlasses wurde mithilfe von Schrittmotoren automatisiert. Die optische Detektion der Partikel wurde so verbessert, dass Partikel mit einer Größe < 100 nm erfasst werden können. Aufbauend auf der optischen Detektion und der automatischen Verkippung des Einlasses wurde eine neue Methode zur Charakterisierung des Partikelstrahls entwickelt. Die Steuerelektronik des SPLAT wurde verbessert, so dass die maximale Analysefrequenz nur durch den Ablationslaser begrenzt wird, der höchsten mit etwa 10 Hz ablatieren kann. Durch eine Optimierung des Vakuumsystems wurde der Ionenverlust im Massenspektrometer um den Faktor 4 verringert.rnrnNeben den hardwareseitigen Weiterentwicklungen des SPLAT bestand ein Großteil dieser Arbeit in der Konzipierung und Implementierung einer Softwarelösung zur Analyse der mit dem SPLAT gewonnenen Rohdaten. CRISP (Concise Retrieval of Information from Single Particles) ist ein auf IGOR PRO (Wavemetrics, USA) aufbauendes Softwarepaket, das die effiziente Auswertung der Einzelpartikel Rohdaten erlaubt. CRISP enthält einen neu entwickelten Algorithmus zur automatischen Massenkalibration jedes einzelnen Massenspektrums, inklusive der Unterdrückung von Rauschen und von Problemen mit Signalen die ein intensives Tailing aufweisen. CRISP stellt Methoden zur automatischen Klassifizierung der Partikel zur Verfügung. Implementiert sind k-means, fuzzy-c-means und eine Form der hierarchischen Einteilung auf Basis eines minimal aufspannenden Baumes. CRISP bietet die Möglichkeit die Daten vorzubehandeln, damit die automatische Einteilung der Partikel schneller abläuft und die Ergebnisse eine höhere Qualität aufweisen. Daneben kann CRISP auf einfache Art und Weise Partikel anhand vorgebener Kriterien sortieren. Die CRISP zugrundeliegende Daten- und Infrastruktur wurde in Hinblick auf Wartung und Erweiterbarkeit erstellt. rnrnIm Rahmen der Arbeit wurde das SPLAT in mehreren Kampagnen erfolgreich eingesetzt und die Fähigkeiten von CRISP konnten anhand der gewonnen Datensätze gezeigt werden.rnrnDas SPLAT ist nun in der Lage effizient im Feldeinsatz zur Charakterisierung des atmosphärischen Aerosols betrieben zu werden, während CRISP eine schnelle und gezielte Auswertung der Daten ermöglicht.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In recent years, Deep Learning techniques have shown to perform well on a large variety of problems both in Computer Vision and Natural Language Processing, reaching and often surpassing the state of the art on many tasks. The rise of deep learning is also revolutionizing the entire field of Machine Learning and Pattern Recognition pushing forward the concepts of automatic feature extraction and unsupervised learning in general. However, despite the strong success both in science and business, deep learning has its own limitations. It is often questioned if such techniques are only some kind of brute-force statistical approaches and if they can only work in the context of High Performance Computing with tons of data. Another important question is whether they are really biologically inspired, as claimed in certain cases, and if they can scale well in terms of "intelligence". The dissertation is focused on trying to answer these key questions in the context of Computer Vision and, in particular, Object Recognition, a task that has been heavily revolutionized by recent advances in the field. Practically speaking, these answers are based on an exhaustive comparison between two, very different, deep learning techniques on the aforementioned task: Convolutional Neural Network (CNN) and Hierarchical Temporal memory (HTM). They stand for two different approaches and points of view within the big hat of deep learning and are the best choices to understand and point out strengths and weaknesses of each of them. CNN is considered one of the most classic and powerful supervised methods used today in machine learning and pattern recognition, especially in object recognition. CNNs are well received and accepted by the scientific community and are already deployed in large corporation like Google and Facebook for solving face recognition and image auto-tagging problems. HTM, on the other hand, is known as a new emerging paradigm and a new meanly-unsupervised method, that is more biologically inspired. It tries to gain more insights from the computational neuroscience community in order to incorporate concepts like time, context and attention during the learning process which are typical of the human brain. In the end, the thesis is supposed to prove that in certain cases, with a lower quantity of data, HTM can outperform CNN.