17 resultados para means clustering
em AMS Tesi di Dottorato - Alm@DL - Università di Bologna
Resumo:
Although the debate of what data science is has a long history and has not reached a complete consensus yet, Data Science can be summarized as the process of learning from data. Guided by the above vision, this thesis presents two independent data science projects developed in the scope of multidisciplinary applied research. The first part analyzes fluorescence microscopy images typically produced in life science experiments, where the objective is to count how many marked neuronal cells are present in each image. Aiming to automate the task for supporting research in the area, we propose a neural network architecture tuned specifically for this use case, cell ResUnet (c-ResUnet), and discuss the impact of alternative training strategies in overcoming particular challenges of our data. The approach provides good results in terms of both detection and counting, showing performance comparable to the interpretation of human operators. As a meaningful addition, we release the pre-trained model and the Fluorescent Neuronal Cells dataset collecting pixel-level annotations of where neuronal cells are located. In this way, we hope to help future research in the area and foster innovative methodologies for tackling similar problems. The second part deals with the problem of distributed data management in the context of LHC experiments, with a focus on supporting ATLAS operations concerning data transfer failures. In particular, we analyze error messages produced by failed transfers and propose a Machine Learning pipeline that leverages the word2vec language model and K-means clustering. This provides groups of similar errors that are presented to human operators as suggestions of potential issues to investigate. The approach is demonstrated on one full day of data, showing promising ability in understanding the message content and providing meaningful groupings, in line with previously reported incidents by human operators.
Resumo:
Long-term monitoring of acoustical environments is gaining popularity thanks to the relevant amount of scientific and engineering insights that it provides. The increasing interest is due to the constant growth of storage capacity and computational power to process large amounts of data. In this perspective, machine learning (ML) provides a broad family of data-driven statistical techniques to deal with large databases. Nowadays, the conventional praxis of sound level meter measurements limits the global description of a sound scene to an energetic point of view. The equivalent continuous level Leq represents the main metric to define an acoustic environment, indeed. Finer analyses involve the use of statistical levels. However, acoustic percentiles are based on temporal assumptions, which are not always reliable. A statistical approach, based on the study of the occurrences of sound pressure levels, would bring a different perspective to the analysis of long-term monitoring. Depicting a sound scene through the most probable sound pressure level, rather than portions of energy, brought more specific information about the activity carried out during the measurements. The statistical mode of the occurrences can capture typical behaviors of specific kinds of sound sources. The present work aims to propose an ML-based method to identify, separate and measure coexisting sound sources in real-world scenarios. It is based on long-term monitoring and is addressed to acousticians focused on the analysis of environmental noise in manifold contexts. The presented method is based on clustering analysis. Two algorithms, Gaussian Mixture Model and K-means clustering, represent the main core of a process to investigate different active spaces monitored through sound level meters. The procedure has been applied in two different contexts: university lecture halls and offices. The proposed method shows robust and reliable results in describing the acoustic scenario and it could represent an important analytical tool for acousticians.
Resumo:
Bioinformatics, in the last few decades, has played a fundamental role to give sense to the huge amount of data produced. Obtained the complete sequence of a genome, the major problem of knowing as much as possible of its coding regions, is crucial. Protein sequence annotation is challenging and, due to the size of the problem, only computational approaches can provide a feasible solution. As it has been recently pointed out by the Critical Assessment of Function Annotations (CAFA), most accurate methods are those based on the transfer-by-homology approach and the most incisive contribution is given by cross-genome comparisons. In the present thesis it is described a non-hierarchical sequence clustering method for protein automatic large-scale annotation, called “The Bologna Annotation Resource Plus” (BAR+). The method is based on an all-against-all alignment of more than 13 millions protein sequences characterized by a very stringent metric. BAR+ can safely transfer functional features (Gene Ontology and Pfam terms) inside clusters by means of a statistical validation, even in the case of multi-domain proteins. Within BAR+ clusters it is also possible to transfer the three dimensional structure (when a template is available). This is possible by the way of cluster-specific HMM profiles that can be used to calculate reliable template-to-target alignments even in the case of distantly related proteins (sequence identity < 30%). Other BAR+ based applications have been developed during my doctorate including the prediction of Magnesium binding sites in human proteins, the ABC transporters superfamily classification and the functional prediction (GO terms) of the CAFA targets. Remarkably, in the CAFA assessment, BAR+ placed among the ten most accurate methods. At present, as a web server for the functional and structural protein sequence annotation, BAR+ is freely available at http://bar.biocomp.unibo.it/bar2.0.
Resumo:
The present work proposes a method based on CLV (Clustering around Latent Variables) for identifying groups of consumers in L-shape data. This kind of datastructure is very common in consumer studies where a panel of consumers is asked to assess the global liking of a certain number of products and then, preference scores are arranged in a two-way table Y. External information on both products (physicalchemical description or sensory attributes) and consumers (socio-demographic background, purchase behaviours or consumption habits) may be available in a row descriptor matrix X and in a column descriptor matrix Z respectively. The aim of this method is to automatically provide a consumer segmentation where all the three matrices play an active role in the classification, getting homogeneous groups from all points of view: preference, products and consumer characteristics. The proposed clustering method is illustrated on data from preference studies on food products: juices based on berry fruits and traditional cheeses from Trentino. The hedonic ratings given by the consumer panel on the products under study were explained with respect to the product chemical compounds, sensory evaluation and consumer socio-demographic information, purchase behaviour and consumption habits.
Resumo:
The main aim of this Ph.D. dissertation is the study of clustering dependent data by means of copula functions with particular emphasis on microarray data. Copula functions are a popular multivariate modeling tool in each field where the multivariate dependence is of great interest and their use in clustering has not been still investigated. The first part of this work contains the review of the literature of clustering methods, copula functions and microarray experiments. The attention focuses on the K–means (Hartigan, 1975; Hartigan and Wong, 1979), the hierarchical (Everitt, 1974) and the model–based (Fraley and Raftery, 1998, 1999, 2000, 2007) clustering techniques because their performance is compared. Then, the probabilistic interpretation of the Sklar’s theorem (Sklar’s, 1959), the estimation methods for copulas like the Inference for Margins (Joe and Xu, 1996) and the Archimedean and Elliptical copula families are presented. In the end, applications of clustering methods and copulas to the genetic and microarray experiments are highlighted. The second part contains the original contribution proposed. A simulation study is performed in order to evaluate the performance of the K–means and the hierarchical bottom–up clustering methods in identifying clusters according to the dependence structure of the data generating process. Different simulations are performed by varying different conditions (e.g., the kind of margins (distinct, overlapping and nested) and the value of the dependence parameter ) and the results are evaluated by means of different measures of performance. In light of the simulation results and of the limits of the two investigated clustering methods, a new clustering algorithm based on copula functions (‘CoClust’ in brief) is proposed. The basic idea, the iterative procedure of the CoClust and the description of the written R functions with their output are given. The CoClust algorithm is tested on simulated data (by varying the number of clusters, the copula models, the dependence parameter value and the degree of overlap of margins) and is compared with the performance of model–based clustering by using different measures of performance, like the percentage of well–identified number of clusters and the not rejection percentage of H0 on . It is shown that the CoClust algorithm allows to overcome all observed limits of the other investigated clustering techniques and is able to identify clusters according to the dependence structure of the data independently of the degree of overlap of margins and the strength of the dependence. The CoClust uses a criterion based on the maximized log–likelihood function of the copula and can virtually account for any possible dependence relationship between observations. Many peculiar characteristics are shown for the CoClust, e.g. its capability of identifying the true number of clusters and the fact that it does not require a starting classification. Finally, the CoClust algorithm is applied to the real microarray data of Hedenfalk et al. (2001) both to the gene expressions observed in three different cancer samples and to the columns (tumor samples) of the whole data matrix.
Resumo:
In the past decade, the advent of efficient genome sequencing tools and high-throughput experimental biotechnology has lead to enormous progress in the life science. Among the most important innovations is the microarray tecnology. It allows to quantify the expression for thousands of genes simultaneously by measurin the hybridization from a tissue of interest to probes on a small glass or plastic slide. The characteristics of these data include a fair amount of random noise, a predictor dimension in the thousand, and a sample noise in the dozens. One of the most exciting areas to which microarray technology has been applied is the challenge of deciphering complex disease such as cancer. In these studies, samples are taken from two or more groups of individuals with heterogeneous phenotypes, pathologies, or clinical outcomes. these samples are hybridized to microarrays in an effort to find a small number of genes which are strongly correlated with the group of individuals. Eventhough today methods to analyse the data are welle developed and close to reach a standard organization (through the effort of preposed International project like Microarray Gene Expression Data -MGED- Society [1]) it is not unfrequant to stumble in a clinician's question that do not have a compelling statistical method that could permit to answer it.The contribution of this dissertation in deciphering disease regards the development of new approaches aiming at handle open problems posed by clinicians in handle specific experimental designs. In Chapter 1 starting from a biological necessary introduction, we revise the microarray tecnologies and all the important steps that involve an experiment from the production of the array, to the quality controls ending with preprocessing steps that will be used into the data analysis in the rest of the dissertation. While in Chapter 2 a critical review of standard analysis methods are provided stressing most of problems that In Chapter 3 is introduced a method to adress the issue of unbalanced design of miacroarray experiments. In microarray experiments, experimental design is a crucial starting-point for obtaining reasonable results. In a two-class problem, an equal or similar number of samples it should be collected between the two classes. However in some cases, e.g. rare pathologies, the approach to be taken is less evident. We propose to address this issue by applying a modified version of SAM [2]. MultiSAM consists in a reiterated application of a SAM analysis, comparing the less populated class (LPC) with 1,000 random samplings of the same size from the more populated class (MPC) A list of the differentially expressed genes is generated for each SAM application. After 1,000 reiterations, each single probe given a "score" ranging from 0 to 1,000 based on its recurrence in the 1,000 lists as differentially expressed. The performance of MultiSAM was compared to the performance of SAM and LIMMA [3] over two simulated data sets via beta and exponential distribution. The results of all three algorithms over low- noise data sets seems acceptable However, on a real unbalanced two-channel data set reagardin Chronic Lymphocitic Leukemia, LIMMA finds no significant probe, SAM finds 23 significantly changed probes but cannot separate the two classes, while MultiSAM finds 122 probes with score >300 and separates the data into two clusters by hierarchical clustering. We also report extra-assay validation in terms of differentially expressed genes Although standard algorithms perform well over low-noise simulated data sets, multi-SAM seems to be the only one able to reveal subtle differences in gene expression profiles on real unbalanced data. In Chapter 4 a method to adress similarities evaluation in a three-class prblem by means of Relevance Vector Machine [4] is described. In fact, looking at microarray data in a prognostic and diagnostic clinical framework, not only differences could have a crucial role. In some cases similarities can give useful and, sometimes even more, important information. The goal, given three classes, could be to establish, with a certain level of confidence, if the third one is similar to the first or the second one. In this work we show that Relevance Vector Machine (RVM) [2] could be a possible solutions to the limitation of standard supervised classification. In fact, RVM offers many advantages compared, for example, with his well-known precursor (Support Vector Machine - SVM [3]). Among these advantages, the estimate of posterior probability of class membership represents a key feature to address the similarity issue. This is a highly important, but often overlooked, option of any practical pattern recognition system. We focused on Tumor-Grade-three-class problem, so we have 67 samples of grade I (G1), 54 samples of grade 3 (G3) and 100 samples of grade 2 (G2). The goal is to find a model able to separate G1 from G3, then evaluate the third class G2 as test-set to obtain the probability for samples of G2 to be member of class G1 or class G3. The analysis showed that breast cancer samples of grade II have a molecular profile more similar to breast cancer samples of grade I. Looking at the literature this result have been guessed, but no measure of significance was gived before.
Resumo:
The intensity of regional specialization in specific activities, and conversely, the level of industrial concentration in specific locations, has been used as a complementary evidence for the existence and significance of externalities. Additionally, economists have mainly focused the debate on disentangling the sources of specialization and concentration processes according to three vectors: natural advantages, internal, and external scale economies. The arbitrariness of partitions plays a key role in capturing these effects, while the selection of the partition would have to reflect the actual characteristics of the economy. Thus, the identification of spatial boundaries to measure specialization becomes critical, since most likely the model will be adapted to different scales of distance, and be influenced by different types of externalities or economies of agglomeration, which are based on the mechanisms of interaction with particular requirements of spatial proximity. This work is based on the analysis of the spatial aspect of economic specialization supported by the manufacturing industry case. The main objective is to propose, for discrete and continuous space: i) a measure of global specialization; ii) a local disaggregation of the global measure; and iii) a spatial clustering method for the identification of specialized agglomerations.
Resumo:
Satellite remote sensing has proved to be an effective support in timely detection and monitoring of marine oil pollution, mainly due to illegal ship discharges. In this context, we have developed a new methodology and technique for optical oil spill detection, which make use of MODIS L2 and MERIS L1B satellite top of atmosphere (TOA) reflectance imagery, for the first time in a highly automated way. The main idea was combining wide swaths and short revisit times of optical sensors with SAR observations, generally used in oil spill monitoring. This arises from the necessity to overcome the SAR reduced coverage and long revisit time of the monitoring area. This can be done now, given the MODIS and MERIS higher spatial resolution with respect to older sensors (250-300 m vs. 1 km), which consents the identification of smaller spills deriving from illicit discharge at sea. The procedure to obtain identifiable spills in optical reflectance images involves removal of oceanic and atmospheric natural variability, in order to enhance oil-water contrast; image clustering, which purpose is to segment the oil spill eventually presents in the image; finally, the application of a set of criteria for the elimination of those features which look like spills (look-alikes). The final result is a classification of oil spill candidate regions by means of a score based on the above criteria.
Resumo:
There are different ways to do cluster analysis of categorical data in the literature and the choice among them is strongly related to the aim of the researcher, if we do not take into account time and economical constraints. Main approaches for clustering are usually distinguished into model-based and distance-based methods: the former assume that objects belonging to the same class are similar in the sense that their observed values come from the same probability distribution, whose parameters are unknown and need to be estimated; the latter evaluate distances among objects by a defined dissimilarity measure and, basing on it, allocate units to the closest group. In clustering, one may be interested in the classification of similar objects into groups, and one may be interested in finding observations that come from the same true homogeneous distribution. But do both of these aims lead to the same clustering? And how good are clustering methods designed to fulfil one of these aims in terms of the other? In order to answer, two approaches, namely a latent class model (mixture of multinomial distributions) and a partition around medoids one, are evaluated and compared by Adjusted Rand Index, Average Silhouette Width and Pearson-Gamma indexes in a fairly wide simulation study. Simulation outcomes are plotted in bi-dimensional graphs via Multidimensional Scaling; size of points is proportional to the number of points that overlap and different colours are used according to the cluster membership.
Resumo:
The evaluation of structural performance of existing concrete buildings, built according to standards and materials quite different to those available today, requires procedures and methods able to cover lack of data about mechanical material properties and reinforcement detailing. To this end detailed inspections and test on materials are required. As a consequence tests on drilled cores are required; on the other end, it is stated that non-destructive testing (NDT) cannot be used as the only mean to get structural information, but can be used in conjunction with destructive testing (DT) by a representative correlation between DT and NDT. The aim of this study is to verify the accuracy of some formulas of correlation available in literature between measured parameters, i.e. rebound index, ultrasonic pulse velocity and compressive strength (SonReb Method). To this end a relevant number of DT and NDT tests has been performed on many school buildings located in Cesena (Italy). The above relationships have been assessed on site correlating NDT results to strength of core drilled in adjacent locations. Nevertheless, concrete compressive strength assessed by means of NDT methods and evaluated with correlation formulas has the advantage of being able to be implemented and used for future applications in a much more simple way than other methods, even if its accuracy is strictly limited to the analysis of concretes having the same characteristics as those used for their calibration. This limitation warranted a search for a different evaluation method for the non-destructive parameters obtained on site. To this aim, the methodology of neural identification of compressive strength is presented. Artificial Neural Network (ANN) suitable for the specific analysis were chosen taking into account the development presented in the literature in this field. The networks were trained and tested in order to detect a more reliable strength identification methodology.
Resumo:
La ricerca proposta si pone l’obiettivo di definire e sperimentare un metodo per un’articolata e sistematica lettura del territorio rurale, che, oltre ad ampliare la conoscenza del territorio, sia di supporto ai processi di pianificazione paesaggistici ed urbanistici e all’attuazione delle politiche agricole e di sviluppo rurale. Un’approfondita disamina dello stato dell’arte riguardante l’evoluzione del processo di urbanizzazione e le conseguenze dello stesso in Italia e in Europa, oltre che del quadro delle politiche territoriali locali nell’ambito del tema specifico dello spazio rurale e periurbano, hanno reso possibile, insieme a una dettagliata analisi delle principali metodologie di analisi territoriale presenti in letteratura, la determinazione del concept alla base della ricerca condotta. E’ stata sviluppata e testata una metodologia multicriteriale e multilivello per la lettura del territorio rurale sviluppata in ambiente GIS, che si avvale di algoritmi di clustering (quale l’algoritmo IsoCluster) e classificazione a massima verosimiglianza, focalizzando l’attenzione sugli spazi agricoli periurbani. Tale metodo si incentra sulla descrizione del territorio attraverso la lettura di diverse componenti dello stesso, quali quelle agro-ambientali e socio-economiche, ed opera una sintesi avvalendosi di una chiave interpretativa messa a punto allo scopo, l’Impronta Agroambientale (Agro-environmental Footprint - AEF), che si propone di quantificare il potenziale impatto degli spazi rurali sul sistema urbano. In particolare obiettivo di tale strumento è l’identificazione nel territorio extra-urbano di ambiti omogenei per caratteristiche attraverso una lettura del territorio a differenti scale (da quella territoriale a quella aziendale) al fine di giungere ad una sua classificazione e quindi alla definizione delle aree classificabili come “agricole periurbane”. La tesi propone la presentazione dell’architettura complessiva della metodologia e la descrizione dei livelli di analisi che la compongono oltre che la successiva sperimentazione e validazione della stessa attraverso un caso studio rappresentativo posto nella Pianura Padana (Italia).
Resumo:
in the everyday clinical practice. Having this in mind, the choice of a simple setup would not be enough because, even if the setup is quick and simple, the instrumental assessment would still be in addition to the daily routine. The will to overcome this limit has led to the idea of instrumenting already existing and widely used functional tests. In this way the sensor based assessment becomes an integral part of the clinical assessment. Reliable and validated signal processing methods have been successfully implemented in Personal Health Systems based on smartphone technology. At the end of this research project there is evidence that such solution can really and easily used in clinical practice in both supervised and unsupervised settings. Smartphone based solution, together or in place of dedicated wearable sensing units, can truly become a pervasive and low-cost means for providing suitable testing solutions for quantitative movement analysis with a clear clinical value, ultimately providing enhanced balance and mobility support to an aging population.
Resumo:
The spectroscopic investigation of the gas-phase molecules relevant for the chemistry of the atmosphere and of the interstellar medium has been performed. Two types of molecules have been studied, linear and symmetric top. Several experimental high-resolution techniques have been adopted, exploiting the spectrometers available in Bologna, Venezia, Brussels and Wuppertal: Fourier-Transform-Infrared Spectroscopy, Cavity-Ring-Down Spectroscopy, Cavity-Enhanced-Absorption Spectroscopy, Tunable-Diode-Laser Spectroscopy. Concerning linear molecules, the spectra of a number of isotopologues of acetylene, 12C2D2, H12C13CD, H13C12CD, 13C12CD2, of DCCF and monodeuterodiacetylene DC4H, have been studied, from 320 to 6800 cm-1. This interval covers bending, stretching, overtone and combination bands, the focus on specific ranges depending on the molecule. In particular, the analysis of the bending modes has been performed for 12C2D2 (450-2200 cm-1), 13C12CD2 (450-1700 cm-1), DCCF (320-850cm-1) and DC4H (450-1100 cm-1), of the stretching-bending system for 12C2D2 (450-5500 cm-1) and of the 2nu1 and combination bands up to four quanta of excitation for H12C13CD, H13C12CD and 13C12CD2 (6130-6800 cm-1). In case of symmetric top molecules, CH3CCH has been investigated in the 2nu1 region (6200-6700 cm-1), which is particularly congested due to the huge network of states affected by Coriolis and anharmonic interactions. The bending fundamentals of 15ND3 (450-2700 cm-1) have been studied for the first time, characterizing completely the bending states, v2 = 1 and v4 = 1, whereas the analysis of the stretching modes, which evidenced the presence of several perturbations, has been started. Finally, the fundamental band nu4 of CF3Br in the 1190-1220 cm-1 region has been investigated. Transitions belonging to the CF379Br and CF381Br molecules have been identified since the spectra were recorded using a sample containing the two isotopologues in natural abundance. This allowed the characterization of the v4 = 1 state for both isotopologues and the evaluation of the bromine isotopic splitting.
Resumo:
The work investigates the feasibility of a new process aimed at the production of hydrogen with inherent separation of carbon oxides. The process consists in a cycle in which, in the first step, a mixed metal oxide is reduced by ethanol (obtained from biomasses). The reduced metal is then contacted with steam in order to split the water and sequestrating the oxygen into the looping material’s structure. The oxides used to run this thermochemical cycle, also called “steam-iron process” are mixed ferrites in the spinel structure MeFe2O4 (Me = Fe, Co, Ni or Cu). To understand the reactions involved in the anaerobic reforming of ethanol, diffuse reflectance spectroscopy (DRIFTS) was used, coupled with the mass analysis of the effluent, to study the surface composition of the ferrites during the adsorption of ethanol and its transformations during the temperature program. This study was paired with the tests on a laboratory scale plant and the characterization through various techniques such as XRD, Mössbauer spectroscopy, elemental analysis... on the materials as synthesized and at different reduction degrees In the first step it was found that besides the generation of the expected CO, CO2 and H2O, the products of ethanol anaerobic oxidation, also a large amount of H2 and coke were produced. The latter is highly undesired, since it affects the second step, during which water is fed over the pre-reduced spinel at high temperature. The behavior of the different spinels was affected by the nature of the divalent metal cation; magnetite was the oxide showing the slower rate of reduction by ethanol, but on the other hand it was that one which could perform the entire cycle of the process more efficiently. Still the problem of coke formation remains the greater challenge to solve.