10 resultados para multivariate binary data
em AMS Tesi di Dottorato - Alm@DL - Università di Bologna
Resumo:
Model misspecification affects the classical test statistics used to assess the fit of the Item Response Theory (IRT) models. Robust tests have been derived under model misspecification, as the Generalized Lagrange Multiplier and Hausman tests, but their use has not been largely explored in the IRT framework. In the first part of the thesis, we introduce the Generalized Lagrange Multiplier test to detect differential item response functioning in IRT models for binary data under model misspecification. By means of a simulation study and a real data analysis, we compare its performance with the classical Lagrange Multiplier test, computed using the Hessian and the cross-product matrix, and the Generalized Jackknife Score test. The power of these tests is computed empirically and asymptotically. The misspecifications considered are local dependence among items and non-normal distribution of the latent variable. The results highlight that, under mild model misspecification, all tests have good performance while, under strong model misspecification, the performance of the tests deteriorates. None of the tests considered show an overall superior performance than the others. In the second part of the thesis, we extend the Generalized Hausman test to detect non-normality of the latent variable distribution. To build the test, we consider a seminonparametric-IRT model, that assumes a more flexible latent variable distribution. By means of a simulation study and two real applications, we compare the performance of the Generalized Hausman test with the M2 limited information goodness-of-fit test and the Likelihood-Ratio test. Additionally, the information criteria are computed. The Generalized Hausman test has a better performance than the Likelihood-Ratio test in terms of Type I error rates and the M2 test in terms of power. The performance of the Generalized Hausman test and the information criteria deteriorates when the sample size is small and with a few items.
Resumo:
Precipitation retrieval over high latitudes, particularly snowfall retrieval over ice and snow, using satellite-based passive microwave spectrometers, is currently an unsolved problem. The challenge results from the large variability of microwave emissivity spectra for snow and ice surfaces, which can mimic, to some degree, the spectral characteristics of snowfall. This work focuses on the investigation of a new snowfall detection algorithm specific for high latitude regions, based on a combination of active and passive sensors able to discriminate between snowing and non snowing areas. The space-borne Cloud Profiling Radar (on CloudSat), the Advanced Microwave Sensor units A and B (on NOAA-16) and the infrared spectrometer MODIS (on AQUA) have been co-located for 365 days, from October 1st 2006 to September 30th, 2007. CloudSat products have been used as truth to calibrate and validate all the proposed algorithms. The methodological approach followed can be summarised into two different steps. In a first step, an empirical search for a threshold, aimed at discriminating the case of no snow, was performed, following Kongoli et al. [2003]. This single-channel approach has not produced appropriate results, a more statistically sound approach was attempted. Two different techniques, which allow to compute the probability above and below a Brightness Temperature (BT) threshold, have been used on the available data. The first technique is based upon a Logistic Distribution to represent the probability of Snow given the predictors. The second technique, defined Bayesian Multivariate Binary Predictor (BMBP), is a fully Bayesian technique not requiring any hypothesis on the shape of the probabilistic model (such as for instance the Logistic), which only requires the estimation of the BT thresholds. The results obtained show that both methods proposed are able to discriminate snowing and non snowing condition over the Polar regions with a probability of correct detection larger than 0.5, highlighting the importance of a multispectral approach.
Resumo:
The aim of this work is to carry out an applicative, comparative and exhaustive study between several entropy based indicators of independence and correlation. We considered some indicators characterized by a wide and consolidate literature, like mutual information, joint entropy, relative entropy or Kullback Leibler distance, and others, more recently introduced, like Granger, Maasoumi and racine entropy, also called Sρ, or utilized in more restricted domains, like Pincus approximate entropy or ApEn. We studied the behaviour of such indicators applying them to binary series. The series was designed to simulate a wide range of situations in order to characterize indicators limit and capability and to identify, case by case, the more useful and trustworthy ones. Our target was not only to study if such indicators were able to discriminate between dependence and independence because, especially for mutual information and Granger, Maasoumi and Racine, that was already demonstrated and reported in literature, but also to verify if and how they were able to provide information about structure, complexity and disorder of the series they were applied to. Special attention was paid on Pincus approximate entropy, that is said by the author to be able to provide information regarding the level of randomness, regularity and complexity of a series. By means of a focused and extensive research, we furthermore tried to clear the meaning of ApEn applied to a couple of different series. In such situation the indicator is named in literature as cross-ApEn. The cross-ApEn meaning and the interpretation of its results is often not simple nor univocal and the matter is scarcely delved into by literature, thereby users can easily leaded up to a misleading conclusion, especially if the indicator is employed, as often unfortunately it happens, in uncritical manner. In order to plug some cross-ApEn gaps and limits clearly brought out during the experimentation, we developed and applied to the already considered cases a further indicator we called “correspondence index”. The correspondence index is perfectly integrated into the cross-ApEn computational algorithm and it is able to provide, at least for binary data, accurate information about the intensity and the direction of an eventual correlation, even not linear, existing between two different series allowing, in the meanwhile, to detect an eventual condition of independence between the series themselves.
Resumo:
The present Dissertation shows how recent statistical analysis tools and open datasets can be exploited to improve modelling accuracy in two distinct yet interconnected domains of flood hazard (FH) assessment. In the first Part, unsupervised artificial neural networks are employed as regional models for sub-daily rainfall extremes. The models aim to learn a robust relation to estimate locally the parameters of Gumbel distributions of extreme rainfall depths for any sub-daily duration (1-24h). The predictions depend on twenty morphoclimatic descriptors. A large study area in north-central Italy is adopted, where 2238 annual maximum series are available. Validation is performed over an independent set of 100 gauges. Our results show that multivariate ANNs may remarkably improve the estimation of percentiles relative to the benchmark approach from the literature, where Gumbel parameters depend on mean annual precipitation. Finally, we show that the very nature of the proposed ANN models makes them suitable for interpolating predicted sub-daily rainfall quantiles across space and time-aggregation intervals. In the second Part, decision trees are used to combine a selected blend of input geomorphic descriptors for predicting FH. Relative to existing DEM-based approaches, this method is innovative, as it relies on the combination of three characteristics: (1) simple multivariate models, (2) a set of exclusively DEM-based descriptors as input, and (3) an existing FH map as reference information. First, the methods are applied to northern Italy, represented with the MERIT DEM (∼90m resolution), and second, to the whole of Italy, represented with the EU-DEM (25m resolution). The results show that multivariate approaches may (a) significantly enhance flood-prone areas delineation relative to a selected univariate one, (b) provide accurate predictions of expected inundation depths, (c) produce encouraging results in extrapolation, (d) complete the information of imperfect reference maps, and (e) conveniently convert binary maps into continuous representation of FH.
Resumo:
The main aim of this Ph.D. dissertation is the study of clustering dependent data by means of copula functions with particular emphasis on microarray data. Copula functions are a popular multivariate modeling tool in each field where the multivariate dependence is of great interest and their use in clustering has not been still investigated. The first part of this work contains the review of the literature of clustering methods, copula functions and microarray experiments. The attention focuses on the K–means (Hartigan, 1975; Hartigan and Wong, 1979), the hierarchical (Everitt, 1974) and the model–based (Fraley and Raftery, 1998, 1999, 2000, 2007) clustering techniques because their performance is compared. Then, the probabilistic interpretation of the Sklar’s theorem (Sklar’s, 1959), the estimation methods for copulas like the Inference for Margins (Joe and Xu, 1996) and the Archimedean and Elliptical copula families are presented. In the end, applications of clustering methods and copulas to the genetic and microarray experiments are highlighted. The second part contains the original contribution proposed. A simulation study is performed in order to evaluate the performance of the K–means and the hierarchical bottom–up clustering methods in identifying clusters according to the dependence structure of the data generating process. Different simulations are performed by varying different conditions (e.g., the kind of margins (distinct, overlapping and nested) and the value of the dependence parameter ) and the results are evaluated by means of different measures of performance. In light of the simulation results and of the limits of the two investigated clustering methods, a new clustering algorithm based on copula functions (‘CoClust’ in brief) is proposed. The basic idea, the iterative procedure of the CoClust and the description of the written R functions with their output are given. The CoClust algorithm is tested on simulated data (by varying the number of clusters, the copula models, the dependence parameter value and the degree of overlap of margins) and is compared with the performance of model–based clustering by using different measures of performance, like the percentage of well–identified number of clusters and the not rejection percentage of H0 on . It is shown that the CoClust algorithm allows to overcome all observed limits of the other investigated clustering techniques and is able to identify clusters according to the dependence structure of the data independently of the degree of overlap of margins and the strength of the dependence. The CoClust uses a criterion based on the maximized log–likelihood function of the copula and can virtually account for any possible dependence relationship between observations. Many peculiar characteristics are shown for the CoClust, e.g. its capability of identifying the true number of clusters and the fact that it does not require a starting classification. Finally, the CoClust algorithm is applied to the real microarray data of Hedenfalk et al. (2001) both to the gene expressions observed in three different cancer samples and to the columns (tumor samples) of the whole data matrix.
Resumo:
Questa tesi descrive alcuni studi di messa a punto di metodi di analisi fisici accoppiati con tecniche statistiche multivariate per valutare la qualità e l’autenticità di oli vegetali e prodotti caseari. L’applicazione di strumenti fisici permette di abbattere i costi ed i tempi necessari per le analisi classiche ed allo stesso tempo può fornire un insieme diverso di informazioni che possono riguardare tanto la qualità come l’autenticità di prodotti. Per il buon funzionamento di tali metodi è necessaria la costruzione di modelli statistici robusti che utilizzino set di dati correttamente raccolti e rappresentativi del campo di applicazione. In questo lavoro di tesi sono stati analizzati oli vegetali e alcune tipologie di formaggi (in particolare pecorini per due lavori di ricerca e Parmigiano-Reggiano per un altro). Sono stati utilizzati diversi strumenti di analisi (metodi fisici), in particolare la spettroscopia, l’analisi termica differenziale, il naso elettronico, oltre a metodiche separative tradizionali. I dati ottenuti dalle analisi sono stati trattati mediante diverse tecniche statistiche, soprattutto: minimi quadrati parziali; regressione lineare multipla ed analisi discriminante lineare.
Resumo:
Millisecond Pulsars (MSPs) are fast rotating, highly magnetized neutron stars. According to the "canonical recycling scenario", MSPs form in binary systems containing a neutron star which is spun up through mass accretion from the evolving companion. Therefore, the final stage consists of a binary made of a MSP and the core of the deeply peeled companion. In the last years, however an increasing number of systems deviating from these expectations has been discovered, thus strongly indicating that our understanding of MSPs is far to be complete. The identification of the optical companions to binary MSPs is crucial to constrain the formation and evolution of these objects. In dense environments such as Globular Clusters (GCs), it also allows us to get insights on the cluster internal dynamics. By using deep photometric data, acquired both from space and ground-based telescopes, we identified 5 new companions to MSPs. Three of them being located in GCs and two in the Galactic Field. The three new identifications in GCs increased by 50% the number of such objects known before this Thesis. They all are non-degenerate stars, at odds with the expectations of the "canonical recycling scenario". These results therefore suggest either that transitory phases should also be taken into account, or that dynamical processes, as exchange interactions, play a crucial role in the evolution of MSPs. We also performed a spectroscopic follow-up of the companion to PSRJ1740-5340A in the GC NGC 6397, confirming that it is a deeply peeled star descending from a ~0.8Msun progenitor. This nicely confirms the theoretical expectations about the formation and evolution of MSPs.
Resumo:
The thesis deals with the problem of Model Selection (MS) motivated by information and prediction theory, focusing on parametric time series (TS) models. The main contribution of the thesis is the extension to the multivariate case of the Misspecification-Resistant Information Criterion (MRIC), a criterion introduced recently that solves Akaike’s original research problem posed 50 years ago, which led to the definition of the AIC. The importance of MS is witnessed by the huge amount of literature devoted to it and published in scientific journals of many different disciplines. Despite such a widespread treatment, the contributions that adopt a mathematically rigorous approach are not so numerous and one of the aims of this project is to review and assess them. Chapter 2 discusses methodological aspects of MS from information theory. Information criteria (IC) for the i.i.d. setting are surveyed along with their asymptotic properties; and the cases of small samples, misspecification, further estimators. Chapter 3 surveys criteria for TS. IC and prediction criteria are considered for: univariate models (AR, ARMA) in the time and frequency domain, parametric multivariate (VARMA, VAR); nonparametric nonlinear (NAR); and high-dimensional models. The MRIC answers Akaike’s original question on efficient criteria, for possibly-misspecified (PM) univariate TS models in multi-step prediction with high-dimensional data and nonlinear models. Chapter 4 extends the MRIC to PM multivariate TS models for multi-step prediction introducing the Vectorial MRIC (VMRIC). We show that the VMRIC is asymptotically efficient by proving the decomposition of the MSPE matrix and the consistency of its Method-of-Moments Estimator (MoME), for Least Squares multi-step prediction with univariate regressor. Chapter 5 extends the VMRIC to the general multiple regressor case, by showing that the MSPE matrix decomposition holds, obtaining consistency for its MoME, and proving its efficiency. The chapter concludes with a digression on the conditions for PM VARX models.
Resumo:
The coastal ocean is a complex environment with extremely dynamic processes that require a high-resolution and cross-scale modeling approach in which all hydrodynamic fields and scales are considered integral parts of the overall system. In the last decade, unstructured-grid models have been used to advance in seamless modeling between scales. On the other hand, the data assimilation methodologies to improve the unstructured-grid models in the coastal seas have been developed only recently and need significant advancements. Here, we link the unstructured-grid ocean modeling to the variational data assimilation methods. In particular, we show results from the modeling system SANIFS based on SHYFEM fully-baroclinic unstructured-grid model interfaced with OceanVar, a state-of-art variational data assimilation scheme adopted for several systems based on a structured grid. OceanVar implements a 3DVar DA scheme. The combination of three linear operators models the background error covariance matrix. The vertical part is represented using multivariate EOFs for temperature, salinity, and sea level anomaly. The horizontal part is assumed to be Gaussian isotropic and is modeled using a first-order recursive filter algorithm designed for structured and regular grids. Here we introduced a novel recursive filter algorithm for unstructured grids. A local hydrostatic adjustment scheme models the rapidly evolving part of the background error covariance. We designed two data assimilation experiments using SANIFS implementation interfaced with OceanVar over the period 2017-2018, one with only temperature and salinity assimilation by Argo profiles and the second also including sea level anomaly. The results showed a successful implementation of the approach and the added value of the assimilation for the active tracer fields. While looking at the broad basin, no significant improvements are highlighted for the sea level, requiring future investigations. Furthermore, a Machine Learning methodology based on an LSTM network has been used to predict the model SST increments.
Resumo:
In this thesis, new classes of models for multivariate linear regression defined by finite mixtures of seemingly unrelated contaminated normal regression models and seemingly unrelated contaminated normal cluster-weighted models are illustrated. The main difference between such families is that the covariates are treated as fixed in the former class of models and as random in the latter. Thus, in cluster-weighted models the assignment of the data points to the unknown groups of observations depends also by the covariates. These classes provide an extension to mixture-based regression analysis for modelling multivariate and correlated responses in the presence of mild outliers that allows to specify a different vector of regressors for the prediction of each response. Expectation-conditional maximisation algorithms for the calculation of the maximum likelihood estimate of the model parameters have been derived. As the number of free parameters incresases quadratically with the number of responses and the covariates, analyses based on the proposed models can become unfeasible in practical applications. These problems have been overcome by introducing constraints on the elements of the covariance matrices according to an approach based on the eigen-decomposition of the covariance matrices. The performances of the new models have been studied by simulations and using real datasets in comparison with other models. In order to gain additional flexibility, mixtures of seemingly unrelated contaminated normal regressions models have also been specified so as to allow mixing proportions to be expressed as functions of concomitant covariates. An illustration of the new models with concomitant variables and a study on housing tension in the municipalities of the Emilia-Romagna region based on different types of multivariate linear regression models have been performed.