14 resultados para Count data models
em AMS Tesi di Dottorato - Alm@DL - Università di Bologna
Resumo:
The recent advent of Next-generation sequencing technologies has revolutionized the way of analyzing the genome. This innovation allows to get deeper information at a lower cost and in less time, and provides data that are discrete measurements. One of the most important applications with these data is the differential analysis, that is investigating if one gene exhibit a different expression level in correspondence of two (or more) biological conditions (such as disease states, treatments received and so on). As for the statistical analysis, the final aim will be statistical testing and for modeling these data the Negative Binomial distribution is considered the most adequate one especially because it allows for "over dispersion". However, the estimation of the dispersion parameter is a very delicate issue because few information are usually available for estimating it. Many strategies have been proposed, but they often result in procedures based on plug-in estimates, and in this thesis we show that this discrepancy between the estimation and the testing framework can lead to uncontrolled first-type errors. We propose a mixture model that allows each gene to share information with other genes that exhibit similar variability. Afterwards, three consistent statistical tests are developed for differential expression analysis. We show that the proposed method improves the sensitivity of detecting differentially expressed genes with respect to the common procedures, since it is the best one in reaching the nominal value for the first-type error, while keeping elevate power. The method is finally illustrated on prostate cancer RNA-seq data.
Resumo:
The thesis studies the economic and financial conditions of Italian households, by using microeconomic data of the Survey on Household Income and Wealth (SHIW) over the period 1998-2006. It develops along two lines of enquiry. First it studies the determinants of households holdings of assets and liabilities and estimates their correlation degree. After a review of the literature, it estimates two non-linear multivariate models on the interactions between assets and liabilities with repeated cross-sections. Second, it analyses households financial difficulties. It defines a quantitative measure of financial distress and tests, by means of non-linear dynamic probit models, whether the probability of experiencing financial difficulties is persistent over time. Chapter 1 provides a critical review of the theoretical and empirical literature on the estimation of assets and liabilities holdings, on their interactions and on households net wealth. The review stresses the fact that a large part of the literature explain households debt holdings as a function, among others, of net wealth, an assumption that runs into possible endogeneity problems. Chapter 2 defines two non-linear multivariate models to study the interactions between assets and liabilities held by Italian households. Estimation refers to a pooling of cross-sections of SHIW. The first model is a bivariate tobit that estimates factors affecting assets and liabilities and their degree of correlation with results coherent with theoretical expectations. To tackle the presence of non normality and heteroskedasticity in the error term, generating non consistent tobit estimators, semi-parametric estimates are provided that confirm the results of the tobit model. The second model is a quadrivariate probit on three different assets (safe, risky and real) and total liabilities; the results show the expected patterns of interdependence suggested by theoretical considerations. Chapter 3 reviews the methodologies for estimating non-linear dynamic panel data models, drawing attention to the problems to be dealt with to obtain consistent estimators. Specific attention is given to the initial condition problem raised by the inclusion of the lagged dependent variable in the set of explanatory variables. The advantage of using dynamic panel data models lies in the fact that they allow to simultaneously account for true state dependence, via the lagged variable, and unobserved heterogeneity via individual effects specification. Chapter 4 applies the models reviewed in Chapter 3 to analyse financial difficulties of Italian households, by using information on net wealth as provided in the panel component of the SHIW. The aim is to test whether households persistently experience financial difficulties over time. A thorough discussion is provided of the alternative approaches proposed by the literature (subjective/qualitative indicators versus quantitative indexes) to identify households in financial distress. Households in financial difficulties are identified as those holding amounts of net wealth lower than the value corresponding to the first quartile of net wealth distribution. Estimation is conducted via four different methods: the pooled probit model, the random effects probit model with exogenous initial conditions, the Heckman model and the recently developed Wooldridge model. Results obtained from all estimators accept the null hypothesis of true state dependence and show that, according with the literature, less sophisticated models, namely the pooled and exogenous models, over-estimate such persistence.
Resumo:
This doctoral thesis aims at contributing to the literature on transition economies focusing on the Russian Federations and in particular on regional income convergence and fertility patterns. The first two chapter deal with the issue of income convergence across regions. Chapter 1 provides an historical-institutional analysis of the period between the late years of the Soviet Union and the last decade of economic growth and a presentation of the sample with a description of gross regional product composition, agrarian or industrial vocation, labor. Chapter 2 contributes to the literature on exploratory spatial data analysis with a application to a panel of 77 regions in the period 1994-2008. It provides an analysis of spatial patterns and it extends the theoretical framework of growth regressions controlling for spatial correlation and heterogeneity. Chapter 3 analyses the national demographic patterns since 1960 and provides a review of the policies on maternity leave and family benefits. Data sources are the Statistical Yearbooks of USSR, the Statistical Yearbooks of the Russian Soviet Federative Socialist Republic and the Demographic Yearbooks of Russia. Chapter 4 analyses the demographic patterns in light of the theoretical framework of the Becker model, the Second Demographic Transition and an economic-crisis argument. With national data from 1960, the theoretically issue of the pro or countercyclical relation between income and fertility is graphically analyzed and discussed, together with female employment and education. With regional data after 1994 different panel data models are tested. Individual level data from the Russian Longitudinal Monitoring Survey are employed using the logit model. Chapter 5 employs data from the Generations and Gender Survey by UNECE to focus on postponement and second births intentions. Postponement is studied through cohort analysis of mean maternal age at first birth, while the methodology used for second birth intentions is the ordered logit model.
Resumo:
In Sub-Saharan Africa, non-democratic events, like civil wars and coup d'etat, destroy economic development. This study investigates both domestic and spatial effects on the likelihood of civil wars and coup d'etat. To civil wars, an increase of income growth is one of common research conclusions to stop wars. This study adds a concern on ethnic fractionalization. IV-2SLS is applied to overcome causality problem. The findings document that income growth is significant to reduce number and degree of violence in high ethnic fractionalized countries, otherwise they are trade-off. Income growth reduces amount of wars, but increases its violent level, in the countries with few large ethnic groups. Promoting growth should consider ethnic composition. This study also investigates the clustering and contagion of civil wars using spatial panel data models. Onset, incidence and end of civil conflicts spread across the network of neighboring countries while peace, the end of conflicts, diffuse only with the nearest neighbor. There is an evidence of indirect links from neighboring income growth, without too much inequality, to reduce the likelihood of civil wars. To coup d'etat, this study revisits its diffusion for both all types of coups and only successful ones. The results find an existence of both domestic and spatial determinants in different periods. Domestic income growth plays major role to reduce the likelihood of coup before cold war ends, while spatial effects do negative afterward. Results on probability to succeed coup are similar. After cold war ends, international organisations seriously promote democracy with pressure against coup d'etat, and it seems to be effective. In sum, this study indicates the role of domestic ethnic fractionalization and the spread of neighboring effects to the likelihood of non-democratic events in a country. Policy implementation should concern these factors.
Resumo:
The first paper sheds light on the informational content of high frequency data and daily data. I assess the economic value of the two family models comparing their performance in forecasting asset volatility through the Value at Risk metric. In running the comparison this paper introduces two key assumptions: jumps in prices and leverage effect in volatility dynamics. Findings suggest that high frequency data models do not exhibit a superior performance over daily data models. In the second paper, building on Majewski et al. (2015), I propose an affine-discrete time model, labeled VARG-J, which is characterized by a multifactor volatility specification. In the VARG-J model volatility experiences periods of extreme movements through a jump factor modeled as an Autoregressive Gamma Zero process. The estimation under historical measure is done by quasi-maximum likelihood and the Extended Kalman Filter. This strategy allows to filter out both volatility factors introducing a measurement equation that relates the Realized Volatility to latent volatility. The risk premia parameters are calibrated using call options written on S&P500 Index. The results clearly illustrate the important contribution of the jump factor in the pricing performance of options and the economic significance of the volatility jump risk premia. In the third paper, I analyze whether there is empirical evidence of contagion at the bank level, measuring the direction and the size of contagion transmission between European markets. In order to understand and quantify the contagion transmission on banking market, I estimate the econometric model by Aït-Sahalia et al. (2015) in which contagion is defined as the within and between countries transmission of shocks and asset returns are directly modeled as a Hawkes jump diffusion process. The empirical analysis indicates that there is a clear evidence of contagion from Greece to European countries as well as self-contagion in all countries.
Resumo:
The Assimilation in the Unstable Subspace (AUS) was introduced by Trevisan and Uboldi in 2004, and developed by Trevisan, Uboldi and Carrassi, to minimize the analysis and forecast errors by exploiting the flow-dependent instabilities of the forecast-analysis cycle system, which may be thought of as a system forced by observations. In the AUS scheme the assimilation is obtained by confining the analysis increment in the unstable subspace of the forecast-analysis cycle system so that it will have the same structure of the dominant instabilities of the system. The unstable subspace is estimated by Breeding on the Data Assimilation System (BDAS). AUS- BDAS has already been tested in realistic models and observational configurations, including a Quasi-Geostrophicmodel and a high dimensional, primitive equation ocean model; the experiments include both fixed and“adaptive”observations. In these contexts, the AUS-BDAS approach greatly reduces the analysis error, with reasonable computational costs for data assimilation with respect, for example, to a prohibitive full Extended Kalman Filter. This is a follow-up study in which we revisit the AUS-BDAS approach in the more basic, highly nonlinear Lorenz 1963 convective model. We run observation system simulation experiments in a perfect model setting, and with two types of model error as well: random and systematic. In the different configurations examined, and in a perfect model setting, AUS once again shows better efficiency than other advanced data assimilation schemes. In the present study, we develop an iterative scheme that leads to a significant improvement of the overall assimilation performance with respect also to standard AUS. In particular, it boosts the efficiency of regime’s changes tracking, with a low computational cost. Other data assimilation schemes need estimates of ad hoc parameters, which have to be tuned for the specific model at hand. In Numerical Weather Prediction models, tuning of parameters — and in particular an estimate of the model error covariance matrix — may turn out to be quite difficult. Our proposed approach, instead, may be easier to implement in operational models.
Resumo:
This thesis is a collection of works focused on the topic of Earthquake Early Warning, with a special attention to large magnitude events. The topic is addressed from different points of view and the structure of the thesis reflects the variety of the aspects which have been analyzed. The first part is dedicated to the giant, 2011 Tohoku-Oki earthquake. The main features of the rupture process are first discussed. The earthquake is then used as a case study to test the feasibility Early Warning methodologies for very large events. Limitations of the standard approaches for large events arise in this chapter. The difficulties are related to the real-time magnitude estimate from the first few seconds of recorded signal. An evolutionary strategy for the real-time magnitude estimate is proposed and applied to the single Tohoku-Oki earthquake. In the second part of the thesis a larger number of earthquakes is analyzed, including small, moderate and large events. Starting from the measurement of two Early Warning parameters, the behavior of small and large earthquakes in the initial portion of recorded signals is investigated. The aim is to understand whether small and large earthquakes can be distinguished from the initial stage of their rupture process. A physical model and a plausible interpretation to justify the observations are proposed. The third part of the thesis is focused on practical, real-time approaches for the rapid identification of the potentially damaged zone during a seismic event. Two different approaches for the rapid prediction of the damage area are proposed and tested. The first one is a threshold-based method which uses traditional seismic data. Then an innovative approach using continuous, GPS data is explored. Both strategies improve the prediction of large scale effects of strong earthquakes.
Resumo:
The aim of the thesis is to propose a Bayesian estimation through Markov chain Monte Carlo of multidimensional item response theory models for graded responses with complex structures and correlated traits. In particular, this work focuses on the multiunidimensional and the additive underlying latent structures, considering that the first one is widely used and represents a classical approach in multidimensional item response analysis, while the second one is able to reflect the complexity of real interactions between items and respondents. A simulation study is conducted to evaluate the parameter recovery for the proposed models under different conditions (sample size, test and subtest length, number of response categories, and correlation structure). The results show that the parameter recovery is particularly sensitive to the sample size, due to the model complexity and the high number of parameters to be estimated. For a sufficiently large sample size the parameters of the multiunidimensional and additive graded response models are well reproduced. The results are also affected by the trade-off between the number of items constituting the test and the number of item categories. An application of the proposed models on response data collected to investigate Romagna and San Marino residents' perceptions and attitudes towards the tourism industry is also presented.
Resumo:
Gastrointestinal stromal tumors (GIST) are the most common di tumors of the gastrointestinal tract, arising from the interstitial cells of Cajal (ICCs) or their precursors. The vast majority of GISTs (75–85% of GIST) harbor KIT or PDGFRA mutations. A small percentage of GIST (about 10‐15%) do not harbor any of these driver mutations and have historically been called wild-type (WT). Among them, from 20% to 40% show loss of function of the succinate dehydrogenase complex (SDH), also defined as SDH‐deficient GIST. SDH-deficient GISTs display distinctive clinical and pathological features, and can be sporadic or associated with Carney triad or Carney-Stratakis syndrome. These tumors arise most frequently in the stomach with predilection to distal stomach and antrum, have a multi-nodular growth, display a histological epithelioid phenotype, and present frequent lympho-vascular invasion. Occurrence of lymph node metastases and indolent course are representative features of SDH-deficient GISTs. This subset of GIST is known for the immunohistochemical loss of succinate dehydrogenase subunit B (SDHB), which signals the loss of function of the entire SDH-complex. The overall aim of my PhD project consists of the comprehensive characterization of SDH deficient GIST. Throughout the project, clinical, molecular and cellular characterizations were performed using next-generation sequencing technologies (NGS), that has the potential to allow the identification of molecular patterns useful for the diagnosis and development of novel treatments. Moreover, while there are many different cell lines and preclinical models of KIT/PDGFRA mutant GIST, no reliable cell model of SDH-deficient GIST has currently been developed, which could be used for studies on tumor evolution and in vitro assessments of drug response. Therefore, another aim of this project was to develop a pre-clinical model of SDH deficient GIST using the novel technology of induced pluripotent stem cells (iPSC).
Resumo:
Model misspecification affects the classical test statistics used to assess the fit of the Item Response Theory (IRT) models. Robust tests have been derived under model misspecification, as the Generalized Lagrange Multiplier and Hausman tests, but their use has not been largely explored in the IRT framework. In the first part of the thesis, we introduce the Generalized Lagrange Multiplier test to detect differential item response functioning in IRT models for binary data under model misspecification. By means of a simulation study and a real data analysis, we compare its performance with the classical Lagrange Multiplier test, computed using the Hessian and the cross-product matrix, and the Generalized Jackknife Score test. The power of these tests is computed empirically and asymptotically. The misspecifications considered are local dependence among items and non-normal distribution of the latent variable. The results highlight that, under mild model misspecification, all tests have good performance while, under strong model misspecification, the performance of the tests deteriorates. None of the tests considered show an overall superior performance than the others. In the second part of the thesis, we extend the Generalized Hausman test to detect non-normality of the latent variable distribution. To build the test, we consider a seminonparametric-IRT model, that assumes a more flexible latent variable distribution. By means of a simulation study and two real applications, we compare the performance of the Generalized Hausman test with the M2 limited information goodness-of-fit test and the Likelihood-Ratio test. Additionally, the information criteria are computed. The Generalized Hausman test has a better performance than the Likelihood-Ratio test in terms of Type I error rates and the M2 test in terms of power. The performance of the Generalized Hausman test and the information criteria deteriorates when the sample size is small and with a few items.
Resumo:
Imaging technologies are widely used in application fields such as natural sciences, engineering, medicine, and life sciences. A broad class of imaging problems reduces to solve ill-posed inverse problems (IPs). Traditional strategies to solve these ill-posed IPs rely on variational regularization methods, which are based on minimization of suitable energies, and make use of knowledge about the image formation model (forward operator) and prior knowledge on the solution, but lack in incorporating knowledge directly from data. On the other hand, the more recent learned approaches can easily learn the intricate statistics of images depending on a large set of data, but do not have a systematic method for incorporating prior knowledge about the image formation model. The main purpose of this thesis is to discuss data-driven image reconstruction methods which combine the benefits of these two different reconstruction strategies for the solution of highly nonlinear ill-posed inverse problems. Mathematical formulation and numerical approaches for image IPs, including linear as well as strongly nonlinear problems are described. More specifically we address the Electrical impedance Tomography (EIT) reconstruction problem by unrolling the regularized Gauss-Newton method and integrating the regularization learned by a data-adaptive neural network. Furthermore we investigate the solution of non-linear ill-posed IPs introducing a deep-PnP framework that integrates the graph convolutional denoiser into the proximal Gauss-Newton method with a practical application to the EIT, a recently introduced promising imaging technique. Efficient algorithms are then applied to the solution of the limited electrods problem in EIT, combining compressive sensing techniques and deep learning strategies. Finally, a transformer-based neural network architecture is adapted to restore the noisy solution of the Computed Tomography problem recovered using the filtered back-projection method.
Resumo:
In this thesis, we investigate the role of applied physics in epidemiological surveillance through the application of mathematical models, network science and machine learning. The spread of a communicable disease depends on many biological, social, and health factors. The large masses of data available make it possible, on the one hand, to monitor the evolution and spread of pathogenic organisms; on the other hand, to study the behavior of people, their opinions and habits. Presented here are three lines of research in which an attempt was made to solve real epidemiological problems through data analysis and the use of statistical and mathematical models. In Chapter 1, we applied language-inspired Deep Learning models to transform influenza protein sequences into vectors encoding their information content. We then attempted to reconstruct the antigenic properties of different viral strains using regression models and to identify the mutations responsible for vaccine escape. In Chapter 2, we constructed a compartmental model to describe the spread of a bacterium within a hospital ward. The model was informed and validated on time series of clinical measurements, and a sensitivity analysis was used to assess the impact of different control measures. Finally (Chapter 3) we reconstructed the network of retweets among COVID-19 themed Twitter users in the early months of the SARS-CoV-2 pandemic. By means of community detection algorithms and centrality measures, we characterized users’ attention shifts in the network, showing that scientific communities, initially the most retweeted, lost influence over time to national political communities. In the Conclusion, we highlighted the importance of the work done in light of the main contemporary challenges for epidemiological surveillance. In particular, we present reflections on the importance of nowcasting and forecasting, the relationship between data and scientific research, and the need to unite the different scales of epidemiological surveillance.
Resumo:
Artificial Intelligence (AI) and Machine Learning (ML) are novel data analysis techniques providing very accurate prediction results. They are widely adopted in a variety of industries to improve efficiency and decision-making, but they are also being used to develop intelligent systems. Their success grounds upon complex mathematical models, whose decisions and rationale are usually difficult to comprehend for human users to the point of being dubbed as black-boxes. This is particularly relevant in sensitive and highly regulated domains. To mitigate and possibly solve this issue, the Explainable AI (XAI) field became prominent in recent years. XAI consists of models and techniques to enable understanding of the intricated patterns discovered by black-box models. In this thesis, we consider model-agnostic XAI techniques, which can be applied to Tabular data, with a particular focus on the Credit Scoring domain. Special attention is dedicated to the LIME framework, for which we propose several modifications to the vanilla algorithm, in particular: a pair of complementary Stability Indices that accurately measure LIME stability, and the OptiLIME policy which helps the practitioner finding the proper balance among explanations' stability and reliability. We subsequently put forward GLEAMS a model-agnostic surrogate interpretable model which requires to be trained only once, while providing both Local and Global explanations of the black-box model. GLEAMS produces feature attributions and what-if scenarios, from both dataset and model perspective. Eventually, we argue that synthetic data are an emerging trend in AI, being more and more used to train complex models instead of original data. To be able to explain the outcomes of such models, we must guarantee that synthetic data are reliable enough to be able to translate their explanations to real-world individuals. To this end we propose DAISYnt, a suite of tests to measure synthetic tabular data quality and privacy.