3 resultados para Instrumental-variable Methods
em Duke University
Resumo:
My dissertation has three chapters which develop and apply microeconometric tech- niques to empirically relevant problems. All the chapters examines the robustness issues (e.g., measurement error and model misspecification) in the econometric anal- ysis. The first chapter studies the identifying power of an instrumental variable in the nonparametric heterogeneous treatment effect framework when a binary treat- ment variable is mismeasured and endogenous. I characterize the sharp identified set for the local average treatment effect under the following two assumptions: (1) the exclusion restriction of an instrument and (2) deterministic monotonicity of the true treatment variable in the instrument. The identification strategy allows for general measurement error. Notably, (i) the measurement error is nonclassical, (ii) it can be endogenous, and (iii) no assumptions are imposed on the marginal distribution of the measurement error, so that I do not need to assume the accuracy of the measure- ment. Based on the partial identification result, I provide a consistent confidence interval for the local average treatment effect with uniformly valid size control. I also show that the identification strategy can incorporate repeated measurements to narrow the identified set, even if the repeated measurements themselves are endoge- nous. Using the the National Longitudinal Study of the High School Class of 1972, I demonstrate that my new methodology can produce nontrivial bounds for the return to college attendance when attendance is mismeasured and endogenous.
The second chapter, which is a part of a coauthored project with Federico Bugni, considers the problem of inference in dynamic discrete choice problems when the structural model is locally misspecified. We consider two popular classes of estimators for dynamic discrete choice models: K-step maximum likelihood estimators (K-ML) and K-step minimum distance estimators (K-MD), where K denotes the number of policy iterations employed in the estimation problem. These estimator classes include popular estimators such as Rust (1987)’s nested fixed point estimator, Hotz and Miller (1993)’s conditional choice probability estimator, Aguirregabiria and Mira (2002)’s nested algorithm estimator, and Pesendorfer and Schmidt-Dengler (2008)’s least squares estimator. We derive and compare the asymptotic distributions of K- ML and K-MD estimators when the model is arbitrarily locally misspecified and we obtain three main results. In the absence of misspecification, Aguirregabiria and Mira (2002) show that all K-ML estimators are asymptotically equivalent regardless of the choice of K. Our first result shows that this finding extends to a locally misspecified model, regardless of the degree of local misspecification. As a second result, we show that an analogous result holds for all K-MD estimators, i.e., all K- MD estimator are asymptotically equivalent regardless of the choice of K. Our third and final result is to compare K-MD and K-ML estimators in terms of asymptotic mean squared error. Under local misspecification, the optimally weighted K-MD estimator depends on the unknown asymptotic bias and is no longer feasible. In turn, feasible K-MD estimators could have an asymptotic mean squared error that is higher or lower than that of the K-ML estimators. To demonstrate the relevance of our asymptotic analysis, we illustrate our findings using in a simulation exercise based on a misspecified version of Rust (1987) bus engine problem.
The last chapter investigates the causal effect of the Omnibus Budget Reconcil- iation Act of 1993, which caused the biggest change to the EITC in its history, on unemployment and labor force participation among single mothers. Unemployment and labor force participation are difficult to define for a few reasons, for example, be- cause of marginally attached workers. Instead of searching for the unique definition for each of these two concepts, this chapter bounds unemployment and labor force participation by observable variables and, as a result, considers various competing definitions of these two concepts simultaneously. This bounding strategy leads to partial identification of the treatment effect. The inference results depend on the construction of the bounds, but they imply positive effect on labor force participa- tion and negligible effect on unemployment. The results imply that the difference- in-difference result based on the BLS definition of unemployment can be misleading
due to misclassification of unemployment.
Resumo:
Constant technology advances have caused data explosion in recent years. Accord- ingly modern statistical and machine learning methods must be adapted to deal with complex and heterogeneous data types. This phenomenon is particularly true for an- alyzing biological data. For example DNA sequence data can be viewed as categorical variables with each nucleotide taking four different categories. The gene expression data, depending on the quantitative technology, could be continuous numbers or counts. With the advancement of high-throughput technology, the abundance of such data becomes unprecedentedly rich. Therefore efficient statistical approaches are crucial in this big data era.
Previous statistical methods for big data often aim to find low dimensional struc- tures in the observed data. For example in a factor analysis model a latent Gaussian distributed multivariate vector is assumed. With this assumption a factor model produces a low rank estimation of the covariance of the observed variables. Another example is the latent Dirichlet allocation model for documents. The mixture pro- portions of topics, represented by a Dirichlet distributed variable, is assumed. This dissertation proposes several novel extensions to the previous statistical methods that are developed to address challenges in big data. Those novel methods are applied in multiple real world applications including construction of condition specific gene co-expression networks, estimating shared topics among newsgroups, analysis of pro- moter sequences, analysis of political-economics risk data and estimating population structure from genotype data.
Resumo:
Abstract
Continuous variable is one of the major data types collected by the survey organizations. It can be incomplete such that the data collectors need to fill in the missingness. Or, it can contain sensitive information which needs protection from re-identification. One of the approaches to protect continuous microdata is to sum them up according to different cells of features. In this thesis, I represents novel methods of multiple imputation (MI) that can be applied to impute missing values and synthesize confidential values for continuous and magnitude data.
The first method is for limiting the disclosure risk of the continuous microdata whose marginal sums are fixed. The motivation for developing such a method comes from the magnitude tables of non-negative integer values in economic surveys. I present approaches based on a mixture of Poisson distributions to describe the multivariate distribution so that the marginals of the synthetic data are guaranteed to sum to the original totals. At the same time, I present methods for assessing disclosure risks in releasing such synthetic magnitude microdata. The illustration on a survey of manufacturing establishments shows that the disclosure risks are low while the information loss is acceptable.
The second method is for releasing synthetic continuous micro data by a nonstandard MI method. Traditionally, MI fits a model on the confidential values and then generates multiple synthetic datasets from this model. Its disclosure risk tends to be high, especially when the original data contain extreme values. I present a nonstandard MI approach conditioned on the protective intervals. Its basic idea is to estimate the model parameters from these intervals rather than the confidential values. The encouraging results of simple simulation studies suggest the potential of this new approach in limiting the posterior disclosure risk.
The third method is for imputing missing values in continuous and categorical variables. It is extended from a hierarchically coupled mixture model with local dependence. However, the new method separates the variables into non-focused (e.g., almost-fully-observed) and focused (e.g., missing-a-lot) ones. The sub-model structure of focused variables is more complex than that of non-focused ones. At the same time, their cluster indicators are linked together by tensor factorization and the focused continuous variables depend locally on non-focused values. The model properties suggest that moving the strongly associated non-focused variables to the side of focused ones can help to improve estimation accuracy, which is examined by several simulation studies. And this method is applied to data from the American Community Survey.