973 resultados para Small samples
Resumo:
Many modern applications fall into the category of "large-scale" statistical problems, in which both the number of observations n and the number of features or parameters p may be large. Many existing methods focus on point estimation, despite the continued relevance of uncertainty quantification in the sciences, where the number of parameters to estimate often exceeds the sample size, despite huge increases in the value of n typically seen in many fields. Thus, the tendency in some areas of industry to dispense with traditional statistical analysis on the basis that "n=all" is of little relevance outside of certain narrow applications. The main result of the Big Data revolution in most fields has instead been to make computation much harder without reducing the importance of uncertainty quantification. Bayesian methods excel at uncertainty quantification, but often scale poorly relative to alternatives. This conflict between the statistical advantages of Bayesian procedures and their substantial computational disadvantages is perhaps the greatest challenge facing modern Bayesian statistics, and is the primary motivation for the work presented here.
Two general strategies for scaling Bayesian inference are considered. The first is the development of methods that lend themselves to faster computation, and the second is design and characterization of computational algorithms that scale better in n or p. In the first instance, the focus is on joint inference outside of the standard problem of multivariate continuous data that has been a major focus of previous theoretical work in this area. In the second area, we pursue strategies for improving the speed of Markov chain Monte Carlo algorithms, and characterizing their performance in large-scale settings. Throughout, the focus is on rigorous theoretical evaluation combined with empirical demonstrations of performance and concordance with the theory.
One topic we consider is modeling the joint distribution of multivariate categorical data, often summarized in a contingency table. Contingency table analysis routinely relies on log-linear models, with latent structure analysis providing a common alternative. Latent structure models lead to a reduced rank tensor factorization of the probability mass function for multivariate categorical data, while log-linear models achieve dimensionality reduction through sparsity. Little is known about the relationship between these notions of dimensionality reduction in the two paradigms. In Chapter 2, we derive several results relating the support of a log-linear model to nonnegative ranks of the associated probability tensor. Motivated by these findings, we propose a new collapsed Tucker class of tensor decompositions, which bridge existing PARAFAC and Tucker decompositions, providing a more flexible framework for parsimoniously characterizing multivariate categorical data. Taking a Bayesian approach to inference, we illustrate empirical advantages of the new decompositions.
Latent class models for the joint distribution of multivariate categorical, such as the PARAFAC decomposition, data play an important role in the analysis of population structure. In this context, the number of latent classes is interpreted as the number of genetically distinct subpopulations of an organism, an important factor in the analysis of evolutionary processes and conservation status. Existing methods focus on point estimates of the number of subpopulations, and lack robust uncertainty quantification. Moreover, whether the number of latent classes in these models is even an identified parameter is an open question. In Chapter 3, we show that when the model is properly specified, the correct number of subpopulations can be recovered almost surely. We then propose an alternative method for estimating the number of latent subpopulations that provides good quantification of uncertainty, and provide a simple procedure for verifying that the proposed method is consistent for the number of subpopulations. The performance of the model in estimating the number of subpopulations and other common population structure inference problems is assessed in simulations and a real data application.
In contingency table analysis, sparse data is frequently encountered for even modest numbers of variables, resulting in non-existence of maximum likelihood estimates. A common solution is to obtain regularized estimates of the parameters of a log-linear model. Bayesian methods provide a coherent approach to regularization, but are often computationally intensive. Conjugate priors ease computational demands, but the conjugate Diaconis--Ylvisaker priors for the parameters of log-linear models do not give rise to closed form credible regions, complicating posterior inference. In Chapter 4 we derive the optimal Gaussian approximation to the posterior for log-linear models with Diaconis--Ylvisaker priors, and provide convergence rate and finite-sample bounds for the Kullback-Leibler divergence between the exact posterior and the optimal Gaussian approximation. We demonstrate empirically in simulations and a real data application that the approximation is highly accurate, even in relatively small samples. The proposed approximation provides a computationally scalable and principled approach to regularized estimation and approximate Bayesian inference for log-linear models.
Another challenging and somewhat non-standard joint modeling problem is inference on tail dependence in stochastic processes. In applications where extreme dependence is of interest, data are almost always time-indexed. Existing methods for inference and modeling in this setting often cluster extreme events or choose window sizes with the goal of preserving temporal information. In Chapter 5, we propose an alternative paradigm for inference on tail dependence in stochastic processes with arbitrary temporal dependence structure in the extremes, based on the idea that the information on strength of tail dependence and the temporal structure in this dependence are both encoded in waiting times between exceedances of high thresholds. We construct a class of time-indexed stochastic processes with tail dependence obtained by endowing the support points in de Haan's spectral representation of max-stable processes with velocities and lifetimes. We extend Smith's model to these max-stable velocity processes and obtain the distribution of waiting times between extreme events at multiple locations. Motivated by this result, a new definition of tail dependence is proposed that is a function of the distribution of waiting times between threshold exceedances, and an inferential framework is constructed for estimating the strength of extremal dependence and quantifying uncertainty in this paradigm. The method is applied to climatological, financial, and electrophysiology data.
The remainder of this thesis focuses on posterior computation by Markov chain Monte Carlo. The Markov Chain Monte Carlo method is the dominant paradigm for posterior computation in Bayesian analysis. It has long been common to control computation time by making approximations to the Markov transition kernel. Comparatively little attention has been paid to convergence and estimation error in these approximating Markov Chains. In Chapter 6, we propose a framework for assessing when to use approximations in MCMC algorithms, and how much error in the transition kernel should be tolerated to obtain optimal estimation performance with respect to a specified loss function and computational budget. The results require only ergodicity of the exact kernel and control of the kernel approximation accuracy. The theoretical framework is applied to approximations based on random subsets of data, low-rank approximations of Gaussian processes, and a novel approximating Markov chain for discrete mixture models.
Data augmentation Gibbs samplers are arguably the most popular class of algorithm for approximately sampling from the posterior distribution for the parameters of generalized linear models. The truncated Normal and Polya-Gamma data augmentation samplers are standard examples for probit and logit links, respectively. Motivated by an important problem in quantitative advertising, in Chapter 7 we consider the application of these algorithms to modeling rare events. We show that when the sample size is large but the observed number of successes is small, these data augmentation samplers mix very slowly, with a spectral gap that converges to zero at a rate at least proportional to the reciprocal of the square root of the sample size up to a log factor. In simulation studies, moderate sample sizes result in high autocorrelations and small effective sample sizes. Similar empirical results are observed for related data augmentation samplers for multinomial logit and probit models. When applied to a real quantitative advertising dataset, the data augmentation samplers mix very poorly. Conversely, Hamiltonian Monte Carlo and a type of independence chain Metropolis algorithm show good mixing on the same dataset.
Resumo:
The presence of harmful algal blooms (HAB) is a growing concern in aquatic environments. Among HAB organisms, cyanobacteria are of special concern because they have been reported worldwide to cause environmental and human health problem through contamination of drinking water. Although several analytical approaches have been applied to monitoring cyanobacteria toxins, conventional methods are costly and time-consuming so that analyses take weeks for field sampling and subsequent lab analysis. Capillary electrophoresis (CE) becomes a particularly suitable analytical separation method that can couple very small samples and rapid separations to a wide range of selective and sensitive detection techniques. This paper demonstrates a method for rapid separation and identification of four microcystin variants commonly found in aquatic environments. CE coupled to UV and electrospray ionization time-of-flight mass spectrometry (ESI-TOF) procedures were developed. All four analytes were separated within 6 minutes. The ESI-TOF experiment provides accurate molecular information, which further identifies analytes.
Resumo:
Since 1997, there has been increasing research focused on Muscle Dysmorphia, a condition underpinned by people’s beliefs they have insufficient muscularity, in both the Western and non-western medical and scientific communities. Much of this empirical interest has surveyed nonclinical samples, and there is limited understanding of people with the condition beyond knowledge about their characteristics. Much existing knowledge about people with the condition is unsurprising and inherent in the definition of the disorder, such as dissatisfaction with muscularity and adherence to muscle-building activities. Only recently have investigators started to explore questions beyond these limited tautological findings that may give rise to substantial knowledge advances, such as the examination of masculine and feminine norms. There is limited understanding of additional topics such as etiology, prevalence, nosology, prognosis, and treatment. Further, the evidence is largely based on a small number of unstandardized case reports and descriptive studies (involving small samples), largely confined to Western (North American, British, and Australian) males. Although much research has been undertaken since the term Muscle Dysmorphia entered the psychiatric lexicon in 1997, there remains tremendous scope for knowledge advancement. A primary task in the short term is for investigators to examine the extent that the condition exists among well-defined populations to help determine the justification for research funding relative to other public health issues. A greater variety of research questions and designs may contribute to a broader and more robust knowledge base than currently exists. Future work will help clinicians assist a group of people whose quality of life and health is placed at risk by their muscular preoccupation.
Resumo:
Biochemical agents, including bacteria and toxins, are potentially dangerous and responsible for a wide variety of diseases. Reliable detection and characterization of small samples is necessary in order to reduce and eliminate their harmful consequences. Microcantilever sensors offer a potential alternative to the state of the art due to their small size, fast response time, and the ability to operate in air and liquid environments. At present, there are several technology limitations that inhibit application of microcantilever to biochemical detection and analysis, including difficulties in conducting temperature-sensitive experiments, material inadequacy resulting in insufficient cell capture, and poor selectivity of multiple analytes. This work aims to address several of these issues by introducing microcantilevers having integrated thermal functionality and by introducing nanocrystalline diamond as new material for microcantilevers. Microcantilevers are designed, fabricated, characterized, and used for capture and detection of cells and bacteria. The first microcantilever type described in this work is a silicon cantilever having highly uniform in-plane temperature distribution. The goal is to have 100 μm square uniformly heated area that can be used for thermal characterization of films as well as to conduct chemical reactions with small amounts of material. Fabricated cantilevers can reach above 300C while maintaining temperature uniformity of 2−4%. This is an improvement of over one order of magnitude over currently available cantilevers. The second microcantilever type is a doped single crystal silicon cantilever having a thin coating of ultrananocrystalline diamond (UNCD). The primary application of such a device is in biological testing, where diamond acts as a stable, electrically isolated reaction surface while silicon layer provides controlled heating with minimum variations in temperature. This work shows that composite cantilevers of this kind are an effective platform for temperature-sensitive biological experiments, such as heat lysing and polymerase chain reaction. The rapid heat-transfer of Si-UNCD cantilever compromised the membrane of NIH 3T3 fibroblast and lysed the cell nucleus within 30 seconds. Bacteria cells, Listeria monocytogenes V7, were shown to be captured with biotinylated heat-shock protein on UNCD surface and 90% of all viable cells exhibit membrane porosity due to high heat in 15 seconds. Lastly, a sensor made solely from UNCD diamond is fabricated with the intention of being used to detect the presence of biological species by means of an integrated piezoresistor or through frequency change monitoring. Since UNCD diamond has not been previously used in piezoresistive applications, temperature-denpendent piezoresistive coefficients and gage factors are determined first. The doped UNCD exhibits a significant piezoresistive effect with gauge factor of 7.53±0.32 and a piezoresistive coefficient of 8.12×10^−12 Pa^−1 at room temperature. The piezoresistive properties of UNCD are constant over the temperature range of 25−200C. 300 μm long cantilevers have the highest sensitivity of 0.186 m-Ohm/Ohm per μm of cantilever end deflection, which is approximately half that of similarly sized silicon cantilevers. UNCD cantilever arrays were fabricated consisting of four sixteen-cantilever arrays of length 20–90 μm in addition to an eight-cantilever array of length 120 μm. Laser doppler vibrometry (LDV) measured the cantilever resonant frequency, which ranged as 218 kHz−5.14 MHz in air and 73 kHz−3.68 MHz in water. The quality factor of the cantilever was 47−151 in air and 18−45 in water. The ability to measure frequencies of the cantilever arrays opens the possibility for detection of individual bacteria by monitoring frequency shift after cell capture.
Resumo:
Doutoramento em Economia
Resumo:
Estudio de validación en escolares pertenecientes a instituciones educativas oficiales de la ciudad de Bogotá, Colombia. Se diseñó y aplicó el CCC-FUPRECOL que indagó por las etapas de cambio para la actividad física/ejercicio, consumo de frutas, verduras, drogas, tabaco e ingesta de bebidas alcohólicas, de manera auto-diligenciada por formulario estructurado.
Resumo:
The thesis deals with the problem of Model Selection (MS) motivated by information and prediction theory, focusing on parametric time series (TS) models. The main contribution of the thesis is the extension to the multivariate case of the Misspecification-Resistant Information Criterion (MRIC), a criterion introduced recently that solves Akaike’s original research problem posed 50 years ago, which led to the definition of the AIC. The importance of MS is witnessed by the huge amount of literature devoted to it and published in scientific journals of many different disciplines. Despite such a widespread treatment, the contributions that adopt a mathematically rigorous approach are not so numerous and one of the aims of this project is to review and assess them. Chapter 2 discusses methodological aspects of MS from information theory. Information criteria (IC) for the i.i.d. setting are surveyed along with their asymptotic properties; and the cases of small samples, misspecification, further estimators. Chapter 3 surveys criteria for TS. IC and prediction criteria are considered for: univariate models (AR, ARMA) in the time and frequency domain, parametric multivariate (VARMA, VAR); nonparametric nonlinear (NAR); and high-dimensional models. The MRIC answers Akaike’s original question on efficient criteria, for possibly-misspecified (PM) univariate TS models in multi-step prediction with high-dimensional data and nonlinear models. Chapter 4 extends the MRIC to PM multivariate TS models for multi-step prediction introducing the Vectorial MRIC (VMRIC). We show that the VMRIC is asymptotically efficient by proving the decomposition of the MSPE matrix and the consistency of its Method-of-Moments Estimator (MoME), for Least Squares multi-step prediction with univariate regressor. Chapter 5 extends the VMRIC to the general multiple regressor case, by showing that the MSPE matrix decomposition holds, obtaining consistency for its MoME, and proving its efficiency. The chapter concludes with a digression on the conditions for PM VARX models.
Resumo:
Topological order has proven a useful concept to describe quantum phase transitions which are not captured by the Ginzburg-Landau type of symmetry-breaking order. However, lacking a local order parameter, topological order is hard to detect. One way to detect it is via direct observation of anyonic properties of excitations which are usually discussed in the thermodynamic limit, but so far has not been realized in macroscopic quantum Hall samples. Here we consider a system of few interacting bosons subjected to the lowest Landau level by a gauge potential, and theoretically investigate vortex excitations in order to identify topological properties of different ground states. Our investigation demonstrates that even in surprisingly small systems anyonic properties are able to characterize the topological order. In addition, focusing on a system in the Laughlin state, we study the robustness of its anyonic behavior in the presence of tunable finite-range interactions acting as a perturbation. A clear signal of a transition to a different state is reflected by the system's anyonic properties.
Resumo:
Topological order has proven a useful concept to describe quantum phase transitions which are not captured by the Ginzburg-Landau type of symmetry-breaking order. However, lacking a local order parameter, topological order is hard to detect. One way to detect it is via direct observation of anyonic properties of excitations which are usually discussed in the thermodynamic limit, but so far has not been realized in macroscopic quantum Hall samples. Here we consider a system of few interacting bosons subjected to the lowest Landau level by a gauge potential, and theoretically investigate vortex excitations in order to identify topological properties of different ground states. Our investigation demonstrates that even in surprisingly small systems anyonic properties are able to characterize the topological order. In addition, focusing on a system in the Laughlin state, we study the robustness of its anyonic behavior in the presence of tunable finite-range interactions acting as a perturbation. A clear signal of a transition to a different state is reflected by the system's anyonic properties.
Resumo:
The concentrations of dissolved noble gases in water are widely used as a climate proxy to determine noble gas temperatures (NGTs); i.e., the temperature of the water when gas exchange last occurred. In this paper we make a step forward to apply this principle to fluid inclusions in stalagmites in order to reconstruct the cave temperature prevailing at the time when the inclusion was formed. We present an analytical protocol that allows us accurately to determine noble gas concentrations and isotope ratios in stalagmites, and which includes a precise manometrical determination of the mass of water liberated from fluid inclusions. Most important for NGT determination is to reduce the amount of noble gases liberated from air inclusions, as they mask the temperature-dependent noble gas signal from the water inclusions. We demonstrate that offline pre-crushing in air to subsequently extract noble gases and water from the samples by heating is appropriate to separate gases released from air and water inclusions. Although a large fraction of recent samples analysed by this technique yields NGTs close to present-day cave temperatures, the interpretation of measured noble gas concentrations in terms of NGTs is not yet feasible using the available least squares fitting models. This is because the noble gas concentrations in stalagmites are not only composed of the two components air and air saturated water (ASW), which these models are able to account for. The observed enrichments in heavy noble gases are interpreted as being due to adsorption during sample preparation in air, whereas the excess in He and Ne is interpreted as an additional noble gas component that is bound in voids in the crystallographic structure of the calcite crystals. As a consequence of our study's findings, NGTs will have to be determined in the future using the concentrations of Ar, Kr and Xe only. This needs to be achieved by further optimizing the sample preparation to minimize atmospheric contamination and to further reduce the amount of noble gases released from air inclusions.
Resumo:
P>In developing countries such as Brazil, where canine rabies is still a considerable problem, samples from wildlife species are infrequently collected and submitted for screening for rabies. A collaborative study was established involving environmental biologists and veterinarians for rabies epidemiological research in a specific ecological area located at the São Paulo State, Brazil. The wild animals' brains are required to be collected without skull damage because the skull's measurements are important in the identification of the captured animal species. For this purpose, samples from bats and small mammals were collected using an aspiration method by inserting a plastic pipette into the brain through the magnum foramen. While there is a progressive increase in the use of the plastic pipette technique in various studies undertaken, it is also appreciated that this method could foster collaborative research between wildlife scientists and rabies epidemiologists thus improving rabies surveillance.