83 resultados para Columbia University. Dept. of Psychology.


Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper, we study panel count data with informative observation times. We assume nonparametric and semiparametric proportional rate models for the underlying recurrent event process, where the form of the baseline rate function is left unspecified and a subject-specific frailty variable inflates or deflates the rate function multiplicatively. The proposed models allow the recurrent event processes and observation times to be correlated through their connections with the unobserved frailty; moreover, the distributions of both the frailty variable and observation times are considered as nuisance parameters. The baseline rate function and the regression parameters are estimated by maximizing a conditional likelihood function of observed event counts and solving estimation equations. Large sample properties of the proposed estimators are studied. Numerical studies demonstrate that the proposed estimation procedures perform well for moderate sample sizes. An application to a bladder tumor study is presented to illustrate the use of the proposed methods.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The ability to make scientific findings reproducible is increasingly important in areas where substantive results are the product of complex statistical computations. Reproducibility can allow others to verify the published findings and conduct alternate analyses of the same data. A question that arises naturally is how can one conduct and distribute reproducible research? This question is relevant from the point of view of both the authors who want to make their research reproducible and readers who want to reproduce relevant findings reported in the scientific literature. We present a framework in which reproducible research can be conducted and distributed via cached computations and describe specific tools for both authors and readers. As a prototype implementation we introduce three software packages written in the R language. The cacheSweave and stashR packages together provide tools for caching computational results in a key-value style database which can be published to a public repository for readers to download. The SRPM package provides tools for generating and interacting with "shared reproducibility packages" (SRPs) which can facilitate the distribution of the data and code. As a case study we demonstrate the use of the toolkit on a national study of air pollution exposure and mortality.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Geostatistics involves the fitting of spatially continuous models to spatially discrete data (Chil`es and Delfiner, 1999). Preferential sampling arises when the process that determines the data-locations and the process being modelled are stochastically dependent. Conventional geostatistical methods assume, if only implicitly, that sampling is non-preferential. However, these methods are often used in situations where sampling is likely to be preferential. For example, in mineral exploration samples may be concentrated in areas thought likely to yield high-grade ore. We give a general expression for the likelihood function of preferentially sampled geostatistical data and describe how this can be evaluated approximately using Monte Carlo methods. We present a model for preferential sampling, and demonstrate through simulated examples that ignoring preferential sampling can lead to seriously misleading inferences. We describe an application of the model to a set of bio-monitoring data from Galicia, northern Spain, in which making allowance for preferential sampling materially changes the inferences.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Visualization and exploratory analysis is an important part of any data analysis and is made more challenging when the data are voluminous and high-dimensional. One such example is environmental monitoring data, which are often collected over time and at multiple locations, resulting in a geographically indexed multivariate time series. Financial data, although not necessarily containing a geographic component, present another source of high-volume multivariate time series data. We present the mvtsplot function which provides a method for visualizing multivariate time series data. We outline the basic design concepts and provide some examples of its usage by applying it to a database of ambient air pollution measurements in the United States and to a hypothetical portfolio of stocks.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Medical errors originating in health care facilities are a significant source of preventable morbidity, mortality, and healthcare costs. Voluntary error report systems that collect information on the causes and contributing factors of medi- cal errors regardless of the resulting harm may be useful for developing effective harm prevention strategies. Some patient safety experts question the utility of data from errors that did not lead to harm to the patient, also called near misses. A near miss (a.k.a. close call) is an unplanned event that did not result in injury to the patient. Only a fortunate break in the chain of events prevented injury. We use data from a large voluntary reporting system of 836,174 medication errors from 1999 to 2005 to provide evidence that the causes and contributing factors of errors that result in harm are similar to the causes and contributing factors of near misses. We develop Bayesian hierarchical models for estimating the log odds of selecting a given cause (or contributing factor) of error given harm has occurred and the log odds of selecting the same cause given that harm did not occur. The posterior distribution of the correlation between these two vectors of log-odds is used as a measure of the evidence supporting the use of data from near misses and their causes and contributing factors to prevent medical errors. In addition, we identify the causes and contributing factors that have the highest or lowest log-odds ratio of harm versus no harm. These causes and contributing factors should also be a focus in the design of prevention strategies. This paper provides important evidence on the utility of data from near misses, which constitute the vast majority of errors in our data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Genome-wide association studies (GWAS) are used to discover genes underlying complex, heritable disorders for which less powerful study designs have failed in the past. The number of GWAS has skyrocketed recently with findings reported in top journals and the mainstream media. Mircorarrays are the genotype calling technology of choice in GWAS as they permit exploration of more than a million single nucleotide polymorphisms (SNPs)simultaneously. The starting point for the statistical analyses used by GWAS, to determine association between loci and disease, are genotype calls (AA, AB, or BB). However, the raw data, microarray probe intensities, are heavily processed before arriving at these calls. Various sophisticated statistical procedures have been proposed for transforming raw data into genotype calls. We find that variability in microarray output quality across different SNPs, different arrays, and different sample batches has substantial inuence on the accuracy of genotype calls made by existing algorithms. Failure to account for these sources of variability, GWAS run the risk of adversely affecting the quality of reported findings. In this paper we present solutions based on a multi-level mixed model. Software implementation of the method described in this paper is available as free and open source code in the crlmm R/BioConductor.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We consider inference in randomized studies, in which repeatedly measured outcomes may be informatively missing due to drop out. In this setting, it is well known that full data estimands are not identified unless unverified assumptions are imposed. We assume a non-future dependence model for the drop-out mechanism and posit an exponential tilt model that links non-identifiable and identifiable distributions. This model is indexed by non-identified parameters, which are assumed to have an informative prior distribution, elicited from subject-matter experts. Under this model, full data estimands are shown to be expressed as functionals of the distribution of the observed data. To avoid the curse of dimensionality, we model the distribution of the observed data using a Bayesian shrinkage model. In a simulation study, we compare our approach to a fully parametric and a fully saturated model for the distribution of the observed data. Our methodology is motivated and applied to data from the Breast Cancer Prevention Trial.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Among the many applications of microarray technology, one of the most popular is the identification of genes that are differentially expressed in two conditions. A common statistical approach is to quantify the interest of each gene with a p-value, adjust these p-values for multiple comparisons, chose an appropriate cut-off, and create a list of candidate genes. This approach has been criticized for ignoring biological knowledge regarding how genes work together. Recently a series of methods, that do incorporate biological knowledge, have been proposed. However, many of these methods seem overly complicated. Furthermore, the most popular method, Gene Set Enrichment Analysis (GSEA), is based on a statistical test known for its lack of sensitivity. In this paper we compare the performance of a simple alternative to GSEA.We find that this simple solution clearly outperforms GSEA.We demonstrate this with eight different microarray datasets.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Submicroscopic changes in chromosomal DNA copy number dosage are common and have been implicated in many heritable diseases and cancers. Recent high-throughput technologies have a resolution that permits the detection of segmental changes in DNA copy number that span thousands of basepairs across the genome. Genome-wide association studies (GWAS) may simultaneously screen for copy number-phenotype and SNP-phenotype associations as part of the analytic strategy. However, genome-wide array analyses are particularly susceptible to batch effects as the logistics of preparing DNA and processing thousands of arrays often involves multiple laboratories and technicians, or changes over calendar time to the reagents and laboratory equipment. Failure to adjust for batch effects can lead to incorrect inference and requires inefficient post-hoc quality control procedures that exclude regions that are associated with batch. Our work extends previous model-based approaches for copy number estimation by explicitly modeling batch effects and using shrinkage to improve locus-specific estimates of copy number uncertainty. Key features of this approach include the use of diallelic genotype calls from experimental data to estimate batch- and locus-specific parameters of background and signal without the requirement of training data. We illustrate these ideas using a study of bipolar disease and a study of chromosome 21 trisomy. The former has batch effects that dominate much of the observed variation in quantile-normalized intensities, while the latter illustrates the robustness of our approach to datasets where as many as 25% of the samples have altered copy number. Locus-specific estimates of copy number can be plotted on the copy-number scale to investigate mosaicism and guide the choice of appropriate downstream approaches for smoothing the copy number as a function of physical position. The software is open source and implemented in the R package CRLMM available at Bioconductor (http:www.bioconductor.org).

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We propose a novel class of models for functional data exhibiting skewness or other shape characteristics that vary with spatial or temporal location. We use copulas so that the marginal distributions and the dependence structure can be modeled independently. Dependence is modeled with a Gaussian or t-copula, so that there is an underlying latent Gaussian process. We model the marginal distributions using the skew t family. The mean, variance, and shape parameters are modeled nonparametrically as functions of location. A computationally tractable inferential framework for estimating heterogeneous asymmetric or heavy-tailed marginal distributions is introduced. This framework provides a new set of tools for increasingly complex data collected in medical and public health studies. Our methods were motivated by and are illustrated with a state-of-the-art study of neuronal tracts in multiple sclerosis patients and healthy controls. Using the tools we have developed, we were able to find those locations along the tract most affected by the disease. However, our methods are general and highly relevant to many functional data sets. In addition to the application to one-dimensional tract profiles illustrated here, higher-dimensional extensions of the methodology could have direct applications to other biological data including functional and structural MRI.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We establish a fundamental equivalence between singular value decomposition (SVD) and functional principal components analysis (FPCA) models. The constructive relationship allows to deploy the numerical efficiency of SVD to fully estimate the components of FPCA, even for extremely high-dimensional functional objects, such as brain images. As an example, a functional mixed effect model is fitted to high-resolution morphometric (RAVENS) images. The main directions of morphometric variation in brain volumes are identified and discussed.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Latent class regression models are useful tools for assessing associations between covariates and latent variables. However, evaluation of key model assumptions cannot be performed using methods from standard regression models due to the unobserved nature of latent outcome variables. This paper presents graphical diagnostic tools to evaluate whether or not latent class regression models adhere to standard assumptions of the model: conditional independence and non-differential measurement. An integral part of these methods is the use of a Markov Chain Monte Carlo estimation procedure. Unlike standard maximum likelihood implementations for latent class regression model estimation, the MCMC approach allows us to calculate posterior distributions and point estimates of any functions of parameters. It is this convenience that allows us to provide the diagnostic methods that we introduce. As a motivating example we present an analysis focusing on the association between depression and socioeconomic status, using data from the Epidemiologic Catchment Area study. We consider a latent class regression analysis investigating the association between depression and socioeconomic status measures, where the latent variable depression is regressed on education and income indicators, in addition to age, gender, and marital status variables. While the fitted latent class regression model yields interesting results, the model parameters are found to be invalid due to the violation of model assumptions. The violation of these assumptions is clearly identified by the presented diagnostic plots. These methods can be applied to standard latent class and latent class regression models, and the general principle can be extended to evaluate model assumptions in other types of models.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper describes the use of model-based geostatistics for choosing the optimal set of sampling locations, collectively called the design, for a geostatistical analysis. Two types of design situations are considered. These are retrospective design, which concerns the addition of sampling locations to, or deletion of locations from, an existing design, and prospective design, which consists of choosing optimal positions for a new set of sampling locations. We propose a Bayesian design criterion which focuses on the goal of efficient spatial prediction whilst allowing for the fact that model parameter values are unknown. The results show that in this situation a wide range of inter-point distances should be included in the design, and the widely used regular design is therefore not the optimal choice.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper, we develop Bayesian hierarchical distributed lag models for estimating associations between daily variations in summer ozone levels and daily variations in cardiovascular and respiratory (CVDRESP) mortality counts for 19 U.S. large cities included in the National Morbidity Mortality Air Pollution Study (NMMAPS) for the period 1987 - 1994. At the first stage, we define a semi-parametric distributed lag Poisson regression model to estimate city-specific relative rates of CVDRESP associated with short-term exposure to summer ozone. At the second stage, we specify a class of distributions for the true city-specific relative rates to estimate an overall effect by taking into account the variability within and across cities. We perform the calculations with respect to several random effects distributions (normal, t-student, and mixture of normal), thus relaxing the common assumption of a two-stage normal-normal hierarchical model. We assess the sensitivity of the results to: 1) lag structure for ozone exposure; 2) degree of adjustment for long-term trends; 3) inclusion of other pollutants in the model;4) heat waves; 5) random effects distributions; and 6) prior hyperparameters. On average across cities, we found that a 10ppb increase in summer ozone level for every day in the previous week is associated with 1.25 percent increase in CVDRESP mortality (95% posterior regions: 0.47, 2.03). The relative rate estimates are also positive and statistically significant at lags 0, 1, and 2. We found that associations between summer ozone and CVDRESP mortality are sensitive to the confounding adjustment for PM_10, but are robust to: 1) the adjustment for long-term trends, other gaseous pollutants (NO_2, SO_2, and CO); 2) the distributional assumptions at the second stage of the hierarchical model; and 3) the prior distributions on all unknown parameters. Bayesian hierarchical distributed lag models and their application to the NMMAPS data allow us estimation of an acute health effect associated with exposure to ambient air pollution in the last few days on average across several locations. The application of these methods and the systematic assessment of the sensitivity of findings to model assumptions provide important epidemiological evidence for future air quality regulations.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The ability to evaluate effects of factors on outcomes is increasingly important for a class of studies that control some but not all of the factors. Although important advances have been made in methods of analysis for such partially controlled studies,work on designs for such studies has been relatively limited. To help understand why, we review main designs that have been used for such partially controlled studies. Based on the review, we give two complementary reasons that explain the limited work on such designs, and suggest a new direction in this area.