14 resultados para High-Dimensional Space Geometrical Informatics (HDSGI)
em Collection Of Biostatistics Research Archive
Resumo:
With recent advances in mass spectrometry techniques, it is now possible to investigate proteins over a wide range of molecular weights in small biological specimens. This advance has generated data-analytic challenges in proteomics, similar to those created by microarray technologies in genetics, namely, discovery of "signature" protein profiles specific to each pathologic state (e.g., normal vs. cancer) or differential profiles between experimental conditions (e.g., treated by a drug of interest vs. untreated) from high-dimensional data. We propose a data analytic strategy for discovering protein biomarkers based on such high-dimensional mass-spectrometry data. A real biomarker-discovery project on prostate cancer is taken as a concrete example throughout the paper: the project aims to identify proteins in serum that distinguish cancer, benign hyperplasia, and normal states of prostate using the Surface Enhanced Laser Desorption/Ionization (SELDI) technology, a recently developed mass spectrometry technique. Our data analytic strategy takes properties of the SELDI mass-spectrometer into account: the SELDI output of a specimen contains about 48,000 (x, y) points where x is the protein mass divided by the number of charges introduced by ionization and y is the protein intensity of the corresponding mass per charge value, x, in that specimen. Given high coefficients of variation and other characteristics of protein intensity measures (y values), we reduce the measures of protein intensities to a set of binary variables that indicate peaks in the y-axis direction in the nearest neighborhoods of each mass per charge point in the x-axis direction. We then account for a shifting (measurement error) problem of the x-axis in SELDI output. After these pre-analysis processing of data, we combine the binary predictors to generate classification rules for cancer, benign hyperplasia, and normal states of prostate. Our approach is to apply the boosting algorithm to select binary predictors and construct a summary classifier. We empirically evaluate sensitivity and specificity of the resulting summary classifiers with a test dataset that is independent from the training dataset used to construct the summary classifiers. The proposed method performed nearly perfectly in distinguishing cancer and benign hyperplasia from normal. In the classification of cancer vs. benign hyperplasia, however, an appreciable proportion of the benign specimens were classified incorrectly as cancer. We discuss practical issues associated with our proposed approach to the analysis of SELDI output and its application in cancer biomarker discovery.
Resumo:
In biostatistical applications, interest often focuses on the estimation of the distribution of time T between two consecutive events. If the initial event time is observed and the subsequent event time is only known to be larger or smaller than an observed monitoring time, then the data is described by the well known singly-censored current status model, also known as interval censored data, case I. We extend this current status model by allowing the presence of a time-dependent process, which is partly observed and allowing C to depend on T through the observed part of this time-dependent process. Because of the high dimension of the covariate process, no globally efficient estimators exist with a good practical performance at moderate sample sizes. We follow the approach of Robins and Rotnitzky (1992) by modeling the censoring variable, given the time-variable and the covariate-process, i.e., the missingness process, under the restriction that it satisfied coarsening at random. We propose a generalization of the simple current status estimator of the distribution of T and of smooth functionals of the distribution of T, which is based on an estimate of the missingness. In this estimator the covariates enter only through the estimate of the missingness process. Due to the coarsening at random assumption, the estimator has the interesting property that if we estimate the missingness process more nonparametrically, then we improve its efficiency. We show that by local estimation of an optimal model or optimal function of the covariates for the missingness process, the generalized current status estimator for smooth functionals become locally efficient; meaning it is efficient if the right model or covariate is consistently estimated and it is consistent and asymptotically normal in general. Estimation of the optimal model requires estimation of the conditional distribution of T, given the covariates. Any (prior) knowledge of this conditional distribution can be used at this stage without any risk of losing root-n consistency. We also propose locally efficient one step estimators. Finally, we show some simulation results.
Resumo:
We establish a fundamental equivalence between singular value decomposition (SVD) and functional principal components analysis (FPCA) models. The constructive relationship allows to deploy the numerical efficiency of SVD to fully estimate the components of FPCA, even for extremely high-dimensional functional objects, such as brain images. As an example, a functional mixed effect model is fitted to high-resolution morphometric (RAVENS) images. The main directions of morphometric variation in brain volumes are identified and discussed.
Resumo:
Use of microarray technology often leads to high-dimensional and low- sample size data settings. Over the past several years, a variety of novel approaches have been proposed for variable selection in this context. However, only a small number of these have been adapted for time-to-event data where censoring is present. Among standard variable selection methods shown both to have good predictive accuracy and to be computationally efficient is the elastic net penalization approach. In this paper, adaptation of the elastic net approach is presented for variable selection both under the Cox proportional hazards model and under an accelerated failure time (AFT) model. Assessment of the two methods is conducted through simulation studies and through analysis of microarray data obtained from a set of patients with diffuse large B-cell lymphoma where time to survival is of interest. The approaches are shown to match or exceed the predictive performance of a Cox-based and an AFT-based variable selection method. The methods are moreover shown to be much more computationally efficient than their respective Cox- and AFT- based counterparts.
Resumo:
Generalized linear mixed models (GLMM) are generalized linear models with normally distributed random effects in the linear predictor. Penalized quasi-likelihood (PQL), an approximate method of inference in GLMMs, involves repeated fitting of linear mixed models with “working” dependent variables and iterative weights that depend on parameter estimates from the previous cycle of iteration. The generality of PQL, and its implementation in commercially available software, has encouraged the application of GLMMs in many scientific fields. Caution is needed, however, since PQL may sometimes yield badly biased estimates of variance components, especially with binary outcomes. Recent developments in numerical integration, including adaptive Gaussian quadrature, higher order Laplace expansions, stochastic integration and Markov chain Monte Carlo (MCMC) algorithms, provide attractive alternatives to PQL for approximate likelihood inference in GLMMs. Analyses of some well known datasets, and simulations based on these analyses, suggest that PQL still performs remarkably well in comparison with more elaborate procedures in many practical situations. Adaptive Gaussian quadrature is a viable alternative for nested designs where the numerical integration is limited to a small number of dimensions. Higher order Laplace approximations hold the promise of accurate inference more generally. MCMC is likely the method of choice for the most complex problems that involve high dimensional integrals.
Resumo:
In many applications the observed data can be viewed as a censored high dimensional full data random variable X. By the curve of dimensionality it is typically not possible to construct estimators that are asymptotically efficient at every probability distribution in a semiparametric censored data model of such a high dimensional censored data structure. We provide a general method for construction of one-step estimators that are efficient at a chosen submodel of the full-data model, are still well behaved off this submodel and can be chosen to always improve on a given initial estimator. These one-step estimators rely on good estimators of the censoring mechanism and thus will require a parametric or semiparametric model for the censoring mechanism. We present a general theorem that provides a template for proving the desired asymptotic results. We illustrate the general one-step estimation methods by constructing locally efficient one-step estimators of marginal distributions and regression parameters with right-censored data, current status data and bivariate right-censored data, in all models allowing the presence of time-dependent covariates. The conditions of the asymptotics theorem are rigorously verified in one of the examples and the key condition of the general theorem is verified for all examples.
Resumo:
In recent years, researchers in the health and social sciences have become increasingly interested in mediation analysis. Specifically, upon establishing a non-null total effect of an exposure, investigators routinely wish to make inferences about the direct (indirect) pathway of the effect of the exposure not through (through) a mediator variable that occurs subsequently to the exposure and prior to the outcome. Natural direct and indirect effects are of particular interest as they generally combine to produce the total effect of the exposure and therefore provide insight on the mechanism by which it operates to produce the outcome. A semiparametric theory has recently been proposed to make inferences about marginal mean natural direct and indirect effects in observational studies (Tchetgen Tchetgen and Shpitser, 2011), which delivers multiply robust locally efficient estimators of the marginal direct and indirect effects, and thus generalizes previous results for total effects to the mediation setting. In this paper we extend the new theory to handle a setting in which a parametric model for the natural direct (indirect) effect within levels of pre-exposure variables is specified and the model for the observed data likelihood is otherwise unrestricted. We show that estimation is generally not feasible in this model because of the curse of dimensionality associated with the required estimation of auxiliary conditional densities or expectations, given high-dimensional covariates. We thus consider multiply robust estimation and propose a more general model which assumes a subset but not all of several working models holds.
Resumo:
There is an emerging interest in modeling spatially correlated survival data in biomedical and epidemiological studies. In this paper, we propose a new class of semiparametric normal transformation models for right censored spatially correlated survival data. This class of models assumes that survival outcomes marginally follow a Cox proportional hazard model with unspecified baseline hazard, and their joint distribution is obtained by transforming survival outcomes to normal random variables, whose joint distribution is assumed to be multivariate normal with a spatial correlation structure. A key feature of the class of semiparametric normal transformation models is that it provides a rich class of spatial survival models where regression coefficients have population average interpretation and the spatial dependence of survival times is conveniently modeled using the transformed variables by flexible normal random fields. We study the relationship of the spatial correlation structure of the transformed normal variables and the dependence measures of the original survival times. Direct nonparametric maximum likelihood estimation in such models is practically prohibited due to the high dimensional intractable integration of the likelihood function and the infinite dimensional nuisance baseline hazard parameter. We hence develop a class of spatial semiparametric estimating equations, which conveniently estimate the population-level regression coefficients and the dependence parameters simultaneously. We study the asymptotic properties of the proposed estimators, and show that they are consistent and asymptotically normal. The proposed method is illustrated with an analysis of data from the East Boston Ashma Study and its performance is evaluated using simulations.
Resumo:
Visualization and exploratory analysis is an important part of any data analysis and is made more challenging when the data are voluminous and high-dimensional. One such example is environmental monitoring data, which are often collected over time and at multiple locations, resulting in a geographically indexed multivariate time series. Financial data, although not necessarily containing a geographic component, present another source of high-volume multivariate time series data. We present the mvtsplot function which provides a method for visualizing multivariate time series data. We outline the basic design concepts and provide some examples of its usage by applying it to a database of ambient air pollution measurements in the United States and to a hypothetical portfolio of stocks.
Resumo:
Quantifying the health effects associated with simultaneous exposure to many air pollutants is now a research priority of the US EPA. Bayesian hierarchical models (BHM) have been extensively used in multisite time series studies of air pollution and health to estimate health effects of a single pollutant adjusted for potential confounding of other pollutants and other time-varying factors. However, when the scientific goal is to estimate the impacts of many pollutants jointly, a straightforward application of BHM is challenged by the need to specify a random-effect distribution on a high-dimensional vector of nuisance parameters, which often do not have an easy interpretation. In this paper we introduce a new BHM formulation, which we call "reduced BHM", aimed at analyzing clustered data sets in the presence of a large number of random effects that are not of primary scientific interest. At the first stage of the reduced BHM, we calculate the integrated likelihood of the parameter of interest (e.g. excess number of deaths attributed to simultaneous exposure to high levels of many pollutants). At the second stage, we specify a flexible random-effect distribution directly on the parameter of interest. The reduced BHM overcomes many of the challenges in the specification and implementation of full BHM in the context of a large number of nuisance parameters. In simulation studies we show that the reduced BHM performs comparably to the full BHM in many scenarios, and even performs better in some cases. Methods are applied to estimate location-specific and overall relative risks of cardiovascular hospital admissions associated with simultaneous exposure to elevated levels of particulate matter and ozone in 51 US counties during the period 1999-2005.