13 resultados para Lanczos, Linear systems, Generalized cross validation

em Collection Of Biostatistics Research Archive


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Estimation of the number of mixture components (k) is an unsolved problem. Available methods for estimation of k include bootstrapping the likelihood ratio test statistics and optimizing a variety of validity functionals such as AIC, BIC/MDL, and ICOMP. We investigate the minimization of distance between fitted mixture model and the true density as a method for estimating k. The distances considered are Kullback-Leibler (KL) and “L sub 2”. We estimate these distances using cross validation. A reliable estimate of k is obtained by voting of B estimates of k corresponding to B cross validation estimates of distance. This estimation methods with KL distance is very similar to Monte Carlo cross validated likelihood methods discussed by Smyth (2000). With focus on univariate normal mixtures, we present simulation studies that compare the cross validated distance method with AIC, BIC/MDL, and ICOMP. We also apply the cross validation estimate of distance approach along with AIC, BIC/MDL and ICOMP approach, to data from an osteoporosis drug trial in order to find groups that differentially respond to treatment.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Smoothing splines are a popular approach for non-parametric regression problems. We use periodic smoothing splines to fit a periodic signal plus noise model to data for which we assume there are underlying circadian patterns. In the smoothing spline methodology, choosing an appropriate smoothness parameter is an important step in practice. In this paper, we draw a connection between smoothing splines and REACT estimators that provides motivation for the creation of criteria for choosing the smoothness parameter. The new criteria are compared to three existing methods, namely cross-validation, generalized cross-validation, and generalization of maximum likelihood criteria, by a Monte Carlo simulation and by an application to the study of circadian patterns. For most of the situations presented in the simulations, including the practical example, the new criteria out-perform the three existing criteria.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The advances in computational biology have made simultaneous monitoring of thousands of features possible. The high throughput technologies not only bring about a much richer information context in which to study various aspects of gene functions but they also present challenge of analyzing data with large number of covariates and few samples. As an integral part of machine learning, classification of samples into two or more categories is almost always of interest to scientists. In this paper, we address the question of classification in this setting by extending partial least squares (PLS), a popular dimension reduction tool in chemometrics, in the context of generalized linear regression based on a previous approach, Iteratively ReWeighted Partial Least Squares, i.e. IRWPLS (Marx, 1996). We compare our results with two-stage PLS (Nguyen and Rocke, 2002A; Nguyen and Rocke, 2002B) and other classifiers. We show that by phrasing the problem in a generalized linear model setting and by applying bias correction to the likelihood to avoid (quasi)separation, we often get lower classification error rates.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The construction of a reliable, practically useful prediction rule for future response is heavily dependent on the "adequacy" of the fitted regression model. In this article, we consider the absolute prediction error, the expected value of the absolute difference between the future and predicted responses, as the model evaluation criterion. This prediction error is easier to interpret than the average squared error and is equivalent to the mis-classification error for the binary outcome. We show that the distributions of the apparent error and its cross-validation counterparts are approximately normal even under a misspecified fitted model. When the prediction rule is "unsmooth", the variance of the above normal distribution can be estimated well via a perturbation-resampling method. We also show how to approximate the distribution of the difference of the estimated prediction errors from two competing models. With two real examples, we demonstrate that the resulting interval estimates for prediction errors provide much more information about model adequacy than the point estimates alone.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Suppose that we are interested in establishing simple, but reliable rules for predicting future t-year survivors via censored regression models. In this article, we present inference procedures for evaluating such binary classification rules based on various prediction precision measures quantified by the overall misclassification rate, sensitivity and specificity, and positive and negative predictive values. Specifically, under various working models we derive consistent estimators for the above measures via substitution and cross validation estimation procedures. Furthermore, we provide large sample approximations to the distributions of these nonsmooth estimators without assuming that the working model is correctly specified. Confidence intervals, for example, for the difference of the precision measures between two competing rules can then be constructed. All the proposals are illustrated with two real examples and their finite sample properties are evaluated via a simulation study.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Marginal generalized linear models can be used for clustered and longitudinal data by fitting a model as if the data were independent and using an empirical estimator of parameter standard errors. We extend this approach to data where the number of observations correlated with a given one grows with sample size and show that parameter estimates are consistent and asymptotically Normal with a slower convergence rate than for independent data, and that an information sandwich variance estimator is consistent. We present two problems that motivated this work, the modelling of patterns of HIV genetic variation and the behavior of clustered data estimators when clusters are large.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Generalized linear mixed models with semiparametric random effects are useful in a wide variety of Bayesian applications. When the random effects arise from a mixture of Dirichlet process (MDP) model, normal base measures and Gibbs sampling procedures based on the Pólya urn scheme are often used to simulate posterior draws. These algorithms are applicable in the conjugate case when (for a normal base measure) the likelihood is normal. In the non-conjugate case, the algorithms proposed by MacEachern and Müller (1998) and Neal (2000) are often applied to generate posterior samples. Some common problems associated with simulation algorithms for non-conjugate MDP models include convergence and mixing difficulties. This paper proposes an algorithm based on the Pólya urn scheme that extends the Gibbs sampling algorithms to non-conjugate models with normal base measures and exponential family likelihoods. The algorithm proceeds by making Laplace approximations to the likelihood function, thereby reducing the procedure to that of conjugate normal MDP models. To ensure the validity of the stationary distribution in the non-conjugate case, the proposals are accepted or rejected by a Metropolis-Hastings step. In the special case where the data are normally distributed, the algorithm is identical to the Gibbs sampler.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Generalized linear mixed models (GLMMs) provide an elegant framework for the analysis of correlated data. Due to the non-closed form of the likelihood, GLMMs are often fit by computational procedures like penalized quasi-likelihood (PQL). Special cases of these models are generalized linear models (GLMs), which are often fit using algorithms like iterative weighted least squares (IWLS). High computational costs and memory space constraints often make it difficult to apply these iterative procedures to data sets with very large number of cases. This paper proposes a computationally efficient strategy based on the Gauss-Seidel algorithm that iteratively fits sub-models of the GLMM to subsetted versions of the data. Additional gains in efficiency are achieved for Poisson models, commonly used in disease mapping problems, because of their special collapsibility property which allows data reduction through summaries. Convergence of the proposed iterative procedure is guaranteed for canonical link functions. The strategy is applied to investigate the relationship between ischemic heart disease, socioeconomic status and age/gender category in New South Wales, Australia, based on outcome data consisting of approximately 33 million records. A simulation study demonstrates the algorithm's reliability in analyzing a data set with 12 million records for a (non-collapsible) logistic regression model.