835 resultados para Ranked Regression
Resumo:
Strategies are compared for the development of a linear regression model with stochastic (multivariate normal) regressor variables and the subsequent assessment of its predictive ability. Bias and mean squared error of four estimators of predictive performance are evaluated in simulated samples of 32 population correlation matrices. Models including all of the available predictors are compared with those obtained using selected subsets. The subset selection procedures investigated include two stopping rules, C$\sb{\rm p}$ and S$\sb{\rm p}$, each combined with an 'all possible subsets' or 'forward selection' of variables. The estimators of performance utilized include parametric (MSEP$\sb{\rm m}$) and non-parametric (PRESS) assessments in the entire sample, and two data splitting estimates restricted to a random or balanced (Snee's DUPLEX) 'validation' half sample. The simulations were performed as a designed experiment, with population correlation matrices representing a broad range of data structures.^ The techniques examined for subset selection do not generally result in improved predictions relative to the full model. Approaches using 'forward selection' result in slightly smaller prediction errors and less biased estimators of predictive accuracy than 'all possible subsets' approaches but no differences are detected between the performances of C$\sb{\rm p}$ and S$\sb{\rm p}$. In every case, prediction errors of models obtained by subset selection in either of the half splits exceed those obtained using all predictors and the entire sample.^ Only the random split estimator is conditionally (on $\\beta$) unbiased, however MSEP$\sb{\rm m}$ is unbiased on average and PRESS is nearly so in unselected (fixed form) models. When subset selection techniques are used, MSEP$\sb{\rm m}$ and PRESS always underestimate prediction errors, by as much as 27 percent (on average) in small samples. Despite their bias, the mean squared errors (MSE) of these estimators are at least 30 percent less than that of the unbiased random split estimator. The DUPLEX split estimator suffers from large MSE as well as bias, and seems of little value within the context of stochastic regressor variables.^ To maximize predictive accuracy while retaining a reliable estimate of that accuracy, it is recommended that the entire sample be used for model development, and a leave-one-out statistic (e.g. PRESS) be used for assessment. ^
Resumo:
This dissertation develops and explores the methodology for the use of cubic spline functions in assessing time-by-covariate interactions in Cox proportional hazards regression models. These interactions indicate violations of the proportional hazards assumption of the Cox model. Use of cubic spline functions allows for the investigation of the shape of a possible covariate time-dependence without having to specify a particular functional form. Cubic spline functions yield both a graphical method and a formal test for the proportional hazards assumption as well as a test of the nonlinearity of the time-by-covariate interaction. Five existing methods for assessing violations of the proportional hazards assumption are reviewed and applied along with cubic splines to three well known two-sample datasets. An additional dataset with three covariates is used to explore the use of cubic spline functions in a more general setting. ^
Resumo:
A Bayesian approach to estimation of the regression coefficients of a multinominal logit model with ordinal scale response categories is presented. A Monte Carlo method is used to construct the posterior distribution of the link function. The link function is treated as an arbitrary scalar function. Then the Gauss-Markov theorem is used to determine a function of the link which produces a random vector of coefficients. The posterior distribution of the random vector of coefficients is used to estimate the regression coefficients. The method described is referred to as a Bayesian generalized least square (BGLS) analysis. Two cases involving multinominal logit models are described. Case I involves a cumulative logit model and Case II involves a proportional-odds model. All inferences about the coefficients for both cases are described in terms of the posterior distribution of the regression coefficients. The results from the BGLS method are compared to maximum likelihood estimates of the regression coefficients. The BGLS method avoids the nonlinear problems encountered when estimating the regression coefficients of a generalized linear model. The method is not complex or computationally intensive. The BGLS method offers several advantages over Bayesian approaches. ^
Resumo:
Logistic regression is one of the most important tools in the analysis of epidemiological and clinical data. Such data often contain missing values for one or more variables. Common practice is to eliminate all individuals for whom any information is missing. This deletion approach does not make efficient use of available information and often introduces bias.^ Two methods were developed to estimate logistic regression coefficients for mixed dichotomous and continuous covariates including partially observed binary covariates. The data were assumed missing at random (MAR). One method (PD) used predictive distribution as weight to calculate the average of the logistic regressions performing on all possible values of missing observations, and the second method (RS) used a variant of resampling technique. Additional seven methods were compared with these two approaches in a simulation study. They are: (1) Analysis based on only the complete cases, (2) Substituting the mean of the observed values for the missing value, (3) An imputation technique based on the proportions of observed data, (4) Regressing the partially observed covariates on the remaining continuous covariates, (5) Regressing the partially observed covariates on the remaining continuous covariates conditional on response variable, (6) Regressing the partially observed covariates on the remaining continuous covariates and response variable, and (7) EM algorithm. Both proposed methods showed smaller standard errors (s.e.) for the coefficient involving the partially observed covariate and for the other coefficients as well. However, both methods, especially PD, are computationally demanding; thus for analysis of large data sets with partially observed covariates, further refinement of these approaches is needed. ^
Resumo:
A large number of ridge regression estimators have been proposed and used with little knowledge of their true distributions. Because of this lack of knowledge, these estimators cannot be used to test hypotheses or to form confidence intervals.^ This paper presents a basic technique for deriving the exact distribution functions for a class of generalized ridge estimators. The technique is applied to five prominent generalized ridge estimators. Graphs of the resulting distribution functions are presented. The actual behavior of these estimators is found to be considerably different than the behavior which is generally assumed for ridge estimators.^ This paper also uses the derived distributions to examine the mean squared error properties of the estimators. A technique for developing confidence intervals based on the generalized ridge estimators is also presented. ^
Resumo:
Few, if any studies, have attempted to identify the specific environmental factors associated with the incidence of diarrheal disease and to rank these by their contribution to the total incidence of diarrheal illness. Potentially those factors with the greatest contribution are the variables on which intervention could be expected to have the greatest impact on the incidence of diarrhea.^ In 317 rural Egyptian households participating in a longitudinal study of diarrheal disease, selected environmental characteristics were observed and recorded on a questionnaire. Characteristics of the environment were classified into seven categories including water usage, proximity of animals to the house, waste management, food preparation area, toilet area, the household structure and hygiene. The variables from each of the seven major groupings most associated with the incidence of diarrhea in infants were selected through the application of stepwise multiple regression. Each area was then ranked by the portion of the incidence of diarrhea in infants that each composite group of area-specific variables alone would explain. The groups of household structure and water usage variables were found to be more associated with the incidence of diarrhea in infants than variables describing the toilet area, proximity to animals or others. It was also found that 24.7% of the total variance in incidence of diarrheal illness was explained by environmental variables. ^
Resumo:
The history of the logistic function since its introduction in 1838 is reviewed, and the logistic model for a polychotomous response variable is presented with a discussion of the assumptions involved in its derivation and use. Following this, the maximum likelihood estimators for the model parameters are derived along with a Newton-Raphson iterative procedure for evaluation. A rigorous mathematical derivation of the limiting distribution of the maximum likelihood estimators is then presented using a characteristic function approach. An appendix with theorems on the asymptotic normality of sample sums when the observations are not identically distributed, with proofs, supports the presentation on asymptotic properties of the maximum likelihood estimators. Finally, two applications of the model are presented using data from the Hypertension Detection and Follow-up Program, a prospective, population-based, randomized trial of treatment for hypertension. The first application compares the risk of five-year mortality from cardiovascular causes with that from noncardiovascular causes; the second application compares risk factors for fatal or nonfatal coronary heart disease with those for fatal or nonfatal stroke. ^
Resumo:
Traditional comparison of standardized mortality ratios (SMRs) can be misleading if the age-specific mortality ratios are not homogeneous. For this reason, a regression model has been developed which incorporates the mortality ratio as a function of age. This model is then applied to mortality data from an occupational cohort study. The nature of the occupational data necessitates the investigation of mortality ratios which increase with age. These occupational data are used primarily to illustrate and develop the statistical methodology.^ The age-specific mortality ratio (MR) for the covariates of interest can be written as MR(,ij...m) = ((mu)(,ij...m)/(theta)(,ij...m)) = r(.)exp (Z('')(,ij...m)(beta)) where (mu)(,ij...m) and (theta)(,ij...m) denote the force of mortality in the study and chosen standard populations in the ij...m('th) stratum, respectively, r is the intercept, Z(,ij...m) is the vector of covariables associated with the i('th) age interval, and (beta) is a vector of regression coefficients associated with these covariables. A Newton-Raphson iterative procedure has been used for determining the maximum likelihood estimates of the regression coefficients.^ This model provides a statistical method for a logical and easily interpretable explanation of an occupational cohort mortality experience. Since it gives a reasonable fit to the mortality data, it can also be concluded that the model is fairly realistic. The traditional statistical method for the analysis of occupational cohort mortality data is to present a summary index such as the SMR under the assumption of constant (homogeneous) age-specific mortality ratios. Since the mortality ratios for occupational groups usually increase with age, the homogeneity assumption of the age-specific mortality ratios is often untenable. The traditional method of comparing SMRs under the homogeneity assumption is a special case of this model, without age as a covariate.^ This model also provides a statistical technique to evaluate the relative risk between two SMRs or a dose-response relationship among several SMRs. The model presented has application in the medical, demographic and epidemiologic areas. The methods developed in this thesis are suitable for future analyses of mortality or morbidity data when the age-specific mortality/morbidity experience is a function of age or when there is an interaction effect between confounding variables needs to be evaluated. ^
Resumo:
One of the difficulties in the practical application of ridge regression is that, for a given data set, it is unknown whether a selected ridge estimator has smaller squared error than the least squares estimator. The concept of the improvement region is defined, and a technique is developed which obtains approximate confidence intervals for the value of ridge k which produces the maximum reduction in mean squared error. Two simulation experiments were conducted to investigate how accurate these approximate confidence intervals might be. ^
Resumo:
Atherosclerosis is widely accepted as a complex genetic phenotype and is the usual cause of cardiovascular disease, the world’s leading killer. Genetic factors have been proven to be important risk contributors for atherosclerosis and much work has been done to identify promising candidates that might play a role in the development of atherosclerosis. It is well known that many independent replications are needed to unequivocally establish a valid genotype-phenotype association across different populations before the findings are extended to clinical settings and to the expensive follow-up studies designed to identify causal genetic variants. Aiming to replicate the association with atherosclerosis in the Pathobiological Determinants of Atherosclerosis in Youth (PDAY) study, we assessed the relationship of 32 atherosclerosis candidate SNPs to atherosclerosis in the PDAY cohort, consisting of AA and EA young people aged 15-34 years who died of non-medical causes. Two association studies, a whole sample study and a 1:1 matched case control study were performed by use of multiple linear regression and logistic regression analyses, respectively. For the whole sample association study, 32 SNPs among 2,650 individuals (1,369 AA and 1,281 EA) were tested for the association with six early atherosclerosis phenotypes: abdominal aorta fatty streaks, abdominal aorta raised lesions, right coronary artery fatty streaks, right coronary artery raised lesions, thoracic aorta fatty streaks, and thoracic aorta raised lesions. For the matched case-control association study, 337 case-control paired samples were included; cases were chosen with the highest total raised lesion scores from the studied population, while controls were randomly selected from individuals that had no raised lesions and matched to cases by age, gender and race. Sixteen SNPs in 13 genes were found to be significantly associated with atherosclerosis in at least one of the PDAY association studies. Among these 16 findings: eight SNPs (rs9579646, rs6053733, rs3849150, rs10499903, rs2148079, rs5073691, rs10116277, and rs17228212) successfully replicated previous results, six SNPs (rs17222814, rs10811661, rs7028570, rs7291467, rs16996148 and rs10401969) were reported as new findings exclusive to our study, the last two of the 16 SNPs, rs501120 and rs6922269, showed either intriguing or conflicting result. SNP rs17222814 in ALOX5AP and SNP rs3849150 in LRRC18 were consistently associated with atherosclerosis in both prior and the two PDAY association studies. SNP rs3849150 was also identified to be highly correlated with a non-synonymous coding SNP, rs17772611, which may damage the protein (polyphen score = 0.996), suggesting that SNP rs17772611 may be the causal functional variant.^ In conclusion, our study added more support for the association of these candidate genes with atherosclerosis. SNPs rs3849150 and rs17772611 of LRRC18, as well as SNP rs17222814 of ALOX5AP, were the most significant findings from our study, and may be ranked among the best for further study.^
Resumo:
The tobacco-specific nitrosamine 4-(methylnitrosamino)-1-(3-pyridyl)-1-butanone (NNK) is an obvious carcinogen for lung cancer. Since CBMN (Cytokinesis-blocked micronucleus) has been found to be extremely sensitive to NNK-induced genetic damage, it is a potential important factor to predict the lung cancer risk. However, the association between lung cancer and NNK-induced genetic damage measured by CBMN assay has not been rigorously examined. ^ This research develops a methodology to model the chromosomal changes under NNK-induced genetic damage in a logistic regression framework in order to predict the occurrence of lung cancer. Since these chromosomal changes were usually not observed very long due to laboratory cost and time, a resampling technique was applied to generate the Markov chain of the normal and the damaged cell for each individual. A joint likelihood between the resampled Markov chains and the logistic regression model including transition probabilities of this chain as covariates was established. The Maximum likelihood estimation was applied to carry on the statistical test for comparison. The ability of this approach to increase discriminating power to predict lung cancer was compared to a baseline "non-genetic" model. ^ Our method offered an option to understand the association between the dynamic cell information and lung cancer. Our study indicated the extent of DNA damage/non-damage using the CBMN assay provides critical information that impacts public health studies of lung cancer risk. This novel statistical method could simultaneously estimate the process of DNA damage/non-damage and its relationship with lung cancer for each individual.^
Resumo:
Hepatitis B virus (HBV) is a significant cause of liver diseases and related complications worldwide. Both injecting and non-injecting drug users are at increased risk of contracting HBV infection. Scientific evidence suggests that drug users have subnormal response to HBV vaccination and the seroprotection rates are lower than that in the general population; potentially due to vaccine factors, host factors, or both. The purpose of this systematic review is to examine the rates of seroprotection following HBV vaccination in drug using populations and to conduct a meta-analysis to identify the factors associated with varying seroprotection rates. Seroprotection is defined as developing an anti-HBs antibody level of ≥ 10 mIU/ml after receiving the HBV vaccine. Original research articles were searched using online databases and reference lists of shortlisted articles. HBV vaccine intervention studies reporting seroprotection rates in drug users and published in English language during or after 1989 were eligible. Out of 235 citations reviewed, 11 studies were included in this review. The reported seroprotection rates ranged from 54.5 – 97.1%. Combination vaccine (HAV and HBV) (Risk ratio 12.91, 95% CI 2.98-55.86, p = 0.003), measurement of anti-HBs with microparticle immunoassay (Risk ratio 3.46, 95% CI 1.11-10.81, p = 0.035) and anti-HBs antibody measurement at 2 months after the last HBV vaccine dose (RR 4.11, 95% CI 1.55-10.89, p = 0.009) were significantly associated with higher seroprotection rates. Although statistically nonsignificant, the variables mean age>30 years, higher prevalence of anti-HBc antibody and anti-HIV antibody in the sample population, and current drug use (not in drug rehabilitation treatment) were strongly associated with decreased seroprotection rates. Proportion of injecting drug users, vaccine dose and accelerated vaccine schedule were not predictors of heterogeneity across studies. Studies examined in this review were significantly heterogeneous (Q = 180.850, p = 0.000) and factors identified should be considered when comparing immune response across studies. The combination vaccine showed promising results; however, its effectiveness compared to standard HBV vaccine needs to be examined systematically. Immune response in DUs can possibly be improved by the use of bivalent vaccines, booster doses, and improving vaccine completion rates through integrated public programs and incentives.^
Resumo:
Bisphosphonates represent a unique class of drugs that effectively treat and prevent a variety of bone-related disorders including metastatic bone disease and osteoporosis. High tolerance and high efficacy rates quickly ranked bisphosphonates as the standard of care for bone-related diseases. However, in the early 2000s, case reports began to surface that linked bisphosphonates with osteonecrosis of the jaw (ONJ). Since that time, studies conducted have corroborated the linkage. However, as with most disease states, many factors can contribute to the onset of disease. The aim of this study was to determine which comorbid factors presented an increased risk for developing ONJ in cancer patients.^ Using a case-control study design, investigators used a combination of ICD-9 codes and chart review to identify confirmed cases of ONJ at The University of Texas M. D. Anderson Cancer Center (MDACC). Each case was then matched to five controls based on age, gender, race/ethnicity, and primary cancer diagnosis. Data querying and chart review provided information on variables of interest. These variables included bisphosphonate exposure, glucocorticoids exposure, smoking history, obesity, and diabetes. Statistical analysis was conducted using PASW (Predictive Analytics Software) Statistics, Version 18 (SPSS Inc., Chicago, Illinois).^ One hundred twelve (112) cases were identified as confirmed cases of ONJ. Variables were run using univariate logistic regression to determine significance (p < .05); significant variables were included in the final conditional logistic regression model. Concurrent use of bisphosphonates and glucocorticoids (OR, 18.60; CI, 8.85 to 39.12; p < .001), current smokers (OR, 2.52; CI, 1.21 to 5.25; p = .014), and presence of diabetes (OR, 1.84; CI, 1.06 to 3.20; p = .030) were found to increase the risk for developing ONJ. Obesity was not associated significantly with ONJ development.^ In this study, cancer patients that received bisphosphonates as part of their therapeutic regimen were found to have an 18-fold increase in their risk of developing ONJ. Other factors included smoking and diabetes. More studies examining the concurrent use of glucocorticoids and bisphosphonates may be able to strengthen any correlations.^
Resumo:
The infant mortality rate (IMR) is considered to be one of the most important indices of a country's well-being. Countries around the world and other health organizations like the World Health Organization are dedicating their resources, knowledge and energy to reduce the infant mortality rates. The well-known Millennium Development Goal 4 (MDG 4), whose aim is to archive a two thirds reduction of the under-five mortality rate between 1990 and 2015, is an example of the commitment. ^ In this study our goal is to model the trends of IMR between the 1950s to 2010s for selected countries. We would like to know how the IMR is changing overtime and how it differs across countries. ^ IMR data collected over time forms a time series. The repeated observations of IMR time series are not statistically independent. So in modeling the trend of IMR, it is necessary to account for these correlations. We proposed to use the generalized least squares method in general linear models setting to deal with the variance-covariance structure in our model. In order to estimate the variance-covariance matrix, we referred to the time-series models, especially the autoregressive and moving average models. Furthermore, we will compared results from general linear model with correlation structure to that from ordinary least squares method without taking into account the correlation structure to check how significantly the estimates change.^