7 resultados para Large Data Sets
em DigitalCommons@The Texas Medical Center
Resumo:
Logistic regression is one of the most important tools in the analysis of epidemiological and clinical data. Such data often contain missing values for one or more variables. Common practice is to eliminate all individuals for whom any information is missing. This deletion approach does not make efficient use of available information and often introduces bias.^ Two methods were developed to estimate logistic regression coefficients for mixed dichotomous and continuous covariates including partially observed binary covariates. The data were assumed missing at random (MAR). One method (PD) used predictive distribution as weight to calculate the average of the logistic regressions performing on all possible values of missing observations, and the second method (RS) used a variant of resampling technique. Additional seven methods were compared with these two approaches in a simulation study. They are: (1) Analysis based on only the complete cases, (2) Substituting the mean of the observed values for the missing value, (3) An imputation technique based on the proportions of observed data, (4) Regressing the partially observed covariates on the remaining continuous covariates, (5) Regressing the partially observed covariates on the remaining continuous covariates conditional on response variable, (6) Regressing the partially observed covariates on the remaining continuous covariates and response variable, and (7) EM algorithm. Both proposed methods showed smaller standard errors (s.e.) for the coefficient involving the partially observed covariate and for the other coefficients as well. However, both methods, especially PD, are computationally demanding; thus for analysis of large data sets with partially observed covariates, further refinement of these approaches is needed. ^
Resumo:
Information overload is a significant problem for modern medicine. Searching MEDLINE for common topics often retrieves more relevant documents than users can review. Therefore, we must identify documents that are not only relevant, but also important. Our system ranks articles using citation counts and the PageRank algorithm, incorporating data from the Science Citation Index. However, citation data is usually incomplete. Therefore, we explore the relationship between the quantity of citation information available to the system and the quality of the result ranking. Specifically, we test the ability of citation count and PageRank to identify "important articles" as defined by experts from large result sets with decreasing citation information. We found that PageRank performs better than simple citation counts, but both algorithms are surprisingly robust to information loss. We conclude that even an incomplete citation database is likely to be effective for importance ranking.
Resumo:
Information overload is a significant problem for modern medicine. Searching MEDLINE for common topics often retrieves more relevant documents than users can review. Therefore, we must identify documents that are not only relevant, but also important. Our system ranks articles using citation counts and the PageRank algorithm, incorporating data from the Science Citation Index. However, citation data is usually incomplete. Therefore, we explore the relationship between the quantity of citation information available to the system and the quality of the result ranking. Specifically, we test the ability of citation count and PageRank to identify "important articles" as defined by experts from large result sets with decreasing citation information. We found that PageRank performs better than simple citation counts, but both algorithms are surprisingly robust to information loss. We conclude that even an incomplete citation database is likely to be effective for importance ranking.
Resumo:
Random Forests™ is reported to be one of the most accurate classification algorithms in complex data analysis. It shows excellent performance even when most predictors are noisy and the number of variables is much larger than the number of observations. In this thesis Random Forests was applied to a large-scale lung cancer case-control study. A novel way of automatically selecting prognostic factors was proposed. Also, synthetic positive control was used to validate Random Forests method. Throughout this study we showed that Random Forests can deal with large number of weak input variables without overfitting. It can account for non-additive interactions between these input variables. Random Forests can also be used for variable selection without being adversely affected by collinearities. ^ Random Forests can deal with the large-scale data sets without rigorous data preprocessing. It has robust variable importance ranking measure. Proposed is a novel variable selection method in context of Random Forests that uses the data noise level as the cut-off value to determine the subset of the important predictors. This new approach enhanced the ability of the Random Forests algorithm to automatically identify important predictors for complex data. The cut-off value can also be adjusted based on the results of the synthetic positive control experiments. ^ When the data set had high variables to observations ratio, Random Forests complemented the established logistic regression. This study suggested that Random Forests is recommended for such high dimensionality data. One can use Random Forests to select the important variables and then use logistic regression or Random Forests itself to estimate the effect size of the predictors and to classify new observations. ^ We also found that the mean decrease of accuracy is a more reliable variable ranking measurement than mean decrease of Gini. ^
Resumo:
The motion of lung tumors during respiration makes the accurate delivery of radiation therapy to the thorax difficult because it increases the uncertainty of target position. The adoption of four-dimensional computed tomography (4D-CT) has allowed us to determine how a tumor moves with respiration for each individual patient. Using information acquired during a 4D-CT scan, we can define the target, visualize motion, and calculate dose during the planning phase of the radiotherapy process. One image data set that can be created from the 4D-CT acquisition is the maximum-intensity projection (MIP). The MIP can be used as a starting point to define the volume that encompasses the motion envelope of the moving gross target volume (GTV). Because of the close relationship that exists between the MIP and the final target volume, we investigated four MIP data sets created with different methodologies (3 using various 4D-CT sorting implementations, and one using all available cine CT images) to compare target delineation. It has been observed that changing the 4D-CT sorting method will lead to the selection of a different collection of images; however, the clinical implications of changing the constituent images on the resultant MIP data set are not clear. There has not been a comprehensive study that compares target delineation based on different 4D-CT sorting methodologies in a patient population. We selected a collection of patients who had previously undergone thoracic 4D-CT scans at our institution, and who had lung tumors that moved at least 1 cm. We then generated the four MIP data sets and automatically contoured the target volumes. In doing so, we identified cases in which the MIP generated from a 4D-CT sorting process under-represented the motion envelope of the target volume by more than 10% than when measured on the MIP generated from all of the cine CT images. The 4D-CT methods suffered from duplicate image selection and might not choose maximum extent images. Based on our results, we suggest utilization of a MIP generated from the full cine CT data set to ensure a representative inclusive tumor extent, and to avoid geometric miss.
Resumo:
Linkage and association studies are major analytical tools to search for susceptibility genes for complex diseases. With the availability of large collection of single nucleotide polymorphisms (SNPs) and the rapid progresses for high throughput genotyping technologies, together with the ambitious goals of the International HapMap Project, genetic markers covering the whole genome will be available for genome-wide linkage and association studies. In order not to inflate the type I error rate in performing genome-wide linkage and association studies, multiple adjustment for the significant level for each independent linkage and/or association test is required, and this has led to the suggestion of genome-wide significant cut-off as low as 5 × 10 −7. Almost no linkage and/or association study can meet such a stringent threshold by the standard statistical methods. Developing new statistics with high power is urgently needed to tackle this problem. This dissertation proposes and explores a class of novel test statistics that can be used in both population-based and family-based genetic data by employing a completely new strategy, which uses nonlinear transformation of the sample means to construct test statistics for linkage and association studies. Extensive simulation studies are used to illustrate the properties of the nonlinear test statistics. Power calculations are performed using both analytical and empirical methods. Finally, real data sets are analyzed with the nonlinear test statistics. Results show that the nonlinear test statistics have correct type I error rates, and most of the studied nonlinear test statistics have higher power than the standard chi-square test. This dissertation introduces a new idea to design novel test statistics with high power and might open new ways to mapping susceptibility genes for complex diseases. ^
Resumo:
The development of the Alcohol Treatment Profile System (ATPS) was described and an evaluation of its perceived value by various States was undertaken, The ATPS is a treatment needs assessment tool based on the unification of several large national epidemiologic and treatment data sets. It was developed by the National Institute on Alcohol Abuse and Alcoholism (NIAAA) and responsibility for its creation was given to the NIAAA's Alcohol Epidemiologic Data System (AEDS). The ATPS merges county-level measures of alcohol problem prevalence (the specially constructed AEDS Alcohol Problem Indicators), indicating "need" for treatment, and treatment utilization measures (the National Drug and Alcohol Treatment Utilization Survey), indicating treatment "demand." The capabilities of the ATPS in the unique planning and policy-making settings of several States were evaluated.^