9 resultados para Classification Methods
em DigitalCommons@The Texas Medical Center
Resumo:
Cervical cancer is the leading cause of death and disease from malignant neoplasms among women in developing countries. Even though the Pap smear has significantly decreased the number of deaths from cervical cancer in the past years, it has its limitations. Researchers have developed an automated screening machine which can potentially detect abnormal cases that are overlooked by conventional screening. The goal of quantitative cytology is to classify the patient's tissue sample based on quantitative measurements of the individual cells. It is also much cheaper and potentially can take less time. One of the major challenges of collecting cells with a cytobrush is the possibility of not sampling any existing dysplastic cells on the cervix. Being able to correctly classify patients who have disease without the presence of dysplastic cells could improve the accuracy of quantitative cytology algorithms. Subtle morphologic changes in normal-appearing tissues adjacent to or distant from malignant tumors have been shown to exist, but a comparison of various statistical methods, including many recent advances in the statistical learning field, has not previously been done. The objective of this thesis is to use different classification methods applied to quantitative cytology data for the detection of malignancy associated changes (MACs). In this thesis, Elastic Net is the best algorithm. When we applied the Elastic Net algorithm to the test set, we combined the training set and validation set as "training" set and used 5-fold cross validation to choose the parameter for Elastic Net. It has a sensitivity of 47% at 80% specificity, an AUC 0.52, and a partial AUC 0.10 (95% CI 0.09-0.11).^
Resumo:
In population studies, most current methods focus on identifying one outcome-related SNP at a time by testing for differences of genotype frequencies between disease and healthy groups or among different population groups. However, testing a great number of SNPs simultaneously has a problem of multiple testing and will give false-positive results. Although, this problem can be effectively dealt with through several approaches such as Bonferroni correction, permutation testing and false discovery rates, patterns of the joint effects by several genes, each with weak effect, might not be able to be determined. With the availability of high-throughput genotyping technology, searching for multiple scattered SNPs over the whole genome and modeling their joint effect on the target variable has become possible. Exhaustive search of all SNP subsets is computationally infeasible for millions of SNPs in a genome-wide study. Several effective feature selection methods combined with classification functions have been proposed to search for an optimal SNP subset among big data sets where the number of feature SNPs far exceeds the number of observations. ^ In this study, we take two steps to achieve the goal. First we selected 1000 SNPs through an effective filter method and then we performed a feature selection wrapped around a classifier to identify an optimal SNP subset for predicting disease. And also we developed a novel classification method-sequential information bottleneck method wrapped inside different search algorithms to identify an optimal subset of SNPs for classifying the outcome variable. This new method was compared with the classical linear discriminant analysis in terms of classification performance. Finally, we performed chi-square test to look at the relationship between each SNP and disease from another point of view. ^ In general, our results show that filtering features using harmononic mean of sensitivity and specificity(HMSS) through linear discriminant analysis (LDA) is better than using LDA training accuracy or mutual information in our study. Our results also demonstrate that exhaustive search of a small subset with one SNP, two SNPs or 3 SNP subset based on best 100 composite 2-SNPs can find an optimal subset and further inclusion of more SNPs through heuristic algorithm doesn't always increase the performance of SNP subsets. Although sequential forward floating selection can be applied to prevent from the nesting effect of forward selection, it does not always out-perform the latter due to overfitting from observing more complex subset states. ^ Our results also indicate that HMSS as a criterion to evaluate the classification ability of a function can be used in imbalanced data without modifying the original dataset as against classification accuracy. Our four studies suggest that Sequential Information Bottleneck(sIB), a new unsupervised technique, can be adopted to predict the outcome and its ability to detect the target status is superior to the traditional LDA in the study. ^ From our results we can see that the best test probability-HMSS for predicting CVD, stroke,CAD and psoriasis through sIB is 0.59406, 0.641815, 0.645315 and 0.678658, respectively. In terms of group prediction accuracy, the highest test accuracy of sIB for diagnosing a normal status among controls can reach 0.708999, 0.863216, 0.639918 and 0.850275 respectively in the four studies if the test accuracy among cases is required to be not less than 0.4. On the other hand, the highest test accuracy of sIB for diagnosing a disease among cases can reach 0.748644, 0.789916, 0.705701 and 0.749436 respectively in the four studies if the test accuracy among controls is required to be at least 0.4. ^ A further genome-wide association study through Chi square test shows that there are no significant SNPs detected at the cut-off level 9.09451E-08 in the Framingham heart study of CVD. Study results in WTCCC can only detect two significant SNPs that are associated with CAD. In the genome-wide study of psoriasis most of top 20 SNP markers with impressive classification accuracy are also significantly associated with the disease through chi-square test at the cut-off value 1.11E-07. ^ Although our classification methods can achieve high accuracy in the study, complete descriptions of those classification results(95% confidence interval or statistical test of differences) require more cost-effective methods or efficient computing system, both of which can't be accomplished currently in our genome-wide study. We should also note that the purpose of this study is to identify subsets of SNPs with high prediction ability and those SNPs with good discriminant power are not necessary to be causal markers for the disease.^
Resumo:
Background and purpose: Breast cancer continues to be a health problem for women, representing 28 percent of all female cancers and remaining one of the leading causes of death for women. Breast cancer incidence rates become substantial before the age of 50. After menopause, breast cancer incidence rates continue to increase with age creating a long-lasting source of concern (Harris et al., 1992). Mammography, a technique for the detection of breast tumors in their nonpalpable stage when they are most curable, has taken on considerable importance as a public health measure. The lifetime risk of breast cancer is approximately 1 in 9 and occurs over many decades. Recommendations are that screening be periodic in order to detect cancer at early stages. These recommendations, largely, are not followed. Not only are most women not getting regular mammograms, but this circumstance is particularly the case among older women where regular mammography has been proven to reduce mortality by approximately 30 percent. The purpose of this project was to increase our understanding of factors that are associated with stage of readiness to obtain subsequent mammograms. A secondary purpose of this research was to suggest further conceptual considerations toward the extension of the Transtheoretical Model (TTM) of behavior change to repeat screening mammography. ^ Methods. A sample (n = 1,222) of women 50 years and older in a large multi-specialty clinic in Houston, Texas was surveyed by mail questionnaire regarding their previous screening experience and stage of readiness to obtain repeat screening. A computerized database, maintained on all women who undergo mammography at the clinic, was used to identify women who are eligible for the project. The major statistical technique employed to select the significant variables and to examine the man and interaction effects of independent variables on dependent variables was polychotomous stepwise, logistic regression. A prediction model for each stage of readiness definition was estimated. The expected probabilities for stage of readiness were calculated to assess the magnitude and direction of significant predictors. ^ Results. Analysis showed that both ways of defining stage of readiness for obtaining a screening mammogram were associated with specific constructs, including decisional balance and processes of the change. ^ Conclusions. The results of the present study demonstrate that the TTM appears to translate to repeat mammography screening. Findings in the current study also support finding of previous studies that suggest that stage of readiness is associated with respondent decisional balance and the processes of change. ^
Resumo:
BACKGROUND: Gray matter lesions are known to be common in multiple sclerosis (MS) and are suspected to play an important role in disease progression and clinical disability. A combination of magnetic resonance imaging (MRI) techniques, double-inversion recovery (DIR), and phase-sensitive inversion recovery (PSIR), has been used for detection and classification of cortical lesions. This study shows that high-resolution three-dimensional (3D) magnetization-prepared rapid acquisition with gradient echo (MPRAGE) improves the classification of cortical lesions by allowing more accurate anatomic localization of lesion morphology. METHODS: 11 patients with MS with previously identified cortical lesions were scanned using DIR, PSIR, and 3D MPRAGE. Lesions were identified on DIR and PSIR and classified as purely intracortical or mixed. MPRAGE images were then examined, and lesions were re-classified based on the new information. RESULTS: The high signal-to-noise ratio, fine anatomic detail, and clear gray-white matter tissue contrast seen in the MPRAGE images provided superior delineation of lesion borders and surrounding gray-white matter junction, improving classification accuracy. 119 lesions were identified as either intracortical or mixed on DIR/PSIR. In 89 cases, MPRAGE confirmed the classification by DIR/PSIR. In 30 cases, MPRAGE overturned the original classification. CONCLUSION: Improved classification of cortical lesions was realized by inclusion of high-spatial resolution 3D MPRAGE. This sequence provides unique detail on lesion morphology that is necessary for accurate classification.
Resumo:
PURPOSE: To develop and implement a method for improved cerebellar tissue classification on the MRI of brain by automatically isolating the cerebellum prior to segmentation. MATERIALS AND METHODS: Dual fast spin echo (FSE) and fluid attenuation inversion recovery (FLAIR) images were acquired on 18 normal volunteers on a 3 T Philips scanner. The cerebellum was isolated from the rest of the brain using a symmetric inverse consistent nonlinear registration of individual brain with the parcellated template. The cerebellum was then separated by masking the anatomical image with individual FLAIR images. Tissues in both the cerebellum and rest of the brain were separately classified using hidden Markov random field (HMRF), a parametric method, and then combined to obtain tissue classification of the whole brain. The proposed method for tissue classification on real MR brain images was evaluated subjectively by two experts. The segmentation results on Brainweb images with varying noise and intensity nonuniformity levels were quantitatively compared with the ground truth by computing the Dice similarity indices. RESULTS: The proposed method significantly improved the cerebellar tissue classification on all normal volunteers included in this study without compromising the classification in remaining part of the brain. The average similarity indices for gray matter (GM) and white matter (WM) in the cerebellum are 89.81 (+/-2.34) and 93.04 (+/-2.41), demonstrating excellent performance of the proposed methodology. CONCLUSION: The proposed method significantly improved tissue classification in the cerebellum. The GM was overestimated when segmentation was performed on the whole brain as a single object.
Resumo:
Two studies among college students were conducted to evaluate appropriate measurement methods for etiological research on computing-related upper extremity musculoskeletal disorders (UEMSDs). ^ A cross-sectional study among 100 graduate students evaluated the utility of symptoms surveys (a VAS scale and 5-point Likert scale) compared with two UEMSD clinical classification systems (Gerr and Moore protocols). The two symptom measures were highly concordant (Lin's rho = 0.54; Spearman's r = 0.72); the two clinical protocols were moderately concordant (Cohen's kappa = 0.50). Sensitivity and specificity, endorsed by Youden's J statistic, did not reveal much agreement between the symptoms surveys and clinical examinations. It cannot be concluded self-report symptoms surveys can be used as surrogate for clinical examinations. ^ A pilot repeated measures study conducted among 30 undergraduate students evaluated computing exposure measurement methods. Key findings are: temporal variations in symptoms, the odds of experiencing symptoms increased with every hour of computer use (adjOR = 1.1, p < .10) and every stretch break taken (adjOR = 1.3, p < .10). When measuring posture using the Computer Use Checklist, a positive association with symptoms was observed (adjOR = 1.3, p < 0.10), while measuring posture using a modified Rapid Upper Limb Assessment produced unexpected and inconsistent associations. The findings were inconclusive in identifying an appropriate posture assessment or superior conceptualization of computer use exposure. ^ A cross-sectional study of 166 graduate students evaluated the comparability of graduate students to College Computing & Health surveys administered to undergraduate students. Fifty-five percent reported computing-related pain and functional limitations. Years of computer use in graduate school and number of years in school where weekly computer use was ≥ 10 hours were associated with pain within an hour of computing in logistic regression analyses. The findings are consistent with current literature on both undergraduate and graduate students. ^
Resumo:
This dissertation develops and tests a comparative effectiveness methodology utilizing a novel approach to the application of Data Envelopment Analysis (DEA) in health studies. The concept of performance tiers (PerT) is introduced as terminology to express a relative risk class for individuals within a peer group and the PerT calculation is implemented with operations research (DEA) and spatial algorithms. The analysis results in the discrimination of the individual data observations into a relative risk classification by the DEA-PerT methodology. The performance of two distance measures, kNN (k-nearest neighbor) and Mahalanobis, was subsequently tested to classify new entrants into the appropriate tier. The methods were applied to subject data for the 14 year old cohort in the Project HeartBeat! study.^ The concepts presented herein represent a paradigm shift in the potential for public health applications to identify and respond to individual health status. The resultant classification scheme provides descriptive, and potentially prescriptive, guidance to assess and implement treatments and strategies to improve the delivery and performance of health systems. ^
Resumo:
Most studies of differential gene-expressions have been conducted between two given conditions. The two-condition experimental (TCE) approach is simple in that all genes detected display a common differential expression pattern responsive to a common two-condition difference. Therefore, the genes that are differentially expressed under the other conditions other than the given two conditions are undetectable with the TCE approach. In order to address the problem, we propose a new approach called multiple-condition experiment (MCE) without replication and develop corresponding statistical methods including inference of pairs of conditions for genes, new t-statistics, and a generalized multiple-testing method for any multiple-testing procedure via a control parameter C. We applied these statistical methods to analyze our real MCE data from breast cancer cell lines and found that 85 percent of gene-expression variations were caused by genotypic effects and genotype-ANAX1 overexpression interactions, which agrees well with our expected results. We also applied our methods to the adenoma dataset of Notterman et al. and identified 93 differentially expressed genes that could not be found in TCE. The MCE approach is a conceptual breakthrough in many aspects: (a) many conditions of interests can be conducted simultaneously; (b) study of association between differential expressions of genes and conditions becomes easy; (c) it can provide more precise information for molecular classification and diagnosis of tumors; (d) it can save lot of experimental resources and time for investigators.^
Resumo:
It is well accepted that tumorigenesis is a multi-step procedure involving aberrant functioning of genes regulating cell proliferation, differentiation, apoptosis, genome stability, angiogenesis and motility. To obtain a full understanding of tumorigenesis, it is necessary to collect information on all aspects of cell activity. Recent advances in high throughput technologies allow biologists to generate massive amounts of data, more than might have been imagined decades ago. These advances have made it possible to launch comprehensive projects such as (TCGA) and (ICGC) which systematically characterize the molecular fingerprints of cancer cells using gene expression, methylation, copy number, microRNA and SNP microarrays as well as next generation sequencing assays interrogating somatic mutation, insertion, deletion, translocation and structural rearrangements. Given the massive amount of data, a major challenge is to integrate information from multiple sources and formulate testable hypotheses. This thesis focuses on developing methodologies for integrative analyses of genomic assays profiled on the same set of samples. We have developed several novel methods for integrative biomarker identification and cancer classification. We introduce a regression-based approach to identify biomarkers predictive to therapy response or survival by integrating multiple assays including gene expression, methylation and copy number data through penalized regression. To identify key cancer-specific genes accounting for multiple mechanisms of regulation, we have developed the integIRTy software that provides robust and reliable inferences about gene alteration by automatically adjusting for sample heterogeneity as well as technical artifacts using Item Response Theory. To cope with the increasing need for accurate cancer diagnosis and individualized therapy, we have developed a robust and powerful algorithm called SIBER to systematically identify bimodally expressed genes using next generation RNAseq data. We have shown that prediction models built from these bimodal genes have the same accuracy as models built from all genes. Further, prediction models with dichotomized gene expression measurements based on their bimodal shapes still perform well. The effectiveness of outcome prediction using discretized signals paves the road for more accurate and interpretable cancer classification by integrating signals from multiple sources.