15 resultados para microarray data classification
em DigitalCommons@The Texas Medical Center
Resumo:
Microarray technology is a high-throughput method for genotyping and gene expression profiling. Limited sensitivity and specificity are one of the essential problems for this technology. Most of existing methods of microarray data analysis have an apparent limitation for they merely deal with the numerical part of microarray data and have made little use of gene sequence information. Because it's the gene sequences that precisely define the physical objects being measured by a microarray, it is natural to make the gene sequences an essential part of the data analysis. This dissertation focused on the development of free energy models to integrate sequence information in microarray data analysis. The models were used to characterize the mechanism of hybridization on microarrays and enhance sensitivity and specificity of microarray measurements. ^ Cross-hybridization is a major obstacle factor for the sensitivity and specificity of microarray measurements. In this dissertation, we evaluated the scope of cross-hybridization problem on short-oligo microarrays. The results showed that cross hybridization on arrays is mostly caused by oligo fragments with a run of 10 to 16 nucleotides complementary to the probes. Furthermore, a free-energy based model was proposed to quantify the amount of cross-hybridization signal on each probe. This model treats cross-hybridization as an integral effect of the interactions between a probe and various off-target oligo fragments. Using public spike-in datasets, the model showed high accuracy in predicting the cross-hybridization signals on those probes whose intended targets are absent in the sample. ^ Several prospective models were proposed to improve Positional Dependent Nearest-Neighbor (PDNN) model for better quantification of gene expression and cross-hybridization. ^ The problem addressed in this dissertation is fundamental to the microarray technology. We expect that this study will help us to understand the detailed mechanism that determines sensitivity and specificity on the microarrays. Consequently, this research will have a wide impact on how microarrays are designed and how the data are interpreted. ^
Resumo:
The difficulty of detecting differential gene expression in microarray data has existed for many years. Several correction procedures try to avoid the family-wise error rate in multiple comparison process, including the Bonferroni and Sidak single-step p-value adjustments, Holm's step-down correction method, and Benjamini and Hochberg's false discovery rate (FDR) correction procedure. Each multiple comparison technique has its advantages and weaknesses. We studied each multiple comparison method through numerical studies (simulations) and applied the methods to the real exploratory DNA microarray data, which detect of molecular signatures in papillary thyroid cancer (PTC) patients. According to our results of simulation studies, Benjamini and Hochberg step-up FDR controlling procedure is the best process among these multiple comparison methods and we discovered 1277 potential biomarkers among 54675 probe sets after applying the Benjamini and Hochberg's method to PTC microarray data.^
Resumo:
Brain tumor is one of the most aggressive types of cancer in humans, with an estimated median survival time of 12 months and only 4% of the patients surviving more than 5 years after disease diagnosis. Until recently, brain tumor prognosis has been based only on clinical information such as tumor grade and patient age, but there are reports indicating that molecular profiling of gliomas can reveal subgroups of patients with distinct survival rates. We hypothesize that coupling molecular profiling of brain tumors with clinical information might improve predictions of patient survival time and, consequently, better guide future treatment decisions. In order to evaluate this hypothesis, the general goal of this research is to build models for survival prediction of glioma patients using DNA molecular profiles (U133 Affymetrix gene expression microarrays) along with clinical information. First, a predictive Random Forest model is built for binary outcomes (i.e. short vs. long-term survival) and a small subset of genes whose expression values can be used to predict survival time is selected. Following, a new statistical methodology is developed for predicting time-to-death outcomes using Bayesian ensemble trees. Due to a large heterogeneity observed within prognostic classes obtained by the Random Forest model, prediction can be improved by relating time-to-death with gene expression profile directly. We propose a Bayesian ensemble model for survival prediction which is appropriate for high-dimensional data such as gene expression data. Our approach is based on the ensemble "sum-of-trees" model which is flexible to incorporate additive and interaction effects between genes. We specify a fully Bayesian hierarchical approach and illustrate our methodology for the CPH, Weibull, and AFT survival models. We overcome the lack of conjugacy using a latent variable formulation to model the covariate effects which decreases computation time for model fitting. Also, our proposed models provides a model-free way to select important predictive prognostic markers based on controlling false discovery rates. We compare the performance of our methods with baseline reference survival methods and apply our methodology to an unpublished data set of brain tumor survival times and gene expression data, selecting genes potentially related to the development of the disease under study. A closing discussion compares results obtained by Random Forest and Bayesian ensemble methods under the biological/clinical perspectives and highlights the statistical advantages and disadvantages of the new methodology in the context of DNA microarray data analysis.
Resumo:
The mechanism of tumorigenesis in the immortalized human pancreatic cell lines: cell culture models of human pancreatic cancer Pancreatic ductal adenocarcinoma (PDAC) is the most lethal cancer in the world. The most common genetic lesions identified in PDAC include activation of K-ras (90%) and Her2 (70%), loss of p16 (95%) and p14 (40%), inactivation p53 (50-75%) and Smad4 (55%). However, the role of these signature gene alterations in PDAC is still not well understood, especially, how these genetic lesions individually or in combination contribute mechanistically to human pancreatic oncogenesis is still elusive. Moreover, a cell culture transformation model with sequential accumulation of signature genetic alterations in human pancreatic ductal cells that resembles the multiple-step human pancreatic carcinogenesis is still not established. In the present study, through the stepwise introduction of the signature genetic alterations in PDAC into the HPV16-E6E7 immortalized human pancreatic duct epithelial (HPDE) cell line and the hTERT immortalized human pancreatic ductal HPNE cell line, we developed the novel experimental cell culture transformation models with the most frequent gene alterations in PDAC and further dissected the molecular mechanism of transformation. We demonstrated that the combination of activation of K-ras and Her2, inactivation of p16/p14 and Smad4, or K-ras mutation plus p16 inactivation, was sufficient for the tumorigenic transformation of HPDE or HPNE cells respectively. We found that these transformed cells exhibited enhanced cell proliferation, anchorage-independent growth in soft agar, and grew tumors with PDAC histopathological features in orthotopic mouse model. Molecular analysis showed that the activation of K-ras and Her2 downstream effector pathways –MAPK, RalA, FAK, together with upregulation of cyclins and c-myc were involved in the malignant transformation. We discovered that MDM2, BMP7 and Bmi-1 were overexpressed in the tumorigenic HPDE cells, and that Smad4 played important roles in regulation of BMP7 and Bmi-1 gene expression and the tumorigenic transformation of HPDE cells. IPA signaling pathway analysis of microarray data revealed that abnormal signaling pathways are involved in transformation. This study is the first complete transformation model of human pancreatic ductal cells with the most common gene alterations in PDAC. Altogether, these novel transformation models more closely recapitulate the human pancreatic carcinogenesis from the cell origin, gene lesion, and activation of specific signaling pathway and histopathological features.
Resumo:
Lung cancer is a devastating disease with very poor prognosis. The design of better treatments for patients would be greatly aided by mouse models that closely resemble the human disease. The most common type of human lung cancer is adenocarcinoma with frequent metastasis. Unfortunately, current models for this tumor are inadequate due to the absence of metastasis. Based on the molecular findings in human lung cancer and metastatic potential of osteosarcomas in mutant p53 mouse models, I hypothesized that mice with both K-ras and p53 missense mutations might develop metastatic lung adenocarcinomas. Therefore, I incorporated both K-rasLA1 and p53RI72HΔg alleles into mouse lung cells to establish a more faithful model for human lung adenocarcinoma and for translational and mechanistic studies. Mice with both mutations ( K-rasLA1/+ p53R172HΔg/+) developed advanced lung adenocarcinomas with similar histopathology to human tumors. These lung adenocarcinomas were highly aggressive and metastasized to multiple intrathoracic and extrathoracic sites in a pattern similar to that seen in lung cancer patients. This mouse model also showed gender differences in cancer related death and developed pleural mesotheliomas in 23.2% of them. In a preclinical study, the new drug Erlotinib (Tarceva) decreased the number and size of lung lesions in this model. These data demonstrate that this mouse model most closely mimics human metastatic lung adenocarcinoma and provides an invaluable system for translational studies. ^ To screen for important genes for metastasis, gene expression profiles of primary lung adenocarcinomas and metastases were analyzed. Microarray data showed that these two groups were segregated in gene expression and had 79 highly differentially expressed genes (more than 2.5 fold changes and p<0.001). Microarray data of Bub1b, Vimentin and CCAM1 were validated in tumors by quantitative real-time PCR (QPCR). Bub1b , a mitotic checkpoint gene, was overexpressed in metastases and this correlated with more chromosomal abnormalities in metastatic cells. Vimentin, a marker of epithelial-mesenchymal transition (EMT), was also highly expressed in metastases. Interestingly, Twist, a key EMT inducer, was also highly upregulated in metastases by QPCR, and this significantly correlated with the overexpression of Vimentin in the same tumors. These data suggest EMT occurs in lung adenocarcinomas and is a key mechanism for the development of metastasis in K-ras LA1/+ p53R172HΔg/+ mice. Thus, this mouse model provides a unique system to further probe the molecular basis of metastatic lung cancer.^
Resumo:
Insulin-like growth factor binding protein 2 (IGFBP2) is a protein known to be overexpressed in a majority of glioblastoma multiforme (GBM) tumors. While it is known the IGFBP2 is involved in promoting GBM tumor cell invasion, no mechanism exists for how the protein is involved in signal transduction pathways leading to enhanced cell invasion. ^ We follow up on preliminary microarray data on IGFBP2-overexpressing GBM cells and protein sequence analysis of IGFBP2 in generating the hypothesis that IGFBP2 interacts with integnn α5 in regulating cell mobility. Microarray data showing upregulation of integrin α5 by IGFBP2 is validated and evidence of protein-protein interaction between IGFBP2 and integrin α5 is found. The exact binding domain on IGFBP2 responsible for its interaction with integrin α5 is also determined, confirming our initial findings and reaffirming that the IGFBP2/integrin α5 interaction is specific. Disruption of this interaction resulted in attenuation of IGFBP2-enhanced cell mobility. Further, we found that cell mobility is only enhanced when IGFBP2 and integrin α5 are both overexpressed and able to interact with each other. ^ We also determined fibronectin to be a critical player in the activation of the IGFBP2/integrin α5 pathway. The activation of this pathway appears to be progressive and initiates once GBM cells have sufficiently established anchorage. ^
Resumo:
Human lipocalin 2 is described as the neutrophil gelatinase-associated lipocalin (NGAL). The lipocalin 2 gene encodes a small, secreted glycoprotein that possesses a variety of functions, of which the best characterized function is organic iron binding activity. Elevated NGAL expression has been observed in many human cancers including breast, colorectal, pancreatic and ovarian cancers. I focused on the characterization of NGAL function in chronic myelogenous leukemia (CML) and breast cancer. Using the leukemic xenograft mouse model, we demonstrated that over-expression of NGAL in K562 cells, a leukemic cell line, led to a higher apoptotic rate and an atrophy phenotype in the spleen of inoculated mice compared to K562 cells alone. These results indicate that NGAL plays a primary role in suppressing hematopoiesis by inducing apoptosis within normal hematopoietic cells. In the breast cancer project, we analyzed two microarray data sets of breast cancer cell lines ( n = 54) and primary breast cancer samples (n = 318), and demonstrated that high NGAL expression is significantly correlated with several tumor characteristics, including negative estrogen receptor (ER) status, positive HER2 status, high tumor grade, and lymph node metastasis. Ectopic NGAL expression in non-aggressive (ZR75.1 and MCF7) cells led to aggressive tumor phenotypes in vitro and in vivo. Conversely, knockdown of NGAL expression in various breast cancer cell lines by shRNA lentiviral infection significantly decreased migration, invasion, and metastasis activities of tumor cells both in vitro and in vivo . It has been previously reported that transgenic mice with a mutation in the region of trans-membrane domain (V664E) of HER2 develop mammary tumors that progress to lung metastasis. However, we observed that genetic deletion of the 24p3 gene, a mouse homolog of NGAL, in HER2 transgenic mice by breeding with 24p3-null mice resulted in a significant delay of mammary tumor formation and reduction of lung metastasis. Strikingly, we also found that treatment with affinity purified 24p3 antibodies in the 4T1 breast cancer mice strongly reduced lung metastasis. Our studies provide evidence that NGAL plays a critical role in breast cancer development and progression, and thus NGAL has potential as a new therapeutic target in breast cancer.^
Resumo:
Cytochromes P450 catalyze a monooxygenase reaction in which molecular oxygen is split and one oxygen atom is incorporated into the substrate. As a whole, P450 researchers have focused most of their attention on substrate metabolism and relatively little on how these enzymes are regulated. This study will focus on the regulation of two P450 isoforms known as, CYP2D6 and CYP4F11. ^ The human CYP2D gene locus contains two pseudogenes and one functional gene known as CYP2D6. This locus is highly polymorphic and produces several alternatively spliced transcripts from the pseudogene CYP2D7. My objective was to understand the role of SV5-in (splice variant 5), one of several alternative splice variants transcribed from the CYP2D7 pseudogene. My results indicate that SV5-in mRNA causes an increase in CYP2D6 protein levels and suggest that there is a role for SV5-in in regulation of CYP2D6 expression. ^ Second, CYP4F11 is a recently discovered and uncharacterized isoform, derived from the CYP4F subfamily. It metabolizes several clinically relevant drugs (i.e.—erythromycin and benzphetamine) and some endogenous inflammatory mediators (i.e.—LTB4). After evaluation of microarray data, I observed an increase in CYP4F11 mRNA levels from wild-type HCT116 cells compared to p53-null cells. Our objectives were to explore and understand this connection between p53 and CYP4F11. Microarray data were confirmed by Q-PCR, after which this effect was again observed at the protein level via Western blot and again at the promoter level via luciferase assay and chromatin immunoprecipitation. Our results indicate that p53 protein regulates expression of CYP4F11 mRNA and protein through CYP4F11 promoter binding (note that p53 binding to CYP4F11 DNA was not shown to be direct). These results signify a whole new level of regulation of drug metabolizing enzymes by p53. ^ An understanding of CYP4F11 regulation by p53 could help us understand another pathway leading to apoptosis or cell growth arrest. This can aid future drug studies and discover new drug metabolism pathways under the control of a tumor suppressor protein. An understanding of the CYP2D6 regulation pathway could illuminate the role of non-coding RNAs in the P450 field and potentially explain several inter-individual drug response variations observed in clinical medicine that are not yet completely explained by genotyping analysis. ^
Resumo:
Inflammatory breast cancer (IBC) is the most insidious form of locally advanced disease. Although rare and less than 2% of all breast cancer, IBC is responsible for up to 10% of all breast cancer deaths. Despite the name, very little is known about the role of inflammation or immune mediators in IBC. Therefore, we analyzed blood samples from IBC patients and non-IBC patients, as well as healthy donor controls to establish an IBC-specific profile of peripheral blood leukocyte phenotype and function of T cells and dendritic cells and serum inflammatory cytokines. Emerging evidence suggests that host factors in the microenviromement may interact with underlying IBC genetics to promote the aggressive nature of the tumor. An integral part of the metastatic process involves epithelial to mesenchymal transition (EMT) where primary breast cancer cells gain motility and stem cell-like features that allow distant seeding. Interestingly, the IBC consortium microarray data found no clear evidence for EMT in IBC tumor tissues. It is becoming increasingly evident that inflammatory factors can induce EMT. However, it is unknown if EMT-inducing soluble factors secreted by activated immune cells in the IBC microenvironment canπ account for the absence of EMT in studies of the tumor cells themselves. We hypothesized that soluble factors from immune cells are capable of inducing EMT in IBC. We tested the ability of immune conditioned media to induce EMT in IBC cells. We found that soluble factors from activated immune cells are able to induce the expression of EMT-related factors in IBC cells along with increased migration and invasion. Specifically, the pro-inflammatory cytokines TNF-α, IL-6 and TGF-β were able to induce EMT and blocking these factors in conditioned media abated the induction of EMT. Surprisingly, unique to IBC cells, this process was related to increased levels of E-cadherin expression and adhesion, reminiscent of the characteristic tightly packed tumor emboli seen in IBC samples. This data offers insight into the unique pathology of IBC by suggesting that tumor immune interactions in the tumor microenvironment contribute to the aggressive nature of IBC implying that immune induced inflammation can be a novel therapeutic target. Specifically, we showed that soluble factors secreted by activated immune cells are capable of inducing EMT in IBC cells and may mediate the persistent E-cadherin expression observed in IBC. This data suggests that immune mediated inflammation may contribute to the highly aggressive nature of IBC and represents a potential therapeutic target that warrants further investigation.
Resumo:
Most studies of differential gene-expressions have been conducted between two given conditions. The two-condition experimental (TCE) approach is simple in that all genes detected display a common differential expression pattern responsive to a common two-condition difference. Therefore, the genes that are differentially expressed under the other conditions other than the given two conditions are undetectable with the TCE approach. In order to address the problem, we propose a new approach called multiple-condition experiment (MCE) without replication and develop corresponding statistical methods including inference of pairs of conditions for genes, new t-statistics, and a generalized multiple-testing method for any multiple-testing procedure via a control parameter C. We applied these statistical methods to analyze our real MCE data from breast cancer cell lines and found that 85 percent of gene-expression variations were caused by genotypic effects and genotype-ANAX1 overexpression interactions, which agrees well with our expected results. We also applied our methods to the adenoma dataset of Notterman et al. and identified 93 differentially expressed genes that could not be found in TCE. The MCE approach is a conceptual breakthrough in many aspects: (a) many conditions of interests can be conducted simultaneously; (b) study of association between differential expressions of genes and conditions becomes easy; (c) it can provide more precise information for molecular classification and diagnosis of tumors; (d) it can save lot of experimental resources and time for investigators.^
Resumo:
It is well accepted that tumorigenesis is a multi-step procedure involving aberrant functioning of genes regulating cell proliferation, differentiation, apoptosis, genome stability, angiogenesis and motility. To obtain a full understanding of tumorigenesis, it is necessary to collect information on all aspects of cell activity. Recent advances in high throughput technologies allow biologists to generate massive amounts of data, more than might have been imagined decades ago. These advances have made it possible to launch comprehensive projects such as (TCGA) and (ICGC) which systematically characterize the molecular fingerprints of cancer cells using gene expression, methylation, copy number, microRNA and SNP microarrays as well as next generation sequencing assays interrogating somatic mutation, insertion, deletion, translocation and structural rearrangements. Given the massive amount of data, a major challenge is to integrate information from multiple sources and formulate testable hypotheses. This thesis focuses on developing methodologies for integrative analyses of genomic assays profiled on the same set of samples. We have developed several novel methods for integrative biomarker identification and cancer classification. We introduce a regression-based approach to identify biomarkers predictive to therapy response or survival by integrating multiple assays including gene expression, methylation and copy number data through penalized regression. To identify key cancer-specific genes accounting for multiple mechanisms of regulation, we have developed the integIRTy software that provides robust and reliable inferences about gene alteration by automatically adjusting for sample heterogeneity as well as technical artifacts using Item Response Theory. To cope with the increasing need for accurate cancer diagnosis and individualized therapy, we have developed a robust and powerful algorithm called SIBER to systematically identify bimodally expressed genes using next generation RNAseq data. We have shown that prediction models built from these bimodal genes have the same accuracy as models built from all genes. Further, prediction models with dichotomized gene expression measurements based on their bimodal shapes still perform well. The effectiveness of outcome prediction using discretized signals paves the road for more accurate and interpretable cancer classification by integrating signals from multiple sources.
Resumo:
Material Safety Data Sheets (MSDSs) are an integral component of occupational hazard communication systems. These documents are used to disseminate hazard information to workers on chemical substances. The primary purpose of this study was to investigate the comprehensibility of MSDSs by workers at an international level. ^ A total of 117 employees of a multi-national petrochemical company participated; thirty-nine (39) each in the United States, Canada and the United Kingdom. Overall participation rate of those approached to participate was 82%. These countries were selected as they each utilize one of the three major existing hazard communication systems for fixed workplaces. The systems are comprised of the Occupational Safety and Health Administration's Hazard Communication Standard in the United States, the Workplace Hazardous Materials Information System (WHMIS) in Canada, and the compilation of several European Union directives addressing classification, labeling of substances and preparations, and MSDSs in Europe. ^ A pretest posttest randomized study design was used, with the posttest being comparable to an open book test. The results of this research indicated that only about two-thirds of the information on the MSDSs was comprehended by the workers with a significant difference identified among study participants based on country comparisons. This data was fairly consistent with the results of previous MSDS comprehensibility studies conducted in the United States. There was no significant difference in the comprehension level among study participants when taking into account the international hazard communication standard that the MSDS complied with. Marginally, age, education level and experience level did not have a significant impact on the comprehension level. ^ Participants did find MSDSs to be satisfactory in providing the information needed to protect them regardless of their views on the readability and formatting of MSDSs. The health-related information was the least comprehended as less than half of it was comprehended on the basis of the responses. The findings from this research suggest that there is much work needed yet to make MSDSs more comprehensible on a global basis, particularly regarding health-related information. ^
High-resolution microarray analysis of chromosome 20q in human colon cancer metastasis model systems
Resumo:
Amplification of human chromosome 20q DNA is the most frequently occurring chromosomal abnormality detected in sporadic colorectal carcinomas and shows significant correlation with liver metastases. Through comprehensive high-resolution microarray comparative genomic hybridization and microarray gene expression profiling, we have characterized chromosome 20q amplicon genes associated with human colorectal cancer metastasis in two in vitro metastasis model systems. The results revealed increasing complexity of the 20q genomic profile from the primary tumor-derived cell lines to the lymph node and liver metastasis derived cell lines. Expression analysis of chromosome 20q revealed a subset of over expressed genes residing within the regions of genomic copy number gain in all the tumor cell lines, suggesting these are Chromosome 20q copy number responsive genes. Bases on their preferential expression levels in the model system cell lines and known biological function, four of the over expressed genes mapping to the common intervals of genomic copy gain were considered the most promising candidate colorectal metastasis-associated genes. Validation of genomic copy number and expression array data was carried out on these genes, with one gene, DNMT3B, standing out as expressed at a relatively higher levels in the metastasis-derived cell lines compared with their primary-derived counterparts in both the models systems analyzed. The data provide evidence for the role of chromosome 20q genes with low copy gain and elevated expression in the clonal evolution of metastatic cells and suggests that such genes may serve as early biomarkers of metastatic potential. The data also support the utility of the combined microarray comparative genomic hybridization and expression array analysis for identifying copy number responsive genes in areas of low DNA copy gain in cancer cells. ^
Resumo:
Maximizing data quality may be especially difficult in trauma-related clinical research. Strategies are needed to improve data quality and assess the impact of data quality on clinical predictive models. This study had two objectives. The first was to compare missing data between two multi-center trauma transfusion studies: a retrospective study (RS) using medical chart data with minimal data quality review and the PRospective Observational Multi-center Major Trauma Transfusion (PROMMTT) study with standardized quality assurance. The second objective was to assess the impact of missing data on clinical prediction algorithms by evaluating blood transfusion prediction models using PROMMTT data. RS (2005-06) and PROMMTT (2009-10) investigated trauma patients receiving ≥ 1 unit of red blood cells (RBC) from ten Level I trauma centers. Missing data were compared for 33 variables collected in both studies using mixed effects logistic regression (including random intercepts for study site). Massive transfusion (MT) patients received ≥ 10 RBC units within 24h of admission. Correct classification percentages for three MT prediction models were evaluated using complete case analysis and multiple imputation based on the multivariate normal distribution. A sensitivity analysis for missing data was conducted to estimate the upper and lower bounds of correct classification using assumptions about missing data under best and worst case scenarios. Most variables (17/33=52%) had <1% missing data in RS and PROMMTT. Of the remaining variables, 50% demonstrated less missingness in PROMMTT, 25% had less missingness in RS, and 25% were similar between studies. Missing percentages for MT prediction variables in PROMMTT ranged from 2.2% (heart rate) to 45% (respiratory rate). For variables missing >1%, study site was associated with missingness (all p≤0.021). Survival time predicted missingness for 50% of RS and 60% of PROMMTT variables. MT models complete case proportions ranged from 41% to 88%. Complete case analysis and multiple imputation demonstrated similar correct classification results. Sensitivity analysis upper-lower bound ranges for the three MT models were 59-63%, 36-46%, and 46-58%. Prospective collection of ten-fold more variables with data quality assurance reduced overall missing data. Study site and patient survival were associated with missingness, suggesting that data were not missing completely at random, and complete case analysis may lead to biased results. Evaluating clinical prediction model accuracy may be misleading in the presence of missing data, especially with many predictor variables. The proposed sensitivity analysis estimating correct classification under upper (best case scenario)/lower (worst case scenario) bounds may be more informative than multiple imputation, which provided results similar to complete case analysis.^
Resumo:
Cervical cancer is the leading cause of death and disease from malignant neoplasms among women in developing countries. Even though the Pap smear has significantly decreased the number of deaths from cervical cancer in the past years, it has its limitations. Researchers have developed an automated screening machine which can potentially detect abnormal cases that are overlooked by conventional screening. The goal of quantitative cytology is to classify the patient's tissue sample based on quantitative measurements of the individual cells. It is also much cheaper and potentially can take less time. One of the major challenges of collecting cells with a cytobrush is the possibility of not sampling any existing dysplastic cells on the cervix. Being able to correctly classify patients who have disease without the presence of dysplastic cells could improve the accuracy of quantitative cytology algorithms. Subtle morphologic changes in normal-appearing tissues adjacent to or distant from malignant tumors have been shown to exist, but a comparison of various statistical methods, including many recent advances in the statistical learning field, has not previously been done. The objective of this thesis is to use different classification methods applied to quantitative cytology data for the detection of malignancy associated changes (MACs). In this thesis, Elastic Net is the best algorithm. When we applied the Elastic Net algorithm to the test set, we combined the training set and validation set as "training" set and used 5-fold cross validation to choose the parameter for Elastic Net. It has a sensitivity of 47% at 80% specificity, an AUC 0.52, and a partial AUC 0.10 (95% CI 0.09-0.11).^