Biblioteca Digital

22 resultados para High-Throughput Nucleotide Sequencing

em DigitalCommons@The Texas Medical Center

INTEGRATIVE BIOMARKER IDENTIFICATION AND CLASSIFICATION USING HIGH THROUGHPUT ASSAYS

Relevância:

100.00% 100.00%

Publicador:

Resumo:

It is well accepted that tumorigenesis is a multi-step procedure involving aberrant functioning of genes regulating cell proliferation, differentiation, apoptosis, genome stability, angiogenesis and motility. To obtain a full understanding of tumorigenesis, it is necessary to collect information on all aspects of cell activity. Recent advances in high throughput technologies allow biologists to generate massive amounts of data, more than might have been imagined decades ago. These advances have made it possible to launch comprehensive projects such as (TCGA) and (ICGC) which systematically characterize the molecular fingerprints of cancer cells using gene expression, methylation, copy number, microRNA and SNP microarrays as well as next generation sequencing assays interrogating somatic mutation, insertion, deletion, translocation and structural rearrangements. Given the massive amount of data, a major challenge is to integrate information from multiple sources and formulate testable hypotheses. This thesis focuses on developing methodologies for integrative analyses of genomic assays profiled on the same set of samples. We have developed several novel methods for integrative biomarker identification and cancer classification. We introduce a regression-based approach to identify biomarkers predictive to therapy response or survival by integrating multiple assays including gene expression, methylation and copy number data through penalized regression. To identify key cancer-specific genes accounting for multiple mechanisms of regulation, we have developed the integIRTy software that provides robust and reliable inferences about gene alteration by automatically adjusting for sample heterogeneity as well as technical artifacts using Item Response Theory. To cope with the increasing need for accurate cancer diagnosis and individualized therapy, we have developed a robust and powerful algorithm called SIBER to systematically identify bimodally expressed genes using next generation RNAseq data. We have shown that prediction models built from these bimodal genes have the same accuracy as models built from all genes. Further, prediction models with dichotomized gene expression measurements based on their bimodal shapes still perform well. The effectiveness of outcome prediction using discretized signals paves the road for more accurate and interpretable cancer classification by integrating signals from multiple sources.

NOVEL THERAPEUTIC TARGETS IDENTIFIED BY HIGH-THROUGHPUT TECHNOLOGIES FOR TRIPLE-NEGATIVE BREAST CANCER

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Triple-negative breast cancers (TNBC) are characterized by the lack of or reduced expression of the estrogen and progesterone receptors, and normal expression of the human epidermal growth factor receptor 2. The lack of a well-characterized target for treatment leaves only systemic chemotherapy as the mainstay of treatment. Approximately 60-70% of patients are chemosensitive, while the remaining majority does not respond. Targeted therapies that take advantage of the unique molecular perturbations found in triple-negative breast cancer are needed. The genes that are frequently amplified or overexpressed represent potential therapeutic targets for triple-negative breast cancer. The purpose of this study was to identify and validate novel therapeutic targets for triple-negative breast cancers. 681 genes showed consistent and highly significant overexpression in TNBC compared to receptor-positive cancers in 2 data sets. For two genes, 3 of the 4 siRNAs showed preferential growth inhibition in TNBC cells. These two genes were the low density lipoprotein receptor-related protein 8 (LRP8) and very low-density lipoprotein receptor (VLDLR). Exposure to their cognate ligands, reelin and apolipoprotein E isoform 4 (ApoE4), stimulated the growth of TNBC cells in vitro. Suppression of the expression of either LRP8 or VLDLR or exposure to RAP (an inhibitor of ligand binding to LRP8 and VLDLR) abolished this ligand-induced proliferation. High-throughput protein and metabolic arrays revealed that ApoE4 stimulation rescued TNBC cells from serum-starvation induced up-regulation of genes involved in lipid biosynthesis, increased protein expression of oncogenes involved in the MAPK/ERK and DNA repair pathways, and reduced the serum-starvation induction of biochemicals involved in oxidative stress response and glycolytic metabolism. shLRP8 MDA-MB-231 xenografts had reduced tumor volume, in comparison to parental and shCON xenografts. These results indicate that LRP8-APOE signaling confers survival advantages to TNBC tumors under reduced nutrient conditions and during cellular environmental stress. We revealed that the LRP8-APOE receptor-ligand system is overexpressed in human TNBC. We also demonstrated that this receptor system mediates a strong growth promoting and survival function in TNBC cells in vitro and helps to sustain the growth of MDA-MD-231 xenografts. We propose that inhibitors of LRP8-APOE signaling may be clinically useful therapeutic agents for triple-negative breast cancer.

Development of HIF-1α/HIF-1β heterodimerization inhibitors using a novel bioluminescence reporter assay system for in vitro high throughput screening and in vivo imaging

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Tumor growth often outpaces its vascularization, leading to development of a hypoxic tumor microenvironment. In response, an intracellular hypoxia survival pathway is initiated by heterodimerization of hypoxia-inducible factor (HIF)-1α and HIF-1β, which subsequently upregulates the expression of several hypoxia-inducible genes, promotes cell survival and stimulates angiogenesis in the oxygen-deprived environment. Hypoxic tumor regions are often associated with resistance to various classes of radio- or chemotherapeutic agents. Therefore, development of HIF-1α/β heterodimerization inhibitors may provide a novel approach to anti-cancer therapy. To this end, a novel approach for imaging HIF-1α/β heterodimerization in vitro and in vivo was developed in this study. Using this screening platform, we identified a promising lead candidate and further chemically derivatized the lead candidate to assess the structure-activity relationship (SAR). The most effective first generation drug inhibitors were selected and their pharmacodynamics and anti-tumor efficacy in vivo were verified by bioluminescence imaging (BLI) of HIF-1α/β heterodimerization in the xenograft tumor model. Furthermore, the first generation drug inhibitors, M-TMCP and D-TMCP, demonstrated efficacy as monotherapies, resulting in tumor growth inhibition via disruption of HIF-1 signaling-mediated tumor stromal neoangiogenesis.

A METAGENOMIC STUDY OF THE TICK MIDGUT

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A Metagenomic Study of the Tick Midgut Daniel Yuan, B.S. Supervisory Professor : Steven J. Norris, Ph.D. Southern tick–associated rash illness (STARI) or Master’s disease is a Lyme-like illness that occurs following bites by Amblyomma americanum, the lone-star tick. Clinical symptoms include a bull’s eye rash similar to the erythema migrans lesions of Lyme disease, as well as fever and joint pains. Lyme disease is caused by Borrelia burgdorferi and related spirochetes. However, B. burgdorferi has not been detected in STARI patients, or in ticks in the South Central U.S. The causative agent of STARI has not been identified, although it was once thought to be caused by another Borrelia species, Borrelia lonestari. Furthermore, while adult A. americanum have up to a 5.6% Borrelia lonestari infection rate, the prevalence of all Borrelia species in Texas ticks as a whole is not known. Previous studies indicate that 6%-30% of Northern Ixodes scapularis ticks are infected by Borrelia burgdorferi while only 10% of Northern A. americanum and I. scapularis ticks are infected by Borrelia species. The first specific aim of this project was to determine the bacterial community that inhabits the midgut of Texas and Northeastern ticks by using high throughput metagenomic sequencing to sequence bacterial 16S rDNA. Through the use of massively parallel 454 sequencing, we were able to individually sequence hundreds of thousands of 16S rDNA regions of the bacterial flora from 133 ticks from the New York, Missouri and Texas. The presence of previously confirmed endosymbionts, specifically the Rickettsia spp. and Coxiella spp., that are commonly found in ticks were confirmed, as well as some highly prevalent genera that were previously undocumented. Furthermore, multiple pathogenic genera sequences were often found in the same tick, suggesting the possibility of co-infection of multiple pathogenic species. The second specific aim was to use Borrelia specific primers to screen 344 individual ticks from Missouri, Texas and the Northeast to determine the prevalence of Borrelia species in ticks. To screen for Borrelia species, two housekeeping genes, uvrA and recG, were selected as well as the 16S-23S rDNA intergenic spacer. Ticks from Missouri, Texas and New York were screened. None of the Missouri or Texas ticks tested positive for Borrelia spp. The rate of I. scapularis infection by B.burgdorferi is dependent on tick feeding activity as well as reservoir availability. B. burgdorferi is endemic in the Northeast, sometimes reported as highly present in over 50% of all I. scapularis ticks. 11.6% of all New York ticks were positive for a species of Borrelia, however only 6.9% of all New York ticks were positive for B. burgdorferi. Despite being significantly lower than 50%, the results still fall in line with previous reports of about the prevalence of B. burgdorferi. 1.5% of all Texas ticks were positive for a Borrelia species, specifically B. lonestari. While this study was unable to identify the causative agent for STARI, 454 sequencing was able to provide a tremendous insight into the bacterial flora and possible pathogenic species of both the I. scapularis and the A. americanum tick.

Complete Genome Sequence of Treponema paraluiscuniculi, Strain Cuniculi A: The Loss of Infectivity to Humans Is Associated with Genome Decay.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Treponema paraluiscuniculi is the causative agent of rabbit venereal spirochetosis. It is not infectious to humans, although its genome structure is very closely related to other pathogenic Treponema species including Treponema pallidum subspecies pallidum, the etiological agent of syphilis. In this study, the genome sequence of Treponema paraluiscuniculi, strain Cuniculi A, was determined by a combination of several high-throughput sequencing strategies. Whereas the overall size (1,133,390 bp), arrangement, and gene content of the Cuniculi A genome closely resembled those of the T. pallidum genome, the T. paraluiscuniculi genome contained a markedly higher number of pseudogenes and gene fragments (51). In addition to pseudogenes, 33 divergent genes were also found in the T. paraluiscuniculi genome. A set of 32 (out of 84) affected genes encoded proteins of known or predicted function in the Nichols genome. These proteins included virulence factors, gene regulators and components of DNA repair and recombination. The majority (52 or 61.9%) of the Cuniculi A pseudogenes and divergent genes were of unknown function. Our results indicate that T. paraluiscuniculi has evolved from a T. pallidum-like ancestor and adapted to a specialized host-associated niche (rabbits) during loss of infectivity to humans. The genes that are inactivated or altered in T. paraluiscuniculi are candidates for virulence factors important in the infectivity and pathogenesis of T. pallidum subspecies.

Genomic screening identifies novel linkages and provides further evidence for a role of MYH9 in nonsyndromic cleft lip and palate.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Nonsyndromic cleft lip with or without cleft palate (NSCLP) is a common birth anomaly that requires prolonged multidisciplinary rehabilitation. Although variation in several genes has been identified as contributing to NSCLP, most of the genetic susceptibility loci have yet to be defined. To identify additional contributory genes, a high-throughput genomic scan was performed using the Illumina Linkage IVb Panel platform. We genotyped 6008 SNPs in nine non-Hispanic white NSCLP multiplex families and a single large African-American NSCLP multiplex family. Fourteen chromosomal regions were identified with LOD>1.5, including six regions not previously reported. Analysis of the data from the African-American and non-Hispanic white families revealed two likely chromosomal regions: 8q21.3-24.12 and 22q12.2-12.3 with LOD scores of 2.98 and 2.66, respectively. On the basis of biological function, syndecan 2 (SDC2) and growth differentiation factor 6 (GDF6) in 8q21.3-24.12 and myosin heavy-chain 9, non-muscle (MYH9) in 22q12.2-12.3 were selected as candidate genes. Association analyses from these genes yielded marginally significant P-values for SNPs in SDC2 and GDF6 (0.01

Theoretical and experimental studies of linkage disequilibrium in human populations

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Linkage disequilibrium (LD) is defined as the nonrandom association of alleles at two or more loci in a population and may be a useful tool in a diverse array of applications including disease gene mapping, elucidating the demographic history of populations, and testing hypotheses of human evolution. However, the successful application of LD-based approaches to pertinent genetic questions is hampered by a lack of understanding about the forces that mediate the genome-wide distribution of LD within and between human populations. Delineating the genomic patterns of LD is a complex task that will require interdisciplinary research that transcends traditional scientific boundaries. The research presented in this dissertation is predicated upon the need for interdisciplinary studies and both theoretical and experimental projects were pursued. In the theoretical studies, I have investigated the effect of genotyping errors and SNP identification strategies on estimates of LD. The primary importance of these two chapters is that they provide important insights and guidance for the design of future empirical LD studies. Furthermore, I analyzed the allele frequency distribution of 26,530 single nucleotide polymorphisms (SNPs) in three populations and generated the first-generation natural selection map of the human genome, which will be an important resource for explaining and understanding genomic patterns of LD. Finally, in the experimental study, I describe a novel and simple, low-cost, and high-throughput SNP genotyping method. The theoretical analyses and experimental tools developed in this dissertation will facilitate a more complete understanding of patterns of LD in human populations. ^

Nonlinear tests for genetic studies of complex disease

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Linkage and association studies are major analytical tools to search for susceptibility genes for complex diseases. With the availability of large collection of single nucleotide polymorphisms (SNPs) and the rapid progresses for high throughput genotyping technologies, together with the ambitious goals of the International HapMap Project, genetic markers covering the whole genome will be available for genome-wide linkage and association studies. In order not to inflate the type I error rate in performing genome-wide linkage and association studies, multiple adjustment for the significant level for each independent linkage and/or association test is required, and this has led to the suggestion of genome-wide significant cut-off as low as 5 × 10 −7. Almost no linkage and/or association study can meet such a stringent threshold by the standard statistical methods. Developing new statistics with high power is urgently needed to tackle this problem. This dissertation proposes and explores a class of novel test statistics that can be used in both population-based and family-based genetic data by employing a completely new strategy, which uses nonlinear transformation of the sample means to construct test statistics for linkage and association studies. Extensive simulation studies are used to illustrate the properties of the nonlinear test statistics. Power calculations are performed using both analytical and empirical methods. Finally, real data sets are analyzed with the nonlinear test statistics. Results show that the nonlinear test statistics have correct type I error rates, and most of the studied nonlinear test statistics have higher power than the standard chi-square test. This dissertation introduces a new idea to design novel test statistics with high power and might open new ways to mapping susceptibility genes for complex diseases. ^

Absolute quantitative real-time polymerase chain reaction for the measurement of human papillomavirus E7 mRNA in cervical cytobrush specimens.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

BACKGROUND: Few reports of the utilization of an accurate, cost-effective means for measuring HPV oncogene transcripts have been published. Several papers have reported the use of relative quantitation or more expensive Taqman methods. Here, we report a method of absolute quantitative real-time PCR utilizing SYBR-green fluorescence for the measurement of HPV E7 expression in cervical cytobrush specimens. RESULTS: The construction of a standard curve based on the serial dilution of an E7-containing plasmid was the key for being able to accurately compare measurements between cervical samples. The assay was highly reproducible with an overall coefficient of variation of 10.4%. CONCLUSION: The use of highly reproducible and accurate SYBR-based real-time polymerase chain reaction (PCR) assays instead of performing Taqman-type assays allows low-cost, high-throughput analysis of viral mRNA expression. The development of such assays will help in refining the current screening programs for HPV-related carcinomas.

NETWORK TOPOLOGY IN HUMAN PROTEIN INTERACTION DATA PREDICTS FUNCTIONAL ASSOCIATION

Relevância:

100.00% 100.00%

Publicador:

Resumo:

High-throughput assays, such as yeast two-hybrid system, have generated a huge amount of protein-protein interaction (PPI) data in the past decade. This tremendously increases the need for developing reliable methods to systematically and automatically suggest protein functions and relationships between them. With the available PPI data, it is now possible to study the functions and relationships in the context of a large-scale network. To data, several network-based schemes have been provided to effectively annotate protein functions on a large scale. However, due to those inherent noises in high-throughput data generation, new methods and algorithms should be developed to increase the reliability of functional annotations. Previous work in a yeast PPI network (Samanta and Liang, 2003) has shown that the local connection topology, particularly for two proteins sharing an unusually large number of neighbors, can predict functional associations between proteins, and hence suggest their functions. One advantage of the work is that their algorithm is not sensitive to noises (false positives) in high-throughput PPI data. In this study, we improved their prediction scheme by developing a new algorithm and new methods which we applied on a human PPI network to make a genome-wide functional inference. We used the new algorithm to measure and reduce the influence of hub proteins on detecting functionally associated proteins. We used the annotations of the Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) as independent and unbiased benchmarks to evaluate our algorithms and methods within the human PPI network. We showed that, compared with the previous work from Samanta and Liang, our algorithm and methods developed in this study improved the overall quality of functional inferences for human proteins. By applying the algorithms to the human PPI network, we obtained 4,233 significant functional associations among 1,754 proteins. Further comparisons of their KEGG and GO annotations allowed us to assign 466 KEGG pathway annotations to 274 proteins and 123 GO annotations to 114 proteins with estimated false discovery rates of <21% for KEGG and <30% for GO. We clustered 1,729 proteins by their functional associations and made pathway analysis to identify several subclusters that are highly enriched in certain signaling pathways. Particularly, we performed a detailed analysis on a subcluster enriched in the transforming growth factor β signaling pathway (P<10-50) which is important in cell proliferation and tumorigenesis. Analysis of another four subclusters also suggested potential new players in six signaling pathways worthy of further experimental investigations. Our study gives clear insight into the common neighbor-based prediction scheme and provides a reliable method for large-scale functional annotations in this post-genomic era.

Angiotensin receptor agonistic autoantibody is highly prevalent in preeclampsia: correlation with disease severity.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Preeclampsia (PE), a syndrome affecting 5% of pregnancies, characterized by hypertension and proteinuria, is a leading cause of maternal and fetal morbidity and mortality. The condition is often accompanied by the presence of a circulating maternal autoantibody, the angiotensin II type I receptor agonistic autoantibody (AT(1)-AA). However, the prevalence of AT(1)-AA in PE remains unknown, and the correlation of AT(1)-AA titers with the severity of the disease remains undetermined. We used a sensitive and high-throughput luciferase bioassay to detect AT(1)-AA levels in the serum of 30 normal, 37 preeclamptic (10 mild and 27 severe), and 23 gestational hypertensive individuals. Here we report that AT(1)-AA is highly prevalent in PE ( approximately 95%). Next, by comparing the levels of AT(1)-AA among women with mild and severe PE, we found that the titer of AT(1)-AA is proportional to the severity of the disease. Intriguingly, among severe preeclamptic patients, we discovered that the titer of AT(1)-AA is significantly correlated with the clinical features of PE: systolic blood pressure (r=0.56), proteinuria (r=0.70), and soluble fms-like tyrosine kinase-1 level (r=0.71), respectively. Notably, only AT(1)-AA, and not soluble fms-like tyrosine kinase-1, levels are elevated in gestational hypertensive patients. These data serve as compelling clinical evidence that AT(1)-AA is highly prevalent in PE, and its titer is strongly correlated to the severity of the disease.

Intact flagellar motor of Borrelia burgdorferi revealed by cryo-electron tomography: evidence for stator ring curvature and rotor/C-ring assembly flexion.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The bacterial flagellar motor is a remarkable nanomachine that provides motility through flagellar rotation. Prior structural studies have revealed the stunning complexity of the purified rotor and C-ring assemblies from flagellar motors. In this study, we used high-throughput cryo-electron tomography and image analysis of intact Borrelia burgdorferi to produce a three-dimensional (3-D) model of the in situ flagellar motor without imposing rotational symmetry. Structural details of B. burgdorferi, including a layer of outer surface proteins, were clearly visible in the resulting 3-D reconstructions. By averaging the 3-D images of approximately 1,280 flagellar motors, a approximately 3.5-nm-resolution model of the stator and rotor structures was obtained. flgI transposon mutants lacked a torus-shaped structure attached to the flagellar rod, establishing the structural location of the spirochetal P ring. Treatment of intact organisms with the nonionic detergent NP-40 resulted in dissolution of the outermost portion of the motor structure and the C ring, providing insight into the in situ arrangement of the stator and rotor structures. Structural elements associated with the stator followed the curvature of the cytoplasmic membrane. The rotor and the C ring also exhibited angular flexion, resulting in a slight narrowing of both structures in the direction perpendicular to the cell axis. These results indicate an inherent flexibility in the rotor-stator interaction. The FliG switching and energizing component likely provides much of the flexibility needed to maintain the interaction between the curved stator and the relatively symmetrical rotor/C-ring assembly during flagellar rotation.

Expression and regulation of a hematopoietic proteoglycan core protein gene during hematopoiesis

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Factors involved in regulating tissue specific gene expression play a major role in cell differentiation. In order to further understand the differentiation events occurring during hematopoiesis, a myeloid specific gene was characterized, the expression pattern during hematopoiesis was analyzed, and the mechanisms governing its regulation were assessed. Previously, our laboratory isolated an anonymous cDNA clone, pD-D1, which displayed preferential expression in myeloid cells. From nucleotide sequencing of overlapping cDNA clones I determined that the D-D1 message encodes a hematopoietic proteoglycan core protein (HpPG). The expression pattern of the gene was assessed by in situ hybridization of bone marrow and peripheral blood samples. The gene was shown to be expressed, at variable levels, in all leukocytes analyzed, including cells from every stage of neutrophil development. In an attempt to ascertain the differentiation time point in which the HpPG gene is initially expressed, more immature populations of leukemic myeloblasts were assessed by northern blot analysis. Though the initial point of expression was not obtained, an up-regulatory event was discovered corresponding to a time point in which granule genesis occurs. This finding is consistent with prior observations of extensive packaging of proteoglycans into the secretory granules of granule producing hematopoietic cells. The HpPG gene was also found to be expressed at low levels in all stages of lymphocyte development analyzed, suggesting that the HpPG gene is initially expressed before the decision for myeloid-lymphoid differentiation. To assess the mechanism for the up-regulatory event, a K562 in vitro megakaryocytic differentiation system was used. Nuclear run-off analyses in this system demonstrated the up-regulation to be under transcriptional control. In addition, the HpPG gene was found to be down regulated during macrophage differentiation of HL60 cells and was also shown to be transcriptionally controlled. These results indicate that there are multiple points of transcriptional regulation of the HpPG gene during differentiation. Furthermore, the factors regulating the gene at these time points are likely to play an important role in the differentiation of granule producing cells and macrophages. ^

Delineating the molecular basis of human genetic diseases: Epigenetic and functional studies

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Identifying and characterizing the genes responsible for inherited human diseases will ultimately lead to a more holistic understanding of disease pathogenesis, catalyze new diagnostic and treatment modalities, and provide insights into basic biological processes. This dissertation presents research aimed at delineating the genetic and molecular basis of human diseases through epigenetic and functional studies and can be divided into two independent areas of research. The first area of research describes the development of two high-throughput melting curve based methods to assay DNA methylation, referred to as McMSP and McCOBRA. The goal of this project was to develop DNA methylation methods that can be used to rapidly determine the DNA methylation status at a specific locus in a large number of samples. McMSP and McCOBRA provide several advantages over existing methods, as they are simple, accurate, robust, and high-throughput making them applicable to large-scale DNA methylation studies. McMSP and McCOBRA were then used in an epigenetic study of the complex disease Ankylosing spondylitis (AS). Specifically, I tested the hypothesis that aberrant patterns of DNA methylation in five AS candidate genes contribute to disease susceptibility. While no statistically significant methylation differences were observed between cases and controls, this is the first study to investigate the hypothesis that epigenetic variation contributes to AS susceptibility and therefore provides the conceptual framework for future studies. ^ In the second area of research, I performed experiments to better delimit the function of aryl hydrocarbon receptor-interacting protein-like 1 (AIPL1), which when mutated causes various forms of inherited blindness such as Leber congenital amaurosis. A yeast two-hybrid screen was performed to identify putative AIPL1-interacting proteins. After screening 2 × 106 bovine retinal cDNA library clones, 6 unique putative AIPL1-interacting proteins were identified. While these 6 AIPL1 protein-protein interactions must be confirmed, their identification is an important step in understanding the functional role of AIPL1 within the retina and will provide insight into the molecular mechanisms underlying inherited blindness. ^

The information bottleneck method for genome-wide association studies

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In population studies, most current methods focus on identifying one outcome-related SNP at a time by testing for differences of genotype frequencies between disease and healthy groups or among different population groups. However, testing a great number of SNPs simultaneously has a problem of multiple testing and will give false-positive results. Although, this problem can be effectively dealt with through several approaches such as Bonferroni correction, permutation testing and false discovery rates, patterns of the joint effects by several genes, each with weak effect, might not be able to be determined. With the availability of high-throughput genotyping technology, searching for multiple scattered SNPs over the whole genome and modeling their joint effect on the target variable has become possible. Exhaustive search of all SNP subsets is computationally infeasible for millions of SNPs in a genome-wide study. Several effective feature selection methods combined with classification functions have been proposed to search for an optimal SNP subset among big data sets where the number of feature SNPs far exceeds the number of observations. ^ In this study, we take two steps to achieve the goal. First we selected 1000 SNPs through an effective filter method and then we performed a feature selection wrapped around a classifier to identify an optimal SNP subset for predicting disease. And also we developed a novel classification method-sequential information bottleneck method wrapped inside different search algorithms to identify an optimal subset of SNPs for classifying the outcome variable. This new method was compared with the classical linear discriminant analysis in terms of classification performance. Finally, we performed chi-square test to look at the relationship between each SNP and disease from another point of view. ^ In general, our results show that filtering features using harmononic mean of sensitivity and specificity(HMSS) through linear discriminant analysis (LDA) is better than using LDA training accuracy or mutual information in our study. Our results also demonstrate that exhaustive search of a small subset with one SNP, two SNPs or 3 SNP subset based on best 100 composite 2-SNPs can find an optimal subset and further inclusion of more SNPs through heuristic algorithm doesn't always increase the performance of SNP subsets. Although sequential forward floating selection can be applied to prevent from the nesting effect of forward selection, it does not always out-perform the latter due to overfitting from observing more complex subset states. ^ Our results also indicate that HMSS as a criterion to evaluate the classification ability of a function can be used in imbalanced data without modifying the original dataset as against classification accuracy. Our four studies suggest that Sequential Information Bottleneck(sIB), a new unsupervised technique, can be adopted to predict the outcome and its ability to detect the target status is superior to the traditional LDA in the study. ^ From our results we can see that the best test probability-HMSS for predicting CVD, stroke,CAD and psoriasis through sIB is 0.59406, 0.641815, 0.645315 and 0.678658, respectively. In terms of group prediction accuracy, the highest test accuracy of sIB for diagnosing a normal status among controls can reach 0.708999, 0.863216, 0.639918 and 0.850275 respectively in the four studies if the test accuracy among cases is required to be not less than 0.4. On the other hand, the highest test accuracy of sIB for diagnosing a disease among cases can reach 0.748644, 0.789916, 0.705701 and 0.749436 respectively in the four studies if the test accuracy among controls is required to be at least 0.4. ^ A further genome-wide association study through Chi square test shows that there are no significant SNPs detected at the cut-off level 9.09451E-08 in the Framingham heart study of CVD. Study results in WTCCC can only detect two significant SNPs that are associated with CAD. In the genome-wide study of psoriasis most of top 20 SNP markers with impressive classification accuracy are also significantly associated with the disease through chi-square test at the cut-off value 1.11E-07. ^ Although our classification methods can achieve high accuracy in the study, complete descriptions of those classification results(95% confidence interval or statistical test of differences) require more cost-effective methods or efficient computing system, both of which can't be accomplished currently in our genome-wide study. We should also note that the purpose of this study is to identify subsets of SNPs with high prediction ability and those SNPs with good discriminant power are not necessary to be causal markers for the disease.^

«
1
2
»