931 resultados para Bioinformatics
Resumo:
Bio-systems are inherently complex information processing systems. Furthermore, physiological complexities of biological systems limit the formation of a hypothesis in terms of behavior and the ability to test hypothesis. More importantly the identification and classification of mutation in patients are centric topics in today's cancer research. Next generation sequencing (NGS) technologies can provide genome-wide coverage at a single nucleotide resolution and at reasonable speed and cost. The unprecedented molecular characterization provided by NGS offers the potential for an individualized approach to treatment. These advances in cancer genomics have enabled scientists to interrogate cancer-specific genomic variants and compare them with the normal variants in the same patient. Analysis of this data provides a catalog of somatic variants, present in tumor genome but not in the normal tissue DNA. In this dissertation, we present a new computational framework to the problem of predicting the number of mutations on a chromosome for a certain patient, which is a fundamental problem in clinical and research fields. We begin this dissertation with the development of a framework system that is capable of utilizing published data from a longitudinal study of patients with acute myeloid leukemia (AML), who's DNA from both normal as well as malignant tissues was subjected to NGS analysis at various points in time. By processing the sequencing data at the time of cancer diagnosis using the components of our framework, we tested it by predicting the genomic regions to be mutated at the time of relapse and, later, by comparing our results with the actual regions that showed mutations (discovered at relapse time). We demonstrate that this coupling of the algorithm pipeline can drastically improve the predictive abilities of searching a reliable molecular signature. Arguably, the most important result of our research is its superior performance to other methods like Radial Basis Function Network, Sequential Minimal Optimization, and Gaussian Process. In the final part of this dissertation, we present a detailed significance, stability and statistical analysis of our model. A performance comparison of the results are presented. This work clearly lays a good foundation for future research for other types of cancer.^
Resumo:
Metagenomics is the culture-independent study of genetic material obtained directly from environmental samples. It has become a realistic approach to understanding microbial communities thanks to advances in high-throughput DNA sequencing technologies over the past decade. Current research has shown that different sites of the human body house varied bacterial communities. There is a strong correlation between an individual’s microbial community profile at a given site and disease. Metagenomics is being applied more often as a means of comparing microbial profiles in biomedical studies. The analysis of the data collected using metagenomics can be quite challenging and there exist a plethora of tools for interpreting the results. An automatic analytical workflow for metagenomic analyses has been implemented and tested using synthetic datasets of varying quality. It is able to accurately classify bacteria by taxa and correctly estimate the richness and diversity of each set. The workflow was then applied to the study of the airways microbiome in Chronic Obstructive Pulmonary Disease (COPD). COPD is a progressive lung disease resulting in narrowing of the airways and restricted airflow. Despite being the third leading cause of death in the United States, little is known about the differences in the lung microbial community profiles of healthy individuals and COPD patients. Bronchoalveolar lavage (BAL) samples were collected from COPD patients, active or ex-smokers, and never smokers and sequenced by 454 pyrosequencing. A total of 56 individuals were recruited for the study. Substantial colonization of the lungs was found in all subjects and differentially abundant genera in each group were identified. These discoveries are promising and may further our understanding of how the structure of the lung microbiome is modified as COPD progresses. It is also anticipated that the results will eventually lead to improved treatments for COPD.
Resumo:
Microarray platforms have been around for many years and while there is a rise of new technologies in laboratories, microarrays are still prevalent. When it comes to the analysis of microarray data to identify differentially expressed (DE) genes, many methods have been proposed and modified for improvement. However, the most popular methods such as Significance Analysis of Microarrays (SAM), samroc, fold change, and rank product are far from perfect. When it comes down to choosing which method is most powerful, it comes down to the characteristics of the sample and distribution of the gene expressions. The most practiced method is usually SAM or samroc but when the data tends to be skewed, the power of these methods decrease. With the concept that the median becomes a better measure of central tendency than the mean when the data is skewed, the tests statistics of the SAM and fold change methods are modified in this thesis. This study shows that the median modified fold change method improves the power for many cases when identifying DE genes if the data follows a lognormal distribution.
Resumo:
I proposed the study of two distinct aspects of Ten-Eleven Translocation 2 (TET2) protein for understanding specific functions in different body systems. In Part I, I characterized the molecular mechanisms of Tet2 in the hematological system. As the second member of Ten-Eleven Translocation protein family, TET2 is frequently mutated in leukemic patients. Previous studies have shown that the TET2 mutations frequently occur in 20% myelodysplastic syndrome/myeloproliferative neoplasm (MDS/MPN), 10% T-cell lymphoma leukemia and 2% B-cell lymphoma leukemia. Genetic mouse models also display distinct phenotypes of various types of hematological malignancies. I performed 5-hydroxymethylcytosine (5hmC) chromatin immunoprecipitation sequencing (ChIP-Seq) and RNA sequencing (RNA-Seq) of hematopoietic stem/progenitor cells to determine whether the deletion of Tet2 can affect the abundance of 5hmC at myeloid, T-cell and B-cell specific gene transcription start sites, which ultimately result in various hematological malignancies. Subsequent Exome sequencing (Exome-Seq) showed that disease-specific genes are mutated in different types of tumors, which suggests that TET2 may protect the genome from being mutated. The direct interaction between TET2 and Mutator S Homolog 6 (MSH6) protein suggests TET2 is involved in DNA mismatch repair. Finally, in vivo mismatch repair studies show that the loss of Tet2 causes a mutator phenotype. Taken together, my data indicate that TET2 binds to MSH6 to protect genome integrity. In Part II, I intended to better understand the role of Tet2 in the nervous system. 5-hydroxymethylcytosine regulates epigenetic modification during neurodevelopment and aging. Thus, Tet2 may play a critical role in regulating adult neurogenesis. To examine the physiological significance of Tet2 in the nervous system, I first showed that the deletion of Tet2 reduces the 5hmC levels in neural stem cells. Mice lacking Tet2 show abnormal hippocampal neurogenesis along with 5hmC alternations at different gene promoters and corresponding gene expression downregulation. Through the luciferase reporter assay, two neural factors Neurogenic differentiation 1 (NeuroD1) and Glial fibrillary acidic protein (Gfap) were down-regulated in Tet2 knockout cells. My results suggest that Tet2 regulates neural stem/progenitor cell proliferation and differentiation in adult brain.
Resumo:
Males and age group 1 to 5 years show a much higher risk for childhood acute lymphoblastic leukemia (ALL). We performed a case-only genome-wide association study (GWAS), using the Illumina Infinium HumanCoreExome Chip, to unmask gender- and age-specific risk variants in 240 non-Hispanic white children with ALL recruited at Texas Children’s Cancer Center, Houston, Texas. Besides statistically most significant results, we also considered results that yielded the highest effect sizes. Existing experimental data and bioinformatic predictions were used to complement results, and to examine the biological significance of statistical results. Our study identified novel risk variants for childhood ALL. The SNP, rs4813720 (RASSF2), showed the statistically most significant gender-specific associations (P < 2 x 10-6). Likewise, rs10505918 (SOX5) yielded the lowest P value (P < 1 x 10-5) for age-specific associations, and also showed the statistically most significant association with age-at-onset (P < 1 x 10-4). Two SNPs, rs12722042 and 12722039, from the HLA-DQA1 region yielded the highest effect sizes (odds ratio (OR) = 15.7; P = 0.002) for gender-specific results, and the SNP, rs17109582 (OR = 12.5; P = 0.006), showed the highest effect size for age-specific results. Sex chromosome variants did not appear to be involved in gender-specific associations. The HLA-DQA1 SNPs belong to DQA1*01:07and confirmed previously reported male-specific association with DQA1*01:07. Twenty one of the SNPs identified as risk markers for gender- or age-specific associations were located in the transcription factor binding sites and 56 SNPs were non-synonymous variants, likely to alter protein function. Although bioinformatic analysis did not implicate a particular mechanism for gender- and age-specific associations, RASSF2 has an estrogen receptor-alpha binding site in its promoter. The unknown mechanisms may be due to lack of interest in gender- and age-specificity in associations. These results provide a foundation for further studies to examine the gender- and age-differential in childhood ALL risk. Following replication and mechanistic studies, risk factors for one gender or age group may have a potential to be used as biomarkers for targeted intervention for prevention and maybe also for treatment.
Resumo:
The overall objective of the research presented in this dissertation was to assess exposure to endocrine disrupting chemicals (EDCs), polychlorinated biphenyls (PCBs), phthalates, and bisphenol A (BPA) in the general population and evaluate their associations with adverse reproductive health effects, including cancers, in women. Given the proven contribution of unopposed estrogens to the risk for endometrial neoplasia or breast cancer, renewed health concerns have aroused about estrogen mimicking EDCs found in food, personal care products or as environmental contaminants. Our meta-analysis showed that exposure to estrogen mimicking PCBs increased summary risk of breast cancer and endometriosis. We further evaluated the relationship between endometriosis and breast cancer, and EDCs using a bioinformatics method. Our bioinformatics approach was able to identify genes with the potential to be involved in interaction with PCB, phthalates and BPA that may be important to the development of breast cancer and endometriosis. Therefore, we hypothesized that exposure to EDCs such as PCBs, phthalates, and BPA, results in adverse reproductive health effects in women. Using subject data and biomarkers available from the Center for Disease Controls National Health and Nutrition Examination Survey database we conducted a cross-sectional study of EDCs in relation to self-reported history of endometriosis, uterine leiomyomas, breast cancer, cervical cancer, ovarian cancer, and uterine cancer. Significantly higher body burdens of PCBs were found in women diagnosed with breast cancer, ovarian cancer, and uterine cancer compared to women without cancer. PCB 138 was significantly associated with breast cancer, cervical cancer, and uterine cancer, while PCBs 74 and 118 were significantly associated with ovarian cancer. The sum of dioxin-like PCBs were significantly associated with ovarian cancer (OR of 2.02, 95% CI: 1.06-3.85) and the sum of non-dioxin-like PCBs were significantly associated with uterine cancer (OR of 1.12, 95%CI: 1.03-1.23). Significantly higher body burdens of PCBs were also found in women diagnosed with endometriosis and uterine leiomyomas. Documenting the exposure to EDCs among the general U.S. population, and identifying agents associated with reproductive toxicity have the potential to fill research gaps and facilitate our understanding of the complex role environmental chemicals play in producing toxicity in reproductive organs.^
Resumo:
Many species have specialized to live in the most varied existing environments showing the remarkable adaptability of the microbial world the most diverse physicochemical conditions. Environments exposed to natural radiation and metals are scarce around the world, presenting a microbiota still unknown. With a total number estimated between 4 and 6 x 1030 microrganisms on earth, they constitute an enormous biological and genetic pool to be explored. Metagenomic approach independent of cultivation, provides a new form to access to the potential genomic environmental samples becoming a powerful tool for the elucidation of ecological functions, metabolic profiles, as well as to identify new biomolecules. In this context, the genetic material of environmental soil and water samples from Açude Boqueirao Parelhas-RN, under the influence of natural radiation and the presence of metals, was extracted, pirosequencing and the generated sequences were analyzed by bioinformatics programs (MG-RAST and STAMP). Taxonomic comparative profiles of both samples showed high abundance of Domain Bacteria, followed by a small portion attributable to Eucaryota Domains, Archaea and Viruses. Proteobacteria, Actinobacteria and Bacterioidetes phyla showed the greater dominance in both samples. Important genera and species associated with resistance to various stressors found in region were observed. Sequences related to oxidative and heat stress, DNA replication and repair, and resistance to toxic compounds were observed, suggesting a significant relationship between the microbiota and their metabolic profile, influenced by regional environmental variables. The results of this study add valuable and unpublished data on the composition of microbial communities in these regions
Resumo:
The sugarcane is a monocot plant grown in tropical and subtropical regions, with Brazil being the largest producer. Despite its economic importance, little is known about the molecular flowering process in sugarcane. This physiological process can promote a loss up to 60% in sugar or bioethanol. Thus, this work had as objective characterize a HINT1 homologous gene previously identified in subtractive libraries of flowering. Genomic analysis of gene and promoter region structure allowed the observation that there are at least two distinct genes homologous to HINT on sugarcane. Bioinformatics analyses showed the conservation of the characteristic protein domain of HIT superfamily and indicate a phylogenetic relationship associated to cell location. Moreover, a possible relation with the SBTILISIN-like protein family through the information available in interatomas was observed. This suggests that the HINT gene of sugarcane can be related to plant development, there are several possibilities of interactions in the regulation of floral induction process, because the sequences present in regulatory regions indicate that differential expression of HINT was related to with climatic factors in the Northeast region of Brazil as well as to biotic stress and phytohormones. Furthermore, the sugarcane phenotypes indicate that the influence of HINT may happen due to product accumulation of its enzymatic activity. For these characteristics this gene can be used as a marker in the selection of new varieties.
Resumo:
Background: Light microscopic analysis of diatom frustules is widely used both in basic and applied research, notably taxonomy, morphometrics, water quality monitoring and paleo-environmental studies. In these applications, usually large numbers of frustules need to be identified and / or measured. Although there is a need for automation in these applications, and image processing and analysis methods supporting these tasks have previously been developed, they did not become widespread in diatom analysis. While methodological reports for a wide variety of methods for image segmentation, diatom identification and feature extraction are available, no single implementation combining a subset of these into a readily applicable workflow accessible to diatomists exists. Results: The newly developed tool SHERPA offers a versatile image processing workflow focused on the identification and measurement of object outlines, handling all steps from image segmentation over object identification to feature extraction, and providing interactive functions for reviewing and revising results. Special attention was given to ease of use, applicability to a broad range of data and problems, and supporting high throughput analyses with minimal manual intervention. Conclusions: Tested with several diatom datasets from different sources and of various compositions, SHERPA proved its ability to successfully analyze large amounts of diatom micrographs depicting a broad range of species. SHERPA is unique in combining the following features: application of multiple segmentation methods and selection of the one giving the best result for each individual object; identification of shapes of interest based on outline matching against a template library; quality scoring and ranking of resulting outlines supporting quick quality checking; extraction of a wide range of outline shape descriptors widely used in diatom studies and elsewhere; minimizing the need for, but enabling manual quality control and corrections. Although primarily developed for analyzing images of diatom valves originating from automated microscopy, SHERPA can also be useful for other object detection, segmentation and outline-based identification problems.
Representing clinical documents to support automatic retrieval of evidence from the Cochrane Library
Resumo:
The overall aim of our research is to develop a clinical information retrieval system that retrieves systematic reviews and underlying clinical studies from the Cochrane Library to support physician decision making. We believe that in order to accomplish this goal we need to develop a mechanism for effectively representing documents that will be retrieved by the application. Therefore, as a first step in developing the retrieval application we have developed a methodology that semi-automatically generates high quality indices and applies them as descriptors to documents from The Cochrane Library. In this paper we present a description and implementation of the automatic indexing methodology and an evaluation that demonstrates that enhanced document representation results in the retrieval of relevant documents for clinical queries. We argue that the evaluation of information retrieval applications should also include an evaluation of the quality of the representation of documents that may be retrieved. ©2010 IEEE.
Resumo:
This thesis focuses on the development of algorithms that will allow protein design calculations to incorporate more realistic modeling assumptions. Protein design algorithms search large sequence spaces for protein sequences that are biologically and medically useful. Better modeling could improve the chance of success in designs and expand the range of problems to which these algorithms are applied. I have developed algorithms to improve modeling of backbone flexibility (DEEPer) and of more extensive continuous flexibility in general (EPIC and LUTE). I’ve also developed algorithms to perform multistate designs, which account for effects like specificity, with provable guarantees of accuracy (COMETS), and to accommodate a wider range of energy functions in design (EPIC and LUTE).
Resumo:
The ABL family of non-receptor tyrosine kinases, ABL1 (also known as c-ABL) and ABL2 (also known as Arg), links diverse extracellular stimuli to signaling pathways that control cell growth, survival, adhesion, migration and invasion. ABL tyrosine kinases play an oncogenic role in human leukemias. However, the role of ABL kinases in solid tumors including breast cancer progression and metastasis is just emerging.
To evaluate whether ABL family kinases are involved in breast cancer development and metastasis, we first analyzed genomic data from large-scale screen of breast cancer patients. We found that ABL kinases are up-regulated in invasive breast cancer patients and high expression of ABL kinases correlates with poor prognosis and early metastasis. Using xenograft mouse models combined with genetic and pharmacological approaches, we demonstrated that ABL kinases are required for regulating breast cancer progression and metastasis to the bone. Using next generation sequencing and bioinformatics analysis, we uncovered a critical role for ABL kinases in promoting multiple oncogenic pathways including TAZ and STAT5 signaling networks and the epithelial to mesenchymal transition (EMT). These findings revealed a role for ABL kinases in regulating breast cancer tumorigenesis and bone metastasis and provide a rationale for targeting breast tumors with ABL-specific inhibitors.
Resumo:
Constant technology advances have caused data explosion in recent years. Accord- ingly modern statistical and machine learning methods must be adapted to deal with complex and heterogeneous data types. This phenomenon is particularly true for an- alyzing biological data. For example DNA sequence data can be viewed as categorical variables with each nucleotide taking four different categories. The gene expression data, depending on the quantitative technology, could be continuous numbers or counts. With the advancement of high-throughput technology, the abundance of such data becomes unprecedentedly rich. Therefore efficient statistical approaches are crucial in this big data era.
Previous statistical methods for big data often aim to find low dimensional struc- tures in the observed data. For example in a factor analysis model a latent Gaussian distributed multivariate vector is assumed. With this assumption a factor model produces a low rank estimation of the covariance of the observed variables. Another example is the latent Dirichlet allocation model for documents. The mixture pro- portions of topics, represented by a Dirichlet distributed variable, is assumed. This dissertation proposes several novel extensions to the previous statistical methods that are developed to address challenges in big data. Those novel methods are applied in multiple real world applications including construction of condition specific gene co-expression networks, estimating shared topics among newsgroups, analysis of pro- moter sequences, analysis of political-economics risk data and estimating population structure from genotype data.
Resumo:
Transcription factors (TFs) control the temporal and spatial expression of target genes by interacting with DNA in a sequence-specific manner. Recent advances in high throughput experiments that measure TF-DNA interactions in vitro and in vivo have facilitated the identification of DNA binding sites for thousands of TFs. However, it remains unclear how each individual TF achieves its specificity, especially in the case of paralogous TFs that recognize distinct target genomic sites despite sharing very similar DNA binding motifs. In my work, I used a combination of high throughput in vitro protein-DNA binding assays and machine-learning algorithms to characterize and model the binding specificity of 11 paralogous TFs from 4 distinct structural families. My work proves that even very closely related paralogous TFs, with indistinguishable DNA binding motifs, oftentimes exhibit differential binding specificity for their genomic target sites, especially for sites with moderate binding affinity. Importantly, the differences I identify in vitro and through computational modeling help explain, at least in part, the differential in vivo genomic targeting by paralogous TFs. Future work will focus on in vivo factors that might also be important for specificity differences between paralogous TFs, such as DNA methylation, interactions with protein cofactors, or the chromatin environment. In this larger context, my work emphasizes the importance of intrinsic DNA binding specificity in targeting of paralogous TFs to the genome.
Resumo:
Autism spectrum disorders (ASD) are complex heterogeneous neurodevelopmental disorders of an unclear etiology, and no cure currently exists. Prior studies have demonstrated that the black and tan, brachyury (BTBR) T+ Itpr3tf/J mouse strain displays a behavioral phenotype with ASD-like features. BTBR T+ Itpr3tf/J mice (referred to simply as BTBR) display deficits in social functioning, lack of communication ability, and engagement in stereotyped behavior. Despite extensive behavioral phenotypic characterization, little is known about the genes and proteins responsible for the presentation of the ASD-like phenotype in the BTBR mouse model. In this study, we employed bioinformatics techniques to gain a wide-scale understanding of the transcriptomic and proteomic changes associated with the ASD-like phenotype in BTBR mice. We found a number of genes and proteins to be significantly altered in BTBR mice compared to C57BL/6J (B6) control mice controls such as BDNF, Shank3, and ERK1, which are highly relevant to prior investigations of ASD. Furthermore, we identified distinct functional pathways altered in BTBR mice compared to B6 controls that have been previously shown to be altered in both mouse models of ASD, some human clinical populations, and have been suggested as a possible etiological mechanism of ASD, including "axon guidance" and "regulation of actin cytoskeleton." In addition, our wide-scale bioinformatics approach also discovered several previously unidentified genes and proteins associated with the ASD phenotype in BTBR mice, such as Caskin1, suggesting that bioinformatics could be an avenue by which novel therapeutic targets for ASD are uncovered. As a result, we believe that informed use of synergistic bioinformatics applications represents an invaluable tool for elucidating the etiology of complex disorders like ASD.