954 resultados para Genomic data
Resumo:
Cancer and cardio-vascular diseases are the leading causes of death world-wide. Caused by systemic genetic and molecular disruptions in cells, these disorders are the manifestation of profound disturbance of normal cellular homeostasis. People suffering or at high risk for these disorders need early diagnosis and personalized therapeutic intervention. Successful implementation of such clinical measures can significantly improve global health. However, development of effective therapies is hindered by the challenges in identifying genetic and molecular determinants of the onset of diseases; and in cases where therapies already exist, the main challenge is to identify molecular determinants that drive resistance to the therapies. Due to the progress in sequencing technologies, the access to a large genome-wide biological data is now extended far beyond few experimental labs to the global research community. The unprecedented availability of the data has revolutionized the capabilities of computational researchers, enabling them to collaboratively address the long standing problems from many different perspectives. Likewise, this thesis tackles the two main public health related challenges using data driven approaches. Numerous association studies have been proposed to identify genomic variants that determine disease. However, their clinical utility remains limited due to their inability to distinguish causal variants from associated variants. In the presented thesis, we first propose a simple scheme that improves association studies in supervised fashion and has shown its applicability in identifying genomic regulatory variants associated with hypertension. Next, we propose a coupled Bayesian regression approach -- eQTeL, which leverages epigenetic data to estimate regulatory and gene interaction potential, and identifies combinations of regulatory genomic variants that explain the gene expression variance. On human heart data, eQTeL not only explains a significantly greater proportion of expression variance in samples, but also predicts gene expression more accurately than other methods. We demonstrate that eQTeL accurately detects causal regulatory SNPs by simulation, particularly those with small effect sizes. Using various functional data, we show that SNPs detected by eQTeL are enriched for allele-specific protein binding and histone modifications, which potentially disrupt binding of core cardiac transcription factors and are spatially proximal to their target. eQTeL SNPs capture a substantial proportion of genetic determinants of expression variance and we estimate that 58% of these SNPs are putatively causal. The challenge of identifying molecular determinants of cancer resistance so far could only be dealt with labor intensive and costly experimental studies, and in case of experimental drugs such studies are infeasible. Here we take a fundamentally different data driven approach to understand the evolving landscape of emerging resistance. We introduce a novel class of genetic interactions termed synthetic rescues (SR) in cancer, which denotes a functional interaction between two genes where a change in the activity of one vulnerable gene (which may be a target of a cancer drug) is lethal, but subsequently altered activity of its partner rescuer gene restores cell viability. Next we describe a comprehensive computational framework --termed INCISOR-- for identifying SR underlying cancer resistance. Applying INCISOR to mine The Cancer Genome Atlas (TCGA), a large collection of cancer patient data, we identified the first pan-cancer SR networks, composed of interactions common to many cancer types. We experimentally test and validate a subset of these interactions involving the master regulator gene mTOR. We find that rescuer genes become increasingly activated as breast cancer progresses, testifying to pervasive ongoing rescue processes. We show that SRs can be utilized to successfully predict patients' survival and response to the majority of current cancer drugs, and importantly, for predicting the emergence of drug resistance from the initial tumor biopsy. Our analysis suggests a potential new strategy for enhancing the effectiveness of existing cancer therapies by targeting their rescuer genes to counteract resistance. The thesis provides statistical frameworks that can harness ever increasing high throughput genomic data to address challenges in determining the molecular underpinnings of hypertension, cardiovascular disease and cancer resistance. We discover novel molecular mechanistic insights that will advance the progress in early disease prevention and personalized therapeutics. Our analyses sheds light on the fundamental biological understanding of gene regulation and interaction, and opens up exciting avenues of translational applications in risk prediction and therapeutics.
Resumo:
The recent advent of new technologies has led to huge amounts of genomic data. With these data come new opportunities to understand biological cellular processes underlying hidden regulation mechanisms and to identify disease related biomarkers for informative diagnostics. However, extracting biological insights from the immense amounts of genomic data is a challenging task. Therefore, effective and efficient computational techniques are needed to analyze and interpret genomic data. In this thesis, novel computational methods are proposed to address such challenges: a Bayesian mixture model, an extended Bayesian mixture model, and an Eigen-brain approach. The Bayesian mixture framework involves integration of the Bayesian network and the Gaussian mixture model. Based on the proposed framework and its conjunction with K-means clustering and principal component analysis (PCA), biological insights are derived such as context specific/dependent relationships and nested structures within microarray where biological replicates are encapsulated. The Bayesian mixture framework is then extended to explore posterior distributions of network space by incorporating a Markov chain Monte Carlo (MCMC) model. The extended Bayesian mixture model summarizes the sampled network structures by extracting biologically meaningful features. Finally, an Eigen-brain approach is proposed to analyze in situ hybridization data for the identification of the cell-type specific genes, which can be useful for informative blood diagnostics. Computational results with region-based clustering reveals the critical evidence for the consistency with brain anatomical structure.
Resumo:
Following the completion of the draft Human Genome in 2001, genomic sequence data is becoming available at an accelerating rate, fueled by advances in sequencing and computational technology. Meanwhile, large collections of astronomical and geospatial data have allowed the creation of virtual observatories, accessible throughout the world and requiring only commodity hardware. Through a combination of advances in data management, data mining and visualization, this infrastructure enables the development of new scientific and educational applications as diverse as galaxy classification and real-time tracking of earthquakes and volcanic plumes. In the present paper, we describe steps taken along a similar path towards a virtual observatory for genomes – an immersive three-dimensional visual navigation and query system for comparative genomic data.
Resumo:
Exponential growth of genomic data in the last two decades has made manual analyses impractical for all but trial studies. As genomic analyses have become more sophisticated, and move toward comparisons across large datasets, computational approaches have become essential. One of the most important biological questions is to understand the mechanisms underlying gene regulation. Genetic regulation is commonly investigated and modelled through the use of transcriptional regulatory network (TRN) structures. These model the regulatory interactions between two key components: transcription factors (TFs) and the target genes (TGs) they regulate. Transcriptional regulatory networks have proven to be invaluable scientific tools in Bioinformatics. When used in conjunction with comparative genomics, they have provided substantial insights into the evolution of regulatory interactions. Current approaches to regulatory network inference, however, omit two additional key entities: promoters and transcription factor binding sites (TFBSs). In this study, we attempted to explore the relationships among these regulatory components in bacteria. Our primary goal was to identify relationships that can assist in reducing the high false positive rates associated with transcription factor binding site predictions and thereupon enhance the reliability of the inferred transcription regulatory networks. In our preliminary exploration of relationships between the key regulatory components in Escherichia coli transcription, we discovered a number of potentially useful features. The combination of location score and sequence dissimilarity scores increased de novo binding site prediction accuracy by 13.6%. Another important observation made was with regards to the relationship between transcription factors grouped by their regulatory role and corresponding promoter strength. Our study of E.coli ��70 promoters, found support at the 0.1 significance level for our hypothesis | that weak promoters are preferentially associated with activator binding sites to enhance gene expression, whilst strong promoters have more repressor binding sites to repress or inhibit gene transcription. Although the observations were specific to �70, they nevertheless strongly encourage additional investigations when more experimentally confirmed data are available. In our preliminary exploration of relationships between the key regulatory components in E.coli transcription, we discovered a number of potentially useful features { some of which proved successful in reducing the number of false positives when applied to re-evaluate binding site predictions. Of chief interest was the relationship observed between promoter strength and TFs with respect to their regulatory role. Based on the common assumption, where promoter homology positively correlates with transcription rate, we hypothesised that weak promoters would have more transcription factors that enhance gene expression, whilst strong promoters would have more repressor binding sites. The t-tests assessed for E.coli �70 promoters returned a p-value of 0.072, which at 0.1 significance level suggested support for our (alternative) hypothesis; albeit this trend may only be present for promoters where corresponding TFBSs are either all repressors or all activators. Nevertheless, such suggestive results strongly encourage additional investigations when more experimentally confirmed data will become available. Much of the remainder of the thesis concerns a machine learning study of binding site prediction, using the SVM and kernel methods, principally the spectrum kernel. Spectrum kernels have been successfully applied in previous studies of protein classification [91, 92], as well as the related problem of promoter predictions [59], and we have here successfully applied the technique to refining TFBS predictions. The advantages provided by the SVM classifier were best seen in `moderately'-conserved transcription factor binding sites as represented by our E.coli CRP case study. Inclusion of additional position feature attributes further increased accuracy by 9.1% but more notable was the considerable decrease in false positive rate from 0.8 to 0.5 while retaining 0.9 sensitivity. Improved prediction of transcription factor binding sites is in turn extremely valuable in improving inference of regulatory relationships, a problem notoriously prone to false positive predictions. Here, the number of false regulatory interactions inferred using the conventional two-component model was substantially reduced when we integrated de novo transcription factor binding site predictions as an additional criterion for acceptance in a case study of inference in the Fur regulon. This initial work was extended to a comparative study of the iron regulatory system across 20 Yersinia strains. This work revealed interesting, strain-specific difierences, especially between pathogenic and non-pathogenic strains. Such difierences were made clear through interactive visualisations using the TRNDifi software developed as part of this work, and would have remained undetected using conventional methods. This approach led to the nomination of the Yfe iron-uptake system as a candidate for further wet-lab experimentation due to its potential active functionality in non-pathogens and its known participation in full virulence of the bubonic plague strain. Building on this work, we introduced novel structures we have labelled as `regulatory trees', inspired by the phylogenetic tree concept. Instead of using gene or protein sequence similarity, the regulatory trees were constructed based on the number of similar regulatory interactions. While the common phylogentic trees convey information regarding changes in gene repertoire, which we might regard being analogous to `hardware', the regulatory tree informs us of the changes in regulatory circuitry, in some respects analogous to `software'. In this context, we explored the `pan-regulatory network' for the Fur system, the entire set of regulatory interactions found for the Fur transcription factor across a group of genomes. In the pan-regulatory network, emphasis is placed on how the regulatory network for each target genome is inferred from multiple sources instead of a single source, as is the common approach. The benefit of using multiple reference networks, is a more comprehensive survey of the relationships, and increased confidence in the regulatory interactions predicted. In the present study, we distinguish between relationships found across the full set of genomes as the `core-regulatory-set', and interactions found only in a subset of genomes explored as the `sub-regulatory-set'. We found nine Fur target gene clusters present across the four genomes studied, this core set potentially identifying basic regulatory processes essential for survival. Species level difierences are seen at the sub-regulatory-set level; for example the known virulence factors, YbtA and PchR were found in Y.pestis and P.aerguinosa respectively, but were not present in both E.coli and B.subtilis. Such factors and the iron-uptake systems they regulate, are ideal candidates for wet-lab investigation to determine whether or not they are pathogenic specific. In this study, we employed a broad range of approaches to address our goals and assessed these methods using the Fur regulon as our initial case study. We identified a set of promising feature attributes; demonstrated their success in increasing transcription factor binding site prediction specificity while retaining sensitivity, and showed the importance of binding site predictions in enhancing the reliability of regulatory interaction inferences. Most importantly, these outcomes led to the introduction of a range of visualisations and techniques, which are applicable across the entire bacterial spectrum and can be utilised in studies beyond the understanding of transcriptional regulatory networks.
Resumo:
MapReduce frameworks such as Hadoop are well suited to handling large sets of data which can be processed separately and independently, with canonical applications in information retrieval and sales record analysis. Rapid advances in sequencing technology have ensured an explosion in the availability of genomic data, with a consequent rise in the importance of large scale comparative genomics, often involving operations and data relationships which deviate from the classical Map Reduce structure. This work examines the application of Hadoop to patterns of this nature, using as our focus a wellestablished workflow for identifying promoters - binding sites for regulatory proteins - Across multiple gene regions and organisms, coupled with the unifying step of assembling these results into a consensus sequence. Our approach demonstrates the utility of Hadoop for problems of this nature, showing how the tyranny of the "dominant decomposition" can be at least partially overcome. It also demonstrates how load balance and the granularity of parallelism can be optimized by pre-processing that splits and reorganizes input files, allowing a wide range of related problems to be brought under the same computational umbrella.
Resumo:
Background Chlamydia pecorum is an important pathogen of domesticated livestock including sheep, cattle and pigs. This pathogen is also a key factor in the decline of the koala in Australia. We sequenced the genomes of three koala C. pecorum strains, isolated from the urogenital tracts and conjunctiva of diseased koalas. The genome of the C. pecorum VR629 (IPA) strain, isolated from a sheep with polyarthritis, was also sequenced. Results Comparisons of the draft C. pecorum genomes against the complete genomes of livestock C. pecorum isolates revealed that these strains have a conserved gene content and order, sharing a nucleotide sequence similarity > 98%. Single nucleotide polymorphisms (SNPs) appear to be key factors in understanding the adaptive process. Two regions of the chromosome were found to be accumulating a large number of SNPs within the koala strains. These regions include the Chlamydia plasticity zone, which contains two cytotoxin genes (toxA and toxB), and a 77 kbp region that codes for putative type III effector proteins. In one koala strain (MC/MarsBar), the toxB gene was truncated by a premature stop codon but is full-length in IPTaLE and DBDeUG. Another five pseudogenes were also identified, two unique to the urogenital strains C. pecorum MC/MarsBar and C. pecorum DBDeUG, respectively, while three were unique to the koala C. pecorum conjunctival isolate IPTaLE. An examination of the distribution of these pseudogenes in C. pecorum strains from a variety of koala populations, alongside a number of sheep and cattle C. pecorum positive samples from Australian livestock, confirmed the presence of four predicted pseudogenes in koala C. pecorum clinical samples. Consistent with our genomics analyses, none of these pseudogenes were observed in the livestock C. pecorum samples examined. Interestingly, three SNPs resulting in pseudogenes identified in the IPTaLE isolate were not found in any other C. pecorum strain analysed, raising questions over the origin of these point mutations. Conclusions The genomic data revealed that variation between C. pecorum strains were mainly due to the accumulation of SNPs, some of which cause gene inactivation. The identification of these genetic differences will provide the basis for further studies to understand the biology and evolution of this important animal pathogen. Keywords: Chlamydia pecorum; Single nucleotide polymorphism; Pseudogene; Cytotoxin
Resumo:
A candidate gene approach using type I single nucleotide polymorphism (SNP) markers can provide an effective method for detecting genes and gene regions that underlie phenotypic variation in adaptively significant traits. In the absence of available genomic data resources, transcriptomes were recently generated in Macrobrachium rosenbergii to identify candidate genes and markers potentially associated with growth. The characterisation of 47 candidate loci by ABI re-sequencing of four cultured and eight wild samples revealed 342 putative SNPs. Among these, 28 SNPs were selected in 23 growth-related candidate genes to genotype in 200 animals selected for improved growth performance in an experimental GFP culture line in Vietnam. The associations between SNP markers and individual growth performance were then examined. For additive and dominant effects, a total of three exonic SNPs in glycogen phosphorylase (additive), heat shock protein 90 (additive and dominant) and peroxidasin (additive), and a total of six intronic SNPs in ankyrin repeats-like protein (additive and dominant), rolling pebbles (dominant), transforming growth factor-β induced precursor (dominant), and UTP-glucose-1-phosphate uridylyltransferase 2 (dominant) genes showed significant associations with the estimated breeding values in the experimental animals (P =0.001−0.031). Individually, they explained 2.6−4.8 % of the genetic variance (R2=0.026−0.048). This is the first large set of SNP markers reported for M. rosenbergii and will be useful for confirmation of associations in other samples or culture lines as well as having applications in marker-assisted selection in future breeding programs.
Resumo:
Background Genomic data are lacking for many allergen sources. To circumvent this limitation, we implemented a strategy to reveal the repertoire of pollen allergens of a grass with clinical importance in subtropical regions, where an increasing proportion of the world's population resides. Objective We sought to identify and immunologically characterize the allergenic components of the Panicoideae Johnson grass pollen (JGP; Sorghum halepense). Methods The total pollen transcriptome, proteome, and allergome of JGP were documented. Serum IgE reactivities with pollen and purified allergens were assessed in 64 patients with grass pollen allergy from a subtropical region. Results Purified Sor h 1 and Sor h 13 were identified as clinically important allergen components of JGP with serum IgE reactivity in 49 (76%) and 28 (43.8%), respectively, of patients with grass pollen allergy. Within whole JGP, multiple cDNA transcripts and peptide spectra belonging to grass pollen allergen families 1, 2, 4, 7, 11, 12, 13, and 25 were identified. Pollen allergens restricted to subtropical grasses (groups 22-24) were also present within the JGP transcriptome and proteome. Mass spectrometry confirmed the IgE-reactive components of JGP included isoforms of Sor h 1, Sor h 2, Sor h 13, and Sor h 23. Conclusion Our integrated molecular approach revealed qualitative differences between the allergenic components of JGP and temperate grass pollens. Knowledge of these newly identified allergens has the potential to improve specific diagnosis and allergen immunotherapy treatment for patients with grass pollen allergy in subtropical regions and reduce the burden of allergic respiratory disease globally.
Resumo:
Evolutionary history of biological entities is recorded within their nucleic acid sequences and can (sometimes) be deciphered by thorough genomic analysis. In this study we sought to gain insights into the diversity and evolution of bacterial and archaeal viruses. Our primary interest was pointed towards those virus groups/families for which comprehensive genomic analysis was not previously possible due to the lack of sufficient amount of genomic data. During the course of this work twenty-five putative proviruses integrated into various prokaryotic genomes were identified, enabling us to undertake a comparative genomics approach. This analysis allowed us to test the previously formulated evolutionary hypotheses and also provided valuable information on the molecular mechanisms behind the genome evolution of the studied virus groups.
Resumo:
Background: Protein kinases are involved in diverse spectrum of cellular processes. Availability of draft version of the human genomic data in the year 2001 enabled recognition of repertoire of protein kinases. However, over the years the human genomic data is being refined and the current release of human genomic data has helped us to recognize a larger repertoire of over 900 human protein kinases represented mainly by splice variants. Results: Many of these identified protein kinases are alternatively spliced products. Interestingly, some of the human kinase splice variants appear to be significantly diverged in terms of their functional properties as represented by incorporation or absence of one or more domains. Many sets of protein kinase splice variants have substantially different domain organization and in a few sets of splice variants kinase domains belong to different subfamilies of kinases suggesting potential participation in different signal transduction pathways. Conclusions: Addition or deletion of a domain between splice variants of multi-domain kinases appears to be a means of generating differences in the functional features of otherwise similar kinases. It is intriguing that marked sequence diversity within the catalytic regions of some of the splice variant kinases result in kinases belonging to different subfamilies. These human kinase splice variants with different functions might contribute to diversity of eukaryotic cellular signaling.
Resumo:
Homozygosity has long been associated with rare, often devastating, Mendelian disorders1, and Darwin was one of the first to recognize that inbreeding reduces evolutionary fitness2. However, the effect of the more distant parental relatedness that is common in modern human populations is less well understood. Genomic data now allow us to investigate the effects of homozygosity on traits of public health importance by observing contiguous homozygous segments (runs of homozygosity), which are inferred to be homozygous along their complete length. Given the low levels of genome-wide homozygosity prevalent in most human populations, information is required on very large numbers of people to provide sufficient power3, 4. Here we use runs of homozygosity to study 16 health-related quantitative traits in 354,224 individuals from 102 cohorts, and find statistically significant associations between summed runs of homozygosity and four complex traits: height, forced expiratory lung volume in one second, general cognitive ability and educational attainment (P < 1 × 10−300, 2.1 × 10−6, 2.5 × 10−10 and 1.8 × 10−10, respectively). In each case, increased homozygosity was associated with decreased trait value, equivalent to the offspring of first cousins being 1.2 cm shorter and having 10 months’ less education. Similar effect sizes were found across four continental groups and populations with different degrees of genome-wide homozygosity, providing evidence that homozygosity, rather than confounding, directly contributes to phenotypic variance. Contrary to earlier reports in substantially smaller samples5, 6, no evidence was seen of an influence of genome-wide homozygosity on blood pressure and low density lipoprotein cholesterol, or ten other cardio-metabolic traits. Since directional dominance is predicted for traits under directional evolutionary selection7, this study provides evidence that increased stature and cognitive function have been positively selected in human evolution, whereas many important risk factors for late-onset complex diseases may not have been.
Resumo:
Genomic data of several organisms have revealed the presence of a vast repertoire of multi-domain proteins. The role played by individual domains in a multi-domain protein has a profound influence on the overall function of the protein. In the present analysis an attempt has been made to better understand the tethering preferences of domain families that occur in multi-domain proteins. The analysis has been carried out on an exhaustive dataset of 2 961 898 sequences of proteins from 930 organisms, where 741 274 proteins are comprised of at least two domain families. For every domain family, the number of other domain families with which it co-occurs within a protein in this dataset has been enumerated and is referred to as the tethering number of the domain family. It was found that, in the general dataset, the AAA ATPase family and the family of Ser/Thr kinases have the highest tethering numbers of 450 and 444 respectively. Further analysis reveals significant correlation between the number of members in a family and its tethering number. Positive correlation was also observed for the extent of a sequence and functional diversity within a family and the tethering numbers of domain families. Domain families that are present ubiquitously in diverse organisms tend to have large tethering numbers, while organism/kingdom-specific families have low tethering numbers. Thus, the analysis uncovers how domain families recombine and evolve to give rise to multi-domain proteins.
Resumo:
The Afrotheria, a supraordinal grouping of mammals whose radiation is rooted in Africa, is strongly supported by DNA sequence data but not by their disparate anatomical features. We have used flow-sorted human, aardvark, and African elephant chromosome painting probes and applied reciprocal painting schemes to representatives of two of the Afrotherian orders, the Tubulidentata (aardvark) and Proboscidea (elephants), in an attempt to shed additional light on the evolutionary affinities of this enigmatic group of mammals. Although we have not yet found any unique cytogenetic signatures that support the monophyly of the Afrotheria, embedded within the aardvark genome we find the strongest evidence yet of a mammalian ancestral karyotype comprising 2n = 44. This karyotype includes nine chromosomes that show complete conserved synteny to those of man, six that show conservation as single chromosome arms or blocks in the human karyotype but that occur on two different chromosomes in the ancestor, and seven neighbor-joining combinations (i.e., the synteny is maintained in the majority of species of the orders studied so far, but which corresponds to two chromosomes in humans). The comparative chromosome maps presented between human and these Afrotherian species provide further insight into mammalian genome organization and comparative genomic data for the Afrotheria, one of the four major evolutionary clades postulated for the Eutheria.
Resumo:
Chris L. Organ, Andrew M. Shedlock, Andrew Meade, Mark Pagel and Scott V. Edwards (2007). Origin of avian genome size and structure in non-avian dinosaurs. Nature, 46(7132), 180-184. RAE2008
Resumo:
Phytochromes are red/far-red photoreceptors that play essential roles in diverse plant morphogenetic and physiological responses to light. Despite their functional significance, phytochrome diversity and evolution across photosynthetic eukaryotes remain poorly understood. Using newly available transcriptomic and genomic data we show that canonical plant phytochromes originated in a common ancestor of streptophytes (charophyte algae and land plants). Phytochromes in charophyte algae are structurally diverse, including canonical and non-canonical forms, whereas in land plants, phytochrome structure is highly conserved. Liverworts, hornworts and Selaginella apparently possess a single phytochrome, whereas independent gene duplications occurred within mosses, lycopods, ferns and seed plants, leading to diverse phytochrome families in these clades. Surprisingly, the phytochrome portions of algal and land plant neochromes, a chimera of phytochrome and phototropin, appear to share a common origin. Our results reveal novel phytochrome clades and establish the basis for understanding phytochrome functional evolution in land plants and their algal relatives.