60 resultados para Genome annotation
em CentAUR: Central Archive University of Reading - UK
Resumo:
Protein structure prediction methods aim to predict the structures of proteins from their amino acid sequences, utilizing various computational algorithms. Structural genome annotation is the process of attaching biological information to every protein encoded within a genome via the production of three-dimensional protein models.
Resumo:
Motivation: In order to enhance genome annotation, the fully automatic fold recognition method GenTHREADER has been improved and benchmarked. The previous version of GenTHREADER consisted of a simple neural network which was trained to combine sequence alignment score, length information and energy potentials derived from threading into a single score representing the relationship between two proteins, as designated by CATH. The improved version incorporates PSI-BLAST searches, which have been jumpstarted with structural alignment profiles from FSSP, and now also makes use of PSIPRED predicted secondary structure and bi-directional scoring in order to calculate the final alignment score. Pairwise potentials and solvation potentials are calculated from the given sequence alignment which are then used as inputs to a multi-layer, feed-forward neural network, along with the alignment score, alignment length and sequence length. The neural network has also been expanded to accommodate the secondary structure element alignment (SSEA) score as an extra input and it is now trained to learn the FSSP Z-score as a measurement of similarity between two proteins. Results: The improvements made to GenTHREADER increase the number of remote homologues that can be detected with a low error rate, implying higher reliability of score, whilst also increasing the quality of the models produced. We find that up to five times as many true positives can be detected with low error rate per query. Total MaxSub score is doubled at low false positive rates using the improved method.
Resumo:
Blumeria graminis is an economically important obligate plant-pathogenic fungus, whose entire genome was recently sequenced and manually annotated using ab initio in silico predictions [7]. Employing large scale proteogenomic analysis we are now able to verify independently the existence of proteins predicted by 24% of open reading frame models. We compared the haustoria and sporulating hyphae proteomes and identified 71 proteins exclusively in haustoria, the feeding and effector-delivery organs of the pathogen. These proteins are ‘significantly smaller than the rest of the protein pool and predicted to be secreted. Most do not share any similarities with Swiss–Prot or Trembl entries nor possess any identifiable Pfam domains. We used a novel automated prediction pipeline to model the 3D structures of the proteins, identify putative ligand binding sites and predict regions of intrinsic disorder. This revealed that the protein set found exclusively in haustoria is significantly less disordered than the rest of the identified Blumeria proteins or random (and representative) protein sets generated from the yeast proteome. For most of the haustorial proteins with unknown functions no good templates could be found, from which to generate high quality models. Thus, these unknown proteins present potentially new protein folds that can be specific to the interaction of the pathogen with its host.
Resumo:
The past years have shown an enormous advancement in sequencing and array-based technologies, producing supplementary or alternative views of the genome stored in various formats and databases. Their sheer volume and different data scope pose a challenge to jointly visualize and integrate diverse data types. We present AmalgamScope a new interactive software tool focusing on assisting scientists with the annotation of the human genome and particularly the integration of the annotation files from multiple data types, using gene identifiers and genomic coordinates. Supported platforms include next-generation sequencing and microarray technologies. The available features of AmalgamScope range from the annotation of diverse data types across the human genome to integration of the data based on the annotational information and visualization of the merged files within chromosomal regions or the whole genome. Additionally, users can define custom transcriptome library files for any species and use the file exchanging distant server options of the tool.
Resumo:
BACKGROUND: Serial Analysis of Gene Expression (SAGE) is a powerful tool for genome-wide transcription studies. Unlike microarrays, it has the ability to detect novel forms of RNA such as alternatively spliced and antisense transcripts, without the need for prior knowledge of their existence. One limitation of using SAGE on an organism with a complex genome and lacking detailed sequence information, such as the hexaploid bread wheat Triticum aestivum, is accurate annotation of the tags generated. Without accurate annotation it is impossible to fully understand the dynamic processes involved in such complex polyploid organisms. Hence we have developed and utilised novel procedures to characterise, in detail, SAGE tags generated from the whole grain transcriptome of hexaploid wheat. RESULTS: Examination of 71,930 Long SAGE tags generated from six libraries derived from two wheat genotypes grown under two different conditions suggested that SAGE is a reliable and reproducible technique for use in studying the hexaploid wheat transcriptome. However, our results also showed that in poorly annotated and/or poorly sequenced genomes, such as hexaploid wheat, considerably more information can be extracted from SAGE data by carrying out a systematic analysis of both perfect and "fuzzy" (partially matched) tags. This detailed analysis of the SAGE data shows first that while there is evidence of alternative polyadenylation this appears to occur exclusively within the 3' untranslated regions. Secondly, we found no strong evidence for widespread alternative splicing in the developing wheat grain transcriptome. However, analysis of our SAGE data shows that antisense transcripts are probably widespread within the transcriptome and appear to be derived from numerous locations within the genome. Examination of antisense transcripts showing sequence similarity to the Puroindoline a and Puroindoline b genes suggests that such antisense transcripts might have a role in the regulation of gene expression. CONCLUSION: Our results indicate that the detailed analysis of transcriptome data, such as SAGE tags, is essential to understand fully the factors that regulate gene expression and that such analysis of the wheat grain transcriptome reveals that antisense transcripts maybe widespread and hence probably play a significant role in the regulation of gene expression during grain development.
Resumo:
Pharmacovigilance, the monitoring of adverse events (AEs), is an integral part in the clinical evaluation of a new drug. Until recently, attempts to relate the incidence of AEs to putative causes have been restricted to the evaluation of simple demographic and environmental factors. The advent of large-scale genotyping, however, provides an opportunity to look for associations between AEs and genetic markers, such as single nucleotides polymorphisms (SNPs). It is envisaged that a very large number of SNPs, possibly over 500 000, will be used in pharmacovigilance in an attempt to identify any genetic difference between patients who have experienced an AE and those who have not. We propose a sequential genome-wide association test for analysing AEs as they arise, allowing evidence-based decision-making at the earliest opportunity. This gives us the capability of quickly establishing whether there is a group of patients at high-risk of an AE based upon their DNA. Our method provides a valid test which takes account of linkage disequilibrium and allows for the sequential nature of the procedure. The method is more powerful than using a correction, such as idák, that assumes that the tests are independent. Copyright © 2006 John Wiley & Sons, Ltd.
Resumo:
To further our understanding of powdery mildew biology during infection, we undertook a systematic shotgun proteomics analysis of the obligate biotroph Blumeria graminis f. sp. hordei at different stages of development in the host. Moreover we used a proteogenomics approach to feed information into the annotation of the newly sequenced genome. We analyzed and compared the proteomes from three stages of development representing different functions during the plant-dependent vegetative life cycle of this fungus. We identified 441 proteins in ungerminated spores, 775 proteins in epiphytic sporulating hyphae, and 47 proteins from haustoria inside barley leaf epidermal cells and used the data to aid annotation of the B. graminis f. sp. hordei genome. We also compared the differences in the protein complement of these key stages. Although confirming some of the previously reported findings and models derived from the analysis of transcriptome dynamics, our results also suggest that the intracellular haustoria are subject to stress possibly as a result of the plant defense strategy, including the production of reactive oxygen species. In addition, a number of small haustorial proteins with a predicted N-terminal signal peptide for secretion were identified in infected tissues: these represent candidate effector proteins that may play a role in controlling host metabolism and immunity. Molecular & Cellular Proteomics 8: 2368-2381, 2009.
Resumo:
Observation of adverse drug reactions during drug development can cause closure of the whole programme. However, if association between the genotype and the risk of an adverse event is discovered, then it might suffice to exclude patients of certain genotypes from future recruitment. Various sequential and non-sequential procedures are available to identify an association between the whole genome, or at least a portion of it, and the incidence of adverse events. In this paper we start with a suspected association between the genotype and the risk of an adverse event and suppose that the genetic subgroups with elevated risk can be identified. Our focus is determination of whether the patients identified as being at risk should be excluded from further studies of the drug. We propose using a utility function to? determine the appropriate action, taking into account the relative costs of suffering an adverse reaction and of failing to alleviate the patient's disease. Two illustrative examples are presented, one comparing patients who suffer from an adverse event with contemporary patients who do not, and the other making use of a reference control group. We also illustrate two classification methods, LASSO and CART, for identifying patients at risk, but we stress that any appropriate classification method could be used in conjunction with the proposed utility function. Our emphasis is on determining the action to take rather than on providing definitive evidence of an association. Copyright (C) 2008 John Wiley & Sons, Ltd.
Resumo:
Increasingly, we regard the genome as a site and source of genetic conflict. This fascinating 'bottom-up' view brings up appealing connections between genome biology and whole-organism ecology, in which populations of elements compete with one another in their genomic habitat. Unlike other habitats, though, a host genome has its own evolutionary interests and is often able to defend itself against molecular parasites. Most well-studied organisms employ strategies to protect their genomes against the harmful effects of genomic parasites, including methylation, various pathways of RNA interference, and more unusual tricks such as repeat induced point-mutation (RIP). These genome defence systems are not obscure biological curiosities, but fundamentally important to the integrity and cohesion of the genome, and exert a powerful influence on genome evolution.
Resumo:
Avian genomes are small and streamlined compared with those of other amniotes by virtue of having fewer repetitive elements and less non-coding DNA(1,2). This condition has been suggested to represent a key adaptation for flight in birds, by reducing the metabolic costs associated with having large genome and cell sizes(3,4). However, the evolution of genome architecture in birds, or any other lineage, is difficult to study because genomic information is often absent for long-extinct relatives. Here we use a novel bayesian comparative method to show that bone-cell size correlates well with genome size in extant vertebrates, and hence use this relationship to estimate the genome sizes of 31 species of extinct dinosaur, including several species of extinct birds. Our results indicate that the small genomes typically associated with avian flight evolved in the saurischian dinosaur lineage between 230 and 250 million years ago, long before this lineage gave rise to the first birds. By comparison, ornithischian dinosaurs are inferred to have had much larger genomes, which were probably typical for ancestral Dinosauria. Using comparative genomic data, we estimate that genome-wide interspersed mobile elements, a class of repetitive DNA, comprised 5 - 12% of the total genome size in the saurischian dinosaur lineage, but was 7 - 19% of total genome size in ornithischian dinosaurs, suggesting that repetitive elements became less active in the saurischian lineage. These genomic characteristics should be added to the list of attributes previously considered avian but now thought to have arisen in non-avian dinosaurs, such as feathers(5), pulmonary innovations 6, and parental care and nesting
Resumo:
Motivation: There is a frequent need to apply a large range of local or remote prediction and annotation tools to one or more sequences. We have created a tool able to dispatch one or more sequences to assorted services by defining a consistent XML format for data and annotations. Results: By analyzing annotation tools, we have determined that annotations can be described using one or more of the six forms of data: numeric or textual annotation of residues, domains (residue ranges) or whole sequences. With this in mind, XML DTDs have been designed to store the input and output of any server. Plug-in wrappers to a number of services have been written which are called from a master script. The resulting APATML is then formatted for display in HTML. Alternatively further tools may be written to perform post-analysis.
Resumo:
Phylogenetic methods hold great promise for the reconstruction of the transition from precursor to modern flora and the identification of underlying factors which drive the process. The phylogenetic methods presently used to address the question of the origin of the Cape flora of South Africa are considered here. The sampling requirements of each of these methods, which include dating of diversifications using calibrated molecular trees, sister pair comparisons, lineage through time plots and biogeographical optimizations are reviewed. Sampling of genes, genomes and species are considered. Although increased higher-level studies and increased sampling are required for robust interpretation, it is clear that much progress is already made. It is argued that despite the remarkable richness of the flora, the Cape flora is a valuable model system to demonstrate the utility of phylogenetic methods in determining the history of a modern flora.
Resumo:
The eukaryotic genome is a mosaic of eubacterial and archaeal genes in addition to those unique to itself. The mosaic may have arisen as the result of two prokaryotes merging their genomes, or from genes acquired from an endosymbiont of eubacterial origin. A third possibility is that the eukaryotic genome arose from successive events of lateral gene transfer over long periods of time. This theory does not exclude the endosymbiont, but questions whether it is necessary to explain the peculiar set of eukaryotic genes. We use phylogenetic studies and reconstructions of ancestral first appearances of genes on the prokaryotic phylogeny to assess evidence for the lateral gene transfer scenario. We find that phylogenies advanced to support fusion can also arise from a succession of lateral gene transfer events. Our reconstructions of ancestral first appearances of genes reveal that the various genes that make up the eukaryotic mosaic arose at different times and in diverse lineages on the prokaryotic tree, and were not available in a single lineage. Successive events of lateral gene transfer can explain the unusual mosaic structure of the eukaryotic genome, with its content linked to the immediate adaptive value of the genes its acquired. Progress in understanding eukaryotes may come from identifying ancestral features such as the eukaryotic splicesome that could explain why this lineage invaded, or created, the eukaryoticniche.
Resumo:
Background: Rhizobium leguminosarum is an alpha-proteobacterial N-2-fixing symbiont of legumes that has been the subject of more than a thousand publications. Genes for the symbiotic interaction with plants are well studied, but the adaptations that allow survival and growth in the soil environment are poorly understood. We have sequenced the genome of R. leguminosarum biovar viciae strain 3841. Results: The 7.75 Mb genome comprises a circular chromosome and six circular plasmids, with 61% G+C overall. All three rRNA operons and 52 tRNA genes are on the chromosome; essential protein-encoding genes are largely chromosomal, but most functional classes occur on plasmids as well. Of the 7,263 protein-encoding genes, 2,056 had orthologs in each of three related genomes ( Agrobacterium tumefaciens, Sinorhizobium meliloti, and Mesorhizobium loti), and these genes were overrepresented in the chromosome and had above average G+C. Most supported the rRNA-based phylogeny, confirming A. tumefaciens to be the closest among these relatives, but 347 genes were incompatible with this phylogeny; these were scattered throughout the genome but were over-represented on the plasmids. An unexpectedly large number of genes were shared by all three rhizobia but were missing from A. tumefaciens. Conclusion: Overall, the genome can be considered to have two main components: a 'core', which is higher in G+C, is mostly chromosomal, is shared with related organisms, and has a consistent phylogeny; and an 'accessory' component, which is sporadic in distribution, lower in G+C, and located on the plasmids and chromosomal islands. The accessory genome has a different nucleotide composition from the core despite a long history of coexistence.
Resumo:
The identification of signatures of natural selection in genomic surveys has become an area of intense research, stimulated by the increasing ease with which genetic markers can be typed. Loci identified as subject to selection may be functionally important, and hence (weak) candidates for involvement in disease causation. They can also be useful in determining the adaptive differentiation of populations, and exploring hypotheses about speciation. Adaptive differentiation has traditionally been identified from differences in allele frequencies among different populations, summarised by an estimate of F-ST. Low outliers relative to an appropriate neutral population-genetics model indicate loci subject to balancing selection, whereas high outliers suggest adaptive (directional) selection. However, the problem of identifying statistically significant departures from neutrality is complicated by confounding effects on the distribution of F-ST estimates, and current methods have not yet been tested in large-scale simulation experiments. Here, we simulate data from a structured population at many unlinked, diallelic loci that are predominantly neutral but with some loci subject to adaptive or balancing selection. We develop a hierarchical-Bayesian method, implemented via Markov chain Monte Carlo (MCMC), and assess its performance in distinguishing the loci simulated under selection from the neutral loci. We also compare this performance with that of a frequentist method, based on moment-based estimates of F-ST. We find that both methods can identify loci subject to adaptive selection when the selection coefficient is at least five times the migration rate. Neither method could reliably distinguish loci under balancing selection in our simulations, even when the selection coefficient is twenty times the migration rate.