984 resultados para Gene annotation
Resumo:
Tese (doutorado)—Universidade de Brasília, Instituto de Ciências Biológicas, Programa de Pós-Graduação em Biologia Molecular, 2016.
Resumo:
Background: Copy number variations (CNVs) have been shown to account for substantial portions of observed genomic variation and have been associated with qualitative and quantitative traits and the onset of disease in a number of species. Information from high-resolution studies to detect, characterize and estimate population-specific variant frequencies will facilitate the incorporation of CNVs in genomic studies to identify genes affecting traits of importance. Results: Genome-wide CNVs were detected in high-density single nucleotide polymorphism (SNP) genotyping data from 1,717 Nelore (Bos indicus) cattle, and in NGS data from eight key ancestral bulls. A total of 68,007 and 12,786 distinct CNVs were observed, respectively. Cross-comparisons of results obtained for the eight resequenced animals revealed that 92 % of the CNVs were observed in both datasets, while 62 % of all detected CNVs were observed to overlap with previously validated cattle copy number variant regions (CNVRs). Observed CNVs were used for obtaining breed-specific CNV frequencies and identification of CNVRs, which were subsequently used for gene annotation. A total of 688 of the detected CNVRs were observed to overlap with 286 non-redundant QTLs associated with important production traits in cattle. All of 34 CNVs previously reported to be associated with milk production traits in Holsteins were also observed in Nelore cattle. Comparisons of estimated frequencies of these CNVs in the two breeds revealed 14, 13, 6 and 14 regions in high (>20 %), low (<20 %) and divergent (NEL > HOL, NEL < HOL) frequencies, respectively. Conclusions: Obtained results significantly enriched the bovine CNV map and enabled the identification of variants that are potentially associated with traits under selection in Nelore cattle, particularly in genome regions harboring QTLs affecting production traits.
Resumo:
T he international FANTOM consortium aims to produce a comprehensive picture of the mammalian transcriptome, based upon an extensive cDNA collection and functional annotation of full-length enriched cDNAs. The previous dataset, FANTOM(2), comprised 60,770 full- length enriched cDNAs. Functional annotation revealed that this cDNA dataset contained only about half of the estimated number of mouse protein- coding genes, indicating that a number of cDNAs still remained to be collected and identified. To pursue the complete gene catalog that covers all predicted mouse genes, cloning and sequencing of full- length enriched cDNAs has been continued since FANTOM2. In FANTOM3, 42,031 newly isolated cDNAs were subjected to functional annotation, and the annotation of 4,347 FANTOM2 cDNAs was updated. To accomplish accurate functional annotation, we improved our automated annotation pipeline by introducing new coding sequence prediction programs and developed a Web- based annotation interface for simplifying the annotation procedures to reduce manual annotation errors. Automated coding sequence and function prediction was followed with manual curation and review by expert curators. A total of 102,801 full- length enriched mouse cDNAs were annotated. Out of 102,801 transcripts, 56,722 were functionally annotated as protein coding ( including partial or truncated transcripts), providing to our knowledge the greatest current coverage of the mouse proteome by full- length cDNAs. The total number of distinct non- protein- coding transcripts increased to 34,030. The FANTOM3 annotation system, consisting of automated computational prediction, manual curation, and. nal expert curation, facilitated the comprehensive characterization of the mouse transcriptome, and could be applied to the transcriptomes of other species.
Resumo:
Staphylococcus aureus (S. aureus) is a prominent human and livestock pathogen investigated widely using omic technologies. Critically, due to availability, low visibility or scattered resources, robust network and statistical contextualisation of the resulting data is generally under-represented. Here, we present novel meta-analyses of freely-accessible molecular network and gene ontology annotation information resources for S. aureus omics data interpretation. Furthermore, through the application of the gene ontology annotation resources we demonstrate their value and ability (or lack-there-of) to summarise and statistically interpret the emergent properties of gene expression and protein abundance changes using publically available data. This analysis provides simple metrics for network selection and demonstrates the availability and impact that gene ontology annotation selection can have on the contextualisation of bacterial omics data.
Resumo:
The time of the large sequencing projects has enabled unprecedented possibilities of investigating more complex aspects of living organisms. Among the high-throughput technologies based on the genomic sequences, the DNA microarrays are widely used for many purposes, including the measurement of the relative quantity of the messenger RNAs. However, the reliability of microarrays has been strongly doubted as robust analysis of the complex microarray output data has been developed only after the technology had already been spread in the community. An objective of this study consisted of increasing the performance of microarrays, and was measured by the successful validation of the results by independent techniques. To this end, emphasis has been given to the possibility of selecting candidate genes with remarkable biological significance within specific experimental design. Along with literature evidence, the re-annotation of the probes and model-based normalization algorithms were found to be beneficial when analyzing Affymetrix GeneChip data. Typically, the analysis of microarrays aims at selecting genes whose expression is significantly different in different conditions followed by grouping them in functional categories, enabling a biological interpretation of the results. Another approach investigates the global differences in the expression of functionally related groups of genes. Here, this technique has been effective in discovering patterns related to temporal changes during infection of human cells. Another aspect explored in this thesis is related to the possibility of combining independent gene expression data for creating a catalog of genes that are selectively expressed in healthy human tissues. Not all the genes present in human cells are active; some involved in basic activities (named housekeeping genes) are expressed ubiquitously. Other genes (named tissue-selective genes) provide more specific functions and they are expressed preferably in certain cell types or tissues. Defining the tissue-selective genes is also important as these genes can cause disease with phenotype in the tissues where they are expressed. The hypothesis that gene expression could be used as a measure of the relatedness of the tissues has been also proved. Microarray experiments provide long lists of candidate genes that are often difficult to interpret and prioritize. Extending the power of microarray results is possible by inferring the relationships of genes under certain conditions. Gene transcription is constantly regulated by the coordinated binding of proteins, named transcription factors, to specific portions of the its promoter sequence. In this study, the analysis of promoters from groups of candidate genes has been utilized for predicting gene networks and highlighting modules of transcription factors playing a central role in the regulation of their transcription. Specific modules have been found regulating the expression of genes selectively expressed in the hippocampus, an area of the brain having a central role in the Major Depression Disorder. Similarly, gene networks derived from microarray results have elucidated aspects of the development of the mesencephalon, another region of the brain involved in Parkinson Disease.
Resumo:
The rapid increase in genome sequence information has necessitated the annotation of their functional elements, particularly those occurring in the non-coding regions, in the genomic context. Promoter region is the key regulatory region, which enables the gene to be transcribed or repressed, but it is difficult to determine experimentally. Hence an in silico identification of promoters is crucial in order to guide experimental work and to pin point the key region that controls the transcription initiation of a gene. In this analysis, we demonstrate that while the promoter regions are in general less stable than the flanking regions, their average free energy varies depending on the GC composition of the flanking genomic sequence. We have therefore obtained a set of free energy threshold values, for genomic DNA with varying GC content and used them as generic criteria for predicting promoter regions in several microbial genomes, using an in-house developed tool `PromPredict'. On applying it to predict promoter regions corresponding to the 1144 and 612 experimentally validated TSSs in E. coli (50.8% GC) and B. subtilis (43.5% GC) sensitivity of 99% and 95% and precision values of 58% and 60%, respectively, were achieved. For the limited data set of 81 TSSs available for M. tuberculosis (65.6% GC) a sensitivity of 100% and precision of 49% was obtained.
Resumo:
Motivation: The number of bacterial genomes being sequenced is increasing very rapidly and hence, it is crucial to have procedures for rapid and reliable annotation of their functional elements such as promoter regions, which control the expression of each gene or each transcription unit of the genome. The present work addresses this requirement and presents a generic method applicable across organisms. Results: Relative stability of the DNA double helical sequences has been used to discriminate promoter regions from non-promoter regions. Based on the difference in stability between neighboring regions, an algorithm has been implemented to predict promoter regions on a large scale over 913 microbial genome sequences. The average free energy values for the promoter regions as well as their downstream regions are found to differ, depending on their GC content. Threshold values to identify promoter regions have been derived using sequences flanking a subset of translation start sites from all microbial genomes and then used to predict promoters over the complete genome sequences. An average recall value of 72% (which indicates the percentage of protein and RNA coding genes with predicted promoter regions assigned to them) and precision of 56% is achieved over the 913 microbial genome dataset.
Resumo:
Background: The number of genome-wide association studies (GWAS) has increased rapidly in the past couple of years, resulting in the identification of genes associated with different diseases. The next step in translating these findings into biomedically useful information is to find out the mechanism of the action of these genes. However, GWAS studies often implicate genes whose functions are currently unknown; for example, MYEOV, ANKLE1, TMEM45B and ORAOV1 are found to be associated with breast cancer, but their molecular function is unknown. Results: We carried out Bayesian inference of Gene Ontology (GO) term annotations of genes by employing the directed acyclic graph structure of GO and the network of protein-protein interactions (PPIs). The approach is designed based on the fact that two proteins that interact biophysically would be in physical proximity of each other, would possess complementary molecular function, and play role in related biological processes. Predicted GO terms were ranked according to their relative association scores and the approach was evaluated quantitatively by plotting the precision versus recall values and F-scores (the harmonic mean of precision and recall) versus varying thresholds. Precisions of similar to 58% and similar to 40% for localization and functions respectively of proteins were determined at a threshold of similar to 30 (top 30 GO terms in the ranked list). Comparison with function prediction based on semantic similarity among nodes in an ontology and incorporation of those similarities in a k nearest neighbor classifier confirmed that our results compared favorably. Conclusions: This approach was applied to predict the cellular component and molecular function GO terms of all human proteins that have interacting partners possessing at least one known GO annotation. The list of predictions is available at http://severus.dbmi.pitt.edu/engo/GOPRED.html. We present the algorithm, evaluations and the results of the computational predictions, especially for genes identified in GWAS studies to be associated with diseases, which are of translational interest.
Resumo:
Cis-peptide embedded segments are rare in proteins but often highlight their important role in molecular function when they do occur. The high evolutionary conservation of these segments illustrates this observation almost universally, although no attempt has been made to systematically use this information for the purpose of function annotation. In the present study, we demonstrate how geometric clustering and level-specific Gene Ontology molecular-function terms (also known as annotations) can be used in a statistically significant manner to identify cis-embedded segments in a protein linked to its molecular function. The present study identifies novel cis-peptide fragments, which are subsequently used for fragment-based function annotation. Annotation recall benchmarks interpreted using the receiver-operator characteristic plot returned an area-under-curve >0.9, corroborating the utility of the annotation method. In addition, we identified cis-peptide fragments occurring in conjunction with functionally important trans-peptide fragments, providing additional insights into molecular function. We further illustrate the applicability of our method in function annotation where homology-based annotation transfer is not possible. The findings of the present study add to the repertoire of function annotation approaches and also facilitate engineering, design and allied studies around the cis-peptide neighborhood of proteins.
Resumo:
The availability of the genome sequence of Mycobacterium tuberculosis H37Rv has encouraged determination of large numbers of protein structures and detailed definition of the biological information encoded therein; yet, the functions of many proteins in M. tuberculosis remain unknown. The emergence of multidrug resistant strains makes it a priority to exploit recent advances in homology recognition and structure prediction to re-analyse its gene products. Here we report the structural and functional characterization of gene products encoded in the M. tuberculosis genome, with the help of sensitive profile-based remote homology search and fold recognition algorithms resulting in an enhanced annotation of the proteome where 95% of the M. tuberculosis proteins were identified wholly or partly with information on structure or function. New information includes association of 244 proteins with 205 domain families and a separate set of new association of folds to 64 proteins. Extending structural information across uncharacterized protein families represented in the M. tuberculosis proteome, by determining superfamily relationships between families of known and unknown structures, has contributed to an enhancement in the knowledge of structural content. In retrospect, such superfamily relationships have facilitated recognition of probable structure and/or function for several uncharacterized protein families, eventually aiding recognition of probable functions for homologous proteins corresponding to such families. Gene products unique to mycobacteria for which no functions could be identified are 183. Of these 18 were determined to be M. tuberculosis specific. Such pathogen-specific proteins are speculated to harbour virulence factors required for pathogenesis. A re-annotated proteome of M. tuberculosis, with greater completeness of annotated proteins and domain assigned regions, provides a valuable basis for experimental endeavours designed to obtain a better understanding of pathogenesis and to accelerate the process of drug target discovery. (C) 2014 Elsevier Ltd. All rights reserved.
Resumo:
MOTIVATION: Synthetic lethal interactions represent pairs of genes whose individual mutations are not lethal, while the double mutation of both genes does incur lethality. Several studies have shown a correlation between functional similarity of genes and their distances in networks based on synthetic lethal interactions. However, there is a lack of algorithms for predicting gene function from synthetic lethality interaction networks. RESULTS: In this article, we present a novel technique called kernelROD for gene function prediction from synthetic lethal interaction networks based on kernel machines. We apply our novel algorithm to Gene Ontology functional annotation prediction in yeast. Our experiments show that our method leads to improved gene function prediction compared with state-of-the-art competitors and that combining genetic and congruence networks leads to a further improvement in prediction accuracy.
Sequencing, annotation and comparative analysis of nine BACs of giant panda (Ailuropoda melanoleuca)
Resumo:
A 10-fold BAC library for giant panda was constructed and nine BACs were selected to generate finish sequences. These BACs could be used as a validation resource for the de novo assembly accuracy of the whole genome shotgun sequencing reads of giant panda newly generated by the Illumina GA sequencing technology. Complete sanger sequencing, assembly, annotation and comparative analysis were carried out on the selected BACs of a joint length 878 kb. Homologue search and de novo prediction methods were used to annotate genes and repeats. Twelve protein coding genes were predicted, seven of which could be functionally annotated. The seven genes have an average gene size of about 41 kb, an average coding size of about 1.2 kb and an average exon number of 6 per gene. Besides, seven tRNA genes were found. About 27 percent of the BAC sequence is composed of repeats. A phylogenetic tree was constructed using neighbor-join algorithm across five species, including giant panda, human, dog, cat and mouse, which reconfirms dog as the most related species to giant panda. Our results provide detailed sequence and structure information for new genes and repeats of giant panda, which will be helpful for further studies on the giant panda.
Resumo:
Background: Giardia are a group of widespread intestinal protozoan parasites in a number of vertebrates. Much evidence from G. lamblia indicated they might be the most primitive extant eukaryotes. When and how such a group of the earliest branching unicellular eukaryotes developed the ability to successfully parasitize the latest branching higher eukaryotes (vertebrates) is an intriguing question. Gene duplication has long been thought to be the most common mechanism in the production of primary resources for the origin of evolutionary novelties. In order to parse the evolutionary trajectory of Giardia parasitic lifestyle, here we carried out a genome-wide analysis about gene duplication patterns in G. lamblia. Results: Although genomic comparison showed that in G. lamblia the contents of many fundamental biologic pathways are simplified and the whole genome is very compact, in our study 40% of its genes were identified as duplicated genes. Evolutionary distance analyses of these duplicated genes indicated two rounds of large scale duplication events had occurred in G. lamblia genome. Functional annotation of them further showed that the majority of recent duplicated genes are VSPs (Variant-specific Surface Proteins), which are essential for the successful parasitic life of Giardia in hosts. Based on evolutionary comparison with their hosts, it was found that the rapid expansion of VSPs in G. lamblia is consistent with the evolutionary radiation of placental mammals. Conclusions: Based on the genome-wide analysis of duplicated genes in G. lamblia, we found that gene duplication was essential for the origin and evolution of Giardia parasitic lifestyle. The recent expansion of VSPs uniquely occurring in G. lamblia is consistent with the increment of its hosts. Therefore we proposed a hypothesis that the increment of Giradia hosts might be the driving force for the rapid expansion of VSPs.
Resumo:
A high-quality cDNA library was constructed from whole body tissues of the zhikong scallop, Chlamys farreri, challenged by Listonella anguillarum. A total of 5720 clones were sequenced, yielding 5123 expressed sequence tags (ESTs). Among the 3326 unique genes identified, 2289 (69%) genes had no significant (E-value < 1e-5) matches to known sequences in public databases and 194 (6%) matched proteins of unknown functions. The remaining 843 (25%) genes that exhibited homology with genes of known functions, showed broad involvement in metabolic processes (31%), cell structure and motility (20%), gene and protein expression (12%), cell signaling and cell communication (8%), cell division (4%), and notably, 25% of those genes were related to immune function. They included stress response genes, complement-like genes, proteinase and proteinase inhibitors, immune recognition receptors and immune effectors. The EST collection obtained in this study provides a useful resource for gene discovery and especially for the identification of host-defense genes and systems in scallops and other molluscs. (C) 2009 Elsevier Ltd. All rights reserved.
Resumo:
Crustacean haemocytes play important roles in the host immune response including recognition, phagocytosis, melanization, cytotoxicity and inter-cellular signal communication. Expressed sequence tags (ESTs) analysis is proved to be an efficient approach not only for gene discovery, but also for gene expression profiles performance. In order to further understand the innate immune system and defense mechanisms of Chinese shrimp at molecular level, complementary DNA library is constructed from the haemocyte tissue of Fenneropenaeus chinensis. A total of 2371 cDNA clones are successfully sequenced and the average sequence length is 460 bp. About 50% are identified as orthologs of known genes from other organisms by BLASTx and BLASTn program. By sequences comparability and analysis, 34 important genes including 177 ESTs are identified that may be involved in defense or immune functions in shrimp based on the known knowledge. These genes are categorized into five categories according to their putative functions in shrimp immune system: 13 genes are different types of antimicrobial peptides (AMP, penaeidin, antilipopolysaccharide factor, etc.), and their proportion is about 3 8%; 11 genes belong to prophenoloxidase system (prophenoloxidase, serine proteinase, serine proteinase inhibitor, etc.), and their proportion is about 32%; five genes have high homology with clotting protein (lectin, transglutaminase, etc), and their proportion is about 15%; three genes may be involved in inter-cell signal communication (peroxinectin, integrin), and their proportion is about 9%; two genes have been identified to be chaperone proteins (Hsc70, thioredoxin peroxidase), and their proportion is about 6%. These EST sequences enrich our understanding of the immune genes of F chinensis and will help farther experimental research into immune factors and improve our knowledge of the immune mechanisms of shrimp. (c) 2007 Elsevier B.V. All rights reserved.