994 resultados para Functional Annotation
Resumo:
Predecir la función biológica de secuencias de Ácido Desoxirribonucleico (ADN) es unos de los mayores desafíos a los que se enfrenta la Bioinformática. Esta tarea se denomina anotación funcional y es un proceso complejo, laborioso y que requiere mucho tiempo. Dado su impacto en investigaciones y anotaciones futuras, la anotación debe ser lo más able y precisa posible. Idealmente, las secuencias deberían ser estudiadas y anotadas manualmente por un experto, garantizando así resultados precisos y de calidad. Sin embargo, la anotación manual solo es factible para pequeños conjuntos de datos o genomas de referencia. Con la llegada de las nuevas tecnologías de secuenciación, el volumen de datos ha crecido signi cativamente, haciendo aún más crítica la necesidad de implementaciones automáticas del proceso. Por su parte, la anotación automática es capaz de manejar grandes cantidades de datos y producir un análisis consistente. Otra ventaja de esta aproximación es su rapidez y bajo coste en relación a la manual. Sin embargo, sus resultados son menos precisos que los manuales y, en general, deben ser revisados ( curados ) por un experto. Aunque los procesos colaborativos de la anotación en comunidad pueden ser utilizados para reducir este cuello de botella, los esfuerzos en esta línea no han tenido hasta ahora el éxito esperado. Además, el problema de la anotación, como muchos otros en el dominio de la Bioinformática, abarca información heterogénea, distribuida y en constante evolución. Una posible aproximación para superar estos problemas consiste en cambiar el foco del proceso de los expertos individuales a su comunidad, y diseñar las herramientas de manera que faciliten la gestión del conocimiento y los recursos. Este trabajo adopta esta línea y propone MASSA (Multi-Agent System to Support functional Annotation), una arquitectura de Sistema Multi-Agente (SMA) para Soportar la Anotación funcional...
Resumo:
The chromodomain is 40-50 amino acids in length and is conserved in a wide range of chromatic and regulatory proteins involved in chromatin remodeling. Chromodomain-containing proteins can be classified into families based on their broader characteristics, in particular the presence of other types of domains, and which correlate with different subclasses of the chromodomains themselves. Hidden Markov model (HMM)-generated profiles of different subclasses of chromodomains were used here to identify sequences encoding chromodomain-containing proteins in the mouse transcriptome and genome. A total of 36 different loci encoding proteins containing chromodomains, including 17 novel loci, were identified. Six of these loci (including three apparent pseudogenes, a novel HP1 ortholog, and two novel Msl-3 transcription factor-like proteins) are not present in the human genome, whereas the human genome contains four loci (two CDY orthologs and two apparent CDY pseuclogenes) that are not present in mouse. A number of these loci exhibit alternative splicing to produce different isoforms, including 43 novel variants, some of which lack the chromodomain. The likely functions of these proteins are discussed in relation to the known functions of other chromodomain-containing proteins within the same family.
Resumo:
With the completion of the human and mouse genome sequences, the task now turns to identifying their encoded transcripts and assigning gene function. In this study, we have undertaken a computational approach to identify and classify all of the protein kinases and phosphatases present in the mouse gene complement. A nonredundant set of these sequences was produced by mining Ensembl gene predictions and publicly available cDNA sequences with a panel of InterPro domains. This approach identified 561 candidate protein kinases and 162 candidate protein phosphatases. This cohort was then analyzed using TribeMCL protein sequence similarity clustering followed by CLUSTALV alignment and hierarchical tree generation. This approach allowed us to (1) distinguish between true members of the protein kinase and phosphatase families and enzymes of related biochemistry, (2) determine the structure of the families, and (3) suggest functions for previously uncharacterized members. The classifications obtained by this approach were in good agreement with previous schemes and allowed us to demonstrate domain associations with a number of clusters. Finally, we comment on the complementary nature of cDNA and genome-based gene detection and the impact of the FANTOM2 transcriptome project.
Resumo:
The number of known mRNA transcripts in the mouse has been greatly expanded by the RIKEN Mouse Gene Encyclopedia project. Validation of their reproducible expression in a tissue is an important contribution to the study of functional genomics. In this report, we determine the expression profile of 57,931 clones on 20 mouse tissues using cDNA microarrays. Of these 57,931 clones, 22,928 clones correspond to the FANTOM2 clone set. The set represents 20,234 transcriptional units (TUs) out of 33,409 TUs in the FANTOM2 set. We identified 7206 separate clones that satisfied stringent criteria for tissue-specific expression. Gene Ontology terms were assigned for these 7206 clones, and the proportion of 'molecular function' ontology for each tissue-specific clone was examined. These data will provide insights into the function of each tissue. Tissue-specific gene expression profiles obtained using our cDNA microarrays were also compared with the data extracted from the GNF Expression Atlas based on Affymetrix microarrays. One major outcome of the RIKEN transcriptome analysis is the identification of numerous nonprotein-coding mRNAs. The expression profile was also used to obtain evidence of expression for putative noncoding RNAs. In addition, 1926 clones (70%) of 2768 clones that were categorized as unknown EST, and 1969 (58%) clones of 3388 clones that were categorized as unclassifiable were also shown to be reproducibly expressed.
Resumo:
We report the construction of the mouse full-length cDNA encyclopedia, the most extensive view of a complex transcriptome, on the basis of preparing and sequencing 246 libraries. Before cloning, cDNAs were enriched in full-length by Cap-Trapper, and in most cases, aggressively subtracted/normalized. We have produced 1,442,236 successful 3'-end sequences clustered into 171,144 groups, from which 60,770 clones were fully sequenced cDNAs annotated in the FANTOM-2 annotation. We have also produced 547,149 5' end reads, which clustered into 124,258 groups. Altogether, these cDNAs were further grouped in 70,000 transcriptional units (TU), which represent the best coverage of a transcriptome so far. By monitoring the extent of normalization/subtraction, we define the tentative equivalent coverage (TEC), which was estimated to be equivalent to >12,000,000 ESTs derived from standard libraries. High coverage explains discrepancies between the very large. numbers of clusters (and TUs) of this project, which also include non-protein-coding RNAs, and the lower gene number estimation of genome annotations. Altogether, S'-end clusters identify regions that are potential promoters for 8637 known genes and S'-end clusters suggest the presence of almost 63,000 transcriptional starting points. An estimate of the frequency of polyadenylation signals suggests that at least half of the singletons in the EST set represent real mRNAs. Clones accounting for about half of the predicted TUs await further sequencing. The continued high-discovery rate suggests that the task of transcriptome discovery is not yet complete.
Resumo:
Zinc-finger-containing proteins can be classified into evolutionary and functionally divergent protein families that share one or more domains in which a zinc ion is tetrahedrally coordinated by cysteines and histidines. The zinc finger domain defines one of the largest protein superfamilies in mammalian genomes; 46 different conserved zinc finger domains are listed in InterPro (http://www.ebi.ac.uk/InterPro). Zinc finger proteins can bind to DNA, RNA, other proteins, or lipids as a modular domain in combination with other conserved structures. Owing to this combinatorial diversity, different members of zinc finger superfamilies contribute to many distinct cellular processes, including transcriptional regulation, mRNA stability and processing, and protein turnover. Accordingly, mutations of zinc finger genes lead to aberrations in a broad spectrum of biological processes such as development, differentiation, apoptosis, and immunological responses. This study provides the first comprehensive classification of zinc finger proteins in a mammalian transcriptome. Specific detailed analysis of the SP/Kruppel-like factors and the E3 ubiquitin-ligase RING-H2 families illustrates the importance of such an analysis for a more comprehensive functional classification of large protein families. We describe the characterization of a new family of C2H2 zinc-finger-containing proteins and a new conserved domain characteristic of this family, the identification and characterization of Sp8, a new member of the Sp family of transcriptional regulators, and the identification of five new RING-H2 proteins.
Resumo:
The current RIKEN transcript set represents a significant proportion of the mouse transcriptome but transcripts expressed in the innate and acquired immune systems are poorly represented. In the present study we have assessed the complexity of the transcriptome expressed in mouse macrophages before and after treatment with lipopolysaccharide, a global regulator of macrophage gene expression, using existing RIKEN 19K arrays. By comparison to array profiles of other cells and tissues, we identify a large set of macrophage-enriched genes, many of which have obvious functions in endocytosis and phagocytosis. In addition, a significant number of LPS-inducible genes were identified. The data suggest that macrophages are a complex source of mRNA for transcriptome studies. To assess complexity and identify additional macrophage expressed genes, cDNA libraries were created from purified populations of macrophage and dendritic cells, a functionally related cell type. Sequence analysis revealed a high incidence of novel mRNAs within these cDNA libraries. These studies provide insights into the depths of transcriptional complexity still untapped amongst products of inducible genes, and identify macrophage and dendritic cell populations as a starting point for sampling the inducible mammalian transcriptome.
Resumo:
Cdca4 (Hepp) was originally identified as a gene expressed specifically in hematopoietic progenitor cells as opposed to hematopoietic stem cells. More recently, it has been shown to stimulate p53 activity and also lead to p53-independent growth inhibition when overexpressed. We independently isolated the murine Cdca4 gene in a genomic expression-based screen for genes involved in mammalian craniofacial development, and show that Cdca4 is expressed in a spatio-temporally restricted pattern during mouse embryogenesis. In addition to expression in the facial primordia including the pharyngeal arches, Cdca4 is expressed in the developing limb buds, brain, spinal cord, dorsal root ganglia, teeth, eye and hair follicles. Along with a small number of proteins from a range of species, the predicted CDCA4 protein contains a novel SERTA motif in addition to cyclin A-binding and PHD bromodomain-binding regions of homology. While the function of the SERTA domain is unknown, proteins containing this domain have previously been linked to cell cycle progression and chromatin remodelling. Using in silico database mining we have extended the number of evolutionarily conserved orthologues of known SERTA domain proteins and identified an uncharacterised member of the SERTA domain family, SERTAD4, with orthologues to date in human, mouse, rat, dog, cow, Tetraodon and chicken. Immunolocalisation of transiently and stably transfected epitope-tagged CDCA4 protein in mammalian cells suggests that it resides predominantly in the nucleus throughout all stages of the cell cycle. (c) 2006 Elsevier B.V. All rights reserved.
Resumo:
Craniofacial anomalies are a common feature of human congenital dysmorphology syndromes, suggesting that genes expressed in the developing face are likely to play a wider role in embryonic development. To facilitate the identification of genes involved in embryogenesis, we previously constructed an enriched cDNA library by subtracting adult mouse liver cDNA from that of embryonic day (E)10.5 mouse pharyngeal arch cDNA. From this library, 273 unique clones were sequenced and known proteins binned into functional categories in order to assess enrichment of the library (1). We have now selected 31 novel and poorly characterised genes from this library and present bioinformatic analysis to predict proteins encoded by these genes, and to detect evolutionary conservation. Of these genes 61% (19/31) showed restricted expression in the developing embryo, and a subset of these was chosen for further in silico characterisation as well as experimental determination of subcellular localisation based on transient transfection of predicted full-length coding sequences into mammalian cell lines. Where a human orthologue of these genes was detected, chromosomal localisation was determined relative to known loci for human congenital disease.
Resumo:
Scorpion toxins are important experimental tools for characterization of vast array of ion channels and serve as scaffolds for drug design. General public database entries contain limited annotation whereby rich structure-function information from mutation studies is typically not available. SCORPION2 contains more than 800 records of native and mutant toxin sequences enriched with binding affinity and toxicity information, 624 three-dimensional structures and some 500 references. SCORPION2 has a set of search and prediction tools that allow users to extract and perform specific queries: text searches of scorpion toxin records, sequence similarity search, extraction of sequences, visualization of scorpion toxin structures, analysis of toxic activity, and functional annotation of previously uncharacterized scorpion toxins. The SCORPION2 database is available at http://sdmc.i2r.a-star.edu.sg/scorpion/. (c) 2006 Elsevier Ltd. All rights reserved.
Resumo:
To carry out their specific roles in the cell, genes and gene products often work together in groups, forming many relationships among themselves and with other molecules. Such relationships include physical protein-protein interaction relationships, regulatory relationships, metabolic relationships, genetic relationships, and much more. With advances in science and technology, some high throughput technologies have been developed to simultaneously detect tens of thousands of pairwise protein-protein interactions and protein-DNA interactions. However, the data generated by high throughput methods are prone to noise. Furthermore, the technology itself has its limitations, and cannot detect all kinds of relationships between genes and their products. Thus there is a pressing need to investigate all kinds of relationships and their roles in a living system using bioinformatic approaches, and is a central challenge in Computational Biology and Systems Biology. This dissertation focuses on exploring relationships between genes and gene products using bioinformatic approaches. Specifically, we consider problems related to regulatory relationships, protein-protein interactions, and semantic relationships between genes. A regulatory element is an important pattern or "signal", often located in the promoter of a gene, which is used in the process of turning a gene "on" or "off". Predicting regulatory elements is a key step in exploring the regulatory relationships between genes and gene products. In this dissertation, we consider the problem of improving the prediction of regulatory elements by using comparative genomics data. With regard to protein-protein interactions, we have developed bioinformatics techniques to estimate support for the data on these interactions. While protein-protein interactions and regulatory relationships can be detected by high throughput biological techniques, there is another type of relationship called semantic relationship that cannot be detected by a single technique, but can be inferred using multiple sources of biological data. The contributions of this thesis involved the development and application of a set of bioinformatic approaches that address the challenges mentioned above. These included (i) an EM-based algorithm that improves the prediction of regulatory elements using comparative genomics data, (ii) an approach for estimating the support of protein-protein interaction data, with application to functional annotation of genes, (iii) a novel method for inferring functional network of genes, and (iv) techniques for clustering genes using multi-source data.
Resumo:
Background: Esophageal adenocarcinoma (EA) is one of the fastest rising cancers in western countries. Barrett’s Esophagus (BE) is the premalignant precursor of EA. However, only a subset of BE patients develop EA, which complicates the clinical management in the absence of valid predictors. Genetic risk factors for BE and EA are incompletely understood. This study aimed to identify novel genetic risk factors for BE and EA.Methods: Within an international consortium of groups involved in the genetics of BE/EA, we performed the first meta-analysis of all genome-wide association studies (GWAS) available, involving 6,167 BE patients, 4,112 EA patients, and 17,159 representative controls, all of European ancestry, genotyped on Illumina high-density SNP-arrays, collected from four separate studies within North America, Europe, and Australia. Meta-analysis was conducted using the fixed-effects inverse variance-weighting approach. We used the standard genome-wide significant threshold of 5×10-8 for this study. We also conducted an association analysis following reweighting of loci using an approach that investigates annotation enrichment among the genome-wide significant loci. The entire GWAS-data set was also analyzed using bioinformatics approaches including functional annotation databases as well as gene-based and pathway-based methods in order to identify pathophysiologically relevant cellular pathways.Findings: We identified eight new associated risk loci for BE and EA, within or near the CFTR (rs17451754, P=4·8×10-10), MSRA (rs17749155, P=5·2×10-10), BLK (rs10108511, P=2·1×10-9), KHDRBS2 (rs62423175, P=3·0×10-9), TPPP/CEP72 (rs9918259, P=3·2×10-9), TMOD1 (rs7852462, P=1·5×10-8), SATB2 (rs139606545, P=2·0×10-8), and HTR3C/ABCC5 genes (rs9823696, P=1·6×10-8). A further novel risk locus at LPA (rs12207195, posteriori probability=0·925) was identified after re-weighting using significantly enriched annotations. This study thereby doubled the number of known risk loci. The strongest disease pathways identified (P<10-6) belong to muscle cell differentiation and to mesenchyme development/differentiation, which fit with current pathophysiological BE/EA concepts. To our knowledge, this study identified for the first time an EA-specific association (rs9823696, P=1·6×10-8) near HTR3C/ABCC5 which is independent of BE development (P=0·45).Interpretation: The identified disease loci and pathways reveal new insights into the etiology of BE and EA. Furthermore, the EA-specific association at HTR3C/ABCC5 may constitute a novel genetic marker for the prediction of transition from BE to EA. Mutations in CFTR, one of the new risk loci identified in this study, cause cystic fibrosis (CF), the most common recessive disorder in Europeans. Gastroesophageal reflux (GER) belongs to the phenotypic CF-spectrum and represents the main risk factor for BE/EA. Thus, the CFTR locus may trigger a common GER-mediated pathophysiology.
Resumo:
Background: Statistical analysis of DNA microarray data provides a valuable diagnostic tool for the investigation of genetic components of diseases. To take advantage of the multitude of available data sets and analysis methods, it is desirable to combine both different algorithms and data from different studies. Applying ensemble learning, consensus clustering and cross-study normalization methods for this purpose in an almost fully automated process and linking different analysis modules together under a single interface would simplify many microarray analysis tasks. Results: We present ArrayMining.net, a web-application for microarray analysis that provides easy access to a wide choice of feature selection, clustering, prediction, gene set analysis and cross-study normalization methods. In contrast to other microarray-related web-tools, multiple algorithms and data sets for an analysis task can be combined using ensemble feature selection, ensemble prediction, consensus clustering and cross-platform data integration. By interlinking different analysis tools in a modular fashion, new exploratory routes become available, e.g. ensemble sample classification using features obtained from a gene set analysis and data from multiple studies. The analysis is further simplified by automatic parameter selection mechanisms and linkage to web tools and databases for functional annotation and literature mining. Conclusion: ArrayMining.net is a free web-application for microarray analysis combining a broad choice of algorithms based on ensemble and consensus methods, using automatic parameter selection and integration with annotation databases.
Resumo:
The development of Next Generation Sequencing promotes Biology in the Big Data era. The ever-increasing gap between proteins with known sequences and those with a complete functional annotation requires computational methods for automatic structure and functional annotation. My research has been focusing on proteins and led so far to the development of three novel tools, DeepREx, E-SNPs&GO and ISPRED-SEQ, based on Machine and Deep Learning approaches. DeepREx computes the solvent exposure of residues in a protein chain. This problem is relevant for the definition of structural constraints regarding the possible folding of the protein. DeepREx exploits Long Short-Term Memory layers to capture residue-level interactions between positions distant in the sequence, achieving state-of-the-art performances. With DeepRex, I conducted a large-scale analysis investigating the relationship between solvent exposure of a residue and its probability to be pathogenic upon mutation. E-SNPs&GO predicts the pathogenicity of a Single Residue Variation. Variations occurring on a protein sequence can have different effects, possibly leading to the onset of diseases. E-SNPs&GO exploits protein embeddings generated by two novel Protein Language Models (PLMs), as well as a new way of representing functional information coming from the Gene Ontology. The method achieves state-of-the-art performances and is extremely time-efficient when compared to traditional approaches. ISPRED-SEQ predicts the presence of Protein-Protein Interaction sites in a protein sequence. Knowing how a protein interacts with other molecules is crucial for accurate functional characterization. ISPRED-SEQ exploits a convolutional layer to parse local context after embedding the protein sequence with two novel PLMs, greatly surpassing the current state-of-the-art. All methods are published in international journals and are available as user-friendly web servers. They have been developed keeping in mind standard guidelines for FAIRness (FAIR: Findable, Accessible, Interoperable, Reusable) and are integrated into the public collection of tools provided by ELIXIR, the European infrastructure for Bioinformatics.
Resumo:
Bioinformatics, in the last few decades, has played a fundamental role to give sense to the huge amount of data produced. Obtained the complete sequence of a genome, the major problem of knowing as much as possible of its coding regions, is crucial. Protein sequence annotation is challenging and, due to the size of the problem, only computational approaches can provide a feasible solution. As it has been recently pointed out by the Critical Assessment of Function Annotations (CAFA), most accurate methods are those based on the transfer-by-homology approach and the most incisive contribution is given by cross-genome comparisons. In the present thesis it is described a non-hierarchical sequence clustering method for protein automatic large-scale annotation, called “The Bologna Annotation Resource Plus” (BAR+). The method is based on an all-against-all alignment of more than 13 millions protein sequences characterized by a very stringent metric. BAR+ can safely transfer functional features (Gene Ontology and Pfam terms) inside clusters by means of a statistical validation, even in the case of multi-domain proteins. Within BAR+ clusters it is also possible to transfer the three dimensional structure (when a template is available). This is possible by the way of cluster-specific HMM profiles that can be used to calculate reliable template-to-target alignments even in the case of distantly related proteins (sequence identity < 30%). Other BAR+ based applications have been developed during my doctorate including the prediction of Magnesium binding sites in human proteins, the ABC transporters superfamily classification and the functional prediction (GO terms) of the CAFA targets. Remarkably, in the CAFA assessment, BAR+ placed among the ten most accurate methods. At present, as a web server for the functional and structural protein sequence annotation, BAR+ is freely available at http://bar.biocomp.unibo.it/bar2.0.