992 resultados para Protein annotation
Resumo:
Motivation: There is a frequent need to apply a large range of local or remote prediction and annotation tools to one or more sequences. We have created a tool able to dispatch one or more sequences to assorted services by defining a consistent XML format for data and annotations. Results: By analyzing annotation tools, we have determined that annotations can be described using one or more of the six forms of data: numeric or textual annotation of residues, domains (residue ranges) or whole sequences. With this in mind, XML DTDs have been designed to store the input and output of any server. Plug-in wrappers to a number of services have been written which are called from a master script. The resulting APATML is then formatted for display in HTML. Alternatively further tools may be written to perform post-analysis.
Resumo:
Most of the tasks in genome annotation can be at least partially automated. Since this annotation is time-consuming, facilitating some parts of the process - thus freeing the specialist to carry out more valuable tasks - has been the motivation of many tools and annotation environments. In particular, annotation of protein function can benefit from knowledge about enzymatic processes. The use of sequence homology alone is not a good approach to derive this knowledge when there are only a few homologues of the sequence to be annotated. The alternative is to use motifs. This paper uses a symbolic machine learning approach to derive rules for the classification of enzymes according to the Enzyme Commission (EC). Our results show that, for the top class, the average global classification error is 3.13%. Our technique also produces a set of rules relating structural to functional information, which is important to understand the protein tridimensional structure and determine its biological function. © 2009 Springer Berlin Heidelberg.
Resumo:
BACKGROUND:In the current climate of high-throughput computational biology, the inference of a protein's function from related measurements, such as protein-protein interaction relations, has become a canonical task. Most existing technologies pursue this task as a classification problem, on a term-by-term basis, for each term in a database, such as the Gene Ontology (GO) database, a popular rigorous vocabulary for biological functions. However, ontology structures are essentially hierarchies, with certain top to bottom annotation rules which protein function predictions should in principle follow. Currently, the most common approach to imposing these hierarchical constraints on network-based classifiers is through the use of transitive closure to predictions.RESULTS:We propose a probabilistic framework to integrate information in relational data, in the form of a protein-protein interaction network, and a hierarchically structured database of terms, in the form of the GO database, for the purpose of protein function prediction. At the heart of our framework is a factorization of local neighborhood information in the protein-protein interaction network across successive ancestral terms in the GO hierarchy. We introduce a classifier within this framework, with computationally efficient implementation, that produces GO-term predictions that naturally obey a hierarchical 'true-path' consistency from root to leaves, without the need for further post-processing.CONCLUSION:A cross-validation study, using data from the yeast Saccharomyces cerevisiae, shows our method offers substantial improvements over both standard 'guilt-by-association' (i.e., Nearest-Neighbor) and more refined Markov random field methods, whether in their original form or when post-processed to artificially impose 'true-path' consistency. Further analysis of the results indicates that these improvements are associated with increased predictive capabilities (i.e., increased positive predictive value), and that this increase is consistent uniformly with GO-term depth. Additional in silico validation on a collection of new annotations recently added to GO confirms the advantages suggested by the cross-validation study. Taken as a whole, our results show that a hierarchical approach to network-based protein function prediction, that exploits the ontological structure of protein annotation databases in a principled manner, can offer substantial advantages over the successive application of 'flat' network-based methods.
Resumo:
Tese de doutoramento, Informática (Bioinformática), Universidade de Lisboa, Faculdade de Ciências, 2015
Resumo:
In the current era of high-throughput sequencing and structure determination, functional annotation has become a bottleneck in biomedical science. Here, we show that automated inference of molecular function using functional linkages among genes increases the accuracy of functional assignments by >= 8% and enriches functional descriptions in >= 34% of top assignments. Furthermore, biochemical literature supports >80% of automated inferences for previously unannotated proteins. These results emphasize the benefit of incorporating functional linkages in protein annotation.
Resumo:
Réalisé en cotutelle avec l'Université de Cergy-Pontoise
Resumo:
Il progresso tecnologico nel campo della biologia molecolare, pone la comunità scientifica di fronte all’esigenza di dare un’interpretazione all’enormità di sequenze biologiche che a mano a mano vanno a costituire le banche dati, siano esse proteine o acidi nucleici. In questo contesto la bioinformatica gioca un ruolo di primaria importanza. Un nuovo livello di possibilità conoscitive è stato introdotto con le tecnologie di Next Generation Sequencing (NGS), per mezzo delle quali è possibile ottenere interi genomi o trascrittomi in poco tempo e con bassi costi. Tra le applicazioni del NGS più rilevanti ci sono senza dubbio quelle oncologiche che prevedono la caratterizzazione genomica di tessuti tumorali e lo sviluppo di nuovi approcci diagnostici e terapeutici per il trattamento del cancro. Con l’analisi NGS è possibile individuare il set completo di variazioni che esistono nel genoma tumorale come varianti a singolo nucleotide, riarrangiamenti cromosomici, inserzioni e delezioni. Va però sottolineato che le variazioni trovate nei geni vanno in ultima battuta osservate dal punto di vista degli effetti a livello delle proteine in quanto esse sono le responsabili più dirette dei fenotipi alterati riscontrabili nella cellula tumorale. L’expertise bioinformatica va quindi collocata sia a livello dell’analisi del dato prodotto per mezzo di NGS ma anche nelle fasi successive ove è necessario effettuare l’annotazione dei geni contenuti nel genoma sequenziato e delle relative strutture proteiche che da esso sono espresse, o, come nel caso dello studio mutazionale, la valutazione dell’effetto della variazione genomica. È in questo contesto che si colloca il lavoro presentato: da un lato lo sviluppo di metodologie computazionali per l’annotazione di sequenze proteiche e dall’altro la messa a punto di una pipeline di analisi di dati prodotti con tecnologie NGS in applicazioni oncologiche avente come scopo finale quello della individuazione e caratterizzazione delle mutazioni genetiche tumorali a livello proteico.
Resumo:
Background: The tight junction (TJ) is one of the most important structures established during merozoite invasion of host cells and a large amount of proteins stored in Toxoplasma and Plasmodium parasites’ apical organelles are involved in forming the TJ. Plasmodium falciparum and Toxoplasma gondii apical membrane antigen 1 (AMA-1) and rhoptry neck proteins (RONs) are the two main TJ components. It has been shown that RON4 plays an essential role during merozoite and sporozoite invasion to target cells. This study has focused on characterizing a novel Plasmodium vivax rhoptry protein, RON4, which is homologous to PfRON4 and PkRON4. Methods: The ron4 gene was re-annotated in the P. vivax genome using various bioinformatics tools and taking PfRON4 and PkRON4 amino acid sequences as templates. Gene synteny, as well as identity and similarity values between open reading frames (ORFs) belonging to the three species were assessed. The gene transcription of pvron4, and the expression and localization of the encoded protein were also determined in the VCG-1 strain by molecular and immunological studies. Nucleotide and amino acid sequences obtained for pvron4 in VCG-1 were compared to those from strains coming from different geographical areas. Results: PvRON4 is a 733 amino acid long protein, which is encoded by three exons, having similar transcription and translation patterns to those reported for its homologue, PfRON4. Sequencing PvRON4 from the VCG-1 strain and comparing it to P. vivax strains from different geographical locations has shown two conserved regions separated by a low complexity variable region, possibly acting as a “smokescreen”. PvRON4 contains a predicted signal sequence, a coiled-coil α-helical motif, two tandem repeats and six conserved cysteines towards the carboxyterminus and is a soluble protein lacking predicted transmembranal domains or a GPI anchor. Indirect immunofluorescence assays have shown that PvRON4 is expressed at the apical end of schizonts and co-localizes at the rhoptry neck with PvRON2.
Resumo:
Protein structure prediction methods aim to predict the structures of proteins from their amino acid sequences, utilizing various computational algorithms. Structural genome annotation is the process of attaching biological information to every protein encoded within a genome via the production of three-dimensional protein models.
Resumo:
Bioinformatics, in the last few decades, has played a fundamental role to give sense to the huge amount of data produced. Obtained the complete sequence of a genome, the major problem of knowing as much as possible of its coding regions, is crucial. Protein sequence annotation is challenging and, due to the size of the problem, only computational approaches can provide a feasible solution. As it has been recently pointed out by the Critical Assessment of Function Annotations (CAFA), most accurate methods are those based on the transfer-by-homology approach and the most incisive contribution is given by cross-genome comparisons. In the present thesis it is described a non-hierarchical sequence clustering method for protein automatic large-scale annotation, called “The Bologna Annotation Resource Plus” (BAR+). The method is based on an all-against-all alignment of more than 13 millions protein sequences characterized by a very stringent metric. BAR+ can safely transfer functional features (Gene Ontology and Pfam terms) inside clusters by means of a statistical validation, even in the case of multi-domain proteins. Within BAR+ clusters it is also possible to transfer the three dimensional structure (when a template is available). This is possible by the way of cluster-specific HMM profiles that can be used to calculate reliable template-to-target alignments even in the case of distantly related proteins (sequence identity < 30%). Other BAR+ based applications have been developed during my doctorate including the prediction of Magnesium binding sites in human proteins, the ABC transporters superfamily classification and the functional prediction (GO terms) of the CAFA targets. Remarkably, in the CAFA assessment, BAR+ placed among the ten most accurate methods. At present, as a web server for the functional and structural protein sequence annotation, BAR+ is freely available at http://bar.biocomp.unibo.it/bar2.0.
Resumo:
The Protein Information Resource, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the most comprehensive and expertly annotated protein sequence database in the public domain, the PIR-International Protein Sequence Database. To provide timely and high quality annotation and promote database interoperability, the PIR-International employs rule-based and classification-driven procedures based on controlled vocabulary and standard nomenclature and includes status tags to distinguish experimentally determined from predicted protein features. The database contains about 200 000 non-redundant protein sequences, which are classified into families and superfamilies and their domains and motifs identified. Entries are extensively cross-referenced to other sequence, classification, genome, structure and activity databases. The PIR web site features search engines that use sequence similarity and database annotation to facilitate the analysis and functional identification of proteins. The PIR-International databases and search tools are accessible on the PIR web site at http://pir.georgetown.edu/ and at the MIPS web site at http://www.mips.biochem.mpg.de. The PIR-International Protein Sequence Database and other files are also available by FTP.
Resumo:
G protein-coupled receptors (GPCR) are amongst the best studied and most functionally diverse types of cell-surface protein. The importance of GPCRs as mediates or cell function and organismal developmental underlies their involvement in key physiological roles and their prominence as targets for pharmacological therapeutics. In this review, we highlight the requirement for integrated protocols which underline the different perspectives offered by different sequence analysis methods. BLAST and FastA offer broad brush strokes. Motif-based search methods add the fine detail. Structural modelling offers another perspective which allows us to elucidate the physicochemical properties that underlie ligand binding. Together, these different views provide a more informative and a more detailed picture of GPCR structure and function. Many GPCRs remain orphan receptors with no identified ligand, yet as computer-driven functional genomics starts to elaborate their functions, a new understanding of their roles in cell and developmental biology will follow.
Resumo:
Background The sequencing, de novo assembly and annotation of transcriptome datasets generated with next generation sequencing (NGS) has enabled biologists to answer genomic questions in non-model species with unprecedented ease. Reliable and accurate de novo assembly and annotation of transcriptomes, however, is a critically important step for transcriptome assemblies generated from short read sequences. Typical benchmarks for assembly and annotation reliability have been performed with model species. To address the reliability and accuracy of de novo transcriptome assembly in non-model species, we generated an RNAseq dataset for an intertidal gastropod mollusc species, Nerita melanotragus, and compared the assembly produced by four different de novo transcriptome assemblers; Velvet, Oases, Geneious and Trinity, for a number of quality metrics and redundancy. Results Transcriptome sequencing on the Ion Torrent PGM™ produced 1,883,624 raw reads with a mean length of 133 base pairs (bp). Both the Trinity and Oases de novo assemblers produced the best assemblies based on all quality metrics including fewer contigs, increased N50 and average contig length and contigs of greater length. Overall the BLAST and annotation success of our assemblies was not high with only 15-19% of contigs assigned a putative function. Conclusions We believe that any improvement in annotation success of gastropod species will require more gastropod genome sequences, but in particular an increase in mollusc protein sequences in public databases. Overall, this paper demonstrates that reliable and accurate de novo transcriptome assemblies can be generated from short read sequencers with the right assembly algorithms. Keywords: Nerita melanotragus; De novo assembly; Transcriptome; Heat shock protein; Ion torrent
Resumo:
Determination of sequence similarity is a central issue in computational biology, a problem addressed primarily through BLAST, an alignment based heuristic which has underpinned much of the analysis and annotation of the genomic era. Despite their success, alignment-based approaches scale poorly with increasing data set size, and are not robust under structural sequence rearrangements. Successive waves of innovation in sequencing technologies – so-called Next Generation Sequencing (NGS) approaches – have led to an explosion in data availability, challenging existing methods and motivating novel approaches to sequence representation and similarity scoring, including adaptation of existing methods from other domains such as information retrieval. In this work, we investigate locality-sensitive hashing of sequences through binary document signatures, applying the method to a bacterial protein classification task. Here, the goal is to predict the gene family to which a given query protein belongs. Experiments carried out on a pair of small but biologically realistic datasets (the full protein repertoires of families of Chlamydia and Staphylococcus aureus genomes respectively) show that a measure of similarity obtained by locality sensitive hashing gives highly accurate results while offering a number of avenues which will lead to substantial performance improvements over BLAST..