6 resultados para Protein sequence

em AMS Tesi di Dottorato - Alm@DL - Università di Bologna


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Bioinformatics, in the last few decades, has played a fundamental role to give sense to the huge amount of data produced. Obtained the complete sequence of a genome, the major problem of knowing as much as possible of its coding regions, is crucial. Protein sequence annotation is challenging and, due to the size of the problem, only computational approaches can provide a feasible solution. As it has been recently pointed out by the Critical Assessment of Function Annotations (CAFA), most accurate methods are those based on the transfer-by-homology approach and the most incisive contribution is given by cross-genome comparisons. In the present thesis it is described a non-hierarchical sequence clustering method for protein automatic large-scale annotation, called “The Bologna Annotation Resource Plus” (BAR+). The method is based on an all-against-all alignment of more than 13 millions protein sequences characterized by a very stringent metric. BAR+ can safely transfer functional features (Gene Ontology and Pfam terms) inside clusters by means of a statistical validation, even in the case of multi-domain proteins. Within BAR+ clusters it is also possible to transfer the three dimensional structure (when a template is available). This is possible by the way of cluster-specific HMM profiles that can be used to calculate reliable template-to-target alignments even in the case of distantly related proteins (sequence identity < 30%). Other BAR+ based applications have been developed during my doctorate including the prediction of Magnesium binding sites in human proteins, the ABC transporters superfamily classification and the functional prediction (GO terms) of the CAFA targets. Remarkably, in the CAFA assessment, BAR+ placed among the ten most accurate methods. At present, as a web server for the functional and structural protein sequence annotation, BAR+ is freely available at http://bar.biocomp.unibo.it/bar2.0.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The vast majority of known proteins have not yet been experimentally characterized and little is known about their function. The design and implementation of computational tools can provide insight into the function of proteins based on their sequence, their structure, their evolutionary history and their association with other proteins. Knowledge of the three-dimensional (3D) structure of a protein can lead to a deep understanding of its mode of action and interaction, but currently the structures of <1% of sequences have been experimentally solved. For this reason, it became urgent to develop new methods that are able to computationally extract relevant information from protein sequence and structure. The starting point of my work has been the study of the properties of contacts between protein residues, since they constrain protein folding and characterize different protein structures. Prediction of residue contacts in proteins is an interesting problem whose solution may be useful in protein folding recognition and de novo design. The prediction of these contacts requires the study of the protein inter-residue distances related to the specific type of amino acid pair that are encoded in the so-called contact map. An interesting new way of analyzing those structures came out when network studies were introduced, with pivotal papers demonstrating that protein contact networks also exhibit small-world behavior. In order to highlight constraints for the prediction of protein contact maps and for applications in the field of protein structure prediction and/or reconstruction from experimentally determined contact maps, I studied to which extent the characteristic path length and clustering coefficient of the protein contacts network are values that reveal characteristic features of protein contact maps. Provided that residue contacts are known for a protein sequence, the major features of its 3D structure could be deduced by combining this knowledge with correctly predicted motifs of secondary structure. In the second part of my work I focused on a particular protein structural motif, the coiled-coil, known to mediate a variety of fundamental biological interactions. Coiled-coils are found in a variety of structural forms and in a wide range of proteins including, for example, small units such as leucine zippers that drive the dimerization of many transcription factors or more complex structures such as the family of viral proteins responsible for virus-host membrane fusion. The coiled-coil structural motif is estimated to account for 5-10% of the protein sequences in the various genomes. Given their biological importance, in my work I introduced a Hidden Markov Model (HMM) that exploits the evolutionary information derived from multiple sequence alignments, to predict coiled-coil regions and to discriminate coiled-coil sequences. The results indicate that the new HMM outperforms all the existing programs and can be adopted for the coiled-coil prediction and for large-scale genome annotation. Genome annotation is a key issue in modern computational biology, being the starting point towards the understanding of the complex processes involved in biological networks. The rapid growth in the number of protein sequences and structures available poses new fundamental problems that still deserve an interpretation. Nevertheless, these data are at the basis of the design of new strategies for tackling problems such as the prediction of protein structure and function. Experimental determination of the functions of all these proteins would be a hugely time-consuming and costly task and, in most instances, has not been carried out. As an example, currently, approximately only 20% of annotated proteins in the Homo sapiens genome have been experimentally characterized. A commonly adopted procedure for annotating protein sequences relies on the "inheritance through homology" based on the notion that similar sequences share similar functions and structures. This procedure consists in the assignment of sequences to a specific group of functionally related sequences which had been grouped through clustering techniques. The clustering procedure is based on suitable similarity rules, since predicting protein structure and function from sequence largely depends on the value of sequence identity. However, additional levels of complexity are due to multi-domain proteins, to proteins that share common domains but that do not necessarily share the same function, to the finding that different combinations of shared domains can lead to different biological roles. In the last part of this study I developed and validate a system that contributes to sequence annotation by taking advantage of a validated transfer through inheritance procedure of the molecular functions and of the structural templates. After a cross-genome comparison with the BLAST program, clusters were built on the basis of two stringent constraints on sequence identity and coverage of the alignment. The adopted measure explicity answers to the problem of multi-domain proteins annotation and allows a fine grain division of the whole set of proteomes used, that ensures cluster homogeneity in terms of sequence length. A high level of coverage of structure templates on the length of protein sequences within clusters ensures that multi-domain proteins when present can be templates for sequences of similar length. This annotation procedure includes the possibility of reliably transferring statistically validated functions and structures to sequences considering information available in the present data bases of molecular functions and structures.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Motivation An actual issue of great interest, both under a theoretical and an applicative perspective, is the analysis of biological sequences for disclosing the information that they encode. The development of new technologies for genome sequencing in the last years, opened new fundamental problems since huge amounts of biological data still deserve an interpretation. Indeed, the sequencing is only the first step of the genome annotation process that consists in the assignment of biological information to each sequence. Hence given the large amount of available data, in silico methods became useful and necessary in order to extract relevant information from sequences. The availability of data from Genome Projects gave rise to new strategies for tackling the basic problems of computational biology such as the determination of the tridimensional structures of proteins, their biological function and their reciprocal interactions. Results The aim of this work has been the implementation of predictive methods that allow the extraction of information on the properties of genomes and proteins starting from the nucleotide and aminoacidic sequences, by taking advantage of the information provided by the comparison of the genome sequences from different species. In the first part of the work a comprehensive large scale genome comparison of 599 organisms is described. 2,6 million of sequences coming from 551 prokaryotic and 48 eukaryotic genomes were aligned and clustered on the basis of their sequence identity. This procedure led to the identification of classes of proteins that are peculiar to the different groups of organisms. Moreover the adopted similarity threshold produced clusters that are homogeneous on the structural point of view and that can be used for structural annotation of uncharacterized sequences. The second part of the work focuses on the characterization of thermostable proteins and on the development of tools able to predict the thermostability of a protein starting from its sequence. By means of Principal Component Analysis the codon composition of a non redundant database comprising 116 prokaryotic genomes has been analyzed and it has been showed that a cross genomic approach can allow the extraction of common determinants of thermostability at the genome level, leading to an overall accuracy in discriminating thermophilic coding sequences equal to 95%. This result outperform those obtained in previous studies. Moreover, we investigated the effect of multiple mutations on protein thermostability. This issue is of great importance in the field of protein engineering, since thermostable proteins are generally more suitable than their mesostable counterparts in technological applications. A Support Vector Machine based method has been trained to predict if a set of mutations can enhance the thermostability of a given protein sequence. The developed predictor achieves 88% accuracy.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Hepatitis B x protein (HBx) is a non structural, multifunctional protein of hepatitis B virus (HBV) that modulates a variety of host processes.Due to its transcriptional activity,able to alter the expression of growth-control genes,it has been implicated in hepatocarcinogenesis.Increased expression of HBx has been reported on the liver tissue samples of hepatocellular carcinoma (HCC),and a specific anti-HBx immune response can be detected in the peripheral blood of patients with chronic HBV.However,its role and entity has not been yet clarified.Thus,we performed a cross-sectional analysis of anti-HBx specific T cell response in HBV-infected patients in different stage of disease.A total of 70 HBV-infected subjects were evaluated:15 affected by chronic hepatitis (CH-median age 45 yrs),14 by cirrhosis (median age 55 yrs),11 with dysplastic nodules (median age 64 yrs),15 with HCC (median age 60 yrs),15 with IC(median age 53 yrs).All patients were infected by virus genotype D with different levels of HBV viremia and most of them (91%) were HBeAb positive.The HBx-specific T cell response was evaluated by anti-Interferon (IFN)-gamma Elispot assay after in vitro stimulation of peripheral blood mononuclear cells,using 20 overlapping synthetic peptides covering all HBx protein sequence.HBx-specific IFN-gamma-secreting T cells were found in 6 out of 15 patients with chronic hepatitis (40%), 3 out of 14 cirrhosis (21%), in 5 out of 11 cirrhosis with macronodules (54%), and in 10 out of 15 HCC patients (67%). The number of responding patients resulted significantly higher in HCC than IC (p=0.02) and cirrhosis (p=0.02). Central specific region of the protein x was preferentially recognize,between 86-88 peptides. HBx response does not correlate with clinical feature disease(AFP,MELD).The HBx specific T-cell response seems to increase accordingly to progression of the disease, being increased in subjects with dysplastic or neoplastic lesions and can represent an additional tool to monitor the patients at high risk to develop HCC

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The development of Next Generation Sequencing promotes Biology in the Big Data era. The ever-increasing gap between proteins with known sequences and those with a complete functional annotation requires computational methods for automatic structure and functional annotation. My research has been focusing on proteins and led so far to the development of three novel tools, DeepREx, E-SNPs&GO and ISPRED-SEQ, based on Machine and Deep Learning approaches. DeepREx computes the solvent exposure of residues in a protein chain. This problem is relevant for the definition of structural constraints regarding the possible folding of the protein. DeepREx exploits Long Short-Term Memory layers to capture residue-level interactions between positions distant in the sequence, achieving state-of-the-art performances. With DeepRex, I conducted a large-scale analysis investigating the relationship between solvent exposure of a residue and its probability to be pathogenic upon mutation. E-SNPs&GO predicts the pathogenicity of a Single Residue Variation. Variations occurring on a protein sequence can have different effects, possibly leading to the onset of diseases. E-SNPs&GO exploits protein embeddings generated by two novel Protein Language Models (PLMs), as well as a new way of representing functional information coming from the Gene Ontology. The method achieves state-of-the-art performances and is extremely time-efficient when compared to traditional approaches. ISPRED-SEQ predicts the presence of Protein-Protein Interaction sites in a protein sequence. Knowing how a protein interacts with other molecules is crucial for accurate functional characterization. ISPRED-SEQ exploits a convolutional layer to parse local context after embedding the protein sequence with two novel PLMs, greatly surpassing the current state-of-the-art. All methods are published in international journals and are available as user-friendly web servers. They have been developed keeping in mind standard guidelines for FAIRness (FAIR: Findable, Accessible, Interoperable, Reusable) and are integrated into the public collection of tools provided by ELIXIR, the European infrastructure for Bioinformatics.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Beet necrotic yellow vein virus (BNYVV), the leading infectious agent that affects sugar beet, is included within viruses transmitted through the soil from plasmodiophorid as Polymyxa betae. BNYVV is the causal agent of Rhizomania, which induces abnormal rootlet proliferation and is widespread in the sugar beet growing areas in Europe, Asia and America; for review see (Peltier et al., 2008). In this latter continent, Beet soil-borne mosaic virus (BSBMV) has been identified (Lee et al., 2001) and belongs to the benyvirus genus together with BNYVV, both vectored by P. betae. BSBMV is widely distributed only in the United States and it has not been reported yet in others countries. It was first identified in Texas as a sugar beet virus morphologically similar but serologically distinct to BNYVV. Subsequent sequence analysis of BSBMV RNAs evidenced similar genomic organization to that of BNYVV but sufficient molecular differences to distinct BSBMV and BNYVV in two different species (Rush et al., 2003). Benyviruses field isolates usually consist of four RNA species but some BNYVV isolates contain a fifth RNA. RNAs -1 contains a single long ORF encoding polypeptide that shares amino acid homology with known viral RNA-dependent RNA polymerases (RdRp) and helicases. RNAs -2 contains six ORFs: capsid protein (CP), one readthrough protein, triple gene block proteins (TGB) that are required for cell-to-cell virus movement and the sixth 14 kDa ORF is a post-translation gene silencing suppressor. RNAs -3 is involved on disease symptoms and is essential for virus systemic movement. BSBMV RNA-3 can be trans-replicated, trans-encapsidated by the BNYVV helper strain (RNA-1 and -2) (Ratti et al., 2009). BNYVV RNA-4 encoded one 31 kDa protein and is essential for vector interactions and virus transmission by P. betae (Rahim et al., 2007). BNYVV RNA-5 encoded 26 kDa protein that improve virus infections and accumulation in the hosts. We are interest on BSBMV effect on Rhizomania studies using powerful tools as full-length infectious cDNA clones. B-type full-length infectious cDNA clones are available (Quillet et al., 1989) as well as A/P-type RNA-3, -4 and -5 from BNYVV (unpublished). A-type BNYVV full-length clones are also available, but RNA-1 cDNA clone still need to be modified. During the PhD program, we start production of BSBMV full-length cDNA clones and we investigate molecular interactions between plant and Benyviruses exploiting biological, epidemiological and molecular similarities/divergences between BSBMV and BNYVV. During my PhD researchrs we obtained full length infectious cDNA clones of BSBMV RNA-1 and -2 and we demonstrate that they transcripts are replicated and packaged in planta and able to substitute BNYVV RNA-1 or RNA-2 in a chimeric viral progeny (BSBMV RNA-1 + BNYVV RNA-2 or BNYVV RNA-1 + BSBMV RNA-2). During BSBMV full-length cDNA clones production, unexpected 1,730 nts long form of BSBMV RNA-4 has been detected from sugar beet roots grown on BSBMV infected soil. Sequence analysis of the new BSBMV RNA-4 form revealed high identity (~100%) with published version of BSBMV RNA-4 sequence (NC_003508) between nucleotides 1-608 and 1,138-1,730, however the new form shows 528 additionally nucleotides between positions 608-1,138 (FJ424610). Two putative ORFs has been identified, the first one (nucleotides 383 to 1,234), encode a protein with predicted mass of 32 kDa (p32) and the second one (nucleotides 885 to 1,244) express an expected product of 13 kDa (p13). As for BSBMV RNA-3 (Ratti et al., 2009), full-length BSBMV RNA-4 cDNA clone permitted to obtain infectious transcripts that BNYVV viral machinery (Stras12) is able to replicate and to encapsidate in planta. Moreover, we demonstrated that BSBMV RNA-4 can substitute BNYVV RNA-4 for an efficient transmission through the vector P. betae in Beta vulgaris plants, demonstrating a very high correlation between BNYVV and BSBMV. At the same time, using BNYVV helper strain, we studied BSBMV RNA-4’s protein expression in planta. We associated a local necrotic lesions phenotype to the p32 protein expression onto mechanically inoculated C. quinoa. Flag or GFP-tagged sequences of p32 and p13 have been expressed in viral context, using Rep3 replicons, based on BNYVV RNA-3. Western blot analyses of local lesions contents, using FLAG-specific antibody, revealed a high molecular weight protein, which suggest either a strong interaction of BSBMV RNA4’s protein with host protein(s) or post translational modifications. GFP-fusion sequences permitted the subcellular localization of BSBMV RNA4’s proteins. Moreover we demonstrated the absence of self-activation domains on p32 by yeast two hybrid system approaches. We also confirmed that p32 protein is essential for virus transmission by P. betae using BNYVV helper strain and BNYVV RNA-3 and we investigated its role by the use of different deleted forms of p32 protein. Serial mechanical inoculation of wild-type BSBMV on C. quinoa plants were performed every 7 days. Deleted form of BSBMV RNA-4 (1298 bp) appeared after 14 passages and its sequence analysis shows deletion of 433 nucleotides between positions 611 and 1044 of RNA-4 new form. We demonstrated that this deleted form can’t support transmission by P. betae using BNYVV helper strain and BNYVV RNA-3, moreover we confirmed our hypothesis that BSBMV RNA-4 described by Lee et al. (2001) is a deleted form. Interesting after 21 passages we identifed one chimeric form of BSBMV RNA-4 and BSBMV RNA-3 (1146 bp). Two putative ORFs has been identified on its sequence, the first one (nucleotides 383 to 562), encode a protein with predicted mass of 7 kDa (p7), corresponding to the N-terminal of p32 protein encoded by BSBMV RNA-4; the second one (nucleotides 562 to 789) express an expected product of 9 kDa (p9) corresponding to the C-terminal of p29 encoded by BSBMV RNA-3. Results obtained by our research in this topic opened new research lines that our laboratories will develop in a closely future. In particular BSBMV p32 and its mutated forms will be used to identify factors, as host or vector protein(s), involved in the virus transmission through P. betae. The new results could allow selection or production of sugar beet plants able to prevent virus transmission then able to reduce viral inoculum in the soil.