22 resultados para protein sequence classification

em University of Queensland eSpace - Australia


Relevância:

100.00% 100.00%

Publicador:

Resumo:

CyBase is a curated database and information source for backbone-cyclized proteins. The database incorporates naturally occurring cyclic proteins as well as synthetic derivatives, grafted analogues and acyclic permutants. The database provides a centralized repository of information on all aspects of cyclic protein biology and addresses issues pertaining to the management and searching of topologically circular sequences. The database is freely available at http://research.imb.uq.edu.au/cybase.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this study, we propose a novel method to predict the solvent accessible surface areas of transmembrane residues. For both transmembrane alpha-helix and beta-barrel residues, the correlation coefficients between the predicted and observed accessible surface areas are around 0.65. On the basis of predicted accessible surface areas, residues exposed to the lipid environment or buried inside a protein can be identified by using certain cutoff thresholds. We have extensively examined our approach based on different definitions of accessible surface areas and a variety of sets of control parameters. Given that experimentally determining the structures of membrane proteins is very difficult and membrane proteins are actually abundant in nature, our approach is useful for theoretically modeling membrane protein tertiary structures, particularly for modeling the assembly of transmembrane domains. This approach can be used to annotate the membrane proteins in proteomes to provide extra structural and functional information.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Background: Protein tertiary structure can be partly characterized via each amino acid's contact number measuring how residues are spatially arranged. The contact number of a residue in a folded protein is a measure of its exposure to the local environment, and is defined as the number of C-beta atoms in other residues within a sphere around the C-beta atom of the residue of interest. Contact number is partly conserved between protein folds and thus is useful for protein fold and structure prediction. In turn, each residue's contact number can be partially predicted from primary amino acid sequence, assisting tertiary fold analysis from sequence data. In this study, we provide a more accurate contact number prediction method from protein primary sequence. Results: We predict contact number from protein sequence using a novel support vector regression algorithm. Using protein local sequences with multiple sequence alignments (PSI-BLAST profiles), we demonstrate a correlation coefficient between predicted and observed contact numbers of 0.70, which outperforms previously achieved accuracies. Including additional information about sequence weight and amino acid composition further improves prediction accuracies significantly with the correlation coefficient reaching 0.73. If residues are classified as being either contacted or non-contacted, the prediction accuracies are all greater than 77%, regardless of the choice of classification thresholds. Conclusion: The successful application of support vector regression to the prediction of protein contact number reported here, together with previous applications of this approach to the prediction of protein accessible surface area and B-factor profile, suggests that a support vector regression approach may be very useful for determining the structure-function relation between primary sequence and higher order consecutive protein structural and functional properties.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

With the completion of the human and mouse genome sequences, the task now turns to identifying their encoded transcripts and assigning gene function. In this study, we have undertaken a computational approach to identify and classify all of the protein kinases and phosphatases present in the mouse gene complement. A nonredundant set of these sequences was produced by mining Ensembl gene predictions and publicly available cDNA sequences with a panel of InterPro domains. This approach identified 561 candidate protein kinases and 162 candidate protein phosphatases. This cohort was then analyzed using TribeMCL protein sequence similarity clustering followed by CLUSTALV alignment and hierarchical tree generation. This approach allowed us to (1) distinguish between true members of the protein kinase and phosphatase families and enzymes of related biochemistry, (2) determine the structure of the families, and (3) suggest functions for previously uncharacterized members. The classifications obtained by this approach were in good agreement with previous schemes and allowed us to demonstrate domain associations with a number of clusters. Finally, we comment on the complementary nature of cDNA and genome-based gene detection and the impact of the FANTOM2 transcriptome project.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The polypeptide backbones and side chains of proteins are constantly moving due to thermal motion and the kinetic energy of the atoms. The B-factors of protein crystal structures reflect the fluctuation of atoms about their average positions and provide important information about protein dynamics. Computational approaches to predict thermal motion are useful for analyzing the dynamic properties of proteins with unknown structures. In this article, we utilize a novel support vector regression (SVR) approach to predict the B-factor distribution (B-factor profile) of a protein from its sequence. We explore schemes for encoding sequences and various settings for the parameters used in SVR. Based on a large dataset of high-resolution proteins, our method predicts the B-factor distribution with a Pearson correlation coefficient (CC) of 0.53. In addition, our method predicts the B-factor profile with a CC of at least 0.56 for more than half of the proteins. Our method also performs well for classifying residues (rigid vs. flexible). For almost all predicted B-factor thresholds, prediction accuracies (percent of correctly predicted residues) are greater than 70%. These results exceed the best results of other sequence-based prediction methods. (C) 2005 Wiley-Liss, Inc.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Structural similarity among proteins is reflected in the distribution of hydropathicity along the amino acids in the protein sequence. Similarities in the hydropathy distributions are obvious for homologous proteins within a protein family. They also were observed for proteins with related structures, even when sequence similarities were undetectable. Here we present a novel method that employs the hydropathy distribution in proteins for identification of (sub)families in a set of (homologous) proteins. We represent proteins as points in a generalized hydropathy space, represented by vectors of specifically defined features. The features are derived from hydropathy of the individual amino acids. Projection of this space onto principal axes reveals groups of proteins with related hydropathy distributions. The groups identified correspond well to families of structurally and functionally related proteins. We found that this method accurately identifies protein families in a set of proteins, or subfamilies in a set of homologous proteins. Our results show that protein families can be identified by the analysis of hydropathy distribution, without the need for sequence alignment. (C) 2005 Wiley-Liss, Inc.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Membrane organization describes the orientation of a protein with respect to the membrane and can be determined by the presence, or absence, and organization within the protein sequence of two features: endoplasmic reticulum signal peptides and alpha-helical transmembrane domains. These features allow protein sequences to be classified into one of five membrane organization categories: soluble intracellular proteins, soluble secreted proteins, type I membrane proteins, type II membrane proteins, and multi- spanning membrane proteins. Generation of protein isoforms with variable membrane organizations can change a protein's subcellular localization or association with the membrane. Application of MemO, a membrane organization annotation pipeline, to the FANTOM3 Isoform Protein Sequence mouse protein set revealed that within the 8,032 transcriptional units ( TUs) with multiple protein isoforms, 573 had variation in their use of signal peptides, 1,527 had variation in their use of transmembrane domains, and 615 generated protein isoforms from distinct membrane organization classes. The mechanisms underlying these transcript variations were analyzed. While TUs were identified encoding all pairwise combinations of membrane organization categories, the most common was conversion of membrane proteins to soluble proteins. Observed within our highconfidence set were 156 TUs predicted to generate both extracellular soluble and membrane proteins, and 217 TUs generating both intracellular soluble and membrane proteins. The differential use of endoplasmic reticulum signal peptides and transmembrane domains is a common occurrence within the variable protein output of TUs. The generation of protein isoforms that are targeted to multiple subcellular locations represents a major functional consequence of transcript variation within the mouse transcriptome.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

We completed the genome sequence of Lettuce necrotic yellows virus (LNYV) by determining the nucleotide sequences of the 4a (putative phosphoprotein), 4b, M (matrix protein), G (glycoprotein) and L (polymerase) genes. The genome consists of 12,807 nucleotides and encodes six genes in the order 3' leader-N-4a(P)-4b-M-G-L-5' trailer. Sequences were derived from clones of a cDNA library from LNYV genomic RNA and from fragments amplified using reverse transcription-polymerase chain reaction. The 4a protein has a low isoelectric point characteristic for rhabdovirus phosphoproteins. The 4b protein has significant sequence similarities with the movement proteins of capillo- and trichoviruses and may be involved in cell-to-cell movement. The putative G protein sequence contains a predicted 25 amino acids signal peptide and endopeptidase cleavage site, three predicted glycosylation sites and a putative transmembrane domain. The deduced L protein sequence shows similarities with the L proteins of other plant rhabdoviruses and contains polymerase module motifs characteristic for RNA-dependent RNA polymerases of negative-strand RNA viruses. Phylogenetic analysis of this motif among rhabdoviruses placed LNYV in a group with other sequenced cytorhabdoviruses, most closely related to Strawberry crinkle virus. (c) 2005 Elsevier B.V. All rights reserved.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Background: The residue-wise contact order (RWCO) describes the sequence separations between the residues of interest and its contacting residues in a protein sequence. It is a new kind of one-dimensional protein structure that represents the extent of long-range contacts and is considered as a generalization of contact order. Together with secondary structure, accessible surface area, the B factor, and contact number, RWCO provides comprehensive and indispensable important information to reconstructing the protein three-dimensional structure from a set of one-dimensional structural properties. Accurately predicting RWCO values could have many important applications in protein three-dimensional structure prediction and protein folding rate prediction, and give deep insights into protein sequence-structure relationships. Results: We developed a novel approach to predict residue-wise contact order values in proteins based on support vector regression (SVR), starting from primary amino acid sequences. We explored seven different sequence encoding schemes to examine their effects on the prediction performance, including local sequence in the form of PSI-BLAST profiles, local sequence plus amino acid composition, local sequence plus molecular weight, local sequence plus secondary structure predicted by PSIPRED, local sequence plus molecular weight and amino acid composition, local sequence plus molecular weight and predicted secondary structure, and local sequence plus molecular weight, amino acid composition and predicted secondary structure. When using local sequences with multiple sequence alignments in the form of PSI-BLAST profiles, we could predict the RWCO distribution with a Pearson correlation coefficient (CC) between the predicted and observed RWCO values of 0.55, and root mean square error (RMSE) of 0.82, based on a well-defined dataset with 680 protein sequences. Moreover, by incorporating global features such as molecular weight and amino acid composition we could further improve the prediction performance with the CC to 0.57 and an RMSE of 0.79. In addition, combining the predicted secondary structure by PSIPRED was found to significantly improve the prediction performance and could yield the best prediction accuracy with a CC of 0.60 and RMSE of 0.78, which provided at least comparable performance compared with the other existing methods. Conclusion: The SVR method shows a prediction performance competitive with or at least comparable to the previously developed linear regression-based methods for predicting RWCO values. In contrast to support vector classification (SVC), SVR is very good at estimating the raw value profiles of the samples. The successful application of the SVR approach in this study reinforces the fact that support vector regression is a powerful tool in extracting the protein sequence-structure relationship and in estimating the protein structural profiles from amino acid sequences.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

With the sequencing and annotation of genomes and transcriptomes of several eukaryotes, the importance of noncoding RNA (ncRNA)-RNA molecules that are not translated to protein products-has become more evident. A subclass of ncRNA transcripts are encoded by highly regulated, multi-exon, transcriptional units, are processed like typical protein-coding mRNAs and are increasingly implicated in regulation of many cellular functions in eukaryotes. This study describes the identification of candidate functional ncRNAs from among the RIKEN mouse full-length cDNA collection, which contains 60,770 sequences, by using a systematic computational filtering approach. We initially searched for previously reported ncRNAs and found nine murine ncRNAs and homologs of several previously described nonmouse ncRNAs. Through our computational approach to filter artifact-free clones that lack protein coding potential, we extracted 4280 transcripts as the largest-candidate set. Many clones in the set had EST hits, potential CpG islands surrounding the transcription start sites, and homologies with the human genome. This implies that many candidates are indeed transcribed in a regulated manner. Our results demonstrate that ncRNAs are a major functional subclass of processed transcripts in mammals.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Sulfadoxine is predominantly used in combination with pyrimethamine, commonly known as Fansidar, for the treatment of Plasmodium falciparum. This combination is usually less effective against Plasmodium vivax, probably due to the innate refractoriness of parasites to the sulfadoxine component. To investigate this mechanism of resistance by P. vivax to sulfadoxine, we cloned and sequenced the P. vivax dhps (pvdhps) gene. The protein sequence was determined, and three-dimensional homology models of dihydropteroate synthase (DHPS) from P. vivax as well as P. falciparum were created. The docking of sulfadoxine to the two DHPS models allowed us to compare contact residues in the putative sulfadoxine-binding site in both species. The predicted sulfadoxine-binding sites between the species differ by one residue, V585 in P. vivax, equivalent to A613 in P. falciparum. V585 in P. vivax is predicted by energy minimization to cause a reduction in binding of sulfadoxine to DHPS in P. vivax compared to P. falciparum. Sequencing dhps genes from a limited set of geographically different P. vivax isolates revealed that V585 was present in all of the samples, suggesting that V585 may be responsible for innate resistance of P. vivax to sulfadoxine. Additionally, amino acid mutations were observed in some P. vivax isolates in positions known to cause resistance in P. falciparum, suggesting that, as in P. falciparum, these mutations are responsible for acquired increases in resistance of P. vivax to sulfadoxine.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

A key component of the venom of many Australian snakes belonging to the elapid family is a toxin that is structurally and functionally similar to that of the mammalian prothrombinase complex. In mammals, this complex is responsible for the cleavage of prothrombin to thrombin and is composed of factor Xa in association with its cofactors calcium, phospholipids, and factor Va. The snake prothrombin activators have been classified on the basis of their requirement for cofactors for activity. The two major subgroups described in Australian elapid snakes, groups C and D, are differentiated by their requirement for mammalian coagulation factor Va. In this study, we describe the cloning, characterization, and comparative analysis of the factor X- and factor V-like components of the prothrombin activators from the venom glands of snakes possessing either group C or D prothrombin activators. The overall domain arrangement in these proteins was highly conserved between all elapids and with the corresponding mammalian clotting factors. The deduced protein sequence for the factor X-like protease precursor, identified in elapids containing either group C or D prothrombin activators, demonstrated a remarkable degree of relatedness to each other (80%-97%). The factor V-like component of the prothrombin activator, present only in snakes containing group C complexes, also showed a very high degree of homology (96%-98%). Expression of both the factor X- and factor V-like proteins determined by immunoblotting provided an additional means of separating these two groups at the molecular level. The molecular phylogenetic analysis described here represents a new approach for distinguishing group C and D snake prothrombin activators and correlates well with previous classifications.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Orphan nuclear receptors: therapeutic opportunities in skeletal muscle. Am J Physiol Cell Physiol 291: C203-C217, 2006; doi: 10.1152/ajpcell. 00476.2005.-Nuclear hormone receptors (NRs) are ligand-dependent transcription factors that bind DNA and translate physiological signals into gene regulation. The therapeutic utility of NRs is underscored by the diversity of drugs created to manage dysfunctional hormone signaling in the context of reproductive biology, inflammation, dermatology, cancer, and metabolic disease. For example, drugs that target nuclear receptors generate over $10 billion in annual sales. Almost two decades ago, gene products were identified that belonged to the NR superfamily on the basis of DNA and protein sequence identity. However, the endogenous and synthetic small molecules that modulate their action were not known, and they were denoted orphan NRs. Many of the remaining orphan NRs are highly enriched in energy-demanding major mass tissues, including skeletal muscle, brown and white adipose, brain, liver, and kidney. This review focuses on recently adopted and orphan NR function in skeletal muscle, a tissue that accounts for similar to 35% of the total body mass and energy expenditure, and is a major site of fatty acid and glucose utilization. Moreover, this lean tissue is involved in cholesterol efflux and secretes that control energy expenditure and adiposity. Consequently, muscle has a significant role in insulin sensitivity, the blood lipid profile, and energy balance. Accordingly, skeletal muscle plays a considerable role in the progression of dyslipidemia, diabetes, and obesity. These are risk factors for cardiovascular disease, which is the the foremost cause of global mortality (> 16.7 million deaths in 2003). Therefore, it is not surprising that orphan NRs and skeletal muscle are emerging as therapeutic candidates in the battle against dyslipidemia, diabetes, obesity, and cardiovascular disease.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Large-scale gene discovery has been performed for the grass fungal endophytes Neotyphodium coenophialum, Neotyphodium lolii, and Epichloe festucae. The resulting sequences have been annotated by comparison with public DNA and protein sequence databases and using intermediate gene ontology annotation tools. Endophyte sequences have also been analysed for the presence of simple sequence repeat and single nucleotide polymorphism molecular genetic markers. Sequences and annotation are maintained within a MySQL database that may be queried using a custom web interface. Two cDNA-based microarrays have been generated from this genome resource, They permit the interrogation of 3806 Neotyphodium genes (Nchip (TM) rnicroarray), and 4195 Neotyphodium and 920 Epichloe genes (EndoChip (TM) microarray), respectively. These microarrays provide tools for high-throughput transcriptome analysis, including genome-specific gene expression studies, profiling of novel endophyte genes, and investigation of the host grass-symbiont interaction. Comparative transcriptome analysis in Neotyphodium and Epichloe was performed. (c) 2006 Elsevier