353 resultados para Banach Sequence Space
em Indian Institute of Science - Bangalore - Índia
Resumo:
Over the past two decades, many ingenious efforts have been made in protein remote homology detection. Because homologous proteins often diversify extensively in sequence, it is challenging to demonstrate such relatedness through entirely sequence-driven searches. Here, we describe a computational method for the generation of `protein-like' sequences that serves to bridge gaps in protein sequence space. Sequence profile information, as embodied in a position-specific scoring matrix of multiply aligned sequences of bona fide family members, serves as the starting point in this algorithm. The observed amino acid propensity and the selection of a random number dictate the selection of a residue for each position in the sequence. In a systematic manner, and by applying a `roulette-wheel' selection approach at each position, we generate parent family-like sequences and thus facilitate an enlargement of sequence space around the family. When generated for a large number of families, we demonstrate that they expand the utility of natural intermediately related sequences in linking distant proteins. In 91% of the assessed examples, inclusion of designed sequences improved fold coverage by 5-10% over searches made in their absence. Furthermore, with several examples from proteins adopting folds such as TIM, globin, lipocalin and others, we demonstrate that the success of including designed sequences in a database positively sensitized methods such as PSI-BLAST and Cascade PSI-BLAST and is a promising opportunity for enormously improved remote homology recognition using sequence information alone.
Resumo:
Protein functional annotation relies on the identification of accurate relationships, sequence divergence being a key factor. This is especially evident when distant protein relationships are demonstrated only with three-dimensional structures. To address this challenge, we describe a computational approach to purposefully bridge gaps between related protein families through directed design of protein-like ``linker'' sequences. For this, we represented SCOP domain families, integrated with sequence homologues, as multiple profiles and performed HMM-HMM alignments between related domain families. Where convincing alignments were achieved, we applied a roulette wheel-based method to design 3,611,010 protein-like sequences corresponding to 374 SCOP folds. To analyze their ability to link proteins in homology searches, we used 3024 queries to search two databases, one containing only natural sequences and another one additionally containing designed sequences. Our results showed that augmented database searches showed up to 30% improvement in fold coverage for over 74% of the folds, with 52 folds achieving all theoretically possible connections. Although sequences could not be designed between some families, the availability of designed sequences between other families within the fold established the sequence continuum to demonstrate 373 difficult relationships. Ultimately, as a practical and realistic extension, we demonstrate that such protein-like sequences can be ``plugged-into'' routine and generic sequence database searches to empower not only remote homology detection but also fold recognition. Our richly statistically supported findings show that complementary searches in both databases will increase the effectiveness of sequence-based searches in recognizing all homologues sharing a common fold. (C) 2013 Elsevier Ltd. All rights reserved.
Resumo:
The notion of optimization is inherent in protein design. A long linear chain of twenty types of amino acid residues are known to fold to a 3-D conformation that minimizes the combined inter-residue energy interactions. There are two distinct protein design problems, viz. predicting the folded structure from a given sequence of amino acid monomers (folding problem) and determining a sequence for a given folded structure (inverse folding problem). These two problems have much similarity to engineering structural analysis and structural optimization problems respectively. In the folding problem, a protein chain with a given sequence folds to a conformation, called a native state, which has a unique global minimum energy value when compared to all other unfolded conformations. This involves a search in the conformation space. This is somewhat akin to the principle of minimum potential energy that determines the deformed static equilibrium configuration of an elastic structure of given topology, shape, and size that is subjected to certain boundary conditions. In the inverse-folding problem, one has to design a sequence with some objectives (having a specific feature of the folded structure, docking with another protein, etc.) and constraints (sequence being fixed in some portion, a particular composition of amino acid types, etc.) while obtaining a sequence that would fold to the desired conformation satisfying the criteria of folding. This requires a search in the sequence space. This is similar to structural optimization in the design-variable space wherein a certain feature of structural response is optimized subject to some constraints while satisfying the governing static or dynamic equilibrium equations. Based on this similarity, in this work we apply the topology optimization methods to protein design, discuss modeling issues and present some initial results.
Resumo:
Determining the sequence of amino acid residues in a heteropolymer chain of a protein with a given conformation is a discrete combinatorial problem that is not generally amenable for gradient-based continuous optimization algorithms. In this paper we present a new approach to this problem using continuous models. In this modeling, continuous "state functions" are proposed to designate the type of each residue in the chain. Such a continuous model helps define a continuous sequence space in which a chosen criterion is optimized to find the most appropriate sequence. Searching a continuous sequence space using a deterministic optimization algorithm makes it possible to find the optimal sequences with much less computation than many other approaches. The computational efficiency of this method is further improved by combining it with a graph spectral method, which explicitly takes into account the topology of the desired conformation and also helps make the combined method more robust. The continuous modeling used here appears to have additional advantages in mimicking the folding pathways and in creating the energy landscapes that help find sequences with high stability and kinetic accessibility. To illustrate the new approach, a widely used simplifying assumption is made by considering only two types of residues: hydrophobic (H) and polar (P). Self-avoiding compact lattice models are used to validate the method with known results in the literature and data that can be practically obtained by exhaustive enumeration on a desktop computer. We also present examples of sequence design for the HP models of some real proteins, which are solved in less than five minutes on a single-processor desktop computer Some open issues and future extensions are noted.
Resumo:
NrichD
Resumo:
In this paper, we present numerical evidence that supports the notion of minimization in the sequence space of proteins for a target conformation. We use the conformations of the real proteins in the Protein Data Bank (PDB) and present computationally efficient methods to identify the sequences with minimum energy. We use edge-weighted connectivity graph for ranking the residue sites with reduced amino acid alphabet and then use continuous optimization to obtain the energy-minimizing sequences. Our methods enable the computation of a lower bound as well as a tight upper bound for the energy of a given conformation. We validate our results by using three different inter-residue energy matrices for five proteins from protein data bank (PDB), and by comparing our energy-minimizing sequences with 80 million diverse sequences that are generated based on different considerations in each case. When we submitted some of our chosen energy-minimizing sequences to Basic Local Alignment Search Tool (BLAST), we obtained some sequences from non-redundant protein sequence database that are similar to ours with an E-value of the order of 10(-7). In summary, we conclude that proteins show a trend towards minimizing energy in the sequence space but do not seem to adopt the global energy-minimizing sequence. The reason for this could be either that the existing energy matrices are not able to accurately represent the inter-residue interactions in the context of the protein environment or that Nature does not push the optimization in the sequence space, once it is able to perform the function.
Resumo:
Background: Signal transduction events often involve transient, yet specific, interactions between structurally conserved protein domains and polypeptide sequences in target proteins. The identification and validation of these associating domains is crucial to understand signal transduction pathways that modulate different cellular or developmental processes. Bioinformatics strategies to extract and integrate information from diverse sources have been shown to facilitate the experimental design to understand complex biological events. These methods, primarily based on information from high-throughput experiments, have also led to the identification of new connections thus providing hypothetical models for cellular events. Such models, in turn, provide a framework for directing experimental efforts for validating the predicted molecular rationale for complex cellular processes. In this context, it is envisaged that the rational design of peptides for protein-peptide binding studies could substantially facilitate the experimental strategies to evaluate a predicted interaction. This rational design procedure involves the integration of protein-protein interaction data, gene ontology, physico-chemical calculations, domain-domain interaction data and information on functional sites or critical residues. Results: Here we describe an integrated approach called ``PeptideMine'' for the identification of peptides based on specific functional patterns present in the sequence of an interacting protein. This approach based on sequence searches in the interacting sequence space has been developed into a webserver, which can be used for the identification and analysis of peptides, peptide homologues or functional patterns from the interacting sequence space of a protein. To further facilitate experimental validation, the PeptideMine webserver also provides a list of physico-chemical parameters corresponding to the peptide to determine the feasibility of using the peptide for in vitro biochemical or biophysical studies. Conclusions: The strategy described here involves the integration of data and tools to identify potential interacting partners for a protein and design criteria for peptides based on desired biochemical properties. Alongside the search for interacting protein sequences using three different search programs, the server also provides the biochemical characteristics of candidate peptides to prune peptide sequences based on features that are most suited for a given experiment. The PeptideMine server is available at the URL: http://caps.ncbs.res.in/peptidemine
Resumo:
Convergence of the vast sequence space of proteins into a highly restricted fold/conformational space suggests a simple yet unique underlying mechanism of protein folding that has been the subject of much debate in the last several decades. One of the major challenges related to the understanding of protein folding or in silico protein structure prediction is the discrimination of non-native structures/decoys from the native structure. Applications of knowledge-based potentials to attain this goal have been extensively reported in the literature. Also, scoring functions based on accessible surface area and amino acid neighbourhood considerations were used in discriminating the decoys from native structures. In this article, we have explored the potential of protein structure network (PSN) parameters to validate the native proteins against a large number of decoy structures generated by diverse methods. We are guided by two principles: (a) the PSNs capture the local properties from a global perspective and (b) inclusion of non-covalent interactions, at all-atom level, including the side-chain atoms, in the network construction accommodates the sequence dependent features. Several network parameters such as the size of the largest cluster, community size, clustering coefficient are evaluated and scored on the basis of the rank of the native structures and the Z-scores. The network analysis of decoy structures highlights the importance of the global properties contributing to the uniqueness of native structures. The analysis also exhibits that the network parameters can be used as metrics to identify the native structures and filter out non-native structures/decoys in a large number of data-sets; thus also has a potential to be used in the protein `structure prediction' problem.
Resumo:
The use of mutagenic drugs to drive HIV-1 past its error threshold presents a novel intervention strategy, as suggested by the quasispecies theory, that may be less susceptible to failure via viral mutation-induced emergence of drug resistance than current strategies. The error threshold of HIV-1, mu(c), however, is not known. Application of the quasispecies theory to determine mu(c) poses significant challenges: Whereas the quasispecies theory considers the asexual reproduction of an infinitely large population of haploid individuals, HIV-1 is diploid, undergoes recombination, and is estimated to have a small effective population size in vivo. We performed population genetics-based stochastic simulations of the within-host evolution of HIV-1 and estimated the structure of the HIV-1 quasispecies and mu(c). We found that with small mutation rates, the quasispecies was dominated by genomes with few mutations. Upon increasing the mutation rate, a sharp error catastrophe occurred where the quasispecies became delocalized in sequence space. Using parameter values that quantitatively captured data of viral diversification in HIV-1 patients, we estimated mu(c) to be 7 x 10(-5) -1 x 10(-4) substitutions/site/replication, similar to 2-6 fold higher than the natural mutation rate of HIV-1, suggesting that HIV-1 survives close to its error threshold and may be readily susceptible to mutagenic drugs. The latter estimate was weakly dependent on the within-host effective population size of HIV-1. With large population sizes and in the absence of recombination, our simulations converged to the quasispecies theory, bridging the gap between quasispecies theory and population genetics-based approaches to describing HIV-1 evolution. Further, mu(c) increased with the recombination rate, rendering HIV-1 less susceptible to error catastrophe, thus elucidating an added benefit of recombination to HIV-1. Our estimate of mu(c) may serve as a quantitative guideline for the use of mutagenic drugs against HIV-1.
Resumo:
Protein structure space is believed to consist of a finite set of discrete folds, unlike the protein sequence space which is astronomically large, indicating that proteins from the available sequence space are likely to adopt one of the many folds already observed. In spite of extensive sequence-structure correlation data, protein structure prediction still remains an open question with researchers having tried different approaches (experimental as well as computational). One of the challenges of protein structure prediction is to identify the native protein structures from a milieu of decoys/models. In this work, a rigorous investigation of Protein Structure Networks (PSNs) has been performed to detect native structures from decoys/ models. Ninety four parameters obtained from network studies have been optimally combined with Support Vector Machines (SVM) to derive a general metric to distinguish decoys/models from the native protein structures with an accuracy of 94.11%. Recently, for the first time in the literature we had shown that PSN has the capability to distinguish native proteins from decoys. A major difference between the present work and the previous study is to explore the transition profiles at different strengths of non-covalent interactions and SVM has indeed identified this as an important parameter. Additionally, the SVM trained algorithm is also applied to the recent CASP10 predicted models. The novelty of the network approach is that it is based on general network properties of native protein structures and that a given model can be assessed independent of any reference structure. Thus, the approach presented in this paper can be valuable in validating the predicted structures. A web-server has been developed for this purpose and is freely available at http://vishgraph.mbu.iisc.ernet.in/GraProStr/PSN-QA.html.
Resumo:
We have analyzed the set of inter and intra base pair parameters for each dinucleotide step in single crystal structures of dodecamers, solved at high and medium resolution and all crystallized in P2(1)2(1)2(1) space group. The objective was to identify whether all the structures which have either the Drew-Dickerson (DD) sequence d[CGCGAATTCGCG] with some base modification or related sequence (non-DD), would display the same sequence dependent structural variability about its palindromic sequence, despite the molecule being bent at one end because of similar crystal lattice packing effect. Most of the local doublet parameters for base pairs steps G2-C3 and G10-C11 positions, symmetrically situated about the lateral twofold, were significantly correlated between themselves. In non-DD sequences, significant correlations between these positional parameters were absent. The different range of local step parameter values at each sequence position contributed to the gross feature of smooth helix axis bending in all structures. The base pair parameters in some of the positions, for medium resolution DD sequence, were quite unlike the high-resolution set and encompassed a higher range of values. Twist and slide are the two main parameters that show wider conformational range for the middle region of non-DD sequence structures in comparison to DD sequence structures. On the contrary, the minor and major groove features bear good resemblance between DD and non-DD sequence crystal structure datasets. The sugar-phosphate backbone torsion angles are similar in all structures, in sharp contrast to base pair parameter variation for high and low resolution DD and non-DD sequence structures, consisting of unusual (epsilon =g(-), xi =t) B-II conformation at the 10(th) position of the dodecamer sequence. Thus examining DD and non-DD sequence structures packed in the same crystal lattice arrangement, we infer that inter and intra base pair parameters are as symmetrically equivalent in its value as the symmetry related step for the palindromic DD sequence about lateral two-fold axis. This feature would lead us to agree with the conclusion that DNA conformation is not substantially affected by end-to-end or lateral inter-molecular interaction due to crystal lattice packing effect. Non-DD sequence structures acquire step parameter values which reflect the altered sequence at each of the dodecamer sequence position in the orthorhombic lattice while showing similar gross features of DD sequence structures
Resumo:
Sesbania mosaic virus (SMV) is a plant virus infecting Sesbania grandiflora plants in Andhra Pradesh, India. Amino acid sequence of the tryptic peptides of SMV coat protein were determined using a gas phase sequenator. These sequences showed identical amino acids at 69% of the positions when aligned with the corresponding residues of southern bean mosaic virus (SBMV).Crystals diffracting to better than 3 Å resolution were obtained by precipitating the virus with ammonium sulphate. The crystals belonged to rhombohedral space group R3 with α = 291·4 Å and α = 61·9°. Three-dimensional X-ray diffraction data on these crystals were collected to a resolution of 4·7 Å, using a Siemens-Nicolet area detector system. Self-rotation function studies revealed the icosahedral symmetry of the virus particles, as well as their precise orientation in the unit cell. Cross-rotation function and modelling studies with SBMV showed that it is a valid starting model for SMV structure determination. Low resolution phases computed using a polyalanine model of SBMV were subjected to refinement and extension by real-space electron density averaging and solvent flattening. The final electron density map revealed a polypeptide fold similar to SBMV. The single disulphide bridge of SBMV coat protein is retained in SMV. Four icosahedrally independent cation binding sites have been tentatively identified. Three of these sites, related by a quasi threefold axis, are also found in SBMV. The fourth site is situated on the quasi threefold axis. Aspartic acid residues, which replace Ile218 of SBMV from the quasi threefold-related subunits are suitable ligands to the cation at this site
Resumo:
In this paper the question of the extent to which truncated heavy tailed random vectors, taking values in a Banach space, retain the characteristic features of heavy tailed random vectors, is answered from the point of view of the central limit theorem.