926 resultados para Protein Sequence
Resumo:
Background The residue-wise contact order (RWCO) describes the sequence separations between the residues of interest and its contacting residues in a protein sequence. It is a new kind of one-dimensional protein structure that represents the extent of long-range contacts and is considered as a generalization of contact order. Together with secondary structure, accessible surface area, the B factor, and contact number, RWCO provides comprehensive and indispensable important information to reconstructing the protein three-dimensional structure from a set of one-dimensional structural properties. Accurately predicting RWCO values could have many important applications in protein three-dimensional structure prediction and protein folding rate prediction, and give deep insights into protein sequence-structure relationships. Results We developed a novel approach to predict residue-wise contact order values in proteins based on support vector regression (SVR), starting from primary amino acid sequences. We explored seven different sequence encoding schemes to examine their effects on the prediction performance, including local sequence in the form of PSI-BLAST profiles, local sequence plus amino acid composition, local sequence plus molecular weight, local sequence plus secondary structure predicted by PSIPRED, local sequence plus molecular weight and amino acid composition, local sequence plus molecular weight and predicted secondary structure, and local sequence plus molecular weight, amino acid composition and predicted secondary structure. When using local sequences with multiple sequence alignments in the form of PSI-BLAST profiles, we could predict the RWCO distribution with a Pearson correlation coefficient (CC) between the predicted and observed RWCO values of 0.55, and root mean square error (RMSE) of 0.82, based on a well-defined dataset with 680 protein sequences. Moreover, by incorporating global features such as molecular weight and amino acid composition we could further improve the prediction performance with the CC to 0.57 and an RMSE of 0.79. In addition, combining the predicted secondary structure by PSIPRED was found to significantly improve the prediction performance and could yield the best prediction accuracy with a CC of 0.60 and RMSE of 0.78, which provided at least comparable performance compared with the other existing methods. Conclusion The SVR method shows a prediction performance competitive with or at least comparable to the previously developed linear regression-based methods for predicting RWCO values. In contrast to support vector classification (SVC), SVR is very good at estimating the raw value profiles of the samples. The successful application of the SVR approach in this study reinforces the fact that support vector regression is a powerful tool in extracting the protein sequence-structure relationship and in estimating the protein structural profiles from amino acid sequences.
Resumo:
Exponential growth of genomic data in the last two decades has made manual analyses impractical for all but trial studies. As genomic analyses have become more sophisticated, and move toward comparisons across large datasets, computational approaches have become essential. One of the most important biological questions is to understand the mechanisms underlying gene regulation. Genetic regulation is commonly investigated and modelled through the use of transcriptional regulatory network (TRN) structures. These model the regulatory interactions between two key components: transcription factors (TFs) and the target genes (TGs) they regulate. Transcriptional regulatory networks have proven to be invaluable scientific tools in Bioinformatics. When used in conjunction with comparative genomics, they have provided substantial insights into the evolution of regulatory interactions. Current approaches to regulatory network inference, however, omit two additional key entities: promoters and transcription factor binding sites (TFBSs). In this study, we attempted to explore the relationships among these regulatory components in bacteria. Our primary goal was to identify relationships that can assist in reducing the high false positive rates associated with transcription factor binding site predictions and thereupon enhance the reliability of the inferred transcription regulatory networks. In our preliminary exploration of relationships between the key regulatory components in Escherichia coli transcription, we discovered a number of potentially useful features. The combination of location score and sequence dissimilarity scores increased de novo binding site prediction accuracy by 13.6%. Another important observation made was with regards to the relationship between transcription factors grouped by their regulatory role and corresponding promoter strength. Our study of E.coli ��70 promoters, found support at the 0.1 significance level for our hypothesis | that weak promoters are preferentially associated with activator binding sites to enhance gene expression, whilst strong promoters have more repressor binding sites to repress or inhibit gene transcription. Although the observations were specific to �70, they nevertheless strongly encourage additional investigations when more experimentally confirmed data are available. In our preliminary exploration of relationships between the key regulatory components in E.coli transcription, we discovered a number of potentially useful features { some of which proved successful in reducing the number of false positives when applied to re-evaluate binding site predictions. Of chief interest was the relationship observed between promoter strength and TFs with respect to their regulatory role. Based on the common assumption, where promoter homology positively correlates with transcription rate, we hypothesised that weak promoters would have more transcription factors that enhance gene expression, whilst strong promoters would have more repressor binding sites. The t-tests assessed for E.coli �70 promoters returned a p-value of 0.072, which at 0.1 significance level suggested support for our (alternative) hypothesis; albeit this trend may only be present for promoters where corresponding TFBSs are either all repressors or all activators. Nevertheless, such suggestive results strongly encourage additional investigations when more experimentally confirmed data will become available. Much of the remainder of the thesis concerns a machine learning study of binding site prediction, using the SVM and kernel methods, principally the spectrum kernel. Spectrum kernels have been successfully applied in previous studies of protein classification [91, 92], as well as the related problem of promoter predictions [59], and we have here successfully applied the technique to refining TFBS predictions. The advantages provided by the SVM classifier were best seen in `moderately'-conserved transcription factor binding sites as represented by our E.coli CRP case study. Inclusion of additional position feature attributes further increased accuracy by 9.1% but more notable was the considerable decrease in false positive rate from 0.8 to 0.5 while retaining 0.9 sensitivity. Improved prediction of transcription factor binding sites is in turn extremely valuable in improving inference of regulatory relationships, a problem notoriously prone to false positive predictions. Here, the number of false regulatory interactions inferred using the conventional two-component model was substantially reduced when we integrated de novo transcription factor binding site predictions as an additional criterion for acceptance in a case study of inference in the Fur regulon. This initial work was extended to a comparative study of the iron regulatory system across 20 Yersinia strains. This work revealed interesting, strain-specific difierences, especially between pathogenic and non-pathogenic strains. Such difierences were made clear through interactive visualisations using the TRNDifi software developed as part of this work, and would have remained undetected using conventional methods. This approach led to the nomination of the Yfe iron-uptake system as a candidate for further wet-lab experimentation due to its potential active functionality in non-pathogens and its known participation in full virulence of the bubonic plague strain. Building on this work, we introduced novel structures we have labelled as `regulatory trees', inspired by the phylogenetic tree concept. Instead of using gene or protein sequence similarity, the regulatory trees were constructed based on the number of similar regulatory interactions. While the common phylogentic trees convey information regarding changes in gene repertoire, which we might regard being analogous to `hardware', the regulatory tree informs us of the changes in regulatory circuitry, in some respects analogous to `software'. In this context, we explored the `pan-regulatory network' for the Fur system, the entire set of regulatory interactions found for the Fur transcription factor across a group of genomes. In the pan-regulatory network, emphasis is placed on how the regulatory network for each target genome is inferred from multiple sources instead of a single source, as is the common approach. The benefit of using multiple reference networks, is a more comprehensive survey of the relationships, and increased confidence in the regulatory interactions predicted. In the present study, we distinguish between relationships found across the full set of genomes as the `core-regulatory-set', and interactions found only in a subset of genomes explored as the `sub-regulatory-set'. We found nine Fur target gene clusters present across the four genomes studied, this core set potentially identifying basic regulatory processes essential for survival. Species level difierences are seen at the sub-regulatory-set level; for example the known virulence factors, YbtA and PchR were found in Y.pestis and P.aerguinosa respectively, but were not present in both E.coli and B.subtilis. Such factors and the iron-uptake systems they regulate, are ideal candidates for wet-lab investigation to determine whether or not they are pathogenic specific. In this study, we employed a broad range of approaches to address our goals and assessed these methods using the Fur regulon as our initial case study. We identified a set of promising feature attributes; demonstrated their success in increasing transcription factor binding site prediction specificity while retaining sensitivity, and showed the importance of binding site predictions in enhancing the reliability of regulatory interaction inferences. Most importantly, these outcomes led to the introduction of a range of visualisations and techniques, which are applicable across the entire bacterial spectrum and can be utilised in studies beyond the understanding of transcriptional regulatory networks.
Resumo:
In the last decade we have come to understand that the growth of cancer cells in general and of breast cancer in particular depends, in many cases, upon growth factors that will bind to and activate their receptors. One of these growth factor receptors is the erbB-2 protein which plays an important role in the prognosis of breast cancer and is overexpressed in nearly 30% of human breast cancer patients. While evidence accumulates to support the relationship between erbB-2 overexpression and poor overall survival in breast cancer, understanding of the biological consequence(s) of erbB-2 overexpression remains elusive. Our recent discovery of the gp30 has allowed us to identify a number of related but distinct biological endpoints which appear responsive to signal transduction through the erbB-2 receptor. These endpoints of growth, invasiveness, and differentiation have clear implications for the emergence, maintenance and/or control of malignancy, and represent established endpoints in the assessment of malignant progression in breast cancer. We have shown that gp30 induces a biphasic growth effect on cells with erbB-2 over-expression. We have recently determined the protein sequence of gp30 and cloned its full length cDNA sequence. We have also cloned two additional forms to the ligand, that are believed to be different isoforms. We are currently expressing the different forms in order to determine their biological effects. To elucidate the cellular mechanisms underlying cell growth inhibition by gp30, we tested the effect of this ligand on cell growth and differentiation of the human breast cancer cells which overexpress erbB-2 and cells which express low levels of this protooncogene. High concentrations of ligand induced differentiation of cells overexpressing erbB-2, as measured by inhibition of cell growth, and increased synthesis of milk components, and modulation of E-cadherin and up- regulation of c-jun and c-fos. These findings indicate that ligand-induced growth inhibition in cells overexpressing erbB-2 is associated with an apparent induction of differentiation. The availability of gp30 derived synthetic peptides and its full cDNAs provides tools necessary to acquire a better understanding of the mechanism of action of the this ligands and the erbB-2 receptor in breast cancer.
Resumo:
Bahia grass, Paspalum notatum, is an important pollen allergen source with a long season of pollination and wide distribution in subtropical and temperate regions. We aimed to characterize the 55. kDa allergen of Bahia grass pollen (BaGP) and ascertain its clinical importance. BaGP extract was separated by 2D-PAGE and immunoblotted with serum IgE of a grass pollen-allergic patient. The amino-terminal protein sequence of the predominant allergen isoform at 55. kDa had similarity with the group 13 allergens of Timothy grass and maize pollen, Phl p 13 and Zea m 13. Four sequences obtained by rapid amplification of the allergen cDNA ends represented multiple isoforms of Pas n 13. The predicted full length cDNA for Pas n 13 encoded a 423 amino acid glycoprotein including a signal peptide of 28 residues and with a predicted pI of 7.0. Tandem mass spectrometry of tryptic peptides of 2D gel spots identified peptides specific to the deduced amino acid sequence for each of the four Pas n 13 cDNA, representing 47% of the predicted mature protein sequence of Pas n 13. There was 80.6% and 72.6% amino acid identity with Zea m 13 and Phl p 13, respectively. Reactivity with a Phl p 13-specific monoclonal antibody AF6 supported designation of this allergen as Pas n 13. The allergen was purified from BaGP extract by ammonium sulphate precipitation, hydrophobic interaction and size exclusion chromatography. Purified Pas n 13 reacted with serum IgE of 34 of 71 (48%) grass pollen-allergic patients and specifically inhibited IgE reactivity with the 55. kDa band of BaGP for two grass pollen-allergic donors. Four isoforms of Pas n 13 from pI 6.3-7.8 had IgE-reactivity with grass pollen allergic sera. The allergenic activity of purified Pas n 13 was demonstrated by activation of basophils from whole blood of three grass pollen-allergic donors tested but not control donors. Pas n 13 is thus a clinically relevant pollen allergen of the subtropical Bahia grass likely to be important in eliciting seasonal allergic rhinitis and asthma in grass pollen-allergic patients.
Resumo:
Sequence-structure correlation studies are important in deciphering the relationships between various structural aspects, which may shed light on the protein-folding problem. The first step of this process is the prediction of secondary structure for a protein sequence of unknown three-dimensional structure. To this end, a web server has been created to predict the consensus secondary structure using well known algorithms from the literature. Furthermore, the server allows users to see the occurrence of predicted secondary structural elements in other structure and sequence databases and to visualize predicted helices as a helical wheel plot. The web server is accessible at http://bioserver1.physics.iisc.ernet.in/cssp/.
Resumo:
Large-scale gene discovery has been performed for the grass fungal endophytes Neotyphodium coenophialum, Neotyphodium lolii, and Epichloë festucae. The resulting sequences have been annotated by comparison with public DNA and protein sequence databases and using intermediate gene ontology annotation tools. Endophyte sequences have also been analysed for the presence of simple sequence repeat and single nucleotide polymorphism molecular genetic markers. Sequences and annotation are maintained within a MySQL database that may be queried using a custom web interface. Two cDNA-based microarrays have been generated from this genome resource. They permit the interrogation of 3806 Neotyphodium genes (NchipTM microarray), and 4195 Neotyphodium and 920 Epichloë genes (EndoChipTM microarray), respectively. These microarrays provide tools for high-throughput transcriptome analysis, including genome-specific gene expression studies, profiling of novel endophyte genes, and investigation of the host grass–symbiont interaction. Comparative transcriptome analysis in Neotyphodium and Epichloë was performed
Resumo:
Aims: To examine the prevalence of bacteriocin production in Streptococcus bovis isolates from Australian ruminants and the feasibility of industrial production of bacteriocin. Methods and Results: Streptococcus bovis strains were tested for production of bacteriocin-like inhibitory substances (BLIS) by antagonism assay against Lactococcus lactis. BLIS production was associated with source animal location (i.e. proximity of other bacteriocin-positive source animals) rather than ruminant species/breed or diet. One bacteriocin showing strong inhibitory activity (Sb15) was isolated and examined. Protein sequence, stability and activity spectrum of this bovicin were very similar to bovicin HC5. Production could be increased through serial culturing, and increased productivity could be partially maintained during cold storage of cultures. Conclusions: BLIS production is geographically widely distributed in Eastern Australia, and it appears that the bacteriocin+ trait is maintained in animals at the same location. The HC5-like bacteriocin, originally identified in North America, is also found in Australia. Production of bacteriocin can be increased through serial culturing. Significance and Impact of the Study: The HC5-like bacteriocins appear to have a broad global distribution. Serial culturing may provide a route towards commercial manufacturing for use in industrial applications, and purified bacteriocin from S. bovis Sb15 could potentially be used to prevent food spoilage or as a feed additive to promote growth in ruminant species.
Resumo:
The work covered in this thesis is focused on the development of technology for bioconversion of glucose into D-erythorbic acid (D-EA) and 5-ketogluconic acid (5-KGA). The task was to show on proof-of-concept level the functionality of the enzymatic conversion or one-step bioconversion of glucose to these acids. The feasibility of both studies to be further developed for production processes was also evaluated. The glucose - D-EA bioconversion study was based on the use of a cloned gene encoding a D-EA forming soluble flavoprotein, D-gluconolactone oxidase (GLO). GLO was purified from Penicillium cyaneo-fulvum and partially sequenced. The peptide sequences obtained were used to isolate a cDNA clone encoding the enzyme. The cloned gene (GenBank accession no. AY576053) is homologous to the other known eukaryotic lactone oxidases and also to some putative prokaryotic lactone oxidases. Analysis of the deduced protein sequence of GLO indicated the presence of a typical secretion signal sequence at the N-terminus of the enzyme. No other targeting/anchoring signals were found, suggesting that GLO is the first known lactone oxidase that is secreted rather than targeted to the membranes of the endoplasmic reticulum or mitochondria. Experimental evidence supports this analysis, as near complete secretion of GLO was observed in two different yeast expression systems. Highest expression levels of GLO were obtained using Pichia pastoris as an expression host. Recombinant GLO was characterised and the suitability of purified GLO for the production of D-EA was studied. Immobilised GLO was found to be rapidly inactivated during D-EA production. The feasibility of in vivo glucose - D-EA conversion using a P. pastoris strain co-expressing the genes of GLO and glucose oxidase (GOD, E.C. 1.1.3.4) of A. niger was demonstrated. The glucose - 5-KGA bioconversion study followed a similar strategy to that used in the D-EA production research. The rationale was based on the use of a cloned gene encoding a membrane-bound pyrroloquinoline quinone (PQQ)-dependent gluconate 5-dehydrogenase (GA 5-DH). GA 5-DH was purified to homogeneity from the only source of this enzyme known in literature, Gluconobacter suboxydans, and partially sequenced. Using the amino acid sequence information, the GA 5-DH gene was cloned from a genomic library of G. suboxydans. The cloned gene was sequenced (GenBank accession no. AJ577472) and found to be an operon of two adjacent genes encoding two subunits of GA 5-DH. It turned out that GA 5-DH is a rather close homologue of a sorbitol dehydrogenase from another G. suboxydans strain. It was also found that GA 5-DH has significant polyol dehydrogenase activity. The G. suboxydans GA 5-DH gene was poorly expressed in E. coli. Under optimised conditions maximum expression levels of GA 5-DH did not exceed the levels found in wild-type G. suboxydans. Attempts to increase expression levels resulted in repression of growth and extensive cell lysis. However, the expression levels were sufficient to demonstrate the possibility of bioconversion of glucose and gluconate into 5-KGA using recombinant strains of E. coli. An uncharacterised homologue of GA 5-DH was identified in Xanthomonas campestris using in silico screening. This enzyme encoded by chromosomal locus NP_636946 was found by a sequencing project of X. campestris and named as a hypothetical glucose dehydrogenase. The gene encoding this uncharacterised enzyme was cloned, expressed in E. coli and found to encode a gluconate/polyol dehydrogenase without glucose dehydrogenase activity. Moreover, the X. campestris GA 5-DH gene was expressed in E. coli at nearly 30 times higher levels than the G. suboxydans GA 5-DH gene. Good expressability of the X. campestris GA-5DH gene makes it a valuable tool not only for 5-KGA production in the tartaric acid (TA) bioprocess, but possibly also for other bioprocesses (e.g. oxidation of sorbitol into L-sorbose). In addition to glucose - 5-KGA bioconversion, a preliminary study of the feasibility of enzymatic conversion of 5-KGA into TA was carried out. Here, the efficacy of the first step of a prospective two-step conversion route including a transketolase and a dehydrogenase was confirmed. It was found that transketolase convert 5-KGA into TA semialdehyde. A candidate for the second step was suggested to be succinic dehydrogenase, but this was not tested. The analysis of the two subprojects indicated that bioconversion of glucose to TA using X. campestris GA 5-DH should be prioritised first and the process development efforts in future should be focused on development of more efficient GA 5-DH production strains by screening a more suitable production host and by protein engineering.
Resumo:
In this paper, we present numerical evidence that supports the notion of minimization in the sequence space of proteins for a target conformation. We use the conformations of the real proteins in the Protein Data Bank (PDB) and present computationally efficient methods to identify the sequences with minimum energy. We use edge-weighted connectivity graph for ranking the residue sites with reduced amino acid alphabet and then use continuous optimization to obtain the energy-minimizing sequences. Our methods enable the computation of a lower bound as well as a tight upper bound for the energy of a given conformation. We validate our results by using three different inter-residue energy matrices for five proteins from protein data bank (PDB), and by comparing our energy-minimizing sequences with 80 million diverse sequences that are generated based on different considerations in each case. When we submitted some of our chosen energy-minimizing sequences to Basic Local Alignment Search Tool (BLAST), we obtained some sequences from non-redundant protein sequence database that are similar to ours with an E-value of the order of 10(-7). In summary, we conclude that proteins show a trend towards minimizing energy in the sequence space but do not seem to adopt the global energy-minimizing sequence. The reason for this could be either that the existing energy matrices are not able to accurately represent the inter-residue interactions in the context of the protein environment or that Nature does not push the optimization in the sequence space, once it is able to perform the function.
Resumo:
The non-oxidative decarboxylation of aromatic acids is a poorly understood reaction. The transformation of 2,3-dihydroxybenzoic acid to catechol in the fungal metabolism of indole is a prototype of such a reaction. 2,3-Dihydroxybenzoic acid decarboxylase (EC 4.1.1.46) which catalyzes this reaction was purified to homogeneity from anthranilate induced cultures of Aspergillus oryzae using affinity chromatography. The enzyme did not require cofactors like NAD(+), PLP, TPP or metal ions for its activity. There was no spectral evidence for the presence of enzyme bound cofactors. The preparation, which was adjudged homogeneous by the criteria of SDS-PAGE, sedimentation analysis and N-terminal analysis, was characterized for its physicochemical and kinetic parameters. The enzyme was inactivated by group-specific modifiers like diethyl pyrocarbonate (DEPC) and N-ethylmaleimide (NEM). The kinetics of inactivation by DEPC suggested the presence of a single class of essential histidine residues, the second order rate constant of inactivation for which was 12.5 M(-1) min(-1). A single class of cysteine residues was modified by NEM with a second order rate constant of 33 M(-1) min(-1). Substrate analogues protected the enzyme against inactivation by both DEPC and NEM, suggesting the Location of the essential histidine and cysteine to be at the active site of the enzyme. The incorporation of radiolabelled NEM in a differential labelling experiment was 0.73 mol per mol subunit confirming the presence of a single essential cysteine per active-site. Differentially labelled enzyme was enzymatically cleaved and the peptide bearing the label was purified and sequenced. The active-site peptide LLGLAETCK and the N-terminal sequence MLGKIALEEAFALPRFEEKT did not bear any similarity to sequences reported in the Swiss-Prot Protein Sequence Databank, a reflection probably of the unique primary structure of this novel enzyme. The sequences reported in this study will appear in the Swiss-Prot Protein Sequence Databank under the accession number P80402.
Resumo:
The Basic Local Alignment Search Tool (BLAST) is one of the most widely used sequence alignment programs with which similarity searches, for both protein and nucleic acid sequences, can be performed against large databases at high speed. A large number of tools exist for processing BLAST output, but none of them provide three-dimensional structure visualization. This shortcoming has been addressed in the proposed tool BLAST Server for Structural Biologists (BSSB), which maps a BLAST output onto the three-dimensional structure of the subject protein. The three-dimensional structure of the subject protein is represented using a three-color coding scheme (identical: red; similar: yellow; and mismatch: white) based on the pairwise alignment obtained. Thus, the user will be able to visualize a possible three-dimensional structure for the query protein sequence. This information can be used to gain a deeper insight into the sequence-structure correlation. Furthermore, the additional structure-level information enables the user to make coherent and logical decisions regarding the type of input model structure or fragment that can be used for molecular replacement calculations. This tool is freely available to all users at http://bioserver1.physics.iisc.ernet.in/bssb/.
Resumo:
A palindrome is a set of characters that reads the same forwards and backwards. Since the discovery of palindromic peptide sequences two decades ago, little effort has been made to understand its structural, functional and evolutionary significance. Therefore, in view of this, an algorithm has been developed to identify all perfect palindromes (excluding the palindromic subset and tandem repeats) in a single protein sequence. The proposed algorithm does not impose any restriction on the number of residues to be given in the input sequence. This avant-garde algorithm will aid in the identification of palindromic peptide sequences of varying lengths in a single protein sequence.
Resumo:
Genomic sequences are far from being random but are made up of systematically ordered and information rich patterns. These repeated sequence patterns have been vastly utilized for their fundamental importance in understanding the genome function and organization. To this end, a comprehensive toolkit, RepEx, has been developed which extracts repeat (inverted, everted and mirror) patterns from the given genome sequence(s) without any constraints. The toolkit can also be used to fetch the inverted repeats present in the protein sequence (s). Further, it is capable of extracting exact and degenerate repeats with a user defined spacer intervals. It is remarkably more precise and sensitive when compared to the existing tools. An example with comprehensive case studies and a performance evaluation of the proposed toolkit has been presented to authenticate its efficiency and accuracy. (C) 2013 Elsevier Inc. All rights reserved.
Resumo:
Establishing functional relationships between multi-domain protein sequences is a non-trivial task. Traditionally, delineating functional assignment and relationships of proteins requires domain assignments as a prerequisite. This process is sensitive to alignment quality and domain definitions. In multi-domain proteins due to multiple reasons, the quality of alignments is poor. We report the correspondence between the classification of proteins represented as full-length gene products and their functions. Our approach differs fundamentally from traditional methods in not performing the classification at the level of domains. Our method is based on an alignment free local matching scores (LMS) computation at the amino-acid sequence level followed by hierarchical clustering. As there are no gold standards for full-length protein sequence classification, we resorted to Gene Ontology and domain-architecture based similarity measures to assess our classification. The final clusters obtained using LMS show high functional and domain architectural similarities. Comparison of the current method with alignment based approaches at both domain and full-length protein showed superiority of the LMS scores. Using this method we have recreated objective relationships among different protein kinase sub-families and also classified immunoglobulin containing proteins where sub-family definitions do not exist currently. This method can be applied to any set of protein sequences and hence will be instrumental in analysis of large numbers of full-length protein sequences.
Resumo:
Background: The function of a protein can be deciphered with higher accuracy from its structure than from its amino acid sequence. Due to the huge gap in the available protein sequence and structural space, tools that can generate functionally homogeneous clusters using only the sequence information, hold great importance. For this, traditional alignment-based tools work well in most cases and clustering is performed on the basis of sequence similarity. But, in the case of multi-domain proteins, the alignment quality might be poor due to varied lengths of the proteins, domain shuffling or circular permutations. Multi-domain proteins are ubiquitous in nature, hence alignment-free tools, which overcome the shortcomings of alignment-based protein comparison methods, are required. Further, existing tools classify proteins using only domain-level information and hence miss out on the information encoded in the tethered regions or accessory domains. Our method, on the other hand, takes into account the full-length sequence of a protein, consolidating the complete sequence information to understand a given protein better. Results: Our web-server, CLAP (Classification of Proteins), is one such alignment-free software for automatic classification of protein sequences. It utilizes a pattern-matching algorithm that assigns local matching scores (LMS) to residues that are a part of the matched patterns between two sequences being compared. CLAP works on full-length sequences and does not require prior domain definitions. Pilot studies undertaken previously on protein kinases and immunoglobulins have shown that CLAP yields clusters, which have high functional and domain architectural similarity. Moreover, parsing at a statistically determined cut-off resulted in clusters that corroborated with the sub-family level classification of that particular domain family. Conclusions: CLAP is a useful protein-clustering tool, independent of domain assignment, domain order, sequence length and domain diversity. Our method can be used for any set of protein sequences, yielding functionally relevant clusters with high domain architectural homogeneity. The CLAP web server is freely available for academic use at http://nslab.mbu.iisc.ernet.in/clap/.