70 resultados para protein sequence classification

em Queensland University of Technology - ePrints Archive


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Genomic and proteomic analyses have attracted a great deal of interests in biological research in recent years. Many methods have been applied to discover useful information contained in the enormous databases of genomic sequences and amino acid sequences. The results of these investigations inspire further research in biological fields in return. These biological sequences, which may be considered as multiscale sequences, have some specific features which need further efforts to characterise using more refined methods. This project aims to study some of these biological challenges with multiscale analysis methods and stochastic modelling approach. The first part of the thesis aims to cluster some unknown proteins, and classify their families as well as their structural classes. A development in proteomic analysis is concerned with the determination of protein functions. The first step in this development is to classify proteins and predict their families. This motives us to study some unknown proteins from specific families, and to cluster them into families and structural classes. We select a large number of proteins from the same families or superfamilies, and link them to simulate some unknown large proteins from these families. We use multifractal analysis and the wavelet method to capture the characteristics of these linked proteins. The simulation results show that the method is valid for the classification of large proteins. The second part of the thesis aims to explore the relationship of proteins based on a layered comparison with their components. Many methods are based on homology of proteins because the resemblance at the protein sequence level normally indicates the similarity of functions and structures. However, some proteins may have similar functions with low sequential identity. We consider protein sequences at detail level to investigate the problem of comparison of proteins. The comparison is based on the empirical mode decomposition (EMD), and protein sequences are detected with the intrinsic mode functions. A measure of similarity is introduced with a new cross-correlation formula. The similarity results show that the EMD is useful for detection of functional relationships of proteins. The third part of the thesis aims to investigate the transcriptional regulatory network of yeast cell cycle via stochastic differential equations. As the investigation of genome-wide gene expressions has become a focus in genomic analysis, researchers have tried to understand the mechanisms of the yeast genome for many years. How cells control gene expressions still needs further investigation. We use a stochastic differential equation to model the expression profile of a target gene. We modify the model with a Gaussian membership function. For each target gene, a transcriptional rate is obtained, and the estimated transcriptional rate is also calculated with the information from five possible transcriptional regulators. Some regulators of these target genes are verified with the related references. With these results, we construct a transcriptional regulatory network for the genes from the yeast Saccharomyces cerevisiae. The construction of transcriptional regulatory network is useful for detecting more mechanisms of the yeast cell cycle.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Background The residue-wise contact order (RWCO) describes the sequence separations between the residues of interest and its contacting residues in a protein sequence. It is a new kind of one-dimensional protein structure that represents the extent of long-range contacts and is considered as a generalization of contact order. Together with secondary structure, accessible surface area, the B factor, and contact number, RWCO provides comprehensive and indispensable important information to reconstructing the protein three-dimensional structure from a set of one-dimensional structural properties. Accurately predicting RWCO values could have many important applications in protein three-dimensional structure prediction and protein folding rate prediction, and give deep insights into protein sequence-structure relationships. Results We developed a novel approach to predict residue-wise contact order values in proteins based on support vector regression (SVR), starting from primary amino acid sequences. We explored seven different sequence encoding schemes to examine their effects on the prediction performance, including local sequence in the form of PSI-BLAST profiles, local sequence plus amino acid composition, local sequence plus molecular weight, local sequence plus secondary structure predicted by PSIPRED, local sequence plus molecular weight and amino acid composition, local sequence plus molecular weight and predicted secondary structure, and local sequence plus molecular weight, amino acid composition and predicted secondary structure. When using local sequences with multiple sequence alignments in the form of PSI-BLAST profiles, we could predict the RWCO distribution with a Pearson correlation coefficient (CC) between the predicted and observed RWCO values of 0.55, and root mean square error (RMSE) of 0.82, based on a well-defined dataset with 680 protein sequences. Moreover, by incorporating global features such as molecular weight and amino acid composition we could further improve the prediction performance with the CC to 0.57 and an RMSE of 0.79. In addition, combining the predicted secondary structure by PSIPRED was found to significantly improve the prediction performance and could yield the best prediction accuracy with a CC of 0.60 and RMSE of 0.78, which provided at least comparable performance compared with the other existing methods. Conclusion The SVR method shows a prediction performance competitive with or at least comparable to the previously developed linear regression-based methods for predicting RWCO values. In contrast to support vector classification (SVC), SVR is very good at estimating the raw value profiles of the samples. The successful application of the SVR approach in this study reinforces the fact that support vector regression is a powerful tool in extracting the protein sequence-structure relationship and in estimating the protein structural profiles from amino acid sequences.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Hepatitis C virus (HCV ) core (C) protein is thought to bind to viral RNA before it undergoes oligomerization leading to RNA encapsidation. Details of these events are so far unknown. The 5ʹ-terminal C protein coding sequence that includes an adenine (A)-rich tract is a part of an internal ribosome entry site(IRES). This nucleotide sequence but not the corresponding protein sequence is needed for proper initiation of translation of viral RNA by an IRES-dependent mechanism. In this study, we examined the importance of this sequence for the ability of the C protein to bind to viral RNA. Serially truncated C proteins with deletions from 10 up to 45 N-terminal amino acids were expressed in Escherichia coli, purified and tested for binding to viral RNA by a gel shift assay. The results showed that truncation of the C protein from its N-terminus by more than 10 amino acids abolished almost completely its expression in E. coli. The latter could be restored by adding a tag to the N-terminus of the protein. The tagged proteins truncated by 15 or more amino acids showed an anomalous migration in SDS-PAGE. Truncation by more than 20 amino acids resulted in a complete loss of ability of tagged C protein to bind to viral RNA. These results provide clues to the early events in the C protein - RNA interactions leading to C protein oligomerization, RNA encapsidation and virion assembly.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Exponential growth of genomic data in the last two decades has made manual analyses impractical for all but trial studies. As genomic analyses have become more sophisticated, and move toward comparisons across large datasets, computational approaches have become essential. One of the most important biological questions is to understand the mechanisms underlying gene regulation. Genetic regulation is commonly investigated and modelled through the use of transcriptional regulatory network (TRN) structures. These model the regulatory interactions between two key components: transcription factors (TFs) and the target genes (TGs) they regulate. Transcriptional regulatory networks have proven to be invaluable scientific tools in Bioinformatics. When used in conjunction with comparative genomics, they have provided substantial insights into the evolution of regulatory interactions. Current approaches to regulatory network inference, however, omit two additional key entities: promoters and transcription factor binding sites (TFBSs). In this study, we attempted to explore the relationships among these regulatory components in bacteria. Our primary goal was to identify relationships that can assist in reducing the high false positive rates associated with transcription factor binding site predictions and thereupon enhance the reliability of the inferred transcription regulatory networks. In our preliminary exploration of relationships between the key regulatory components in Escherichia coli transcription, we discovered a number of potentially useful features. The combination of location score and sequence dissimilarity scores increased de novo binding site prediction accuracy by 13.6%. Another important observation made was with regards to the relationship between transcription factors grouped by their regulatory role and corresponding promoter strength. Our study of E.coli ��70 promoters, found support at the 0.1 significance level for our hypothesis | that weak promoters are preferentially associated with activator binding sites to enhance gene expression, whilst strong promoters have more repressor binding sites to repress or inhibit gene transcription. Although the observations were specific to �70, they nevertheless strongly encourage additional investigations when more experimentally confirmed data are available. In our preliminary exploration of relationships between the key regulatory components in E.coli transcription, we discovered a number of potentially useful features { some of which proved successful in reducing the number of false positives when applied to re-evaluate binding site predictions. Of chief interest was the relationship observed between promoter strength and TFs with respect to their regulatory role. Based on the common assumption, where promoter homology positively correlates with transcription rate, we hypothesised that weak promoters would have more transcription factors that enhance gene expression, whilst strong promoters would have more repressor binding sites. The t-tests assessed for E.coli �70 promoters returned a p-value of 0.072, which at 0.1 significance level suggested support for our (alternative) hypothesis; albeit this trend may only be present for promoters where corresponding TFBSs are either all repressors or all activators. Nevertheless, such suggestive results strongly encourage additional investigations when more experimentally confirmed data will become available. Much of the remainder of the thesis concerns a machine learning study of binding site prediction, using the SVM and kernel methods, principally the spectrum kernel. Spectrum kernels have been successfully applied in previous studies of protein classification [91, 92], as well as the related problem of promoter predictions [59], and we have here successfully applied the technique to refining TFBS predictions. The advantages provided by the SVM classifier were best seen in `moderately'-conserved transcription factor binding sites as represented by our E.coli CRP case study. Inclusion of additional position feature attributes further increased accuracy by 9.1% but more notable was the considerable decrease in false positive rate from 0.8 to 0.5 while retaining 0.9 sensitivity. Improved prediction of transcription factor binding sites is in turn extremely valuable in improving inference of regulatory relationships, a problem notoriously prone to false positive predictions. Here, the number of false regulatory interactions inferred using the conventional two-component model was substantially reduced when we integrated de novo transcription factor binding site predictions as an additional criterion for acceptance in a case study of inference in the Fur regulon. This initial work was extended to a comparative study of the iron regulatory system across 20 Yersinia strains. This work revealed interesting, strain-specific difierences, especially between pathogenic and non-pathogenic strains. Such difierences were made clear through interactive visualisations using the TRNDifi software developed as part of this work, and would have remained undetected using conventional methods. This approach led to the nomination of the Yfe iron-uptake system as a candidate for further wet-lab experimentation due to its potential active functionality in non-pathogens and its known participation in full virulence of the bubonic plague strain. Building on this work, we introduced novel structures we have labelled as `regulatory trees', inspired by the phylogenetic tree concept. Instead of using gene or protein sequence similarity, the regulatory trees were constructed based on the number of similar regulatory interactions. While the common phylogentic trees convey information regarding changes in gene repertoire, which we might regard being analogous to `hardware', the regulatory tree informs us of the changes in regulatory circuitry, in some respects analogous to `software'. In this context, we explored the `pan-regulatory network' for the Fur system, the entire set of regulatory interactions found for the Fur transcription factor across a group of genomes. In the pan-regulatory network, emphasis is placed on how the regulatory network for each target genome is inferred from multiple sources instead of a single source, as is the common approach. The benefit of using multiple reference networks, is a more comprehensive survey of the relationships, and increased confidence in the regulatory interactions predicted. In the present study, we distinguish between relationships found across the full set of genomes as the `core-regulatory-set', and interactions found only in a subset of genomes explored as the `sub-regulatory-set'. We found nine Fur target gene clusters present across the four genomes studied, this core set potentially identifying basic regulatory processes essential for survival. Species level difierences are seen at the sub-regulatory-set level; for example the known virulence factors, YbtA and PchR were found in Y.pestis and P.aerguinosa respectively, but were not present in both E.coli and B.subtilis. Such factors and the iron-uptake systems they regulate, are ideal candidates for wet-lab investigation to determine whether or not they are pathogenic specific. In this study, we employed a broad range of approaches to address our goals and assessed these methods using the Fur regulon as our initial case study. We identified a set of promising feature attributes; demonstrated their success in increasing transcription factor binding site prediction specificity while retaining sensitivity, and showed the importance of binding site predictions in enhancing the reliability of regulatory interaction inferences. Most importantly, these outcomes led to the introduction of a range of visualisations and techniques, which are applicable across the entire bacterial spectrum and can be utilised in studies beyond the understanding of transcriptional regulatory networks.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Background Flower development in kiwifruit (Actinidia spp.) is initiated in the first growing season, when undifferentiated primordia are established in latent shoot buds. These primordia can differentiate into flowers in the second growing season, after the winter dormancy period and upon accumulation of adequate winter chilling. Kiwifruit is an important horticultural crop, yet little is known about the molecular regulation of flower development. Results To study kiwifruit flower development, nine MADS-box genes were identified and functionally characterized. Protein sequence alignment, phenotypes obtained upon overexpression in Arabidopsis and expression patterns suggest that the identified genes are required for floral meristem and floral organ specification. Their role during budbreak and flower development was studied. A spontaneous kiwifruit mutant was utilized to correlate the extended expression domains of these flowering genes with abnormal floral development. Conclusions This study provides a description of flower development in kiwifruit at the molecular level. It has identified markers for flower development, and candidates for manipulation of kiwifruit growth, phase change and time of flowering. The expression in normal and aberrant flowers provided a model for kiwifruit flower development.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

This paper addresses the problem of predicting the outcome of an ongoing case of a business process based on event logs. In this setting, the outcome of a case may refer for example to the achievement of a performance objective or the fulfillment of a compliance rule upon completion of the case. Given a log consisting of traces of completed cases, given a trace of an ongoing case, and given two or more possible out- comes (e.g., a positive and a negative outcome), the paper addresses the problem of determining the most likely outcome for the case in question. Previous approaches to this problem are largely based on simple symbolic sequence classification, meaning that they extract features from traces seen as sequences of event labels, and use these features to construct a classifier for runtime prediction. In doing so, these approaches ignore the data payload associated to each event. This paper approaches the problem from a different angle by treating traces as complex symbolic sequences, that is, sequences of events each carrying a data payload. In this context, the paper outlines different feature encodings of complex symbolic sequences and compares their predictive accuracy on real-life business process event logs.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Interferon-induced transmembrane protein 5 or bone-restricted i ifitm-like gene (Bril) was first identified as a bone gene in 2008, although no in vivo role was identified at that time. A role in human bone has now been demonstrated with a number of recent studies identifying a single point mutation in Bril as the causative mutation in osteogenesis imperfecta type V (OI type V). Such a discovery suggests a key role for Bril in skeletal regulation, and the completely novel nature of the gene raises the possibility of a new regulatory pathway in bone. Furthermore, the phenotype of OI type V has unique and quite divergent features compared with other forms of OI involving defects in collagen biology. Currently it appears that the underlying genetic defect in OI type V may be unrelated to collagen regulation, which also raises interesting questions about the classification of this form of OI. This review will discuss current knowledge of OI type V, the function of Bril, and the implications of this recent discovery.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In the last decade we have come to understand that the growth of cancer cells in general and of breast cancer in particular depends, in many cases, upon growth factors that will bind to and activate their receptors. One of these growth factor receptors is the erbB-2 protein which plays an important role in the prognosis of breast cancer and is overexpressed in nearly 30% of human breast cancer patients. While evidence accumulates to support the relationship between erbB-2 overexpression and poor overall survival in breast cancer, understanding of the biological consequence(s) of erbB-2 overexpression remains elusive. Our recent discovery of the gp30 has allowed us to identify a number of related but distinct biological endpoints which appear responsive to signal transduction through the erbB-2 receptor. These endpoints of growth, invasiveness, and differentiation have clear implications for the emergence, maintenance and/or control of malignancy, and represent established endpoints in the assessment of malignant progression in breast cancer. We have shown that gp30 induces a biphasic growth effect on cells with erbB-2 over-expression. We have recently determined the protein sequence of gp30 and cloned its full length cDNA sequence. We have also cloned two additional forms to the ligand, that are believed to be different isoforms. We are currently expressing the different forms in order to determine their biological effects. To elucidate the cellular mechanisms underlying cell growth inhibition by gp30, we tested the effect of this ligand on cell growth and differentiation of the human breast cancer cells which overexpress erbB-2 and cells which express low levels of this protooncogene. High concentrations of ligand induced differentiation of cells overexpressing erbB-2, as measured by inhibition of cell growth, and increased synthesis of milk components, and modulation of E-cadherin and up- regulation of c-jun and c-fos. These findings indicate that ligand-induced growth inhibition in cells overexpressing erbB-2 is associated with an apparent induction of differentiation. The availability of gp30 derived synthetic peptides and its full cDNAs provides tools necessary to acquire a better understanding of the mechanism of action of the this ligands and the erbB-2 receptor in breast cancer.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This paper evaluates the suitability of sequence classification techniques for analyzing deviant business process executions based on event logs. Deviant process executions are those that deviate in a negative or positive way with respect to normative or desirable outcomes, such as non-compliant executions or executions that undershoot or exceed performance targets. We evaluate a range of feature types and classification methods in terms of their ability to accurately discriminate between normal and deviant executions both when deviances are infrequent (unbalanced) and when deviances are as frequent as normal executions (balanced). We also analyze the ability of the discovered rules to explain potential causes and contributing factors of observed deviances. The evaluation results show that feature types extracted using pattern mining techniques only slightly outperform those based on individual activity frequency. The results also suggest that more complex feature types ought to be explored to achieve higher levels of accuracy.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Bahia grass, Paspalum notatum, is an important pollen allergen source with a long season of pollination and wide distribution in subtropical and temperate regions. We aimed to characterize the 55. kDa allergen of Bahia grass pollen (BaGP) and ascertain its clinical importance. BaGP extract was separated by 2D-PAGE and immunoblotted with serum IgE of a grass pollen-allergic patient. The amino-terminal protein sequence of the predominant allergen isoform at 55. kDa had similarity with the group 13 allergens of Timothy grass and maize pollen, Phl p 13 and Zea m 13. Four sequences obtained by rapid amplification of the allergen cDNA ends represented multiple isoforms of Pas n 13. The predicted full length cDNA for Pas n 13 encoded a 423 amino acid glycoprotein including a signal peptide of 28 residues and with a predicted pI of 7.0. Tandem mass spectrometry of tryptic peptides of 2D gel spots identified peptides specific to the deduced amino acid sequence for each of the four Pas n 13 cDNA, representing 47% of the predicted mature protein sequence of Pas n 13. There was 80.6% and 72.6% amino acid identity with Zea m 13 and Phl p 13, respectively. Reactivity with a Phl p 13-specific monoclonal antibody AF6 supported designation of this allergen as Pas n 13. The allergen was purified from BaGP extract by ammonium sulphate precipitation, hydrophobic interaction and size exclusion chromatography. Purified Pas n 13 reacted with serum IgE of 34 of 71 (48%) grass pollen-allergic patients and specifically inhibited IgE reactivity with the 55. kDa band of BaGP for two grass pollen-allergic donors. Four isoforms of Pas n 13 from pI 6.3-7.8 had IgE-reactivity with grass pollen allergic sera. The allergenic activity of purified Pas n 13 was demonstrated by activation of basophils from whole blood of three grass pollen-allergic donors tested but not control donors. Pas n 13 is thus a clinically relevant pollen allergen of the subtropical Bahia grass likely to be important in eliciting seasonal allergic rhinitis and asthma in grass pollen-allergic patients.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Determination of sequence similarity is a central issue in computational biology, a problem addressed primarily through BLAST, an alignment based heuristic which has underpinned much of the analysis and annotation of the genomic era. Despite their success, alignment-based approaches scale poorly with increasing data set size, and are not robust under structural sequence rearrangements. Successive waves of innovation in sequencing technologies – so-called Next Generation Sequencing (NGS) approaches – have led to an explosion in data availability, challenging existing methods and motivating novel approaches to sequence representation and similarity scoring, including adaptation of existing methods from other domains such as information retrieval. In this work, we investigate locality-sensitive hashing of sequences through binary document signatures, applying the method to a bacterial protein classification task. Here, the goal is to predict the gene family to which a given query protein belongs. Experiments carried out on a pair of small but biologically realistic datasets (the full protein repertoires of families of Chlamydia and Staphylococcus aureus genomes respectively) show that a measure of similarity obtained by locality sensitive hashing gives highly accurate results while offering a number of avenues which will lead to substantial performance improvements over BLAST..