47 resultados para symbolic machine learning
Resumo:
Background: The structure of proteins may change as a result of the inherent flexibility of some protein regions. We develop and explore probabilistic machine learning methods for predicting a continuum secondary structure, i.e. assigning probabilities to the conformational states of a residue. We train our methods using data derived from high-quality NMR models. Results: Several probabilistic models not only successfully estimate the continuum secondary structure, but also provide a categorical output on par with models directly trained on categorical data. Importantly, models trained on the continuum secondary structure are also better than their categorical counterparts at identifying the conformational state for structurally ambivalent residues. Conclusion: Cascaded probabilistic neural networks trained on the continuum secondary structure exhibit better accuracy in structurally ambivalent regions of proteins, while sustaining an overall classification accuracy on par with standard, categorical prediction methods.
Resumo:
Background: Designing novel proteins with site-directed recombination has enormous prospects. By locating effective recombination sites for swapping sequence parts, the probability that hybrid sequences have the desired properties is increased dramatically. The prohibitive requirements for applying current tools led us to investigate machine learning to assist in finding useful recombination sites from amino acid sequence alone. Results: We present STAR, Site Targeted Amino acid Recombination predictor, which produces a score indicating the structural disruption caused by recombination, for each position in an amino acid sequence. Example predictions contrasted with those of alternative tools, illustrate STAR'S utility to assist in determining useful recombination sites. Overall, the correlation coefficient between the output of the experimentally validated protein design algorithm SCHEMA and the prediction of STAR is very high (0.89). Conclusion: STAR allows the user to explore useful recombination sites in amino acid sequences with unknown structure and unknown evolutionary origin. The predictor service is available from http://pprowler.itee.uq.edu.au/star.
Resumo:
We present an application of Mathematical Morphology (MM) for the classification of astronomical objects, both for star/galaxy differentiation and galaxy morphology classification. We demonstrate that, for CCD images, 99.3 +/- 3.8% of galaxies can be separated from stars using MM, with 19.4 +/- 7.9% of the stars being misclassified. We demonstrate that, for photographic plate images, the number of galaxies correctly separated from the stars can be increased using our MM diffraction spike tool, which allows 51.0 +/- 6.0% of the high-brightness galaxies that are inseparable in current techniques to be correctly classified, with only 1.4 +/- 0.5% of the high-brightness stars contaminating the population. We demonstrate that elliptical (E) and late-type spiral (Sc-Sd) galaxies can be classified using MM with an accuracy of 91.4 +/- 7.8%. It is a method involving fewer 'free parameters' than current techniques, especially automated machine learning algorithms. The limitation of MM galaxy morphology classification based on seeing and distance is also presented. We examine various star/galaxy differentiation and galaxy morphology classification techniques commonly used today, and show that our MM techniques compare very favourably.
Resumo:
This paper presents a composite multi-layer classifier system for predicting the subcellular localization of proteins based on their amino acid sequence. The work is an extension of our previous predictor PProwler v1.1 which is itself built upon the series of predictors SignalP and TargetP. In this study we outline experiments conducted to improve the classifier design. The major improvement came from using Support Vector machines as a "smart gate" sorting the outputs of several different targeting peptide detection networks. Our final model (PProwler v1.2) gives MCC values of 0.873 for non-plant and 0.849 for plant proteins. The model improves upon the accuracy of our previous subcellular localization predictor (PProwler v1.1) by 2% for plant data (which represents 7.5% improvement upon TargetP).
Resumo:
We are developing a telemedicine application which offers automated diagnosis of facial (Bell's) palsy through a Web service. We used a test data set of 43 images of facial palsy patients and 44 normal people to develop the automatic recognition algorithm. Three different image pre-processing methods were used. Machine learning techniques (support vector machine, SVM) were used to examine the difference between the two halves of the face. If there was a sufficient difference, then the SVM recognized facial palsy. Otherwise, if the halves were roughly symmetrical, the SVM classified the image as normal. It was found that the facial palsy images had a greater Hamming Distance than the normal images, indicating greater asymmetry. The median distance in the normal group was 331 (interquartile range 277-435) and the median distance in the facial palsy group was 509 (interquartile range 334-703). This difference was significant (P
Resumo:
This paper presents a new approach to improving the effectiveness of autonomous systems that deal with dynamic environments. The basis of the approach is to find repeating patterns of behavior in the dynamic elements of the system, and then to use predictions of the repeating elements to better plan goal directed behavior. It is a layered approach involving classifying, modeling, predicting and exploiting. Classifying involves using observations to place the moving elements into previously defined classes. Modeling involves recording features of the behavior on a coarse grained grid. Exploitation is achieved by integrating predictions from the model into the behavior selection module to improve the utility of the robot's actions. This is in contrast to typical approaches that use the model to select between different strategies or plays. Three methods of adaptation to the dynamic features of the environment are explored. The effectiveness of each method is determined using statistical tests over a number of repeated experiments. The work is presented in the context of predicting opponent behavior in the highly dynamic and multi-agent robot soccer domain (RoboCup)
Resumo:
Quantitative databases are limited to information identified as important by their creators, while databases containing natural language are limited by our ability to analyze large unstructured bodies of text. Leximancer is a tool that uses semantic mapping to develop concept maps from natural language. We have applied Leximancer to educational based pathology case notes to demonstrate how real patient records or databases of case studies could be analyzed to identify unique relationships. We then discuss how such analysis could be used to conduct quantitative analysis from databases such as the Coronary Heart Disease Database.
Resumo:
We present a machine learning model that predicts a structural disruption score from a protein’s primary structure. SCHEMA was introduced by Frances Arnold and colleagues as a method for determining putative recombination sites of a protein on the basis of the full (PDB) description of its structure. The present method provides an alternative to SCHEMA that is able to determine the same score from sequence data only. Circumventing the need for resolving the full structure enables the exploration of yet unresolved and even hypothetical sequences for protein design efforts. Deriving the SCHEMA score from a primary structure is achieved using a two step approach: first predicting a secondary structure from the sequence and then predicting the SCHEMA score from the predicted secondary structure. The correlation coefficient for the prediction is 0.88 and indicates the feasibility of replacing SCHEMA with little loss of precision. ©2005 IEEE
Resumo:
Prediction of peroxisomal matrix proteins generally depends on the presence of one of two distinct motifs at the end of the amino acid sequence. PTS1 peroxisomal proteins have a well conserved tripeptide at the C-terminal end. However, the preceding residues in the sequence arguably play a crucial role in targeting the protein to the peroxisome. Previous work in applying machine learning to the prediction of peroxisomal matrix proteins has failed W capitalize on the full extent of these dependencies. We benchmark a range of machine learning algorithms, and show that a classifier - based on the Support Vector Machine - produces more accurate results when dependencies between the conserved motif and the preceding section are exploited. We publish an updated and rigorously curated data set that results in increased prediction accuracy of most tested models.
Resumo:
Traditionally, machine learning algorithms have been evaluated in applications where assumptions can be reliably made about class priors and/or misclassification costs. In this paper, we consider the case of imprecise environments, where little may be known about these factors and they may well vary significantly when the system is applied. Specifically, the use of precision-recall analysis is investigated and compared to the more well known performance measures such as error-rate and the receiver operating characteristic (ROC). We argue that while ROC analysis is invariant to variations in class priors, this invariance in fact hides an important factor of the evaluation in imprecise environments. Therefore, we develop a generalised precision-recall analysis methodology in which variation due to prior class probabilities is incorporated into a multi-way analysis of variance (ANOVA). The increased sensitivity and reliability of this approach is demonstrated in a remote sensing application.
Resumo:
Machine learning techniques for prediction and rule extraction from artificial neural network methods are used. The hypothesis that market sentiment and IPO specific attributes are equally responsible for first-day IPO returns in the US stock market is tested. Machine learning methods used are Bayesian classifications, support vector machines, decision tree techniques, rule learners and artificial neural networks. The outcomes of the research are predictions and rules associated With first-day returns of technology IPOs. The hypothesis that first-day returns of technology IPOs are equally determined by IPO specific and market sentiment is rejected. Instead lower yielding IPOs are determined by IPO specific and market sentiment attributes, while higher yielding IPOs are largely dependent on IPO specific attributes.