819 resultados para symbolic machine learning
Resumo:
Thesis (Ph.D.)--University of Washington, 2016-06
Resumo:
Thesis (Ph.D.)--University of Washington, 2016-06
Resumo:
Thesis (Ph.D.)--University of Washington, 2016-06
Resumo:
We have used microarray gene expression pro. ling and machine learning to predict the presence of BRAF mutations in a panel of 61 melanoma cell lines. The BRAF gene was found to be mutated in 42 samples (69%) and intragenic mutations of the NRAS gene were detected in seven samples (11%). No cell line carried mutations of both genes. Using support vector machines, we have built a classifier that differentiates between melanoma cell lines based on BRAF mutation status. As few as 83 genes are able to discriminate between BRAF mutant and BRAF wild-type samples with clear separation observed using hierarchical clustering. Multidimensional scaling was used to visualize the relationship between a BRAF mutation signature and that of a generalized mitogen-activated protein kinase ( MAPK) activation ( either BRAF or NRAS mutation) in the context of the discriminating gene list. We observed that samples carrying NRAS mutations lie somewhere between those with or without BRAF mutations. These observations suggest that there are gene-specific mutation signals in addition to a common MAPK activation that result from the pleiotropic effects of either BRAF or NRAS on other signaling pathways, leading to measurably different transcriptional changes.
Resumo:
The cross-entropy (CE) method is a new generic approach to combinatorial and multi-extremal optimization and rare event simulation. The purpose of this tutorial is to give a gentle introduction to the CE method. We present the CE methodology, the basic algorithm and its modifications, and discuss applications in combinatorial optimization and machine learning. combinatorial optimization
Resumo:
Background: The multitude of motif detection algorithms developed to date have largely focused on the detection of patterns in primary sequence. Since sequence-dependent DNA structure and flexibility may also play a role in protein-DNA interactions, the simultaneous exploration of sequence-and structure-based hypotheses about the composition of binding sites and the ordering of features in a regulatory region should be considered as well. The consideration of structural features requires the development of new detection tools that can deal with data types other than primary sequence. Results: GANN ( available at http://bioinformatics.org.au/gann) is a machine learning tool for the detection of conserved features in DNA. The software suite contains programs to extract different regions of genomic DNA from flat files and convert these sequences to indices that reflect sequence and structural composition or the presence of specific protein binding sites. The machine learning component allows the classification of different types of sequences based on subsamples of these indices, and can identify the best combinations of indices and machine learning architecture for sequence discrimination. Another key feature of GANN is the replicated splitting of data into training and test sets, and the implementation of negative controls. In validation experiments, GANN successfully merged important sequence and structural features to yield good predictive models for synthetic and real regulatory regions. Conclusion: GANN is a flexible tool that can search through large sets of sequence and structural feature combinations to identify those that best characterize a set of sequences.
Resumo:
Selection of machine learning techniques requires a certain sensitivity to the requirements of the problem. In particular, the problem can be made more tractable by deliberately using algorithms that are biased toward solutions of the requisite kind. In this paper, we argue that recurrent neural networks have a natural bias toward a problem domain of which biological sequence analysis tasks are a subset. We use experiments with synthetic data to illustrate this bias. We then demonstrate that this bias can be exploitable using a data set of protein sequences containing several classes of subcellular localization targeting peptides. The results show that, compared with feed forward, recurrent neural networks will generally perform better on sequence analysis tasks. Furthermore, as the patterns within the sequence become more ambiguous, the choice of specific recurrent architecture becomes more critical.
Resumo:
The Leximancer system is a relatively new method for transforming lexical co-occurrence information from natural language into semantic patterns in an unsupervised manner. It employs two stages of co-occurrence information extraction-semantic and relational-using a different algorithm for each stage. The algorithms used are statistical, but they employ nonlinear dynamics and machine learning. This article is an attempt to validate the output of Leximancer, using a set of evaluation criteria taken from content analysis that are appropriate for knowledge discovery tasks.
Resumo:
Document classification is a supervised machine learning process, where predefined category labels are assigned to documents based on the hypothesis derived from training set of labelled documents. Documents cannot be directly interpreted by a computer system unless they have been modelled as a collection of computable features. Rogati and Yang [M. Rogati and Y. Yang, Resource selection for domain-specific cross-lingual IR, in SIGIR 2004: Proceedings of the 27th annual international conference on Research and Development in Information Retrieval, ACM Press, Sheffied: United Kingdom, pp. 154-161.] pointed out that the effectiveness of document classification system may vary in different domains. This implies that the quality of document model contributes to the effectiveness of document classification. Conventionally, model evaluation is accomplished by comparing the effectiveness scores of classifiers on model candidates. However, this kind of evaluation methods may encounter either under-fitting or over-fitting problems, because the effectiveness scores are restricted by the learning capacities of classifiers. We propose a model fitness evaluation method to determine whether a model is sufficient to distinguish positive and negative instances while still competent to provide satisfactory effectiveness with a small feature subset. Our experiments demonstrated how the fitness of models are assessed. The results of our work contribute to the researches of feature selection, dimensionality reduction and document classification.
Resumo:
Background. The factors behind the reemergence of severe, invasive group A streptococcal (GAS) diseases are unclear, but it could be caused by altered genetic endowment in these organisms. However, data from previous studies assessing the association between single genetic factors and invasive disease are often conflicting, suggesting that other, as-yet unidentified factors are necessary for the development of this class of disease. Methods. In this study, we used a targeted GAS virulence microarray containing 226 GAS genes to determine the virulence gene repertoires of 68 GAS isolates (42 associated with invasive disease and 28 associated with noninvasive disease) collected in a defined geographic location during a contiguous time period. We then employed 3 advanced machine learning methods (genetic algorithm neural network, support vector machines, and classification trees) to identify genes with an increased association with invasive disease. Results. Virulence gene profiles of individual GAS isolates varied extensively among these geographically and temporally related strains. Using genetic algorithm neural network analysis, we identified 3 genes with a marginal overrepresentation in invasive disease isolates. Significantly, 2 of these genes, ssa and mf4, encoded superantigens but were only present in a restricted set of GAS M-types. The third gene, spa, was found in variable distributions in all M-types in the study. Conclusions. Our comprehensive analysis of GAS virulence profiles provides strong evidence for the incongruent relationships among any of the 226 genes represented on the array and the overall propensity of GAS to cause invasive disease, underscoring the pathogenic complexity of these diseases, as well as the importance of multiple bacteria and/ or host factors.
Resumo:
We consider the statistical problem of catalogue matching from a machine learning perspective with the goal of producing probabilistic outputs, and using all available information. A framework is provided that unifies two existing approaches to producing probabilistic outputs in the literature, one based on combining distribution estimates and the other based on combining probabilistic classifiers. We apply both of these to the problem of matching the HI Parkes All Sky Survey radio catalogue with large positional uncertainties to the much denser SuperCOSMOS catalogue with much smaller positional uncertainties. We demonstrate the utility of probabilistic outputs by a controllable completeness and efficiency trade-off and by identifying objects that have high probability of being rare. Finally, possible biasing effects in the output of these classifiers are also highlighted and discussed.
Resumo:
Background: The structure of proteins may change as a result of the inherent flexibility of some protein regions. We develop and explore probabilistic machine learning methods for predicting a continuum secondary structure, i.e. assigning probabilities to the conformational states of a residue. We train our methods using data derived from high-quality NMR models. Results: Several probabilistic models not only successfully estimate the continuum secondary structure, but also provide a categorical output on par with models directly trained on categorical data. Importantly, models trained on the continuum secondary structure are also better than their categorical counterparts at identifying the conformational state for structurally ambivalent residues. Conclusion: Cascaded probabilistic neural networks trained on the continuum secondary structure exhibit better accuracy in structurally ambivalent regions of proteins, while sustaining an overall classification accuracy on par with standard, categorical prediction methods.
Resumo:
Background: Designing novel proteins with site-directed recombination has enormous prospects. By locating effective recombination sites for swapping sequence parts, the probability that hybrid sequences have the desired properties is increased dramatically. The prohibitive requirements for applying current tools led us to investigate machine learning to assist in finding useful recombination sites from amino acid sequence alone. Results: We present STAR, Site Targeted Amino acid Recombination predictor, which produces a score indicating the structural disruption caused by recombination, for each position in an amino acid sequence. Example predictions contrasted with those of alternative tools, illustrate STAR'S utility to assist in determining useful recombination sites. Overall, the correlation coefficient between the output of the experimentally validated protein design algorithm SCHEMA and the prediction of STAR is very high (0.89). Conclusion: STAR allows the user to explore useful recombination sites in amino acid sequences with unknown structure and unknown evolutionary origin. The predictor service is available from http://pprowler.itee.uq.edu.au/star.
Resumo:
We present an application of Mathematical Morphology (MM) for the classification of astronomical objects, both for star/galaxy differentiation and galaxy morphology classification. We demonstrate that, for CCD images, 99.3 +/- 3.8% of galaxies can be separated from stars using MM, with 19.4 +/- 7.9% of the stars being misclassified. We demonstrate that, for photographic plate images, the number of galaxies correctly separated from the stars can be increased using our MM diffraction spike tool, which allows 51.0 +/- 6.0% of the high-brightness galaxies that are inseparable in current techniques to be correctly classified, with only 1.4 +/- 0.5% of the high-brightness stars contaminating the population. We demonstrate that elliptical (E) and late-type spiral (Sc-Sd) galaxies can be classified using MM with an accuracy of 91.4 +/- 7.8%. It is a method involving fewer 'free parameters' than current techniques, especially automated machine learning algorithms. The limitation of MM galaxy morphology classification based on seeing and distance is also presented. We examine various star/galaxy differentiation and galaxy morphology classification techniques commonly used today, and show that our MM techniques compare very favourably.
Resumo:
This paper presents a composite multi-layer classifier system for predicting the subcellular localization of proteins based on their amino acid sequence. The work is an extension of our previous predictor PProwler v1.1 which is itself built upon the series of predictors SignalP and TargetP. In this study we outline experiments conducted to improve the classifier design. The major improvement came from using Support Vector machines as a "smart gate" sorting the outputs of several different targeting peptide detection networks. Our final model (PProwler v1.2) gives MCC values of 0.873 for non-plant and 0.849 for plant proteins. The model improves upon the accuracy of our previous subcellular localization predictor (PProwler v1.1) by 2% for plant data (which represents 7.5% improvement upon TargetP).