821 resultados para Learning methods
Resumo:
Thesis (Master's)--University of Washington, 2016-08
Resumo:
The aim of this thesis project is to automatically localize HCC tumors in the human liver and subsequently predict if the tumor will undergo microvascular infiltration (MVI), the initial stage of metastasis development. The input data for the work have been partially supplied by Sant'Orsola Hospital and partially downloaded from online medical databases. Two Unet models have been implemented for the automatic segmentation of the livers and the HCC malignancies within it. The segmentation models have been evaluated with the Intersection-over-Union and the Dice Coefficient metrics. The outcomes obtained for the liver automatic segmentation are quite good (IOU = 0.82; DC = 0.35); the outcomes obtained for the tumor automatic segmentation (IOU = 0.35; DC = 0.46) are, instead, affected by some limitations: it can be state that the algorithm is almost always able to detect the location of the tumor, but it tends to underestimate its dimensions. The purpose is to achieve the CT images of the HCC tumors, necessary for features extraction. The 14 Haralick features calculated from the 3D-GLCM, the 120 Radiomic features and the patients' clinical information are collected to build a dataset of 153 features. Now, the goal is to build a model able to discriminate, based on the features given, the tumors that will undergo MVI and those that will not. This task can be seen as a classification problem: each tumor needs to be classified either as “MVI positive” or “MVI negative”. Techniques for features selection are implemented to identify the most descriptive features for the problem at hand and then, a set of classification models are trained and compared. Among all, the models with the best performances (around 80-84% ± 8-15%) result to be the XGBoost Classifier, the SDG Classifier and the Logist Regression models (without penalization and with Lasso, Ridge or Elastic Net penalization).
Resumo:
Machine Learning makes computers capable of performing tasks typically requiring human intelligence. A domain where it is having a considerable impact is the life sciences, allowing to devise new biological analysis protocols, develop patients’ treatments efficiently and faster, and reduce healthcare costs. This Thesis work presents new Machine Learning methods and pipelines for the life sciences focusing on the unsupervised field. At a methodological level, two methods are presented. The first is an “Ab Initio Local Principal Path” and it is a revised and improved version of a pre-existing algorithm in the manifold learning realm. The second contribution is an improvement over the Import Vector Domain Description (one-class learning) through the Kullback-Leibler divergence. It hybridizes kernel methods to Deep Learning obtaining a scalable solution, an improved probabilistic model, and state-of-the-art performances. Both methods are tested through several experiments, with a central focus on their relevance in life sciences. Results show that they improve the performances achieved by their previous versions. At the applicative level, two pipelines are presented. The first one is for the analysis of RNA-Seq datasets, both transcriptomic and single-cell data, and is aimed at identifying genes that may be involved in biological processes (e.g., the transition of tissues from normal to cancer). In this project, an R package is released on CRAN to make the pipeline accessible to the bioinformatic Community through high-level APIs. The second pipeline is in the drug discovery domain and is useful for identifying druggable pockets, namely regions of a protein with a high probability of accepting a small molecule (a drug). Both these pipelines achieve remarkable results. Lastly, a detour application is developed to identify the strengths/limitations of the “Principal Path” algorithm by analyzing Convolutional Neural Networks induced vector spaces. This application is conducted in the music and visual arts domains.
Resumo:
In medicine, innovation depends on a better knowledge of the human body mechanism, which represents a complex system of multi-scale constituents. Unraveling the complexity underneath diseases proves to be challenging. A deep understanding of the inner workings comes with dealing with many heterogeneous information. Exploring the molecular status and the organization of genes, proteins, metabolites provides insights on what is driving a disease, from aggressiveness to curability. Molecular constituents, however, are only the building blocks of the human body and cannot currently tell the whole story of diseases. This is why nowadays attention is growing towards the contemporary exploitation of multi-scale information. Holistic methods are then drawing interest to address the problem of integrating heterogeneous data. The heterogeneity may derive from the diversity across data types and from the diversity within diseases. Here, four studies conducted data integration using customly designed workflows that implement novel methods and views to tackle the heterogeneous characterization of diseases. The first study devoted to determine shared gene regulatory signatures for onco-hematology and it showed partial co-regulation across blood-related diseases. The second study focused on Acute Myeloid Leukemia and refined the unsupervised integration of genomic alterations, which turned out to better resemble clinical practice. In the third study, network integration for artherosclerosis demonstrated, as a proof of concept, the impact of network intelligibility when it comes to model heterogeneous data, which showed to accelerate the identification of new potential pharmaceutical targets. Lastly, the fourth study introduced a new method to integrate multiple data types in a unique latent heterogeneous-representation that facilitated the selection of important data types to predict the tumour stage of invasive ductal carcinoma. The results of these four studies laid the groundwork to ease the detection of new biomarkers ultimately beneficial to medical practice and to the ever-growing field of Personalized Medicine.
Resumo:
Learning of preference relations has recently received significant attention in machine learning community. It is closely related to the classification and regression analysis and can be reduced to these tasks. However, preference learning involves prediction of ordering of the data points rather than prediction of a single numerical value as in case of regression or a class label as in case of classification. Therefore, studying preference relations within a separate framework facilitates not only better theoretical understanding of the problem, but also motivates development of the efficient algorithms for the task. Preference learning has many applications in domains such as information retrieval, bioinformatics, natural language processing, etc. For example, algorithms that learn to rank are frequently used in search engines for ordering documents retrieved by the query. Preference learning methods have been also applied to collaborative filtering problems for predicting individual customer choices from the vast amount of user generated feedback. In this thesis we propose several algorithms for learning preference relations. These algorithms stem from well founded and robust class of regularized least-squares methods and have many attractive computational properties. In order to improve the performance of our methods, we introduce several non-linear kernel functions. Thus, contribution of this thesis is twofold: kernel functions for structured data that are used to take advantage of various non-vectorial data representations and the preference learning algorithms that are suitable for different tasks, namely efficient learning of preference relations, learning with large amount of training data, and semi-supervised preference learning. Proposed kernel-based algorithms and kernels are applied to the parse ranking task in natural language processing, document ranking in information retrieval, and remote homology detection in bioinformatics domain. Training of kernel-based ranking algorithms can be infeasible when the size of the training set is large. This problem is addressed by proposing a preference learning algorithm whose computation complexity scales linearly with the number of training data points. We also introduce sparse approximation of the algorithm that can be efficiently trained with large amount of data. For situations when small amount of labeled data but a large amount of unlabeled data is available, we propose a co-regularized preference learning algorithm. To conclude, the methods presented in this thesis address not only the problem of the efficient training of the algorithms but also fast regularization parameter selection, multiple output prediction, and cross-validation. Furthermore, proposed algorithms lead to notably better performance in many preference learning tasks considered.
Resumo:
Constant technology advances have caused data explosion in recent years. Accord- ingly modern statistical and machine learning methods must be adapted to deal with complex and heterogeneous data types. This phenomenon is particularly true for an- alyzing biological data. For example DNA sequence data can be viewed as categorical variables with each nucleotide taking four different categories. The gene expression data, depending on the quantitative technology, could be continuous numbers or counts. With the advancement of high-throughput technology, the abundance of such data becomes unprecedentedly rich. Therefore efficient statistical approaches are crucial in this big data era.
Previous statistical methods for big data often aim to find low dimensional struc- tures in the observed data. For example in a factor analysis model a latent Gaussian distributed multivariate vector is assumed. With this assumption a factor model produces a low rank estimation of the covariance of the observed variables. Another example is the latent Dirichlet allocation model for documents. The mixture pro- portions of topics, represented by a Dirichlet distributed variable, is assumed. This dissertation proposes several novel extensions to the previous statistical methods that are developed to address challenges in big data. Those novel methods are applied in multiple real world applications including construction of condition specific gene co-expression networks, estimating shared topics among newsgroups, analysis of pro- moter sequences, analysis of political-economics risk data and estimating population structure from genotype data.
Resumo:
The dissertation starts by providing a description of the phenomena related to the increasing importance recently acquired by satellite applications. The spread of such technology comes with implications, such as an increase in maintenance cost, from which derives the interest in developing advanced techniques that favor an augmented autonomy of spacecrafts in health monitoring. Machine learning techniques are widely employed to lay a foundation for effective systems specialized in fault detection by examining telemetry data. Telemetry consists of a considerable amount of information; therefore, the adopted algorithms must be able to handle multivariate data while facing the limitations imposed by on-board hardware features. In the framework of outlier detection, the dissertation addresses the topic of unsupervised machine learning methods. In the unsupervised scenario, lack of prior knowledge of the data behavior is assumed. In the specific, two models are brought to attention, namely Local Outlier Factor and One-Class Support Vector Machines. Their performances are compared in terms of both the achieved prediction accuracy and the equivalent computational cost. Both models are trained and tested upon the same sets of time series data in a variety of settings, finalized at gaining insights on the effect of the increase in dimensionality. The obtained results allow to claim that both models, combined with a proper tuning of their characteristic parameters, successfully comply with the role of outlier detectors in multivariate time series data. Nevertheless, under this specific context, Local Outlier Factor results to be outperforming One-Class SVM, in that it proves to be more stable over a wider range of input parameter values. This property is especially valuable in unsupervised learning since it suggests that the model is keen to adapting to unforeseen patterns.
Resumo:
The purpose of this investigation was to evaluate three learning methods for teaching basic oral surgical skills Thirty predoctoral dental students without any surgical knowledge or previous surgical experience were divided Into three groups (n=10 each) according to instructional strategy Group 1, active learning Group 2, text reading only, and Group 3, text reading and video demonstration After instruction, the apprentices were allowed to practice incision dissection and suture maneuvers in a bench learning model During the students' performance, a structured practice evaluation test to account for correct or incorrect maneuvers was applied by trained observers Evaluation tests were repeated after thirty and sixty days Data from resulting scores between groups and periods were considered for statistical analysis (ANOVA and Tukey Kramer) with a significant level of a=0 05 Results showed that the active learning group presented the significantly best learning outcomes related to immediate assimilation of surgical procedures compared to other groups All groups results were similar after sixty days of the first practice Assessment tests were fundamental to evaluate teaching strategies and allowed theoretical and proficiency learning feedbacks Repetition and interactive practice promoted retention of knowledge on basic oral surgical skills
Resumo:
Learning organizations are a special form of organization where enhancing learning is a strategy to increase intellectual capital. Developing learning organizations has become an imperative for many managers, since an organization's learning methods and rate may be the only source of sustainable competitive advantage. However, learning organization theory tends to be prescriptive and rhetorical, with empirical research still relatively new. This paper contributes to the literature by reporting case-study research in progress based on four Australian organizations. In the organizations studied, use of the learning organization metaphor was coupled with an emergent metaphor: organization as `family". By employing structure mapping of metaphor within analytical induction, both established methods but not combined before, this paper shows how theory might be developed from metaphor.
Resumo:
Reinforcement Learning is an area of Machine Learning that deals with how an agent should take actions in an environment such as to maximize the notion of accumulated reward. This type of learning is inspired by the way humans learn and has led to the creation of various algorithms for reinforcement learning. These algorithms focus on the way in which an agent’s behaviour can be improved, assuming independence as to their surroundings. The current work studies the application of reinforcement learning methods to solve the inverted pendulum problem. The importance of the variability of the environment (factors that are external to the agent) on the execution of reinforcement learning agents is studied by using a model that seeks to obtain equilibrium (stability) through dynamism – a Cart-Pole system or inverted pendulum. We sought to improve the behaviour of the autonomous agents by changing the information passed to them, while maintaining the agent’s internal parameters constant (learning rate, discount factors, decay rate, etc.), instead of the classical approach of tuning the agent’s internal parameters. The influence of changes on the state set and the action set on an agent’s capability to solve the Cart-pole problem was studied. We have studied typical behaviour of reinforcement learning agents applied to the classic BOXES model and a new form of characterizing the environment was proposed using the notion of convergence towards a reference value. We demonstrate the gain in performance of this new method applied to a Q-Learning agent.
Resumo:
Personal memories composed of digital pictures are very popular at the moment. To retrieve these media items annotation is required. During the last years, several approaches have been proposed in order to overcome the image annotation problem. This paper presents our proposals to address this problem. Automatic and semi-automatic learning methods for semantic concepts are presented. The automatic method is based on semantic concepts estimated using visual content, context metadata and audio information. The semi-automatic method is based on results provided by a computer game. The paper describes our proposals and presents their evaluations.
Resumo:
This Thesis describes the application of automatic learning methods for a) the classification of organic and metabolic reactions, and b) the mapping of Potential Energy Surfaces(PES). The classification of reactions was approached with two distinct methodologies: a representation of chemical reactions based on NMR data, and a representation of chemical reactions from the reaction equation based on the physico-chemical and topological features of chemical bonds. NMR-based classification of photochemical and enzymatic reactions. Photochemical and metabolic reactions were classified by Kohonen Self-Organizing Maps (Kohonen SOMs) and Random Forests (RFs) taking as input the difference between the 1H NMR spectra of the products and the reactants. The development of such a representation can be applied in automatic analysis of changes in the 1H NMR spectrum of a mixture and their interpretation in terms of the chemical reactions taking place. Examples of possible applications are the monitoring of reaction processes, evaluation of the stability of chemicals, or even the interpretation of metabonomic data. A Kohonen SOM trained with a data set of metabolic reactions catalysed by transferases was able to correctly classify 75% of an independent test set in terms of the EC number subclass. Random Forests improved the correct predictions to 79%. With photochemical reactions classified into 7 groups, an independent test set was classified with 86-93% accuracy. The data set of photochemical reactions was also used to simulate mixtures with two reactions occurring simultaneously. Kohonen SOMs and Feed-Forward Neural Networks (FFNNs) were trained to classify the reactions occurring in a mixture based on the 1H NMR spectra of the products and reactants. Kohonen SOMs allowed the correct assignment of 53-63% of the mixtures (in a test set). Counter-Propagation Neural Networks (CPNNs) gave origin to similar results. The use of supervised learning techniques allowed an improvement in the results. They were improved to 77% of correct assignments when an ensemble of ten FFNNs were used and to 80% when Random Forests were used. This study was performed with NMR data simulated from the molecular structure by the SPINUS program. In the design of one test set, simulated data was combined with experimental data. The results support the proposal of linking databases of chemical reactions to experimental or simulated NMR data for automatic classification of reactions and mixtures of reactions. Genome-scale classification of enzymatic reactions from their reaction equation. The MOLMAP descriptor relies on a Kohonen SOM that defines types of bonds on the basis of their physico-chemical and topological properties. The MOLMAP descriptor of a molecule represents the types of bonds available in that molecule. The MOLMAP descriptor of a reaction is defined as the difference between the MOLMAPs of the products and the reactants, and numerically encodes the pattern of bonds that are broken, changed, and made during a chemical reaction. The automatic perception of chemical similarities between metabolic reactions is required for a variety of applications ranging from the computer validation of classification systems, genome-scale reconstruction (or comparison) of metabolic pathways, to the classification of enzymatic mechanisms. Catalytic functions of proteins are generally described by the EC numbers that are simultaneously employed as identifiers of reactions, enzymes, and enzyme genes, thus linking metabolic and genomic information. Different methods should be available to automatically compare metabolic reactions and for the automatic assignment of EC numbers to reactions still not officially classified. In this study, the genome-scale data set of enzymatic reactions available in the KEGG database was encoded by the MOLMAP descriptors, and was submitted to Kohonen SOMs to compare the resulting map with the official EC number classification, to explore the possibility of predicting EC numbers from the reaction equation, and to assess the internal consistency of the EC classification at the class level. A general agreement with the EC classification was observed, i.e. a relationship between the similarity of MOLMAPs and the similarity of EC numbers. At the same time, MOLMAPs were able to discriminate between EC sub-subclasses. EC numbers could be assigned at the class, subclass, and sub-subclass levels with accuracies up to 92%, 80%, and 70% for independent test sets. The correspondence between chemical similarity of metabolic reactions and their MOLMAP descriptors was applied to the identification of a number of reactions mapped into the same neuron but belonging to different EC classes, which demonstrated the ability of the MOLMAP/SOM approach to verify the internal consistency of classifications in databases of metabolic reactions. RFs were also used to assign the four levels of the EC hierarchy from the reaction equation. EC numbers were correctly assigned in 95%, 90%, 85% and 86% of the cases (for independent test sets) at the class, subclass, sub-subclass and full EC number level,respectively. Experiments for the classification of reactions from the main reactants and products were performed with RFs - EC numbers were assigned at the class, subclass and sub-subclass level with accuracies of 78%, 74% and 63%, respectively. In the course of the experiments with metabolic reactions we suggested that the MOLMAP / SOM concept could be extended to the representation of other levels of metabolic information such as metabolic pathways. Following the MOLMAP idea, the pattern of neurons activated by the reactions of a metabolic pathway is a representation of the reactions involved in that pathway - a descriptor of the metabolic pathway. This reasoning enabled the comparison of different pathways, the automatic classification of pathways, and a classification of organisms based on their biochemical machinery. The three levels of classification (from bonds to metabolic pathways) allowed to map and perceive chemical similarities between metabolic pathways even for pathways of different types of metabolism and pathways that do not share similarities in terms of EC numbers. Mapping of PES by neural networks (NNs). In a first series of experiments, ensembles of Feed-Forward NNs (EnsFFNNs) and Associative Neural Networks (ASNNs) were trained to reproduce PES represented by the Lennard-Jones (LJ) analytical potential function. The accuracy of the method was assessed by comparing the results of molecular dynamics simulations (thermal, structural, and dynamic properties) obtained from the NNs-PES and from the LJ function. The results indicated that for LJ-type potentials, NNs can be trained to generate accurate PES to be used in molecular simulations. EnsFFNNs and ASNNs gave better results than single FFNNs. A remarkable ability of the NNs models to interpolate between distant curves and accurately reproduce potentials to be used in molecular simulations is shown. The purpose of the first study was to systematically analyse the accuracy of different NNs. Our main motivation, however, is reflected in the next study: the mapping of multidimensional PES by NNs to simulate, by Molecular Dynamics or Monte Carlo, the adsorption and self-assembly of solvated organic molecules on noble-metal electrodes. Indeed, for such complex and heterogeneous systems the development of suitable analytical functions that fit quantum mechanical interaction energies is a non-trivial or even impossible task. The data consisted of energy values, from Density Functional Theory (DFT) calculations, at different distances, for several molecular orientations and three electrode adsorption sites. The results indicate that NNs require a data set large enough to cover well the diversity of possible interaction sites, distances, and orientations. NNs trained with such data sets can perform equally well or even better than analytical functions. Therefore, they can be used in molecular simulations, particularly for the ethanol/Au (111) interface which is the case studied in the present Thesis. Once properly trained, the networks are able to produce, as output, any required number of energy points for accurate interpolations.
Resumo:
Recently, kernel-based Machine Learning methods have gained great popularity in many data analysis and data mining fields: pattern recognition, biocomputing, speech and vision, engineering, remote sensing etc. The paper describes the use of kernel methods to approach the processing of large datasets from environmental monitoring networks. Several typical problems of the environmental sciences and their solutions provided by kernel-based methods are considered: classification of categorical data (soil type classification), mapping of environmental and pollution continuous information (pollution of soil by radionuclides), mapping with auxiliary information (climatic data from Aral Sea region). The promising developments, such as automatic emergency hot spot detection and monitoring network optimization are discussed as well.