797 resultados para Machine Learning Algorithms
Resumo:
The paper presents a novel method for monitoring network optimisation, based on a recent machine learning technique known as support vector machine. It is problem-oriented in the sense that it directly answers the question of whether the advised spatial location is important for the classification model. The method can be used to increase the accuracy of classification models by taking a small number of additional measurements. Traditionally, network optimisation is performed by means of the analysis of the kriging variances. The comparison of the method with the traditional approach is presented on a real case study with climate data.
Resumo:
We present a new framework for large-scale data clustering. The main idea is to modify functional dimensionality reduction techniques to directly optimize over discrete labels using stochastic gradient descent. Compared to methods like spectral clustering our approach solves a single optimization problem, rather than an ad-hoc two-stage optimization approach, does not require a matrix inversion, can easily encode prior knowledge in the set of implementable functions, and does not have an ?out-of-sample? problem. Experimental results on both artificial and real-world datasets show the usefulness of our approach.
Resumo:
This paper investigates the use of ensemble of predictors in order to improve the performance of spatial prediction methods. Support vector regression (SVR), a popular method from the field of statistical machine learning, is used. Several instances of SVR are combined using different data sampling schemes (bagging and boosting). Bagging shows good performance, and proves to be more computationally efficient than training a single SVR model while reducing error. Boosting, however, does not improve results on this specific problem.
Resumo:
Remote sensing image processing is nowadays a mature research area. The techniques developed in the field allow many real-life applications with great societal value. For instance, urban monitoring, fire detection or flood prediction can have a great impact on economical and environmental issues. To attain such objectives, the remote sensing community has turned into a multidisciplinary field of science that embraces physics, signal theory, computer science, electronics, and communications. From a machine learning and signal/image processing point of view, all the applications are tackled under specific formalisms, such as classification and clustering, regression and function approximation, image coding, restoration and enhancement, source unmixing, data fusion or feature selection and extraction. This paper serves as a survey of methods and applications, and reviews the last methodological advances in remote sensing image processing.
Resumo:
Computational anatomy with magnetic resonance imaging (MRI) is well established as a noninvasive biomarker of Alzheimer's disease (AD); however, there is less certainty about its dependency on the staging of AD. We use classical group analyses and automated machine learning classification of standard structural MRI scans to investigate AD diagnostic accuracy from the preclinical phase to clinical dementia. Longitudinal data from the Alzheimer's Disease Neuroimaging Initiative were stratified into 4 groups according to the clinical status-(1) AD patients; (2) mild cognitive impairment (MCI) converters; (3) MCI nonconverters; and (4) healthy controls-and submitted to a support vector machine. The obtained classifier was significantly above the chance level (62%) for detecting AD already 4 years before conversion from MCI. Voxel-based univariate tests confirmed the plausibility of our findings detecting a distributed network of hippocampal-temporoparietal atrophy in AD patients. We also identified a subgroup of control subjects with brain structure and cognitive changes highly similar to those observed in AD. Our results indicate that computational anatomy can detect AD substantially earlier than suggested by current models. The demonstrated differential spatial pattern of atrophy between correctly and incorrectly classified AD patients challenges the assumption of a uniform pathophysiological process underlying clinically identified AD.
Resumo:
We present a machine learning approach to modeling bowing control parametercontours in violin performance. Using accurate sensing techniqueswe obtain relevant timbre-related bowing control parameters such as bowtransversal velocity, bow pressing force, and bow-bridge distance of eachperformed note. Each performed note is represented by a curve parametervector and a number of note classes are defined. The principal componentsof the data represented by the set of curve parameter vectors are obtainedfor each class. Once curve parameter vectors are expressed in the new spacedefined by the principal components, we train a model based on inductivelogic programming, able to predict curve parameter vectors used for renderingbowing controls. We evaluate the prediction results and show the potentialof the model by predicting bowing control parameter contours from anannotated input score.
Resumo:
The relationship between inflammation and cancer is well established in several tumor types, including bladder cancer. We performed an association study between 886 inflammatory-gene variants and bladder cancer risk in 1,047 cases and 988 controls from the Spanish Bladder Cancer (SBC)/EPICURO Study. A preliminary exploration with the widely used univariate logistic regression approach did not identify any significant SNP after correcting for multiple testing. We further applied two more comprehensive methods to capture the complexity of bladder cancer genetic susceptibility: Bayesian Threshold LASSO (BTL), a regularized regression method, and AUC-Random Forest, a machine-learning algorithm. Both approaches explore the joint effect of markers. BTL analysis identified a signature of 37 SNPs in 34 genes showing an association with bladder cancer. AUC-RF detected an optimal predictive subset of 56 SNPs. 13 SNPs were identified by both methods in the total population. Using resources from the Texas Bladder Cancer study we were able to replicate 30% of the SNPs assessed. The associations between inflammatory SNPs and bladder cancer were reexamined among non-smokers to eliminate the effect of tobacco, one of the strongest and most prevalent environmental risk factor for this tumor. A 9 SNP-signature was detected by BTL. Here we report, for the first time, a set of SNP in inflammatory genes jointly associated with bladder cancer risk. These results highlight the importance of the complex structure of genetic susceptibility associated with cancer risk.
Resumo:
Modeling the mechanisms that determine how humans and other agents choose among different behavioral and cognitive processes-be they strategies, routines, actions, or operators-represents a paramount theoretical stumbling block across disciplines, ranging from the cognitive and decision sciences to economics, biology, and machine learning. By using the cognitive and decision sciences as a case study, we provide an introduction to what is also known as the strategy selection problem. First, we explain why many researchers assume humans and other animals to come equipped with a repertoire of behavioral and cognitive processes. Second, we expose three descriptive, predictive, and prescriptive challenges that are common to all disciplines which aim to model the choice among these processes. Third, we give an overview of different approaches to strategy selection. These include cost‐benefit, ecological, learning, memory, unified, connectionist, sequential sampling, and maximization approaches. We conclude by pointing to opportunities for future research and by stressing that the selection problem is far from being resolved.
Resumo:
Les plantes sont essentielles pour les sociétés humaines. Notre alimentation quotidienne, les matériaux de constructions et les sources énergétiques dérivent de la biomasse végétale. En revanche, la compréhension des multiples aspects développementaux des plantes est encore peu exploitée et représente un sujet de recherche majeur pour la science. L'émergence des technologies à haut débit pour le séquençage de génome à grande échelle ou l'imagerie de haute résolution permet à présent de produire des quantités énormes d'information. L'analyse informatique est une façon d'intégrer ces données et de réduire la complexité apparente vers une échelle d'abstraction appropriée, dont la finalité est de fournir des perspectives de recherches ciblées. Ceci représente la raison première de cette thèse. En d'autres termes, nous appliquons des méthodes descriptives et prédictives combinées à des simulations numériques afin d'apporter des solutions originales à des problèmes relatifs à la morphogénèse à l'échelle de la cellule et de l'organe. Nous nous sommes fixés parmi les objectifs principaux de cette thèse d'élucider de quelle manière l'interaction croisée des phytohormones auxine et brassinosteroïdes (BRs) détermine la croissance de la cellule dans la racine du méristème apical d'Arabidopsis thaliana, l'organisme modèle de référence pour les études moléculaires en plantes. Pour reconstruire le réseau de signalement cellulaire, nous avons extrait de la littérature les informations pertinentes concernant les relations entre les protéines impliquées dans la transduction des signaux hormonaux. Le réseau a ensuite été modélisé en utilisant un formalisme logique et qualitatif pour pallier l'absence de données quantitatives. Tout d'abord, Les résultats ont permis de confirmer que l'auxine et les BRs agissent en synergie pour contrôler la croissance de la cellule, puis, d'expliquer des observations phénotypiques paradoxales et au final, de mettre à jour une interaction clef entre deux protéines dans la maintenance du méristème de la racine. Une étude ultérieure chez la plante modèle Brachypodium dystachion (Brachypo- dium) a révélé l'ajustement du réseau d'interaction croisée entre auxine et éthylène par rapport à Arabidopsis. Chez ce dernier, interférer avec la biosynthèse de l'auxine mène à la formation d'une racine courte. Néanmoins, nous avons isolé chez Brachypodium un mutant hypomorphique dans la biosynthèse de l'auxine qui affiche une racine plus longue. Nous avons alors conduit une analyse morphométrique qui a confirmé que des cellules plus anisotropique (plus fines et longues) sont à l'origine de ce phénotype racinaire. Des analyses plus approfondies ont démontré que la différence phénotypique entre Brachypodium et Arabidopsis s'explique par une inversion de la fonction régulatrice dans la relation entre le réseau de signalisation par l'éthylène et la biosynthèse de l'auxine. L'analyse morphométrique utilisée dans l'étude précédente exploite le pipeline de traitement d'image de notre méthode d'histologie quantitative. Pendant la croissance secondaire, la symétrie bilatérale de l'hypocotyle est remplacée par une symétrie radiale et une organisation concentrique des tissus constitutifs. Ces tissus sont initialement composés d'une douzaine de cellules mais peuvent aisément atteindre des dizaines de milliers dans les derniers stades du développement. Cette échelle dépasse largement le seuil d'investigation par les moyens dits 'traditionnels' comme l'imagerie directe de tissus en profondeur. L'étude de ce système pendant cette phase de développement ne peut se faire qu'en réalisant des coupes fines de l'organe, ce qui empêche une compréhension des phénomènes cellulaires dynamiques sous-jacents. Nous y avons remédié en proposant une stratégie originale nommée, histologie quantitative. De fait, nous avons extrait l'information contenue dans des images de très haute résolution de sections transverses d'hypocotyles en utilisant un pipeline d'analyse et de segmentation d'image à grande échelle. Nous l'avons ensuite combiné avec un algorithme de reconnaissance automatique des cellules. Cet outil nous a permis de réaliser une description quantitative de la progression de la croissance secondaire révélant des schémas développementales non-apparents avec une inspection visuelle classique. La formation de pôle de phloèmes en structure répétée et espacée entre eux d'une longueur constante illustre les bénéfices de notre approche. Par ailleurs, l'exploitation approfondie de ces résultats a montré un changement de croissance anisotropique des cellules du cambium et du phloème qui semble en phase avec l'expansion du xylème. Combinant des outils génétiques et de la modélisation biomécanique, nous avons démontré que seule la croissance plus rapide des tissus internes peut produire une réorientation de l'axe de croissance anisotropique des tissus périphériques. Cette prédiction a été confirmée par le calcul du ratio des taux de croissance du xylème et du phloème au cours de développement secondaire ; des ratios élevés sont effectivement observés et concomitant à l'établissement progressif et tangentiel du cambium. Ces résultats suggèrent un mécanisme d'auto-organisation établi par un gradient de division méristématique qui génèrent une distribution de contraintes mécaniques. Ceci réoriente la croissance anisotropique des tissus périphériques pour supporter la croissance secondaire. - Plants are essential for human society, because our daily food, construction materials and sustainable energy are derived from plant biomass. Yet, despite this importance, the multiple developmental aspects of plants are still poorly understood and represent a major challenge for science. With the emergence of high throughput devices for genome sequencing and high-resolution imaging, data has never been so easy to collect, generating huge amounts of information. Computational analysis is one way to integrate those data and to decrease the apparent complexity towards an appropriate scale of abstraction with the aim to eventually provide new answers and direct further research perspectives. This is the motivation behind this thesis work, i.e. the application of descriptive and predictive analytics combined with computational modeling to answer problems that revolve around morphogenesis at the subcellular and organ scale. One of the goals of this thesis is to elucidate how the auxin-brassinosteroid phytohormone interaction determines the cell growth in the root apical meristem of Arabidopsis thaliana (Arabidopsis), the plant model of reference for molecular studies. The pertinent information about signaling protein relationships was obtained through the literature to reconstruct the entire hormonal crosstalk. Due to a lack of quantitative information, we employed a qualitative modeling formalism. This work permitted to confirm the synergistic effect of the hormonal crosstalk on cell elongation, to explain some of our paradoxical mutant phenotypes and to predict a novel interaction between the BREVIS RADIX (BRX) protein and the transcription factor MONOPTEROS (MP),which turned out to be critical for the maintenance of the root meristem. On the same subcellular scale, another study in the monocot model Brachypodium dystachion (Brachypodium) revealed an alternative wiring of auxin-ethylene crosstalk as compared to Arabidopsis. In the latter, increasing interference with auxin biosynthesis results in progressively shorter roots. By contrast, a hypomorphic Brachypodium mutant isolated in this study in an enzyme of the auxin biosynthesis pathway displayed a dramatically longer seminal root. Our morphometric analysis confirmed that more anisotropic cells (thinner and longer) are principally responsible for the mutant root phenotype. Further characterization pointed towards an inverted regulatory logic in the relation between ethylene signaling and auxin biosynthesis in Brachypodium as compared to Arabidopsis, which explains the phenotypic discrepancy. Finally, the morphometric analysis of hypocotyl secondary growth that we applied in this study was performed with the image-processing pipeline of our quantitative histology method. During its secondary growth, the hypocotyl reorganizes its primary bilateral symmetry to a radial symmetry of highly specialized tissues comprising several thousand cells, starting with a few dozens. However, such a scale only permits observations in thin cross-sections, severely hampering a comprehensive analysis of the morphodynamics involved. Our quantitative histology strategy overcomes this limitation. We acquired hypocotyl cross-sections from tiled high-resolution images and extracted their information content using custom high-throughput image processing and segmentation. Coupled with an automated cell type recognition algorithm, it allows precise quantitative characterization of vascular development and reveals developmental patterns that were not evident from visual inspection, for example the steady interspace distance of the phloem poles. Further analyses indicated a change in growth anisotropy of cambial and phloem cells, which appeared in phase with the expansion of xylem. Combining genetic tools and computational modeling, we showed that the reorientation of growth anisotropy axis of peripheral tissue layers only occurs when the growth rate of central tissue is higher than the peripheral one. This was confirmed by the calculation of the ratio of the growth rate xylem to phloem throughout secondary growth. High ratios are indeed observed and concomitant with the homogenization of cambium anisotropy. These results suggest a self-organization mechanism, promoted by a gradient of division in the cambium that generates a pattern of mechanical constraints. This, in turn, reorients the growth anisotropy of peripheral tissues to sustain the secondary growth.
Resumo:
The quality of environmental data analysis and propagation of errors are heavily affected by the representativity of the initial sampling design [CRE 93, DEU 97, KAN 04a, LEN 06, MUL07]. Geostatistical methods such as kriging are related to field samples, whose spatial distribution is crucial for the correct detection of the phenomena. Literature about the design of environmental monitoring networks (MN) is widespread and several interesting books have recently been published [GRU 06, LEN 06, MUL 07] in order to clarify the basic principles of spatial sampling design (monitoring networks optimization) based on Support Vector Machines was proposed. Nonetheless, modelers often receive real data coming from environmental monitoring networks that suffer from problems of non-homogenity (clustering). Clustering can be related to the preferential sampling or to the impossibility of reaching certain regions.
Resumo:
Many classification systems rely on clustering techniques in which a collection of training examples is provided as an input, and a number of clusters c1,...cm modelling some concept C results as an output, such that every cluster ci is labelled as positive or negative. Given a new, unlabelled instance enew, the above classification is used to determine to which particular cluster ci this new instance belongs. In such a setting clusters can overlap, and a new unlabelled instance can be assigned to more than one cluster with conflicting labels. In the literature, such a case is usually solved non-deterministically by making a random choice. This paper presents a novel, hybrid approach to solve this situation by combining a neural network for classification along with a defeasible argumentation framework which models preference criteria for performing clustering.
Resumo:
Transmission of drug-resistant pathogens presents an almost-universal challenge for fighting infectious diseases. Transmitted drug resistance mutations (TDRM) can persist in the absence of drugs for considerable time. It is generally believed that differential TDRM-persistence is caused, at least partially, by variations in TDRM-fitness-costs. However, in vivo epidemiological evidence for the impact of fitness costs on TDRM-persistence is rare. Here, we studied the persistence of TDRM in HIV-1 using longitudinally-sampled nucleotide sequences from the Swiss-HIV-Cohort-Study (SHCS). All treatment-naïve individuals with TDRM at baseline were included. Persistence of TDRM was quantified via reversion rates (RR) determined with interval-censored survival models. Fitness costs of TDRM were estimated in the genetic background in which they occurred using a previously published and validated machine-learning algorithm (based on in vitro replicative capacities) and were included in the survival models as explanatory variables. In 857 sequential samples from 168 treatment-naïve patients, 17 TDRM were analyzed. RR varied substantially and ranged from 174.0/100-person-years;CI=[51.4, 588.8] (for 184V) to 2.7/100-person-years;[0.7, 10.9] (for 215D). RR increased significantly with fitness cost (increase by 1.6[1.3,2.0] per standard deviation of fitness costs). When subdividing fitness costs into the average fitness cost of a given mutation and the deviation from the average fitness cost of a mutation in a given genetic background, we found that both components were significantly associated with reversion-rates. Our results show that the substantial variations of TDRM persistence in the absence of drugs are associated with fitness-cost differences both among mutations and among different genetic backgrounds for the same mutation.
Resumo:
Approximate models (proxies) can be employed to reduce the computational costs of estimating uncertainty. The price to pay is that the approximations introduced by the proxy model can lead to a biased estimation. To avoid this problem and ensure a reliable uncertainty quantification, we propose to combine functional data analysis and machine learning to build error models that allow us to obtain an accurate prediction of the exact response without solving the exact model for all realizations. We build the relationship between proxy and exact model on a learning set of geostatistical realizations for which both exact and approximate solvers are run. Functional principal components analysis (FPCA) is used to investigate the variability in the two sets of curves and reduce the dimensionality of the problem while maximizing the retained information. Once obtained, the error model can be used to predict the exact response of any realization on the basis of the sole proxy response. This methodology is purpose-oriented as the error model is constructed directly for the quantity of interest, rather than for the state of the system. Also, the dimensionality reduction performed by FPCA allows a diagnostic of the quality of the error model to assess the informativeness of the learning set and the fidelity of the proxy to the exact model. The possibility of obtaining a prediction of the exact response for any newly generated realization suggests that the methodology can be effectively used beyond the context of uncertainty quantification, in particular for Bayesian inference and optimization.
Resumo:
In this thesis author approaches the problem of automated text classification, which is one of basic tasks for building Intelligent Internet Search Agent. The work discusses various approaches to solving sub-problems of automated text classification, such as feature extraction and machine learning on text sources. Author also describes her own multiword approach to feature extraction and pres-ents the results of testing this approach using linear discriminant analysis based classifier, and classifier combining unsupervised learning for etalon extraction with supervised learning using common backpropagation algorithm for multilevel perceptron.
Resumo:
Biomedical research is currently facing a new type of challenge: an excess of information, both in terms of raw data from experiments and in the number of scientific publications describing their results. Mirroring the focus on data mining techniques to address the issues of structured data, there has recently been great interest in the development and application of text mining techniques to make more effective use of the knowledge contained in biomedical scientific publications, accessible only in the form of natural human language. This thesis describes research done in the broader scope of projects aiming to develop methods, tools and techniques for text mining tasks in general and for the biomedical domain in particular. The work described here involves more specifically the goal of extracting information from statements concerning relations of biomedical entities, such as protein-protein interactions. The approach taken is one using full parsing—syntactic analysis of the entire structure of sentences—and machine learning, aiming to develop reliable methods that can further be generalized to apply also to other domains. The five papers at the core of this thesis describe research on a number of distinct but related topics in text mining. In the first of these studies, we assessed the applicability of two popular general English parsers to biomedical text mining and, finding their performance limited, identified several specific challenges to accurate parsing of domain text. In a follow-up study focusing on parsing issues related to specialized domain terminology, we evaluated three lexical adaptation methods. We found that the accurate resolution of unknown words can considerably improve parsing performance and introduced a domain-adapted parser that reduced the error rate of theoriginal by 10% while also roughly halving parsing time. To establish the relative merits of parsers that differ in the applied formalisms and the representation given to their syntactic analyses, we have also developed evaluation methodology, considering different approaches to establishing comparable dependency-based evaluation results. We introduced a methodology for creating highly accurate conversions between different parse representations, demonstrating the feasibility of unification of idiverse syntactic schemes under a shared, application-oriented representation. In addition to allowing formalism-neutral evaluation, we argue that such unification can also increase the value of parsers for domain text mining. As a further step in this direction, we analysed the characteristics of publicly available biomedical corpora annotated for protein-protein interactions and created tools for converting them into a shared form, thus contributing also to the unification of text mining resources. The introduced unified corpora allowed us to perform a task-oriented comparative evaluation of biomedical text mining corpora. This evaluation established clear limits on the comparability of results for text mining methods evaluated on different resources, prompting further efforts toward standardization. To support this and other research, we have also designed and annotated BioInfer, the first domain corpus of its size combining annotation of syntax and biomedical entities with a detailed annotation of their relationships. The corpus represents a major design and development effort of the research group, with manual annotation that identifies over 6000 entities, 2500 relationships and 28,000 syntactic dependencies in 1100 sentences. In addition to combining these key annotations for a single set of sentences, BioInfer was also the first domain resource to introduce a representation of entity relations that is supported by ontologies and able to capture complex, structured relationships. Part I of this thesis presents a summary of this research in the broader context of a text mining system, and Part II contains reprints of the five included publications.