920 resultados para Semi-Supervised Learning
Resumo:
We perform a review of Web Mining techniques and we describe a Bootstrap Statistics methodology applied to pattern model classifier optimization and verification for Supervised Learning for Tour-Guide Robot knowledge repository management. It is virtually impossible to test thoroughly Web Page Classifiers and many other Internet Applications with pure empirical data, due to the need for human intervention to generate training sets and test sets. We propose using the computer-based Bootstrap paradigm to design a test environment where they are checked with better reliability.
Resumo:
Hoy en día, con la evolución continua y rápida de las tecnologías de la información y los dispositivos de computación, se recogen y almacenan continuamente grandes volúmenes de datos en distintos dominios y a través de diversas aplicaciones del mundo real. La extracción de conocimiento útil de una cantidad tan enorme de datos no se puede realizar habitualmente de forma manual, y requiere el uso de técnicas adecuadas de aprendizaje automático y de minería de datos. La clasificación es una de las técnicas más importantes que ha sido aplicada con éxito a varias áreas. En general, la clasificación se compone de dos pasos principales: en primer lugar, aprender un modelo de clasificación o clasificador a partir de un conjunto de datos de entrenamiento, y en segundo lugar, clasificar las nuevas instancias de datos utilizando el clasificador aprendido. La clasificación es supervisada cuando todas las etiquetas están presentes en los datos de entrenamiento (es decir, datos completamente etiquetados), semi-supervisada cuando sólo algunas etiquetas son conocidas (es decir, datos parcialmente etiquetados), y no supervisada cuando todas las etiquetas están ausentes en los datos de entrenamiento (es decir, datos no etiquetados). Además, aparte de esta taxonomía, el problema de clasificación se puede categorizar en unidimensional o multidimensional en función del número de variables clase, una o más, respectivamente; o también puede ser categorizado en estacionario o cambiante con el tiempo en función de las características de los datos y de la tasa de cambio subyacente. A lo largo de esta tesis, tratamos el problema de clasificación desde tres perspectivas diferentes, a saber, clasificación supervisada multidimensional estacionaria, clasificación semisupervisada unidimensional cambiante con el tiempo, y clasificación supervisada multidimensional cambiante con el tiempo. Para llevar a cabo esta tarea, hemos usado básicamente los clasificadores Bayesianos como modelos. La primera contribución, dirigiéndose al problema de clasificación supervisada multidimensional estacionaria, se compone de dos nuevos métodos de aprendizaje de clasificadores Bayesianos multidimensionales a partir de datos estacionarios. Los métodos se proponen desde dos puntos de vista diferentes. El primer método, denominado CB-MBC, se basa en una estrategia de envoltura de selección de variables que es voraz y hacia delante, mientras que el segundo, denominado MB-MBC, es una estrategia de filtrado de variables con una aproximación basada en restricciones y en el manto de Markov. Ambos métodos han sido aplicados a dos problemas reales importantes, a saber, la predicción de los inhibidores de la transcriptasa inversa y de la proteasa para el problema de infección por el virus de la inmunodeficiencia humana tipo 1 (HIV-1), y la predicción del European Quality of Life-5 Dimensions (EQ-5D) a partir de los cuestionarios de la enfermedad de Parkinson con 39 ítems (PDQ-39). El estudio experimental incluye comparaciones de CB-MBC y MB-MBC con los métodos del estado del arte de la clasificación multidimensional, así como con métodos comúnmente utilizados para resolver el problema de predicción de la enfermedad de Parkinson, a saber, la regresión logística multinomial, mínimos cuadrados ordinarios, y mínimas desviaciones absolutas censuradas. En ambas aplicaciones, los resultados han sido prometedores con respecto a la precisión de la clasificación, así como en relación al análisis de las estructuras gráficas que identifican interacciones conocidas y novedosas entre las variables. La segunda contribución, referida al problema de clasificación semi-supervisada unidimensional cambiante con el tiempo, consiste en un método nuevo (CPL-DS) para clasificar flujos de datos parcialmente etiquetados. Los flujos de datos difieren de los conjuntos de datos estacionarios en su proceso de generación muy rápido y en su aspecto de cambio de concepto. Es decir, los conceptos aprendidos y/o la distribución subyacente están probablemente cambiando y evolucionando en el tiempo, lo que hace que el modelo de clasificación actual sea obsoleto y deba ser actualizado. CPL-DS utiliza la divergencia de Kullback-Leibler y el método de bootstrapping para cuantificar y detectar tres tipos posibles de cambio: en las predictoras, en la a posteriori de la clase o en ambas. Después, si se detecta cualquier cambio, un nuevo modelo de clasificación se aprende usando el algoritmo EM; si no, el modelo de clasificación actual se mantiene sin modificaciones. CPL-DS es general, ya que puede ser aplicado a varios modelos de clasificación. Usando dos modelos diferentes, el clasificador naive Bayes y la regresión logística, CPL-DS se ha probado con flujos de datos sintéticos y también se ha aplicado al problema real de la detección de código malware, en el cual los nuevos ficheros recibidos deben ser continuamente clasificados en malware o goodware. Los resultados experimentales muestran que nuestro método es efectivo para la detección de diferentes tipos de cambio a partir de los flujos de datos parcialmente etiquetados y también tiene una buena precisión de la clasificación. Finalmente, la tercera contribución, sobre el problema de clasificación supervisada multidimensional cambiante con el tiempo, consiste en dos métodos adaptativos, a saber, Locally Adpative-MB-MBC (LA-MB-MBC) y Globally Adpative-MB-MBC (GA-MB-MBC). Ambos métodos monitorizan el cambio de concepto a lo largo del tiempo utilizando la log-verosimilitud media como métrica y el test de Page-Hinkley. Luego, si se detecta un cambio de concepto, LA-MB-MBC adapta el actual clasificador Bayesiano multidimensional localmente alrededor de cada nodo cambiado, mientras que GA-MB-MBC aprende un nuevo clasificador Bayesiano multidimensional. El estudio experimental realizado usando flujos de datos sintéticos multidimensionales indica los méritos de los métodos adaptativos propuestos. ABSTRACT Nowadays, with the ongoing and rapid evolution of information technology and computing devices, large volumes of data are continuously collected and stored in different domains and through various real-world applications. Extracting useful knowledge from such a huge amount of data usually cannot be performed manually, and requires the use of adequate machine learning and data mining techniques. Classification is one of the most important techniques that has been successfully applied to several areas. Roughly speaking, classification consists of two main steps: first, learn a classification model or classifier from an available training data, and secondly, classify the new incoming unseen data instances using the learned classifier. Classification is supervised when the whole class values are present in the training data (i.e., fully labeled data), semi-supervised when only some class values are known (i.e., partially labeled data), and unsupervised when the whole class values are missing in the training data (i.e., unlabeled data). In addition, besides this taxonomy, the classification problem can be categorized into uni-dimensional or multi-dimensional depending on the number of class variables, one or more, respectively; or can be also categorized into stationary or streaming depending on the characteristics of the data and the rate of change underlying it. Through this thesis, we deal with the classification problem under three different settings, namely, supervised multi-dimensional stationary classification, semi-supervised unidimensional streaming classification, and supervised multi-dimensional streaming classification. To accomplish this task, we basically used Bayesian network classifiers as models. The first contribution, addressing the supervised multi-dimensional stationary classification problem, consists of two new methods for learning multi-dimensional Bayesian network classifiers from stationary data. They are proposed from two different points of view. The first method, named CB-MBC, is based on a wrapper greedy forward selection approach, while the second one, named MB-MBC, is a filter constraint-based approach based on Markov blankets. Both methods are applied to two important real-world problems, namely, the prediction of the human immunodeficiency virus type 1 (HIV-1) reverse transcriptase and protease inhibitors, and the prediction of the European Quality of Life-5 Dimensions (EQ-5D) from 39-item Parkinson’s Disease Questionnaire (PDQ-39). The experimental study includes comparisons of CB-MBC and MB-MBC against state-of-the-art multi-dimensional classification methods, as well as against commonly used methods for solving the Parkinson’s disease prediction problem, namely, multinomial logistic regression, ordinary least squares, and censored least absolute deviations. For both considered case studies, results are promising in terms of classification accuracy as well as regarding the analysis of the learned MBC graphical structures identifying known and novel interactions among variables. The second contribution, addressing the semi-supervised uni-dimensional streaming classification problem, consists of a novel method (CPL-DS) for classifying partially labeled data streams. Data streams differ from the stationary data sets by their highly rapid generation process and their concept-drifting aspect. That is, the learned concepts and/or the underlying distribution are likely changing and evolving over time, which makes the current classification model out-of-date requiring to be updated. CPL-DS uses the Kullback-Leibler divergence and bootstrapping method to quantify and detect three possible kinds of drift: feature, conditional or dual. Then, if any occurs, a new classification model is learned using the expectation-maximization algorithm; otherwise, the current classification model is kept unchanged. CPL-DS is general as it can be applied to several classification models. Using two different models, namely, naive Bayes classifier and logistic regression, CPL-DS is tested with synthetic data streams and applied to the real-world problem of malware detection, where the new received files should be continuously classified into malware or goodware. Experimental results show that our approach is effective for detecting different kinds of drift from partially labeled data streams, as well as having a good classification performance. Finally, the third contribution, addressing the supervised multi-dimensional streaming classification problem, consists of two adaptive methods, namely, Locally Adaptive-MB-MBC (LA-MB-MBC) and Globally Adaptive-MB-MBC (GA-MB-MBC). Both methods monitor the concept drift over time using the average log-likelihood score and the Page-Hinkley test. Then, if a drift is detected, LA-MB-MBC adapts the current multi-dimensional Bayesian network classifier locally around each changed node, whereas GA-MB-MBC learns a new multi-dimensional Bayesian network classifier from scratch. Experimental study carried out using synthetic multi-dimensional data streams shows the merits of both proposed adaptive methods.
Resumo:
BACKGROUND: Clinical Trials (CTs) are essential for bridging the gap between experimental research on new drugs and their clinical application. Just like CTs for traditional drugs and biologics have helped accelerate the translation of biomedical findings into medical practice, CTs for nanodrugs and nanodevices could advance novel nanomaterials as agents for diagnosis and therapy. Although there is publicly available information about nanomedicine-related CTs, the online archiving of this information is carried out without adhering to criteria that discriminate between studies involving nanomaterials or nanotechnology-based processes (nano), and CTs that do not involve nanotechnology (non-nano). Finding out whether nanodrugs and nanodevices were involved in a study from CT summaries alone is a challenging task. At the time of writing, CTs archived in the well-known online registry ClinicalTrials.gov are not easily told apart as to whether they are nano or non-nano CTs-even when performed by domain experts, due to the lack of both a common definition for nanotechnology and of standards for reporting nanomedical experiments and results. METHODS: We propose a supervised learning approach for classifying CT summaries from ClinicalTrials.gov according to whether they fall into the nano or the non-nano categories. Our method involves several stages: i) extraction and manual annotation of CTs as nano vs. non-nano, ii) pre-processing and automatic classification, and iii) performance evaluation using several state-of-the-art classifiers under different transformations of the original dataset. RESULTS AND CONCLUSIONS: The performance of the best automated classifier closely matches that of experts (AUC over 0.95), suggesting that it is feasible to automatically detect the presence of nanotechnology products in CT summaries with a high degree of accuracy. This can significantly speed up the process of finding whether reports on ClinicalTrials.gov might be relevant to a particular nanoparticle or nanodevice, which is essential to discover any precedents for nanotoxicity events or advantages for targeted drug therapy.
Resumo:
Multi-dimensional classification (MDC) is the supervised learning problem where an instance is associated with multiple classes, rather than with a single class, as in traditional classification problems. Since these classes are often strongly correlated, modeling the dependencies between them allows MDC methods to improve their performance – at the expense of an increased computational cost. In this paper we focus on the classifier chains (CC) approach for modeling dependencies, one of the most popular and highest-performing methods for multi-label classification (MLC), a particular case of MDC which involves only binary classes (i.e., labels). The original CC algorithm makes a greedy approximation, and is fast but tends to propagate errors along the chain. Here we present novel Monte Carlo schemes, both for finding a good chain sequence and performing efficient inference. Our algorithms remain tractable for high-dimensional data sets and obtain the best predictive performance across several real data sets.
Resumo:
We present the results of applying automated machine learning techniques to the problem of matching different object catalogues in astrophysics. In this study, we take two partially matched catalogues where one of the two catalogues has a large positional uncertainty. The two catalogues we used here were taken from the H I Parkes All Sky Survey (HIPASS) and SuperCOSMOS optical survey. Previous work had matched 44 per cent (1887 objects) of HIPASS to the SuperCOSMOS catalogue. A supervised learning algorithm was then applied to construct a model of the matched portion of our catalogue. Validation of the model shows that we achieved a good classification performance (99.12 per cent correct). Applying this model to the unmatched portion of the catalogue found 1209 new matches. This increases the catalogue size from 1887 matched objects to 3096. The combination of these procedures yields a catalogue that is 72 per cent matched.
Resumo:
The dynamics of supervised learning in layered neural networks were studied in the regime where the size of the training set is proportional to the number of inputs. The evolution of macroscopic observables, including the two relevant performance measures can be predicted by using the dynamical replica theory. Three approximation schemes aimed at eliminating the need to solve a functional saddle-point equation at each time step have been derived.
Resumo:
Hemispheric differences in the learning and generalization of pattern categories were explored in two experiments involving sixteen patients with unilateral posterior, cerebral lesions in the left (LH) or right (RH) hemisphere. In each experiment participants were first trained to criterion in a supervised learning paradigm to categorize a set of patterns that either consisted of simple geometric forms (Experiment 1) or unfamiliar grey-level images (Experiment 2). They were then tested for their ability to generalize acquired categorical knowledge to contrast-reversed versions of the learning patterns. The results showed that RH lesions impeded category learning of unfamiliar grey-level images more severely than LH lesions, whereas this relationship appeared reversed for categories defined by simple geometric forms. With regard to generalization to contrast reversal, categorization performance of LH and RH patients was unaffected in the case of simple geometric forms. However, generalization to of contrast-reversed grey-level images distinctly deteriorated for patients with LH lesions relative to those with RH lesions, with the latter (but not the former) being consistently unable to identify the pattern manipulation. These findings suggest a differential use of contrast information in the representation of pattern categories in the two hemispheres. Such specialization appears in line with previous distinctions between a predominantly lefthemispheric, abstract-analytical and a righthemispheric, specific-holistic representation of object categories, and their prediction of a mandatory representation of contrast polarity in the RH. Some implications for the well-established dissociation of visual disorders for the recognition of faces and letters are discussed.
Resumo:
Sentiment analysis or opinion mining aims to use automated tools to detect subjective information such as opinions, attitudes, and feelings expressed in text. This paper proposes a novel probabilistic modeling framework called joint sentiment-topic (JST) model based on latent Dirichlet allocation (LDA), which detects sentiment and topic simultaneously from text. A reparameterized version of the JST model called Reverse-JST, obtained by reversing the sequence of sentiment and topic generation in the modeling process, is also studied. Although JST is equivalent to Reverse-JST without a hierarchical prior, extensive experiments show that when sentiment priors are added, JST performs consistently better than Reverse-JST. Besides, unlike supervised approaches to sentiment classification which often fail to produce satisfactory performance when shifting to other domains, the weakly supervised nature of JST makes it highly portable to other domains. This is verified by the experimental results on data sets from five different domains where the JST model even outperforms existing semi-supervised approaches in some of the data sets despite using no labeled documents. Moreover, the topics and topic sentiment detected by JST are indeed coherent and informative. We hypothesize that the JST model can readily meet the demand of large-scale sentiment analysis from the web in an open-ended fashion.
Resumo:
This dissertation establishes a novel system for human face learning and recognition based on incremental multilinear Principal Component Analysis (PCA). Most of the existing face recognition systems need training data during the learning process. The system as proposed in this dissertation utilizes an unsupervised or weakly supervised learning approach, in which the learning phase requires a minimal amount of training data. It also overcomes the inability of traditional systems to adapt to the testing phase as the decision process for the newly acquired images continues to rely on that same old training data set. Consequently when a new training set is to be used, the traditional approach will require that the entire eigensystem will have to be generated again. However, as a means to speed up this computational process, the proposed method uses the eigensystem generated from the old training set together with the new images to generate more effectively the new eigensystem in a so-called incremental learning process. In the empirical evaluation phase, there are two key factors that are essential in evaluating the performance of the proposed method: (1) recognition accuracy and (2) computational complexity. In order to establish the most suitable algorithm for this research, a comparative analysis of the best performing methods has been carried out first. The results of the comparative analysis advocated for the initial utilization of the multilinear PCA in our research. As for the consideration of the issue of computational complexity for the subspace update procedure, a novel incremental algorithm, which combines the traditional sequential Karhunen-Loeve (SKL) algorithm with the newly developed incremental modified fast PCA algorithm, was established. In order to utilize the multilinear PCA in the incremental process, a new unfolding method was developed to affix the newly added data at the end of the previous data. The results of the incremental process based on these two methods were obtained to bear out these new theoretical improvements. Some object tracking results using video images are also provided as another challenging task to prove the soundness of this incremental multilinear learning method.
Resumo:
We present and evaluate a novel supervised recurrent neural network architecture, the SARASOM, based on the associative self-organizing map. The performance of the SARASOM is evaluated and compared with the Elman network as well as with a hidden Markov model (HMM) in a number of prediction tasks using sequences of letters, including some experiments with a reduced lexicon of 15 words. The results were very encouraging with the SARASOM learning better and performing with better accuracy than both the Elman network and the HMM.
Resumo:
Dopo lo sviluppo dei primi casi di Covid-19 in Cina nell’autunno del 2019, ad inizio 2020 l’intero pianeta è precipitato in una pandemia globale che ha stravolto le nostre vite con conseguenze che non si vivevano dall’influenza spagnola. La grandissima quantità di paper scientifici in continua pubblicazione sul coronavirus e virus ad esso affini ha portato alla creazione di un unico dataset dinamico chiamato CORD19 e distribuito gratuitamente. Poter reperire informazioni utili in questa mole di dati ha ulteriormente acceso i riflettori sugli information retrieval systems, capaci di recuperare in maniera rapida ed efficace informazioni preziose rispetto a una domanda dell'utente detta query. Di particolare rilievo è stata la TREC-COVID Challenge, competizione per lo sviluppo di un sistema di IR addestrato e testato sul dataset CORD19. Il problema principale è dato dal fatto che la grande mole di documenti è totalmente non etichettata e risulta dunque impossibile addestrare modelli di reti neurali direttamente su di essi. Per aggirare il problema abbiamo messo a punto nuove soluzioni self-supervised, a cui abbiamo applicato lo stato dell'arte del deep metric learning e dell'NLP. Il deep metric learning, che sta avendo un enorme successo soprattuto nella computer vision, addestra il modello ad "avvicinare" tra loro immagini simili e "allontanare" immagini differenti. Dato che sia le immagini che il testo vengono rappresentati attraverso vettori di numeri reali (embeddings) si possano utilizzare le stesse tecniche per "avvicinare" tra loro elementi testuali pertinenti (e.g. una query e un paragrafo) e "allontanare" elementi non pertinenti. Abbiamo dunque addestrato un modello SciBERT con varie loss, che ad oggi rappresentano lo stato dell'arte del deep metric learning, in maniera completamente self-supervised direttamente e unicamente sul dataset CORD19, valutandolo poi sul set formale TREC-COVID attraverso un sistema di IR e ottenendo risultati interessanti.
Resumo:
This thesis focuses on automating the time-consuming task of manually counting activated neurons in fluorescent microscopy images, which is used to study the mechanisms underlying torpor. The traditional method of manual annotation can introduce bias and delay the outcome of experiments, so the author investigates a deep-learning-based procedure to automatize this task. The author explores two of the main convolutional-neural-network (CNNs) state-of-the-art architectures: UNet and ResUnet family model, and uses a counting-by-segmentation strategy to provide a justification of the objects considered during the counting process. The author also explores a weakly-supervised learning strategy that exploits only dot annotations. The author quantifies the advantages in terms of data reduction and counting performance boost obtainable with a transfer-learning approach and, specifically, a fine-tuning procedure. The author released the dataset used for the supervised use case and all the pre-training models, and designed a web application to share both the counting process pipeline developed in this work and the models pre-trained on the dataset analyzed in this work.
Resumo:
Depth estimation from images has long been regarded as a preferable alternative compared to expensive and intrusive active sensors, such as LiDAR and ToF. The topic has attracted the attention of an increasingly wide audience thanks to the great amount of application domains, such as autonomous driving, robotic navigation and 3D reconstruction. Among the various techniques employed for depth estimation, stereo matching is one of the most widespread, owing to its robustness, speed and simplicity in setup. Recent developments has been aided by the abundance of annotated stereo images, which granted to deep learning the opportunity to thrive in a research area where deep networks can reach state-of-the-art sub-pixel precision in most cases. Despite the recent findings, stereo matching still begets many open challenges, two among them being finding pixel correspondences in presence of objects that exhibits a non-Lambertian behaviour and processing high-resolution images. Recently, a novel dataset named Booster, which contains high-resolution stereo pairs featuring a large collection of labeled non-Lambertian objects, has been released. The work shown that training state-of-the-art deep neural network on such data improves the generalization capabilities of these networks also in presence of non-Lambertian surfaces. Regardless being a further step to tackle the aforementioned challenge, Booster includes a rather small number of annotated images, and thus cannot satisfy the intensive training requirements of deep learning. This thesis work aims to investigate novel view synthesis techniques to augment the Booster dataset, with ultimate goal of improving stereo matching reliability in presence of high-resolution images that displays non-Lambertian surfaces.
Resumo:
This work presents a method for predicting resource availability in opportunistic grids by means of use pattern analysis (UPA), a technique based on non-supervised learning methods. This prediction method is based on the assumption of the existence of several classes of computational resource use patterns, which can be used to predict the resource availability. Trace-driven simulations validate this basic assumptions, which also provide the parameter settings for the accurate learning of resource use patterns. Experiments made with an implementation of the UPA method show the feasibility of its use in the scheduling of grid tasks with very little overhead. The experiments also demonstrate the method`s superiority over other predictive and non-predictive methods. An adaptative prediction method is suggested to deal with the lack of training data at initialization. Further adaptative behaviour is motivated by experiments which show that, in some special environments, reliable resource use patterns may not always be detected. Copyright (C) 2009 John Wiley & Sons, Ltd.
Resumo:
Host responses following exposure to Mycobacterium tuberculosis (TB) are complex and can significantly affect clinical outcome. These responses, which are largely mediated by complex immune mechanisms involving peripheral blood cells (PBCs) such as T-lymphocytes, NK cells and monocyte-derived macrophages, have not been fully characterized. We hypothesize that different clinical outcome following TB exposure will be uniquely reflected in host gene expression profiles, and expression profiling of PBCs can be used to discriminate between different TB infectious outcomes. In this study, microarray analysis was performed on PBCs from three TB groups (BCG-vaccinated, latent TB infection, and active TB infection) and a control healthy group. Supervised learning algorithms were used to identify signature genomic responses that differentiate among group samples. Gene Set Enrichment Analysis was used to determine sets of genes that were co-regulated. Multivariate permutation analysis (p < 0.01) gave 645 genes differentially expressed among the four groups, with both distinct and common patterns of gene expression observed for each group. A 127-probeset, representing 77 known genes, capable of accurately classifying samples into their respective groups was identified. In addition, 13 insulin-sensitive genes were found to be differentially regulated in all three TB infected groups, underscoring the functional association between insulin signaling pathway and TB infection. Published by Elsevier Ltd.