834 resultados para Semi-supervised machine learning
Resumo:
Semi-supervised learning is applied to classification problems where only a small portion of the data items is labeled. In these cases, the reliability of the labels is a crucial factor, because mislabeled items may propagate wrong labels to a large portion or even the entire data set. This paper aims to address this problem by presenting a graph-based (network-based) semi-supervised learning method, specifically designed to handle data sets with mislabeled samples. The method uses teams of walking particles, with competitive and cooperative behavior, for label propagation in the network constructed from the input data set. The proposed model is nature-inspired and it incorporates some features to make it robust to a considerable amount of mislabeled data items. Computer simulations show the performance of the method in the presence of different percentage of mislabeled data, in networks of different sizes and average node degree. Importantly, these simulations reveals the existence of the critical points of the mislabeled subset size, below which the network is free of wrong label contamination, but above which the mislabeled samples start to propagate their labels to the rest of the network. Moreover, numerical comparisons have been made among the proposed method and other representative graph-based semi-supervised learning methods using both artificial and real-world data sets. Interestingly, the proposed method has increasing better performance than the others as the percentage of mislabeled samples is getting larger. © 2012 IEEE.
Resumo:
Both Semi-Supervised Leaning and Active Learning are techniques used when unlabeled data is abundant, but the process of labeling them is expensive and/or time consuming. In this paper, those two machine learning techniques are combined into a single nature-inspired method. It features particles walking on a network built from the data set, using a unique random-greedy rule to select neighbors to visit. The particles, which have both competitive and cooperative behavior, are created on the network as the result of label queries. They may be created as the algorithm executes and only nodes affected by the new particles have to be updated. Therefore, it saves execution time compared to traditional active learning frameworks, in which the learning algorithm has to be executed several times. The data items to be queried are select based on information extracted from the nodes and particles temporal dynamics. Two different rules for queries are explored in this paper, one of them is based on querying by uncertainty approaches and the other is based on data and labeled nodes distribution. Each of them may perform better than the other according to some data sets peculiarities. Experimental results on some real-world data sets are provided, and the proposed method outperforms the semi-supervised learning method, from which it is derived, in all of them.
Resumo:
Concept drift, which refers to non stationary learning problems over time, has increasing importance in machine learning and data mining. Many concept drift applications require fast response, which means an algorithm must always be (re)trained with the latest available data. But the process of data labeling is usually expensive and/or time consuming when compared to acquisition of unlabeled data, thus usually only a small fraction of the incoming data may be effectively labeled. Semi-supervised learning methods may help in this scenario, as they use both labeled and unlabeled data in the training process. However, most of them are based on assumptions that the data is static. Therefore, semi-supervised learning with concept drifts is still an open challenging task in machine learning. Recently, a particle competition and cooperation approach has been developed to realize graph-based semi-supervised learning from static data. We have extend that approach to handle data streams and concept drift. The result is a passive algorithm which uses a single classifier approach, naturally adapted to concept changes without any explicit drift detection mechanism. It has built-in mechanisms that provide a natural way of learning from new data, gradually "forgetting" older knowledge as older data items are no longer useful for the classification of newer data items. The proposed algorithm is applied to the KDD Cup 1999 Data of network intrusion, showing its effectiveness.
Resumo:
Semi-supervised learning is one of the important topics in machine learning, concerning with pattern classification where only a small subset of data is labeled. In this paper, a new network-based (or graph-based) semi-supervised classification model is proposed. It employs a combined random-greedy walk of particles, with competition and cooperation mechanisms, to propagate class labels to the whole network. Due to the competition mechanism, the proposed model has a local label spreading fashion, i.e., each particle only visits a portion of nodes potentially belonging to it, while it is not allowed to visit those nodes definitely occupied by particles of other classes. In this way, a "divide-and-conquer" effect is naturally embedded in the model. As a result, the proposed model can achieve a good classification rate while exhibiting low computational complexity order in comparison to other network-based semi-supervised algorithms. Computer simulations carried out for synthetic and real-world data sets provide a numeric quantification of the performance of the method.
Resumo:
Semi-supervised learning techniques have gained increasing attention in the machine learning community, as a result of two main factors: (1) the available data is exponentially increasing; (2) the task of data labeling is cumbersome and expensive, involving human experts in the process. In this paper, we propose a network-based semi-supervised learning method inspired by the modularity greedy algorithm, which was originally applied for unsupervised learning. Changes have been made in the process of modularity maximization in a way to adapt the model to propagate labels throughout the network. Furthermore, a network reduction technique is introduced, as well as an extensive analysis of its impact on the network. Computer simulations are performed for artificial and real-world databases, providing a numerical quantitative basis for the performance of the proposed method.
Resumo:
Thesis (Ph.D.)--University of Washington, 2016-06
Resumo:
Since manually constructing domain-specific sentiment lexicons is extremely time consuming and it may not even be feasible for domains where linguistic expertise is not available. Research on the automatic construction of domain-specific sentiment lexicons has become a hot topic in recent years. The main contribution of this paper is the illustration of a novel semi-supervised learning method which exploits both term-to-term and document-to-term relations hidden in a corpus for the construction of domain specific sentiment lexicons. More specifically, the proposed two-pass pseudo labeling method combines shallow linguistic parsing and corpusbase statistical learning to make domain-specific sentiment extraction scalable with respect to the sheer volume of opinionated documents archived on the Internet these days. Another novelty of the proposed method is that it can utilize the readily available user-contributed labels of opinionated documents (e.g., the user ratings of product reviews) to bootstrap the performance of sentiment lexicon construction. Our experiments show that the proposed method can generate high quality domain-specific sentiment lexicons as directly assessed by human experts. Moreover, the system generated domain-specific sentiment lexicons can improve polarity prediction tasks at the document level by 2:18% when compared to other well-known baseline methods. Our research opens the door to the development of practical and scalable methods for domain-specific sentiment analysis.
Resumo:
In this thesis a manifold learning method is applied to the problem of WLAN positioning and automatic radio map creation. Due to the nature of WLAN signal strength measurements, a signal map created from raw measurements results in non-linear distance relations between measurement points. These signal strength vectors reside in a high-dimensioned coordinate system. With the help of the so called Isomap-algorithm the dimensionality of this map can be reduced, and thus more easily processed. By embedding position-labeled strategic key points, we can automatically adjust the mapping to match the surveyed environment. The environment is thus learned in a semi-supervised way; gathering training points and embedding them in a two-dimensional manifold gives us a rough mapping of the measured environment. After a calibration phase, where the labeled key points in the training data are used to associate coordinates in the manifold representation with geographical locations, we can perform positioning using the adjusted map. This can be achieved through a traditional supervised learning process, which in our case is a simple nearest neighbors matching of a sampled signal strength vector. We deployed this system in two locations in the Kumpula campus in Helsinki, Finland. Results indicate that positioning based on the learned radio map can achieve good accuracy, especially in hallways or other areas in the environment where the WLAN signal is constrained by obstacles such as walls.
Resumo:
Traffic classification using machine learning continues to be an active research area. The majority of work in this area uses off-the-shelf machine learning tools and treats them as black-box classifiers. This approach turns all the modelling complexity into a feature selection problem. In this paper, we build a problem-specific solution to the traffic classification problem by designing a custom probabilistic graphical model. Graphical models are a modular framework to design classifiers which incorporate domain-specific knowledge. More specifically, our solution introduces semi-supervised learning which means we learn from both labelled and unlabelled traffic flows. We show that our solution performs competitively compared to previous approaches while using less data and simpler features. Copyright © 2010 ACM.
Resumo:
T.Boongoen and Q. Shen. Semi-Supervised OWA Aggregation for Link-Based Similarity Evaluation and Alias Detection. Proceedings of the 18th International Conference on Fuzzy Systems (FUZZ-IEEE'09), pp. 288-293, 2009. Sponsorship: EPSRC
Resumo:
In this paper, a new methodology for the prediction of scoliosis curve types from non invasive acquisitions of the back surface of the trunk is proposed. One hundred and fifty-nine scoliosis patients had their back surface acquired in 3D using an optical digitizer. Each surface is then characterized by 45 local measurements of the back surface rotation. Using a semi-supervised algorithm, the classifier is trained with only 32 labeled and 58 unlabeled data. Tested on 69 new samples, the classifier succeeded in classifying correctly 87.0% of the data. After reducing the number of labeled training samples to 12, the behavior of the resulting classifier tends to be similar to the reference case where the classifier is trained only with the maximum number of available labeled data. Moreover, the addition of unlabeled data guided the classifier towards more generalizable boundaries between the classes. Those results provide a proof of feasibility for using a semi-supervised learning algorithm to train a classifier for the prediction of a scoliosis curve type, when only a few training data are labeled. This constitutes a promising clinical finding since it will allow the diagnosis and the follow-up of scoliotic deformities without exposing the patient to X-ray radiations.