965 resultados para Semi-supervised classification


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Gaussian Processes (GPs) are promising Bayesian methods for classification and regression problems. They have also been used for semi-supervised learning tasks. In this paper, we propose a new algorithm for solving semi-supervised binary classification problem using sparse GP regression (GPR) models. It is closely related to semi-supervised learning based on support vector regression (SVR) and maximum margin clustering. The proposed algorithm is simple and easy to implement. It gives a sparse solution directly unlike the SVR based algorithm. Also, the hyperparameters are estimated easily without resorting to expensive cross-validation technique. Use of sparse GPR model helps in making the proposed algorithm scalable. Preliminary results on synthetic and real-world data sets demonstrate the efficacy of the new algorithm.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Graph plays an important role in graph-based semi-supervised classification. However, due to noisy and redundant features in high-dimensional data, it is not a trivial job to construct a well-structured graph on high-dimensional samples. In this paper, we take advantage of sparse representation in random subspaces for graph construction and propose a method called Semi-Supervised Classification based on Subspace Sparse Representation, SSC-SSR in short. SSC-SSR first generates several random subspaces from the original space and then seeks sparse representation coefficients in these subspaces. Next, it trains semi-supervised linear classifiers on graphs that are constructed by these coefficients. Finally, it combines these classifiers into an ensemble classifier by minimizing a linear regression problem. Unlike traditional graph-based semi-supervised classification methods, the graphs of SSC-SSR are data-driven instead of man-made in advance. Empirical study on face images classification tasks demonstrates that SSC-SSR not only has superior recognition performance with respect to competitive methods, but also has wide ranges of effective input parameters.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Semi-supervised learning is a classification paradigm in which just a few labeled instances are available for the training process. To overcome this small amount of initial label information, the information provided by the unlabeled instances is also considered. In this paper, we propose a nature-inspired semi-supervised learning technique based on attraction forces. Instances are represented as points in a k-dimensional space, and the movement of data points is modeled as a dynamical system. As the system runs, data items with the same label cooperate with each other, and data items with different labels compete among them to attract unlabeled points by applying a specific force function. In this way, all unlabeled data items can be classified when the system reaches its stable state. Stability analysis for the proposed dynamical system is performed and some heuristics are proposed for parameter setting. Simulation results show that the proposed technique achieves good classification results on artificial data sets and is comparable to well-known semi-supervised techniques using benchmark data sets.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Semi-supervised learning is one of the important topics in machine learning, concerning with pattern classification where only a small subset of data is labeled. In this paper, a new network-based (or graph-based) semi-supervised classification model is proposed. It employs a combined random-greedy walk of particles, with competition and cooperation mechanisms, to propagate class labels to the whole network. Due to the competition mechanism, the proposed model has a local label spreading fashion, i.e., each particle only visits a portion of nodes potentially belonging to it, while it is not allowed to visit those nodes definitely occupied by particles of other classes. In this way, a "divide-and-conquer" effect is naturally embedded in the model. As a result, the proposed model can achieve a good classification rate while exhibiting low computational complexity order in comparison to other network-based semi-supervised algorithms. Computer simulations carried out for synthetic and real-world data sets provide a numeric quantification of the performance of the method.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In the design of practical web page classification systems one often encounters a situation in which the labeled training set is created by choosing some examples from each class; but, the class proportions in this set are not the same as those in the test distribution to which the classifier will be actually applied. The problem is made worse when the amount of training data is also small. In this paper we explore and adapt binary SVM methods that make use of unlabeled data from the test distribution, viz., Transductive SVMs (TSVMs) and expectation regularization/constraint (ER/EC) methods to deal with this situation. We empirically show that when the labeled training data is small, TSVM designed using the class ratio tuned by minimizing the loss on the labeled set yields the best performance; its performance is good even when the deviation between the class ratios of the labeled training set and the test set is quite large. When the labeled training data is sufficiently large, an unsupervised Gaussian mixture model can be used to get a very good estimate of the class ratio in the test set; also, when this estimate is used, both TSVM and EC/ER give their best possible performance, with TSVM coming out superior. The ideas in the paper can be easily extended to multi-class SVMs and MaxEnt models.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Traffic classification using machine learning continues to be an active research area. The majority of work in this area uses off-the-shelf machine learning tools and treats them as black-box classifiers. This approach turns all the modelling complexity into a feature selection problem. In this paper, we build a problem-specific solution to the traffic classification problem by designing a custom probabilistic graphical model. Graphical models are a modular framework to design classifiers which incorporate domain-specific knowledge. More specifically, our solution introduces semi-supervised learning which means we learn from both labelled and unlabelled traffic flows. We show that our solution performs competitively compared to previous approaches while using less data and simpler features. Copyright © 2010 ACM.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In recent years, the performance of semi-supervised learning has been theoretically investigated. However, most of this theoretical development has focussed on binary classification problems. In this paper, we take it a step further by extending the work of Castelli and Cover [1] [2] to the multi-class paradigm. Particularly, we consider the key problem in semi-supervised learning of classifying an unseen instance x into one of K different classes, using a training dataset sampled from a mixture density distribution and composed of l labelled records and u unlabelled examples. Even under the assumption of identifiability of the mixture and having infinite unlabelled examples, labelled records are needed to determine the K decision regions. Therefore, in this paper, we first investigate the minimum number of labelled examples needed to accomplish that task. Then, we propose an optimal multi-class learning algorithm which is a generalisation of the optimal procedure proposed in the literature for binary problems. Finally, we make use of this generalisation to study the probability of error when the binary class constraint is relaxed.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents a new semi-supervised method to effectively improve traffic classification performance when few supervised training data are available. Existing semi supervised methods label a large proportion of testing flows as unknown flows due to limited supervised information, which severely affects the classification performance. To address this problem, we propose to incorporate flow correlation into both training and testing stages. At the training stage, we make use of flow correlation to extend the supervised data set by automatically labeling unlabeled flows according to their correlation to the pre-labeled flows. Consequently, the traffic classifier has better performance due to the extended size and quality of the supervised data sets. At the testing stage, the correlated flows are identified and classified jointly by combining their individual predictions, so as to further boost the classification accuracy. The empirical study on the real-world network traffic shows that the proposed method outperforms the state-of-the-art flow statistical feature based classification methods.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The goal of email classification is to classify user emails into spam and legitimate ones. Many supervised learning algorithms have been invented in this domain to accomplish the task, and these algorithms require a large number of labeled training data. However, data labeling is a labor intensive task and requires in-depth domain knowledge. Thus, only a very small proportion of the data can be labeled in practice. This bottleneck greatly degrades the effectiveness of supervised email classification systems. In order to address this problem, in this work, we first identify some critical issues regarding supervised machine learning-based email classification. Then we propose an effective classification model based on multi-view disagreement-based semi-supervised learning. The motivation behind the attempt of using multi-view and semi-supervised learning is that multi-view can provide richer information for classification, which is often ignored by literature, and semi-supervised learning supplies with the capability of coping with labeled and unlabeled data. In the evaluation, we demonstrate that the multi-view data can improve the email classification than using a single view data, and that the proposed model working with our algorithm can achieve better performance as compared to the existing similar algorithms.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents a new semi-supervised method to effectively improve traffic classification performance when very few supervised training data are available. Existing semisupervised methods label a large proportion of testing flows as unknown flows due to limited supervised information, which severely affects the classification performance. To address this problem, we propose to incorporate flow correlation into both training and testing stages. At the training stage, we make use of flow correlation to extend the supervised data set by automatically labelling unlabelled flows according to their correlation to the pre-labelled flows. Consequently, a traffic classifier achieves excellent performance because of the enhanced training data set. At the testing stage, the correlated flows are identified and classified jointly by combining their individual predictions, so as to further boost the classification accuracy. The empirical study on the real-world network traffic shows that the proposed method significantly outperforms the state-of-the-art flow statistical feature based classification methods. Copyright © 2012 Inderscience Enterprises Ltd.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Due to its wide applicability, semi-supervised learning is an attractive method for using unlabeled data in classification. In this work, we present a semi-supervised support vector classifier that is designed using quasi-Newton method for nonsmooth convex functions. The proposed algorithm is suitable in dealing with very large number of examples and features. Numerical experiments on various benchmark datasets showed that the proposed algorithm is fast and gives improved generalization performance over the existing methods. Further, a non-linear semi-supervised SVM has been proposed based on a multiple label switching scheme. This non-linear semi-supervised SVM is found to converge faster and it is found to improve generalization performance on several benchmark datasets. (C) 2010 Elsevier Ltd. All rights reserved.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Network traffic classification is an essential component for network management and security systems. To address the limitations of traditional port-based and payload-based methods, recent studies have been focusing on alternative approaches. One promising direction is applying machine learning techniques to classify traffic flows based on packet and flow level statistics. In particular, previous papers have illustrated that clustering can achieve high accuracy and discover unknown application classes. In this work, we present a novel semi-supervised learning method using constrained clustering algorithms. The motivation is that in network domain a lot of background information is available in addition to the data instances themselves. For example, we might know that flow ƒ1 and ƒ2 are using the same application protocol because they are visiting the same host address at the same port simultaneously. In this case, ƒ1 and ƒ2 shall be grouped into the same cluster ideally. Therefore, we describe these correlations in the form of pair-wise must-link constraints and incorporate them in the process of clustering. We have applied three constrained variants of the K-Means algorithm, which perform hard or soft constraint satisfaction and metric learning from constraints. A number of real-world traffic traces have been used to show the availability of constraints and to test the proposed approach. The experimental results indicate that by incorporating constraints in the course of clustering, the overall accuracy and cluster purity can be significantly improved.