968 resultados para Unsupervised classification
Resumo:
Outlier detection in high dimensional categorical data has been a problem of much interest due to the extensive use of qualitative features for describing the data across various application areas. Though there exist various established methods for dealing with the dimensionality aspect through feature selection on numerical data, the categorical domain is actively being explored. As outlier detection is generally considered as an unsupervised learning problem due to lack of knowledge about the nature of various types of outliers, the related feature selection task also needs to be handled in a similar manner. This motivates the need to develop an unsupervised feature selection algorithm for efficient detection of outliers in categorical data. Addressing this aspect, we propose a novel feature selection algorithm based on the mutual information measure and the entropy computation. The redundancy among the features is characterized using the mutual information measure for identifying a suitable feature subset with less redundancy. The performance of the proposed algorithm in comparison with the information gain based feature selection shows its effectiveness for outlier detection. The efficacy of the proposed algorithm is demonstrated on various high-dimensional benchmark data sets employing two existing outlier detection methods.
Resumo:
Crop type classification using remote sensing data plays a vital role in planning cultivation activities and for optimal usage of the available fertile land. Thus a reliable and precise classification of agricultural crops can help improve agricultural productivity. Hence in this paper a gene expression programming based fuzzy logic approach for multiclass crop classification using Multispectral satellite image is proposed. The purpose of this work is to utilize the optimization capabilities of GEP for tuning the fuzzy membership functions. The capabilities of GEP as a classifier is also studied. The proposed method is compared to Bayesian and Maximum likelihood classifier in terms of performance evaluation. From the results we can conclude that the proposed method is effective for classification.
Resumo:
Chebyshev-inequality-based convex relaxations of Chance-Constrained Programs (CCPs) are shown to be useful for learning classifiers on massive datasets. In particular, an algorithm that integrates efficient clustering procedures and CCP approaches for computing classifiers on large datasets is proposed. The key idea is to identify high density regions or clusters from individual class conditional densities and then use a CCP formulation to learn a classifier on the clusters. The CCP formulation ensures that most of the data points in a cluster are correctly classified by employing a Chebyshev-inequality-based convex relaxation. This relaxation is heavily dependent on the second-order statistics. However, this formulation and in general such relaxations that depend on the second-order moments are susceptible to moment estimation errors. One of the contributions of the paper is to propose several formulations that are robust to such errors. In particular a generic way of making such formulations robust to moment estimation errors is illustrated using two novel confidence sets. An important contribution is to show that when either of the confidence sets is employed, for the special case of a spherical normal distribution of clusters, the robust variant of the formulation can be posed as a second-order cone program. Empirical results show that the robust formulations achieve accuracies comparable to that with true moments, even when moment estimates are erroneous. Results also illustrate the benefits of employing the proposed methodology for robust classification of large-scale datasets.
Resumo:
Moving shadow detection and removal from the extracted foreground regions of video frames, aim to limit the risk of misconsideration of moving shadows as a part of moving objects. This operation thus enhances the rate of accuracy in detection and classification of moving objects. With a similar reasoning, the present paper proposes an efficient method for the discrimination of moving object and moving shadow regions in a video sequence, with no human intervention. Also, it requires less computational burden and works effectively under dynamic traffic road conditions on highways (with and without marking lines), street ways (with and without marking lines). Further, we have used scale-invariant feature transform-based features for the classification of moving vehicles (with and without shadow regions), which enhances the effectiveness of the proposed method. The potentiality of the method is tested with various data sets collected from different road traffic scenarios, and its superiority is compared with the existing methods. (C) 2013 Elsevier GmbH. All rights reserved.
Resumo:
This paper presents an efficient approach to the modeling and classification of vehicles using the magnetic signature of the vehicle. A database was created using the magnetic signature collected over a wide range of vehicles(cars). A sensor dependent approach called as Magnetic Field Angle Model is proposed for modeling the obtained magnetic signature. Based on the data model, we present a novel method to extract the feature vector from the magnetic signature. In the classification of vehicles, a linear support vector machine configuration is used to classify the vehicles based on the obtained feature vectors.
Resumo:
This paper presents classification, representation and extraction of deformation features in sheet-metal parts. The thickness is constant for these shape features and hence these are also referred to as constant thickness features. The deformation feature is represented as a set of faces with a characteristic arrangement among the faces. Deformation of the base-sheet or forming of material creates Bends and Walls with respect to a base-sheet or a reference plane. These are referred to as Basic Deformation Features (BDFs). Compound deformation features having two or more BDFs are defined as characteristic combinations of Bends and Walls and represented as a graph called Basic Deformation Features Graph (BDFG). The graph, therefore, represents a compound deformation feature uniquely. The characteristic arrangement of the faces and type of bends belonging to the feature decide the type and nature of the deformation feature. Algorithms have been developed to extract and identify deformation features from a CAD model of sheet-metal parts. The proposed algorithm does not require folding and unfolding of the part as intermediate steps to recognize deformation features. Representations of typical features are illustrated and results of extracting these deformation features from typical sheet metal parts are presented and discussed. (C) 2013 Elsevier Ltd. All rights reserved.
Resumo:
Myopathies are muscular diseases in which muscle fibers degenerate due to many factors such as nutrient deficiency, infection and mutations in myofibrillar etc. The objective of this study is to identify the bio-markers to distinguish various muscle mutants in Drosophila (fruit fly) using Raman Spectroscopy. Principal Components based Linear Discriminant Analysis (PC-LDA) classification model yielding >95% accuracy was developed to classify such different mutants representing various myopathies according to their physiopathology.
Resumo:
We study consistency properties of surrogate loss functions for general multiclass classification problems, defined by a general loss matrix. We extend the notion of classification calibration, which has been studied for binary and multiclass 0-1 classification problems (and for certain other specific learning problems), to the general multiclass setting, and derive necessary and sufficient conditions for a surrogate loss to be classification calibrated with respect to a loss matrix in this setting. We then introduce the notion of \emph{classification calibration dimension} of a multiclass loss matrix, which measures the smallest `size' of a prediction space for which it is possible to design a convex surrogate that is classification calibrated with respect to the loss matrix. We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. In particular, as one application, we provide a different route from the recent result of Duchi et al.\ (2010) for analyzing the difficulty of designing `low-dimensional' convex surrogates that are consistent with respect to pairwise subset ranking losses. We anticipate the classification calibration dimension may prove to be a useful tool in the study and design of surrogate losses for general multiclass learning problems.
Resumo:
We consider the problem of developing privacy-preserving machine learning algorithms in a dis-tributed multiparty setting. Here different parties own different parts of a data set, and the goal is to learn a classifier from the entire data set with-out any party revealing any information about the individual data points it owns. Pathak et al [7]recently proposed a solution to this problem in which each party learns a local classifier from its own data, and a third party then aggregates these classifiers in a privacy-preserving manner using a cryptographic scheme. The generaliza-tion performance of their algorithm is sensitive to the number of parties and the relative frac-tions of data owned by the different parties. In this paper, we describe a new differentially pri-vate algorithm for the multiparty setting that uses a stochastic gradient descent based procedure to directly optimize the overall multiparty ob-jective rather than combining classifiers learned from optimizing local objectives. The algorithm achieves a slightly weaker form of differential privacy than that of [7], but provides improved generalization guarantees that do not depend on the number of parties or the relative sizes of the individual data sets. Experimental results corrob-orate our theoretical findings.
Resumo:
Transductive SVM (TSVM) is a well known semi-supervised large margin learning method for binary text classification. In this paper we extend this method to multi-class and hierarchical classification problems. We point out that the determination of labels of unlabeled examples with fixed classifier weights is a linear programming problem. We devise an efficient technique for solving it. The method is applicable to general loss functions. We demonstrate the value of the new method using large margin loss on a number of multi-class and hierarchical classification datasets. For maxent loss we show empirically that our method is better than expectation regularization/constraint and posterior regularization methods, and competitive with the version of entropy regularization method which uses label constraints.
Resumo:
Myopathies are muscular diseases in which muscle fibers degenerate due to many factors such as nutrient deficiency, infection and mutations in myofibrillar etc. The objective of this study is to identify the bio-markers to distinguish various muscle mutants in Drosophila (fruit fly) using Raman Spectroscopy. Principal Components based Linear Discriminant Analysis (PC-LDA) classification model yielding >95% accuracy was developed to classify such different mutants representing various myopathies according to their physiopathology.
Resumo:
In this paper, we have proposed a simple and effective approach to classify H.264 compressed videos, by capturing orientation information from the motion vectors. Our major contribution involves computing Histogram of Oriented Motion Vectors (HOMV) for overlapping hierarchical Space-Time cubes. The Space-Time cubes selected are partially overlapped. HOMV is found to be very effective to define the motion characteristics of these cubes. We then use Bag of Features (B OF) approach to define the video as histogram of HOMV keywords, obtained using k-means clustering. The video feature, thus computed, is found to be very effective in classifying videos. We demonstrate our results with experiments on two large publicly available video database.
Resumo:
Sparse representation based classification (SRC) is one of the most successful methods that has been developed in recent times for face recognition. Optimal projection for Sparse representation based classification (OPSRC)1] provides a dimensionality reduction map that is supposed to give optimum performance for SRC framework. However, the computational complexity involved in this method is too high. Here, we propose a new projection technique using the data scatter matrix which is computationally superior to the optimal projection method with comparable classification accuracy with respect OPSRC. The performance of the proposed approach is benchmarked with various publicly available face database.
Resumo:
Maximum entropy approach to classification is very well studied in applied statistics and machine learning and almost all the methods that exists in literature are discriminative in nature. In this paper, we introduce a maximum entropy classification method with feature selection for large dimensional data such as text datasets that is generative in nature. To tackle the curse of dimensionality of large data sets, we employ conditional independence assumption (Naive Bayes) and we perform feature selection simultaneously, by enforcing a `maximum discrimination' between estimated class conditional densities. For two class problems, in the proposed method, we use Jeffreys (J) divergence to discriminate the class conditional densities. To extend our method to the multi-class case, we propose a completely new approach by considering a multi-distribution divergence: we replace Jeffreys divergence by Jensen-Shannon (JS) divergence to discriminate conditional densities of multiple classes. In order to reduce computational complexity, we employ a modified Jensen-Shannon divergence (JS(GM)), based on AM-GM inequality. We show that the resulting divergence is a natural generalization of Jeffreys divergence to a multiple distributions case. As far as the theoretical justifications are concerned we show that when one intends to select the best features in a generative maximum entropy approach, maximum discrimination using J-divergence emerges naturally in binary classification. Performance and comparative study of the proposed algorithms have been demonstrated on large dimensional text and gene expression datasets that show our methods scale up very well with large dimensional datasets.
Resumo:
Elastic Net Regularizers have shown much promise in designing sparse classifiers for linear classification. In this work, we propose an alternating optimization approach to solve the dual problems of elastic net regularized linear classification Support Vector Machines (SVMs) and logistic regression (LR). One of the sub-problems turns out to be a simple projection. The other sub-problem can be solved using dual coordinate descent methods developed for non-sparse L2-regularized linear SVMs and LR, without altering their iteration complexity and convergence properties. Experiments on very large datasets indicate that the proposed dual coordinate descent - projection (DCD-P) methods are fast and achieve comparable generalization performance after the first pass through the data, with extremely sparse models.