133 resultados para online classification
Resumo:
In this paper we study the problem of designing SVM classifiers when the kernel matrix, K, is affected by uncertainty. Specifically K is modeled as a positive affine combination of given positive semi definite kernels, with the coefficients ranging in a norm-bounded uncertainty set. We treat the problem using the Robust Optimization methodology. This reduces the uncertain SVM problem into a deterministic conic quadratic problem which can be solved in principle by a polynomial time Interior Point (IP) algorithm. However, for large-scale classification problems, IP methods become intractable and one has to resort to first-order gradient type methods. The strategy we use here is to reformulate the robust counterpart of the uncertain SVM problem as a saddle point problem and employ a special gradient scheme which works directly on the convex-concave saddle function. The algorithm is a simplified version of a general scheme due to Juditski and Nemirovski (2011). It achieves an O(1/T-2) reduction of the initial error after T iterations. A comprehensive empirical study on both synthetic data and real-world protein structure data sets show that the proposed formulations achieve the desired robustness, and the saddle point based algorithm outperforms the IP method significantly.
Resumo:
In the design of practical web page classification systems one often encounters a situation in which the labeled training set is created by choosing some examples from each class; but, the class proportions in this set are not the same as those in the test distribution to which the classifier will be actually applied. The problem is made worse when the amount of training data is also small. In this paper we explore and adapt binary SVM methods that make use of unlabeled data from the test distribution, viz., Transductive SVMs (TSVMs) and expectation regularization/constraint (ER/EC) methods to deal with this situation. We empirically show that when the labeled training data is small, TSVM designed using the class ratio tuned by minimizing the loss on the labeled set yields the best performance; its performance is good even when the deviation between the class ratios of the labeled training set and the test set is quite large. When the labeled training data is sufficiently large, an unsupervised Gaussian mixture model can be used to get a very good estimate of the class ratio in the test set; also, when this estimate is used, both TSVM and EC/ER give their best possible performance, with TSVM coming out superior. The ideas in the paper can be easily extended to multi-class SVMs and MaxEnt models.
Resumo:
The present approach uses stopwords and the gaps that oc- cur between successive stopwords –formed by contentwords– as features for sentiment classification.
Resumo:
Time series classification deals with the problem of classification of data that is multivariate in nature. This means that one or more of the attributes is in the form of a sequence. The notion of similarity or distance, used in time series data, is significant and affects the accuracy, time, and space complexity of the classification algorithm. There exist numerous similarity measures for time series data, but each of them has its own disadvantages. Instead of relying upon a single similarity measure, our aim is to find the near optimal solution to the classification problem by combining different similarity measures. In this work, we use genetic algorithms to combine the similarity measures so as to get the best performance. The weightage given to different similarity measures evolves over a number of generations so as to get the best combination. We test our approach on a number of benchmark time series datasets and present promising results.
Resumo:
Research in the field of recognizing unlimited vocabulary, online handwritten Indic words is still in its infancy. Most of the focus so far has been in the area of isolated character recognition. In the context of lexicon-free recognition of words, one of the primary issues to be addressed is that of segmentation. As a preliminary attempt, this paper proposes a novel script-independent, lexicon-free method for segmenting online handwritten words to their constituent symbols. Feedback strategies, inspired from neuroscience studies, are proposed for improving the segmentation. The segmentation strategy has been tested on an exhaustive set of 10000 Tamil words collected from a large number of writers. The results show that better segmentation improves the overall recognition performance of the handwriting system.
Resumo:
This paper presents a new hierarchical clustering algorithm for crop stage classification using hyperspectral satellite image. Amongst the multiple benefits and uses of remote sensing, one of the important application is to solve the problem of crop stage classification. Modern commercial imaging satellites, owing to their large volume of satellite imagery, offer greater opportunities for automated image analysis. Hence, we propose a unsupervised algorithm namely Hierarchical Artificial Immune System (HAIS) of two steps: splitting the cluster centers and merging them. The high dimensionality of the data has been reduced with the help of Principal Component Analysis (PCA). The classification results have been compared with K-means and Artificial Immune System algorithms. From the results obtained, we conclude that the proposed hierarchical clustering algorithm is accurate.
Resumo:
Subsurface lithology and seismic site classification of Lucknow urban center located in the central part of the Indo-Gangetic Basin (IGB) are presented based on detailed shallow subsurface investigations and borehole analysis. These are done by carrying out 47 seismic surface wave tests using multichannel analysis of surface waves (MASW) and 23 boreholes drilled up to 30 m with standard penetration test (SPT) N values. Subsurface lithology profiles drawn from the drilled boreholes show low- to medium-compressibility clay and silty to poorly graded sand available till depth of 30 m. In addition, deeper boreholes (depth >150 m) were collected from the Lucknow Jal Nigam (Water Corporation), Government of Uttar Pradesh to understand deeper subsoil stratification. Deeper boreholes in this paper refer to those with depth over 150 m. These reports show the presence of clay mix with sand and Kankar at some locations till a depth of 150 m, followed by layers of sand, clay, and Kankar up to 400 m. Based on the available details, shallow and deeper cross-sections through Lucknow are presented. Shear wave velocity (SWV) and N-SPT values were measured for the study area using MASW and SPT testing. Measured SWV and N-SPT values for the same locations were found to be comparable. These values were used to estimate 30 m average values of N-SPT (N-30) and SWV (V-s(30)) for seismic site classification of the study area as per the National Earthquake Hazards Reduction Program (NEHRP) soil classification system. Based on the NEHRP classification, the entire study area is classified into site class C and D based on V-s(30) and site class D and E based on N-30. The issue of larger amplification during future seismic events is highlighted for a major part of the study area which comes under site class D and E. Also, the mismatch of site classes based on N-30 and V-s(30) raises the question of the suitability of the NEHRP classification system for the study region. Further, 17 sets of SPT and SWV data are used to develop a correlation between N-SPT and SWV. This represents a first attempt of seismic site classification and correlation between N-SPT and SWV in the Indo-Gangetic Basin.
Resumo:
This paper presents an efficient approach to the modeling and classification of vehicles using the magnetic signature of the vehicle. A database was created using the magnetic signature collected over a wide range of vehicles(cars). A vehicle is modeled as an array of magnetic dipoles. The strength of the magnetic dipole and the separation between the magnetic dipoles varies for different vehicles and is dependent on the metallic composition and configuration of the vehicle. Based on the magnetic dipole data model, we present a novel method to extract a feature vector from the magnetic signature. In the classification of vehicles, a linear support vector machine configuration is used to classify the vehicles based on the obtained feature vectors.
Resumo:
Effective conservation and management of natural resources requires up-to-date information of the land cover (LC) types and their dynamics. The LC dynamics are being captured using multi-resolution remote sensing (RS) data with appropriate classification strategies. RS data with important environmental layers (either remotely acquired or derived from ground measurements) would however be more effective in addressing LC dynamics and associated changes. These ancillary layers provide additional information for delineating LC classes' decision boundaries compared to the conventional classification techniques. This communication ascertains the possibility of improved classification accuracy of RS data with ancillary and derived geographical layers such as vegetation index, temperature, digital elevation model (DEM), aspect, slope and texture. This has been implemented in three terrains of varying topography. The study would help in the selection of appropriate ancillary data depending on the terrain for better classified information.
Resumo:
N-gram language models and lexicon-based word-recognition are popular methods in the literature to improve recognition accuracies of online and offline handwritten data. However, there are very few works that deal with application of these techniques on online Tamil handwritten data. In this paper, we explore methods of developing symbol-level language models and a lexicon from a large Tamil text corpus and their application to improving symbol and word recognition accuracies. On a test database of around 2000 words, we find that bigram language models improve symbol (3%) and word recognition (8%) accuracies and while lexicon methods offer much greater improvements (30%) in terms of word recognition, there is a large dependency on choosing the right lexicon. For comparison to lexicon and language model based methods, we have also explored re-evaluation techniques which involve the use of expert classifiers to improve symbol and word recognition accuracies.
Resumo:
This work proposes a boosting-based transfer learning approach for head-pose classification from multiple, low-resolution views. Head-pose classification performance is adversely affected when the source (training) and target (test) data arise from different distributions (due to change in face appearance, lighting, etc). Under such conditions, we employ Xferboost, a Logitboost-based transfer learning framework that integrates knowledge from a few labeled target samples with the source model to effectively minimize misclassifications on the target data. Experiments confirm that the Xferboost framework can improve classification performance by up to 6%, when knowledge is transferred between the CLEAR and FBK four-view headpose datasets.
Resumo:
Multi-view head-pose estimation in low-resolution, dynamic scenes is difficult due to blurred facial appearance and perspective changes as targets move around freely in the environment. Under these conditions, acquiring sufficient training examples to learn the dynamic relationship between position, face appearance and head-pose can be very expensive. Instead, a transfer learning approach is proposed in this work. Upon learning a weighted-distance function from many examples where the target position is fixed, we adapt these weights to the scenario where target positions are varying. The adaptation framework incorporates reliability of the different face regions for pose estimation under positional variation, by transforming the target appearance to a canonical appearance corresponding to a reference scene location. Experimental results confirm effectiveness of the proposed approach, which outperforms state-of-the-art by 9.5% under relevant conditions. To aid further research on this topic, we also make DPOSE- a dynamic, multi-view head-pose dataset with ground-truth publicly available with this paper.
Resumo:
In document community support vector machines and naïve bayes classifier are known for their simplistic yet excellent performance. Normally the feature subsets used by these two approaches complement each other, however a little has been done to combine them. The essence of this paper is a linear classifier, very similar to these two. We propose a novel way of combining these two approaches, which synthesizes best of them into a hybrid model. We evaluate the proposed approach using 20ng dataset, and compare it with its counterparts. The efficacy of our results strongly corroborate the effectiveness of our approach.
Resumo:
Classification of a large document collection involves dealing with a huge feature space where each distinct word is a feature. In such an environment, classification is a costly task both in terms of running time and computing resources. Further it will not guarantee optimal results because it is likely to overfit by considering every feature for classification. In such a context, feature selection is inevitable. This work analyses the feature selection methods, explores the relations among them and attempts to find a minimal subset of features which are discriminative for document classification.
Resumo:
Seismic site classifications are used to represent site effects for estimating hazard parameters (response spectral ordinates) at the soil surface. Seismic site classifications have generally been carried out using average shear wave velocity and/or standard penetration test n-values of top 30-m soil layers, according to the recommendations of the National Earthquake Hazards Reduction Program (NEHRP) or the International Building Code (IBC). The site classification system in the NEHRP and the IBC is based on the studies carried out in the United States where soil layers extend up to several hundred meters before reaching any distinct soil-bedrock interface and may not be directly applicable to other regions, especially in regions having shallow geological deposits. This paper investigates the influence of rock depth on site classes based on the recommendations of the NEHRP and the IBC. For this study, soil sites having a wide range of average shear wave velocities (or standard penetration test n-values) have been collected from different parts of Australia, China, and India. Shear wave velocities of rock layers underneath soil layers have also been collected at depths from a few meters to 180 m. It is shown that a site classification system based on the top 30-m soil layers often represents stiffer site classes for soil sites having shallow rock depths (rock depths less than 25 m from the soil surface). A new site classification system based on average soil thickness up to engineering bedrock has been proposed herein, which is considered more representative for soil sites in shallow bedrock regions. It has been observed that response spectral ordinates, amplification factors, and site periods estimated using one-dimensional shear wave analysis considering the depth of engineering bedrock are different from those obtained considering top 30-m soil layers.