104 resultados para Word Classification


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Myopathies are muscular diseases in which muscle fibers degenerate due to many factors such as nutrient deficiency, infection and mutations in myofibrillar etc. The objective of this study is to identify the bio-markers to distinguish various muscle mutants in Drosophila (fruit fly) using Raman Spectroscopy. Principal Components based Linear Discriminant Analysis (PC-LDA) classification model yielding >95% accuracy was developed to classify such different mutants representing various myopathies according to their physiopathology.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We study consistency properties of surrogate loss functions for general multiclass classification problems, defined by a general loss matrix. We extend the notion of classification calibration, which has been studied for binary and multiclass 0-1 classification problems (and for certain other specific learning problems), to the general multiclass setting, and derive necessary and sufficient conditions for a surrogate loss to be classification calibrated with respect to a loss matrix in this setting. We then introduce the notion of \emph{classification calibration dimension} of a multiclass loss matrix, which measures the smallest `size' of a prediction space for which it is possible to design a convex surrogate that is classification calibrated with respect to the loss matrix. We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. In particular, as one application, we provide a different route from the recent result of Duchi et al.\ (2010) for analyzing the difficulty of designing `low-dimensional' convex surrogates that are consistent with respect to pairwise subset ranking losses. We anticipate the classification calibration dimension may prove to be a useful tool in the study and design of surrogate losses for general multiclass learning problems.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We consider the problem of developing privacy-preserving machine learning algorithms in a dis-tributed multiparty setting. Here different parties own different parts of a data set, and the goal is to learn a classifier from the entire data set with-out any party revealing any information about the individual data points it owns. Pathak et al [7]recently proposed a solution to this problem in which each party learns a local classifier from its own data, and a third party then aggregates these classifiers in a privacy-preserving manner using a cryptographic scheme. The generaliza-tion performance of their algorithm is sensitive to the number of parties and the relative frac-tions of data owned by the different parties. In this paper, we describe a new differentially pri-vate algorithm for the multiparty setting that uses a stochastic gradient descent based procedure to directly optimize the overall multiparty ob-jective rather than combining classifiers learned from optimizing local objectives. The algorithm achieves a slightly weaker form of differential privacy than that of [7], but provides improved generalization guarantees that do not depend on the number of parties or the relative sizes of the individual data sets. Experimental results corrob-orate our theoretical findings.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Transductive SVM (TSVM) is a well known semi-supervised large margin learning method for binary text classification. In this paper we extend this method to multi-class and hierarchical classification problems. We point out that the determination of labels of unlabeled examples with fixed classifier weights is a linear programming problem. We devise an efficient technique for solving it. The method is applicable to general loss functions. We demonstrate the value of the new method using large margin loss on a number of multi-class and hierarchical classification datasets. For maxent loss we show empirically that our method is better than expectation regularization/constraint and posterior regularization methods, and competitive with the version of entropy regularization method which uses label constraints.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper, we report a breakthrough result on the difficult task of segmentation and recognition of coloured text from the word image dataset of ICDAR robust reading competition challenge 2: reading text in scene images. We split the word image into individual colour, gray and lightness planes and enhance the contrast of each of these planes independently by a power-law transform. The discrimination factor of each plane is computed as the maximum between-class variance used in Otsu thresholding. The plane that has maximum discrimination factor is selected for segmentation. The trial version of Omnipage OCR is then used on the binarized words for recognition. Our recognition results on ICDAR 2011 and ICDAR 2003 word datasets are compared with those reported in the literature. As baseline, the images binarized by simple global and local thresholding techniques were also recognized. The word recognition rate obtained by our non-linear enhancement and selection of plance method is 72.8% and 66.2% for ICDAR 2011 and 2003 word datasets, respectively. We have created ground-truth for each image at the pixel level to benchmark these datasets using a toolkit developed by us. The recognition rate of benchmarked images is 86.7% and 83.9% for ICDAR 2011 and 2003 datasets, respectively.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Myopathies are muscular diseases in which muscle fibers degenerate due to many factors such as nutrient deficiency, infection and mutations in myofibrillar etc. The objective of this study is to identify the bio-markers to distinguish various muscle mutants in Drosophila (fruit fly) using Raman Spectroscopy. Principal Components based Linear Discriminant Analysis (PC-LDA) classification model yielding >95% accuracy was developed to classify such different mutants representing various myopathies according to their physiopathology.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper, we have proposed a simple and effective approach to classify H.264 compressed videos, by capturing orientation information from the motion vectors. Our major contribution involves computing Histogram of Oriented Motion Vectors (HOMV) for overlapping hierarchical Space-Time cubes. The Space-Time cubes selected are partially overlapped. HOMV is found to be very effective to define the motion characteristics of these cubes. We then use Bag of Features (B OF) approach to define the video as histogram of HOMV keywords, obtained using k-means clustering. The video feature, thus computed, is found to be very effective in classifying videos. We demonstrate our results with experiments on two large publicly available video database.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Sparse representation based classification (SRC) is one of the most successful methods that has been developed in recent times for face recognition. Optimal projection for Sparse representation based classification (OPSRC)1] provides a dimensionality reduction map that is supposed to give optimum performance for SRC framework. However, the computational complexity involved in this method is too high. Here, we propose a new projection technique using the data scatter matrix which is computationally superior to the optimal projection method with comparable classification accuracy with respect OPSRC. The performance of the proposed approach is benchmarked with various publicly available face database.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Maximum entropy approach to classification is very well studied in applied statistics and machine learning and almost all the methods that exists in literature are discriminative in nature. In this paper, we introduce a maximum entropy classification method with feature selection for large dimensional data such as text datasets that is generative in nature. To tackle the curse of dimensionality of large data sets, we employ conditional independence assumption (Naive Bayes) and we perform feature selection simultaneously, by enforcing a `maximum discrimination' between estimated class conditional densities. For two class problems, in the proposed method, we use Jeffreys (J) divergence to discriminate the class conditional densities. To extend our method to the multi-class case, we propose a completely new approach by considering a multi-distribution divergence: we replace Jeffreys divergence by Jensen-Shannon (JS) divergence to discriminate conditional densities of multiple classes. In order to reduce computational complexity, we employ a modified Jensen-Shannon divergence (JS(GM)), based on AM-GM inequality. We show that the resulting divergence is a natural generalization of Jeffreys divergence to a multiple distributions case. As far as the theoretical justifications are concerned we show that when one intends to select the best features in a generative maximum entropy approach, maximum discrimination using J-divergence emerges naturally in binary classification. Performance and comparative study of the proposed algorithms have been demonstrated on large dimensional text and gene expression datasets that show our methods scale up very well with large dimensional datasets.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Elastic Net Regularizers have shown much promise in designing sparse classifiers for linear classification. In this work, we propose an alternating optimization approach to solve the dual problems of elastic net regularized linear classification Support Vector Machines (SVMs) and logistic regression (LR). One of the sub-problems turns out to be a simple projection. The other sub-problem can be solved using dual coordinate descent methods developed for non-sparse L2-regularized linear SVMs and LR, without altering their iteration complexity and convergence properties. Experiments on very large datasets indicate that the proposed dual coordinate descent - projection (DCD-P) methods are fast and achieve comparable generalization performance after the first pass through the data, with extremely sparse models.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Establishing functional relationships between multi-domain protein sequences is a non-trivial task. Traditionally, delineating functional assignment and relationships of proteins requires domain assignments as a prerequisite. This process is sensitive to alignment quality and domain definitions. In multi-domain proteins due to multiple reasons, the quality of alignments is poor. We report the correspondence between the classification of proteins represented as full-length gene products and their functions. Our approach differs fundamentally from traditional methods in not performing the classification at the level of domains. Our method is based on an alignment free local matching scores (LMS) computation at the amino-acid sequence level followed by hierarchical clustering. As there are no gold standards for full-length protein sequence classification, we resorted to Gene Ontology and domain-architecture based similarity measures to assess our classification. The final clusters obtained using LMS show high functional and domain architectural similarities. Comparison of the current method with alignment based approaches at both domain and full-length protein showed superiority of the LMS scores. Using this method we have recreated objective relationships among different protein kinase sub-families and also classified immunoglobulin containing proteins where sub-family definitions do not exist currently. This method can be applied to any set of protein sequences and hence will be instrumental in analysis of large numbers of full-length protein sequences.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Classification of pharmacologic activity of a chemical compound is an essential step in any drug discovery process. We develop two new atom-centered fragment descriptors (vertex indices) - one based solely on topological considerations without discriminating atomor bond types, and another based on topological and electronic features. We also assess their usefulness by devising a method to rank and classify molecules with regard to their antibacterial activity. Classification performances of our method are found to be superior compared to two previous studies on large heterogeneous data sets for hit finding and hit-to-lead studies even though we use much fewer parameters. It is found that for hit finding studies topological features (simple graph) alone provide significant discriminating power, and for hit-to-lead process small but consistent improvement can be made by additionally including electronic features (colored graph). Our approach is simple, interpretable, and suitable for design of molecules as we do not use any physicochemical properties. The singular use of vertex index as descriptor, novel range based feature extraction, and rigorous statistical validation are the key elements of this study.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The paper describes an algorithm for multi-label classification. Since a pattern can belong to more than one class, the task of classifying a test pattern is a challenging one. We propose a new algorithm to carry out multi-label classification which works for discrete data. We have implemented the algorithm and presented the results for different multi-label data sets. The results have been compared with the algorithm multi-label KNN or ML-KNN and found to give good results.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The problem of classification of time series data is an interesting problem in the field of data mining. Even though several algorithms have been proposed for the problem of time series classification we have developed an innovative algorithm which is computationally fast and accurate in several cases when compared with 1NN classifier. In our method we are calculating the fuzzy membership of each test pattern to be classified to each class. We have experimented with 6 benchmark datasets and compared our method with 1NN classifier.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Head pose classification from surveillance images acquired with distant, large field-of-view cameras is difficult as faces are captured at low-resolution and have a blurred appearance. Domain adaptation approaches are useful for transferring knowledge from the training (source) to the test (target) data when they have different attributes, minimizing target data labeling efforts in the process. This paper examines the use of transfer learning for efficient multi-view head pose classification with minimal target training data under three challenging situations: (i) where the range of head poses in the source and target images is different, (ii) where source images capture a stationary person while target images capture a moving person whose facial appearance varies under motion due to changing perspective, scale and (iii) a combination of (i) and (ii). On the whole, the presented methods represent novel transfer learning solutions employed in the context of multi-view head pose classification. We demonstrate that the proposed solutions considerably outperform the state-of-the-art through extensive experimental validation. Finally, the DPOSE dataset compiled for benchmarking head pose classification performance with moving persons, and to aid behavioral understanding applications is presented in this work.