150 resultados para feature selection

em Indian Institute of Science - Bangalore - Índia


Relevância:

100.00% 100.00%

Publicador:

Resumo:

The concept of feature selection in a nonparametric unsupervised learning environment is practically undeveloped because no true measure for the effectiveness of a feature exists in such an environment. The lack of a feature selection phase preceding the clustering process seriously affects the reliability of such learning. New concepts such as significant features, level of significance of features, and immediate neighborhood are introduced which result in meeting implicitly the need for feature slection in the context of clustering techniques.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The concept of feature selection in a nonparametric unsupervised learning environment is practically undeveloped because no true measure for the effectiveness of a feature exists in such an environment. The lack of a feature selection phase preceding the clustering process seriously affects the reliability of such learning. New concepts such as significant features, level of significance of features, and immediate neighborhood are introduced which result in meeting implicitly the need for feature slection in the context of clustering techniques.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper, we develop a game theoretic approach for clustering features in a learning problem. Feature clustering can serve as an important preprocessing step in many problems such as feature selection, dimensionality reduction, etc. In this approach, we view features as rational players of a coalitional game where they form coalitions (or clusters) among themselves in order to maximize their individual payoffs. We show how Nash Stable Partition (NSP), a well known concept in the coalitional game theory, provides a natural way of clustering features. Through this approach, one can obtain some desirable properties of the clusters by choosing appropriate payoff functions. For a small number of features, the NSP based clustering can be found by solving an integer linear program (ILP). However, for large number of features, the ILP based approach does not scale well and hence we propose a hierarchical approach. Interestingly, a key result that we prove on the equivalence between a k-size NSP of a coalitional game and minimum k-cut of an appropriately constructed graph comes in handy for large scale problems. In this paper, we use feature selection problem (in a classification setting) as a running example to illustrate our approach. We conduct experiments to illustrate the efficacy of our approach.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper, we present a methodology for identifying best features from a large feature space. In high dimensional feature space nearest neighbor search is meaningless. In this feature space we see quality and performance issue with nearest neighbor search. Many data mining algorithms use nearest neighbor search. So instead of doing nearest neighbor search using all the features we need to select relevant features. We propose feature selection using Non-negative Matrix Factorization(NMF) and its application to nearest neighbor search. Recent clustering algorithm based on Locally Consistent Concept Factorization(LCCF) shows better quality of document clustering by using local geometrical and discriminating structure of the data. By using our feature selection method we have shown further improvement of performance in the clustering.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Outlier detection in high dimensional categorical data has been a problem of much interest due to the extensive use of qualitative features for describing the data across various application areas. Though there exist various established methods for dealing with the dimensionality aspect through feature selection on numerical data, the categorical domain is actively being explored. As outlier detection is generally considered as an unsupervised learning problem due to lack of knowledge about the nature of various types of outliers, the related feature selection task also needs to be handled in a similar manner. This motivates the need to develop an unsupervised feature selection algorithm for efficient detection of outliers in categorical data. Addressing this aspect, we propose a novel feature selection algorithm based on the mutual information measure and the entropy computation. The redundancy among the features is characterized using the mutual information measure for identifying a suitable feature subset with less redundancy. The performance of the proposed algorithm in comparison with the information gain based feature selection shows its effectiveness for outlier detection. The efficacy of the proposed algorithm is demonstrated on various high-dimensional benchmark data sets employing two existing outlier detection methods.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Data clustering groups data so that data which are similar to each other are in the same group and data which are dissimilar to each other are in different groups. Since generally clustering is a subjective activity, it is possible to get different clusterings of the same data depending on the need. This paper attempts to find the best clustering of the data by first carrying out feature selection and using only the selected features, for clustering. A PSO (Particle Swarm Optimization)has been used for clustering but feature selection has also been carried out simultaneously. The performance of the above proposed algorithm is evaluated on some benchmark data sets. The experimental results shows the proposed methodology outperforms the previous approaches such as basic PSO and Kmeans for the clustering problem.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Selection of relevant features is an open problem in Brain-computer interfacing (BCI) research. Sometimes, features extracted from brain signals are high dimensional which in turn affects the accuracy of the classifier. Selection of the most relevant features improves the performance of the classifier and reduces the computational cost of the system. In this study, we have used a combination of Bacterial Foraging Optimization and Learning Automata to determine the best subset of features from a given motor imagery electroencephalography (EEG) based BCI dataset. Here, we have employed Discrete Wavelet Transform to obtain a high dimensional feature set and classified it by Distance Likelihood Ratio Test. Our proposed feature selector produced an accuracy of 80.291% in 216 seconds.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Guo and Nixon proposed a feature selection method based on maximizing I(x; Y),the multidimensional mutual information between feature vector x and class variable Y. Because computing I(x; Y) can be difficult in practice, Guo and Nixon proposed an approximation of I(x; Y) as the criterion for feature selection. We show that Guo and Nixon's criterion originates from approximating the joint probability distributions in I(x; Y) by second-order product distributions. We remark on the limitations of the approximation and discuss computationally attractive alternatives to compute I(x; Y).

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Classification of a large document collection involves dealing with a huge feature space where each distinct word is a feature. In such an environment, classification is a costly task both in terms of running time and computing resources. Further it will not guarantee optimal results because it is likely to overfit by considering every feature for classification. In such a context, feature selection is inevitable. This work analyses the feature selection methods, explores the relations among them and attempts to find a minimal subset of features which are discriminative for document classification.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Feature selection is an important first step in regional hydrologic studies (RHYS). Over the past few decades, advances in data collection facilities have resulted in development of data archives on a variety of hydro-meteorological variables that may be used as features in RHYS. Currently there are no established procedures for selecting features from such archives. Therefore, hydrologists often use subjective methods to arrive at a set of features. This may lead to misleading results. To alleviate this problem, a probabilistic clustering method for regionalization is presented to determine appropriate features from the available dataset. The effectiveness of the method is demonstrated by application to regionalization of watersheds in conterminous United States for low flow frequency analysis. Plausible homogeneous regions that are formed by using the proposed clustering method are compared with those from conventional methods of regionalization using L-moment based homogeneity tests. Results show that the proposed methodology is promising for RHYS.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A new automata model Mr,k, with a conceptually significant innovation in the form of multi-state alternatives at each instance, is proposed in this study. Computer simulations of the Mr,k, model in the context of feature selection in an unsupervised environment has demonstrated the superiority of the model over similar models without this multi-state-choice innovation.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Separation of printed text blocks from the non-text areas, containing signatures, handwritten text, logos and other such symbols, is a necessary first step for an OCR involving printed text recognition. In the present work, we compare the efficacy of some feature-classifier combinations to carry out this separation task. We have selected length-nomalized horizontal projection profile (HPP) as the starting point of such a separation task. This is with the assumption that the printed text blocks contain lines of text which generate HPP's with some regularity. Such an assumption is demonstrated to be valid. Our features are the HPP and its two transformed versions, namely, eigen and Fisher profiles. Four well known classifiers, namely, Nearest neighbor, Linear discriminant function, SVM's and artificial neural networks have been considered and efficiency of the combination of these classifiers with the above features is compared. A sequential floating feature selection technique has been adopted to enhance the efficiency of this separation task. The results give an average accuracy of about 96.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Design of speaker identification schemes for a small number of speakers (around 10) with a high degree of accuracy in controlled environment is a practical proposition today. When the number of speakers is large (say 50–100), many of these schemes cannot be directly extended, as both recognition error and computation time increase monotonically with population size. The feature selection problem is also complex for such schemes. Though there were earlier attempts to rank order features based on statistical distance measures, it has been observed only recently that the best two independent measurements are not the same as the combination in two's for pattern classification. We propose here a systematic approach to the problem using the decision tree or hierarchical classifier with the following objectives: (1) Design of optimal policy at each node of the tree given the tree structure i.e., the tree skeleton and the features to be used at each node. (2) Determination of the optimal feature measurement and decision policy given only the tree skeleton. Applicability of optimization procedures such as dynamic programming in the design of such trees is studied. The experimental results deal with the design of a 50 speaker identification scheme based on this approach.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Homomorphic analysis and pole-zero modeling of electrocardiogram (ECG) signals are presented in this paper. Four typical ECG signals are considered and deconvolved into their minimum and maximum phase components through cepstral filtering, with a view to study the possibility of more efficient feature selection from the component signals for diagnostic purposes. The complex cepstra of the signals are linearly filtered to extract the basic wavelet and the excitation function. The ECG signals are, in general, mixed phase and hence, exponential weighting is done to aid deconvolution of the signals. The basic wavelet for normal ECG approximates the action potential of the muscle fiber of the heart and the excitation function corresponds to the excitation pattern of the heart muscles during a cardiac cycle. The ECG signals and their components are pole-zero modeled and the pole-zero pattern of the models can give a clue to classify the normal and abnormal signals. Besides, storing only the parameters of the model can result in a data reduction of more than 3:1 for normal signals sampled at a moderate 128 samples/s

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Clustering is a process of partitioning a given set of patterns into meaningful groups. The clustering process can be viewed as consisting of the following three phases: (i) feature selection phase, (ii) classification phase, and (iii) description generation phase. Conventional clustering algorithms implicitly use knowledge about the clustering environment to a large extent in the feature selection phase. This reduces the need for the environmental knowledge in the remaining two phases, permitting the usage of simple numerical measure of similarity in the classification phase. Conceptual clustering algorithms proposed by Michalski and Stepp [IEEE Trans. PAMI, PAMI-5, 396–410 (1983)] and Stepp and Michalski [Artif. Intell., pp. 43–69 (1986)] make use of the knowledge about the clustering environment in the form of a set of predefined concepts to compute the conceptual cohesiveness during the classification phase. Michalski and Stepp [IEEE Trans. PAMI, PAMI-5, 396–410 (1983)] have argued that the results obtained with the conceptual clustering algorithms are superior to conventional methods of numerical classification. However, this claim was not supported by the experimental results obtained by Dale [IEEE Trans. PAMI, PAMI-7, 241–244 (1985)]. In this paper a theoretical framework, based on an intuitively appealing set of axioms, is developed to characterize the equivalence between the conceptual clustering and conventional clustering. In other words, it is shown that any classification obtained using conceptual clustering can also be obtained using conventional clustering and vice versa.