907 resultados para optimal feature selection
Resumo:
Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation-maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation-maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.
Resumo:
In research on Silent Speech Interfaces (SSI), different sources of information (modalities) have been combined, aiming at obtaining better performance than the individual modalities. However, when combining these modalities, the dimensionality of the feature space rapidly increases, yielding the well-known "curse of dimensionality". As a consequence, in order to extract useful information from this data, one has to resort to feature selection (FS) techniques to lower the dimensionality of the learning space. In this paper, we assess the impact of FS techniques for silent speech data, in a dataset with 4 non-invasive and promising modalities, namely: video, depth, ultrasonic Doppler sensing, and surface electromyography. We consider two supervised (mutual information and Fisher's ratio) and two unsupervised (meanmedian and arithmetic mean geometric mean) FS filters. The evaluation was made by assessing the classification accuracy (word recognition error) of three well-known classifiers (knearest neighbors, support vector machines, and dynamic time warping). The key results of this study show that both unsupervised and supervised FS techniques improve on the classification accuracy on both individual and combined modalities. For instance, on the video component, we attain relative performance gains of 36.2% in error rates. FS is also useful as pre-processing for feature fusion. Copyright © 2014 ISCA.
Resumo:
Feature selection is a central problem in machine learning and pattern recognition. On large datasets (in terms of dimension and/or number of instances), using search-based or wrapper techniques can be cornputationally prohibitive. Moreover, many filter methods based on relevance/redundancy assessment also take a prohibitively long time on high-dimensional. datasets. In this paper, we propose efficient unsupervised and supervised feature selection/ranking filters for high-dimensional datasets. These methods use low-complexity relevance and redundancy criteria, applicable to supervised, semi-supervised, and unsupervised learning, being able to act as pre-processors for computationally intensive methods to focus their attention on smaller subsets of promising features. The experimental results, with up to 10(5) features, show the time efficiency of our methods, with lower generalization error than state-of-the-art techniques, while being dramatically simpler and faster.
Resumo:
In machine learning and pattern recognition tasks, the use of feature discretization techniques may have several advantages. The discretized features may hold enough information for the learning task at hand, while ignoring minor fluctuations that are irrelevant or harmful for that task. The discretized features have more compact representations that may yield both better accuracy and lower training time, as compared to the use of the original features. However, in many cases, mainly with medium and high-dimensional data, the large number of features usually implies that there is some redundancy among them. Thus, we may further apply feature selection (FS) techniques on the discrete data, keeping the most relevant features, while discarding the irrelevant and redundant ones. In this paper, we propose relevance and redundancy criteria for supervised feature selection techniques on discrete data. These criteria are applied to the bin-class histograms of the discrete features. The experimental results, on public benchmark data, show that the proposed criteria can achieve better accuracy than widely used relevance and redundancy criteria, such as mutual information and the Fisher ratio.
Resumo:
We are going to implement the "GA-SEFS" by Tsymbal and analyse experimentally its performance depending on the classifier algorithms used in the fitness function (NB, MNge, SMO). We are also going to study the effect of adding to the fitness function a measure to control complexity of the base classifiers.
Resumo:
Diagnosis of community acquired legionella pneumonia (CALP) is currently performed by means of laboratory techniques which may delay diagnosis several hours. To determine whether ANN can categorize CALP and non-legionella community-acquired pneumonia (NLCAP) and be standard for use by clinicians, we prospectively studied 203 patients with community-acquired pneumonia (CAP) diagnosed by laboratory tests. Twenty one clinical and analytical variables were recorded to train a neural net with two classes (LCAP or NLCAP class). In this paper we deal with the problem of diagnosis, feature selection, and ranking of the features as a function of their classification importance, and the design of a classifier the criteria of maximizing the ROC (Receiving operating characteristics) area, which gives a good trade-off between true positives and false negatives. In order to guarantee the validity of the statistics; the train-validation-test databases were rotated by the jackknife technique, and a multistarting procedure was done in order to make the system insensitive to local maxima.
Resumo:
Gene filtering is a useful preprocessing technique often applied to microarray datasets. However, it is no common practice because clear guidelines are lacking and it bears the risk of excluding some potentially relevant genes. In this work, we propose to model microarray data as a mixture of two Gaussian distributions that will allow us to obtain an optimal filter threshold in terms of the gene expression level.
Resumo:
Peer-reviewed
Resumo:
Alzheimer׳s disease (AD) is the most common type of dementia among the elderly. This work is part of a larger study that aims to identify novel technologies and biomarkers or features for the early detection of AD and its degree of severity. The diagnosis is made by analyzing several biomarkers and conducting a variety of tests (although only a post-mortem examination of the patients’ brain tissue is considered to provide definitive confirmation). Non-invasive intelligent diagnosis techniques would be a very valuable diagnostic aid. This paper concerns the Automatic Analysis of Emotional Response (AAER) in spontaneous speech based on classical and new emotional speech features: Emotional Temperature (ET) and fractal dimension (FD). This is a pre-clinical study aiming to validate tests and biomarkers for future diagnostic use. The method has the great advantage of being non-invasive, low cost, and without any side effects. The AAER shows very promising results for the definition of features useful in the early diagnosis of AD.
Resumo:
Objective. Recently, significant advances have been made in the early diagnosis of Alzheimer’s disease from EEG. However, choosing suitable measures is a challenging task. Among other measures, frequency Relative Power and loss of complexity have been used with promising results. In the present study we investigate the early diagnosis of AD using synchrony measures and frequency Relative Power on EEG signals, examining the changes found in different frequency ranges. Approach. We first explore the use of a single feature for computing the classification rate, looking for the best frequency range. Then, we present a multiple feature classification system that outperforms all previous results using a feature selection strategy. These two approaches are tested in two different databases, one containing MCI and healthy subjects (patients age: 71.9 ± 10.2, healthy subjects age: 71.7 ± 8.3), and the other containing Mild AD and healthy subjects (patients age: 77.6 ± 10.0; healthy subjects age: 69.4± 11.5). Main Results. Using a single feature to compute classification rates we achieve a performance of 78.33% for the MCI data set and of 97.56 % for Mild AD. Results are clearly improved using the multiple feature classification, where a classification rate of 95% is found for the MCI data set using 11 features, and 100% for the Mild AD data set using 4 features. Significance. The new features selection method described in this work may be a reliable tool that could help to design a realistic system that does not require prior knowledge of a patient's status. With that aim, we explore the standardization of features for MCI and Mild AD data sets with promising results.
Resumo:
In this study, feature selection in classification based problems is highlighted. The role of feature selection methods is to select important features by discarding redundant and irrelevant features in the data set, we investigated this case by using fuzzy entropy measures. We developed fuzzy entropy based feature selection method using Yu's similarity and test this using similarity classifier. As the similarity classifier we used Yu's similarity, we tested our similarity on the real world data set which is dermatological data set. By performing feature selection based on fuzzy entropy measures before classification on our data set the empirical results were very promising, the highest classification accuracy of 98.83% was achieved when testing our similarity measure to the data set. The achieved results were then compared with some other results previously obtained using different similarity classifiers, the obtained results show better accuracy than the one achieved before. The used methods helped to reduce the dimensionality of the used data set, to speed up the computation time of a learning algorithm and therefore have simplified the classification task
Resumo:
Treating e-mail filtering as a binary text classification problem, researchers have applied several statistical learning algorithms to email corpora with promising results. This paper examines the performance of a Naive Bayes classifier using different approaches to feature selection and tokenization on different email corpora
Resumo:
We present a new method to select features for a face detection system using Support Vector Machines (SVMs). In the first step we reduce the dimensionality of the input space by projecting the data into a subset of eigenvectors. The dimension of the subset is determined by a classification criterion based on minimizing a bound on the expected error probability of an SVM. In the second step we select features from the SVM feature space by removing those that have low contributions to the decision function of the SVM.
Resumo:
A new method for the automated selection of colour features is described. The algorithm consists of two stages of processing. In the first, a complete set of colour features is calculated for every object of interest in an image. In the second stage, each object is mapped into several n-dimensional feature spaces in order to select the feature set with the smallest variables able to discriminate the remaining objects. The evaluation of the discrimination power for each concrete subset of features is performed by means of decision trees composed of linear discrimination functions. This method can provide valuable help in outdoor scene analysis where no colour space has been demonstrated as being the most suitable. Experiment results recognizing objects in outdoor scenes are reported