864 resultados para Feature selection algorithm
Resumo:
A combined strategy based on the computation of absorption energies, using the ZINDO/S semiempirical method, for a statistically relevant number of thermally sampled configurations extracted from QM/MM trajectories is used to establish a one-to-one correspondence between the structures of the different early intermediates (dark, batho, BSI, lumi) involved in the initial steps of the rhodopsin photoactivation mechanism and their optical spectra. A systematic analysis of the results based on a correlation-based feature selection algorithm shows that the origin of the color shifts among these intermediates can be mainly ascribed to alterations in intrinsic properties of the chromophore structure, which are tuned by several residues located in the protein binding pocket. In addition to the expected electrostatic and dipolar effects caused by the charged residues (Glu113, Glu181) and to strong hydrogen bonding with Glu113, other interactions such as π-stacking with Ala117 and Thr118 backbone atoms, van der Waals contacts with Gly114 and Ala292, and CH/π weak interactions with Tyr268, Ala117, Thr118, and Ser186 side chains are found to make non-negligible contributions to the modulation of the color tuning among the different rhodopsin photointermediates.
Resumo:
We investigate the relevance of morphological operators for the classification of land use in urban scenes using submetric panchromatic imagery. A support vector machine is used for the classification. Six types of filters have been employed: opening and closing, opening and closing by reconstruction, and opening and closing top hat. The type and scale of the filters are discussed, and a feature selection algorithm called recursive feature elimination is applied to decrease the dimensionality of the input data. The analysis performed on two QuickBird panchromatic images showed that simple opening and closing operators are the most relevant for classification at such a high spatial resolution. Moreover, mixed sets combining simple and reconstruction filters provided the best performance. Tests performed on both images, having areas characterized by different architectural styles, yielded similar results for both feature selection and classification accuracy, suggesting the generalization of the feature sets highlighted.
Resumo:
There are numerous text documents available in electronic form. More and more are becoming available every day. Such documents represent a massive amount of information that is easily accessible. Seeking value in this huge collection requires organization; much of the work of organizing documents can be automated through text classification. The accuracy and our understanding of such systems greatly influences their usefulness. In this paper, we seek 1) to advance the understanding of commonly used text classification techniques, and 2) through that understanding, improve the tools that are available for text classification. We begin by clarifying the assumptions made in the derivation of Naive Bayes, noting basic properties and proposing ways for its extension and improvement. Next, we investigate the quality of Naive Bayes parameter estimates and their impact on classification. Our analysis leads to a theorem which gives an explanation for the improvements that can be found in multiclass classification with Naive Bayes using Error-Correcting Output Codes. We use experimental evidence on two commonly-used data sets to exhibit an application of the theorem. Finally, we show fundamental flaws in a commonly-used feature selection algorithm and develop a statistics-based framework for text feature selection. Greater understanding of Naive Bayes and the properties of text allows us to make better use of it in text classification.
Resumo:
Accurate single trial P300 classification lends itself to fast and accurate control of Brain Computer Interfaces (BCIs). Highly accurate classification of single trial P300 ERPs is achieved by characterizing the EEG via corresponding stationary and time-varying Wackermann parameters. Subsets of maximally discriminating parameters are then selected using the Network Clustering feature selection algorithm and classified with Naive-Bayes and Linear Discriminant Analysis classifiers. Hence the method is assessed on two different data-sets from BCI competitions and is shown to produce accuracies of between approximately 70% and 85%. This is promising for the use of Wackermann parameters as features in the classification of single-trial ERP responses.
Resumo:
Although non-technical losses automatic identification has been massively studied, the problem of selecting the most representative features in order to boost the identification accuracy has not attracted much attention in this context. In this paper, we focus on this problem applying a novel feature selection algorithm based on Particle Swarm Optimization and Optimum-Path Forest. The results demonstrated that this method can improve the classification accuracy of possible frauds up to 49% in some datasets composed by industrial and commercial profiles. © 2011 IEEE.
Resumo:
Recent experimental evidence has suggested a neuromodulatory deficit in Alzheimer's disease (AD). In this paper, we present a new electroencephalogram (EEG) based metric to quantitatively characterize neuromodulatory activity. More specifically, the short-term EEG amplitude modulation rate-of-change (i.e., modulation frequency) is computed for five EEG subband signals. To test the performance of the proposed metric, a classification task was performed on a database of 32 participants partitioned into three groups of approximately equal size: healthy controls, patients diagnosed with mild AD, and those with moderate-to-severe AD. To gauge the benefits of the proposed metric, performance results were compared with those obtained using EEG spectral peak parameters which were recently shown to outperform other conventional EEG measures. Using a simple feature selection algorithm based on area-under-the-curve maximization and a support vector machine classifier, the proposed parameters resulted in accuracy gains, relative to spectral peak parameters, of 21.3% when discriminating between the three groups and by 50% when mild and moderate-to-severe groups were merged into one. The preliminary findings reported herein provide promising insights that automated tools may be developed to assist physicians in very early diagnosis of AD as well as provide researchers with a tool to automatically characterize cross-frequency interactions and their changes with disease.
Resumo:
This work proposes a system for classification of industrial steel pieces by means of magnetic nondestructive device. The proposed classification system presents two main stages, online system stage and off-line system stage. In online stage, the system classifies inputs and saves misclassification information in order to perform posterior analyses. In the off-line optimization stage, the topology of a Probabilistic Neural Network is optimized by a Feature Selection algorithm combined with the Probabilistic Neural Network to increase the classification rate. The proposed Feature Selection algorithm searches for the signal spectrogram by combining three basic elements: a Sequential Forward Selection algorithm, a Feature Cluster Grow algorithm with classification rate gradient analysis and a Sequential Backward Selection. Also, a trash-data recycling algorithm is proposed to obtain the optimal feedback samples selected from the misclassified ones.
Antarctic cloud spectral emission from ground-based measurements, a focus on far infrared signatures
Resumo:
The present work belongs to the PRANA project, the first extensive field campaign of observation of atmospheric emission spectra covering the Far InfraRed spectral region, for more than two years. The principal deployed instrument is REFIR-PAD, a Fourier transform spectrometer used by us to study Antarctic cloud properties. A dataset covering the whole 2013 has been analyzed and, firstly, a selection of good quality spectra is performed, using, as thresholds, radiance values in few chosen spectral regions. These spectra are described in a synthetic way averaging radiances in selected intervals, converting them into BTs and finally considering the differences between each pair of them. A supervised feature selection algorithm is implemented with the purpose to select the features really informative about the presence, the phase and the type of cloud. Hence, training and test sets are collected, by means of Lidar quick-looks. The supervised classification step of the overall monthly datasets is performed using a SVM. On the base of this classification and with the help of Lidar observations, 29 non-precipitating ice cloud case studies are selected. A single spectrum, or at most an average over two or three spectra, is processed by means of the retrieval algorithm RT-RET, exploiting some main IR window channels, in order to extract cloud properties. Retrieved effective radii and optical depths are analyzed, to compare them with literature studies and to evaluate possible seasonal trends. Finally, retrieval output atmospheric profiles are used as inputs for simulations, assuming two different crystal habits, with the aim to examine our ability to reproduce radiances in the FIR. Substantial mis-estimations are found for FIR micro-windows: a high variability is observed in the spectral pattern of simulation deviations from measured spectra and an effort to link these deviations to cloud parameters has been performed.
Resumo:
With the rapid growth of the Internet, computer attacks are increasing at a fast pace and can easily cause millions of dollar in damage to an organization. Detecting these attacks is an important issue of computer security. There are many types of attacks and they fall into four main categories, Denial of Service (DoS) attacks, Probe, User to Root (U2R) attacks, and Remote to Local (R2L) attacks. Within these categories, DoS and Probe attacks continuously show up with greater frequency in a short period of time when they attack systems. They are different from the normal traffic data and can be easily separated from normal activities. On the contrary, U2R and R2L attacks are embedded in the data portions of the packets and normally involve only a single connection. It becomes difficult to achieve satisfactory detection accuracy for detecting these two attacks. Therefore, we focus on studying the ambiguity problem between normal activities and U2R/R2L attacks. The goal is to build a detection system that can accurately and quickly detect these two attacks. In this dissertation, we design a two-phase intrusion detection approach. In the first phase, a correlation-based feature selection algorithm is proposed to advance the speed of detection. Features with poor prediction ability for the signatures of attacks and features inter-correlated with one or more other features are considered redundant. Such features are removed and only indispensable information about the original feature space remains. In the second phase, we develop an ensemble intrusion detection system to achieve accurate detection performance. The proposed method includes multiple feature selecting intrusion detectors and a data mining intrusion detector. The former ones consist of a set of detectors, and each of them uses a fuzzy clustering technique and belief theory to solve the ambiguity problem. The latter one applies data mining technique to automatically extract computer users’ normal behavior from training network traffic data. The final decision is a combination of the outputs of feature selecting and data mining detectors. The experimental results indicate that our ensemble approach not only significantly reduces the detection time but also effectively detect U2R and R2L attacks that contain degrees of ambiguous information.
Resumo:
With the rapid growth of the Internet, computer attacks are increasing at a fast pace and can easily cause millions of dollar in damage to an organization. Detecting these attacks is an important issue of computer security. There are many types of attacks and they fall into four main categories, Denial of Service (DoS) attacks, Probe, User to Root (U2R) attacks, and Remote to Local (R2L) attacks. Within these categories, DoS and Probe attacks continuously show up with greater frequency in a short period of time when they attack systems. They are different from the normal traffic data and can be easily separated from normal activities. On the contrary, U2R and R2L attacks are embedded in the data portions of the packets and normally involve only a single connection. It becomes difficult to achieve satisfactory detection accuracy for detecting these two attacks. Therefore, we focus on studying the ambiguity problem between normal activities and U2R/R2L attacks. The goal is to build a detection system that can accurately and quickly detect these two attacks. In this dissertation, we design a two-phase intrusion detection approach. In the first phase, a correlation-based feature selection algorithm is proposed to advance the speed of detection. Features with poor prediction ability for the signatures of attacks and features inter-correlated with one or more other features are considered redundant. Such features are removed and only indispensable information about the original feature space remains. In the second phase, we develop an ensemble intrusion detection system to achieve accurate detection performance. The proposed method includes multiple feature selecting intrusion detectors and a data mining intrusion detector. The former ones consist of a set of detectors, and each of them uses a fuzzy clustering technique and belief theory to solve the ambiguity problem. The latter one applies data mining technique to automatically extract computer users’ normal behavior from training network traffic data. The final decision is a combination of the outputs of feature selecting and data mining detectors. The experimental results indicate that our ensemble approach not only significantly reduces the detection time but also effectively detect U2R and R2L attacks that contain degrees of ambiguous information.
Resumo:
A number of studies in the areas of Biomedical Engineering and Health Sciences have employed machine learning tools to develop methods capable of identifying patterns in different sets of data. Despite its extinction in many countries of the developed world, Hansen’s disease is still a disease that affects a huge part of the population in countries such as India and Brazil. In this context, this research proposes to develop a method that makes it possible to understand in the future how Hansen’s disease affects facial muscles. By using surface electromyography, a system was adapted so as to capture the signals from the largest possible number of facial muscles. We have first looked upon the literature to learn about the way researchers around the globe have been working with diseases that affect the peripheral neural system and how electromyography has acted to contribute to the understanding of these diseases. From these data, a protocol was proposed to collect facial surface electromyographic (sEMG) signals so that these signals presented a high signal to noise ratio. After collecting the signals, we looked for a method that would enable the visualization of this information in a way to make it possible to guarantee that the method used presented satisfactory results. After identifying the method's efficiency, we tried to understand which information could be extracted from the electromyographic signal representing the collected data. Once studies demonstrating which information could contribute to a better understanding of this pathology were not to be found in literature, parameters of amplitude, frequency and entropy were extracted from the signal and a feature selection was made in order to look for the features that better distinguish a healthy individual from a pathological one. After, we tried to identify the classifier that best discriminates distinct individuals from different groups, and also the set of parameters of this classifier that would bring the best outcome. It was identified that the protocol proposed in this study and the adaptation with disposable electrodes available in market proved their effectiveness and capability of being used in different studies whose intention is to collect data from facial electromyography. The feature selection algorithm also showed that not all of the features extracted from the signal are significant for data classification, with some more relevant than others. The classifier Support Vector Machine (SVM) proved itself efficient when the adequate Kernel function was used with the muscle from which information was to be extracted. Each investigated muscle presented different results when the classifier used linear, radial and polynomial kernel functions. Even though we have focused on Hansen’s disease, the method applied here can be used to study facial electromyography in other pathologies.
Resumo:
A number of studies in the areas of Biomedical Engineering and Health Sciences have employed machine learning tools to develop methods capable of identifying patterns in different sets of data. Despite its extinction in many countries of the developed world, Hansen’s disease is still a disease that affects a huge part of the population in countries such as India and Brazil. In this context, this research proposes to develop a method that makes it possible to understand in the future how Hansen’s disease affects facial muscles. By using surface electromyography, a system was adapted so as to capture the signals from the largest possible number of facial muscles. We have first looked upon the literature to learn about the way researchers around the globe have been working with diseases that affect the peripheral neural system and how electromyography has acted to contribute to the understanding of these diseases. From these data, a protocol was proposed to collect facial surface electromyographic (sEMG) signals so that these signals presented a high signal to noise ratio. After collecting the signals, we looked for a method that would enable the visualization of this information in a way to make it possible to guarantee that the method used presented satisfactory results. After identifying the method's efficiency, we tried to understand which information could be extracted from the electromyographic signal representing the collected data. Once studies demonstrating which information could contribute to a better understanding of this pathology were not to be found in literature, parameters of amplitude, frequency and entropy were extracted from the signal and a feature selection was made in order to look for the features that better distinguish a healthy individual from a pathological one. After, we tried to identify the classifier that best discriminates distinct individuals from different groups, and also the set of parameters of this classifier that would bring the best outcome. It was identified that the protocol proposed in this study and the adaptation with disposable electrodes available in market proved their effectiveness and capability of being used in different studies whose intention is to collect data from facial electromyography. The feature selection algorithm also showed that not all of the features extracted from the signal are significant for data classification, with some more relevant than others. The classifier Support Vector Machine (SVM) proved itself efficient when the adequate Kernel function was used with the muscle from which information was to be extracted. Each investigated muscle presented different results when the classifier used linear, radial and polynomial kernel functions. Even though we have focused on Hansen’s disease, the method applied here can be used to study facial electromyography in other pathologies.
Resumo:
This paper presents a methodology for short-term load forecasting based on genetic algorithm feature selection and artificial neural network modeling. A feed forward artificial neural network is used to model the 24-h ahead load based on past consumption, weather and stock index data. A genetic algorithm is used in order to find the best subset of variables for modeling. Three data sets of different geographical locations, encompassing areas of different dimensions with distinct load profiles are used in order to evaluate the methodology. The developed approach was found to generate models achieving a minimum mean average percentage error under 2 %. The feature selection algorithm was able to significantly reduce the number of used features and increase the accuracy of the models.
Resumo:
Many learning problems require handling high dimensional datasets with a relatively small number of instances. Learning algorithms are thus confronted with the curse of dimensionality, and need to address it in order to be effective. Examples of these types of data include the bag-of-words representation in text classification problems and gene expression data for tumor detection/classification. Usually, among the high number of features characterizing the instances, many may be irrelevant (or even detrimental) for the learning tasks. It is thus clear that there is a need for adequate techniques for feature representation, reduction, and selection, to improve both the classification accuracy and the memory requirements. In this paper, we propose combined unsupervised feature discretization and feature selection techniques, suitable for medium and high-dimensional datasets. The experimental results on several standard datasets, with both sparse and dense features, show the efficiency of the proposed techniques as well as improvements over previous related techniques.
Resumo:
In this paper, we present a feature selection approach based on Gabor wavelet feature and boosting for face verification. By convolution with a group of Gabor wavelets, the original images are transformed into vectors of Gabor wavelet features. Then for individual person, a small set of significant features are selected by the boosting algorithm from a large set of Gabor wavelet features. The experiment results have shown that the approach successfully selects meaningful and explainable features for face verification. The experiments also suggest that for the common characteristics such as eyes, noses, mouths may not be as important as some unique characteristic when training set is small. When training set is large, the unique characteristics and the common characteristics are both important.