803 resultados para correlation-based feature selection
Resumo:
This paper deals with the problem of using the data mining models in a real-world situation where the user can not provide all the inputs with which the predictive model is built. A learning system framework, Query Based Learning System (QBLS), is developed for improving the performance of the predictive models in practice where not all inputs are available for querying to the system. The automatic feature selection algorithm called Query Based Feature Selection (QBFS) is developed for selecting features to obtain a balance between the relative minimum subset of features and the relative maximum classification accuracy. Performance of the QBLS system and the QBFS algorithm is successfully demonstrated with a real-world application
Resumo:
Most current computer systems authorise the user at the start of a session and do not detect whether the current user is still the initial authorised user, a substitute user, or an intruder pretending to be a valid user. Therefore, a system that continuously checks the identity of the user throughout the session is necessary without being intrusive to end-user and/or effectively doing this. Such a system is called a continuous authentication system (CAS). Researchers have applied several approaches for CAS and most of these techniques are based on biometrics. These continuous biometric authentication systems (CBAS) are supplied by user traits and characteristics. One of the main types of biometric is keystroke dynamics which has been widely tried and accepted for providing continuous user authentication. Keystroke dynamics is appealing for many reasons. First, it is less obtrusive, since users will be typing on the computer keyboard anyway. Second, it does not require extra hardware. Finally, keystroke dynamics will be available after the authentication step at the start of the computer session. Currently, there is insufficient research in the CBAS with keystroke dynamics field. To date, most of the existing schemes ignore the continuous authentication scenarios which might affect their practicality in different real world applications. Also, the contemporary CBAS with keystroke dynamics approaches use characters sequences as features that are representative of user typing behavior but their selected features criteria do not guarantee features with strong statistical significance which may cause less accurate statistical user-representation. Furthermore, their selected features do not inherently incorporate user typing behavior. Finally, the existing CBAS that are based on keystroke dynamics are typically dependent on pre-defined user-typing models for continuous authentication. This dependency restricts the systems to authenticate only known users whose typing samples are modelled. This research addresses the previous limitations associated with the existing CBAS schemes by developing a generic model to better identify and understand the characteristics and requirements of each type of CBAS and continuous authentication scenario. Also, the research proposes four statistical-based feature selection techniques that have highest statistical significance and encompasses different user typing behaviors which represent user typing patterns effectively. Finally, the research proposes the user-independent threshold approach that is able to authenticate a user accurately without needing any predefined user typing model a-priori. Also, we enhance the technique to detect the impostor or intruder who may take over during the entire computer session.
Resumo:
Background Cancer outlier profile analysis (COPA) has proven to be an effective approach to analyzing cancer expression data, leading to the discovery of the TMPRSS2 and ETS family gene fusion events in prostate cancer. However, the original COPA algorithm did not identify down-regulated outliers, and the currently available R package implementing the method is similarly restricted to the analysis of over-expressed outliers. Here we present a modified outlier detection method, mCOPA, which contains refinements to the outlier-detection algorithm, identifies both over- and under-expressed outliers, is freely available, and can be applied to any expression dataset. Results We compare our method to other feature-selection approaches, and demonstrate that mCOPA frequently selects more-informative features than do differential expression or variance-based feature selection approaches, and is able to recover observed clinical subtypes more consistently. We demonstrate the application of mCOPA to prostate cancer expression data, and explore the use of outliers in clustering, pathway analysis, and the identification of tumour suppressors. We analyse the under-expressed outliers to identify known and novel prostate cancer tumour suppressor genes, validating these against data in Oncomine and the Cancer Gene Index. We also demonstrate how a combination of outlier analysis and pathway analysis can identify molecular mechanisms disrupted in individual tumours. Conclusions We demonstrate that mCOPA offers advantages, compared to differential expression or variance, in selecting outlier features, and that the features so selected are better able to assign samples to clinically annotated subtypes. Further, we show that the biology explored by outlier analysis differs from that uncovered in differential expression or variance analysis. mCOPA is an important new tool for the exploration of cancer datasets and the discovery of new cancer subtypes, and can be combined with pathway and functional analysis approaches to discover mechanisms underpinning heterogeneity in cancers
Resumo:
One of the major aims of BCI research is devoted to achieving faster and more efficient control of external devices. The identification of individual tap events in a motor imagery BCI is therefore a desirable goal. EEG is recorded from subjects performing and imagining finger taps with their left and right hands. A Differential Evolution based feature selection wrapper is used in order to identify optimal features in the spatial and frequency domains for tap identification. Channel-frequency band combinations are found which allow differentiation of tap vs. no-tap control conditions for executed and imagined taps. Left vs. right hand taps may also be differentiated with features found in this manner. A sliding time window is then used to accurately identify individual taps in the executed tap and imagined tap conditions. Highly statistically significant classification accuracies are achieved with time windows of 0.5 s and more allowing taps to be identified on a single trial basis.
Resumo:
Aim of this paper is to evaluate the diagnostic contribution of various types of texture features in discrimination of hepatic tissue in abdominal non-enhanced Computed Tomography (CT) images. Regions of Interest (ROIs) corresponding to the classes: normal liver, cyst, hemangioma, and hepatocellular carcinoma were drawn by an experienced radiologist. For each ROI, five distinct sets of texture features are extracted using First Order Statistics (FOS), Spatial Gray Level Dependence Matrix (SGLDM), Gray Level Difference Method (GLDM), Laws' Texture Energy Measures (TEM), and Fractal Dimension Measurements (FDM). In order to evaluate the ability of the texture features to discriminate the various types of hepatic tissue, each set of texture features, or its reduced version after genetic algorithm based feature selection, was fed to a feed-forward Neural Network (NN) classifier. For each NN, the area under Receiver Operating Characteristic (ROC) curves (Az) was calculated for all one-vs-all discriminations of hepatic tissue. Additionally, the total Az for the multi-class discrimination task was estimated. The results show that features derived from FOS perform better than other texture features (total Az: 0.802+/-0.083) in the discrimination of hepatic tissue.
Resumo:
Piotr Omenzetter and Simon Hoell’s work within the Lloyd’s Register Foundation Centre for Safety and Reliability Engineering at the University of Aberdeen is supported by Lloyd’s Register Foundation. The Foundation helps to protect life and property by supporting engineering-related education, public engagement and the application of research.
Resumo:
Piotr Omenzetter and Simon Hoell’s work within the Lloyd’s Register Foundation Centre for Safety and Reliability Engineering at the University of Aberdeen is supported by Lloyd’s Register Foundation. The Foundation helps to protect life and property by supporting engineering-related education, public engagement and the application of research.
Resumo:
Guo and Nixon proposed a feature selection method based on maximizing I(x; Y),the multidimensional mutual information between feature vector x and class variable Y. Because computing I(x; Y) can be difficult in practice, Guo and Nixon proposed an approximation of I(x; Y) as the criterion for feature selection. We show that Guo and Nixon's criterion originates from approximating the joint probability distributions in I(x; Y) by second-order product distributions. We remark on the limitations of the approximation and discuss computationally attractive alternatives to compute I(x; Y).
Resumo:
The work presented here is part of a larger study to identify novel technologies and biomarkers for early Alzheimer disease (AD) detection and it focuses on evaluating the suitability of a new approach for early AD diagnosis by non-invasive methods. The purpose is to examine in a pilot study the potential of applying intelligent algorithms to speech features obtained from suspected patients in order to contribute to the improvement of diagnosis of AD and its degree of severity. In this sense, Artificial Neural Networks (ANN) have been used for the automatic classification of the two classes (AD and control subjects). Two human issues have been analyzed for feature selection: Spontaneous Speech and Emotional Response. Not only linear features but also non-linear ones, such as Fractal Dimension, have been explored. The approach is non invasive, low cost and without any side effects. Obtained experimental results were very satisfactory and promising for early diagnosis and classification of AD patients.
Resumo:
Although many feature selection methods for classification have been developed, there is a need to identify genes in high-dimensional data with censored survival outcomes. Traditional methods for gene selection in classification problems have several drawbacks. First, the majority of the gene selection approaches for classification are single-gene based. Second, many of the gene selection procedures are not embedded within the algorithm itself. The technique of random forests has been found to perform well in high-dimensional data settings with survival outcomes. It also has an embedded feature to identify variables of importance. Therefore, it is an ideal candidate for gene selection in high-dimensional data with survival outcomes. In this paper, we develop a novel method based on the random forests to identify a set of prognostic genes. We compare our method with several machine learning methods and various node split criteria using several real data sets. Our method performed well in both simulations and real data analysis.Additionally, we have shown the advantages of our approach over single-gene-based approaches. Our method incorporates multivariate correlations in microarray data for survival outcomes. The described method allows us to better utilize the information available from microarray data with survival outcomes.
Resumo:
We present a new wrapper feature selection algorithm for human detection. This algorithm is a hybrid featureselection approach combining the benefits of filter and wrapper methods. It allows the selection of an optimalfeature vector that well represents the shapes of the subjects in the images. In detail, the proposed featureselection algorithm adopts the k-fold subsampling and sequential backward elimination approach, while thestandard linear support vector machine (SVM) is used as the classifier for human detection. We apply theproposed algorithm to the publicly accessible INRIA and ETH pedestrian full image datasets with the PASCALVOC evaluation criteria. Compared to other state of the arts algorithms, our feature selection based approachcan improve the detection speed of the SVM classifier by over 50% with up to 2% better detection accuracy.Our algorithm also outperforms the equivalent systems introduced in the deformable part model approach witharound 9% improvement in the detection accuracy
Resumo:
In this paper, we present a feature selection approach based on Gabor wavelet feature and boosting for face verification. By convolution with a group of Gabor wavelets, the original images are transformed into vectors of Gabor wavelet features. Then for individual person, a small set of significant features are selected by the boosting algorithm from a large set of Gabor wavelet features. The experiment results have shown that the approach successfully selects meaningful and explainable features for face verification. The experiments also suggest that for the common characteristics such as eyes, noses, mouths may not be as important as some unique characteristic when training set is small. When training set is large, the unique characteristics and the common characteristics are both important.
Resumo:
Nonlinear analysis tools for studying and characterizing the dynamics of physiological signals have gained popularity, mainly because tracking sudden alterations of the inherent complexity of biological processes might be an indicator of altered physiological states. Typically, in order to perform an analysis with such tools, the physiological variables that describe the biological process under study are used to reconstruct the underlying dynamics of the biological processes. For that goal, a procedure called time-delay or uniform embedding is usually employed. Nonetheless, there is evidence of its inability for dealing with non-stationary signals, as those recorded from many physiological processes. To handle with such a drawback, this paper evaluates the utility of non-conventional time series reconstruction procedures based on non uniform embedding, applying them to automatic pattern recognition tasks. The paper compares a state of the art non uniform approach with a novel scheme which fuses embedding and feature selection at once, searching for better reconstructions of the dynamics of the system. Moreover, results are also compared with two classic uniform embedding techniques. Thus, the goal is comparing uniform and non uniform reconstruction techniques, including the one proposed in this work, for pattern recognition in biomedical signal processing tasks. Once the state space is reconstructed, the scheme followed characterizes with three classic nonlinear dynamic features (Largest Lyapunov Exponent, Correlation Dimension and Recurrence Period Density Entropy), while classification is carried out by means of a simple k-nn classifier. In order to test its generalization capabilities, the approach was tested with three different physiological databases (Speech Pathologies, Epilepsy and Heart Murmurs). In terms of the accuracy obtained to automatically detect the presence of pathologies, and for the three types of biosignals analyzed, the non uniform techniques used in this work lightly outperformed the results obtained using the uniform methods, suggesting their usefulness to characterize non-stationary biomedical signals in pattern recognition applications. On the other hand, in view of the results obtained and its low computational load, the proposed technique suggests its applicability for the applications under study.
Resumo:
This paper presents a fault diagnosis method based on adaptive neuro-fuzzy inference system (ANFIS) in combination with decision trees. Classification and regression tree (CART) which is one of the decision tree methods is used as a feature selection procedure to select pertinent features from data set. The crisp rules obtained from the decision tree are then converted to fuzzy if-then rules that are employed to identify the structure of ANFIS classifier. The hybrid of back-propagation and least squares algorithm are utilized to tune the parameters of the membership functions. In order to evaluate the proposed algorithm, the data sets obtained from vibration signals and current signals of the induction motors are used. The results indicate that the CART–ANFIS model has potential for fault diagnosis of induction motors.
Resumo:
Data preprocessing is widely recognized as an important stage in anomaly detection. This paper reviews the data preprocessing techniques used by anomaly-based network intrusion detection systems (NIDS), concentrating on which aspects of the network traffic are analyzed, and what feature construction and selection methods have been used. Motivation for the paper comes from the large impact data preprocessing has on the accuracy and capability of anomaly-based NIDS. The review finds that many NIDS limit their view of network traffic to the TCP/IP packet headers. Time-based statistics can be derived from these headers to detect network scans, network worm behavior, and denial of service attacks. A number of other NIDS perform deeper inspection of request packets to detect attacks against network services and network applications. More recent approaches analyze full service responses to detect attacks targeting clients. The review covers a wide range of NIDS, highlighting which classes of attack are detectable by each of these approaches. Data preprocessing is found to predominantly rely on expert domain knowledge for identifying the most relevant parts of network traffic and for constructing the initial candidate set of traffic features. On the other hand, automated methods have been widely used for feature extraction to reduce data dimensionality, and feature selection to find the most relevant subset of features from this candidate set. The review shows a trend toward deeper packet inspection to construct more relevant features through targeted content parsing. These context sensitive features are required to detect current attacks.