Biblioteca Digital

47 resultados para Fuzzy K Nearest Neighbor

Zero-day malware detection based on supervised learning algorithms of API call signatures

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Zero-day or unknown malware are created using code obfuscation techniques that can modify the parent code to produce offspring copies which have the same functionality but with different signatures. Current techniques reported in literature lack the capability of detecting zero-day malware with the required accuracy and efficiency. In this paper, we have proposed and evaluated a novel method of employing several data mining techniques to detect and classify zero-day malware with high levels of accuracy and efficiency based on the frequency of Windows API calls. This paper describes the methodology employed for the collection of large data sets to train the classifiers, and analyses the performance results of the various data mining algorithms adopted for the study using a fully automated tool developed in this research to conduct the various experimental investigations and evaluation. Through the performance results of these algorithms from our experimental analysis, we are able to evaluate and discuss the advantages of one data mining algorithm over the other for accurately detecting zero-day malware successfully. The data mining framework employed in this research learns through analysing the behavior of existing malicious and benign codes in large datasets. We have employed robust classifiers, namely Naïve Bayes (NB) Algorithm, k−Nearest Neighbor (kNN) Algorithm, Sequential Minimal Optimization (SMO) Algorithm with 4 differents kernels (SMO - Normalized PolyKernel, SMO – PolyKernel, SMO – Puk, and SMO- Radial Basis Function (RBF)), Backpropagation Neural Networks Algorithm, and J48 decision tree and have evaluated their performance. Overall, the automated data mining system implemented for this study has achieved high true positive (TP) rate of more than 98.5%, and low false positive (FP) rate of less than 0.025, which has not been achieved in literature so far. This is much higher than the required commercial acceptance level indicating that our novel technique is a major leap forward in detecting zero-day malware. This paper also offers future directions for researchers in exploring different aspects of obfuscations that are affecting the IT world today.

Fusion of GRNN and FA for online noisy data regression

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A new online neural-network-based regression model for noisy data is proposed in this paper. It is a hybrid system combining the Fuzzy ART (FA) and General Regression Neural Network (GRNN) models. Both the FA and GRNN models are fast incremental learning systems. The proposed hybrid model, denoted as GRNNFA-online, retains the online learning properties of both models. The kernel centers of the GRNN are obtained by compressing the training samples using the FA model. The width of each kernel is then estimated by the K-nearest-neighbors (kNN) method. A heuristic is proposed to tune the value of Kof the kNN dynamically based on the concept of gradient-descent. The performance of the GRNNFA-online model was evaluated using two benchmark datasets, i.e., OZONE and Friedman#1. The experimental results demonstrated the convergence of the prediction errors. Bootstrapping was employed to assess the performance statistically. The final prediction errors are analyzed and compared with those from other systems.Bootstrapping was employed to assess the performance statistically. The final prediction errors are analyzed and compared with those from other systems.

Use of circle-segments as a data visualization technique for feature selection in pattern classification

Relevância:

100.00% 100.00%

Publicador:

Resumo:

One of the issues associated with pattern classification using data based machine learning systems is the “curse of dimensionality”. In this paper, the circle-segments method is proposed as a feature selection method to identify important input features before the entire data set is provided for learning with machine learning systems. Specifically, four machine learning systems are deployed for classification, viz. Multilayer Perceptron (MLP), Support Vector Machine (SVM), Fuzzy ARTMAP (FAM), and k-Nearest Neighbour (kNN). The integration between the circle-segments method and the machine learning systems has been applied to two case studies comprising one benchmark and one real data sets. Overall, the results after feature selection using the circle segments method demonstrate improvements in performance even with more than 50% of the input features eliminated from the original data sets.

EEG data classification using wavelet features selected by Wilcoxon statistics

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper introduces a method to classify EEG signals using features extracted by an integration of wavelet transform and the nonparametric Wilcoxon test. Orthogonal Haar wavelet coefficients are ranked based on the Wilcoxon test’s statistics. The most prominent discriminant wavelets are assembled to form a feature set that serves as inputs to the naïve Bayes classifier. Two benchmark datasets, named Ia and Ib, downloaded from the brain–computer interface (BCI) competition II are employed for the experiments. Classification performance is evaluated using accuracy, mutual information, Gini coefficient and F-measure. Widely used classifiers, including feedforward neural network, support vector machine, k-nearest neighbours, ensemble learning Adaboost and adaptive neuro-fuzzy inference system, are also implemented for comparisons. The proposed combination of Haar wavelet features and naïve Bayes classifier considerably dominates the competitive classification approaches and outperforms the best performance on the Ia and Ib datasets reported in the BCI competition II. Application of naïve Bayes also provides a low computational cost approach that promotes the implementation of a potential real-time BCI system.

Localization with orientation using RSSI measurements: RF map based approach

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Localization of RFIDs in the indoor environment will entail determining both the position and the orientation of the user. This paper develops estimator using RSSI measurements to predict the position and orientation of a transmitter in an indoor environment. The best estimator tried was an K-nearest neighbours model that gave an accuracy of approximately 83% for position prediction and 93% for orientation prediction. It was also found that the RSSI values change throughout the day, meaning that an adaptive estimator is necessary for localization.

Dependency bagging

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper, a new variant of Bagging named DepenBag is proposed. This algorithm obtains bootstrap samples at first. Then, it employs a causal discoverer to induce from each sample a dependency model expressed as a Directed Acyclic Graph (DAG). The attributes without connections to the class attribute in all the DAGs are then removed. Finally, a component learner is trained from each of the resulted samples to constitute the ensemble. Empirical study shows that DepenBag is effective in building ensembles of nearest neighbor classifiers.

Detecting unwanted email using VAT

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Spam or unwanted email is one of the potential issues of Internet security and classifying user emails correctly from penetration of spam is an important research issue for anti-spam researchers. In this paper we present an effective and efficient spam classification technique using clustering approach to categorize the features. In our clustering technique we use VAT (Visual Assessment and clustering Tendency) approach into our training model to categorize the extracted features and then pass the information into classification engine. We have used WEKA (www.cs.waikato.ac.nz/ml/weka/) interface to classify the data using different classification algorithms, including tree-based classifiers, nearest neighbor algorithms, statistical algorithms and AdaBoosts. Our empirical performance shows that we can achieve detection rate over 97%.

Robust learning of discriminative projection for multicategory classification on the Stiefel manifold

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Learning a robust projection with a small number of training samples is still a challenging problem in face recognition, especially when the unseen faces have extreme variation in pose, illumination, and facial expression. To address this problem, we propose a framework formulated under statistical learning theory that facilitates robust learning of a discriminative projection. Dimensionality reduction using the projection matrix is combined with a linear classifier in the regularized framework of lasso regression. The projection matrix in conjunction with the classifier parameters are then found by solving an optimization problem over the Stiefel manifold. The experimental results on standard face databases suggest that the proposed method outperforms some recent regularized techniques when the number of training samples is small.

Video genre categorization using audio wavelet coefficients

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper, we investigate the use of a wavelet transform-based analysis of audio tracks accompanying videos for the problem of automatic program genre detection. We compare the classification performance based on wavelet-based audio features to that using conventional features derived from Fourier and time analysis for the task of discriminating TV programs such as news, commercials, music shows, concerts, motor racing games, and animated cartoons. Three different classifiers namely the Decision Trees, SVMs, and k-Nearest Neighbours are studied to analyse the reliability of the performance of our wavelet features based approach. Further, we investigate the issue of an appropriate duration of an audio clip to be analyzed for this automatic genre determination. Our experimental results show that features derived from the wavelet transform of the audio signal can very well separate the six video genres studied. It is also found that there is no significant difference in performance with varying audio clip durations across the classifiers.

Dimensionality reduction of protein mass spectrometry data using random projection

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Protein mass spectrometry (MS) pattern recognition has recently emerged as a new method for cancer diagnosis. Unfortunately, classification performance may degrade owing to the enormously high dimensionality of the data. This paper investigates the use of Random Projection in protein MS data dimensionality reduction. The effectiveness of Random Projection (RP) is analyzed and compared against Principal Component Analysis (PCA) by using three classification algorithms, namely Support Vector Machine, Feed-forward Neural Networks and K-Nearest Neighbour. Three real-world cancer data sets are employed to evaluate the performances of RP and PCA. Through the investigations, RP method demonstrated better or at least comparable classification performance as PCA if the dimensionality of the projection matrix is sufficiently large. This paper also explores the use of RP as a pre-processing step prior to PCA. The results show that without sacrificing classification accuracy, performing RP prior to PCA significantly improves the computational time.

Network traffic classification using correlation information

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Traffic classification has wide applications in network management, from security monitoring to quality of service measurements. Recent research tends to apply machine learning techniques to flow statistical feature based classification methods. The nearest neighbor (NN)-based method has exhibited superior classification performance. It also has several important advantages, such as no requirements of training procedure, no risk of overfitting of parameters, and naturally being able to handle a huge number of classes. However, the performance of NN classifier can be severely affected if the size of training data is small. In this paper, we propose a novel nonparametric approach for traffic classification, which can improve the classification performance effectively by incorporating correlated information into the classification process. We analyze the new classification approach and its performance benefit from both theoretical and empirical perspectives. A large number of experiments are carried out on two real-world traffic data sets to validate the proposed approach. The results show the traffic classification performance can be improved significantly even under the extreme difficult circumstance of very few training samples.

AdaM : adaptive-maximum imputation for neighborhood-based collaborative filtering

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In the context of collaborative filtering, the well known data sparsity issue makes two like-minded users have little similarity, and consequently renders the k nearest neighbour rule inapplicable. In this paper, we address the data sparsity problem in the neighbourhood-based CF methods by proposing an Adaptive-Maximum imputation method (AdaM). The basic idea is to identify an imputation area that can maximize the imputation benefit for recommendation purposes, while minimizing the imputation error brought in. To achieve the maximum imputation benefit, the imputation area is determined from both the user and the item perspectives; to minimize the imputation error, there is at least one real rating preserved for each item in the identified imputation area. A theoretical analysis is provided to prove that the proposed imputation method outperforms the conventional neighbourhood-based CF methods through more accurate neighbour identification. Experiment results on benchmark datasets show that the proposed method significantly outperforms the other related state-of-the-art imputation-based methods in terms of accuracy.

Emergent effects in massive agent swarms in real-time game environments

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Computational efficiency and hence the scale of agent-based swarm simulations is bound by the nearest neighbour computation for each agent. This article proposes the use of GPU texture memory to implement lookup tables for a spatial partitioning based k-Nearest Neighbours algorithm. These improvements allow simulation of swarms of 220 agents at higher rates than the current best alternative algorithms. This approach is incorporated into an existing framework for simulating steering behaviours allowing for a complete implementation of massive agent swarm simulations, with per agent behaviour preferences, on a Graphics Processing Unit. These simulations have enabled an investigation of the emergent dynamics that occur when massive swarms interact with a choke point in their environment. Various modes of sustained dynamics with temporal and spatial coherence are identified when a critical mass of agents is simulated and some elementary properties are presented. The algorithms presented in this article enable researchers and content designers in games and movies to implement truly massive agent swarms in real time and thus provide a basis for further identification and analysis of the emergent dynamics in these swarms. This will improve not only the scale of swarms used in commercial games and movies but will also improve the reliability of swarm behaviour with respect to content design goals.

Sparse representation with multi-manifold analysis for texture classification from few training images

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Texture classification is one of the most important tasks in computer vision field and it has been extensively investigated in the last several decades. Previous texture classification methods mainly used the template matching based methods such as Support Vector Machine and k-Nearest-Neighbour for classification. Given enough training images the state-of-the-art texture classification methods could achieve very high classification accuracies on some benchmark databases. However, when the number of training images is limited, which usually happens in real-world applications because of the high cost of obtaining labelled data, the classification accuracies of those state-of-the-art methods would deteriorate due to the overfitting effect. In this paper we aim to develop a novel framework that could correctly classify textural images with only a small number of training images. By taking into account the repetition and sparsity property of textures we propose a sparse representation based multi-manifold analysis framework for texture classification from few training images. A set of new training samples are generated from each training image by a scale and spatial pyramid, and then the training samples belonging to each class are modelled by a manifold based on sparse representation. We learn a dictionary of sparse representation and a projection matrix for each class and classify the test images based on the projected reconstruction errors. The framework provides a more compact model than the template matching based texture classification methods, and mitigates the overfitting effect. Experimental results show that the proposed method could achieve reasonably high generalization capability even with as few as 3 training images, and significantly outperforms the state-of-the-art texture classification approaches on three benchmark datasets. © 2014 Elsevier B.V. All rights reserved.

On addressing the imbalance problem : A correlated KNN approach for network traffic classification

Relevância:

100.00% 100.00%

Publicador:

Resumo:

With the arrival of big data era, the Internet traffic is growing exponentially. A wide variety of applications arise on the Internet and traffic classification is introduced to help people manage the massive applications on the Internet for security monitoring and quality of service purposes. A large number of Machine Learning (ML) algorithms are introduced to deal with traffic classification. A significant challenge to the classification performance comes from imbalanced distribution of data in traffic classification system. In this paper, we proposed an Optimised Distance-based Nearest Neighbor (ODNN), which has the capability of improving the classification performance of imbalanced traffic data. We analyzed the proposed ODNN approach and its performance benefit from both theoretical and empirical perspectives. A large number of experiments were implemented on the real-world traffic dataset. The results show that the performance of “small classes” can be improved significantly even only with small number of training data and the performance of “large classes” remains stable.

«
1
2
3
4
»