919 resultados para Naive Bayes classifier
Resumo:
N-gram analysis is an approach that investigates the structure of a program using bytes, characters or text strings. This research uses dynamic analysis to investigate malware detection using a classification approach based on N-gram analysis. A key issue with dynamic analysis is the length of time a program has to be run to ensure a correct classification. The motivation for this research is to find the optimum subset of operational codes (opcodes) that make the best indicators of malware and to determine how long a program has to be monitored to ensure an accurate support vector machine (SVM) classification of benign and malicious software. The experiments within this study represent programs as opcode density histograms gained through dynamic analysis for different program run periods. A SVM is used as the program classifier to determine the ability of different program run lengths to correctly determine the presence of malicious software. The findings show that malware can be detected with different program run lengths using a small number of opcodes
Resumo:
N-gram analysis is an approach that investigates the structure of a program using bytes, characters or text strings. This research uses dynamic analysis to investigate malware detection using a classification approach based on N-gram analysis. The motivation for this research is to find a subset of Ngram features that makes a robust indicator of malware. The experiments within this paper represent programs as N-gram density histograms, gained through dynamic analysis. A Support Vector Machine (SVM) is used as the program classifier to determine the ability of N-grams to correctly determine the presence of malicious software. The preliminary findings show that an N-gram size N=3 and N=4 present the best avenues for further analysis.
Resumo:
Efficient identification and follow-up of astronomical transients is hindered by the need for humans to manually select promising candidates from data streams that contain many false positives. These artefacts arise in the difference images that are produced by most major ground-based time-domain surveys with large format CCD cameras. This dependence on humans to reject bogus detections is unsustainable for next generation all-sky surveys and significant effort is now being invested to solve the problem computationally. In this paper, we explore a simple machine learning approach to real-bogus classification by constructing a training set from the image data of similar to 32 000 real astrophysical transients and bogus detections from the Pan-STARRS1 Medium Deep Survey. We derive our feature representation from the pixel intensity values of a 20 x 20 pixel stamp around the centre of the candidates. This differs from previous work in that it works directly on the pixels rather than catalogued domain knowledge for feature design or selection. Three machine learning algorithms are trained (artificial neural networks, support vector machines and random forests) and their performances are tested on a held-out subset of 25 per cent of the training data. We find the best results from the random forest classifier and demonstrate that by accepting a false positive rate of 1 per cent, the classifier initially suggests a missed detection rate of around 10 per cent. However, we also find that a combination of bright star variability, nuclear transients and uncertainty in human labelling means that our best estimate of the missed detection rate is approximately 6 per cent.
Resumo:
People usually perform economic interactions within the social setting of a small group, while they obtain relevant information from a broader source. We capture this feature with a dynamic interaction model based on two separate social networks. Individuals play a coordination game in an interaction network, while updating their strategies using information from a separate influence network through which information is disseminated. In each time period, the interaction and influence networks co-evolve, and the individuals’ strategies are updated through a modified naive learning process. We show that both network structures and players’ strategies always reach a steady state, in which players form fully connected groups and converge to local conventions. We also analyze the influence exerted by a minority group of strongly opinionated players on these outcomes.
Resumo:
Masked implementations of cryptographic algorithms are often used in commercial embedded cryptographic devices to increase their resistance to side channel attacks. In this work we show how neural networks can be used to both identify the mask value, and to subsequently identify the secret key value with a single attack trace with high probability. We propose the use of a pre-processing step using principal component analysis (PCA) to significantly increase the success of the attack. We have developed a classifier that can correctly identify the mask for each trace, hence removing the security provided by that mask and reducing the attack to being equivalent to an attack against an unprotected implementation. The attack is performed on the freely available differential power analysis (DPA) contest data set to allow our work to be easily reproducible. We show that neural networks allow for a robust and efficient classification in the context of side-channel attacks.
Resumo:
This research presents a fast algorithm for projected support vector machines (PSVM) by selecting a basis vector set (BVS) for the kernel-induced feature space, the training points are projected onto the subspace spanned by the selected BVS. A standard linear support vector machine (SVM) is then produced in the subspace with the projected training points. As the dimension of the subspace is determined by the size of the selected basis vector set, the size of the produced SVM expansion can be specified. A two-stage algorithm is derived which selects and refines the basis vector set achieving a locally optimal model. The model expansion coefficients and bias are updated recursively for increase and decrease in the basis set and support vector set. The condition for a point to be classed as outside the current basis vector and selected as a new basis vector is derived and embedded in the recursive procedure. This guarantees the linear independence of the produced basis set. The proposed algorithm is tested and compared with an existing sparse primal SVM (SpSVM) and a standard SVM (LibSVM) on seven public benchmark classification problems. Our new algorithm is designed for use in the application area of human activity recognition using smart devices and embedded sensors where their sometimes limited memory and processing resources must be exploited to the full and the more robust and accurate the classification the more satisfied the user. Experimental results demonstrate the effectiveness and efficiency of the proposed algorithm. This work builds upon a previously published algorithm specifically created for activity recognition within mobile applications for the EU Haptimap project [1]. The algorithms detailed in this paper are more memory and resource efficient making them suitable for use with bigger data sets and more easily trained SVMs.
Resumo:
A practically viable multi-biometric recognition system should not only be stable, robust and accurate but should also adhere to real-time processing speed and memory constraints. This study proposes a cascaded classifier-based framework for use in biometric recognition systems. The proposed framework utilises a set of weak classifiers to reduce the enrolled users' dataset to a small list of candidate users. This list is then used by a strong classifier set as the final stage of the cascade to formulate the decision. At each stage, the candidate list is generated by a Mahalanobis distance-based match score quality measure. One of the key features of the authors framework is that each classifier in the ensemble can be designed to use a different modality thus providing the advantages of a truly multimodal biometric recognition system. In addition, it is one of the first truly multimodal cascaded classifier-based approaches for biometric recognition. The performance of the proposed system is evaluated both for single and multimodalities to demonstrate the effectiveness of the approach.
Resumo:
An outlier removal based data cleaning technique is proposed to
clean manually pre-segmented human skin data in colour images.
The 3-dimensional colour data is projected onto three 2-dimensional
planes, from which outliers are removed. The cleaned 2 dimensional
data projections are merged to yield a 3D clean RGB data. This data
is finally used to build a look up table and a single Gaussian classifier
for the purpose of human skin detection in colour images.
Resumo:
A RkNN query returns all objects whose nearest k neighbors
contain the query object. In this paper, we consider RkNN
query processing in the case where the distances between
attribute values are not necessarily metric. Dissimilarities
between objects could then be a monotonic aggregate of dissimilarities
between their values, such aggregation functions
being specified at query time. We outline real world cases
that motivate RkNN processing in such scenarios. We consider
the AL-Tree index and its applicability in RkNN query
processing. We develop an approach that exploits the group
level reasoning enabled by the AL-Tree in RkNN processing.
We evaluate our approach against a Naive approach
that performs sequential scans on contiguous data and an
improved block-based approach that we provide. We use
real-world datasets and synthetic data with varying characteristics
for our experiments. This extensive empirical
evaluation shows that our approach is better than existing
methods in terms of computational and disk access costs,
leading to significantly better response times.
Resumo:
One of the most popular techniques of generating classifier ensembles is known as stacking which is based on a meta-learning approach. In this paper, we introduce an alternative method to stacking which is based on cluster analysis. Similar to stacking, instances from a validation set are initially classified by all base classifiers. The output of each classifier is subsequently considered as a new attribute of the instance. Following this, a validation set is divided into clusters according to the new attributes and a small subset of the original attributes of the instances. For each cluster, we find its centroid and calculate its class label. The collection of centroids is considered as a meta-classifier. Experimental results show that the new method outperformed all benchmark methods, namely Majority Voting, Stacking J48, Stacking LR, AdaBoost J48, and Random Forest, in 12 out of 22 data sets. The proposed method has two advantageous properties: it is very robust to relatively small training sets and it can be applied in semi-supervised learning problems. We provide a theoretical investigation regarding the proposed method. This demonstrates that for the method to be successful, the base classifiers applied in the ensemble should have greater than 50% accuracy levels.
Resumo:
An algorithm for approximate credal network updating is presented. The problem in its general formulation is a multilinear optimization task, which can be linearized by an appropriate rule for fixing all the local models apart from those of a single variable. This simple idea can be iterated and quickly leads to very accurate inferences. The approach can also be specialized to classification with credal networks based on the maximality criterion. A complexity analysis for both the problem and the algorithm is reported together with numerical experiments, which confirm the good performance of the method. While the inner approximation produced by the algorithm gives rise to a classifier which might return a subset of the optimal class set, preliminary empirical results suggest that the accuracy of the optimal class set is seldom affected by the approximate probabilities
Resumo:
This paper implements momentum among a host of market anomalies. Our investment universe consists of the 15 top (long-leg) and 15 bottom (short-leg) anomaly portfolios. The proposed active strategy buys (sells short) a subset of the top (bottom) anomaly portfolios based on past one-month return. The evidence shows statistically strong and economically meaningful persistence in anomaly payoffs. Our strategy consistently outperforms a naive benchmark that equal weights anomalies and yields an abnormal monthly return ranging between 1.27% and 1.47%. The persistence is robust to the post-2000 period, and various other considerations, and is stronger following episodes of high investor sentiment.
Resumo:
Importance: The natural history of patients with newly diagnosed high-risk nonmetastatic (M0) prostate cancer receiving hormone therapy (HT) either alone or with standard-of-care radiotherapy (RT) is not well documented. Furthermore, no clinical trial has assessed the role of RT in patients with node-positive (N+) M0 disease. The STAMPEDE Trial includes such individuals, allowing an exploratory multivariate analysis of the impact of radical RT.
Objective: To describe survival and the impact on failure-free survival of RT by nodal involvement in these patients.
Design, Setting, and Participants: Cohort study using data collected for patients allocated to the control arm (standard-of-care only) of the STAMPEDE Trial between October 5, 2005, and May 1, 2014. Outcomes are presented as hazard ratios (HRs) with 95% CIs derived from adjusted Cox models; survival estimates are reported at 2 and 5 years. Participants were high-risk, hormone-naive patients with newly diagnosed M0 prostate cancer starting long-term HT for the first time. Radiotherapy is encouraged in this group, but mandated for patients with node-negative (N0) M0 disease only since November 2011.
Exposures: Long-term HT either alone or with RT, as per local standard. Planned RT use was recorded at entry.
Main Outcomes and Measures: Failure-free survival (FFS) and overall survival.
Results: A total of 721 men with newly diagnosed M0 disease were included: median age at entry, 66 (interquartile range [IQR], 61-72) years, median (IQR) prostate-specific antigen level of 43 (18-88) ng/mL. There were 40 deaths (31 owing to prostate cancer) with 17 months' median follow-up. Two-year survival was 96% (95% CI, 93%-97%) and 2-year FFS, 77% (95% CI, 73%-81%). Median (IQR) FFS was 63 (26 to not reached) months. Time to FFS was worse in patients with N+ disease (HR, 2.02 [95% CI, 1.46-2.81]) than in those with N0 disease. Failure-free survival outcomes favored planned use of RT for patients with both N0M0 (HR, 0.33 [95% CI, 0.18-0.61]) and N+M0 disease (HR, 0.48 [95% CI, 0.29-0.79]).
Conclusions and Relevance: Survival for men entering the cohort with high-risk M0 disease was higher than anticipated at study inception. These nonrandomized data were consistent with previous trials that support routine use of RT with HT in patients with N0M0 disease. Additionally, the data suggest that the benefits of RT extend to men with N+M0 disease.
Trial Registration: clinicaltrials.gov Identifier: NCT00268476; ISRCTN78818544.
Resumo:
In this paper, a novel and effective lip-based biometric identification approach with the Discrete Hidden Markov Model Kernel (DHMMK) is developed. Lips are described by shape features (both geometrical and sequential) on two different grid layouts: rectangular and polar. These features are then specifically modeled by a DHMMK, and learnt by a support vector machine classifier. Our experiments are carried out in a ten-fold cross validation fashion on three different datasets, GPDS-ULPGC Face Dataset, PIE Face Dataset and RaFD Face Dataset. Results show that our approach has achieved an average classification accuracy of 99.8%, 97.13%, and 98.10%, using only two training images per class, on these three datasets, respectively. Our comparative studies further show that the DHMMK achieved a 53% improvement against the baseline HMM approach. The comparative ROC curves also confirm the efficacy of the proposed lip contour based biometrics learned by DHMMK. We also show that the performance of linear and RBF SVM is comparable under the frame work of DHMMK.
Resumo:
We present a new wrapper feature selection algorithm for human detection. This algorithm is a hybrid featureselection approach combining the benefits of filter and wrapper methods. It allows the selection of an optimalfeature vector that well represents the shapes of the subjects in the images. In detail, the proposed featureselection algorithm adopts the k-fold subsampling and sequential backward elimination approach, while thestandard linear support vector machine (SVM) is used as the classifier for human detection. We apply theproposed algorithm to the publicly accessible INRIA and ETH pedestrian full image datasets with the PASCALVOC evaluation criteria. Compared to other state of the arts algorithms, our feature selection based approachcan improve the detection speed of the SVM classifier by over 50% with up to 2% better detection accuracy.Our algorithm also outperforms the equivalent systems introduced in the deformable part model approach witharound 9% improvement in the detection accuracy