947 resultados para naive bayes classifier


Relevância:

10.00% 10.00%

Publicador:

Resumo:

Top Down Induction of Decision Trees (TDIDT) is the most commonly used method of constructing a model from a dataset in the form of classification rules to classify previously unseen data. Alternative algorithms have been developed such as the Prism algorithm. Prism constructs modular rules which produce qualitatively better rules than rules induced by TDIDT. However, along with the increasing size of databases, many existing rule learning algorithms have proved to be computational expensive on large datasets. To tackle the problem of scalability, parallel classification rule induction algorithms have been introduced. As TDIDT is the most popular classifier, even though there are strongly competitive alternative algorithms, most parallel approaches to inducing classification rules are based on TDIDT. In this paper we describe work on a distributed classifier that induces classification rules in a parallel manner based on Prism.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Collaborative mining of distributed data streams in a mobile computing environment is referred to as Pocket Data Mining PDM. Hoeffding trees techniques have been experimentally and analytically validated for data stream classification. In this paper, we have proposed, developed and evaluated the adoption of distributed Hoeffding trees for classifying streaming data in PDM applications. We have identified a realistic scenario in which different users equipped with smart mobile devices run a local Hoeffding tree classifier on a subset of the attributes. Thus, we have investigated the mining of vertically partitioned datasets with possible overlap of attributes, which is the more likely case. Our experimental results have validated the efficiency of our proposed model achieving promising accuracy for real deployment.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Advances in hardware and software in the past decade allow to capture, record and process fast data streams at a large scale. The research area of data stream mining has emerged as a consequence from these advances in order to cope with the real time analysis of potentially large and changing data streams. Examples of data streams include Google searches, credit card transactions, telemetric data and data of continuous chemical production processes. In some cases the data can be processed in batches by traditional data mining approaches. However, in some applications it is required to analyse the data in real time as soon as it is being captured. Such cases are for example if the data stream is infinite, fast changing, or simply too large in size to be stored. One of the most important data mining techniques on data streams is classification. This involves training the classifier on the data stream in real time and adapting it to concept drifts. Most data stream classifiers are based on decision trees. However, it is well known in the data mining community that there is no single optimal algorithm. An algorithm may work well on one or several datasets but badly on others. This paper introduces eRules, a new rule based adaptive classifier for data streams, based on an evolving set of Rules. eRules induces a set of rules that is constantly evaluated and adapted to changes in the data stream by adding new and removing old rules. It is different from the more popular decision tree based classifiers as it tends to leave data instances rather unclassified than forcing a classification that could be wrong. The ongoing development of eRules aims to improve its accuracy further through dynamic parameter setting which will also address the problem of changing feature domain values.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Generally classifiers tend to overfit if there is noise in the training data or there are missing values. Ensemble learning methods are often used to improve a classifier's classification accuracy. Most ensemble learning approaches aim to improve the classification accuracy of decision trees. However, alternative classifiers to decision trees exist. The recently developed Random Prism ensemble learner for classification aims to improve an alternative classification rule induction approach, the Prism family of algorithms, which addresses some of the limitations of decision trees. However, Random Prism suffers like any ensemble learner from a high computational overhead due to replication of the data and the induction of multiple base classifiers. Hence even modest sized datasets may impose a computational challenge to ensemble learners such as Random Prism. Parallelism is often used to scale up algorithms to deal with large datasets. This paper investigates parallelisation for Random Prism, implements a prototype and evaluates it empirically using a Hadoop computing cluster.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

A two-stage linear-in-the-parameter model construction algorithm is proposed aimed at noisy two-class classification problems. The purpose of the first stage is to produce a prefiltered signal that is used as the desired output for the second stage which constructs a sparse linear-in-the-parameter classifier. The prefiltering stage is a two-level process aimed at maximizing a model's generalization capability, in which a new elastic-net model identification algorithm using singular value decomposition is employed at the lower level, and then, two regularization parameters are optimized using a particle-swarm-optimization algorithm at the upper level by minimizing the leave-one-out (LOO) misclassification rate. It is shown that the LOO misclassification rate based on the resultant prefiltered signal can be analytically computed without splitting the data set, and the associated computational cost is minimal due to orthogonality. The second stage of sparse classifier construction is based on orthogonal forward regression with the D-optimality algorithm. Extensive simulations of this approach for noisy data sets illustrate the competitiveness of this approach to classification of noisy data problems.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper explores the development of multi-feature classification techniques used to identify tremor-related characteristics in the Parkinsonian patient. Local field potentials were recorded from the subthalamic nucleus and the globus pallidus internus of eight Parkinsonian patients through the implanted electrodes of a Deep brain stimulation (DBS) device prior to device internalization. A range of signal processing techniques were evaluated with respect to their tremor detection capability and used as inputs in a multi-feature neural network classifier to identify the activity of Parkinsonian tremor. The results of this study show that a trained multi-feature neural network is able, under certain conditions, to achieve excellent detection accuracy on patients unseen during training. Overall the tremor detection accuracy was mixed, although an accuracy of over 86% was achieved in four out of the eight patients.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This letter presents an effective approach for selection of appropriate terrain modeling methods in forming a digital elevation model (DEM). This approach achieves a balance between modeling accuracy and modeling speed. A terrain complexity index is defined to represent a terrain's complexity. A support vector machine (SVM) classifies terrain surfaces into either complex or moderate based on this index associated with the terrain elevation range. The classification result recommends a terrain modeling method for a given data set in accordance with its required modeling accuracy. Sample terrain data from the lunar surface are used in constructing an experimental data set. The results have shown that the terrain complexity index properly reflects the terrain complexity, and the SVM classifier derived from both the terrain complexity index and the terrain elevation range is more effective and generic than that designed from either the terrain complexity index or the terrain elevation range only. The statistical results have shown that the average classification accuracy of SVMs is about 84.3% ± 0.9% for terrain types (complex or moderate). For various ratios of complex and moderate terrain types in a selected data set, the DEM modeling speed increases up to 19.5% with given DEM accuracy.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Ensemble learning can be used to increase the overall classification accuracy of a classifier by generating multiple base classifiers and combining their classification results. A frequently used family of base classifiers for ensemble learning are decision trees. However, alternative approaches can potentially be used, such as the Prism family of algorithms that also induces classification rules. Compared with decision trees, Prism algorithms generate modular classification rules that cannot necessarily be represented in the form of a decision tree. Prism algorithms produce a similar classification accuracy compared with decision trees. However, in some cases, for example, if there is noise in the training and test data, Prism algorithms can outperform decision trees by achieving a higher classification accuracy. However, Prism still tends to overfit on noisy data; hence, ensemble learners have been adopted in this work to reduce the overfitting. This paper describes the development of an ensemble learner using a member of the Prism family as the base classifier to reduce the overfitting of Prism algorithms on noisy datasets. The developed ensemble classifier is compared with a stand-alone Prism classifier in terms of classification accuracy and resistance to noise.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Recent research shows that speakers of languages with obligatory plural marking (English) preferentially categorize objects based on common shape, whereas speakers of nonplural-marking classifier languages (Yucatec and Japanese) preferentially categorize objects based on common material. The current study extends that investigation to the domain of bilingualism. Japanese and English monolinguals, and Japanese–English bilinguals were asked to match novel objects based on either common shape or color. Results showed that English monolinguals selected shape significantly more than Japanese monolinguals, whereas the bilinguals shifted their cognitive preferences as a function of their second language proficiency. The implications of these findings for conceptual representation and cognitive processing in bilinguals are discussed.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Numerical Weather Prediction (NWP) fields are used to assist the detection of cloud in satellite imagery. Simulated observations based on NWP are used within a framework based on Bayes' theorem to calculate a physically-based probability of each pixel with an imaged scene being clear or cloudy. Different thresholds can be set on the probabilities to create application-specific cloud-masks. Here, this is done over both land and ocean using night-time (infrared) imagery. We use a validation dataset of difficult cloud detection targets for the Spinning Enhanced Visible and Infrared Imager (SEVIRI) achieving true skill scores of 87% and 48% for ocean and land, respectively using the Bayesian technique, compared to 74% and 39%, respectively for the threshold-based techniques associated with the validation dataset.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Numerical Weather Prediction (NWP) fields are used to assist the detection of cloud in satellite imagery. Simulated observations based on NWP are used within a framework based on Bayes' theorem to calculate a physically-based probability of each pixel with an imaged scene being clear or cloudy. Different thresholds can be set on the probabilities to create application-specific cloud masks. Here, the technique is shown to be suitable for daytime applications over land and sea, using visible and near-infrared imagery, in addition to thermal infrared. We use a validation dataset of difficult cloud detection targets for the Spinning Enhanced Visible and Infrared Imager (SEVIRI) achieving true skill scores of 89% and 73% for ocean and land, respectively using the Bayesian technique, compared to 90% and 70%, respectively for the threshold-based techniques associated with the validation dataset.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We propose and demonstrate a fully probabilistic (Bayesian) approach to the detection of cloudy pixels in thermal infrared (TIR) imagery observed from satellite over oceans. Using this approach, we show how to exploit the prior information and the fast forward modelling capability that are typically available in the operational context to obtain improved cloud detection. The probability of clear sky for each pixel is estimated by applying Bayes' theorem, and we describe how to apply Bayes' theorem to this problem in general terms. Joint probability density functions (PDFs) of the observations in the TIR channels are needed; the PDFs for clear conditions are calculable from forward modelling and those for cloudy conditions have been obtained empirically. Using analysis fields from numerical weather prediction as prior information, we apply the approach to imagery representative of imagers on polar-orbiting platforms. In comparison with the established cloud-screening scheme, the new technique decreases both the rate of failure to detect cloud contamination and the false-alarm rate by one quarter. The rate of occurrence of cloud-screening-related errors of >1 K in area-averaged SSTs is reduced by 83%. Copyright © 2005 Royal Meteorological Society.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper forecasts Daily Sterling exchange rate returns using various naive, linear and non-linear univariate time-series models. The accuracy of the forecasts is evaluated using mean squared error and sign prediction criteria. These show only a very modest improvement over forecasts generated by a random walk model. The Pesaran–Timmerman test and a comparison with forecasts generated artificially shows that even the best models have no evidence of market timing ability.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We present a Bayesian image classification scheme for discriminating cloud, clear and sea-ice observations at high latitudes to improve identification of areas of clear-sky over ice-free ocean for SST retrieval. We validate the image classification against a manually classified dataset using Advanced Along Track Scanning Radiometer (AATSR) data. A three way classification scheme using a near-infrared textural feature improves classifier accuracy by 9.9 % over the nadir only version of the cloud clearing used in the ATSR Reprocessing for Climate (ARC) project in high latitude regions. The three way classification gives similar numbers of cloud and ice scenes misclassified as clear but significantly more clear-sky cases are correctly identified (89.9 % compared with 65 % for ARC). We also demonstrate the poetential of a Bayesian image classifier including information from the 0.6 micron channel to be used in sea-ice extent and ice surface temperature retrieval with 77.7 % of ice scenes correctly identified and an overall classifier accuracy of 96 %.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This contribution proposes a novel probability density function (PDF) estimation based over-sampling (PDFOS) approach for two-class imbalanced classification problems. The classical Parzen-window kernel function is adopted to estimate the PDF of the positive class. Then according to the estimated PDF, synthetic instances are generated as the additional training data. The essential concept is to re-balance the class distribution of the original imbalanced data set under the principle that synthetic data sample follows the same statistical properties. Based on the over-sampled training data, the radial basis function (RBF) classifier is constructed by applying the orthogonal forward selection procedure, in which the classifier’s structure and the parameters of RBF kernels are determined using a particle swarm optimisation algorithm based on the criterion of minimising the leave-one-out misclassification rate. The effectiveness of the proposed PDFOS approach is demonstrated by the empirical study on several imbalanced data sets.