44 resultados para machine learning algorithms
Resumo:
Recently there has been an increasing interest in the development of new methods using Pareto optimality to deal with multi-objective criteria (for example, accuracy and architectural complexity). Once one has learned a model based on their devised method, the problem is then how to compare it with the state of art. In machine learning, algorithms are typically evaluated by comparing their performance on different data sets by means of statistical tests. Unfortunately, the standard tests used for this purpose are not able to jointly consider performance measures. The aim of this paper is to resolve this issue by developing statistical procedures that are able to account for multiple competing measures at the same time. In particular, we develop two tests: a frequentist procedure based on the generalized likelihood-ratio test and a Bayesian procedure based on a multinomial-Dirichlet conjugate model. We further extend them by discovering conditional independences among measures to reduce the number of parameter of such models, as usually the number of studied cases is very reduced in such comparisons. Real data from a comparison among general purpose classifiers is used to show a practical application of our tests.
Resumo:
In this paper, we propose a malware categorization method that models malware behavior in terms of instructions using PageRank. PageRank computes ranks of web pages based on structural information and can also compute ranks of instructions that represent the structural information of the instructions in malware analysis methods. Our malware categorization method uses the computed ranks as features in machine learning algorithms. In the evaluation, we compare the effectiveness of different PageRank algorithms and also investigate bagging and boosting algorithms to improve the categorization accuracy.
Resumo:
This paper introduces a logical model of inductive generalization, and specifically of the machine learning task of inductive concept learning (ICL). We argue that some inductive processes, like ICL, can be seen as a form of defeasible reasoning. We define a consequence relation characterizing which hypotheses can be induced from given sets of examples, and study its properties, showing they correspond to a rather well-behaved non-monotonic logic. We will also show that with the addition of a preference relation on inductive theories we can characterize the inductive bias of ICL algorithms. The second part of the paper shows how this logical characterization of inductive generalization can be integrated with another form of non-monotonic reasoning (argumentation), to define a model of multiagent ICL. This integration allows two or more agents to learn, in a consistent way, both from induction and from arguments used in the communication between them. We show that the inductive theories achieved by multiagent induction plus argumentation are sound, i.e. they are precisely the same as the inductive theories built by a single agent with all data. © 2012 Elsevier B.V.
Resumo:
The problem of learning from imbalanced data is of critical importance in a large number of application domains and can be a bottleneck in the performance of various conventional learning methods that assume the data distribution to be balanced. The class imbalance problem corresponds to dealing with the situation where one class massively outnumbers the other. The imbalance between majority and minority would lead machine learning to be biased and produce unreliable outcomes if the imbalanced data is used directly. There has been increasing interest in this research area and a number of algorithms have been developed. However, independent evaluation of the algorithms is limited. This paper aims at evaluating the performance of five representative data sampling methods namely SMOTE, ADASYN, BorderlineSMOTE, SMOTETomek and RUSBoost that deal with class imbalance problems. A comparative study is conducted and the performance of each method is critically analysed in terms of assessment metrics. © 2013 Springer-Verlag.
Resumo:
The identification of non-linear systems using only observed finite datasets has become a mature research area over the last two decades. A class of linear-in-the-parameter models with universal approximation capabilities have been intensively studied and widely used due to the availability of many linear-learning algorithms and their inherent convergence conditions. This article presents a systematic overview of basic research on model selection approaches for linear-in-the-parameter models. One of the fundamental problems in non-linear system identification is to find the minimal model with the best model generalisation performance from observational data only. The important concepts in achieving good model generalisation used in various non-linear system-identification algorithms are first reviewed, including Bayesian parameter regularisation and models selective criteria based on the cross validation and experimental design. A significant advance in machine learning has been the development of the support vector machine as a means for identifying kernel models based on the structural risk minimisation principle. The developments on the convex optimisation-based model construction algorithms including the support vector regression algorithms are outlined. Input selection algorithms and on-line system identification algorithms are also included in this review. Finally, some industrial applications of non-linear models are discussed.
Resumo:
This paper investigates the learning of a wide class of single-hidden-layer feedforward neural networks (SLFNs) with two sets of adjustable parameters, i.e., the nonlinear parameters in the hidden nodes and the linear output weights. The main objective is to both speed up the convergence of second-order learning algorithms such as Levenberg-Marquardt (LM), as well as to improve the network performance. This is achieved here by reducing the dimension of the solution space and by introducing a new Jacobian matrix. Unlike conventional supervised learning methods which optimize these two sets of parameters simultaneously, the linear output weights are first converted into dependent parameters, thereby removing the need for their explicit computation. Consequently, the neural network (NN) learning is performed over a solution space of reduced dimension. A new Jacobian matrix is then proposed for use with the popular second-order learning methods in order to achieve a more accurate approximation of the cost function. The efficacy of the proposed method is shown through an analysis of the computational complexity and by presenting simulation results from four different examples.
Resumo:
Recent years have witnessed an incredibly increasing interest in the topic of incremental learning. Unlike conventional machine learning situations, data flow targeted by incremental learning becomes available continuously over time. Accordingly, it is desirable to be able to abandon the traditional assumption of the availability of representative training data during the training period to develop decision boundaries. Under scenarios of continuous data flow, the challenge is how to transform the vast amount of stream raw data into information and knowledge representation, and accumulate experience over time to support future decision-making process. In this paper, we propose a general adaptive incremental learning framework named ADAIN that is capable of learning from continuous raw data, accumulating experience over time, and using such knowledge to improve future learning and prediction performance. Detailed system level architecture and design strategies are presented in this paper. Simulation results over several real-world data sets are used to validate the effectiveness of this method.
Resumo:
The optimization of full-scale biogas plant operation is of great importance to make biomass a competitive source of renewable energy. The implementation of innovative control and optimization algorithms, such as Nonlinear Model Predictive Control, requires an online estimation of operating states of biogas plants. This state estimation allows for optimal control and operating decisions according to the actual state of a plant. In this paper such a state estimator is developed using a calibrated simulation model of a full-scale biogas plant, which is based on the Anaerobic Digestion Model No.1. The use of advanced pattern recognition methods shows that model states can be predicted from basic online measurements such as biogas production, CH4 and CO2 content in the biogas, pH value and substrate feed volume of known substrates. The machine learning methods used are trained and evaluated using synthetic data created with the biogas plant model simulating over a wide range of possible plant operating regions. Results show that the operating state vector of the modelled anaerobic digestion process can be predicted with an overall accuracy of about 90%. This facilitates the application of state-based optimization and control algorithms on full-scale biogas plants and therefore fosters the production of eco-friendly energy from biomass.
Resumo:
This paper explores the performance of sliding-window based training, termed as semi batch, using multilayer perceptron (MLP) neural network in the presence of correlated data. The sliding window training is a form of higher order instantaneous learning strategy without the need of covariance matrix, usually employed for modeling and tracking purposes. Sliding-window framework is implemented to combine the robustness of offline learning algorithms with the ability to track online the underlying process of a function. This paper adopted sliding window training with recent advances in conjugate gradient direction with application of data store management e.g. simple distance measure, angle evaluation and the novel prediction error test. The simulation results show the best convergence performance is gained by using store management techniques. © 2012 Springer-Verlag.
Resumo:
Real-world graphs or networks tend to exhibit a well-known set of properties, such as heavy-tailed degree distributions, clustering and community formation. Much effort has been directed into creating realistic and tractable models for unlabelled graphs, which has yielded insights into graph structure and evolution. Recently, attention has moved to creating models for labelled graphs: many real-world graphs are labelled with both discrete and numeric attributes. In this paper, we presentAgwan (Attribute Graphs: Weighted and Numeric), a generative model for random graphs with discrete labels and weighted edges. The model is easily generalised to edges labelled with an arbitrary number of numeric attributes. We include algorithms for fitting the parameters of the Agwanmodel to real-world graphs and for generating random graphs from the model. Using real-world directed and undirected graphs as input, we compare our approach to state-of-the-art random labelled graph generators and draw conclusions about the contribution of discrete vertex labels and edge weights to graph structure.
Resumo:
This paper addresses the problem of learning Bayesian network structures from data based on score functions that are decomposable. It describes properties that strongly reduce the time and memory costs of many known methods without losing global optimality guarantees. These properties are derived for different score criteria such as Minimum Description Length (or Bayesian Information Criterion), Akaike Information Criterion and Bayesian Dirichlet Criterion. Then a branch-and-bound algorithm is presented that integrates structural constraints with data in a way to guarantee global optimality. As an example, structural constraints are used to map the problem of structure learning in Dynamic Bayesian networks into a corresponding augmented Bayesian network. Finally, we show empirically the benefits of using the properties with state-of-the-art methods and with the new algorithm, which is able to handle larger data sets than before.
Resumo:
With over 50 billion downloads and more than 1.3 million apps in Google’s official market, Android has continued to gain popularity amongst smartphone users worldwide. At the same time there has been a rise in malware targeting the platform, with more recent strains employing highly sophisticated detection avoidance techniques. As traditional signature based methods become less potent in detecting unknown malware, alternatives are needed for timely zero-day discovery. Thus this paper proposes an approach that utilizes ensemble learning for Android malware detection. It combines advantages of static analysis with the efficiency and performance of ensemble machine learning to improve Android malware detection accuracy. The machine learning models are built using a large repository of malware samples and benign apps from a leading antivirus vendor. Experimental results and analysis presented shows that the proposed method which uses a large feature space to leverage the power of ensemble learning is capable of 97.3 % to 99% detection accuracy with very low false positive rates.
Resumo:
Algorithms for concept drift handling are important for various applications including video analysis and smart grids. In this paper we present decision tree ensemble classication method based on the Random Forest algorithm for concept drift. The weighted majority voting ensemble aggregation rule is employed based on the ideas of Accuracy Weighted Ensemble (AWE) method. Base learner weight in our case is computed for each sample evaluation using base learners accuracy and intrinsic proximity measure of Random Forest. Our algorithm exploits both temporal weighting of samples and ensemble pruning as a forgetting strategy. We present results of empirical comparison of our method with îriginal random forest with incorporated replace-the-looser forgetting andother state-of-the-art concept-drift classiers like AWE2.