928 resultados para Bayesian nonparametric
Resumo:
Mobile malware has been growing in scale and complexity spurred by the unabated uptake of smartphones worldwide. Android is fast becoming the most popular mobile platform resulting in sharp increase in malware targeting the platform. Additionally, Android malware is evolving rapidly to evade detection by traditional signature-based scanning. Despite current detection measures in place, timely discovery of new malware is still a critical issue. This calls for novel approaches to mitigate the growing threat of zero-day Android malware. Hence, the authors develop and analyse proactive machine-learning approaches based on Bayesian classification aimed at uncovering unknown Android malware via static analysis. The study, which is based on a large malware sample set of majority of the existing families, demonstrates detection capabilities with high accuracy. Empirical results and comparative analysis are presented offering useful insight towards development of effective static-analytic Bayesian classification-based solutions for detecting unknown Android malware.
Resumo:
Mineral exploration programmes around the world use data from remote sensing, geophysics and direct sampling. On a regional scale, the combination of airborne geophysics and ground-based geochemical sampling can aid geological mapping and economic minerals exploration. The fact that airborne geophysical and traditional soil-sampling data are generated at different spatial resolutions means that they are not immediately comparable due to their different sampling density. Several geostatistical techniques, including indicator cokriging and collocated cokriging, can be used to integrate different types of data into a geostatistical model. With increasing numbers of variables the inference of the cross-covariance model required for cokriging can be demanding in terms of effort and computational time. In this paper a Gaussian-based Bayesian updating approach is applied to integrate airborne radiometric data and ground-sampled geochemical soil data to maximise information generated from the soil survey, to enable more accurate geological interpretation for the exploration and development of natural resources. The Bayesian updating technique decomposes the collocated estimate into a production of two models: prior and likelihood models. The prior model is built from primary information and the likelihood model is built from secondary information. The prior model is then updated with the likelihood model to build the final model. The approach allows multiple secondary variables to be simultaneously integrated into the mapping of the primary variable. The Bayesian updating approach is demonstrated using a case study from Northern Ireland where the history of mineral prospecting for precious and base metals dates from the 18th century. Vein-hosted, strata-bound and volcanogenic occurrences of mineralisation are found. The geostatistical technique was used to improve the resolution of soil geochemistry, collected one sample per 2 km2, by integrating more closely measured airborne geophysical data from the GSNI Tellus Survey, measured over a footprint of 65 x 200 m. The directly measured geochemistry data were considered as primary data in the Bayesian approach and the airborne radiometric data were used as secondary data. The approach produced more detailed updated maps and in particular maximized information on mapped estimates of zinc, copper and lead. Greater delineation of an elongated northwest/southeast trending zone in the updated maps strengthened the potential to investigate stratabound base metal deposits.
Resumo:
We present a Bayesian-odds-ratio-based algorithm for detecting stellar flares in light-curve data. We assume flares are described by a model in which there is a rapid rise with a half-Gaussian profile, followed by an exponential decay. Our signal model also contains a polynomial background model required to fit underlying light-curve variations in the data, which could otherwise partially mimic a flare. We characterize the false alarm probability and efficiency of this method under the assumption that any unmodelled noise in the data is Gaussian, and compare it with a simpler thresholding method based on that used in Walkowicz et al. We find our method has a significant increase in detection efficiency for low signal-to-noise ratio (S/N) flares. For a conservative false alarm probability our method can detect 95 per cent of flares with S/N less than 20, as compared to S/N of 25 for the simpler method. We also test how well the assumption of Gaussian noise holds by applying the method to a selection of 'quiet' Kepler stars. As an example we have applied our method to a selection of stars in Kepler Quarter 1 data. The method finds 687 flaring stars with a total of 1873 flares after vetos have been applied. For these flares we have made preliminary characterizations of their durations and and S/N.
Resumo:
We consider the local order estimation of nonlinear autoregressive systems with exogenous inputs (NARX), which may have different local dimensions at different points. By minimizing the kernel-based local information criterion introduced in this paper, the strongly consistent estimates for the local orders of the NARX system at points of interest are obtained. The modification of the criterion and a simple procedure of searching the minimum of the criterion, are also discussed. The theoretical results derived here are tested by simulation examples.
Resumo:
This work presents two new score functions based on the Bayesian Dirichlet equivalent uniform (BDeu) score for learning Bayesian network structures. They consider the sensitivity of BDeu to varying parameters of the Dirichlet prior. The scores take on the most adversary and the most beneficial priors among those within a contamination set around the symmetric one. We build these scores in such way that they are decomposable and can be computed efficiently. Because of that, they can be integrated into any state-of-the-art structure learning method that explores the space of directed acyclic graphs and allows decomposable scores. Empirical results suggest that our scores outperform the standard BDeu score in terms of the likelihood of unseen data and in terms of edge discovery with respect to the true network, at least when the training sample size is small. We discuss the relation between these new scores and the accuracy of inferred models. Moreover, our new criteria can be used to identify the amount of data after which learning is saturated, that is, additional data are of little help to improve the resulting model.
Resumo:
This work presents novel algorithms for learning Bayesian networks of bounded treewidth. Both exact and approximate methods are developed. The exact method combines mixed integer linear programming formulations for structure learning and treewidth computation. The approximate method consists in sampling k-trees (maximal graphs of treewidth k), and subsequently selecting, exactly or approximately, the best structure whose moral graph is a subgraph of that k-tree. The approaches are empirically compared to each other and to state-of-the-art methods on a collection of public data sets with up to 100 variables.
Resumo:
This paper addresses the problem of learning Bayesian network structures from data based on score functions that are decomposable. It describes properties that strongly reduce the time and memory costs of many known methods without losing global optimality guarantees. These properties are derived for different score criteria such as Minimum Description Length (or Bayesian Information Criterion), Akaike Information Criterion and Bayesian Dirichlet Criterion. Then a branch-and-bound algorithm is presented that integrates structural constraints with data in a way to guarantee global optimality. As an example, structural constraints are used to map the problem of structure learning in Dynamic Bayesian networks into a corresponding augmented Bayesian network. Finally, we show empirically the benefits of using the properties with state-of-the-art methods and with the new algorithm, which is able to handle larger data sets than before.
Resumo:
Retrospective clinical datasets are often characterized by a relatively small sample size and many missing data. In this case, a common way for handling the missingness consists in discarding from the analysis patients with missing covariates, further reducing the sample size. Alternatively, if the mechanism that generated the missing allows, incomplete data can be imputed on the basis of the observed data, avoiding the reduction of the sample size and allowing methods to deal with complete data later on. Moreover, methodologies for data imputation might depend on the particular purpose and might achieve better results by considering specific characteristics of the domain. The problem of missing data treatment is studied in the context of survival tree analysis for the estimation of a prognostic patient stratification. Survival tree methods usually address this problem by using surrogate splits, that is, splitting rules that use other variables yielding similar results to the original ones. Instead, our methodology consists in modeling the dependencies among the clinical variables with a Bayesian network, which is then used to perform data imputation, thus allowing the survival tree to be applied on the completed dataset. The Bayesian network is directly learned from the incomplete data using a structural expectation–maximization (EM) procedure in which the maximization step is performed with an exact anytime method, so that the only source of approximation is due to the EM formulation itself. On both simulated and real data, our proposed methodology usually outperformed several existing methods for data imputation and the imputation so obtained improved the stratification estimated by the survival tree (especially with respect to using surrogate splits).
Resumo:
This paper presents new results for the (partial) maximum a posteriori (MAP) problem in Bayesian networks, which is the problem of querying the most probable state configuration of some of the network variables given evidence. It is demonstrated that the problem remains hard even in networks with very simple topology, such as binary polytrees and simple trees (including the Naive Bayes structure), which extends previous complexity results. Furthermore, a Fully Polynomial Time Approximation Scheme for MAP in networks with bounded treewidth and bounded number of states per variable is developed. Approximation schemes were thought to be impossible, but here it is shown otherwise under the assumptions just mentioned, which are adopted in most applications.
Resumo:
This paper presents new results for the (partial) maximum a posteriori (MAP) problem in Bayesian networks, which is the problem of querying the most probable state configuration of some of the network variables given evidence. First, it is demonstrated that the problem remains hard even in networks with very simple topology, such as binary polytrees and simple trees (including the Naive Bayes structure). Such proofs extend previous complexity results for the problem. Inapproximability results are also derived in the case of trees if the number of states per variable is not bounded. Although the problem is shown to be hard and inapproximable even in very simple scenarios, a new exact algorithm is described that is empirically fast in networks of bounded treewidth and bounded number of states per variable. The same algorithm is used as basis of a Fully Polynomial Time Approximation Scheme for MAP under such assumptions. Approximation schemes were generally thought to be impossible for this problem, but we show otherwise for classes of networks that are important in practice. The algorithms are extensively tested using some well-known networks as well as random generated cases to show their effectiveness.
Resumo:
This paper addresses the estimation of parameters of a Bayesian network from incomplete data. The task is usually tackled by running the Expectation-Maximization (EM) algorithm several times in order to obtain a high log-likelihood estimate. We argue that choosing the maximum log-likelihood estimate (as well as the maximum penalized log-likelihood and the maximum a posteriori estimate) has severe drawbacks, being affected both by overfitting and model uncertainty. Two ideas are discussed to overcome these issues: a maximum entropy approach and a Bayesian model averaging approach. Both ideas can be easily applied on top of EM, while the entropy idea can be also implemented in a more sophisticated way, through a dedicated non-linear solver. A vast set of experiments shows that these ideas produce significantly better estimates and inferences than the traditional and widely used maximum (penalized) log-likelihood and maximum a posteriori estimates. In particular, if EM is adopted as optimization engine, the model averaging approach is the best performing one; its performance is matched by the entropy approach when implemented using the non-linear solver. The results suggest that the applicability of these ideas is immediate (they are easy to implement and to integrate in currently available inference engines) and that they constitute a better way to learn Bayesian network parameters.
Resumo:
This paper strengthens the NP-hardness result for the (partial) maximum a posteriori (MAP) problem in Bayesian networks with topology of trees (every variable has at most one parent) and variable cardinality at most three. MAP is the problem of querying the most probable state configuration of some (not necessarily all) of the network variables given evidence. It is demonstrated that the problem remains hard even in such simplistic networks.
Resumo:
This paper presents new results on the complexity of graph-theoretical models that represent probabilities (Bayesian networks) and that represent interval and set valued probabilities (credal networks). We define a new class of networks with bounded width, and introduce a new decision problem for Bayesian networks, the maximin a posteriori. We present new links between the Bayesian and credal networks, and present new results both for Bayesian networks (most probable explanation with observations, maximin a posteriori) and for credal networks (bounds on probabilities a posteriori, most probable explanation with and without observations, maximum a posteriori).
Resumo:
This paper investigates a representation language with flexibility inspired by probabilistic logic and compactness inspired by relational Bayesian networks. The goal is to handle propositional and first-order constructs together with precise, imprecise, indeterminate and qualitative probabilistic assessments. The paper shows how this can be achieved through the theory of credal networks. New exact and approximate inference algorithms based on multilinear programming and iterated/loopy propagation of interval probabilities are presented; their superior performance, compared to existing ones, is shown empirically.