940 resultados para data-driven modelling
Resumo:
This paper presents a review of methodology for semi-supervised modeling with kernel methods, when the manifold assumption is guaranteed to be satisfied. It concerns environmental data modeling on natural manifolds, such as complex topographies of the mountainous regions, where environmental processes are highly influenced by the relief. These relations, possibly regionalized and nonlinear, can be modeled from data with machine learning using the digital elevation models in semi-supervised kernel methods. The range of the tools and methodological issues discussed in the study includes feature selection and semisupervised Support Vector algorithms. The real case study devoted to data-driven modeling of meteorological fields illustrates the discussed approach.
Resumo:
In recent years there has been an explosive growth in the development of adaptive and data driven methods. One of the efficient and data-driven approaches is based on statistical learning theory (Vapnik 1998). The theory is based on Structural Risk Minimisation (SRM) principle and has a solid statistical background. When applying SRM we are trying not only to reduce training error ? to fit the available data with a model, but also to reduce the complexity of the model and to reduce generalisation error. Many nonlinear learning procedures recently developed in neural networks and statistics can be understood and interpreted in terms of the structural risk minimisation inductive principle. A recent methodology based on SRM is called Support Vector Machines (SVM). At present SLT is still under intensive development and SVM find new areas of application (www.kernel-machines.org). SVM develop robust and non linear data models with excellent generalisation abilities that is very important both for monitoring and forecasting. SVM are extremely good when input space is high dimensional and training data set i not big enough to develop corresponding nonlinear model. Moreover, SVM use only support vectors to derive decision boundaries. It opens a way to sampling optimization, estimation of noise in data, quantification of data redundancy etc. Presentation of SVM for spatially distributed data is given in (Kanevski and Maignan 2004).
Resumo:
Uncertainty quantification of petroleum reservoir models is one of the present challenges, which is usually approached with a wide range of geostatistical tools linked with statistical optimisation or/and inference algorithms. The paper considers a data driven approach in modelling uncertainty in spatial predictions. Proposed semi-supervised Support Vector Regression (SVR) model has demonstrated its capability to represent realistic features and describe stochastic variability and non-uniqueness of spatial properties. It is able to capture and preserve key spatial dependencies such as connectivity, which is often difficult to achieve with two-point geostatistical models. Semi-supervised SVR is designed to integrate various kinds of conditioning data and learn dependences from them. A stochastic semi-supervised SVR model is integrated into a Bayesian framework to quantify uncertainty with multiple models fitted to dynamic observations. The developed approach is illustrated with a reservoir case study. The resulting probabilistic production forecasts are described by uncertainty envelopes.
Resumo:
Intensification of agricultural production without a sound management and regulations can lead to severe environmental problems, as in Western Santa Catarina State, Brazil, where intensive swine production has caused large accumulations of manure and consequently water pollution. Natural resource scientists are asked by decision-makers for advice on management and regulatory decisions. Distributed environmental models are useful tools, since they can be used to explore consequences of various management practices. However, in many areas of the world, quantitative data for model calibration and validation are lacking. The data-intensive distributed environmental model AgNPS was applied in a data-poor environment, the upper catchment (2,520 ha) of the Ariranhazinho River, near the city of Seara, in Santa Catarina State. Steps included data preparation, cell size selection, sensitivity analysis, model calibration and application to different management scenarios. The model was calibrated based on a best guess for model parameters and on a pragmatic sensitivity analysis. The parameters were adjusted to match model outputs (runoff volume, peak runoff rate and sediment concentration) closely with the sparse observed data. A modelling grid cell resolution of 150 m adduced appropriate and computer-fit results. The rainfall runoff response of the AgNPS model was calibrated using three separate rainfall ranges (< 25, 25-60, > 60 mm). Predicted sediment concentrations were consistently six to ten times higher than observed, probably due to sediment trapping along vegetated channel banks. Predicted N and P concentrations in stream water ranged from just below to well above regulatory norms. Expert knowledge of the area, in addition to experience reported in the literature, was able to compensate in part for limited calibration data. Several scenarios (actual, recommended and excessive manure applications, and point source pollution from swine operations) could be compared by the model, using a relative ranking rather than quantitative predictions.
Resumo:
This paper presents multiple kernel learning (MKL) regression as an exploratory spatial data analysis and modelling tool. The MKL approach is introduced as an extension of support vector regression, where MKL uses dedicated kernels to divide a given task into sub-problems and to treat them separately in an effective way. It provides better interpretability to non-linear robust kernel regression at the cost of a more complex numerical optimization. In particular, we investigate the use of MKL as a tool that allows us to avoid using ad-hoc topographic indices as covariables in statistical models in complex terrains. Instead, MKL learns these relationships from the data in a non-parametric fashion. A study on data simulated from real terrain features confirms the ability of MKL to enhance the interpretability of data-driven models and to aid feature selection without degrading predictive performances. Here we examine the stability of the MKL algorithm with respect to the number of training data samples and to the presence of noise. The results of a real case study are also presented, where MKL is able to exploit a large set of terrain features computed at multiple spatial scales, when predicting mean wind speed in an Alpine region.
Resumo:
Radioactive soil-contamination mapping and risk assessment is a vital issue for decision makers. Traditional approaches for mapping the spatial concentration of radionuclides employ various regression-based models, which usually provide a single-value prediction realization accompanied (in some cases) by estimation error. Such approaches do not provide the capability for rigorous uncertainty quantification or probabilistic mapping. Machine learning is a recent and fast-developing approach based on learning patterns and information from data. Artificial neural networks for prediction mapping have been especially powerful in combination with spatial statistics. A data-driven approach provides the opportunity to integrate additional relevant information about spatial phenomena into a prediction model for more accurate spatial estimates and associated uncertainty. Machine-learning algorithms can also be used for a wider spectrum of problems than before: classification, probability density estimation, and so forth. Stochastic simulations are used to model spatial variability and uncertainty. Unlike regression models, they provide multiple realizations of a particular spatial pattern that allow uncertainty and risk quantification. This paper reviews the most recent methods of spatial data analysis, prediction, and risk mapping, based on machine learning and stochastic simulations in comparison with more traditional regression models. The radioactive fallout from the Chernobyl Nuclear Power Plant accident is used to illustrate the application of the models for prediction and classification problems. This fallout is a unique case study that provides the challenging task of analyzing huge amounts of data ('hard' direct measurements, as well as supplementary information and expert estimates) and solving particular decision-oriented problems.
Resumo:
The neural mechanisms determining the timing of even simple actions, such as when to walk or rest, are largely mysterious. One intriguing, but untested, hypothesis posits a role for ongoing activity fluctuations in neurons of central action selection circuits that drive animal behavior from moment to moment. To examine how fluctuating activity can contribute to action timing, we paired high-resolution measurements of freely walking Drosophila melanogaster with data-driven neural network modeling and dynamical systems analysis. We generated fluctuation-driven network models whose outputs-locomotor bouts-matched those measured from sensory-deprived Drosophila. From these models, we identified those that could also reproduce a second, unrelated dataset: the complex time-course of odor-evoked walking for genetically diverse Drosophila strains. Dynamical models that best reproduced both Drosophila basal and odor-evoked locomotor patterns exhibited specific characteristics. First, ongoing fluctuations were required. In a stochastic resonance-like manner, these fluctuations allowed neural activity to escape stable equilibria and to exceed a threshold for locomotion. Second, odor-induced shifts of equilibria in these models caused a depression in locomotor frequency following olfactory stimulation. Our models predict that activity fluctuations in action selection circuits cause behavioral output to more closely match sensory drive and may therefore enhance navigation in complex sensory environments. Together these data reveal how simple neural dynamics, when coupled with activity fluctuations, can give rise to complex patterns of animal behavior.
Resumo:
The attached file is created with Scientific Workplace Latex
Resumo:
A wide variety of exposure models are currently employed for health risk assessments. Individual models have been developed to meet the chemical exposure assessment needs of Government, industry and academia. These existing exposure models can be broadly categorised according to the following types of exposure source: environmental, dietary, consumer product, occupational, and aggregate and cumulative. Aggregate exposure models consider multiple exposure pathways, while cumulative models consider multiple chemicals. In this paper each of these basic types of exposure model are briefly described, along with any inherent strengths or weaknesses, with the UK as a case study. Examples are given of specific exposure models that are currently used, or that have the potential for future use, and key differences in modelling approaches adopted are discussed. The use of exposure models is currently fragmentary in nature. Specific organisations with exposure assessment responsibilities tend to use a limited range of models. The modelling techniques adopted in current exposure models have evolved along distinct lines for the various types of source. In fact different organisations may be using different models for very similar exposure assessment situations. This lack of consistency between exposure modelling practices can make understanding the exposure assessment process more complex, can lead to inconsistency between organisations in how critical modelling issues are addressed (e.g. variability and uncertainty), and has the potential to communicate mixed messages to the general public. Further work should be conducted to integrate the various approaches and models, where possible and regulatory remits allow, to get a coherent and consistent exposure modelling process. We recommend the development of an overall framework for exposure and risk assessment with common approaches and methodology, a screening tool for exposure assessment, collection of better input data, probabilistic modelling, validation of model input and output and a closer working relationship between scientists and policy makers and staff from different Government departments. A much increased effort is required is required in the UK to address these issues. The result will be a more robust, transparent, valid and more comparable exposure and risk assessment process. (C) 2006 Elsevier Ltd. All rights reserved.
Resumo:
A common problem in many data based modelling algorithms such as associative memory networks is the problem of the curse of dimensionality. In this paper, a new two-stage neurofuzzy system design and construction algorithm (NeuDeC) for nonlinear dynamical processes is introduced to effectively tackle this problem. A new simple preprocessing method is initially derived and applied to reduce the rule base, followed by a fine model detection process based on the reduced rule set by using forward orthogonal least squares model structure detection. In both stages, new A-optimality experimental design-based criteria we used. In the preprocessing stage, a lower bound of the A-optimality design criterion is derived and applied as a subset selection metric, but in the later stage, the A-optimality design criterion is incorporated into a new composite cost function that minimises model prediction error as well as penalises the model parameter variance. The utilisation of NeuDeC leads to unbiased model parameters with low parameter variance and the additional benefit of a parsimonious model structure. Numerical examples are included to demonstrate the effectiveness of this new modelling approach for high dimensional inputs.
Resumo:
The modelling of a nonlinear stochastic dynamical processes from data involves solving the problems of data gathering, preprocessing, model architecture selection, learning or adaptation, parametric evaluation and model validation. For a given model architecture such as associative memory networks, a common problem in non-linear modelling is the problem of "the curse of dimensionality". A series of complementary data based constructive identification schemes, mainly based on but not limited to an operating point dependent fuzzy models, are introduced in this paper with the aim to overcome the curse of dimensionality. These include (i) a mixture of experts algorithm based on a forward constrained regression algorithm; (ii) an inherent parsimonious delaunay input space partition based piecewise local lineal modelling concept; (iii) a neurofuzzy model constructive approach based on forward orthogonal least squares and optimal experimental design and finally (iv) the neurofuzzy model construction algorithm based on basis functions that are Bézier Bernstein polynomial functions and the additive decomposition. Illustrative examples demonstrate their applicability, showing that the final major hurdle in data based modelling has almost been removed.
Resumo:
We describe ncWMS, an implementation of the Open Geospatial Consortium’s Web Map Service (WMS) specification for multidimensional gridded environmental data. ncWMS can read data in a large number of common scientific data formats – notably the NetCDF format with the Climate and Forecast conventions – then efficiently generate map imagery in thousands of different coordinate reference systems. It is designed to require minimal configuration from the system administrator and, when used in conjunction with a suitable client tool, provides end users with an interactive means for visualizing data without the need to download large files or interpret complex metadata. It is also used as a “bridging” tool providing interoperability between the environmental science community and users of geographic information systems. ncWMS implements a number of extensions to the WMS standard in order to fulfil some common scientific requirements, including the ability to generate plots representing timeseries and vertical sections. We discuss these extensions and their impact upon present and future interoperability. We discuss the conceptual mapping between the WMS data model and the data models used by gridded data formats, highlighting areas in which the mapping is incomplete or ambiguous. We discuss the architecture of the system and particular technical innovations of note, including the algorithms used for fast data reading and image generation. ncWMS has been widely adopted within the environmental data community and we discuss some of the ways in which the software is integrated within data infrastructures and portals.
Resumo:
Background: In many experimental pipelines, clustering of multidimensional biological datasets is used to detect hidden structures in unlabelled input data. Taverna is a popular workflow management system that is used to design and execute scientific workflows and aid in silico experimentation. The availability of fast unsupervised methods for clustering and visualization in the Taverna platform is important to support a data-driven scientific discovery in complex and explorative bioinformatics applications. Results: This work presents a Taverna plugin, the Biological Data Interactive Clustering Explorer (BioDICE), that performs clustering of high-dimensional biological data and provides a nonlinear, topology preserving projection for the visualization of the input data and their similarities. The core algorithm in the BioDICE plugin is Fast Learning Self Organizing Map (FLSOM), which is an improved variant of the Self Organizing Map (SOM) algorithm. The plugin generates an interactive 2D map that allows the visual exploration of multidimensional data and the identification of groups of similar objects. The effectiveness of the plugin is demonstrated on a case study related to chemical compounds. Conclusions: The number and variety of available tools and its extensibility have made Taverna a popular choice for the development of scientific data workflows. This work presents a novel plugin, BioDICE, which adds a data-driven knowledge discovery component to Taverna. BioDICE provides an effective and powerful clustering tool, which can be adopted for the explorative analysis of biological datasets.
Resumo:
Instrumentation and automation plays a vital role to managing the water industry. These systems generate vast amounts of data that must be effectively managed in order to enable intelligent decision making. Time series data management software, commonly known as data historians are used for collecting and managing real-time (time series) information. More advanced software solutions provide a data infrastructure or utility wide Operations Data Management System (ODMS) that stores, manages, calculates, displays, shares, and integrates data from multiple disparate automation and business systems that are used daily in water utilities. These ODMS solutions are proven and have the ability to manage data from smart water meters to the collaboration of data across third party corporations. This paper focuses on practical, utility successes in the water industry where utility managers are leveraging instantaneous access to data from proven, commercial off-the-shelf ODMS solutions to enable better real-time decision making. Successes include saving $650,000 / year in water loss control, safeguarding water quality, saving millions of dollars in energy management and asset management. Immediate opportunities exist to integrate the research being done in academia with these ODMS solutions in the field and to leverage these successes to utilities around the world.
Resumo:
Este trabalho apresenta resultados práticos de uma atenção sistemática dada ao processamento e à interpretação sísmica de algumas linhas terrestres do conjunto de dados do gráben do Tacutu (Brasil), sobre os quais foram aplicadas etapas fundamentais do sistema WIT de imageamento do empilhamento CRS (Superfície de Reflexão Comum) vinculado a dados. Como resultado, esperamos estabelecer um fluxograma para a reavaliação sísmica de bacias sedimentares. Fundamentado nos atributos de frente de onda resultantes do empilhamento CRS, um macro-modelo suave de velocidades foi obtido através de inversão tomográfica. Usando este macro-modelo, foi realizado uma migração à profundidade pré- e pós-empilhamento. Além disso, outras técnicas baseadas no empilhamento CRS foram realizadas em paralelo como correção estática residual e migração de abertura-limitada baseada na zona de Fresnel projetada. Uma interpretação geológica sobre as seções empilhadas e migradas foi esboçada. A partir dos detalhes visuais dos painéis é possível interpretar desconformidades, afinamentos, um anticlinal principal falhado com conjuntos de horstes e grábens. Também, uma parte da linha selecionada precisa de processamento mais detalhado para evidenciar melhor qualquer estrutura presente na subsuperfície.