Biblioteca Digital

20 resultados para Data sets storage

em Aston University Research Archive

Sequential, Bayesian geostatistics:a principled method for large data sets

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The principled statistical application of Gaussian random field models used in geostatistics has historically been limited to data sets of a small size. This limitation is imposed by the requirement to store and invert the covariance matrix of all the samples to obtain a predictive distribution at unsampled locations, or to use likelihood-based covariance estimation. Various ad hoc approaches to solve this problem have been adopted, such as selecting a neighborhood region and/or a small number of observations to use in the kriging process, but these have no sound theoretical basis and it is unclear what information is being lost. In this article, we present a Bayesian method for estimating the posterior mean and covariance structures of a Gaussian random field using a sequential estimation algorithm. By imposing sparsity in a well-defined framework, the algorithm retains a subset of “basis vectors” that best represent the “true” posterior Gaussian random field model in the relative entropy sense. This allows a principled treatment of Gaussian random field models on very large data sets. The method is particularly appropriate when the Gaussian random field model is regarded as a latent variable model, which may be nonlinearly related to the observations. We show the application of the sequential, sparse Bayesian estimation in Gaussian random field models and discuss its merits and drawbacks.

Projected sequential Gaussian processes: flexible interpolation for large data sets

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Recently within the machine learning and spatial statistics communities many papers have explored the potential of reduced rank representations of the covariance matrix, often referred to as projected or fixed rank approaches. In such methods the covariance function of the posterior process is represented by a reduced rank approximation which is chosen such that there is minimal information loss. In this paper a sequential framework for inference in such projected processes is presented, where the observations are considered one at a time. We introduce a C++ library for carrying out such projected, sequential estimation which adds several novel features. In particular we have incorporated the ability to use a generic observation operator, or sensor model, to permit data fusion. We can also cope with a range of observation error characteristics, including non-Gaussian observation errors. Inference for the variogram parameters is based on maximum likelihood estimation. We illustrate the projected sequential method in application to synthetic and real data sets. We discuss the software implementation and suggest possible future extensions.

Three-factor models versus time series models:quantifying time-dependencies of interactions between stimuli in cell biology and psychobiology for short longitudinal data

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Signal integration determines cell fate on the cellular level, affects cognitive processes and affective responses on the behavioural level, and is likely to be involved in psychoneurobiological processes underlying mood disorders. Interactions between stimuli may subjected to time effects. Time-dependencies of interactions between stimuli typically lead to complex cell responses and complex responses on the behavioural level. We show that both three-factor models and time series models can be used to uncover such time-dependencies. However, we argue that for short longitudinal data the three factor modelling approach is more suitable. In order to illustrate both approaches, we re-analysed previously published short longitudinal data sets. We found that in human embryonic kidney 293 cells cells the interaction effect in the regulation of extracellular signal-regulated kinase (ERK) 1 signalling activation by insulin and epidermal growth factor is subjected to a time effect and dramatically decays at peak values of ERK activation. In contrast, we found that the interaction effect induced by hypoxia and tumour necrosis factor-alpha for the transcriptional activity of the human cyclo-oxygenase-2 promoter in HEK293 cells is time invariant at least in the first 12-h time window after stimulation. Furthermore, we applied the three-factor model to previously reported animal studies. In these studies, memory storage was found to be subjected to an interaction effect of the beta-adrenoceptor agonist clenbuterol and certain antagonists acting on the alpha-1-adrenoceptor / glucocorticoid-receptor system. Our model-based analysis suggests that only if the antagonist drug is administer in a critical time window, then the interaction effect is relevant.

Efficient use of training data in the n-tuple recognition method

Relevância:

90.00% 90.00%

Publicador:

Resumo:

A simple technique is presented for improving the robustness of the n-tuple recognition method against inauspicious choices of architectural parameters, guarding against the saturation problem, and improving the utilisation of small data sets. Experiments are reported which confirm that the method significantly improves performance and reduces saturation in character recognition problems.

Minimum description length, regularisation and multi-modal data

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Relationships between clustering, description length, and regularisation are pointed out, motivating the introduction of a cost function with a description length interpretation and the unusual and useful property of having its minimum approximated by the densest mode of a distribution. A simple inverse kinematics example is used to demonstrate that this property can be used to select and learn one branch of a multi-valued mapping. This property is also used to develop a method for setting regularisation parameters according to the scale on which structure is exhibited in the training data. The regularisation technique is demonstrated on two real data sets, a classification problem and a regression problem.

A hierarchical latent variable model for data visualization

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Visualization has proven to be a powerful and widely-applicable tool the analysis and interpretation of data. Most visualization algorithms aim to find a projection from the data space down to a two-dimensional visualization space. However, for complex data sets living in a high-dimensional space it is unlikely that a single two-dimensional projection can reveal all of the interesting structure. We therefore introduce a hierarchical visualization algorithm which allows the complete data set to be visualized at the top level, with clusters and sub-clusters of data points visualized at deeper levels. The algorithm is based on a hierarchical mixture of latent variable models, whose parameters are estimated using the expectation-maximization algorithm. We demonstrate the principle of the approach first on a toy data set, and then apply the algorithm to the visualization of a synthetic data set in 12 dimensions obtained from a simulation of multi-phase flows in oil pipelines and to data in 36 dimensions derived from satellite images.

Data visualization during the early stages of drug discovery

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Multidimensional compound optimization is a new paradigm in the drug discovery process, yielding efficiencies during early stages and reducing attrition in the later stages of drug development. The success of this strategy relies heavily on understanding this multidimensional data and extracting useful information from it. This paper demonstrates how principled visualization algorithms can be used to understand and explore a large data set created in the early stages of drug discovery. The experiments presented are performed on a real-world data set comprising biological activity data and some whole-molecular physicochemical properties. Data visualization is a popular way of presenting complex data in a simpler form. We have applied powerful principled visualization methods, such as generative topographic mapping (GTM) and hierarchical GTM (HGTM), to help the domain experts (screening scientists, chemists, biologists, etc.) understand and draw meaningful decisions. We also benchmark these principled methods against relatively better known visualization approaches, principal component analysis (PCA), Sammon's mapping, and self-organizing maps (SOMs), to demonstrate their enhanced power to help the user visualize the large multidimensional data sets one has to deal with during the early stages of the drug discovery process. The results reported clearly show that the GTM and HGTM algorithms allow the user to cluster active compounds for different targets and understand them better than the benchmarks. An interactive software tool supporting these visualization algorithms was provided to the domain experts. The tool facilitates the domain experts by exploration of the projection obtained from the visualization algorithms providing facilities such as parallel coordinate plots, magnification factors, directional curvatures, and integration with industry standard software. © 2006 American Chemical Society.

A principled approach to interactive hierarchical non-linear visualization of high-dimensional data

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Hierarchical visualization systems are desirable because a single two-dimensional visualization plot may not be sufficient to capture all of the interesting aspects of complex high-dimensional data sets. We extend an existing locally linear hierarchical visualization system PhiVis [1] in several directions: bf(1) we allow for em non-linear projection manifolds (the basic building block is the Generative Topographic Mapping -- GTM), bf(2) we introduce a general formulation of hierarchical probabilistic models consisting of local probabilistic models organized in a hierarchical tree, bf(3) we describe folding patterns of low-dimensional projection manifold in high-dimensional data space by computing and visualizing the manifold's local directional curvatures. Quantities such as magnification factors [3] and directional curvatures are helpful for understanding the layout of the nonlinear projection manifold in the data space and for further refinement of the hierarchical visualization plot. Like PhiVis, our system is statistically principled and is built interactively in a top-down fashion using the EM algorithm. We demonstrate the visualization system principle of the approach on a complex 12-dimensional data set and mention possible applications in the pharmaceutical industry.

Semi-supervised learning of hierarchical latent trait models for data visualisation

Relevância:

90.00% 90.00%

Publicador:

Resumo:

An interactive hierarchical Generative Topographic Mapping (HGTM) ¸iteH_GTM has been developed to visualise complex data sets. In this paper, we build a more general visualisation system by extending the HGTM visualisation system in 3 directions: bf (1) We generalize HGTM to noise models from the exponential family of distributions. The basic building block is the Latent Trait Model (LTM) developed in ¸iteKaban_pami. bf (2) We give the user a choice of initializing the child plots of the current plot in either em interactive, or em automatic mode. In the interactive mode the user interactively selects ``regions of interest'' as in ¸iteH_GTM, whereas in the automatic mode an unsupervised minimum message length (MML)-driven construction of a mixture of LTMs is employed. bf (3) We derive general formulas for magnification factors in latent trait models. Magnification factors are a useful tool to improve our understanding of the visualisation plots, since they can highlight the boundaries between data clusters. The unsupervised construction is particularly useful when high-level plots are covered with dense clusters of highly overlapping data projections, making it difficult to use the interactive mode. Such a situation often arises when visualizing large data sets. We illustrate our approach on a toy example and apply our system to three more complex real data sets.

Parallel geostatistics for sparse and dense datasets

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Very large spatially-referenced datasets, for example, those derived from satellite-based sensors which sample across the globe or large monitoring networks of individual sensors, are becoming increasingly common and more widely available for use in environmental decision making. In large or dense sensor networks, huge quantities of data can be collected over small time periods. In many applications the generation of maps, or predictions at specific locations, from the data in (near) real-time is crucial. Geostatistical operations such as interpolation are vital in this map-generation process and in emergency situations, the resulting predictions need to be available almost instantly, so that decision makers can make informed decisions and define risk and evacuation zones. It is also helpful when analysing data in less time critical applications, for example when interacting directly with the data for exploratory analysis, that the algorithms are responsive within a reasonable time frame. Performing geostatistical analysis on such large spatial datasets can present a number of problems, particularly in the case where maximum likelihood. Although the storage requirements only scale linearly with the number of observations in the dataset, the computational complexity in terms of memory and speed, scale quadratically and cubically respectively. Most modern commodity hardware has at least 2 processor cores if not more. Other mechanisms for allowing parallel computation such as Grid based systems are also becoming increasingly commonly available. However, currently there seems to be little interest in exploiting this extra processing power within the context of geostatistics. In this paper we review the existing parallel approaches for geostatistics. By recognising that diffeerent natural parallelisms exist and can be exploited depending on whether the dataset is sparsely or densely sampled with respect to the range of variation, we introduce two contrasting novel implementations of parallel algorithms based on approximating the data likelihood extending the methods of Vecchia [1988] and Tresp [2000]. Using parallel maximum likelihood variogram estimation and parallel prediction algorithms we show that computational time can be significantly reduced. We demonstrate this with both sparsely sampled data and densely sampled data on a variety of architectures ranging from the common dual core processor, found in many modern desktop computers, to large multi-node super computers. To highlight the strengths and weaknesses of the diffeerent methods we employ synthetic data sets and go on to show how the methods allow maximum likelihood based inference on the exhaustive Walker Lake data set.

Modelling and measurements on a condensing counter-flow heat exchanger

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Experimental investigations and computer modelling studies have been made on the refrigerant-water counterflow condenser section of a small air to water heat pump. The main object of the investigation was a comparative study between the computer modelling predictions and the experimental observations for a range of operating conditions but other characteristics of a counterflow heat exchanger are also discussed. The counterflow condenser consisted of 15 metres of a thermally coupled pair of copper pipes, one containing the R12 working fluid and the other water flowing in the opposite direction. This condenser was mounted horizontally and folded into 0.5 metre straight sections. Thermocouples were inserted in both pipes at one metre intervals and transducers for pressure and flow measurement were also included. Data acquisition, storage and analysis was carried out by a micro-computer suitably interfaced with the transducers and thermocouples. Many sets of readings were taken under a variety of conditions, with air temperature ranging from 18 to 26 degrees Celsius, water inlet from 13.5 to 21.7 degrees, R12 inlet temperature from 61.2 to 81.7 degrees and water mass flow rate from 6.7 to 32.9 grammes per second. A Fortran computer model of the condenser (originally prepared by Carrington[1]) has been modified to match the information available from experimental work. This program uses iterative segmental integration over the desuperheating, mixed phase and subcooled regions for the R12 working fluid, the water always being in the liquid phase. Methods of estimating the inlet and exit fluid conditions from the available experimental data have been developed for application to the model. Temperature profiles and other parameters have been predicted and compared with experimental values for the condenser for a range of evaporator conditions and have shown that the model gives a satisfactory prediction of the physical behaviour of a simple counterflow heat exchanger in both single phase and two phase regions.

Data analysis methods in optometry Part 8: Principal components analysis (PCA) and factor analysis (FA)

Relevância:

90.00% 90.00%

Publicador:

Resumo:

PCA/FA is a method of analyzing complex data sets in which there are no clearly defined X or Y variables. It has multiple uses including the study of the pattern of variation between individual entities such as patients with particular disorders and the detailed study of descriptive variables. In most applications, variables are related to a smaller number of ‘factors’ or PCs that account for the maximum variance in the data and hence, may explain important trends among the variables. An increasingly important application of the method is in the ‘validation’ of questionnaires that attempt to relate subjective aspects of a patients experience with more objective measures of vision.

The use of ERS-1 crossovers and TOPEX/Poseidon repeat pass data for global sea surface variability studies

Relevância:

90.00% 90.00%

Publicador:

Resumo:

In previous sea-surface variability studies, researchers have failed to utilise the full ERS-1 mission due to the varying orbital characteristics in each mission phase, and most have simply ignored the Ice and Geodetic phases. This project aims to introduce a technique which will allow the straightforward use of all orbital phases, regardless of orbit type. This technique is based upon single satellite crossovers. Unfortunately the ERS-1 orbital height is still poorly resolved (due to higher air drag and stronger gravitational effects) when compared with that of TOPEX/Poseidon (T/P), so to make best use of the ERS-1 crossover data corrections to the ERS-1 orbital heights are calculated by fitting a cubic-spline to dual-crossover residuals with T/P. This correction is validated by comparison of dual satellite crossovers with tide gauge data. The crossover processing technique is validated by comparing the extracted sea-surface variability information with that from T/P repeat pass data. The two data sets are then combined into a single consistent data set for analysis of sea-surface variability patterns. These patterns are simplified by the use of an empirical orthogonal function decomposition which breaks the signals into spatial modes which are then discussed separately. Further studies carried out on these data include an analysis of the characteristics of the annual signal, discussion of evidence for Rossby wave propagation on a global basis, and finally analysis of the evidence for global mean sea level rise.

The primacy of data?

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Purpose – The purpose of this editorial is to stimulate debate and discussion amongst marketing scholarship regarding the implications for scientific research of increasingly large amounts of data and sophisticated data analytic techniques. Design/methodology/approach – The authors respond to a recent editorial in WIRED magazine which heralds the demise of the scientific method in the face of the vast data sets now available. Findings – The authors propose that more data makes theory more important, not less. They differentiate between raw prediction and scientific knowledge – which is aimed at explanation. Research limitations/implications – These thoughts are preliminary and intended to spark thinking and debate, not represent editorial policy. Due to space constraints, the coverage of many issues is necessarily brief. Practical implications – Marketing researchers should find these thoughts at the very least stimulating, and may wish to investigate these issues further. Originality/value – This piece should provide some interesting food for thought for marketing researchers.

Practical methods of tracking of nonstationary time series applied to real-world data

Relevância:

90.00% 90.00%

Publicador:

Resumo:

In this paper, we discuss some practical implications for implementing adaptable network algorithms applied to non-stationary time series problems. Two real world data sets, containing electricity load demands and foreign exchange market prices, are used to test several different methods, ranging from linear models with fixed parameters, to non-linear models which adapt both parameters and model order on-line. Training with the extended Kalman filter, we demonstrate that the dynamic model-order increment procedure of the resource allocating RBF network (RAN) is highly sensitive to the parameters of the novelty criterion. We investigate the use of system noise for increasing the plasticity of the Kalman filter training algorithm, and discuss the consequences for on-line model order selection. The results of our experiments show that there are advantages to be gained in tracking real world non-stationary data through the use of more complex adaptive models.

«
1
2
»