Biblioteca Digital

971 resultados para Aggregated data

An empirical examination of the consequences of national pride: Analyses of survey and experimental data

Relevância:

30.00% 30.00%

Publicador:

Resumo:

National pride is both an important and understudied topic with respect to economic behaviour, hence this thesis investigates whether: 1) there is a "light" side of national pride through increased compliance, and a "dark" side linked to exclusion; 2) successful priming of national pride is linked to increased tax compliance; and 3) East German post-reunification outmigration is related to loyalty. The project comprises three related empirical studies, analysing evidence from a large, aggregated, international survey dataset; a tax compliance laboratory experiment combining psychological priming with measurement of heart rate variability; and data collected after the fall of the Berlin Wall (a situation approximating a natural experiment).

Information accountability and health big data analytics: A consent-based model

Relevância:

30.00% 30.00%

Publicador:

Resumo:

With the ever increasing amount of eHealth data available from various eHealth systems and sources, Health Big Data Analytics promises enticing benefits such as enabling the discovery of new treatment options and improved decision making. However, concerns over the privacy of information have hindered the aggregation of this information. To address these concerns, we propose the use of Information Accountability protocols to provide patients with the ability to decide how and when their data can be shared and aggregated for use in big data research. In this paper, we discuss the issues surrounding Health Big Data Analytics and propose a consent-based model to address privacy concerns to aid in achieving the promised benefits of Big Data in eHealth.

Region Based Structure Layout Optimization by Selective Data Copying

Relevância:

30.00% 30.00%

Publicador:

Resumo:

As the gap between processor and memory continues to grow Memory performance becomes a key performance bottleneck for many applications. Compilers therefore increasingly seek to modify an application’s data layout to improve cache locality and cache reuse. Whole program Structure Layout [WPSL] transformations can significantly increase the spatial locality of data and reduce the runtime of programs that use link-based data structures, by increasing the cache line utilization. However, in production compilers WPSL transformations do not realize the entire performance potential possible due to a number of factors. Structure layout decisions made on the basis of whole program aggregated affinity/hotness of structure fields, can be sub optimal for local code regions. WPSL is also restricted in applicability in production compilers for type unsafe languages like C/C++ due to the extensive legality checks and field sensitive pointer analysis required over the entire application. In order to overcome the issues associated with WPSL, we propose Region Based Structure Layout (RBSL) optimization framework, using selective data copying. We describe our RBSL framework, implemented in the production compiler for C/C++ on HP-UX IA-64. We show that acting in complement to the existing and mature WPSL transformation framework in our compiler, RBSL improves application performance in pointer intensive SPEC benchmarks ranging from 3% to 28% over WPSL

Robust Sketching and Aggregation of Distributed Data Streams

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The data streaming model provides an attractive framework for one-pass summarization of massive data sets at a single observation point. However, in an environment where multiple data streams arrive at a set of distributed observation points, sketches must be computed remotely and then must be aggregated through a hierarchy before queries may be conducted. As a result, many sketch-based methods for the single stream case do not apply directly, as either the error introduced becomes large, or because the methods assume that the streams are non-overlapping. These limitations hinder the application of these techniques to practical problems in network traffic monitoring and aggregation in sensor networks. To address this, we develop a general framework for evaluating and enabling robust computation of duplicate-sensitive aggregate functions (e.g., SUM and QUANTILE), over data produced by distributed sources. We instantiate our approach by augmenting the Count-Min and Quantile-Digest sketches to apply in this distributed setting, and analyze their performance. We conclude with experimental evaluation to validate our analysis.

Real-Time and Data-Driven Operation Optimization and Knowledge Discovery for an Enterprise Information System

Relevância:

30.00% 30.00%

Publicador:

Resumo:

An enterprise information system (EIS) is an integrated data-applications platform characterized by diverse, heterogeneous, and distributed data sources. For many enterprises, a number of business processes still depend heavily on static rule-based methods and extensive human expertise. Enterprises are faced with the need for optimizing operation scheduling, improving resource utilization, discovering useful knowledge, and making data-driven decisions.

This thesis research is focused on real-time optimization and knowledge discovery that addresses workflow optimization, resource allocation, as well as data-driven predictions of process-execution times, order fulfillment, and enterprise service-level performance. In contrast to prior work on data analytics techniques for enterprise performance optimization, the emphasis here is on realizing scalable and real-time enterprise intelligence based on a combination of heterogeneous system simulation, combinatorial optimization, machine-learning algorithms, and statistical methods.

On-demand digital-print service is a representative enterprise requiring a powerful EIS.We use real-life data from Reischling Press, Inc. (RPI), a digit-print-service provider (PSP), to evaluate our optimization algorithms.

In order to handle the increase in volume and diversity of demands, we first present a high-performance, scalable, and real-time production scheduling algorithm for production automation based on an incremental genetic algorithm (IGA). The objective of this algorithm is to optimize the order dispatching sequence and balance resource utilization. Compared to prior work, this solution is scalable for a high volume of orders and it provides fast scheduling solutions for orders that require complex fulfillment procedures. Experimental results highlight its potential benefit in reducing production inefficiencies and enhancing the productivity of an enterprise.

We next discuss analysis and prediction of different attributes involved in hierarchical components of an enterprise. We start from a study of the fundamental processes related to real-time prediction. Our process-execution time and process status prediction models integrate statistical methods with machine-learning algorithms. In addition to improved prediction accuracy compared to stand-alone machine-learning algorithms, it also performs a probabilistic estimation of the predicted status. An order generally consists of multiple series and parallel processes. We next introduce an order-fulfillment prediction model that combines advantages of multiple classification models by incorporating flexible decision-integration mechanisms. Experimental results show that adopting due dates recommended by the model can significantly reduce enterprise late-delivery ratio. Finally, we investigate service-level attributes that reflect the overall performance of an enterprise. We analyze and decompose time-series data into different components according to their hierarchical periodic nature, perform correlation analysis,

and develop univariate prediction models for each component as well as multivariate models for correlated components. Predictions for the original time series are aggregated from the predictions of its components. In addition to a significant increase in mid-term prediction accuracy, this distributed modeling strategy also improves short-term time-series prediction accuracy.

In summary, this thesis research has led to a set of characterization, optimization, and prediction tools for an EIS to derive insightful knowledge from data and use them as guidance for production management. It is expected to provide solutions for enterprises to increase reconfigurability, accomplish more automated procedures, and obtain data-driven recommendations or effective decisions.

SERS enhancement by aggregated Au colloids: effect of particle size

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Aggregated Au colloids have been widely used as SERS enhancing media for many years but to date there has been no systematic investigation of the effect of the particle size on the enhancements given by simple aggregated Au colloid solutions. Previous systematic studies on isolated particles in solution or multiple particles deposited onto surfaces reported widely different optimum particle sizes for the same excitation wavelength and also disagreed on the extent to which surface plasmon absorption spectra were a good predictor of enhancement factors. In this work the spectroscopic properties of a range of samples of monodisperse Au colloids with diameters ranging from 21 to 146 nm have been investigated in solution. The UV/visible absorption spectra of the colloids show complex changes as a function of aggregating salt (MgSO4) concentration which diminish when the colloid is fully aggregated. Under these conditions, the relative SERS enhancements provided by the variously sized colloids vary very significantly across the size range. The largest signals in the raw data are observed for 46 nm colloids but correction for the total surface area available to generate enhancement shows that particles with 74 nm diameter give the largest enhancement per unit surface area. The observed enhancements do not correlate with absorbance at the excitation wavelength but the large differences between differently sized colloids demonstrate that even in the randomly aggregated particle assemblies studied here, inhomogeneous broadening does not mask the underlying changes due to differences in particle diameter.

A Lightweight Tool for Anomaly Detection in Cloud Data Centres

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Cloud data centres are critical business infrastructures and the fastest growing service providers. Detecting anomalies in Cloud data centre operation is vital. Given the vast complexity of the data centre system software stack, applications and workloads, anomaly detection is a challenging endeavour. Current tools for detecting anomalies often use machine learning techniques, application instance behaviours or system metrics distribu- tion, which are complex to implement in Cloud computing environments as they require training, access to application-level data and complex processing. This paper presents LADT, a lightweight anomaly detection tool for Cloud data centres that uses rigorous correlation of system metrics, implemented by an efficient corre- lation algorithm without need for training or complex infrastructure set up. LADT is based on the hypothesis that, in an anomaly-free system, metrics from data centre host nodes and virtual machines (VMs) are strongly correlated. An anomaly is detected whenever correlation drops below a threshold value. We demonstrate and evaluate LADT using a Cloud environment, where it shows that the hosting node I/O operations per second (IOPS) are strongly correlated with the aggregated virtual machine IOPS, but this correlation vanishes when an application stresses the disk, indicating a node-level anomaly.

Supervised Aggregative Feature Extraction for Big Data Time Series Regression

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In many applications, and especially those where batch processes are involved, a target scalar output of interest is often dependent on one or more time series of data. With the exponential growth in data logging in modern industries such time series are increasingly available for statistical modeling in soft sensing applications. In order to exploit time series data for predictive modelling, it is necessary to summarise the information they contain as a set of features to use as model regressors. Typically this is done in an unsupervised fashion using simple techniques such as computing statistical moments, principal components or wavelet decompositions, often leading to significant information loss and hence suboptimal predictive models. In this paper, a functional learning paradigm is exploited in a supervised fashion to derive continuous, smooth estimates of time series data (yielding aggregated local information), while simultaneously estimating a continuous shape function yielding optimal predictions. The proposed Supervised Aggregative Feature Extraction (SAFE) methodology can be extended to support nonlinear predictive models by embedding the functional learning framework in a Reproducing Kernel Hilbert Spaces setting. SAFE has a number of attractive features including closed form solution and the ability to explicitly incorporate first and second order derivative information. Using simulation studies and a practical semiconductor manufacturing case study we highlight the strengths of the new methodology with respect to standard unsupervised feature extraction approaches.

Guest reputation indexes to analyze hotel’s online reputation using data extracted from OTAs

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Nowadays many travelers use online travel agency (OTAs) to book flights, hotel rooms, rent-a-cars, cruises or entire vacation packages. Usually OTAs allow their users to give scores and to write reviews about what was used. Each OTA defines the terms and conditions for guest rating or review score and hoteliers are giving increasing importance to the scores and reviews their guests do in OTAs. This paper proposes two guest reputation index to help hoteliers to monitorize their presence in OTAs. The Aggregated Guest Reputation Index (AGRI), which shows the positioning of a hotel in different OTAs and it is calculated from the scores obtained by the hotels in those OTAs. Another one, the Semantic Guest Reputation Index (SGRI), which incorporates the social reputation of a hotel and that can be visualized through the development of word clouds or tag clouds. Examples of usage of these indexes are given with data extracted from 5-stars hotels in the Algarve, south region of Portugal, that are available on Booking and Expedia.

Comuns: An Open-Data Provider, Explorer and Analytic Toolbox Based on FOSS

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Contém resumo

Efficient estimation using the characteristic function : theory and applications with high frequency data

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The attached file is created with Scientific Workplace Latex

Stochastic modelling of rainfall from satellite data

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Satellite-based rainfall monitoring is widely used for climatological studies because of its full global coverage but it is also of great importance for operational purposes especially in areas such as Africa where there is a lack of ground-based rainfall data. Satellite rainfall estimates have enormous potential benefits as input to hydrological and agricultural models because of their real time availability, low cost and full spatial coverage. One issue that needs to be addressed is the uncertainty on these estimates. This is particularly important in assessing the likely errors on the output from non-linear models (rainfall-runoff or crop yield) which make use of the rainfall estimates, aggregated over an area, as input. Correct assessment of the uncertainty on the rainfall is non-trivial as it must take account of • the difference in spatial support of the satellite information and independent data used for calibration • uncertainties on the independent calibration data • the non-Gaussian distribution of rainfall amount • the spatial intermittency of rainfall • the spatial correlation of the rainfall field This paper describes a method for estimating the uncertainty on satellite-based rainfall values taking account of these factors. The method involves firstly a stochastic calibration which completely describes the probability of rainfall occurrence and the pdf of rainfall amount for a given satellite value, and secondly the generation of ensemble of rainfall fields based on the stochastic calibration but with the correct spatial correlation structure within each ensemble member. This is achieved by the use of geostatistical sequential simulation. The ensemble generated in this way may be used to estimate uncertainty at larger spatial scales. A case study of daily rainfall monitoring in the Gambia, west Africa for the purpose of crop yield forecasting is presented to illustrate the method.

Determining the effect of asymmetric data on the variogram. II. Outliers

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Asymmetry in a distribution can arise from a long tail of values in the underlying process or from outliers that belong to another population that contaminate the primary process. The first paper of this series examined the effects of the former on the variogram and this paper examines the effects of asymmetry arising from outliers. Simulated annealing was used to create normally distributed random fields of different size that are realizations of known processes described by variograms with different nugget:sill ratios. These primary data sets were then contaminated with randomly located and spatially aggregated outliers from a secondary process to produce different degrees of asymmetry. Experimental variograms were computed from these data by Matheron's estimator and by three robust estimators. The effects of standard data transformations on the coefficient of skewness and on the variogram were also investigated. Cross-validation was used to assess the performance of models fitted to experimental variograms computed from a range of data contaminated by outliers for kriging. The results showed that where skewness was caused by outliers the variograms retained their general shape, but showed an increase in the nugget and sill variances and nugget:sill ratios. This effect was only slightly more for the smallest data set than for the two larger data sets and there was little difference between the results for the latter. Overall, the effect of size of data set was small for all analyses. The nugget:sill ratio showed a consistent decrease after transformation to both square roots and logarithms; the decrease was generally larger for the latter, however. Aggregated outliers had different effects on the variogram shape from those that were randomly located, and this also depended on whether they were aggregated near to the edge or the centre of the field. The results of cross-validation showed that the robust estimators and the removal of outliers were the most effective ways of dealing with outliers for variogram estimation and kriging. (C) 2007 Elsevier Ltd. All rights reserved.

Using reanalysis data to quantify extreme wind power generation statistics : a 33 year case study in Great Britain

Relevância:

30.00% 30.00%

Publicador:

Resumo:

With a rapidly increasing fraction of electricity generation being sourced from wind, extreme wind power generation events such as prolonged periods of low (or high) generation and ramps in generation, are a growing concern for the efficient and secure operation of national power systems. As extreme events occur infrequently, long and reliable meteorological records are required to accurately estimate their characteristics. Recent publications have begun to investigate the use of global meteorological “reanalysis” data sets for power system applications, many of which focus on long-term average statistics such as monthly-mean generation. Here we demonstrate that reanalysis data can also be used to estimate the frequency of relatively short-lived extreme events (including ramping on sub-daily time scales). Verification against 328 surface observation stations across the United Kingdom suggests that near-surface wind variability over spatiotemporal scales greater than around 300 km and 6 h can be faithfully reproduced using reanalysis, with no need for costly dynamical downscaling. A case study is presented in which a state-of-the-art, 33 year reanalysis data set (MERRA, from NASA-GMAO), is used to construct an hourly time series of nationally-aggregated wind power generation in Great Britain (GB), assuming a fixed, modern distribution of wind farms. The resultant generation estimates are highly correlated with recorded data from National Grid in the recent period, both for instantaneous hourly values and for variability over time intervals greater than around 6 h. This 33 year time series is then used to quantify the frequency with which different extreme GB-wide wind power generation events occur, as well as their seasonal and inter-annual variability. Several novel insights into the nature of extreme wind power generation events are described, including (i) that the number of prolonged low or high generation events is well approximated by a Poission-like random process, and (ii) whilst in general there is large seasonal variability, the magnitude of the most extreme ramps is similar in both summer and winter. An up-to-date version of the GB case study data as well as the underlying model are freely available for download from our website: http://www.met.reading.ac.uk/~energymet/data/Cannon2014/.

Validation of Canopy Height Profile methodology for small-footprint full-waveform airborne LiDAR data in a discontinuous canopy environment

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A Canopy Height Profile (CHP) procedure presented in Harding et al. (2001) for large footprint LiDAR data was tested in a closed canopy environment as a way of extracting vertical foliage profiles from LiDAR raw-waveform. In this study, an adaptation of this method to small-footprint data has been shown, tested and validated in an Australian sparse canopy forest at plot- and site-level. Further, the methodology itself has been enhanced by implementing a dataset-adjusted reflectance ratio calculation according to Armston et al. (2013) in the processing chain, and tested against a fixed ratio of 0.5 estimated for the laser wavelength of 1550nm. As a by-product of the methodology, effective leaf area index (LAIe) estimates were derived and compared to hemispherical photography-derived values. To assess the influence of LiDAR aggregation area size on the estimates in a sparse canopy environment, LiDAR CHPs and LAIes were generated by aggregating waveforms to plot- and site-level footprints (plot/site-aggregated) as well as in 5m grids (grid-processed). LiDAR profiles were then compared to leaf biomass field profiles generated based on field tree measurements. The correlation between field and LiDAR profiles was very high, with a mean R2 of 0.75 at plot-level and 0.86 at site-level for 55 plots and the corresponding 11 sites. Gridding had almost no impact on the correlation between LiDAR and field profiles (only marginally improvement), nor did the dataset-adjusted reflectance ratio. However, gridding and the dataset-adjusted reflectance ratio were found to improve the correlation between raw-waveform LiDAR and hemispherical photography LAIe estimates, yielding the highest correlations of 0.61 at plot-level and of 0.83 at site-level. This proved the validity of the approach and superiority of dataset-adjusted reflectance ratio of Armston et al. (2013) over a fixed ratio of 0.5 for LAIe estimation, as well as showed the adequacy of small-footprint LiDAR data for LAIe estimation in discontinuous canopy forests.

«
1
2
3
4
5
6
7
8
...
64
65
»