980 resultados para Data errors
Resumo:
Analytical curves are normally obtained from discrete data by least squares regression. The least squares regression of data involving significant error in both x and y values should not be implemented by ordinary least squares (OLS). In this work, the use of orthogonal distance regression (ODR) is discussed as an alternative approach in order to take into account the error in the x variable. Four examples are presented to illustrate deviation between the results from both regression methods. The examples studied show that, in some situations, ODR coefficients must substitute for those of OLS, and, in other situations, the difference is not significant.
Resumo:
Thermal and air conditions inside animal facilities change during the day due to the influence of the external environment. For statistical and geostatistical analyses to be representative, a large number of points spatially distributed in the facility area must be monitored. This work suggests that the time variation of environmental variables of interest for animal production, monitored within animal facility, can be modeled accurately from discrete-time records. The aim of this study was to develop a numerical method to correct the temporal variations of these environmental variables, transforming the data so that such observations are independent of the time spent during the measurement. The proposed method approached values recorded with time delays to those expected at the exact moment of interest, if the data were measured simultaneously at the moment at all points distributed spatially. The correction model for numerical environmental variables was validated for environmental air temperature parameter, and the values corrected by the method did not differ by Tukey's test at 5% significance of real values recorded by data loggers.
Resumo:
Longitudinal surveys are increasingly used to collect event history data on person-specific processes such as transitions between labour market states. Surveybased event history data pose a number of challenges for statistical analysis. These challenges include survey errors due to sampling, non-response, attrition and measurement. This study deals with non-response, attrition and measurement errors in event history data and the bias caused by them in event history analysis. The study also discusses some choices faced by a researcher using longitudinal survey data for event history analysis and demonstrates their effects. These choices include, whether a design-based or a model-based approach is taken, which subset of data to use and, if a design-based approach is taken, which weights to use. The study takes advantage of the possibility to use combined longitudinal survey register data. The Finnish subset of European Community Household Panel (FI ECHP) survey for waves 1–5 were linked at person-level with longitudinal register data. Unemployment spells were used as study variables of interest. Lastly, a simulation study was conducted in order to assess the statistical properties of the Inverse Probability of Censoring Weighting (IPCW) method in a survey data context. The study shows how combined longitudinal survey register data can be used to analyse and compare the non-response and attrition processes, test the missingness mechanism type and estimate the size of bias due to non-response and attrition. In our empirical analysis, initial non-response turned out to be a more important source of bias than attrition. Reported unemployment spells were subject to seam effects, omissions, and, to a lesser extent, overreporting. The use of proxy interviews tended to cause spell omissions. An often-ignored phenomenon classification error in reported spell outcomes, was also found in the data. Neither the Missing At Random (MAR) assumption about non-response and attrition mechanisms, nor the classical assumptions about measurement errors, turned out to be valid. Both measurement errors in spell durations and spell outcomes were found to cause bias in estimates from event history models. Low measurement accuracy affected the estimates of baseline hazard most. The design-based estimates based on data from respondents to all waves of interest and weighted by the last wave weights displayed the largest bias. Using all the available data, including the spells by attriters until the time of attrition, helped to reduce attrition bias. Lastly, the simulation study showed that the IPCW correction to design weights reduces bias due to dependent censoring in design-based Kaplan-Meier and Cox proportional hazard model estimators. The study discusses implications of the results for survey organisations collecting event history data, researchers using surveys for event history analysis, and researchers who develop methods to correct for non-sampling biases in event history data.
Resumo:
Extensive literature shows that analysts’ forecasts and recommendations are often biased. Thus, it is important for the financial market to be able to recognize this bias to be able to correctly valuate public companies. This thesis uses characteristic approach, which was introduced by So (2013, pp. 615-640), to forecast analysts’ forecast errors and tests if predictable forecast error is fully incorporated into share prices. Data is collected of listed Finnish companies. Thesis’ timeframe spans over ten years from 2004 to 2013 consisting of 788 firm-years. Although there is earlier evidence that the characteristic approach is able to predict analysts’ forecast errors, no support for this is found in the Finnish market. This thesis contributes to the current knowledge by showing that the characteristic approach does not work universally as such but requires development to work especially in the smaller markets.
Resumo:
Flight safety is one of the most important and frequently discussed issues in aviation. Recent accident inquiries have raised questions as to how the work of flight crews is organized and the extent to which these conditions may have been contributing factors to accidents. Fatigue is based on physiologic limitations, which are reflected in performance deficits. The purpose of the present study was to provide an analysis of the periods of the day in which pilots working for a commercial airline presented major errors. Errors made by 515 captains and 472 copilots were analyzed using data from flight operation quality assurance systems. To analyze the times of day (shifts) during which incidents occurred, we divided the light-dark cycle (24:00) in four periods: morning, afternoon, night, and early morning. The differences of risk during the day were reported as the ratio of morning to afternoon, morning to night and morning to early morning error rates. For the purposes of this research, level 3 events alone were taken into account, since these were the most serious in which company operational limits were exceeded or when established procedures were not followed. According to airline flight schedules, 35% of flights take place in the morning period, 32% in the afternoon, 26% at night, and 7% in the early morning. Data showed that the risk of errors increased by almost 50% in the early morning relative to the morning period (ratio of 1:1.46). For the period of the afternoon, the ratio was 1:1.04 and for the night a ratio of 1:1.05 was found. These results showed that the period of the early morning represented a greater risk of attention problems and fatigue.
Resumo:
This thesis introduces heat demand forecasting models which are generated by using data mining algorithms. The forecast spans one full day and this forecast can be used in regulating heat consumption of buildings. For training the data mining models, two years of heat consumption data from a case building and weather measurement data from Finnish Meteorological Institute are used. The thesis utilizes Microsoft SQL Server Analysis Services data mining tools in generating the data mining models and CRISP-DM process framework to implement the research. Results show that the built models can predict heat demand at best with mean average percentage errors of 3.8% for 24-h profile and 5.9% for full day. A deployment model for integrating the generated data mining models into an existing building energy management system is also discussed.
Resumo:
This note investigates the adequacy of the finite-sample approximation provided by the Functional Central Limit Theorem (FCLT) when the errors are allowed to be dependent. We compare the distribution of the scaled partial sums of some data with the distribution of the Wiener process to which it converges. Our setup is purposely very simple in that it considers data generated from an ARMA(1,1) process. Yet, this is sufficient to bring out interesting conclusions about the particular elements which cause the approximations to be inadequate in even quite large sample sizes.
Resumo:
In this paper we propose exact likelihood-based mean-variance efficiency tests of the market portfolio in the context of Capital Asset Pricing Model (CAPM), allowing for a wide class of error distributions which include normality as a special case. These tests are developed in the frame-work of multivariate linear regressions (MLR). It is well known however that despite their simple statistical structure, standard asymptotically justified MLR-based tests are unreliable. In financial econometrics, exact tests have been proposed for a few specific hypotheses [Jobson and Korkie (Journal of Financial Economics, 1982), MacKinlay (Journal of Financial Economics, 1987), Gib-bons, Ross and Shanken (Econometrica, 1989), Zhou (Journal of Finance 1993)], most of which depend on normality. For the gaussian model, our tests correspond to Gibbons, Ross and Shanken’s mean-variance efficiency tests. In non-gaussian contexts, we reconsider mean-variance efficiency tests allowing for multivariate Student-t and gaussian mixture errors. Our framework allows to cast more evidence on whether the normality assumption is too restrictive when testing the CAPM. We also propose exact multivariate diagnostic checks (including tests for multivariate GARCH and mul-tivariate generalization of the well known variance ratio tests) and goodness of fit tests as well as a set estimate for the intervening nuisance parameters. Our results [over five-year subperiods] show the following: (i) multivariate normality is rejected in most subperiods, (ii) residual checks reveal no significant departures from the multivariate i.i.d. assumption, and (iii) mean-variance efficiency tests of the market portfolio is not rejected as frequently once it is allowed for the possibility of non-normal errors.
Resumo:
There are many ways to generate geometrical models for numerical simulation, and most of them start with a segmentation step to extract the boundaries of the regions of interest. This paper presents an algorithm to generate a patient-specific three-dimensional geometric model, based on a tetrahedral mesh, without an initial extraction of contours from the volumetric data. Using the information directly available in the data, such as gray levels, we built a metric to drive a mesh adaptation process. The metric is used to specify the size and orientation of the tetrahedral elements everywhere in the mesh. Our method, which produces anisotropic meshes, gives good results with synthetic and real MRI data. The resulting model quality has been evaluated qualitatively and quantitatively by comparing it with an analytical solution and with a segmentation made by an expert. Results show that our method gives, in 90% of the cases, as good or better meshes as a similar isotropic method, based on the accuracy of the volume reconstruction for a given mesh size. Moreover, a comparison of the Hausdorff distances between adapted meshes of both methods and ground-truth volumes shows that our method decreases reconstruction errors faster. Copyright © 2015 John Wiley & Sons, Ltd.
Resumo:
The results of an investigation on the limits of the random errors contained in the basic data of Physical Oceanography and their propagation through the computational procedures are presented in this thesis. It also suggest a method which increases the reliability of the derived results. The thesis is presented in eight chapters including the introductory chapter. Chapter 2 discusses the general theory of errors that are relevant in the context of the propagation of errors in Physical Oceanographic computations. The error components contained in the independent oceanographic variables namely, temperature, salinity and depth are deliniated and quantified in chapter 3. Chapter 4 discusses and derives the magnitude of errors in the computation of the dependent oceanographic variables, density in situ, gt, specific volume and specific volume anomaly, due to the propagation of errors contained in the independent oceanographic variables. The errors propagated into the computed values of the derived quantities namely, dynamic depth and relative currents, have been estimated and presented chapter 5. Chapter 6 reviews the existing methods for the identification of level of no motion and suggests a method for the identification of a reliable zero reference level. Chapter 7 discusses the available methods for the extension of the zero reference level into shallow regions of the oceans and suggests a new method which is more reliable. A procedure of graphical smoothening of dynamic topographies between the error limits to provide more reliable results is also suggested in this chapter. Chapter 8 deals with the computation of the geostrophic current from these smoothened values of dynamic heights, with reference to the selected zero reference level. The summary and conclusion are also presented in this chapter.
Resumo:
Our essay aims at studying suitable statistical methods for the clustering of compositional data in situations where observations are constituted by trajectories of compositional data, that is, by sequences of composition measurements along a domain. Observed trajectories are known as “functional data” and several methods have been proposed for their analysis. In particular, methods for clustering functional data, known as Functional Cluster Analysis (FCA), have been applied by practitioners and scientists in many fields. To our knowledge, FCA techniques have not been extended to cope with the problem of clustering compositional data trajectories. In order to extend FCA techniques to the analysis of compositional data, FCA clustering techniques have to be adapted by using a suitable compositional algebra. The present work centres on the following question: given a sample of compositional data trajectories, how can we formulate a segmentation procedure giving homogeneous classes? To address this problem we follow the steps described below. First of all we adapt the well-known spline smoothing techniques in order to cope with the smoothing of compositional data trajectories. In fact, an observed curve can be thought of as the sum of a smooth part plus some noise due to measurement errors. Spline smoothing techniques are used to isolate the smooth part of the trajectory: clustering algorithms are then applied to these smooth curves. The second step consists in building suitable metrics for measuring the dissimilarity between trajectories: we propose a metric that accounts for difference in both shape and level, and a metric accounting for differences in shape only. A simulation study is performed in order to evaluate the proposed methodologies, using both hierarchical and partitional clustering algorithm. The quality of the obtained results is assessed by means of several indices
Resumo:
The representation of the diurnal cycle in the Hadley Centre climate model is evaluated using simulations of the infrared radiances observed by Meteosat 7. In both the window and water vapour channels, the standard version of the model with 19 levels produces a good simulation of the geographical distributions of the mean radiances and of the amplitude of the diurnal cycle. Increasing the vertical resolution to 30 levels leads to further improvements in the mean fields. The timing of the maximum and minimum radiances reveals significant model errors, however, which are sensitive to the frequency with which the radiation scheme is called. In most regions, these errors are consistent with well documented errors in the timing of convective precipitation, which peaks before noon in the model, in contrast to the observed peak in the late afternoon or evening. When the radiation scheme is called every model time step (half an hour), as opposed to every three hours in the standard version, the timing of the minimum radiance is improved for convective regions over central Africa, due to the creation of upper-level layer-cloud by detrainment from the convection scheme, which persists well after the convection itself has dissipated. However, this produces a decoupling between the timing of the diurnal cycles of precipitation and window channel radiance. The possibility is raised that a similar decoupling may occur in reality and the implications of this for the retrieval of the diurnal cycle of precipitation from infrared radiances are discussed.
Resumo:
A combination of satellite data, reanalysis products and climate models are combined to monitor changes in water vapour, clear-sky radiative cooling of the atmosphere and precipitation over the period 1979-2006. Climate models are able to simulate observed increases in column integrated water vapour (CWV) with surface temperature (Ts) over the ocean. Changes in the observing system lead to spurious variability in water vapour and clear-sky longwave radiation in reanalysis products. Nevertheless all products considered exhibit a robust increase in clear-sky longwave radiative cooling from the atmosphere to the surface; clear-sky longwave radiative cooling of the atmosphere is found to increase with Ts at the rate of ~4 Wm-2 K-1 over tropical ocean regions of mean descending vertical motion. Precipitation (P) is tightly coupled to atmospheric radiative cooling rates and this implies an increase in P with warming at a slower rate than the observed increases in CWV. Since convective precipitation depends on moisture convergence, the above implies enhanced precipitation over convective regions and reduced precipitation over convectively suppressed regimes. To quantify this response, observed and simulated changes in precipitation rate are analysed separately over regions of mean ascending and descending vertical motion over the tropics. The observed response is found to be substantially larger than the model simulations and climate change projections. It is currently not clear whether this is due to deficiencies in model parametrizations or errors in satellite retrievals.
Resumo:
Satellite-based rainfall monitoring is widely used for climatological studies because of its full global coverage but it is also of great importance for operational purposes especially in areas such as Africa where there is a lack of ground-based rainfall data. Satellite rainfall estimates have enormous potential benefits as input to hydrological and agricultural models because of their real time availability, low cost and full spatial coverage. One issue that needs to be addressed is the uncertainty on these estimates. This is particularly important in assessing the likely errors on the output from non-linear models (rainfall-runoff or crop yield) which make use of the rainfall estimates, aggregated over an area, as input. Correct assessment of the uncertainty on the rainfall is non-trivial as it must take account of • the difference in spatial support of the satellite information and independent data used for calibration • uncertainties on the independent calibration data • the non-Gaussian distribution of rainfall amount • the spatial intermittency of rainfall • the spatial correlation of the rainfall field This paper describes a method for estimating the uncertainty on satellite-based rainfall values taking account of these factors. The method involves firstly a stochastic calibration which completely describes the probability of rainfall occurrence and the pdf of rainfall amount for a given satellite value, and secondly the generation of ensemble of rainfall fields based on the stochastic calibration but with the correct spatial correlation structure within each ensemble member. This is achieved by the use of geostatistical sequential simulation. The ensemble generated in this way may be used to estimate uncertainty at larger spatial scales. A case study of daily rainfall monitoring in the Gambia, west Africa for the purpose of crop yield forecasting is presented to illustrate the method.