21 resultados para Data quality problems
em CentAUR: Central Archive University of Reading - UK
Resumo:
Data quality is a difficult notion to define precisely, and different communities have different views and understandings of the subject. This causes confusion, a lack of harmonization of data across communities and omission of vital quality information. For some existing data infrastructures, data quality standards cannot address the problem adequately and cannot fulfil all user needs or cover all concepts of data quality. In this study, we discuss some philosophical issues on data quality. We identify actual user needs on data quality, review existing standards and specifications on data quality, and propose an integrated model for data quality in the field of Earth observation (EO). We also propose a practical mechanism for applying the integrated quality information model to a large number of datasets through metadata inheritance. While our data quality management approach is in the domain of EO, we believe that the ideas and methodologies for data quality management can be applied to wider domains and disciplines to facilitate quality-enabled scientific research.
Resumo:
Geospatial information of many kinds, from topographic maps to scientific data, is increasingly being made available through web mapping services. These allow georeferenced map images to be served from data stores and displayed in websites and geographic information systems, where they can be integrated with other geographic information. The Open Geospatial Consortiums Web Map Service (WMS) standard has been widely adopted in diverse communities for sharing data in this way. However, current services typically provide little or no information about the quality or accuracy of the data they serve. In this paper we will describe the design and implementation of a new quality-enabled profile of WMS, which we call WMS-Q. This describes how information about data quality can be transmitted to the user through WMS. Such information can exist at many levels, from entire datasets to individual measurements, and includes the many different ways in which data uncertainty can be expressed. We also describe proposed extensions to the Symbology Encoding specification, which include provision for visualizing uncertainty in raster data in a number of different ways, including contours, shading and bivariate colour maps. We shall also describe new open-source implementations of the new specifications, which include both clients and servers.
Resumo:
In January 1992, there was a major pollutant event for the River Canon and downstream with its confluence to the River Fal and the Fal estuary in the west Cornwall. This incident was associated with the discharge of several million gallons of highly polluted water from the abandoned Wheal Jane tin mine that also extracted Ag, Cu and Zn ore. Later that year, the Centre for Ecology and Hydrology (CBH; then Institute of Hydrology) Wallingford undertook daily monitoring of the River Canon for a range of major, minor and trace elements to assess the nature and the dynamics of the pollutant discharges. These data cover an 18-month period when there remained major water-quality problems after the initial phase of surface water contamination. Here, a summary is provided of the water quality found, as a backdrop to set against subsequent remediation. Two types of water-quality determinant grouping were observed. The first type comprises the determinants B, Cs, Ca, Li, K, Na, SO4, Rb and Sr, and their concentrations are positively correlated with each other but inversely correlated with flow. This type of water-quality determinant shows variations in concentration that broadly link to the normal hydrogeochemical processes within the catchment, with limited confounding issues associated with mine drainage. The second type of water-quality determinant comprises Al, Be, Cd, Ce, Co, Cu, Fe, La, Pb, Pr, Nd, Ni, Si, Sb, U, Y and Zn, and concentrations for all this group are positively correlated. The determinants in this second group all have concentrations that are negatively correlated with pH. This group links primarily to pollutant mine discharge. The water-quality variations in the River Camon are described in relation to these two distinct hydrogeochemical groupings. (C) 2004 Elsevier B.V All rights reserved.
Resumo:
The GaussNewton algorithm is an iterative method regularly used for solving nonlinear least squares problems. It is particularly well suited to the treatment of very large scale variational data assimilation problems that arise in atmosphere and ocean forecasting. The procedure consists of a sequence of linear least squares approximations to the nonlinear problem, each of which is solved by an inner direct or iterative process. In comparison with Newtons method and its variants, the algorithm is attractive because it does not require the evaluation of second-order derivatives in the Hessian of the objective function. In practice the exact GaussNewton method is too expensive to apply operationally in meteorological forecasting, and various approximations are made in order to reduce computational costs and to solve the problems in real time. Here we investigate the effects on the convergence of the GaussNewton method of two types of approximation used commonly in data assimilation. First, we examine truncated GaussNewton methods where the inner linear least squares problem is not solved exactly, and second, we examine perturbed GaussNewton methods where the true linearized inner problem is approximated by a simplified, or perturbed, linear least squares problem. We give conditions ensuring that the truncated and perturbed GaussNewton methods converge and also derive rates of convergence for the iterations. The results are illustrated by a simple numerical example. A practical application to the problem of data assimilation in a typical meteorological system is presented.
Resumo:
Mediterranean ecosystems rival tropical ecosystems in terms of plant biodiversity. The Mediterranean Basin (MB) itself hosts 25 000 plant species, half of which are endemic. This rich biodiversity and the complex biogeographical and political issues make conservation a difficult task in the region. Species, habitat, ecosystem and landscape approaches have been used to identify conservation targets at various scales: ie, European, national, regional and local. Conservation decisions require adequate information at the species, community and habitat level. Nevertheless and despite recent improvements/efforts, this information is still incomplete, fragmented and varies from one country to another. This paper reviews the biogeographic data, the problems arising from current conservation efforts and methods for the conservation assessment and prioritization using GIS. GIS has an important role to play for managing spatial and attribute information on the ecosystems of the MB and to facilitate interactions with existing databases. Where limited information is available it can be used for prediction when directly or indirectly linked to externally built models. As well as being a predictive tool today GIS incorporate spatial techniques which can improve the level of information such as fuzzy logic, geostatistics, or provide insight about landscape changes such as 3D visualization. Where there are limited resources it can assist with identifying sites of conservation priority or the resolution of environmental conflicts (scenario building). Although not a panacea, GIS is an invaluable tool for improving the understanding of Mediterranean ecosystems and their dynamics and for practical management in a region that is under increasing pressure from human impact.
Resumo:
Mediterranean ecosystems rival tropical ecosystems in terms of plant biodiversity. The Mediterranean Basin (MB) itself hosts 25 000 plant species, half of which are endemic. This rich biodiversity and the complex biogeographical and political issues make conservation a difficult task in the region. Species, habitat, ecosystem and landscape approaches have been used to identify conservation targets at various scales: ie, European, national, regional and local. Conservation decisions require adequate information at the species, community and habitat level. Nevertheless and despite recent improvements/efforts, this information is still incomplete, fragmented and varies from one country to another. This paper reviews the biogeographic data, the problems arising from current conservation efforts and methods for the conservation assessment and prioritization using GIS. GIS has an important role to play for managing spatial and attribute information on the ecosystems of the MB and to facilitate interactions with existing databases. Where limited information is available it can be used for prediction when directly or indirectly linked to externally built models. As well as being a predictive tool today GIS incorporate spatial techniques which can improve the level of information such as fuzzy logic, geostatistics, or provide insight about landscape changes such as 3D visualization. Where there are limited resources it can assist with identifying sites of conservation priority or the resolution of environmental conflicts (scenario building). Although not a panacea, GIS is an invaluable tool for improving the understanding of Mediterranean ecosystems and their dynamics and for practical management in a region that is under increasing pressure from human impact.
Resumo:
Aircraft Maintenance, Repair and Overhaul (MRO) agencies rely largely on row-data based quotation systems to select the best suppliers for the customers (airlines). The data quantity and quality becomes a key issue to determining the success of an MRO job, since we need to ensure we achieve cost and quality benchmarks. This paper introduces a data mining approach to create an MRO quotation system that enhances the data quantity and data quality, and enables significantly more precise MRO job quotations. Regular Expression was utilized to analyse descriptive textual feedback (i.e. engineers reports) in order to extract more referable highly normalised data for job quotation. A text mining based key influencer analysis function enables the user to proactively select sub-parts, defects and possible solutions to make queries more accurate. Implementation results show that system data would improve cost quotation in 40% of MRO jobs, would reduce service cost without causing a drop in service quality.
Resumo:
The position of Real Estate within a multi-asset portfolio has received considerable attention recently. Previous research has concentrated on the percentage holding property would achieve given its risk/return characteristics. Such studies have invariably used Modern Portfolio Theory and these approaches have been criticised for both the quality of the real estate data and problems with the methodology itself. The first problem is now well understood, and the second can be addressed by the use of realistic constraints on asset holdings. This paper takes a different approach. We determine the level of return that Real Estate needs to achieve to justify an allocation within the multi asset portfolio. In order to test the importance of the quality of the data we use historic appraisal based and desmoothed returns to examine the sensitivity of the results. Consideration is also given to the Holding period and the imposition of realistic constraints on the asset holdings in order to model portfolios held by pension fund investors. We conclude, using several benchmark levels of portfolio risk and return, that using appraisal based data the required level of return for Real Estate was less than that achieved over the period 1972-1993. The use of desmoothed series can reverse this result at the highest levels of desmoothing although within a restricted holding period Real Estate offered returns in excess of those required to enter the portfolio and might have a role to play in the multi-asset portfolio.
Resumo:
The catchment of the River Thames, the principal river system in southern England, provides the main water supply for London but is highly vulnerable to changes in climate, land use and population. The river is eutrophic with significant algal blooms with phosphorus assumed to be the primary chemical indicator of ecosystem health. In the Thames Basin, phosphorus is available from point sources such as wastewater treatment plants and from diffuse sources such as agriculture. In order to predict vulnerability to future change, the integrated catchments model for phosphorus (INCA-P) has been applied to the river basin and used to assess the cost-effectiveness of a range of mitigation and adaptation strategies. It is shown that scenarios of future climate and land-use change will exacerbate the water quality problems, but a range of mitigation measures can improve the situation. A cost-effectiveness study has been undertaken to compare the economic benefits of each mitigation measure and to assess the phosphorus reductions achieved. The most effective strategy is to reduce fertilizer use by 20% together with the treatment of effluent to a high standard. Such measures will reduce the instream phosphorus concentrations to close to the EU Water Framework Directive target for the Thames.
Resumo:
We propose a new class of neurofuzzy construction algorithms with the aim of maximizing generalization capability specifically for imbalanced data classification problems based on leave-one-out (LOO) cross validation. The algorithms are in two stages, first an initial rule base is constructed based on estimating the Gaussian mixture model with analysis of variance decomposition from input data; the second stage carries out the joint weighted least squares parameter estimation and rule selection using orthogonal forward subspace selection (OFSS)procedure. We show how different LOO based rule selection criteria can be incorporated with OFSS, and advocate either maximizing the leave-one-out area under curve of the receiver operating characteristics, or maximizing the leave-one-out Fmeasure if the data sets exhibit imbalanced class distribution. Extensive comparative simulations illustrate the effectiveness of the proposed algorithms.
Resumo:
For users of climate services, the ability to quickly determine the datasets that best fit one's needs would be invaluable. The volume, variety and complexity of climate data makes this judgment difficult. The ambition of CHARMe ("Characterization of metadata to enable high-quality climate services") is to give a wider interdisciplinary community access to a range of supporting information, such as journal articles, technical reports or feedback on previous applications of the data. The capture and discovery of this "commentary" information, often created by data users rather than data providers, and currently not linked to the data themselves, has not been significantly addressed previously. CHARMe applies the principles of Linked Data and open web standards to associate, record, search and publish user-derived annotations in a way that can be read both by users and automated systems. Tools have been developed within the CHARMe project that enable annotation capability for data delivery systems already in wide use for discovering climate data. In addition, the project has developed advanced tools for exploring data and commentary in innovative ways, including an interactive data explorer and comparator ("CHARMe Maps") and a tool for correlating climate time series with external "significant events" (e.g. instrument failures or large volcanic eruptions) that affect the data quality. Although the project focuses on climate science, the concepts are general and could be applied to other fields. All CHARMe system software is open-source, released under a liberal licence, permitting future projects to re-use the source code as they wish.
Resumo:
Ozone and temperature profiles from the Michelson Interferometer for Passive Atmospheric Sounding (MIPAS) have been assimilated, using three-dimensional variational assimilation, into a stratosphere troposphere version of the Met Office numerical weather-prediction system. Analyses are made for the month of September 2002, when there was an unprecedented split in the southern hemisphere polar vortex. The analyses are validated against independent ozone observations from sondes, limb-occultation and total column ozone satellite instruments. Through most of the stratosphere, precision varies from 5 to 15%, and biases are 15% or less of the analysed field. Problems remain in the vortex and below the 60 hPa. level, especially at the tropopause where the analyses have too much ozone and poor agreement with independent data. Analysis problems are largely a result of the model rather than the data, giving confidence in the MIPAS ozone retrievals, though there may be a small high bias in MIPAS ozone in the lower stratosphere. Model issues include an excessive Brewer-Dobson circulation, which results both from known problems with the tracer transport scheme and from the data assimilation of dynamical variables. The extreme conditions of the vortex split reveal large differences between existing linear ozone photochemistry schemes. Despite these issues, the ozone analyses are able to successfully describe the ozone hole split and compare well to other studies of this event. Recommendations are made for the further development of the ozone assimilation system.
Resumo:
There is a concerted global effort to digitize biodiversity occurrence data from herbarium and museum collections that together offer an unparalleled archive of life on Earth over the past few centuries. The Global Biodiversity Information Facility provides the largest single gateway to these data. Since 2004 it has provided a single point of access to specimen data from databases of biological surveys and collections. Biologists now have rapid access to more than 120 million observations, for use in many biological analyses. We investigate the quality and coverage of data digitally available, from the perspective of a biologist seeking distribution data for spatial analysis on a global scale. We present an example of automatic verification of geographic data using distributions from the International Legume Database and Information Service to test empirically, issues of geographic coverage and accuracy. There are over 1/2 million records covering 31% of all Legume species, and 84% of these records pass geographic validation. These data are not yet a global biodiversity resource for all species, or all countries. A user will encounter many biases and gaps in these data which should be understood before data are used or analyzed. The data are notably deficient in many of the world's biodiversity hotspots. The deficiencies in data coverage can be resolved by an increased application of resources to digitize and publish data throughout these most diverse regions. But in the push to provide ever more data online, we should not forget that consistent data quality is of paramount importance if the data are to be useful in capturing a meaningful picture of life on Earth.