962 resultados para Data quality problems


Relevância:

100.00% 100.00%

Publicador:

Resumo:

One challenge on data assimilation (DA) methods is how the error covariance for the model state is computed. Ensemble methods have been proposed for producing error covariance estimates, as error is propagated in time using the non-linear model. Variational methods, on the other hand, use the concepts of control theory, whereby the state estimate is optimized from both the background and the measurements. Numerical optimization schemes are applied which solve the problem of memory storage and huge matrix inversion needed by classical Kalman filter methods. Variational Ensemble Kalman filter (VEnKF), as a method inspired the Variational Kalman Filter (VKF), enjoys the benefits from both ensemble methods and variational methods. It avoids filter inbreeding problems which emerge when the ensemble spread underestimates the true error covariance. In VEnKF this is tackled by resampling the ensemble every time measurements are available. One advantage of VEnKF over VKF is that it needs neither tangent linear code nor adjoint code. In this thesis, VEnKF has been applied to a two-dimensional shallow water model simulating a dam-break experiment. The model is a public code with water height measurements recorded in seven stations along the 21:2 m long 1:4 m wide flumes mid-line. Because the data were too sparse to assimilate the 30 171 model state vector, we chose to interpolate the data both in time and in space. The results of the assimilation were compared with that of a pure simulation. We have found that the results revealed by the VEnKF were more realistic, without numerical artifacts present in the pure simulation. Creating a wrapper code for a model and DA scheme might be challenging, especially when the two were designed independently or are poorly documented. In this thesis we have presented a non-intrusive approach of coupling the model and a DA scheme. An external program is used to send and receive information between the model and DA procedure using files. The advantage of this method is that the model code changes needed are minimal, only a few lines which facilitate input and output. Apart from being simple to coupling, the approach can be employed even if the two were written in different programming languages, because the communication is not through code. The non-intrusive approach is made to accommodate parallel computing by just telling the control program to wait until all the processes have ended before the DA procedure is invoked. It is worth mentioning the overhead increase caused by the approach, as at every assimilation cycle both the model and the DA procedure have to be initialized. Nonetheless, the method can be an ideal approach for a benchmark platform in testing DA methods. The non-intrusive VEnKF has been applied to a multi-purpose hydrodynamic model COHERENS to assimilate Total Suspended Matter (TSM) in lake Skyln Pyhjrvi. The lake has an area of 154 km2 with an average depth of 5:4 m. Turbidity and chlorophyll-a concentrations from MERIS satellite images for 7 days between May 16 and July 6 2009 were available. The effect of the organic matter has been computationally eliminated to obtain TSM data. Because of computational demands from both COHERENS and VEnKF, we have chosen to use 1 km grid resolution. The results of the VEnKF have been compared with the measurements recorded at an automatic station located at the North-Western part of the lake. However, due to TSM data sparsity in both time and space, it could not be well matched. The use of multiple automatic stations with real time data is important to elude the time sparsity problem. With DA, this will help in better understanding the environmental hazard variables for instance. We have found that using a very high ensemble size does not necessarily improve the results, because there is a limit whereby additional ensemble members add very little to the performance. Successful implementation of the non-intrusive VEnKF and the ensemble size limit for performance leads to an emerging area of Reduced Order Modeling (ROM). To save computational resources, running full-blown model in ROM is avoided. When the ROM is applied with the non-intrusive DA approach, it might result in a cheaper algorithm that will relax computation challenges existing in the field of modelling and DA.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The accuracy of a map is dependent on the reference dataset used in its construction. Classification analyses used in thematic mapping can, for example, be sensitive to a range of sampling and data quality concerns. With particular focus on the latter, the effects of reference data quality on land cover classifications from airborne thematic mapper data are explored. Variations in sampling intensity and effort are highlighted in a dataset that is widely used in mapping and modelling studies; these may need accounting for in analyses. The quality of the labelling in the reference dataset was also a key variable influencing mapping accuracy. Accuracy varied with the amount and nature of mislabelled training cases with the nature of the effects varying between classifiers. The largest impacts on accuracy occurred when mislabelling involved confusion between similar classes. Accuracy was also typically negatively related to the magnitude of mislabelled cases and the support vector machine (SVM), which has been claimed to be relatively insensitive to training data error, was the most sensitive of the set of classifiers investigated, with overall classification accuracy declining by 8% (significant at 95% level of confidence) with the use of a training set containing 20% mislabelled cases.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Water covers over 70% of the Earth's surface, and is vital for all known forms of life. But only 3% of the Earth's water is fresh water, and less than 0.3% of all freshwater is in rivers, lakes, reservoirs and the atmosphere. However, rivers and lakes are an important part of fresh surface water, amounting to about 89%. In this Master Thesis dissertation, the focus is on three types of water bodies rivers, lakes and reservoirs, and their water quality issues in Asian countries. The surface water quality in a region is largely determined both by the natural processes such as climate or geographic conditions, and the anthropogenic influences such as industrial and agricultural activities or land use conversion. The quality of the water can be affected by pollutants discharge from a specific point through a sewer pipe and also by extensive drainage from agriculture/urban areas and within basin. Hence, water pollutant sources can be divided into two categories: Point source pollution and Non-point source (NPS) pollution. Seasonal variations in precipitation and surface run-off have a strong effect on river discharge and the concentration of pollutants in water bodies. For example, in the rainy season, heavy and persistent rain wash off the ground, the runoff flow increases and may contain various kinds of pollutants and, eventually, enters the water bodies. In some cases, especially in confined water bodies, the quality may be positive related with rainfall in the wet season, because this confined type of fresh water systems allows high dilution of pollutants, decreasing their possible impacts. During the dry season, the quality of water is largely related to industrialization and urbanization pollution. The aim of this study is to identify the most common water quality problems in Asian countries and to enumerate and analyze the methodologies used for assessment of water quality conditions of both rivers and confined water bodies (lakes and reservoirs). Based on the evaluation of a sample of 57 papers, dated between 2000 and 2012, it was found that over the past decade, the water quality of rivers, lakes, and reservoirs in developing countries is being degraded. Water pollution and destruction of aquatic ecosystems have caused massive damage to the functions and integrity of water resources. The most widespread NPS in Asian countries and those which have the greatest spatial impacts are urban runoff and agriculture. Locally, mine waste runoff and rice paddy are serious NPS problems. The most relevant point pollution sources are the effluents from factories, sewage treatment plant, and public or household facilities. It was found that the most used methodology was unquestionably the monitoring activity, used in 49 of analyzed studies, accounting for 86%. Sometimes, data from historical databases were used as well. It can be seen that taking samples from the water body and then carry on laboratory work (chemical analyses) is important because it can give an understanding of the water quality. 6 papers (11%) used a method that combined monitoring data and modeling. 6 papers (11%) just applied a model to estimate the quality of water. Modeling is a useful resource when there is limited budget since some models are of free download and use. In particular, several of used models come from the U.S.A, but they have their own purposes and features, meaning that a careful application of the models to other countries and a critical discussion of the results are crucial. 5 papers (9%) focus on a method combining monitoring data and statistical analysis. When there is a huge data matrix, the researchers need an efficient way of interpretation of the information which is provided by statistics. 3 papers (5%) used a method combining monitoring data, statistical analysis and modeling. These different methods are all valuable to evaluate the water quality. It was also found that the evaluation of water quality was made as well by using other types of sampling different than water itself, and they also provide useful information to understand the condition of the water body. These additional monitoring activities are: Air sampling, sediment sampling, phytoplankton sampling and aquatic animal tissues sampling. Despite considerable progress in developing and applying control regulations to point and NPS pollution, the pollution status of rivers, lakes, and reservoirs in Asian countries is not improving. In fact, this reflects the slow pace of investment in new infrastructure for pollution control and growing population pressures. Water laws or regulations and public involvement in enforcement can play a constructive and indispensable role in environmental protection. In the near future, in order to protect water from further contamination, rapid action is highly needed to control the various kinds of effluents in one region. Environmental remediation and treatment of industrial effluent and municipal wastewaters is essential. It is also important to prevent the direct input of agricultural and mine site runoff. Finally, stricter environmental regulation for water quality is required to support protection and management strategies. It would have been possible to get further information based in the 57 sample of papers. For instance, it would have been interesting to compare the level of concentrations of some pollutants in the diferente Asian countries. However the limit of three months duration for this study prevented further work to take place. In spite of this, the study objectives were achieved: the work provided an overview of the most relevant water quality problems in rivers, lakes and reservoirs in Asian countries, and also listed and analyzed the most common methodologies.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The emergence of new business models, namely, the establishment of partnerships between organizations, the chance that companies have of adding existing data on the web, especially in the semantic web, to their information, led to the emphasis on some problems existing in databases, particularly related to data quality. Poor data can result in loss of competitiveness of the organizations holding these data, and may even lead to their disappearance, since many of their decision-making processes are based on these data. For this reason, data cleaning is essential. Current approaches to solve these problems are closely linked to database schemas and specific domains. In order that data cleaning can be used in different repositories, it is necessary for computer systems to understand these data, i.e., an associated semantic is needed. The solution presented in this paper includes the use of ontologies: (i) for the specification of data cleaning operations and, (ii) as a way of solving the semantic heterogeneity problems of data stored in different sources. With data cleaning operations defined at a conceptual level and existing mappings between domain ontologies and an ontology that results from a database, they may be instantiated and proposed to the expert/specialist to be executed over that database, thus enabling their interoperability.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Dissertation elaborated for the partial fulfilment of the requirements of the Master Degree in Civil Engineering in the Speciality Area of Hydarulics

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This research is looking to find out what benefits employees expect the organization of data governance gains for an organization and how it benefits implementing automated marketing capabilities. Quality and usability of the data are crucial for organizations to meet various business needs. Organizations have more data and technology available what can be utilized for example in automated marketing. Data governance addresses the organization of decision rights and accountabilities for the management of an organizations data assets. With automated marketing it is meant sending a right message, to a right person, at a right time, automatically. The research is a single case study conducted in Finnish ICT-company. The case company was starting to organize data governance and implementing automated marketing capabilities at the time of the research. Empirical material is interviews of the employees of the case company. Content analysis is used to interpret the interviews in order to find the answers to the research questions. Theoretical framework of the research is derived from the morphology of data governance. Findings of the research indicate that the employees expect the organization of data governance among others to improve customer experience, to improve sales, to provide abilities to identify individual customers life-situation, ensure that the handling of the data is according to the regulations and improve operational efficiency. The organization of data governance is expected to solve problems in customer data quality that are currently hindering implementation of automated marketing capabilities.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In January 1992, there was a major pollutant event for the River Canon and downstream with its confluence to the River Fal and the Fal estuary in the west Cornwall. This incident was associated with the discharge of several million gallons of highly polluted water from the abandoned Wheal Jane tin mine that also extracted Ag, Cu and Zn ore. Later that year, the Centre for Ecology and Hydrology (CBH; then Institute of Hydrology) Wallingford undertook daily monitoring of the River Canon for a range of major, minor and trace elements to assess the nature and the dynamics of the pollutant discharges. These data cover an 18-month period when there remained major water-quality problems after the initial phase of surface water contamination. Here, a summary is provided of the water quality found, as a backdrop to set against subsequent remediation. Two types of water-quality determinant grouping were observed. The first type comprises the determinants B, Cs, Ca, Li, K, Na, SO4, Rb and Sr, and their concentrations are positively correlated with each other but inversely correlated with flow. This type of water-quality determinant shows variations in concentration that broadly link to the normal hydrogeochemical processes within the catchment, with limited confounding issues associated with mine drainage. The second type of water-quality determinant comprises Al, Be, Cd, Ce, Co, Cu, Fe, La, Pb, Pr, Nd, Ni, Si, Sb, U, Y and Zn, and concentrations for all this group are positively correlated. The determinants in this second group all have concentrations that are negatively correlated with pH. This group links primarily to pollutant mine discharge. The water-quality variations in the River Camon are described in relation to these two distinct hydrogeochemical groupings. (C) 2004 Elsevier B.V All rights reserved.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The Primary Care Information System (SIAB) concentrates basic healthcare information from all different regions of Brazil. The information is collected by primary care teams on a paper-based procedure that degrades the quality of information provided to the healthcare authorities and slows down the process of decision making. To overcome these problems we propose a new data gathering application that uses a mobile device connected to a 3G network and a GPS to be used by the primary care teams for collecting the families' data. A prototype was developed in which a digital version of one SIAB form is made available at the mobile device. The prototype was tested in a basic healthcare unit located in a suburb of Sao Paulo. The results obtained so far have shown that the proposed process is a better alternative for data collecting at primary care, both in terms of data quality and lower deployment time to health care authorities.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The Gaia space mission is a major project for the European astronomical community. As challenging as it is, the processing and analysis of the huge data-flow incoming from Gaia is the subject of thorough study and preparatory work by the DPAC (Data Processing and Analysis Consortium), in charge of all aspects of the Gaia data reduction. This PhD Thesis was carried out in the framework of the DPAC, within the team based in Bologna. The task of the Bologna team is to define the calibration model and to build a grid of spectro-photometric standard stars (SPSS) suitable for the absolute flux calibration of the Gaia G-band photometry and the BP/RP spectrophotometry. Such a flux calibration can be performed by repeatedly observing each SPSS during the life-time of the Gaia mission and by comparing the observed Gaia spectra to the spectra obtained by our ground-based observations. Due to both the different observing sites involved and the huge amount of frames expected (100000), it is essential to maintain the maximum homogeneity in data quality, acquisition and treatment, and a particular care has to be used to test the capabilities of each telescope/instrument combination (through the instrument familiarization plan), to devise methods to keep under control, and eventually to correct for, the typical instrumental effects that can affect the high precision required for the Gaia SPSS grid (a few % with respect to Vega). I contributed to the ground-based survey of Gaia SPSS in many respects: with the observations, the instrument familiarization plan, the data reduction and analysis activities (both photometry and spectroscopy), and to the maintenance of the data archives. However, the field I was personally responsible for was photometry and in particular relative photometry for the production of short-term light curves. In this context I defined and tested a semi-automated pipeline which allows for the pre-reduction of imaging SPSS data and the production of aperture photometry catalogues ready to be used for further analysis. A series of semi-automated quality control criteria are included in the pipeline at various levels, from pre-reduction, to aperture photometry, to light curves production and analysis.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Bovine spongiform encephalopathy (BSE) rapid tests and routine BSE-testing laboratories underlie strict regulations for approval. Due to the lack of BSE-positive control samples, however, full assay validation at the level of individual test runs and continuous monitoring of test performance on-site is difficult. Most rapid tests use synthetic prion protein peptides, but it is not known to which extend they reflect the assay performance on field samples, and whether they are sufficient to indicate on-site assay quality problems. To address this question we compared the test scores of the provided kit peptide controls to those of standardized weak BSE-positive tissue samples in individual test runs as well as continuously over time by quality control charts in two widely used BSE rapid tests. Our results reveal only a weak correlation between the weak positive tissue control and the peptide control scores. We identified kit-lot related shifts in the assay performances that were not reflected by the peptide control scores. Vice versa, not all shifts indicated by the peptide control scores indeed reflected a shift in the assay performance. In conclusion these data highlight that the use of the kit peptide controls for continuous quality control purposes may result in unjustified rejection or acceptance of test runs. However, standardized weak positive tissue controls in combination with Shewhart-CUSUM control charts appear to be reliable in continuously monitoring assay performance on-site to identify undesired deviations.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The current state of health and biomedicine includes an enormity of heterogeneous data silos, collected for different purposes and represented differently, that are presently impossible to share or analyze in toto. The greatest challenge for large-scale and meaningful analyses of health-related data is to achieve a uniform data representation for data extracted from heterogeneous source representations. Based upon an analysis and categorization of heterogeneities, a process for achieving comparable data content by using a uniform terminological representation is developed. This process addresses the types of representational heterogeneities that commonly arise in healthcare data integration problems. Specifically, this process uses a reference terminology, and associated "maps" to transform heterogeneous data to a standard representation for comparability and secondary use. The capture of quality and precision of the maps between local terms and reference terminology concepts enhances the meaning of the aggregated data, empowering end users with better-informed queries for subsequent analyses. A data integration case study in the domain of pediatric asthma illustrates the development and use of a reference terminology for creating comparable data from heterogeneous source representations. The contribution of this research is a generalized process for the integration of data from heterogeneous source representations, and this process can be applied and extended to other problems where heterogeneous data needs to be merged.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The Data Quality Campaign (DQC) has been focused since 2005 on advocating for states to build robust state longitudinal data systems (SLDS). While states have made great progress in their data infrastructure, and should continue to emphasize this work, t data systems alone will not improve outcomes. It is time for both DQC and states to focus on building capacity to use the information that these systems are producing at every level from classrooms to state houses. To impact system performance and student achievement, the ingrained culture must be replaced with one that focuses on data use for continuous improvement. The effective use of data to inform decisions, provide transparency, improve the measurement of outcomes, and fuel continuous improvement will not come to fruition unless there is a system wide focus on building capacity around the collection, analysis, dissemination, and use of this data, including through research.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

As the number of data sources publishing their data on the Web of Data is growing, we are experiencing an immense growth of the Linked Open Data cloud. The lack of control on the published sources, which could be untrustworthy or unreliable, along with their dynamic nature that often invalidates links and causes conflicts or other discrepancies, could lead to poor quality data. In order to judge data quality, a number of quality indicators have been proposed, coupled with quality metrics that quantify the quality level of a dataset. In addition to the above, some approaches address how to improve the quality of the datasets through a repair process that focuses on how to correct invalidities caused by constraint violations by either removing or adding triples. In this paper we argue that provenance is a critical factor that should be taken into account during repairs to ensure that the most reliable data is kept. Based on this idea, we propose quality metrics that take into account provenance and evaluate their applicability as repair guidelines in a particular data fusion setting.