969 resultados para Data Quality
Resumo:
Construction organizations typically deal with large volumes of project data containing valuable information. It is found that these organizations do not use these data effectively for planning and decision-making. There are two reasons. First, the information systems in construction organizations are designed to support day-to-day construction operations. The data stored in these systems are often non-validated, non-integrated and are available in a format that makes it difficult for decision makers to use in order to make timely decisions. Second, the organizational structure and the IT infrastructure are often not compatible with the information systems thereby resulting in higher operational costs and lower productivity. These two issues have been investigated in this research with the objective of developing systems that are structured for effective decision-making. ^ A framework was developed to guide storage and retrieval of validated and integrated data for timely decision-making and to enable construction organizations to redesign their organizational structure and IT infrastructure matched with information system capabilities. The research was focused on construction owner organizations that were continuously involved in multiple construction projects. Action research and Data warehousing techniques were used to develop the framework. ^ One hundred and sixty-three construction owner organizations were surveyed in order to assess their data needs, data management practices and extent of use of information systems in planning and decision-making. For in-depth analysis, Miami-Dade Transit (MDT) was selected which is in-charge of all transportation-related construction projects in the Miami-Dade county. A functional model and a prototype system were developed to test the framework. The results revealed significant improvements in data management and decision-support operations that were examined through various qualitative (ease in data access, data quality, response time, productivity improvement, etc.) and quantitative (time savings and operational cost savings) measures. The research results were first validated by MDT and then by a representative group of twenty construction owner organizations involved in various types of construction projects. ^
Resumo:
Construction organizations typically deal with large volumes of project data containing valuable information. It is found that these organizations do not use these data effectively for planning and decision-making. There are two reasons. First, the information systems in construction organizations are designed to support day-to-day construction operations. The data stored in these systems are often non-validated, nonintegrated and are available in a format that makes it difficult for decision makers to use in order to make timely decisions. Second, the organizational structure and the IT infrastructure are often not compatible with the information systems thereby resulting in higher operational costs and lower productivity. These two issues have been investigated in this research with the objective of developing systems that are structured for effective decision-making. A framework was developed to guide storage and retrieval of validated and integrated data for timely decision-making and to enable construction organizations to redesign their organizational structure and IT infrastructure matched with information system capabilities. The research was focused on construction owner organizations that were continuously involved in multiple construction projects. Action research and Data warehousing techniques were used to develop the framework. One hundred and sixty-three construction owner organizations were surveyed in order to assess their data needs, data management practices and extent of use of information systems in planning and decision-making. For in-depth analysis, Miami-Dade Transit (MDT) was selected which is in-charge of all transportation-related construction projects in the Miami-Dade county. A functional model and a prototype system were developed to test the framework. The results revealed significant improvements in data management and decision-support operations that were examined through various qualitative (ease in data access, data quality, response time, productivity improvement, etc.) and quantitative (time savings and operational cost savings) measures. The research results were first validated by MDT and then by a representative group of twenty construction owner organizations involved in various types of construction projects.
Resumo:
An array of Bio-Argo floats equipped with radiometric sensors has been recently deployed in various open ocean areas representative of the diversity of trophic and bio-optical conditions prevailing in the so-called Case 1 waters. Around solar noon and almost everyday, each float acquires 0-250 m vertical profiles of Photosynthetically Available Radiation and downward irradiance at three wavelengths (380, 412 and 490 nm). Up until now, more than 6500 profiles for each radiometric channel have been acquired. As these radiometric data are collected out of operator’s control and regardless of meteorological conditions, specific and automatic data processing protocols have to be developed. Here, we present a data quality-control procedure aimed at verifying profile shapes and providing near real-time data distribution. This procedure is specifically developed to: 1) identify main issues of measurements (i.e. dark signal, atmospheric clouds, spikes and wave-focusing occurrences); 2) validate the final data with a hierarchy of tests to ensure a scientific utilization. The procedure, adapted to each of the four radiometric channels, is designed to flag each profile in a way compliant with the data management procedure used by the Argo program. Main perturbations in the light field are identified by the new protocols with good performances over the whole dataset. This highlights its potential applicability at the global scale. Finally, the comparison with modeled surface irradiances allows assessing the accuracy of quality-controlled measured irradiance values and identifying any possible evolution over the float lifetime due to biofouling and instrumental drift.
Resumo:
An array of Bio-Argo floats equipped with radiometric sensors has been recently deployed in various open ocean areas representative of the diversity of trophic and bio-optical conditions prevailing in the so-called Case 1 waters. Around solar noon and almost everyday, each float acquires 0-250 m vertical profiles of Photosynthetically Available Radiation and downward irradiance at three wavelengths (380, 412 and 490 nm). Up until now, more than 6500 profiles for each radiometric channel have been acquired. As these radiometric data are collected out of operator’s control and regardless of meteorological conditions, specific and automatic data processing protocols have to be developed. Here, we present a data quality-control procedure aimed at verifying profile shapes and providing near real-time data distribution. This procedure is specifically developed to: 1) identify main issues of measurements (i.e. dark signal, atmospheric clouds, spikes and wave-focusing occurrences); 2) validate the final data with a hierarchy of tests to ensure a scientific utilization. The procedure, adapted to each of the four radiometric channels, is designed to flag each profile in a way compliant with the data management procedure used by the Argo program. Main perturbations in the light field are identified by the new protocols with good performances over the whole dataset. This highlights its potential applicability at the global scale. Finally, the comparison with modeled surface irradiances allows assessing the accuracy of quality-controlled measured irradiance values and identifying any possible evolution over the float lifetime due to biofouling and instrumental drift.
Resumo:
The speed with which data has moved from being scarce, expensive and valuable, thus justifying detailed and careful verification and analysis to a situation where the streams of detailed data are almost too large to handle has caused a series of shifts to occur. Legal systems already have severe problems keeping up with, or even in touch with, the rate at which unexpected outcomes flow from information technology. The capacity to harness massive quantities of existing data has driven Big Data applications until recently. Now the data flows in real time are rising swiftly, become more invasive and offer monitoring potential that is eagerly sought by commerce and government alike. The ambiguities as to who own this often quite remarkably intrusive personal data need to be resolved – and rapidly - but are likely to encounter rising resistance from industrial and commercial bodies who see this data flow as ‘theirs’. There have been many changes in ICT that has led to stresses in the resolution of the conflicts between IP exploiters and their customers, but this one is of a different scale due to the wide potential for individual customisation of pricing, identification and the rising commercial value of integrated streams of diverse personal data. A new reconciliation between the parties involved is needed. New business models, and a shift in the current confusions over who owns what data into alignments that are in better accord with the community expectations. After all they are the customers, and the emergence of information monopolies needs to be balanced by appropriate consumer/subject rights. This will be a difficult discussion, but one that is needed to realise the great benefits to all that are clearly available if these issues can be positively resolved. The customers need to make these data flow contestable in some form. These Big data flows are only going to grow and become ever more instructive. A better balance is necessary, For the first time these changes are directly affecting governance of democracies, as the very effective micro targeting tools deployed in recent elections have shown. Yet the data gathered is not available to the subjects. This is not a survivable social model. The Private Data Commons needs our help. Businesses and governments exploit big data without regard for issues of legality, data quality, disparate data meanings, and process quality. This often results in poor decisions, with individuals bearing the greatest risk. The threats harbored by big data extend far beyond the individual, however, and call for new legal structures, business processes, and concepts such as a Private Data Commons. This Web extra is the audio part of a video in which author Marcus Wigan expands on his article "Big Data's Big Unintended Consequences" and discusses how businesses and governments exploit big data without regard for issues of legality, data quality, disparate data meanings, and process quality. This often results in poor decisions, with individuals bearing the greatest risk. The threats harbored by big data extend far beyond the individual, however, and call for new legal structures, business processes, and concepts such as a Private Data Commons.
Resumo:
This document does NOT address the issue of oxygen data quality control (either real-time or delayed mode). As a preliminary step towards that goal, this document seeks to ensure that all countries deploying floats equipped with oxygen sensors document the data and metadata related to these floats properly. We produced this document in response to action item 14 from the AST-10 meeting in Hangzhou (March 22-23, 2009). Action item 14: Denis Gilbert to work with Taiyo Kobayashi and Virginie Thierry to ensure DACs are processing oxygen data according to recommendations. If the recommendations contained herein are followed, we will end up with a more uniform set of oxygen data within the Argo data system, allowing users to begin analysing not only their own oxygen data, but also those of others, in the true spirit of Argo data sharing. Indications provided in this document are valid as of the date of writing this document. It is very likely that changes in sensors, calibrations and conversions equations will occur in the future. Please contact V. Thierry (vthierry@ifremer.fr) for any inconsistencies or missing information. A dedicated webpage on the Argo Data Management website (www) contains all information regarding Argo oxygen data management : current and previous version of this cookbook, oxygen sensor manuals, calibration sheet examples, examples of matlab code to process oxygen data, test data, etc..
Resumo:
Ensemble Stream Modeling and Data-cleaning are sensor information processing systems have different training and testing methods by which their goals are cross-validated. This research examines a mechanism, which seeks to extract novel patterns by generating ensembles from data. The main goal of label-less stream processing is to process the sensed events to eliminate the noises that are uncorrelated, and choose the most likely model without over fitting thus obtaining higher model confidence. Higher quality streams can be realized by combining many short streams into an ensemble which has the desired quality. The framework for the investigation is an existing data mining tool. First, to accommodate feature extraction such as a bush or natural forest-fire event we make an assumption of the burnt area (BA*), sensed ground truth as our target variable obtained from logs. Even though this is an obvious model choice the results are disappointing. The reasons for this are two: One, the histogram of fire activity is highly skewed. Two, the measured sensor parameters are highly correlated. Since using non descriptive features does not yield good results, we resort to temporal features. By doing so we carefully eliminate the averaging effects; the resulting histogram is more satisfactory and conceptual knowledge is learned from sensor streams. Second is the process of feature induction by cross-validating attributes with single or multi-target variables to minimize training error. We use F-measure score, which combines precision and accuracy to determine the false alarm rate of fire events. The multi-target data-cleaning trees use information purity of the target leaf-nodes to learn higher order features. A sensitive variance measure such as f-test is performed during each node’s split to select the best attribute. Ensemble stream model approach proved to improve when using complicated features with a simpler tree classifier. The ensemble framework for data-cleaning and the enhancements to quantify quality of fitness (30% spatial, 10% temporal, and 90% mobility reduction) of sensor led to the formation of streams for sensor-enabled applications. Which further motivates the novelty of stream quality labeling and its importance in solving vast amounts of real-time mobile streams generated today.
Resumo:
Recent marine long-offset transient electromagnetic (LOTEM) measurements yielded the offshore delineation of a fresh groundwater body beneath the seafloor in the region of Bat Yam, Israel. The LOTEM application was effective in detecting this freshwater body underneath the Mediterranean Sea and allowed an estimation of its seaward extent. However, the measured data set was insufficient to understand the hydrogeological configuration and mechanism controlling the occurrence of this fresh groundwater discovery. Especially the lateral geometry of the freshwater boundary, important for the hydrogeological modelling, could not be resolved. Without such an understanding, a rational management of this unexploited groundwater reservoir is not possible. Two new high-resolution marine time-domain electromagnetic methods are theoretically developed to derive the hydrogeological structure of the western aquifer boundary. The first is called Circular Electric Dipole (CED). It is the land-based analogous of the Vertical Electric Dipole (VED), which is commonly applied to detect resistive structures in the subsurface. Although the CED shows exceptional detectability characteristics in the step-off signal towards the sub-seafloor freshwater body, an actual application was not carried out in the extent of this study. It was found that the method suffers from an insufficient signal strength to adequately delineate the resistive aquifer under realistic noise conditions. Moreover, modelling studies demonstrated that severe signal distortions are caused by the slightest geometrical inaccuracies. As a result, a successful application of CED in Israel proved to be rather doubtful. A second method called Differential Electric Dipole (DED) is developed as an alternative to the intended CED method. Compared to the conventional marine time-domain electromagnetic system that commonly applies a horizontal electric dipole transmitter, the DED is composed of two horizontal electric dipoles in an in-line configuration that share a common central electrode. Theoretically, DED has similar detectability/resolution characteristics compared to the conventional LOTEM system. However, the superior lateral resolution towards multi-dimensional resistivity structures make an application desirable. Furthermore, the method is less susceptible towards geometrical errors making an application in Israel feasible. In the extent of this thesis, the novel marine DED method is substantiated using several one-dimensional (1D) and multi-dimensional (2D/3D) modelling studies. The main emphasis lies on the application in Israel. Preliminary resistivity models are derived from the previous marine LOTEM measurement and tested for a DED application. The DED method is effective in locating the two-dimensional resistivity structure at the western aquifer boundary. Moreover, a prediction regarding the hydrogeological boundary conditions are feasible, provided a brackish water zone exists at the head of the interface. A seafloor-based DED transmitter/receiver system is designed and built at the Institute of Geophysics and Meteorology at the University of Cologne. The first DED measurements were carried out in Israel in April 2016. The acquired data set is the first of its kind. The measured data is processed and subsequently interpreted using 1D inversion. The intended aim of interpreting both step-on and step-off signals failed, due to the insufficient data quality of the latter. Yet, the 1D inversion models of the DED step-on signals clearly detect the freshwater body for receivers located close to the Israeli coast. Additionally, a lateral resistivity contrast is observable in the 1D inversion models that allow to constrain the seaward extent of this freshwater body. A large-scale 2D modelling study followed the 1D interpretation. In total, 425 600 forward calculations are conducted to find a sub-seafloor resistivity distribution that adequately explains the measured data. The results indicate that the western aquifer boundary is located at 3600 m - 3700 m before the coast. Moreover, a brackish water zone of 3 Omega*m to 5 Omega*m with a lateral extent of less than 300 m is likely located at the head of the freshwater aquifer. Based on these results, it is predicted that the sub-seafloor freshwater body is indeed open to the sea and may be vulnerable to seawater intrusion.
Resumo:
Collecting ground truth data is an important step to be accomplished before performing a supervised classification. However, its quality depends on human, financial and time ressources. It is then important to apply a validation process to assess the reliability of the acquired data. In this study, agricultural infomation was collected in the Brazilian Amazonian State of Mato Grosso in order to map crop expansion based on MODIS EVI temporal profiles. The field work was carried out through interviews for the years 2005-2006 and 2006-2007. This work presents a methodology to validate the training data quality and determine the optimal sample to be used according to the classifier employed. The technique is based on the detection of outlier pixels for each class and is carried out by computing Mahalanobis distances for each pixel. The higher the distance, the further the pixel is from the class centre. Preliminary observations through variation coefficent validate the efficiency of the technique to detect outliers. Then, various subsamples are defined by applying different thresholds to exclude outlier pixels from the classification process. The classification results prove the robustness of the Maximum Likelihood and Spectral Angle Mapper classifiers. Indeed, those classifiers were insensitive to outlier exclusion. On the contrary, the decision tree classifier showed better results when deleting 7.5% of pixels in the training data. The technique managed to detect outliers for all classes. In this study, few outliers were present in the training data, so that the classification quality was not deeply affected by the outliers.
Resumo:
The term Artificial intelligence acquired a lot of baggage since its introduction and in its current incarnation is synonymous with Deep Learning. The sudden availability of data and computing resources has opened the gates to myriads of applications. Not all are created equal though, and problems might arise especially for fields not closely related to the tasks that pertain tech companies that spearheaded DL. The perspective of practitioners seems to be changing, however. Human-Centric AI emerged in the last few years as a new way of thinking DL and AI applications from the ground up, with a special attention at their relationship with humans. The goal is designing a system that can gracefully integrate in already established workflows, as in many real-world scenarios AI may not be good enough to completely replace its humans. Often this replacement may even be unneeded or undesirable. Another important perspective comes from, Andrew Ng, a DL pioneer, who recently started shifting the focus of development from “better models” towards better, and smaller, data. He defined his approach Data-Centric AI. Without downplaying the importance of pushing the state of the art in DL, we must recognize that if the goal is creating a tool for humans to use, more raw performance may not align with more utility for the final user. A Human-Centric approach is compatible with a Data-Centric one, and we find that the two overlap nicely when human expertise is used as the driving force behind data quality. This thesis documents a series of case-studies where these approaches were employed, to different extents, to guide the design and implementation of intelligent systems. We found human expertise proved crucial in improving datasets and models. The last chapter includes a slight deviation, with studies on the pandemic, still preserving the human and data centric perspective.
Resumo:
Artificial Intelligence (AI) and Machine Learning (ML) are novel data analysis techniques providing very accurate prediction results. They are widely adopted in a variety of industries to improve efficiency and decision-making, but they are also being used to develop intelligent systems. Their success grounds upon complex mathematical models, whose decisions and rationale are usually difficult to comprehend for human users to the point of being dubbed as black-boxes. This is particularly relevant in sensitive and highly regulated domains. To mitigate and possibly solve this issue, the Explainable AI (XAI) field became prominent in recent years. XAI consists of models and techniques to enable understanding of the intricated patterns discovered by black-box models. In this thesis, we consider model-agnostic XAI techniques, which can be applied to Tabular data, with a particular focus on the Credit Scoring domain. Special attention is dedicated to the LIME framework, for which we propose several modifications to the vanilla algorithm, in particular: a pair of complementary Stability Indices that accurately measure LIME stability, and the OptiLIME policy which helps the practitioner finding the proper balance among explanations' stability and reliability. We subsequently put forward GLEAMS a model-agnostic surrogate interpretable model which requires to be trained only once, while providing both Local and Global explanations of the black-box model. GLEAMS produces feature attributions and what-if scenarios, from both dataset and model perspective. Eventually, we argue that synthetic data are an emerging trend in AI, being more and more used to train complex models instead of original data. To be able to explain the outcomes of such models, we must guarantee that synthetic data are reliable enough to be able to translate their explanations to real-world individuals. To this end we propose DAISYnt, a suite of tests to measure synthetic tabular data quality and privacy.
Resumo:
This paper examines the spatial pattern of ill-defined causes of death across Brazilian regions, and its relationship with the evolution of completeness of the deaths registry and changes in the mortality age profile. We make use of the Brazilian Health Informatics Department mortality database and population censuses from 1980 to 2010. We applied demographic methods to evaluate the quality of mortality data for 137 small areas and correct for under-registration of death counts when necessary. The second part of the analysis uses linear regression models to investigate the relationship between, on the one hand, changes in death counts coverage and age profile of mortality, and on the other, changes in the reporting of ill-defined causes of death. The completeness of death counts coverage increases from about 80% in 1980-1991 to over 95% in 2000-2010 at the same time the percentage of ill-defined causes of deaths reduced about 53% in the country. The analysis suggests that the government's efforts to improve data quality are proving successful, and they will allow for a better understanding of the dynamics of health and the mortality transition.
Resumo:
OBJETIVO: Conhecer a qualidade dos dados de internação por causas externas em São José dos Campos, São Paulo. MÉTODO: Foram estudadas as internações pelo Sistema Único de Saúde por lesões decorrentes de causas externas no primeiro semestre de 2003, no Hospital Municipal, referência para o atendimento ao trauma no Município, por meio da comparação dos dados registrados no Sistema de Informações Hospitalares com os prontuários de 990 internações. A concordância das variáveis relativas à vítima, à internação e ao agravo foi avaliada pela taxa bruta de concordância e pelo coeficiente Kappa. As lesões e as causas externas foram codificadas segundo a 10ª revisão da Classificação Internacional de Doenças, respectivamente, capítulos XIX e XX. RESULTADOS: A taxa de concordância bruta foi de boa qualidade para as variáveis relativas à vítima e à internação, variando de 89,0% a 99,2%. As lesões tiveram concordância ótima, exceto os traumatismos do pescoço (k=0,73), traumatismos múltiplos (k=0,67) e fraturas do tórax (k=0,49). As causas externas tiveram concordância ótima para acidentes de transporte (k=0,90) e quedas (k=0,83). A confiabilidade foi menor para agressões (k=0,50), causas indeterminadas (k=0,37), e complicações da assistência médica (k=0,03). Houve concordância ótima nos acidentes de transporte em pedestres, ciclistas e motociclistas. CONCLUSÃO: A maioria das variáveis de estudo teve boa qualidade no nível de agregação analisado. Algumas variáveis relativas à vítima e alguns tipos de causas externas necessitam de aperfeiçoamento da qualidade dos dados. O perfil da morbidade hospitalar encontrado confirmou os acidentes de transporte como importante causa externa de internação hospitalar no Município.
Resumo:
Background: Cerebral palsy (CP) patients have motor limitations that can affect functionality and abilities for activities of daily living (ADL). Health related quality of life and health status instruments validated to be applied to these patients do not directly approach the concepts of functionality or ADL. The Child Health Assessment Questionnaire (CHAQ) seems to be a good instrument to approach this dimension, but it was never used for CP patients. The purpose of the study was to verify the psychometric properties of CHAQ applied to children and adolescents with CP. Methods: Parents or guardians of children and adolescents with CP, aged 5 to 18 years, answered the CHAQ. A healthy group of 314 children and adolescents was recruited during the validation of the CHAQ Brazilian-version. Data quality, reliability and validity were studied. The motor function was evaluated by the Gross Motor Function Measure (GMFM). Results: Ninety-six parents/guardians answered the questionnaire. The age of the patients ranged from 5 to 17.9 years (average: 9.3). The rate of missing data was low(< 9.3%). The floor effect was observed in two domains, being higher only in the visual analogue scales (<= 35.5%). The ceiling effect was significant in all domains and particularly high in patients with quadriplegia (81.8 to 90.9%) and extrapyramidal (45.4 to 91.0%). The Cronbach alpha coefficient ranged from 0.85 to 0.95. The validity was appropriate: for the discriminant validity the correlation of the disability index with the visual analogue scales was not significant; for the convergent validity CHAQ disability index had a strong correlation with the GMFM (0.77); for the divergent validity there was no correlation between GMFM and the pain and overall evaluation scales; for the criterion validity GMFM as well as CHAQ detected differences in the scores among the clinical type of CP (p < 0.01); for the construct validity, the patients' disability index score (mean: 2.16; SD: 0.72) was higher than the healthy group ( mean: 0.12; SD: 0.23)(p < 0.01). Conclusion: CHAQ reliability and validity were adequate to this population. However, further studies are necessary to verify the influence of the ceiling effect on the responsiveness of the instrument.
Resumo:
There is substantial disagreement among published epidemiological studies regarding environmental risk factors for Parkinson’s disease (PD). Differences in the quality of measurement of environmental exposures may contribute to this variation. The current study examined the test–retest repeatability of self-report data on risk factors for PD obtained from a series of 32 PD cases recruited from neurology clinics and 29 healthy sex-, age-and residential suburb-matched controls. Exposure data were collected in face-to-face interviews using a structured questionnaire derived from previous epidemiological studies. High repeatability was demonstrated for ‘lifestyle’ exposures, such as smoking and coffee/tea consumption (kappas 0.70–1.00). Environmental exposures that involved some action by the person, such as pesticide application and use of solvents and metals, also showed high repeatability (kappas>0.78). Lower repeatability was seen for rural residency and bore water consumption (kappa 0.39–0.74). In general, we found that case and control participants provided similar rates of incongruent and missing responses for categorical and continuous occupational, domestic, lifestyle and medical exposures.