936 resultados para Data quality problems
Resumo:
Open Educational Resources (OER) are teaching, learning and research materials that have been released under an open licence that permits online access and re-use by others. The 2012 Paris OER Declaration encourages the open licensing of educational materials produced with public funds. Digital data and data sets produced as a result of scientific and non-scientific research are an increasingly important category of educational materials. This paper discusses the legal challenges presented when publicly funded research data is made available as OER, arising from intellectual property rights, confidentiality and information privacy laws, and the lack of a legal duty to ensure data quality. If these legal challenges are not understood, addressed and effectively managed, they may impede and restrict access to and re-use of research data. This paper identifies some of the legal challenges that need to be addressed and describes 10 proposed best practices which are recommended for adoption to so that publicly funded research data can be made available for access and re-use as OER.
Resumo:
Citizen Science projects are initiatives in which members of the general public participate in scientific research projects and perform or manage research-related tasks such as data collection and/or data annotation. Citizen Science is technologically possible and scientifically significant. However, as the gathered information is from the crowd, the data quality is always hard to manage. There are many ways to manage data quality, and reputation management is one of the common approaches. In recent year, many research teams have deployed many audio or image sensors in natural environment in order to monitor the status of animals or plants. The collected data will be analysed by ecologists. However, as the amount of collected data is exceedingly huge and the number of ecologists is very limited, it is impossible for scientists to manually analyse all these data. The functions of existing automated tools to process the data are still very limited and the results are still not very accurate. Therefore, researchers have turned to recruiting general citizens who are interested in helping scientific research to do the pre-processing tasks such as species tagging. Although research teams can save time and money by recruiting general citizens to volunteer their time and skills to help data analysis, the reliability of contributed data varies a lot. Therefore, this research aims to investigate techniques to enhance the reliability of data contributed by general citizens in scientific research projects especially for acoustic sensing projects. In particular, we aim to investigate how to use reputation management to enhance data reliability. Reputation systems have been used to solve the uncertainty and improve data quality in many marketing and E-Commerce domains. The commercial organizations which have chosen to embrace the reputation management and implement the technology have gained many benefits. Data quality issues are significant to the domain of Citizen Science due to the quantity and diversity of people and devices involved. However, research on reputation management in this area is relatively new. We therefore start our investigation by examining existing reputation systems in different domains. Then we design novel reputation management approaches for Citizen Science projects to categorise participants and data. We have investigated some critical elements which may influence data reliability in Citizen Science projects. These elements include personal information such as location and education and performance information such as the ability to recognise certain bird calls. The designed reputation framework is evaluated by a series of experiments involving many participants for collecting and interpreting data, in particular, environmental acoustic data. Our research in exploring the advantages of reputation management in Citizen Science (or crowdsourcing in general) will help increase awareness among organizations that are unacquainted with its potential benefits.
Resumo:
In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is also proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Experiments based on several real-world data collections demonstrate that WebPut outperforms existing approaches.
Resumo:
The Council of Australian Governments (COAG) in 2003 gave in-principle approval to a best-practice report recommending a holistic approach to managing natural disasters in Australia incorporating a move from a traditional response-centric approach to a greater focus on mitigation, recovery and resilience with community well-being at the core. Since that time, there have been a range of complementary developments that have supported the COAG recommended approach. Developments have been administrative, legislative and technological, both, in reaction to the COAG initiative and resulting from regular natural disasters. This paper reviews the characteristics of the spatial data that is becoming increasingly available at Federal, state and regional jurisdictions with respect to their being fit for the purpose for disaster planning and mitigation and strengthening community resilience. In particular, Queensland foundation spatial data, which is increasingly accessible by the public under the provisions of the Right to Information Act 2009, Information Privacy Act 2009, and recent open data reform initiatives are evaluated. The Fitzroy River catchment and floodplain is used as a case study for the review undertaken. The catchment covers an area of 142,545 km2, the largest river catchment flowing to the eastern coast of Australia. The Fitzroy River basin experienced extensive flooding during the 20102011 Queensland floods. The basin is an area of important economic, environmental and heritage values and contains significant infrastructure critical for the mining and agricultural sectors, the two most important economic sectors for Queensland State. Consequently, the spatial datasets for this area play a critical role in disaster management and for protecting critical infrastructure essential for economic and community well-being. The foundation spatial datasets are assessed for disaster planning and mitigation purposes using data quality indicators such as resolution, accuracy, integrity, validity and audit trail.
Resumo:
Severe power quality problems can arise when a large number of single-phase distributed energy resources (DERs) are connected to a low-voltage power distribution system. Due to the random location and size of DERs, it may so happen that a particular phase generates excess power than its load demand. In such an event, the excess power will be fed back to the distribution substation and will eventually find its way to the transmission network, causing undesirable voltage-current unbalance. As a solution to this problem, the article proposes the use of a distribution static compensator (DSTATCOM), which regulates voltage at the point of common coupling (PCC), thereby ensuring balanced current flow from and to the distribution substation. Additionally, this device can also support the distribution network in the absence of the utility connection, making the distribution system work as a microgrid. The proposals are validated through extensive digital computer simulation studies using PSCADTM
Resumo:
Industrialised building (IB) is considered by many to have an important role to play in China's residential construction industry due to its potential for improved quality, productivity, efficiency, safety and sustainability. It is surprising, therefore, that although a large number of construction programmes have been completed in the country in recent years, very few have been built in this manner. Quite why this situation exists is unknown. The well-known problems with IB, such as the constraints placed on designer freedom, may be the cause. It is equally possible that, as is typical with developing countries such as China, cost or government issues dominate. On the other hand, in comparison with other countries, the construction industry in China has been widely criticised for its lack of modernity. Either way, there is an urgent need to assess and understand the hindrances to the adoption of IB in residential construction in order to identify what corrective measures, if any, need to be taken. Towards this end, we first identify a set of critical factors (CFs) for assessing the hindrances to IB adoption in China. This involves the analysis of research data collected by a questionnaire survey of experienced housing developers and professionals working in China's construction industry sector. Fuzzy set theory is used in the selection of the CFs. These CFs comprise, in rank order: higher initial cost; lack of skilled labour in IB; manufacturing capability and involvement issues and product quality problems; lack of supply chain; lack of codes and standards; and lack of government incentives, directives and promotion. The establishment of the CFs provides a basis for local construction sectors to better equip themselves for future implementation of IB. The findings also indicate a current need for formulating improved policies and strategies to encourage the further development of IB in China at present.
Resumo:
Numerous statements and declarations have been made over recent decades in support of open access to research data. The growing recognition of the importance of open access to research data has been accompanied by calls on public research funding agencies and universities to facilitate better access to publicly funded research data so that it can be re-used and redistributed as public goods. International and inter-governmental bodies such as the ICSU/CODATA, the OECD and the European Union are strong supporters of open access to and re-use of publicly funded research data. This thesis focuses on the research data created by university researchers in Malaysian public universities whose research activities are funded by the Federal Government of Malaysia. Malaysia, like many countries, has not yet formulated a policy on open access to and re-use of publicly funded research data. Therefore, the aim of this thesis is to develop a policy to support the objective of enabling open access to and re-use of publicly funded research data in Malaysian public universities. Policy development is very important if the objective of enabling open access to and re-use of publicly funded research data is to be successfully achieved. In developing the policy, this thesis identifies a myriad of legal impediments arising from intellectual property rights, confidentiality, privacy and national security laws, novelty requirements in patent law and lack of a legal duty to ensure data quality. Legal impediments such as these have the effect of restricting, obstructing, hindering or slowing down the objective of enabling open access to and re-use of publicly funded research data. A key focus in the formulation of the policy was the need to resolve the various legal impediments that have been identified. This thesis analyses the existing policies and guidelines of Malaysian public universities to ascertain to what extent the legal impediments have been resolved. An international perspective is adopted by making a comparative analysis of the policies of public research funding agencies and universities in the United Kingdom, the United States and Australia to understand how they have dealt with the identified legal impediments. These countries have led the way in introducing policies which support open access to and re-use of publicly funded research data. As well as proposing a policy supporting open access to and re-use of publicly funded research data in Malaysian public universities, this thesis provides procedures for the implementation of the policy and guidelines for addressing the legal impediments to open access and re-use.
Resumo:
This article presents the field applications and validations for the controlled Monte Carlo data generation scheme. This scheme was previously derived to assist the Mahalanobis squared distancebased damage identification method to cope with data-shortage problems which often cause inadequate data multinormality and unreliable identification outcome. To do so, real-vibration datasets from two actual civil engineering structures with such data (and identification) problems are selected as the test objects which are then shown to be in need of enhancement to consolidate their conditions. By utilizing the robust probability measures of the data condition indices in controlled Monte Carlo data generation and statistical sensitivity analysis of the Mahalanobis squared distance computational system, well-conditioned synthetic data generated by an optimal controlled Monte Carlo data generation configurations can be unbiasedly evaluated against those generated by other set-ups and against the original data. The analysis results reconfirm that controlled Monte Carlo data generation is able to overcome the shortage of observations, improve the data multinormality and enhance the reliability of the Mahalanobis squared distancebased damage identification method particularly with respect to false-positive errors. The results also highlight the dynamic structure of controlled Monte Carlo data generation that makes this scheme well adaptive to any type of input data with any (original) distributional condition.
Resumo:
Objective To synthesise recent research on the use of machine learning approaches to mining textual injury surveillance data. Design Systematic review. Data sources The electronic databases which were searched included PubMed, Cinahl, Medline, Google Scholar, and Proquest. The bibliography of all relevant articles was examined and associated articles were identified using a snowballing technique. Selection criteria For inclusion, articles were required to meet the following criteria: (a) used a health-related database, (b) focused on injury-related cases, AND used machine learning approaches to analyse textual data. Methods The papers identified through the search were screened resulting in 16 papers selected for review. Articles were reviewed to describe the databases and methodology used, the strength and limitations of different techniques, and quality assurance approaches used. Due to heterogeneity between studies meta-analysis was not performed. Results Occupational injuries were the focus of half of the machine learning studies and the most common methods described were Bayesian probability or Bayesian network based methods to either predict injury categories or extract common injury scenarios. Models were evaluated through either comparison with gold standard data or content expert evaluation or statistical measures of quality. Machine learning was found to provide high precision and accuracy when predicting a small number of categories, was valuable for visualisation of injury patterns and prediction of future outcomes. However, difficulties related to generalizability, source data quality, complexity of models and integration of content and technical knowledge were discussed. Conclusions The use of narrative text for injury surveillance has grown in popularity, complexity and quality over recent years. With advances in data mining techniques, increased capacity for analysis of large databases, and involvement of computer scientists in the injury prevention field, along with more comprehensive use and description of quality assurance methods in text mining approaches, it is likely that we will see a continued growth and advancement in knowledge of text mining in the injury field.
Resumo:
Background Historically, the paper hand-held record (PHR) has been used for sharing information between hospital clinicians, general practitioners and pregnant women in a maternity shared-care environment. Recently in alignment with a National e-health agenda, an electronic health record (EHR) was introduced at an Australian tertiary maternity service to replace the PHR for collection and transfer of data. The aim of this study was to examine and compare the completeness of clinical data collected in a PHR and an EHR. Methods We undertook a comparative cohort design study to determine differences in completeness between data collected from maternity records in two phases. Phase 1 data were collected from the PHR and Phase 2 data from the EHR. Records were compared for completeness of best practice variables collected The primary outcome was the presence of best practice variables and the secondary outcomes were the differences in individual variables between the records. Results Ninety-four percent of paper medical charts were available in Phase 1 and 100% of records from an obstetric database in Phase 2. No PHR or EHR had a complete dataset of best practice variables. The variables with significant improvement in completeness of data documented in the EHR, compared with the PHR, were urine culture, glucose tolerance test, nuchal screening, morphology scans, folic acid advice, tobacco smoking, illicit drug assessment and domestic violence assessment (p=0.001). Additionally the documentation of immunisations (pertussis, hepatitis B, varicella, fluvax) were markedly improved in the EHR (p=0.001). The variables of blood pressure, proteinuria, blood group, antibody, rubella and syphilis status, showed no significant differences in completeness of recording. Conclusion This is the first paper to report on the comparison of clinical data collected on a PHR and EHR in a maternity shared-care setting. The use of an EHR demonstrated significant improvements to the collection of best practice variables. Additionally, the data in an EHR were more available to relevant clinical staff with the appropriate log-in and more easily retrieved than from the PHR. This study contributes to an under-researched area of determining data quality collected in patient records.
Resumo:
Decision-making is such an integral aspect in health care routine that the ability to make the right decisions at crucial moments can lead to patient health improvements. Evidence-based practice, the paradigm used to make those informed decisions, relies on the use of current best evidence from systematic research such as randomized controlled trials. Limitations of the outcomes from randomized controlled trials (RCT), such as quantity and quality of evidence generated, has lowered healthcare professionals confidence in using EBP. An alternate paradigm of Practice-Based Evidence has evolved with the key being evidence drawn from practice settings. Through the use of health information technology, electronic health records (EHR) capture relevant clinical practice evidence. A data-driven approach is proposed to capitalize on the benefits of EHR. The issues of data privacy, security and integrity are diminished by an information accountability concept. Data warehouse architecture completes the data-driven approach by integrating health data from multi-source systems, unique within the healthcare environment.
Resumo:
Big Datasets are endemic, but they are often notoriously difficult to analyse because of their size, heterogeneity, history and quality. The purpose of this paper is to open a discourse on the use of modern experimental design methods to analyse Big Data in order to answer particular questions of interest. By appealing to a range of examples, it is suggested that this perspective on Big Data modelling and analysis has wide generality and advantageous inferential and computational properties. In particular, the principled experimental design approach is shown to provide a flexible framework for analysis that, for certain classes of objectives and utility functions, delivers near equivalent answers compared with analyses of the full dataset under a controlled error rate. It can also provide a formalised method for iterative parameter estimation, model checking, identification of data gaps and evaluation of data quality. Finally, it has the potential to add value to other Big Data sampling algorithms, in particular divide-and-conquer strategies, by determining efficient sub-samples.
Resumo:
The explosive growth in the development of Traditional Chinese Medicine (TCM) has resulted in the continued increase in clinical and research data. The lack of standardised terminology, flaws in data quality planning and management of TCM informatics are preventing clinical decision-making, drug discovery and education. This paper argues that the introduction of data warehousing technologies to enhance the effectiveness and durability in TCM is paramount. To showcase the role of data warehousing in the improvement of TCM, this paper presents a practical model for data warehousing with detailed explanation, which is based on the structured electronic records, for TCM clinical researches and medical knowledge discovery.
Resumo:
Over recent years, the focus in road safety has shifted towards a greater understanding of road crash serious injuries in addition to fatalities. Police reported crash data are often the primary source of crash information; however, the definition of serious injury within these data is not consistent across jurisdictions and may not be accurately operationalised. This study examined the linkage of police-reported road crash data with hospital data to explore the potential for linked data to enhance the quantification of serious injury. Data from the Queensland Road Crash Database (QRCD), the Queensland Hospital Admitted Patients Data Collection (QHAPDC), Emergency Department Information System (EDIS), and the Queensland Injury Surveillance Unit (QISU) for the year 2009 were linked. Nine different estimates of serious road crash injury were produced. Results showed that there was a large amount of variation in the estimates of the number and profile of serious road crash injuries depending on the definition or measure used. The results also showed that as the definition of serious injury becomes more precise the vulnerable road users become more prominent. These results have major implications in terms of how serious injuries are identified for reporting purposes. Depending on the definitions used, the calculation of cost and understanding of the impact of serious injuries would vary greatly. This study has shown how data linkage can be used to investigate issues of data quality. It has also demonstrated the potential improvements to the understanding of the road safety problem, particularly serious injury, by conducting data linkage.
Resumo:
This study identified the areas of poor specificity in national injury hospitalization data and the areas of improvement and deterioration in specificity over time. A descriptive analysis of ten years of national hospital discharge data for Australia from July 2002-June 2012 was performed. Proportions and percentage change of defined/undefined codes over time was examined. At the intent block level, accidents and assault were the most poorly defined with over 11% undefined in each block. The mechanism blocks for accidents showed a significant deterioration in specificity over time with up to 20% more undefined codes in some mechanisms. Place and activity were poorly defined at the broad block level (43% and 72% undefined respectively). Private hospitals and hospitals in very remote locations recorded the highest proportion of undefined codes. Those aged over 60 years and females had the higher proportion of undefined code usage. This study has identified significant, and worsening, deficiencies in the specificity of coded injury data in several areas. Focal attention is needed to improve the quality of injury data, especially on those identified in this study, to provide the evidence base needed to address the significant burden of injury in the Australian community.