7 resultados para Open Research Data
em DigitalCommons@The Texas Medical Center
Resumo:
Clinical Research Data Quality Literature Review and Pooled Analysis We present a literature review and secondary analysis of data accuracy in clinical research and related secondary data uses. A total of 93 papers meeting our inclusion criteria were categorized according to the data processing methods. Quantitative data accuracy information was abstracted from the articles and pooled. Our analysis demonstrates that the accuracy associated with data processing methods varies widely, with error rates ranging from 2 errors per 10,000 files to 5019 errors per 10,000 fields. Medical record abstraction was associated with the highest error rates (70–5019 errors per 10,000 fields). Data entered and processed at healthcare facilities had comparable error rates to data processed at central data processing centers. Error rates for data processed with single entry in the presence of on-screen checks were comparable to double entered data. While data processing and cleaning methods may explain a significant amount of the variability in data accuracy, additional factors not resolvable here likely exist. Defining Data Quality for Clinical Research: A Concept Analysis Despite notable previous attempts by experts to define data quality, the concept remains ambiguous and subject to the vagaries of natural language. This current lack of clarity continues to hamper research related to data quality issues. We present a formal concept analysis of data quality, which builds on and synthesizes previously published work. We further posit that discipline-level specificity may be required to achieve the desired definitional clarity. To this end, we combine work from the clinical research domain with findings from the general data quality literature to produce a discipline-specific definition and operationalization for data quality in clinical research. While the results are helpful to clinical research, the methodology of concept analysis may be useful in other fields to clarify data quality attributes and to achieve operational definitions. Medical Record Abstractor’s Perceptions of Factors Impacting the Accuracy of Abstracted Data Medical record abstraction (MRA) is known to be a significant source of data errors in secondary data uses. Factors impacting the accuracy of abstracted data are not reported consistently in the literature. Two Delphi processes were conducted with experienced medical record abstractors to assess abstractor’s perceptions about the factors. The Delphi process identified 9 factors that were not found in the literature, and differed with the literature by 5 factors in the top 25%. The Delphi results refuted seven factors reported in the literature as impacting the quality of abstracted data. The results provide insight into and indicate content validity of a significant number of the factors reported in the literature. Further, the results indicate general consistency between the perceptions of clinical research medical record abstractors and registry and quality improvement abstractors. Distributed Cognition Artifacts on Clinical Research Data Collection Forms Medical record abstraction, a primary mode of data collection in secondary data use, is associated with high error rates. Distributed cognition in medical record abstraction has not been studied as a possible explanation for abstraction errors. We employed the theory of distributed representation and representational analysis to systematically evaluate cognitive demands in medical record abstraction and the extent of external cognitive support employed in a sample of clinical research data collection forms. We show that the cognitive load required for abstraction in 61% of the sampled data elements was high, exceedingly so in 9%. Further, the data collection forms did not support external cognition for the most complex data elements. High working memory demands are a possible explanation for the association of data errors with data elements requiring abstractor interpretation, comparison, mapping or calculation. The representational analysis used here can be used to identify data elements with high cognitive demands.
Resumo:
Increasing amounts of clinical research data are collected by manual data entry into electronic source systems and directly from research subjects. For this manual entered source data, common methods of data cleaning such as post-entry identification and resolution of discrepancies and double data entry are not feasible. However data accuracy rates achieved without these mechanisms may be higher than desired for a particular research use. We evaluated a heuristic usability method for utility as a tool to independently and prospectively identify data collection form questions associated with data errors. The method evaluated had a promising sensitivity of 64% and a specificity of 67%. The method was used as described in the literature for usability with no further adaptations or specialization for predicting data errors. We conclude that usability evaluation methodology should be further investigated for use in data quality assurance.
Resumo:
Context: Black women are reported to have a higher prevalence of uterine fibroids, and a threefold higher incidence rate and relative risk for clinical uterine fibroid development as compared to women of other races. Uterine fibroid research has reported that black women experience greater uterine fibroid morbidity and disproportionate uterine fibroid disease burden. With increased interest in understanding uterine fibroid development, and race being a critical component of uterine fibroid assessment, it is imperative that the methods used to determine the race of research participants is defined and the operational definition of the use of race as a variable is reported for methodological guidance, and to enable the research community to compare statistical data and replicate studies. ^ Objectives: To systematically review and evaluate the methods used to assess race and racial disparities in uterine fibroid research. ^ Data Sources: Databases searched for this review include: OVID Medline, NML PubMed, Ebscohost Cumulative Index to Nursing and Allied Health Plus with Full Text, and Elsevier Scopus. ^ Review Methods: Articles published in English were retrieved from data sources between January 2011 and March 2011. Broad search terms, uterine fibroids and race, were employed to retrieve a comprehensive list of citations for review screening. The initial database yield included 947 articles, after duplicate extraction 485 articles remained. In addition, 771 bibliographic citations were reviewed to identify additional articles not found through the primary database search, of which 17 new articles were included. In the first screening, 502 titles and abstracts were screened against eligibility questions to determine citations of exclusion and to retrieve full text articles for review. In the second screening, 197 full texted articles were screened against eligibility questions to determine whether or not they met full inclusion/exclusion criteria. ^ Results: 100 articles met inclusion criteria and were used in the results of this systematic review. The evidence suggested that black women have a higher prevalence of uterine fibroids when compared to white women. None of the 14 studies reporting data on prevalence reported an operational definition or conceptual framework for the use of race. There were a limited number of studies reporting on the prevalence of risk factors among racial subgroups. Of the 3 studies, 2 studies reported prevalence of risk factors lower for black women than other races, which was contrary to hypothesis. And, of the three studies reporting on prevalence of risk factors among racial subgroups, none of them reported a conceptual framework for the use of race. ^ Conclusion: In the 100 uterine fibroid studies included in this review over half, 66%, reported a specific objective to assess and recruit study participants based upon their race and/or ethnicity, but most, 51%, failed to report a method of determining the actual race of the participants, and far fewer, 4% (only four South American studies), reported a conceptual framework and/or operational definition of race as a variable. However, most, 95%, of all studies reported race-based health outcomes. The inadequate methodological guidance on the use of race in uterine fibroid studies, purporting to assess race and racial disparities, may be a primary reason that uterine fibroid research continues to report racial disparities, but fails to understand the high prevalence and increased exposures among African-American women. A standardized method of assessing race throughout uterine fibroid research would appear to be helpful in elucidating what race is actually measuring, and the risk of exposures for that measurement. ^
Resumo:
Data collected under federally funded research is subject to compliance rules and regulations. Policies affecting what you can and cannot do with your data, who is responsible, and what role your institution plays can vary with funding agencies and the type of data collected. This talk will address many of the compliance issues associated with research data, as well as funder mandates that you need to be aware of to ensure compliance.
Resumo:
Geneva Henry, Executive Director of the Center for Digital Scholarship, Rice University. Data rights and ownership of digital research data can impact how you use data, how others use data you've collected, and how rights are determined in collaborative research. Copyright rules governing data vary from one country to the next, making data ownership in international collaborations particularly murky. Licensing the use of data sets from the start is one way to address these issues early and provide a means for easily sharing datasets that can be cited and properly attributed. This talk with introduce issues associated with digital research data governance and how to protect your rights with data you work with.
Resumo:
It is well accepted that tumorigenesis is a multi-step procedure involving aberrant functioning of genes regulating cell proliferation, differentiation, apoptosis, genome stability, angiogenesis and motility. To obtain a full understanding of tumorigenesis, it is necessary to collect information on all aspects of cell activity. Recent advances in high throughput technologies allow biologists to generate massive amounts of data, more than might have been imagined decades ago. These advances have made it possible to launch comprehensive projects such as (TCGA) and (ICGC) which systematically characterize the molecular fingerprints of cancer cells using gene expression, methylation, copy number, microRNA and SNP microarrays as well as next generation sequencing assays interrogating somatic mutation, insertion, deletion, translocation and structural rearrangements. Given the massive amount of data, a major challenge is to integrate information from multiple sources and formulate testable hypotheses. This thesis focuses on developing methodologies for integrative analyses of genomic assays profiled on the same set of samples. We have developed several novel methods for integrative biomarker identification and cancer classification. We introduce a regression-based approach to identify biomarkers predictive to therapy response or survival by integrating multiple assays including gene expression, methylation and copy number data through penalized regression. To identify key cancer-specific genes accounting for multiple mechanisms of regulation, we have developed the integIRTy software that provides robust and reliable inferences about gene alteration by automatically adjusting for sample heterogeneity as well as technical artifacts using Item Response Theory. To cope with the increasing need for accurate cancer diagnosis and individualized therapy, we have developed a robust and powerful algorithm called SIBER to systematically identify bimodally expressed genes using next generation RNAseq data. We have shown that prediction models built from these bimodal genes have the same accuracy as models built from all genes. Further, prediction models with dichotomized gene expression measurements based on their bimodal shapes still perform well. The effectiveness of outcome prediction using discretized signals paves the road for more accurate and interpretable cancer classification by integrating signals from multiple sources.
Resumo:
Brain tumor is one of the most aggressive types of cancer in humans, with an estimated median survival time of 12 months and only 4% of the patients surviving more than 5 years after disease diagnosis. Until recently, brain tumor prognosis has been based only on clinical information such as tumor grade and patient age, but there are reports indicating that molecular profiling of gliomas can reveal subgroups of patients with distinct survival rates. We hypothesize that coupling molecular profiling of brain tumors with clinical information might improve predictions of patient survival time and, consequently, better guide future treatment decisions. In order to evaluate this hypothesis, the general goal of this research is to build models for survival prediction of glioma patients using DNA molecular profiles (U133 Affymetrix gene expression microarrays) along with clinical information. First, a predictive Random Forest model is built for binary outcomes (i.e. short vs. long-term survival) and a small subset of genes whose expression values can be used to predict survival time is selected. Following, a new statistical methodology is developed for predicting time-to-death outcomes using Bayesian ensemble trees. Due to a large heterogeneity observed within prognostic classes obtained by the Random Forest model, prediction can be improved by relating time-to-death with gene expression profile directly. We propose a Bayesian ensemble model for survival prediction which is appropriate for high-dimensional data such as gene expression data. Our approach is based on the ensemble "sum-of-trees" model which is flexible to incorporate additive and interaction effects between genes. We specify a fully Bayesian hierarchical approach and illustrate our methodology for the CPH, Weibull, and AFT survival models. We overcome the lack of conjugacy using a latent variable formulation to model the covariate effects which decreases computation time for model fitting. Also, our proposed models provides a model-free way to select important predictive prognostic markers based on controlling false discovery rates. We compare the performance of our methods with baseline reference survival methods and apply our methodology to an unpublished data set of brain tumor survival times and gene expression data, selecting genes potentially related to the development of the disease under study. A closing discussion compares results obtained by Random Forest and Bayesian ensemble methods under the biological/clinical perspectives and highlights the statistical advantages and disadvantages of the new methodology in the context of DNA microarray data analysis.