21 resultados para Data-driven Methods
em DigitalCommons@The Texas Medical Center
Resumo:
The overarching goal of the Pathway Semantics Algorithm (PSA) is to improve the in silico identification of clinically useful hypotheses about molecular patterns in disease progression. By framing biomedical questions within a variety of matrix representations, PSA has the flexibility to analyze combined quantitative and qualitative data over a wide range of stratifications. The resulting hypothetical answers can then move to in vitro and in vivo verification, research assay optimization, clinical validation, and commercialization. Herein PSA is shown to generate novel hypotheses about the significant biological pathways in two disease domains: shock / trauma and hemophilia A, and validated experimentally in the latter. The PSA matrix algebra approach identified differential molecular patterns in biological networks over time and outcome that would not be easily found through direct assays, literature or database searches. In this dissertation, Chapter 1 provides a broad overview of the background and motivation for the study, followed by Chapter 2 with a literature review of relevant computational methods. Chapters 3 and 4 describe PSA for node and edge analysis respectively, and apply the method to disease progression in shock / trauma. Chapter 5 demonstrates the application of PSA to hemophilia A and the validation with experimental results. The work is summarized in Chapter 6, followed by extensive references and an Appendix with additional material.
Resumo:
Complex diseases such as cancer result from multiple genetic changes and environmental exposures. Due to the rapid development of genotyping and sequencing technologies, we are now able to more accurately assess causal effects of many genetic and environmental factors. Genome-wide association studies have been able to localize many causal genetic variants predisposing to certain diseases. However, these studies only explain a small portion of variations in the heritability of diseases. More advanced statistical models are urgently needed to identify and characterize some additional genetic and environmental factors and their interactions, which will enable us to better understand the causes of complex diseases. In the past decade, thanks to the increasing computational capabilities and novel statistical developments, Bayesian methods have been widely applied in the genetics/genomics researches and demonstrating superiority over some regular approaches in certain research areas. Gene-environment and gene-gene interaction studies are among the areas where Bayesian methods may fully exert its functionalities and advantages. This dissertation focuses on developing new Bayesian statistical methods for data analysis with complex gene-environment and gene-gene interactions, as well as extending some existing methods for gene-environment interactions to other related areas. It includes three sections: (1) Deriving the Bayesian variable selection framework for the hierarchical gene-environment and gene-gene interactions; (2) Developing the Bayesian Natural and Orthogonal Interaction (NOIA) models for gene-environment interactions; and (3) extending the applications of two Bayesian statistical methods which were developed for gene-environment interaction studies, to other related types of studies such as adaptive borrowing historical data. We propose a Bayesian hierarchical mixture model framework that allows us to investigate the genetic and environmental effects, gene by gene interactions (epistasis) and gene by environment interactions in the same model. It is well known that, in many practical situations, there exists a natural hierarchical structure between the main effects and interactions in the linear model. Here we propose a model that incorporates this hierarchical structure into the Bayesian mixture model, such that the irrelevant interaction effects can be removed more efficiently, resulting in more robust, parsimonious and powerful models. We evaluate both of the 'strong hierarchical' and 'weak hierarchical' models, which specify that both or one of the main effects between interacting factors must be present for the interactions to be included in the model. The extensive simulation results show that the proposed strong and weak hierarchical mixture models control the proportion of false positive discoveries and yield a powerful approach to identify the predisposing main effects and interactions in the studies with complex gene-environment and gene-gene interactions. We also compare these two models with the 'independent' model that does not impose this hierarchical constraint and observe their superior performances in most of the considered situations. The proposed models are implemented in the real data analysis of gene and environment interactions in the cases of lung cancer and cutaneous melanoma case-control studies. The Bayesian statistical models enjoy the properties of being allowed to incorporate useful prior information in the modeling process. Moreover, the Bayesian mixture model outperforms the multivariate logistic model in terms of the performances on the parameter estimation and variable selection in most cases. Our proposed models hold the hierarchical constraints, that further improve the Bayesian mixture model by reducing the proportion of false positive findings among the identified interactions and successfully identifying the reported associations. This is practically appealing for the study of investigating the causal factors from a moderate number of candidate genetic and environmental factors along with a relatively large number of interactions. The natural and orthogonal interaction (NOIA) models of genetic effects have previously been developed to provide an analysis framework, by which the estimates of effects for a quantitative trait are statistically orthogonal regardless of the existence of Hardy-Weinberg Equilibrium (HWE) within loci. Ma et al. (2012) recently developed a NOIA model for the gene-environment interaction studies and have shown the advantages of using the model for detecting the true main effects and interactions, compared with the usual functional model. In this project, we propose a novel Bayesian statistical model that combines the Bayesian hierarchical mixture model with the NOIA statistical model and the usual functional model. The proposed Bayesian NOIA model demonstrates more power at detecting the non-null effects with higher marginal posterior probabilities. Also, we review two Bayesian statistical models (Bayesian empirical shrinkage-type estimator and Bayesian model averaging), which were developed for the gene-environment interaction studies. Inspired by these Bayesian models, we develop two novel statistical methods that are able to handle the related problems such as borrowing data from historical studies. The proposed methods are analogous to the methods for the gene-environment interactions on behalf of the success on balancing the statistical efficiency and bias in a unified model. By extensive simulation studies, we compare the operating characteristics of the proposed models with the existing models including the hierarchical meta-analysis model. The results show that the proposed approaches adaptively borrow the historical data in a data-driven way. These novel models may have a broad range of statistical applications in both of genetic/genomic and clinical studies.
Resumo:
Increasing amounts of clinical research data are collected by manual data entry into electronic source systems and directly from research subjects. For this manual entered source data, common methods of data cleaning such as post-entry identification and resolution of discrepancies and double data entry are not feasible. However data accuracy rates achieved without these mechanisms may be higher than desired for a particular research use. We evaluated a heuristic usability method for utility as a tool to independently and prospectively identify data collection form questions associated with data errors. The method evaluated had a promising sensitivity of 64% and a specificity of 67%. The method was used as described in the literature for usability with no further adaptations or specialization for predicting data errors. We conclude that usability evaluation methodology should be further investigated for use in data quality assurance.
Resumo:
Problems due to the lack of data standardization and data management have lead to work inefficiencies for the staff working with the vision data for the Lifetime Surveillance of Astronaut Health. Data has been collected over 50 years in a variety of manners and then entered into a software. The lack of communication between the electronic health record (EHR) form designer, epidemiologists, and optometrists has led to some level to confusion on the capability of the EHR system and how its forms can be designed to fit all the needs of the relevant parties. EHR form customizations or form redesigns were found to be critical for using NASA's EHR system in the most beneficial way for its patients, optometrists, and epidemiologists. In order to implement a protocol, data being collected was examined to find the differences in data collection methods. Changes were implemented through the establishment of a process improvement team (PIT). Based on the findings of the PIT, suggestions have been made to improve the current EHR system. If the suggestions are implemented correctly, this will not only improve efficiency of the staff at NASA and its contractors, but set guidelines for changes in other forms such as the vision exam forms. Because NASA is at the forefront of such research and health surveillance the impact of this management change could have a drastic improvement on the collection of and adaptability of the EHR. Accurate data collection from this 50+ year study is ongoing and is going to help current and future generations understand the implications of space flight on human health. It is imperative that the vast amount of information is documented correctly.^
Resumo:
Clinical Research Data Quality Literature Review and Pooled Analysis We present a literature review and secondary analysis of data accuracy in clinical research and related secondary data uses. A total of 93 papers meeting our inclusion criteria were categorized according to the data processing methods. Quantitative data accuracy information was abstracted from the articles and pooled. Our analysis demonstrates that the accuracy associated with data processing methods varies widely, with error rates ranging from 2 errors per 10,000 files to 5019 errors per 10,000 fields. Medical record abstraction was associated with the highest error rates (70–5019 errors per 10,000 fields). Data entered and processed at healthcare facilities had comparable error rates to data processed at central data processing centers. Error rates for data processed with single entry in the presence of on-screen checks were comparable to double entered data. While data processing and cleaning methods may explain a significant amount of the variability in data accuracy, additional factors not resolvable here likely exist. Defining Data Quality for Clinical Research: A Concept Analysis Despite notable previous attempts by experts to define data quality, the concept remains ambiguous and subject to the vagaries of natural language. This current lack of clarity continues to hamper research related to data quality issues. We present a formal concept analysis of data quality, which builds on and synthesizes previously published work. We further posit that discipline-level specificity may be required to achieve the desired definitional clarity. To this end, we combine work from the clinical research domain with findings from the general data quality literature to produce a discipline-specific definition and operationalization for data quality in clinical research. While the results are helpful to clinical research, the methodology of concept analysis may be useful in other fields to clarify data quality attributes and to achieve operational definitions. Medical Record Abstractor’s Perceptions of Factors Impacting the Accuracy of Abstracted Data Medical record abstraction (MRA) is known to be a significant source of data errors in secondary data uses. Factors impacting the accuracy of abstracted data are not reported consistently in the literature. Two Delphi processes were conducted with experienced medical record abstractors to assess abstractor’s perceptions about the factors. The Delphi process identified 9 factors that were not found in the literature, and differed with the literature by 5 factors in the top 25%. The Delphi results refuted seven factors reported in the literature as impacting the quality of abstracted data. The results provide insight into and indicate content validity of a significant number of the factors reported in the literature. Further, the results indicate general consistency between the perceptions of clinical research medical record abstractors and registry and quality improvement abstractors. Distributed Cognition Artifacts on Clinical Research Data Collection Forms Medical record abstraction, a primary mode of data collection in secondary data use, is associated with high error rates. Distributed cognition in medical record abstraction has not been studied as a possible explanation for abstraction errors. We employed the theory of distributed representation and representational analysis to systematically evaluate cognitive demands in medical record abstraction and the extent of external cognitive support employed in a sample of clinical research data collection forms. We show that the cognitive load required for abstraction in 61% of the sampled data elements was high, exceedingly so in 9%. Further, the data collection forms did not support external cognition for the most complex data elements. High working memory demands are a possible explanation for the association of data errors with data elements requiring abstractor interpretation, comparison, mapping or calculation. The representational analysis used here can be used to identify data elements with high cognitive demands.
Resumo:
Accurate quantitative estimation of exposure using retrospective data has been one of the most challenging tasks in the exposure assessment field. To improve these estimates, some models have been developed using published exposure databases with their corresponding exposure determinants. These models are designed to be applied to reported exposure determinants obtained from study subjects or exposure levels assigned by an industrial hygienist, so quantitative exposure estimates can be obtained. ^ In an effort to improve the prediction accuracy and generalizability of these models, and taking into account that the limitations encountered in previous studies might be due to limitations in the applicability of traditional statistical methods and concepts, the use of computer science- derived data analysis methods, predominantly machine learning approaches, were proposed and explored in this study. ^ The goal of this study was to develop a set of models using decision trees/ensemble and neural networks methods to predict occupational outcomes based on literature-derived databases, and compare, using cross-validation and data splitting techniques, the resulting prediction capacity to that of traditional regression models. Two cases were addressed: the categorical case, where the exposure level was measured as an exposure rating following the American Industrial Hygiene Association guidelines and the continuous case, where the result of the exposure is expressed as a concentration value. Previously developed literature-based exposure databases for 1,1,1 trichloroethane, methylene dichloride and, trichloroethylene were used. ^ When compared to regression estimations, results showed better accuracy of decision trees/ensemble techniques for the categorical case while neural networks were better for estimation of continuous exposure values. Overrepresentation of classes and overfitting were the main causes for poor neural network performance and accuracy. Estimations based on literature-based databases using machine learning techniques might provide an advantage when they are applied to other methodologies that combine `expert inputs' with current exposure measurements, like the Bayesian Decision Analysis tool. The use of machine learning techniques to more accurately estimate exposures from literature-based exposure databases might represent the starting point for the independence from the expert judgment.^
Resumo:
Similar to other health care processes, referrals are susceptible to breakdowns. These breakdowns in the referral process can lead to poor continuity of care, slow diagnostic processes, delays and repetition of tests, patient and provider dissatisfaction, and can lead to a loss of confidence in providers. These facts and the necessity for a deeper understanding of referrals in healthcare served as the motivation to conduct a comprehensive study of referrals. The research began with the real problem and need to understand referral communication as a mean to improve patient care. Despite previous efforts to explain referrals and the dynamics and interrelations of the variables that influence referrals there is not a common, contemporary, and accepted definition of what a referral is in the health care context. The research agenda was guided by the need to explore referrals as an abstract concept by: 1) developing a conceptual definition of referrals, and 2) developing a model of referrals, to finally propose a 3) comprehensive research framework. This dissertation has resulted in a standard conceptual definition of referrals and a model of referrals. In addition a mixed-method framework to evaluate referrals was proposed, and finally a data driven model was developed to predict whether a referral would be approved or denied by a specialty service. The three manuscripts included in this dissertation present the basis for studying and assessing referrals using a common framework that should allow an easier comparative research agenda to improve referrals taking into account the context where referrals occur.
Resumo:
This study applies the multilevel analysis technique to longitudinal data of a large clinical trial. The technique accounts for the correlation at different levels when modeling repeated blood pressure measurements taken throughout the trial. This modeling allows for closer inspection of the remaining correlation and non-homogeneity of variance in the data. Three methods of modeling the correlation were compared. ^
Resumo:
Social capital, a relatively new public health concept, represents the intangible resources embedded in social relationships that facilitate collective action. Current interest in the concept stems from empirical studies linking social capital with health outcomes. However, in order for social capital to function as a meaningful research variable, conceptual development aimed at refining the domains, attributes, and boundaries of the concept are needed. An existing framework of social capital (Uphoff, 2000), developed from studies in India, was selected for congruence with the inductive analysis of pilot data from a community that was unsuccessful at mobilizing collective action. This framework provided the underpinnings for a formal ethnographic research study designed to examine the components of social capital in a community that had successfully mobilized collective action. The specific aim of the ethnographic study was to examine the fittingness of Uphoff's framework in the contrasting American community. A contrasting context was purposefully selected to distinguish essential attributes of social capital from those that were specific to one community. Ethnographic data collection methods included participant observation, formal interviews, and public documents. Data was originally analyzed according to codes developed from Uphoff's theoretical framework. The results from this analysis were only partially satisfactory, indicating that the theoretical framework required refinement. The refinement of the coding system resulted in the emergence of an explanatory theory of social capital that was tested with the data collected from formal fieldwork. Although Uphoff's framework was useful, the refinement of the framework revealed, (1) trust as the dominant attribute of social capital, (2) efficacy of mutually beneficial collective action as the outcome indicator, (3) cognitive and structural domains more appropriately defined as the cultural norms of the community and group, and (4) a definition of social capital as the combination of the cognitive norms of the community and the structural norms of the group that are either constructive or destructive to the development of trust and the efficacy of mutually beneficial collective action. This explanatory framework holds increased pragmatic utility for public health practice and research. ^
Resumo:
Case control and retrospective studies have identified parental substance abuse as a risk factor for physical child abuse and neglect (Dore, Doris, & Wright, 1995, May; S. R. Dube et al., 2001; Guterman & Lee, 2005, May; Walsh, MacMillan, & Jamieson, 2003). The purpose of this paper is to present the findings of a systematic review of prospective studies from 1975 through 2005 that include parental substance abuse as a risk factor for physical child abuse or neglect. Characteristics of each study such as the research question, sample information, data collection methods and results, including the parent assessed and definitions of substance abuse and physical child abuse and neglect, are discussed. Five studies were identified that met the search criteria. Four of five studies found that parental substance abuse was a significant variable in predicting physical child abuse and neglect.^
Resumo:
With the recognition of the importance of evidence-based medicine, there is an emerging need for methods to systematically synthesize available data. Specifically, methods to provide accurate estimates of test characteristics for diagnostic tests are needed to help physicians make better clinical decisions. To provide more flexible approaches for meta-analysis of diagnostic tests, we developed three Bayesian generalized linear models. Two of these models, a bivariate normal and a binomial model, analyzed pairs of sensitivity and specificity values while incorporating the correlation between these two outcome variables. Noninformative independent uniform priors were used for the variance of sensitivity, specificity and correlation. We also applied an inverse Wishart prior to check the sensitivity of the results. The third model was a multinomial model where the test results were modeled as multinomial random variables. All three models can include specific imaging techniques as covariates in order to compare performance. Vague normal priors were assigned to the coefficients of the covariates. The computations were carried out using the 'Bayesian inference using Gibbs sampling' implementation of Markov chain Monte Carlo techniques. We investigated the properties of the three proposed models through extensive simulation studies. We also applied these models to a previously published meta-analysis dataset on cervical cancer as well as to an unpublished melanoma dataset. In general, our findings show that the point estimates of sensitivity and specificity were consistent among Bayesian and frequentist bivariate normal and binomial models. However, in the simulation studies, the estimates of the correlation coefficient from Bayesian bivariate models are not as good as those obtained from frequentist estimation regardless of which prior distribution was used for the covariance matrix. The Bayesian multinomial model consistently underestimated the sensitivity and specificity regardless of the sample size and correlation coefficient. In conclusion, the Bayesian bivariate binomial model provides the most flexible framework for future applications because of its following strengths: (1) it facilitates direct comparison between different tests; (2) it captures the variability in both sensitivity and specificity simultaneously as well as the intercorrelation between the two; and (3) it can be directly applied to sparse data without ad hoc correction. ^
Resumo:
A crucial link in preserving and protecting the future of our communities resides in maintaining the health and well being of our youth. While every member of the community owns an opinion regarding where to best utilize monies for prevention and intervention, the data to support such opinion is often scarce. In an effort to generate data-driven indices for community planning and action, the United Way of Comal County, Texas partnered with the University Of Texas - Houston Health Science Center, School Of Public Health to accomplish a county-specific needs assessment. A community-based participatory research emphasis utilizing the Mobilization for Action through Planning and Partnership (MAPP) format developed by the National Association of City and County Health Officials (NACCHO) was implemented to engage community members in identifying and addressing community priorities. The single greatest area of consensus and concern identified by community members was the health and well being of the youth population. Thus, a youth survey, targeting these specific areas of community concern, was designed, coordinated and administered to all 9-11th grade students in the county. 20% of the 3,698 completed surveys (72% response rate) were randomly selected for analysis. These 740 surveys were coded and scanned into an electronic survey database. Statistical analysis provided youth-reported data on the status of the multiple issues affecting the health and well being of the community's youth. These data will be reported back to the community stakeholders, as part of the larger Comal County Needs Assessment, for the purposes of community planning and action. Survey data will provide community planners with an awareness of the high risk behaviors and habit patterns amongst their youth. This knowledge will permit more effective targeting of the means for encouraging healthy behaviors and preventing the spread of disease. Further, the community-oriented, population-based nature of this effort will provide answers to questions raised by the community and will provide an effective launching pad for the development and implementation of targeted, preventive health strategies. ^
Resumo:
Patients who had started HAART (Highly Active Anti-Retroviral Treatment) under previous aggressive DHHS guidelines (1997) underwent a life-long continuous HAART that was associated with many short term as well as long term complications. Many interventions attempted to reduce those complications including intermittent treatment also called pulse therapy. Many studies were done to study the determinants of rate of fall in CD4 count after interruption as this data would help guide treatment interruptions. The data set used here was a part of a cohort study taking place at the Johns Hopkins AIDS service since January 1984, in which the data were collected both prospectively and retrospectively. The patients in this data set consisted of 47 patients receiving via pulse therapy with the aim of reducing the long-term complications. ^ The aim of this project was to study the impact of virologic and immunologic factors on the rate of CD4 loss after treatment interruption. The exposure variables under investigation included CD4 cell count and viral load at treatment initiation. The rates of change of CD4 cell count after treatment interruption was estimated from observed data using advanced longitudinal data analysis methods (i.e., linear mixed model). Using random effects accounted for repeated measures of CD4 per person after treatment interruption. The regression coefficient estimates from the model was then used to produce subject specific rates of CD4 change accounting for group trends in change. The exposure variables of interest were age, race, and gender, CD4 cell counts and HIV RNA levels at HAART initiation. ^ The rate of fall of CD4 count did not depend on CD4 cell count or viral load at initiation of treatment. Thus these factors may not be used to determine who can have a chance of successful treatment interruption. CD4 and viral load were again studied by t-tests and ANOVA test after grouping based on medians and quartiles to see any difference in means of rate of CD4 fall after interruption. There was no significant difference between the groups suggesting that there was no association between rate of fall of CD4 after treatment interruption and above mentioned exposure variables. ^
Resumo:
Medication errors, one of the most frequent types of medical errors, are a common cause of patient harm in hospital systems today. Nurses at the bedside are in a position to encounter many of these errors since they are there at the start of the process (ordering/prescribing) and the end of the process (administration). One of the recommendations from the IOM (Institute of Medicine) report, "To Err is Human," was for organizations to identify and learn from medical errors through event reporting systems. While many organizations have reporting systems in place, research studies report a significant amount of underreporting by nurses. A systematic review of the literature was performed to identify contributing factors related to the reporting and not reporting of medication errors by nurses at the bedside.^ Articles included in the literature review were primary or secondary studies, dated January 1, 2000 – July 2009, related to nursing medication error reporting. All 634 articles were reviewed with an algorithm developed to standardize the review process and help filter out those that did not meet the study criteria. In addition, 142 article bibliographies were reviewed to find additional studies that were not found in the original literature search.^ After reviewing the 634 articles and the additional 108 articles discovered in the bibliography review, 41 articles met the study criteria and were used in the systematic literature review results.^ Fear of punitive reactions to medication errors was a frequent barrier to error reporting. Nurses fear reactions from their leadership, peers, patients and their families, nursing boards, and the media. Anonymous reporting systems and departments/organizations with a strong safety culture in place helped to encourage the reporting of medication errors by nursing staff.^ Many of the studies included in this literature review do not allow results that can be generalized. The majority of them took place in single institutions/organizations with limited sample sizes. Stronger studies with larger sample sizes need to be performed, utilizing data collection methods that have been validated, to determine stronger correlations between safety cultures and nurse error reporting.^
Resumo:
The objectives of this study were to identify and measure the average outcomes of the Open Door Mission's nine-month community-based substance abuse treatment program, identify predictors of successful outcomes, and make recommendations to the Open Door Mission for improving its treatment program.^ The Mission's program is exclusive to adult men who have limited financial resources: most of which were homeless or dependent on parents or other family members for basic living needs. Many, but not all, of these men are either chemically dependent or have a history of substance abuse.^ This study tracked a cohort of the Mission's graduates throughout this one-year study and identified various indicators of success at short-term intervals, which may be predictive of longer-term outcomes. We tracked various levels of 12-step program involvement, as well as other social and spiritual activities, such as church affiliation and recovery support.^ Twenty-four of the 66 subjects, or 36% met the Mission's requirements for success. Specific to this success criteria; Fifty-four, or 82% reported affiliation with a home church; Twenty-six, or 39% reported full-time employment; Sixty-one, or 92% did not report or were not identified as having any post-treatment arrests or incarceration, and; Forty, or 61% reported continuous abstinence from both drugs and alcohol.^ Five research-based hypotheses were developed and tested. The primary analysis tool was the web-based non-parametric dependency modeling tool, B-Course, which revealed some strong associations with certain variables, and helped the researchers generate and test several data-driven hypotheses. Full-time employment is the greatest predictor of abstinence: 95% of those who reported full time employment also reported continuous post-treatment abstinence, while 50% of those working part-time were abstinent and 29% of those with no employment were abstinent. Working with a 12-step sponsor, attending aftercare, and service with others were identified as predictors of abstinence.^ This study demonstrates that associations with abstinence and the ODM success criteria are not simply based on one social or behavioral factor. Rather, these relationships are interdependent, and show that abstinence is achieved and maintained through a combination of several 12-step recovery activities. This study used a simple assessment methodology, which demonstrated strong associations across variables and outcomes, which have practical applicability to the Open Door Mission for improving its treatment program. By leveraging the predictive capability of the various success determination methodologies discussed and developed throughout this study, we can identify accurate outcomes with both validity and reliability. This assessment instrument can also be used as an intervention that, if operationalized to the Mission’s clients during the primary treatment program, may measurably improve the effectiveness and outcomes of the Open Door Mission.^