11 resultados para data storage concept
em DigitalCommons@The Texas Medical Center
Resumo:
A wealth of genetic associations for cardiovascular and metabolic phenotypes in humans has been accumulating over the last decade, in particular a large number of loci derived from recent genome wide association studies (GWAS). True complex disease-associated loci often exert modest effects, so their delineation currently requires integration of diverse phenotypic data from large studies to ensure robust meta-analyses. We have designed a gene-centric 50 K single nucleotide polymorphism (SNP) array to assess potentially relevant loci across a range of cardiovascular, metabolic and inflammatory syndromes. The array utilizes a "cosmopolitan" tagging approach to capture the genetic diversity across approximately 2,000 loci in populations represented in the HapMap and SeattleSNPs projects. The array content is informed by GWAS of vascular and inflammatory disease, expression quantitative trait loci implicated in atherosclerosis, pathway based approaches and comprehensive literature searching. The custom flexibility of the array platform facilitated interrogation of loci at differing stringencies, according to a gene prioritization strategy that allows saturation of high priority loci with a greater density of markers than the existing GWAS tools, particularly in African HapMap samples. We also demonstrate that the IBC array can be used to complement GWAS, increasing coverage in high priority CVD-related loci across all major HapMap populations. DNA from over 200,000 extensively phenotyped individuals will be genotyped with this array with a significant portion of the generated data being released into the academic domain facilitating in silico replication attempts, analyses of rare variants and cross-cohort meta-analyses in diverse populations. These datasets will also facilitate more robust secondary analyses, such as explorations with alternative genetic models, epistasis and gene-environment interactions.
Resumo:
The current state of health and biomedicine includes an enormity of heterogeneous data ‘silos’, collected for different purposes and represented differently, that are presently impossible to share or analyze in toto. The greatest challenge for large-scale and meaningful analyses of health-related data is to achieve a uniform data representation for data extracted from heterogeneous source representations. Based upon an analysis and categorization of heterogeneities, a process for achieving comparable data content by using a uniform terminological representation is developed. This process addresses the types of representational heterogeneities that commonly arise in healthcare data integration problems. Specifically, this process uses a reference terminology, and associated "maps" to transform heterogeneous data to a standard representation for comparability and secondary use. The capture of quality and precision of the “maps” between local terms and reference terminology concepts enhances the meaning of the aggregated data, empowering end users with better-informed queries for subsequent analyses. A data integration case study in the domain of pediatric asthma illustrates the development and use of a reference terminology for creating comparable data from heterogeneous source representations. The contribution of this research is a generalized process for the integration of data from heterogeneous source representations, and this process can be applied and extended to other problems where heterogeneous data needs to be merged.
Resumo:
OBJECTIVE: To determine whether algorithms developed for the World Wide Web can be applied to the biomedical literature in order to identify articles that are important as well as relevant. DESIGN AND MEASUREMENTS A direct comparison of eight algorithms: simple PubMed queries, clinical queries (sensitive and specific versions), vector cosine comparison, citation count, journal impact factor, PageRank, and machine learning based on polynomial support vector machines. The objective was to prioritize important articles, defined as being included in a pre-existing bibliography of important literature in surgical oncology. RESULTS Citation-based algorithms were more effective than noncitation-based algorithms at identifying important articles. The most effective strategies were simple citation count and PageRank, which on average identified over six important articles in the first 100 results compared to 0.85 for the best noncitation-based algorithm (p < 0.001). The authors saw similar differences between citation-based and noncitation-based algorithms at 10, 20, 50, 200, 500, and 1,000 results (p < 0.001). Citation lag affects performance of PageRank more than simple citation count. However, in spite of citation lag, citation-based algorithms remain more effective than noncitation-based algorithms. CONCLUSION Algorithms that have proved successful on the World Wide Web can be applied to biomedical information retrieval. Citation-based algorithms can help identify important articles within large sets of relevant results. Further studies are needed to determine whether citation-based algorithms can effectively meet actual user information needs.
Resumo:
Information overload is a significant problem for modern medicine. Searching MEDLINE for common topics often retrieves more relevant documents than users can review. Therefore, we must identify documents that are not only relevant, but also important. Our system ranks articles using citation counts and the PageRank algorithm, incorporating data from the Science Citation Index. However, citation data is usually incomplete. Therefore, we explore the relationship between the quantity of citation information available to the system and the quality of the result ranking. Specifically, we test the ability of citation count and PageRank to identify "important articles" as defined by experts from large result sets with decreasing citation information. We found that PageRank performs better than simple citation counts, but both algorithms are surprisingly robust to information loss. We conclude that even an incomplete citation database is likely to be effective for importance ranking.
Resumo:
Information overload is a significant problem for modern medicine. Searching MEDLINE for common topics often retrieves more relevant documents than users can review. Therefore, we must identify documents that are not only relevant, but also important. Our system ranks articles using citation counts and the PageRank algorithm, incorporating data from the Science Citation Index. However, citation data is usually incomplete. Therefore, we explore the relationship between the quantity of citation information available to the system and the quality of the result ranking. Specifically, we test the ability of citation count and PageRank to identify "important articles" as defined by experts from large result sets with decreasing citation information. We found that PageRank performs better than simple citation counts, but both algorithms are surprisingly robust to information loss. We conclude that even an incomplete citation database is likely to be effective for importance ranking.
Resumo:
People often use tools to search for information. In order to improve the quality of an information search, it is important to understand how internal information, which is stored in user’s mind, and external information, represented by the interface of tools interact with each other. How information is distributed between internal and external representations significantly affects information search performance. However, few studies have examined the relationship between types of interface and types of search task in the context of information search. For a distributed information search task, how data are distributed, represented, and formatted significantly affects the user search performance in terms of response time and accuracy. Guided by UFuRT (User, Function, Representation, Task), a human-centered process, I propose a search model, task taxonomy. The model defines its relationship with other existing information models. The taxonomy clarifies the legitimate operations for each type of search task of relation data. Based on the model and taxonomy, I have also developed prototypes of interface for the search tasks of relational data. These prototypes were used for experiments. The experiments described in this study are of a within-subject design with a sample of 24 participants recruited from the graduate schools located in the Texas Medical Center. Participants performed one-dimensional nominal search tasks over nominal, ordinal, and ratio displays, and searched one-dimensional nominal, ordinal, interval, and ratio tasks over table and graph displays. Participants also performed the same task and display combination for twodimensional searches. Distributed cognition theory has been adopted as a theoretical framework for analyzing and predicting the search performance of relational data. It has been shown that the representation dimensions and data scales, as well as the search task types, are main factors in determining search efficiency and effectiveness. In particular, the more external representations used, the better search task performance, and the results suggest the ideal search performance occurs when the question type and corresponding data scale representation match. The implications of the study lie in contributing to the effective design of search interface for relational data, especially laboratory results, which are often used in healthcare activities.
Resumo:
There is a growing interest in the location of Treatment, Storage, and Disposal (TSDF) sites in relation to minority communities. A number of studies have been completed, and the results of these studies have been varied. Some of the studies have shown a strong positive correlation between the location of TSDF sites and minority populations, while a few have shown no significance in that relationship. The major difference between these studies has been in the areal unit used.^ This study compared the minority populations of Texas census tracts and ZIP codes containing a TSDF using the associated county as the comparison population. The hypothesis of this study was that there was no difference between using census tracts and ZIP codes to analyze the relationship of minority populations and TSDF's. The census data used was from 1990, and the initial list of TSDF sites was supplied by the Texas Natural Resource Conservation Commission. The TSDF site locations were checked using graphical information systems (GIS) programs, in order to increase the accuracy of the identity of exposed ZIP codes and census tracts. The minority populations of the exposed areal units were compared using proportional differences, crosstables, maps, and logistic regression. The dependent variable used was the exposure status of the areal units under study, including counties, census tracts, and ZIP codes. The independent variables used included minority group proportion and grouping of the proportions, educational status, household income, and home value.^ In all cases, education was significant or near significant at the.05 level. Education rather than minority proportion was therefore the most significant predictor of the exposure status of a census tract or ZIP code. ^
Resumo:
The purpose of this study was to provide further data on the relationship between self-concept and violence focusing on a delinquent adolescent population. Recent research has explored the relationship between self-concept and violence with most of the research being done with adult populations. Within the literature, there are two opposing views on the question of this relationship. The traditional view supports the idea that low self-esteem is a cause of violent behavior while the non-traditional view supports the idea that high self-esteem may be a contributor to violent behavior. ^ Using a sample of 200 delinquent adolescents 100 of whom had committed acts of violence and 100 who had not, a group comparison study was done which addressed the following questions, (1) within a delinquent population of violent and non-violent adolescents, is there a relationship between violence and self-concept? (2) what is that relationship; (3) using the Piers-Harris Children's Self-Concept Scale, can it be determined that attributes such as behavior, anxiety, popularity, happiness, and physical appearance as they relate to self-concept are more predictive than others in determining who within a delinquent population will commit acts of violence. For the purposes of this study, delinquent adolescents were those who had official records of misconduct with either the school or juvenile authorities. Adolescents classified as violent were those who had committed acts such as assault, use of a weapon, use of deadly force, and sexual assault while adolescents classified as non-violent had committed anti-social acts such as, truancy, talking back and rule breaking. ^ The study concluded that there is a relationship between adolescent violence and self-concept. However, there was insufficient statistical evidence that self-concept is a predictor of violence. ^
Resumo:
Next-generation DNA sequencing platforms can effectively detect the entire spectrum of genomic variation and is emerging to be a major tool for systematic exploration of the universe of variants and interactions in the entire genome. However, the data produced by next-generation sequencing technologies will suffer from three basic problems: sequence errors, assembly errors, and missing data. Current statistical methods for genetic analysis are well suited for detecting the association of common variants, but are less suitable to rare variants. This raises great challenge for sequence-based genetic studies of complex diseases.^ This research dissertation utilized genome continuum model as a general principle, and stochastic calculus and functional data analysis as tools for developing novel and powerful statistical methods for next generation of association studies of both qualitative and quantitative traits in the context of sequencing data, which finally lead to shifting the paradigm of association analysis from the current locus-by-locus analysis to collectively analyzing genome regions.^ In this project, the functional principal component (FPC) methods coupled with high-dimensional data reduction techniques will be used to develop novel and powerful methods for testing the associations of the entire spectrum of genetic variation within a segment of genome or a gene regardless of whether the variants are common or rare.^ The classical quantitative genetics suffer from high type I error rates and low power for rare variants. To overcome these limitations for resequencing data, this project used functional linear models with scalar response to develop statistics for identifying quantitative trait loci (QTLs) for both common and rare variants. To illustrate their applications, the functional linear models were applied to five quantitative traits in Framingham heart studies. ^ This project proposed a novel concept of gene-gene co-association in which a gene or a genomic region is taken as a unit of association analysis and used stochastic calculus to develop a unified framework for testing the association of multiple genes or genomic regions for both common and rare alleles. The proposed methods were applied to gene-gene co-association analysis of psoriasis in two independent GWAS datasets which led to discovery of networks significantly associated with psoriasis.^
Resumo:
Clinical Research Data Quality Literature Review and Pooled Analysis We present a literature review and secondary analysis of data accuracy in clinical research and related secondary data uses. A total of 93 papers meeting our inclusion criteria were categorized according to the data processing methods. Quantitative data accuracy information was abstracted from the articles and pooled. Our analysis demonstrates that the accuracy associated with data processing methods varies widely, with error rates ranging from 2 errors per 10,000 files to 5019 errors per 10,000 fields. Medical record abstraction was associated with the highest error rates (70–5019 errors per 10,000 fields). Data entered and processed at healthcare facilities had comparable error rates to data processed at central data processing centers. Error rates for data processed with single entry in the presence of on-screen checks were comparable to double entered data. While data processing and cleaning methods may explain a significant amount of the variability in data accuracy, additional factors not resolvable here likely exist. Defining Data Quality for Clinical Research: A Concept Analysis Despite notable previous attempts by experts to define data quality, the concept remains ambiguous and subject to the vagaries of natural language. This current lack of clarity continues to hamper research related to data quality issues. We present a formal concept analysis of data quality, which builds on and synthesizes previously published work. We further posit that discipline-level specificity may be required to achieve the desired definitional clarity. To this end, we combine work from the clinical research domain with findings from the general data quality literature to produce a discipline-specific definition and operationalization for data quality in clinical research. While the results are helpful to clinical research, the methodology of concept analysis may be useful in other fields to clarify data quality attributes and to achieve operational definitions. Medical Record Abstractor’s Perceptions of Factors Impacting the Accuracy of Abstracted Data Medical record abstraction (MRA) is known to be a significant source of data errors in secondary data uses. Factors impacting the accuracy of abstracted data are not reported consistently in the literature. Two Delphi processes were conducted with experienced medical record abstractors to assess abstractor’s perceptions about the factors. The Delphi process identified 9 factors that were not found in the literature, and differed with the literature by 5 factors in the top 25%. The Delphi results refuted seven factors reported in the literature as impacting the quality of abstracted data. The results provide insight into and indicate content validity of a significant number of the factors reported in the literature. Further, the results indicate general consistency between the perceptions of clinical research medical record abstractors and registry and quality improvement abstractors. Distributed Cognition Artifacts on Clinical Research Data Collection Forms Medical record abstraction, a primary mode of data collection in secondary data use, is associated with high error rates. Distributed cognition in medical record abstraction has not been studied as a possible explanation for abstraction errors. We employed the theory of distributed representation and representational analysis to systematically evaluate cognitive demands in medical record abstraction and the extent of external cognitive support employed in a sample of clinical research data collection forms. We show that the cognitive load required for abstraction in 61% of the sampled data elements was high, exceedingly so in 9%. Further, the data collection forms did not support external cognition for the most complex data elements. High working memory demands are a possible explanation for the association of data errors with data elements requiring abstractor interpretation, comparison, mapping or calculation. The representational analysis used here can be used to identify data elements with high cognitive demands.
Resumo:
These three manuscripts are presented as a PhD dissertation for the study of using GeoVis application to evaluate telehealth programs. The primary reason of this research was to understand how the GeoVis applications can be designed and developed using combined approaches of HC approach and cognitive fit theory and in terms utilized to evaluate telehealth program in Brazil. First manuscript The first manuscript in this dissertation presented a background about the use of GeoVisualization to facilitate visual exploration of public health data. The manuscript covered the existing challenges that were associated with an adoption of existing GeoVis applications. The manuscript combines the principles of Human Centered approach and Cognitive Fit Theory and a framework using a combination of these approaches is developed that lays the foundation of this research. The framework is then utilized to propose the design, development and evaluation of “the SanaViz” to evaluate telehealth data in Brazil, as a proof of concept. Second manuscript The second manuscript is a methods paper that describes the approaches that can be employed to design and develop “the SanaViz” based on the proposed framework. By defining the various elements of the HC approach and CFT, a mixed methods approach is utilized for the card sorting and sketching techniques. A representative sample of 20 study participants currently involved in the telehealth program at the NUTES telehealth center at UFPE, Recife, Brazil was enrolled. The findings of this manuscript helped us understand the needs of the diverse group of telehealth users, the tasks that they perform and helped us determine the essential features that might be necessary to be included in the proposed GeoVis application “the SanaViz”. Third manuscript The third manuscript involved mix- methods approach to compare the effectiveness and usefulness of the HC GeoVis application “the SanaViz” against a conventional GeoVis application “Instant Atlas”. The same group of 20 study participants who had earlier participated during Aim 2 was enrolled and a combination of quantitative and qualitative assessments was done. Effectiveness was gauged by the time that the participants took to complete the tasks using both the GeoVis applications, the ease with which they completed the tasks and the number of attempts that were taken to complete each task. Usefulness was assessed by System Usability Scale (SUS), a validated questionnaire tested in prior studies. In-depth interviews were conducted to gather opinions about both the GeoVis applications. This manuscript helped us in the demonstration of the usefulness and effectiveness of HC GeoVis applications to facilitate visual exploration of telehealth data, as a proof of concept. Together, these three manuscripts represent challenges of combining principles of Human Centered approach, Cognitive Fit Theory to design and develop GeoVis applications as a method to evaluate Telehealth data. To our knowledge, this is the first study to explore the usefulness and effectiveness of GeoVis to facilitate visual exploration of telehealth data. The results of the research enabled us to develop a framework for the design and development of GeoVis applications related to the areas of public health and especially telehealth. The results of our study showed that the varied users were involved with the telehealth program and the tasks that they performed. Further it enabled us to identify the components that might be essential to be included in these GeoVis applications. The results of our research answered the following questions; (a) Telehealth users vary in their level of understanding about GeoVis (b) Interaction features such as zooming, sorting, and linking and multiple views and representation features such as bar chart and choropleth maps were considered the most essential features of the GeoVis applications. (c) Comparing and sorting were two important tasks that the telehealth users would perform for exploratory data analysis. (d) A HC GeoVis prototype application is more effective and useful for exploration of telehealth data than a conventional GeoVis application. Future studies should be done to incorporate the proposed HC GeoVis framework to enable comprehensive assessment of the users and the tasks they perform to identify the features that might be necessary to be a part of the GeoVis applications. The results of this study demonstrate a novel approach to comprehensively and systematically enhance the evaluation of telehealth programs using the proposed GeoVis Framework.