13 resultados para Genomic data integration

em DigitalCommons@The Texas Medical Center


Relevância:

90.00% 90.00%

Publicador:

Resumo:

The current state of health and biomedicine includes an enormity of heterogeneous data ‘silos’, collected for different purposes and represented differently, that are presently impossible to share or analyze in toto. The greatest challenge for large-scale and meaningful analyses of health-related data is to achieve a uniform data representation for data extracted from heterogeneous source representations. Based upon an analysis and categorization of heterogeneities, a process for achieving comparable data content by using a uniform terminological representation is developed. This process addresses the types of representational heterogeneities that commonly arise in healthcare data integration problems. Specifically, this process uses a reference terminology, and associated "maps" to transform heterogeneous data to a standard representation for comparability and secondary use. The capture of quality and precision of the “maps” between local terms and reference terminology concepts enhances the meaning of the aggregated data, empowering end users with better-informed queries for subsequent analyses. A data integration case study in the domain of pediatric asthma illustrates the development and use of a reference terminology for creating comparable data from heterogeneous source representations. The contribution of this research is a generalized process for the integration of data from heterogeneous source representations, and this process can be applied and extended to other problems where heterogeneous data needs to be merged.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

It is well accepted that tumorigenesis is a multi-step procedure involving aberrant functioning of genes regulating cell proliferation, differentiation, apoptosis, genome stability, angiogenesis and motility. To obtain a full understanding of tumorigenesis, it is necessary to collect information on all aspects of cell activity. Recent advances in high throughput technologies allow biologists to generate massive amounts of data, more than might have been imagined decades ago. These advances have made it possible to launch comprehensive projects such as (TCGA) and (ICGC) which systematically characterize the molecular fingerprints of cancer cells using gene expression, methylation, copy number, microRNA and SNP microarrays as well as next generation sequencing assays interrogating somatic mutation, insertion, deletion, translocation and structural rearrangements. Given the massive amount of data, a major challenge is to integrate information from multiple sources and formulate testable hypotheses. This thesis focuses on developing methodologies for integrative analyses of genomic assays profiled on the same set of samples. We have developed several novel methods for integrative biomarker identification and cancer classification. We introduce a regression-based approach to identify biomarkers predictive to therapy response or survival by integrating multiple assays including gene expression, methylation and copy number data through penalized regression. To identify key cancer-specific genes accounting for multiple mechanisms of regulation, we have developed the integIRTy software that provides robust and reliable inferences about gene alteration by automatically adjusting for sample heterogeneity as well as technical artifacts using Item Response Theory. To cope with the increasing need for accurate cancer diagnosis and individualized therapy, we have developed a robust and powerful algorithm called SIBER to systematically identify bimodally expressed genes using next generation RNAseq data. We have shown that prediction models built from these bimodal genes have the same accuracy as models built from all genes. Further, prediction models with dichotomized gene expression measurements based on their bimodal shapes still perform well. The effectiveness of outcome prediction using discretized signals paves the road for more accurate and interpretable cancer classification by integrating signals from multiple sources.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

A wealth of genetic associations for cardiovascular and metabolic phenotypes in humans has been accumulating over the last decade, in particular a large number of loci derived from recent genome wide association studies (GWAS). True complex disease-associated loci often exert modest effects, so their delineation currently requires integration of diverse phenotypic data from large studies to ensure robust meta-analyses. We have designed a gene-centric 50 K single nucleotide polymorphism (SNP) array to assess potentially relevant loci across a range of cardiovascular, metabolic and inflammatory syndromes. The array utilizes a "cosmopolitan" tagging approach to capture the genetic diversity across approximately 2,000 loci in populations represented in the HapMap and SeattleSNPs projects. The array content is informed by GWAS of vascular and inflammatory disease, expression quantitative trait loci implicated in atherosclerosis, pathway based approaches and comprehensive literature searching. The custom flexibility of the array platform facilitated interrogation of loci at differing stringencies, according to a gene prioritization strategy that allows saturation of high priority loci with a greater density of markers than the existing GWAS tools, particularly in African HapMap samples. We also demonstrate that the IBC array can be used to complement GWAS, increasing coverage in high priority CVD-related loci across all major HapMap populations. DNA from over 200,000 extensively phenotyped individuals will be genotyped with this array with a significant portion of the generated data being released into the academic domain facilitating in silico replication attempts, analyses of rare variants and cross-cohort meta-analyses in diverse populations. These datasets will also facilitate more robust secondary analyses, such as explorations with alternative genetic models, epistasis and gene-environment interactions.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Semantic Web technologies offer a promising framework for integration of disparate biomedical data. In this paper we present the semantic information integration platform under development at the Center for Clinical and Translational Sciences (CCTS) at the University of Texas Health Science Center at Houston (UTHSC-H) as part of our Clinical and Translational Science Award (CTSA) program. We utilize the Semantic Web technologies not only for integrating, repurposing and classification of multi-source clinical data, but also to construct a distributed environment for information sharing, and collaboration online. Service Oriented Architecture (SOA) is used to modularize and distribute reusable services in a dynamic and distributed environment. Components of the semantic solution and its overall architecture are described.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

High-throughput assays, such as yeast two-hybrid system, have generated a huge amount of protein-protein interaction (PPI) data in the past decade. This tremendously increases the need for developing reliable methods to systematically and automatically suggest protein functions and relationships between them. With the available PPI data, it is now possible to study the functions and relationships in the context of a large-scale network. To data, several network-based schemes have been provided to effectively annotate protein functions on a large scale. However, due to those inherent noises in high-throughput data generation, new methods and algorithms should be developed to increase the reliability of functional annotations. Previous work in a yeast PPI network (Samanta and Liang, 2003) has shown that the local connection topology, particularly for two proteins sharing an unusually large number of neighbors, can predict functional associations between proteins, and hence suggest their functions. One advantage of the work is that their algorithm is not sensitive to noises (false positives) in high-throughput PPI data. In this study, we improved their prediction scheme by developing a new algorithm and new methods which we applied on a human PPI network to make a genome-wide functional inference. We used the new algorithm to measure and reduce the influence of hub proteins on detecting functionally associated proteins. We used the annotations of the Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) as independent and unbiased benchmarks to evaluate our algorithms and methods within the human PPI network. We showed that, compared with the previous work from Samanta and Liang, our algorithm and methods developed in this study improved the overall quality of functional inferences for human proteins. By applying the algorithms to the human PPI network, we obtained 4,233 significant functional associations among 1,754 proteins. Further comparisons of their KEGG and GO annotations allowed us to assign 466 KEGG pathway annotations to 274 proteins and 123 GO annotations to 114 proteins with estimated false discovery rates of <21% for KEGG and <30% for GO. We clustered 1,729 proteins by their functional associations and made pathway analysis to identify several subclusters that are highly enriched in certain signaling pathways. Particularly, we performed a detailed analysis on a subcluster enriched in the transforming growth factor β signaling pathway (P<10-50) which is important in cell proliferation and tumorigenesis. Analysis of another four subclusters also suggested potential new players in six signaling pathways worthy of further experimental investigations. Our study gives clear insight into the common neighbor-based prediction scheme and provides a reliable method for large-scale functional annotations in this post-genomic era.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Myxobacteria are single-celled, but social, eubacterial predators. Upon starvation they build multicellular fruiting bodies using a developmental program that progressively changes the pattern of cell movement and the repertoire of genes expressed. Development terminates with spore differentiation and is coordinated by both diffusible and cell-bound signals. The growth and development of Myxococcus xanthus is regulated by the integration of multiple signals from outside the cells with physiological signals from within. A collection of M. xanthus cells behaves, in many respects, like a multicellular organism. For these reasons M. xanthus offers unparalleled access to a regulatory network that controls development and that organizes cell movement on surfaces. The genome of M. xanthus is large (9.14 Mb), considerably larger than the other sequenced delta-proteobacteria. We suggest that gene duplication and divergence were major contributors to genomic expansion from its progenitor. More than 1,500 duplications specific to the myxobacterial lineage were identified, representing >15% of the total genes. Genes were not duplicated at random; rather, genes for cell-cell signaling, small molecule sensing, and integrative transcription control were amplified selectively. Families of genes encoding the production of secondary metabolites are overrepresented in the genome but may have been received by horizontal gene transfer and are likely to be important for predation.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Nonsyndromic cleft lip with or without cleft palate (NSCLP) is a common birth anomaly that requires prolonged multidisciplinary rehabilitation. Although variation in several genes has been identified as contributing to NSCLP, most of the genetic susceptibility loci have yet to be defined. To identify additional contributory genes, a high-throughput genomic scan was performed using the Illumina Linkage IVb Panel platform. We genotyped 6008 SNPs in nine non-Hispanic white NSCLP multiplex families and a single large African-American NSCLP multiplex family. Fourteen chromosomal regions were identified with LOD>1.5, including six regions not previously reported. Analysis of the data from the African-American and non-Hispanic white families revealed two likely chromosomal regions: 8q21.3-24.12 and 22q12.2-12.3 with LOD scores of 2.98 and 2.66, respectively. On the basis of biological function, syndecan 2 (SDC2) and growth differentiation factor 6 (GDF6) in 8q21.3-24.12 and myosin heavy-chain 9, non-muscle (MYH9) in 22q12.2-12.3 were selected as candidate genes. Association analyses from these genes yielded marginally significant P-values for SNPs in SDC2 and GDF6 (0.01

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This research characterized a serologically indistinguishable form of HLA-DR1 that: (1) cannot stimulate some DR1-restricted or specific T-lymphocyte clones; (2) displays an unusual electrophoretic pattern on two dimensional gels; and (3) is marked by a polymorphic restriction site of the alpha gene. Inefficient stimulation of some DR1-restricted clones was a property of DR1$\sp{+}$ cells that shared HLA-B14 on the same haplotype and/or were carriers of 21-hydroxylase (21-OH) deficiency. Nonclassical 21-OH deficiency frequently demonstrates genetic linkage with HLA-B14;DR1 haplotypes and associates with duplications of C4B and one 21-OH gene. Cells having both stimulatory (DR1$\sb{\rm n}$) and nonstimulatory (DR1$\sb{\rm x}$) parental haplotypes did not mediate proliferation of these clones. However, heterozygous DR1$\sb{\rm x}$, 2 and DR1$\sb{\rm x}$, 7 cells were efficient stimulators of DR2 and DR7 specific clones, respectively, suggesting that a trans acting factor may modify DR1 alleles or products to yield a dominant DR1$\sb{\rm x}$ phenotype. Incompetent stimulator populations did not secrete an intercellular soluble or contact dependent suppressor factor nor did they express interleukin-2 receptors competing for T-cell growth factors. Two dimensional gel analysis of anti-DR immunoprecipitates revealed, in addition to normal DR$\alpha$ and DR$\beta$ chains, a 50kD species from DR1$\sb{\rm x}$ but not from the majority of DR1$\sb{\rm n}$ or non-DR1 cells. The 50kD structure was stable under reducing conditions in SDS and urea, had antigenic homology with DR, and dissociated after boiling into 34kD and 28kD peptide chains apparently identical with DR$\alpha$ and DR$\beta$ as shown by limited digest peptide maps. N-linked glycosylation and sialation of DRgp50 appeared to be unchanged from normal DR$\alpha$ and DR$\beta$. Bg1II digestion and $DR\alpha$ probing of DR1$\sb{\rm x}$ genomic DNA revealed a 4.5kb fragment while DR1$\sb{\rm n}$ DNA yielded 3.8 and 0.76kb fragments; all restriction sites mapped to the 3$\sp\prime$ untranslated region of $DR\alpha$. Collectively, these data suggest that DRgp50 represents a novel combinatorial association between constitutive chains of DR that may interfere with or compete for normal T cell receptor recognition of DR1 as both an alloantigen and restricting element. Furthermore, extensive chromosomal abnormalities previously mapped to the class III region of B14;DR1 haplotypes may extend into the adjacent class II region with consequent intrusion on immune function. ^

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Cancer cell lines can be treated with a drug and the molecular comparison of responders and non-responders may yield potential predictors that could be tested in the clinic. It is a bioinformatics challenge to apply the cell line-derived multivariable response predictors to patients who respond to therapy. Using the gene expression data from 23 breast cancer cell lines, I developed three predictors of dasatinib sensitivity by selecting differentially expressed genes and applying different classification algorithms. The performance of these predictors on independent cell lines with known dasatinib response was tested. The predictor based on weighted voting method has the best overall performance. It correctly predicted dasatinib sensitivity in 11 out of 12 (92%) breast and 17 out of 23 (74%) lung cancer cell lines. These predictors were then applied to the gene expression data from 133 breast cancer patients in an attempt to predict how the patients might respond to dasatinib therapy. Two predictors identified 13 patients in common to be dasatinib sensitive. Sixty two percent of these cases are triple negative (ER-negative, HER2-negative and PR-negative) and 76% are double negative. The result is consistent with the findings from other studies, which identified a target population for dasatinib treatment to be triple negative or basal breast cancer subtype. In conclusion, we think that the cell line-derived dasatinib classifiers can be applied to the human patients. ^

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Next-generation DNA sequencing platforms can effectively detect the entire spectrum of genomic variation and is emerging to be a major tool for systematic exploration of the universe of variants and interactions in the entire genome. However, the data produced by next-generation sequencing technologies will suffer from three basic problems: sequence errors, assembly errors, and missing data. Current statistical methods for genetic analysis are well suited for detecting the association of common variants, but are less suitable to rare variants. This raises great challenge for sequence-based genetic studies of complex diseases.^ This research dissertation utilized genome continuum model as a general principle, and stochastic calculus and functional data analysis as tools for developing novel and powerful statistical methods for next generation of association studies of both qualitative and quantitative traits in the context of sequencing data, which finally lead to shifting the paradigm of association analysis from the current locus-by-locus analysis to collectively analyzing genome regions.^ In this project, the functional principal component (FPC) methods coupled with high-dimensional data reduction techniques will be used to develop novel and powerful methods for testing the associations of the entire spectrum of genetic variation within a segment of genome or a gene regardless of whether the variants are common or rare.^ The classical quantitative genetics suffer from high type I error rates and low power for rare variants. To overcome these limitations for resequencing data, this project used functional linear models with scalar response to develop statistics for identifying quantitative trait loci (QTLs) for both common and rare variants. To illustrate their applications, the functional linear models were applied to five quantitative traits in Framingham heart studies. ^ This project proposed a novel concept of gene-gene co-association in which a gene or a genomic region is taken as a unit of association analysis and used stochastic calculus to develop a unified framework for testing the association of multiple genes or genomic regions for both common and rare alleles. The proposed methods were applied to gene-gene co-association analysis of psoriasis in two independent GWAS datasets which led to discovery of networks significantly associated with psoriasis.^

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Making healthcare comprehensive and more efficient remains a complex challenge. Health Information Technology (HIT) is recognized as an important component of this transformation but few studies describe HIT adoption and it's effect on the bedside experience by physicians, staff and patients. This study applied descriptive statistics and correlation analysis to data from the Patient-Centered Medical Home National Demonstration Project (NDP) of the American Academy of Family Physicians. Thirty-six clinics were followed for 26 months by clinician/staff questionnaires and patient surveys. This study characterizes those clinics as well as staff and patient perspectives on HIT usefulness, the doctor-patient relationship, electronic medical record (EMR) implementation, and computer connections in the practice throughout the study. The Global Practice Experience factor, a composite score related to key components of primary care, was then correlated to clinician and patient perspectives. This study found wide adoption of HIT among NDP practices. Patient perspectives on HIT helpfulness on the doctor-patient showed a suggestive trend that approached statistical significance (p = 0.172). Clinicians and staff noted successful integration of EMR into clinic workflow and their perception of helpfulness to the doctor-patient relationship show a suggestive increase also approaching statistical significance (p=0.06). GPE was correlated with clinician/staff assessment of a helpful doctor-patient relationship midway through the study (R 0.460, p = 0.021) with the remaining time points nearing statistical significance. GPE was also correlated to both patient perspectives of EMR helpfulness in the doctor-patient relationship (R 0.601, p = 0.001) and computer connections (R 0.618, p = 0.0001) at the start of the study. ^

Relevância:

30.00% 30.00%

Publicador:

Resumo:

My dissertation focuses on two aspects of RNA sequencing technology. The first is the methodology for modeling the overdispersion inherent in RNA-seq data for differential expression analysis. This aspect is addressed in three sections. The second aspect is the application of RNA-seq data to identify the CpG island methylator phenotype (CIMP) by integrating datasets of mRNA expression level and DNA methylation status. Section 1: The cost of DNA sequencing has reduced dramatically in the past decade. Consequently, genomic research increasingly depends on sequencing technology. However it remains elusive how the sequencing capacity influences the accuracy of mRNA expression measurement. We observe that accuracy improves along with the increasing sequencing depth. To model the overdispersion, we use the beta-binomial distribution with a new parameter indicating the dependency between overdispersion and sequencing depth. Our modified beta-binomial model performs better than the binomial or the pure beta-binomial model with a lower false discovery rate. Section 2: Although a number of methods have been proposed in order to accurately analyze differential RNA expression on the gene level, modeling on the base pair level is required. Here, we find that the overdispersion rate decreases as the sequencing depth increases on the base pair level. Also, we propose four models and compare them with each other. As expected, our beta binomial model with a dynamic overdispersion rate is shown to be superior. Section 3: We investigate biases in RNA-seq by exploring the measurement of the external control, spike-in RNA. This study is based on two datasets with spike-in controls obtained from a recent study. We observe an undiscovered bias in the measurement of the spike-in transcripts that arises from the influence of the sample transcripts in RNA-seq. Also, we find that this influence is related to the local sequence of the random hexamer that is used in priming. We suggest a model of the inequality between samples and to correct this type of bias. Section 4: The expression of a gene can be turned off when its promoter is highly methylated. Several studies have reported that a clear threshold effect exists in gene silencing that is mediated by DNA methylation. It is reasonable to assume the thresholds are specific for each gene. It is also intriguing to investigate genes that are largely controlled by DNA methylation. These genes are called “L-shaped” genes. We develop a method to determine the DNA methylation threshold and identify a new CIMP of BRCA. In conclusion, we provide a detailed understanding of the relationship between the overdispersion rate and sequencing depth. And we reveal a new bias in RNA-seq and provide a detailed understanding of the relationship between this new bias and the local sequence. Also we develop a powerful method to dichotomize methylation status and consequently we identify a new CIMP of breast cancer with a distinct classification of molecular characteristics and clinical features.