14 resultados para RNA-seq data
em Duke University
Resumo:
BACKGROUND: There is considerable interest in the development of methods to efficiently identify all coding variants present in large sample sets of humans. There are three approaches possible: whole-genome sequencing, whole-exome sequencing using exon capture methods, and RNA-Seq. While whole-genome sequencing is the most complete, it remains sufficiently expensive that cost effective alternatives are important. RESULTS: Here we provide a systematic exploration of how well RNA-Seq can identify human coding variants by comparing variants identified through high coverage whole-genome sequencing to those identified by high coverage RNA-Seq in the same individual. This comparison allowed us to directly evaluate the sensitivity and specificity of RNA-Seq in identifying coding variants, and to evaluate how key parameters such as the degree of coverage and the expression levels of genes interact to influence performance. We find that although only 40% of exonic variants identified by whole genome sequencing were captured using RNA-Seq; this number rose to 81% when concentrating on genes known to be well-expressed in the source tissue. We also find that a high false positive rate can be problematic when working with RNA-Seq data, especially at higher levels of coverage. CONCLUSIONS: We conclude that as long as a tissue relevant to the trait under study is available and suitable quality control screens are implemented, RNA-Seq is a fast and inexpensive alternative approach for finding coding variants in genes with sufficiently high expression levels.
Resumo:
Extensive departures from balanced gene dose in aneuploids are highly deleterious. However, we know very little about the relationship between gene copy number and expression in aneuploid cells. We determined copy number and transcript abundance (expression) genome-wide in Drosophila S2 cells by DNA-Seq and RNA-Seq. We found that S2 cells are aneuploid for >43 Mb of the genome, primarily in the range of one to five copies, and show a male genotype ( approximately two X chromosomes and four sets of autosomes, or 2X;4A). Both X chromosomes and autosomes showed expression dosage compensation. X chromosome expression was elevated in a fixed-fold manner regardless of actual gene dose. In engineering terms, the system "anticipates" the perturbation caused by X dose, rather than responding to an error caused by the perturbation. This feed-forward regulation resulted in precise dosage compensation only when X dose was half of the autosome dose. Insufficient compensation occurred at lower X chromosome dose and excessive expression occurred at higher doses. RNAi knockdown of the Male Specific Lethal complex abolished feed-forward regulation. Both autosome and X chromosome genes show Male Specific Lethal-independent compensation that fits a first order dose-response curve. Our data indicate that expression dosage compensation dampens the effect of altered DNA copy number genome-wide. For the X chromosome, compensation includes fixed and dose-dependent components.
Resumo:
Cryptococcus neoformans is a pathogenic basidiomycetous yeast responsible for more than 600,000 deaths each year. It occurs as two serotypes (A and D) representing two varieties (i.e. grubii and neoformans, respectively). Here, we sequenced the genome and performed an RNA-Seq-based analysis of the C. neoformans var. grubii transcriptome structure. We determined the chromosomal locations, analyzed the sequence/structural features of the centromeres, and identified origins of replication. The genome was annotated based on automated and manual curation. More than 40,000 introns populating more than 99% of the expressed genes were identified. Although most of these introns are located in the coding DNA sequences (CDS), over 2,000 introns in the untranslated regions (UTRs) were also identified. Poly(A)-containing reads were employed to locate the polyadenylation sites of more than 80% of the genes. Examination of the sequences around these sites revealed a new poly(A)-site-associated motif (AUGHAH). In addition, 1,197 miscRNAs were identified. These miscRNAs can be spliced and/or polyadenylated, but do not appear to have obvious coding capacities. Finally, this genome sequence enabled a comparative analysis of strain H99 variants obtained after laboratory passage. The spectrum of mutations identified provides insights into the genetics underlying the micro-evolution of a laboratory strain, and identifies mutations involved in stress responses, mating efficiency, and virulence.
Resumo:
Transcriptional regulation has been studied intensively in recent decades. One important aspect of this regulation is the interaction between regulatory proteins, such as transcription factors (TF) and nucleosomes, and the genome. Different high-throughput techniques have been invented to map these interactions genome-wide, including ChIP-based methods (ChIP-chip, ChIP-seq, etc.), nuclease digestion methods (DNase-seq, MNase-seq, etc.), and others. However, a single experimental technique often only provides partial and noisy information about the whole picture of protein-DNA interactions. Therefore, the overarching goal of this dissertation is to provide computational developments for jointly modeling different experimental datasets to achieve a holistic inference on the protein-DNA interaction landscape.
We first present a computational framework that can incorporate the protein binding information in MNase-seq data into a thermodynamic model of protein-DNA interaction. We use a correlation-based objective function to model the MNase-seq data and a Markov chain Monte Carlo method to maximize the function. Our results show that the inferred protein-DNA interaction landscape is concordant with the MNase-seq data and provides a mechanistic explanation for the experimentally collected MNase-seq fragments. Our framework is flexible and can easily incorporate other data sources. To demonstrate this flexibility, we use prior distributions to integrate experimentally measured protein concentrations.
We also study the ability of DNase-seq data to position nucleosomes. Traditionally, DNase-seq has only been widely used to identify DNase hypersensitive sites, which tend to be open chromatin regulatory regions devoid of nucleosomes. We reveal for the first time that DNase-seq datasets also contain substantial information about nucleosome translational positioning, and that existing DNase-seq data can be used to infer nucleosome positions with high accuracy. We develop a Bayes-factor-based nucleosome scoring method to position nucleosomes using DNase-seq data. Our approach utilizes several effective strategies to extract nucleosome positioning signals from the noisy DNase-seq data, including jointly modeling data points across the nucleosome body and explicitly modeling the quadratic and oscillatory DNase I digestion pattern on nucleosomes. We show that our DNase-seq-based nucleosome map is highly consistent with previous high-resolution maps. We also show that the oscillatory DNase I digestion pattern is useful in revealing the nucleosome rotational context around TF binding sites.
Finally, we present a state-space model (SSM) for jointly modeling different kinds of genomic data to provide an accurate view of the protein-DNA interaction landscape. We also provide an efficient expectation-maximization algorithm to learn model parameters from data. We first show in simulation studies that the SSM can effectively recover underlying true protein binding configurations. We then apply the SSM to model real genomic data (both DNase-seq and MNase-seq data). Through incrementally increasing the types of genomic data in the SSM, we show that different data types can contribute complementary information for the inference of protein binding landscape and that the most accurate inference comes from modeling all available datasets.
This dissertation provides a foundation for future research by taking a step toward the genome-wide inference of protein-DNA interaction landscape through data integration.
Resumo:
Olfactory sensory neurons (OSNs), which detect a myriad of odorants, are known to express one allele of one olfactory receptor (OR) gene (Olfr) from the largest gene family in the mammalian genome. The OSNs expressing the same OR project their axons to the main olfactory bulb where they converge to form glomeruli. This “One neuron-one receptor rule” makes the olfactory epithelium (OE), which consists of a vast number of OSNs expressing unique ORs, one of the most heterogeneous cell populations. However, the mechanism of how the single OR allele is chosen remains unclear along with the question of whether one OSN only expresses a single OR gene, a hypothesis that has not been rigorously verified while we performed the experiments. Moreover, failure of axonal targeting to single glomerulus was observed in MeCP2 deficient OSNs where delayed development was proposed as an explanation for the phenotype. How Mecp2 mutation caused this aberrant targeting is not entirely understood.
In this dissertation, we explored the transcriptomes of single and mature OSNs by single-cell RNA-Seq to reveal their heterogeneity and further studied the OR gene expression from these isolated OSNs. The singularity of sequenced OSNs was ensured by the observation of monoallelic expression of X-linked genes from the hybrid samples from crosses between mice of different strains where strain-specific polymorphisms could be used to track the allelic origins of SNP-containing reads. The clustering of expression profiles from triplicates that originated from the same cell assured that the transcriptomic identities of OSNs were maintained through the experimental process. The average gene expression profiles of sequenced OSNs correlated well to the conventional transcriptome data of FACS-sorted Omp-positive cells, and the top-ranked expression of OR was conceded in the single-OSN transcriptomes. While exploring cellular diversity, in addition to OR genes, we revealed nearly 200 differentially expressed genes among the sequenced OSNs in this study. Among the 36 sequenced OSNs, eight cells (22.2%) showed multiple OR gene expression and the presences of additional ORs were not restricted to the neighbor loci that shared the transcriptional effect of the primary OR expression, suggesting that the “One neuron-one receptor rule” might not be strictly true at the transcription level. All of the inferable ORs, including additional co-expressed ORs, were shown to be monoallelic. Our sequencing of 21 Mecp2308 mutant OSNs, of which 62% expressed more than one OR genes, and the expression levels of the additional ORs were significantly higher than those in the wild-type, suggested that MeCP2 plays a role in the regulation of singular OR gene expression. Dual label in situ hybridization along with the sequence data revealed that dorsal and ventral ORs were co-expressed in the same Mecp2 mutant OSN, further implying that MeCP2 might be involved in regulation of OR territories in the OE. Our results suggested a new role of MeCP2 in OR gene choice and ratified that this multiple-OR expression caused by Mecp2 mutation did not accompany delayed OSN development that has been observed in the previous studies on the Mecp2 mutants.
Resumo:
Single-molecule sequencing instruments can generate multikilobase sequences with the potential to greatly improve genome and transcriptome assembly. However, the error rates of single-molecule reads are high, which has limited their use thus far to resequencing bacteria. To address this limitation, we introduce a correction algorithm and assembly strategy that uses short, high-fidelity sequences to correct the error in single-molecule sequences. We demonstrate the utility of this approach on reads generated by a PacBio RS instrument from phage, prokaryotic and eukaryotic whole genomes, including the previously unsequenced genome of the parrot Melopsittacus undulatus, as well as for RNA-Seq reads of the corn (Zea mays) transcriptome. Our long-read correction achieves >99.9% base-call accuracy, leading to substantially better assemblies than current sequencing strategies: in the best example, the median contig size was quintupled relative to high-coverage, second-generation assemblies. Greater gains are predicted if read lengths continue to increase, including the prospect of single-contig bacterial chromosome assembly.
Resumo:
Nutrient availability profoundly influences gene expression. Many animal genes encode multiple transcript isoforms, yet the effect of nutrient availability on transcript isoform expression has not been studied in genome-wide fashion. When Caenorhabditis elegans larvae hatch without food, they arrest development in the first larval stage (L1 arrest). Starved larvae can survive L1 arrest for weeks, but growth and post-embryonic development are rapidly initiated in response to feeding. We used RNA-seq to characterize the transcriptome during L1 arrest and over time after feeding. Twenty-seven percent of detectable protein-coding genes were differentially expressed during recovery from L1 arrest, with the majority of changes initiating within the first hour, demonstrating widespread, acute effects of nutrient availability on gene expression. We used two independent approaches to track expression of individual exons and mRNA isoforms, and we connected changes in expression to functional consequences by mining a variety of databases. These two approaches identified an overlapping set of genes with alternative isoform expression, and they converged on common functional patterns. Genes affecting mRNA splicing and translation are regulated by alternative isoform expression, revealing post-transcriptional consequences of nutrient availability on gene regulation. We also found that phosphorylation sites are often alternatively expressed, revealing a common mode by which alternative isoform expression modifies protein function and signal transduction. Our results detail rich changes in C. elegans gene expression as larvae initiate growth and post-embryonic development, and they provide an excellent resource for ongoing investigation of transcriptional regulation and developmental physiology.
Resumo:
My dissertation work integrates comparative transcriptomics and functional analyses to investigate gene expression changes underlying two significant aspects of sea urchin evolution and development: the dramatic developmental changes associated with an ecologically significant shift in life history strategy and the development of the unusual radial body plan of adult sea urchins.
In Chapter 2, I investigate evolutionary changes in gene expression underlying the switch from feeding (planktotrophic) to nonfeeding (lecithotrophic) development in sea urchins. In order to identify these changes, I used Illumina RNA-seq to measure expression dynamics across 7 developmental stages in three sea urchin species: the lecithotroph Heliocidaris erythrogramma, the closely related planktotroph Heliocidaris tuberculata, and an outgroup planktotroph Lytechinus variegatus. My analyses draw on a well-characterized developmental gene regulatory network (GRN) in sea urchins to understand how the ancestral planktotrophic developmental program was altered during the evolution of lecithotrophic development. My results suggest that changes in gene expression profiles occurred more frequently across the transcriptome during the evolution of lecithotrophy than during the persistence of planktotrophy. These changes were even more pronounced within the GRN than across the transcriptome as a whole, and occurred in each network territory (skeletogenic, endomesoderm and ectoderm). I found evidence for both conservation and divergence of regulatory interactions in the network, as well as significant changes in the expression of genes with known roles in larval skeletogenesis, which is dramatically altered in lecithotrophs. I further explored network dynamics between species using coexpression analyses, which allowed me to identify novel players likely involved in sea urchin neurogenesis and endoderm patterning.
In Chapter 3, I investigate developmental changes in gene expression underlying radial body plan development and metamorphosis in H. erythrogramma. Using Illumina RNA-seq, I measured gene expression profiles across larval, metamorphic, and post-metamorphic life cycle phases. My results present a high-resolution view of gene expression dynamics during the complex transition from pre- to post-metamorphic development and suggest that distinct sets of regulatory and effector proteins are used during different life history phases.
Collectively, my investigations provide an important foundation for future, empirical studies to investigate the functional role of gene expression change in the evolution of developmental differences between species and also for the generation of the unusual radial body plan of sea urchins.
Resumo:
BACKGROUND: Small molecule inhibitors of histone deacetylases (HDACi) hold promise as anticancer agents for particular malignancies. However, clinical use is often confounded by toxicity, perhaps due to indiscriminate hyperacetylation of cellular proteins. Therefore, elucidating the mechanisms by which HDACi trigger differentiation, cell cycle arrest, or apoptosis of cancer cells could inform development of more targeted therapies. We used the myelogenous leukemia line K562 as a model of HDACi-induced differentiation to investigate chromatin accessibility (DNase-seq) and expression (RNA-seq) changes associated with this process. RESULTS: We identified several thousand specific regulatory elements [~10 % of total DNase I-hypersensitive (DHS) sites] that become significantly more or less accessible with sodium butyrate or suberanilohydroxamic acid treatment. Most of the differential DHS sites display hallmarks of enhancers, including being enriched for non-promoter regions, associating with nearby gene expression changes, and increasing luciferase reporter expression in K562 cells. Differential DHS sites were enriched for key hematopoietic lineage transcription factor motifs, including SPI1 (PU.1), a known pioneer factor. We found PU.1 increases binding at opened DHS sites with HDACi treatment by ChIP-seq, but PU.1 knockdown by shRNA fails to block the chromatin accessibility and expression changes. A machine-learning approach indicates H3K27me3 initially marks PU.1-bound sites that open with HDACi treatment, suggesting these sites are epigenetically poised. CONCLUSIONS: We find HDACi treatment of K562 cells results in site-specific chromatin remodeling at epigenetically poised regulatory elements. PU.1 shows evidence of a pioneer role in this process by marking poised enhancers but is not required for transcriptional activation.
Resumo:
BACKGROUND: Over the past two decades more than fifty thousand unique clinical and biological samples have been assayed using the Affymetrix HG-U133 and HG-U95 GeneChip microarray platforms. This substantial repository has been used extensively to characterize changes in gene expression between biological samples, but has not been previously mined en masse for changes in mRNA processing. We explored the possibility of using HG-U133 microarray data to identify changes in alternative mRNA processing in several available archival datasets. RESULTS: Data from these and other gene expression microarrays can now be mined for changes in transcript isoform abundance using a program described here, SplicerAV. Using in vivo and in vitro breast cancer microarray datasets, SplicerAV was able to perform both gene and isoform specific expression profiling within the same microarray dataset. Our reanalysis of Affymetrix U133 plus 2.0 data generated by in vitro over-expression of HRAS, E2F3, beta-catenin (CTNNB1), SRC, and MYC identified several hundred oncogene-induced mRNA isoform changes, one of which recognized a previously unknown mechanism of EGFR family activation. Using clinical data, SplicerAV predicted 241 isoform changes between low and high grade breast tumors; with changes enriched among genes coding for guanyl-nucleotide exchange factors, metalloprotease inhibitors, and mRNA processing factors. Isoform changes in 15 genes were associated with aggressive cancer across the three breast cancer datasets. CONCLUSIONS: Using SplicerAV, we identified several hundred previously uncharacterized isoform changes induced by in vitro oncogene over-expression and revealed a previously unknown mechanism of EGFR activation in human mammary epithelial cells. We analyzed Affymetrix GeneChip data from over 400 human breast tumors in three independent studies, making this the largest clinical dataset analyzed for en masse changes in alternative mRNA processing. The capacity to detect RNA isoform changes in archival microarray data using SplicerAV allowed us to carry out the first analysis of isoform specific mRNA changes directly associated with cancer survival.
Resumo:
Beta-arrestins bind to activated G protein-coupled receptor kinase-phosphorylated receptors, which leads to their desensitization with respect to G proteins, internalization via clathrin-coated pits, and signaling via a growing list of "scaffolded" pathways. To facilitate the discovery of novel adaptor and signaling roles of beta-arrestins, we have developed and validated a generally applicable interfering RNA approach for selectively suppressing beta-arrestins 1 or 2 expression by up to 95%. Beta-arrestin depletion in HEK293 cells leads to enhanced cAMP generation in response to beta(2)-adrenergic receptor stimulation, markedly reduced beta(2)-adrenergic receptor and angiotensin II receptor internalization and impaired activation of the MAP kinases ERK 1 and 2 by angiotensin II. This approach should allow discovery of novel signaling and regulatory roles for the beta-arrestins in many seven-membrane-spanning receptor systems.
Resumo:
With increasing recognition of the roles RNA molecules and RNA/protein complexes play in an unexpected variety of biological processes, understanding of RNA structure-function relationships is of high current importance. To make clean biological interpretations from three-dimensional structures, it is imperative to have high-quality, accurate RNA crystal structures available, and the community has thoroughly embraced that goal. However, due to the many degrees of freedom inherent in RNA structure (especially for the backbone), it is a significant challenge to succeed in building accurate experimental models for RNA structures. This chapter describes the tools and techniques our research group and our collaborators have developed over the years to help RNA structural biologists both evaluate and achieve better accuracy. Expert analysis of large, high-resolution, quality-conscious RNA datasets provides the fundamental information that enables automated methods for robust and efficient error diagnosis in validating RNA structures at all resolutions. The even more crucial goal of correcting the diagnosed outliers has steadily developed toward highly effective, computationally based techniques. Automation enables solving complex issues in large RNA structures, but cannot circumvent the need for thoughtful examination of local details, and so we also provide some guidance for interpreting and acting on the results of current structure validation for RNA.
Resumo:
Post-transcriptional regulation of cytoplasmic mRNAs is an efficient mechanism of regulating the amounts of active protein within a eukaryotic cell. RNA sequence elements located in the untranslated regions of mRNAs can influence transcript degradation or translation through associations with RNA-binding proteins. Tristetraprolin (TTP) is the best known member of a family of CCCH zinc finger proteins that targets adenosine-uridine rich element (ARE) binding sites in the 3’ untranslated regions (UTRs) of mRNAs, promoting transcript deadenylation through the recruitment of deadenylases. More specifically, TTP has been shown to bind AREs located in the 3’-UTRs of transcripts with known roles in the inflammatory response. The mRNA-binding region of the protein is the highly conserved CCCH tandem zinc finger (TZF) domain. The synthetic TTP TZF domain has been shown to bind with high affinity to the 13-mer sequence of UUUUAUUUAUUUU. However, the binding affinities of full-length TTP family members to the same sequence and its variants are unknown. Furthermore, the distance needed between two overlapping or neighboring UUAUUUAUU 9-mers for tandem binding events of a full-length TTP family member to a target transcript has not been explored. To address these questions, we recombinantly expressed and purified the full-length C. albicans TTP family member Zfs1. Using full-length Zfs1, tagged at the N-terminus with maltose binding protein (MBP), we determined the binding affinities of the protein to the optimal TTP binding sequence, UUAUUUAUU. Fluorescence anisotropy experiments determined that the binding affinities of MBP-Zfs1 to non-canonical AREs were influenced by ionic buffer strength, suggesting that transcript selectivity may be affected by intracellular conditions. Furthermore, electrophoretic mobility shift assays (EMSAs) revealed that separation of two core AUUUA sequences by two uridines is sufficient for tandem binding of MBP-Zfs1. Finally, we found evidence for tandem binding of MBP-Zfs1 to a 27-base RNA oligonucleotide containing only a single ARE-binding site, and showed that this was concentration and RNA length dependent; this phenomenon had not been seen previously. These data suggest that the association of the TTP TZF domain and the TZF domains of other species, to ARE-binding sites is highly conserved. Domains outside of the TZF domain may mediate transcript selectivity in changing cellular conditions, and promote protein-RNA interactions not associated with the ARE-binding TZF domain.
In summary, the evidence presented here suggests that Zfs1-mediated decay of mRNA targets may require additional interactions, in addition to ARE-TZF domain associations, to promote transcript destabilization and degradation. These studies further our understanding of post-transcriptional steps in gene regulation.