958 resultados para RNA-seq data
Resumo:
Sequencing of pools of individuals (Pool-Seq) represents a reliable and cost-effective approach for estimating genome-wide SNP and transposable element insertion frequencies. However, Pool-Seq does not provide direct information on haplotypes so that, for example, obtaining inversion frequencies has not been possible until now. Here, we have developed a new set of diagnostic marker SNPs for seven cosmopolitan inversions in Drosophila melanogaster that can be used to infer inversion frequencies from Pool-Seq data. We applied our novel marker set to Pool-Seq data from an experimental evolution study and from North American and Australian latitudinal clines. In the experimental evolution data, we find evidence that positive selection has driven the frequencies of In(3R)C and In(3R)Mo to increase over time. In the clinal data, we confirm the existence of frequency clines for In(2L)t, In(3L)P and In(3R)Payne in both North America and Australia and detect a previously unknown latitudinal cline for In(3R)Mo in North America. The inversion markers developed here provide a versatile and robust tool for characterizing inversion frequencies and their dynamics in Pool-Seq data from diverse D. melanogaster populations.
Resumo:
BACKGROUND: There is an ever-increasing volume of data on host genes that are modulated during HIV infection, influence disease susceptibility or carry genetic variants that impact HIV infection. We created GuavaH (Genomic Utility for Association and Viral Analyses in HIV, http://www.GuavaH.org), a public resource that supports multipurpose analysis of genome-wide genetic variation and gene expression profile across multiple phenotypes relevant to HIV biology. FINDINGS: We included original data from 8 genome and transcriptome studies addressing viral and host responses in and ex vivo. These studies cover phenotypes such as HIV acquisition, plasma viral load, disease progression, viral replication cycle, latency and viral-host genome interaction. This represents genome-wide association data from more than 4,000 individuals, exome sequencing data from 392 individuals, in vivo transcriptome microarray data from 127 patients/conditions, and 60 sets of RNA-seq data. Additionally, GuavaH allows visualization of protein variation in ~8,000 individuals from the general population. The publicly available GuavaH framework supports queries on (i) unique single nucleotide polymorphism across different HIV related phenotypes, (ii) gene structure and variation, (iii) in vivo gene expression in the setting of human infection (CD4+ T cells), and (iv) in vitro gene expression data in models of permissive infection, latency and reactivation. CONCLUSIONS: The complexity of the analysis of host genetic influences on HIV biology and pathogenesis calls for comprehensive motors of research on curated data. The tool developed here allows queries and supports validation of the rapidly growing body of host genomic information pertinent to HIV research.
Resumo:
The transcriptome of the developing starchy endosperm of hexaploid wheat (Triticum aestivum) was determined using RNA-Seq isolated at five stages during grain fill. This resource represents an excellent way to identify candidate genes responsible for the starchy endosperm cell wall, which is dominated by arabinoxylan (AX), accounting for 70% of the cell wall polysaccharides, with 20% (1,3; 1,4)-beta-D-glucan, 7% glucomannan, and 4% cellulose. A complete inventory of transcripts of 124 glycosyltransferase (GT) and 72 glycosylhydrolase (GH) genes associated with cell walls is presented. The most highly expressed GT transcript (excluding those known to be involved in starch synthesis) was a GT47 family transcript similar to Arabidopsis (Arabidopsis thaliana) IRX10 involved in xylan extension, and the second most abundant was a GT61. Profiles for GT43 IRX9 and IRX14 putative orthologs were consistent with roles in AX synthesis. Low abundances were found for transcripts from genes in the acyl-coA transferase BAHD family, for which a role in AX feruloylation has been postulated. The relative expression of these was much greater in whole grain compared with starchy endosperm, correlating with the levels of bound ferulate. Transcripts associated with callose (GSL), cellulose (CESA), pectin (GAUT), and glucomannan (CSLA) synthesis were also abundant in starchy endosperm, while the corresponding cell wall polysaccharides were confirmed as low abundance (glucomannan and callose) or undetectable (pectin) in these samples. Abundant transcripts from GH families associated with the hydrolysis of these polysaccharides were also present, suggesting that they may be rapidly turned over. Abundant transcripts in the GT31 family may be responsible for the addition of Gal residues to arabinogalactan peptide.
Resumo:
Pós-graduação em Biociências e Biotecnologia Aplicadas à Farmácia - FCFAR
Resumo:
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
Resumo:
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
Resumo:
The recent advent of Next-generation sequencing technologies has revolutionized the way of analyzing the genome. This innovation allows to get deeper information at a lower cost and in less time, and provides data that are discrete measurements. One of the most important applications with these data is the differential analysis, that is investigating if one gene exhibit a different expression level in correspondence of two (or more) biological conditions (such as disease states, treatments received and so on). As for the statistical analysis, the final aim will be statistical testing and for modeling these data the Negative Binomial distribution is considered the most adequate one especially because it allows for "over dispersion". However, the estimation of the dispersion parameter is a very delicate issue because few information are usually available for estimating it. Many strategies have been proposed, but they often result in procedures based on plug-in estimates, and in this thesis we show that this discrepancy between the estimation and the testing framework can lead to uncontrolled first-type errors. We propose a mixture model that allows each gene to share information with other genes that exhibit similar variability. Afterwards, three consistent statistical tests are developed for differential expression analysis. We show that the proposed method improves the sensitivity of detecting differentially expressed genes with respect to the common procedures, since it is the best one in reaching the nominal value for the first-type error, while keeping elevate power. The method is finally illustrated on prostate cancer RNA-seq data.
Resumo:
My dissertation focuses on two aspects of RNA sequencing technology. The first is the methodology for modeling the overdispersion inherent in RNA-seq data for differential expression analysis. This aspect is addressed in three sections. The second aspect is the application of RNA-seq data to identify the CpG island methylator phenotype (CIMP) by integrating datasets of mRNA expression level and DNA methylation status. Section 1: The cost of DNA sequencing has reduced dramatically in the past decade. Consequently, genomic research increasingly depends on sequencing technology. However it remains elusive how the sequencing capacity influences the accuracy of mRNA expression measurement. We observe that accuracy improves along with the increasing sequencing depth. To model the overdispersion, we use the beta-binomial distribution with a new parameter indicating the dependency between overdispersion and sequencing depth. Our modified beta-binomial model performs better than the binomial or the pure beta-binomial model with a lower false discovery rate. Section 2: Although a number of methods have been proposed in order to accurately analyze differential RNA expression on the gene level, modeling on the base pair level is required. Here, we find that the overdispersion rate decreases as the sequencing depth increases on the base pair level. Also, we propose four models and compare them with each other. As expected, our beta binomial model with a dynamic overdispersion rate is shown to be superior. Section 3: We investigate biases in RNA-seq by exploring the measurement of the external control, spike-in RNA. This study is based on two datasets with spike-in controls obtained from a recent study. We observe an undiscovered bias in the measurement of the spike-in transcripts that arises from the influence of the sample transcripts in RNA-seq. Also, we find that this influence is related to the local sequence of the random hexamer that is used in priming. We suggest a model of the inequality between samples and to correct this type of bias. Section 4: The expression of a gene can be turned off when its promoter is highly methylated. Several studies have reported that a clear threshold effect exists in gene silencing that is mediated by DNA methylation. It is reasonable to assume the thresholds are specific for each gene. It is also intriguing to investigate genes that are largely controlled by DNA methylation. These genes are called “L-shaped” genes. We develop a method to determine the DNA methylation threshold and identify a new CIMP of BRCA. In conclusion, we provide a detailed understanding of the relationship between the overdispersion rate and sequencing depth. And we reveal a new bias in RNA-seq and provide a detailed understanding of the relationship between this new bias and the local sequence. Also we develop a powerful method to dichotomize methylation status and consequently we identify a new CIMP of breast cancer with a distinct classification of molecular characteristics and clinical features.
Resumo:
Of the ~1.7 million SINE elements in the human genome, only a tiny number are estimated to be active in transcription by RNA polymerase (Pol) III. Tracing the individual loci from which SINE transcripts originate is complicated by their highly repetitive nature. By exploiting RNA-Seq datasets and unique SINE DNA sequences, we devised a bioinformatic pipeline allowing us to identify Pol III-dependent transcripts of individual SINE elements. When applied to ENCODE transcriptomes of seven human cell lines, this search strategy identified ~1300 Alu loci and ~1100 MIR loci corresponding to detectable transcripts, with ~120 and ~60 respectively Alu and MIR loci expressed in at least three cell lines. In vitro transcription of selected SINEs did not reflect their in vivo expression properties, and required the native 5’-flanking region in addition to internal promoter. We also identified a cluster of expressed AluYa5-derived transcription units, juxtaposed to snaR genes on chromosome 19, formed by a promoter-containing left monomer fused to an Alu-unrelated downstream moiety. Autonomous Pol III transcription was also revealed for SINEs nested within Pol II-transcribed genes raising the possibility of an underlying mechanism for Pol II gene regulation by SINE transcriptional units. Moreover the application of our bioinformatic pipeline to both RNA-seq data of cells subjected to an in vitro pro-oncogenic stimulus and of in vivo matched tumor and non-tumor samples allowed us to detect increased Alu RNA expression as well as the source loci of such deregulation. The ability to investigate SINE transcriptomes at single-locus resolution will facilitate both the identification of novel biologically relevant SINE RNAs and the assessment of SINE expression alteration under pathological conditions.
Resumo:
Cnidarians are often considered simple animals, but the more than 13,000 estimated species (e.g., corals, hydroids and jellyfish) of the early diverging phylum exhibit a broad diversity of forms, functions and behaviors, some of which are demonstrably complex. In particular, cubozoans (box jellyfish) are cnidarians that have evolved a number of distinguishing features. Some cubozoan species possess complex mating behaviors or particularly potent stings, and all possess well-developed light sensation involving image-forming eyes. Like all cnidarians, cubozoans have specialized subcellular structures called nematocysts that are used in prey capture and defense. The objective of this study is to contribute to the development of the box jellyfish Alatina alata as a model cnidarian. This cubozoan species offers numerous advantages for investigating morphological and molecular traits underlying complex processes and coordinated behavior in free-living medusozoans (i.e., jellyfish), and more broadly throughout Metazoa. First, I provide an overview of Cnidaria with an emphasis on the current understanding of genes and proteins implicated in complex biological processes in a few select cnidarians. Second, to further develop resources for A. alata, I provide a formal redescription of this cubozoan and establish a neotype specimen voucher, which serve to stabilize the taxonomy of the species. Third, I generate the first functionally annotated transcriptome of adult and larval A. alata tissue and apply preliminary differential expression analyses to identify candidate genes implicated broadly in biological processes related to prey capture and defense, vision and the phototransduction pathway and sexual reproduction and gametogenesis. Fourth, to better understand venom diversity and mechanisms controlling venom synthesis in A. alata, I use bioinformatics to investigate gene candidates with dual roles in venom and digestion, and review the biology of prey capture and digestion in cubozoans. The morphological and molecular resources presented herein contribute to understanding the evolution of cubozoan characteristics and serve to facilitate further research on this emerging cubozoan model.
Resumo:
We report the first quantitative and qualitative analysis of the poly (A)(+) transcriptome of two human mammary cell lines, differentially expressing (human epidermal growth factor receptor) an oncogene over-expressed in approximately 25% of human breast tumors. Full-length cDNA populations from the two cell lines were digested enzymatically, individually tagged according to a customized method for library construction, and simultaneously sequenced by the use of the Titanium 454-Roche-platform. Comprehensive bioinformatics analysis followed by experimental validation confirmed novel genes, splicing variants, single nucleotide polymorphisms, and gene fusions indicated by RNA-seq data from both samples. Moreover, comparative analysis showed enrichment in alternative events, especially in the exon usage category, in ERBB2 over-expressing cells, data indicating regulation of alternative splicing mediated by the oncogene. Alterations in expression levels of genes, such as LOX, ATP5L, GALNT3, and MME revealed by large-scale sequencing were confirmed between cell lines as well as in tumor specimens with different ERBB2 backgrounds. This approach was shown to be suitable for structural, quantitative, and qualitative assessment of complex transcriptomes and revealed new events mediated by ERBB2 overexpression, in addition to potential molecular targets for breast cancer that are driven by this oncogene.
Resumo:
The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.
Resumo:
Since the advent of high-throughput DNA sequencing technologies, the ever-increasing rate at which genomes have been published has generated new challenges notably at the level of genome annotation. Even if gene predictors and annotation softwares are more and more efficient, the ultimate validation is still in the observation of predicted gene product( s). Mass-spectrometry based proteomics provides the necessary high throughput technology to show evidences of protein presence and, from the identified sequences, confirmation or invalidation of predicted annotations. We review here different strategies used to perform a MS-based proteogenomics experiment with a bottom-up approach. We start from the strengths and weaknesses of the different database construction strategies, based on different genomic information (whole genome, ORF, cDNA, EST or RNA-Seq data), which are then used for matching mass spectra to peptides and proteins. We also review the important points to be considered for a correct statistical assessment of the peptide identifications. Finally, we provide references for tools used to map and visualize the peptide identifications back to the original genomic information.