77 resultados para Genome Annotation Assessment
em Université de Lausanne, Switzerland
Resumo:
BACKGROUND: We present the results of EGASP, a community experiment to assess the state-of-the-art in genome annotation within the ENCODE regions, which span 1% of the human genome sequence. The experiment had two major goals: the assessment of the accuracy of computational methods to predict protein coding genes; and the overall assessment of the completeness of the current human genome annotations as represented in the ENCODE regions. For the computational prediction assessment, eighteen groups contributed gene predictions. We evaluated these submissions against each other based on a 'reference set' of annotations generated as part of the GENCODE project. These annotations were not available to the prediction groups prior to the submission deadline, so that their predictions were blind and an external advisory committee could perform a fair assessment. RESULTS: The best methods had at least one gene transcript correctly predicted for close to 70% of the annotated genes. Nevertheless, the multiple transcript accuracy, taking into account alternative splicing, reached only approximately 40% to 50% accuracy. At the coding nucleotide level, the best programs reached an accuracy of 90% in both sensitivity and specificity. Programs relying on mRNA and protein sequences were the most accurate in reproducing the manually curated annotations. Experimental validation shows that only a very small percentage (3.2%) of the selected 221 computationally predicted exons outside of the existing annotation could be verified. CONCLUSION: This is the first such experiment in human DNA, and we have followed the standards established in a similar experiment, GASP1, in Drosophila melanogaster. We believe the results presented here contribute to the value of ongoing large-scale annotation projects and should guide further experimental methods when being scaled up to the entire human genome sequence.
Resumo:
The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.
Resumo:
A recurring task in the analysis of mass genome annotation data from high-throughput technologies is the identification of peaks or clusters in a noisy signal profile. Examples of such applications are the definition of promoters on the basis of transcription start site profiles, the mapping of transcription factor binding sites based on ChIP-chip data and the identification of quantitative trait loci (QTL) from whole genome SNP profiles. Input to such an analysis is a set of genome coordinates associated with counts or intensities. The output consists of a discrete number of peaks with respective volumes, extensions and center positions. We have developed for this purpose a flexible one-dimensional clustering tool, called MADAP, which we make available as a web server and as standalone program. A set of parameters enables the user to customize the procedure to a specific problem. The web server, which returns results in textual and graphical form, is useful for small to medium-scale applications, as well as for evaluation and parameter tuning in view of large-scale applications, requiring a local installation. The program written in C++ can be freely downloaded from ftp://ftp.epd.unil.ch/pub/software/unix/madap. The MADAP web server can be accessed at http://www.isrec.isb-sib.ch/madap/.
Resumo:
Matrix attachment regions are DNA sequences found throughout eukaryotic genomes that are believed to define boundaries interfacing heterochromatin and euchromatin domains, thereby acting as epigenetic regulators. When included in expression vectors, MARs can improve and sustain transgene expression, and a search for more potent novel elements is therefore actively pursued to further improve recombinant protein production. Here we describe the isolation of new MARs from the mouse genome using a modified in silico analysis. One of these MARs was found to be a powerful activator of transgene expression in stable transfections. Interestingly, this MAR also increased GFP and/or immunoglobulin expression from some but not all expression vectors in transient transfections. This effect was attributed to the presence or absence of elements on the vector backbone, providing an explanation for earlier discrepancies as to the ability of this class of elements to affect transgene expression under such conditions.
Resumo:
Dramatic improvements in DNA sequencing technologies have led to amore than 1,000-fold reduction in sequencing costs over the past five years.Genome-wide research approaches can thus now be applied beyond medicallyrelevant questions to examine the molecular-genetic basis of behavior,development and unique life histories in almost any organism. A first step foran emerging model organism is usually establishing a reference genomesequence. I offer insight gained from the fire ant genome project. First, I detailhow the project came to be and how sequencing, assembly and annotationstrategies were chosen. Subsequently, I describe some of the issues linked toworking with data from recently sequenced genomes. Finally, I discuss anapproach undertaken in a follow-up project based on the fire ant genomesequence.
Resumo:
BACKGROUND: Despite the continuous production of genome sequence for a number of organisms, reliable, comprehensive, and cost effective gene prediction remains problematic. This is particularly true for genomes for which there is not a large collection of known gene sequences, such as the recently published chicken genome. We used the chicken sequence to test comparative and homology-based gene-finding methods followed by experimental validation as an effective genome annotation method. RESULTS: We performed experimental evaluation by RT-PCR of three different computational gene finders, Ensembl, SGP2 and TWINSCAN, applied to the chicken genome. A Venn diagram was computed and each component of it was evaluated. The results showed that de novo comparative methods can identify up to about 700 chicken genes with no previous evidence of expression, and can correctly extend about 40% of homology-based predictions at the 5' end. CONCLUSIONS: De novo comparative gene prediction followed by experimental verification is effective at enhancing the annotation of the newly sequenced genomes provided by standard homology-based methods.
Resumo:
BACKGROUND: The Complete Arabidopsis Transcript MicroArray (CATMA) initiative combines the efforts of laboratories in eight European countries 1 to deliver gene-specific sequence tags (GSTs) for the Arabidopsis research community. The CATMA initiative offers the power and flexibility to regularly update the GST collection according to evolving knowledge about the gene repertoire. These GST amplicons can easily be reamplified and shared, subsets can be picked at will to print dedicated arrays, and the GSTs can be cloned and used for other functional studies. This ongoing initiative has already produced approximately 24,000 GSTs that have been made publicly available for spotted microarray printing and RNA interference. RESULTS: GSTs from the CATMA version 2 repertoire (CATMAv2, created in 2002) were mapped onto the gene models from two independent Arabidopsis nuclear genome annotation efforts, TIGR5 and PSB-EuGène, to consolidate a list of genes that were targeted by previously designed CATMA tags. A total of 9,027 gene models were not tagged by any amplified CATMAv2 GST, and 2,533 amplified GSTs were no longer predicted to tag an updated gene model. To validate the efficacy of GST mapping criteria and design rules, the predicted and experimentally observed hybridization characteristics associated to GST features were correlated in transcript profiling datasets obtained with the CATMAv2 microarray, confirming the reliability of this platform. To complete the CATMA repertoire, all 9,027 gene models for which no GST had yet been designed were processed with an adjusted version of the Specific Primer and Amplicon Design Software (SPADS). A total of 5,756 novel GSTs were designed and amplified by PCR from genomic DNA. Together with the pre-existing GST collection, this new addition constitutes the CATMAv3 repertoire. It comprises 30,343 unique amplified sequences that tag 24,202 and 23,009 protein-encoding nuclear gene models in the TAIR6 and EuGène genome annotations, respectively. To cover the remaining untagged genes, we identified 543 additional GSTs using less stringent design criteria and designed 990 sequence tags matching multiple members of gene families (Gene Family Tags or GFTs) to cover any remaining untagged genes. These latter 1,533 features constitute the CATMAv4 addition. CONCLUSION: To update the CATMA GST repertoire, we designed 7,289 additional sequence tags, bringing the total number of tagged TAIR6-annotated Arabidopsis nuclear protein-coding genes to 26,173. This resource is used both for the production of spotted microarrays and the large-scale cloning of hairpin RNA silencing vectors. All information about the resulting updated CATMA repertoire is available through the CATMA database http://www.catma.org.
Resumo:
OBJECTIVES: Co-morbidity between depression and anxiety disorders is common. In this study we define a quantitative measure of anxiety by summating four anxiety items from the SCAN interview in a large collection of major depression (MDD) cases to identify genes contributing to this complex phenotype. METHODS: A total of 1522 MDD cases dichotomised according to those with at least one anxiety item scored (n = 1080) and those without anxiety (n = 442) were analysed, and also compared to 1588 healthy controls at a genome-wide level, to identify genes that may contribute to anxiety in MDD. RESULTS: For the quantitative trait, suggestive evidence of association was detected for two SNPs, and for the dichotomous anxiety present/absent ratings for three SNPs at genome-wide level. In the genome-wide analysis of MDD cases with co-morbid anxiety and healthy controls, two SNPs attained P values of < 5 × 10⁻⁶. Analysing candidate genes, P values ≤ 0.0005 were found with three SNPs for the quantitative trait and three SNPs for the dichotomous trait. CONCLUSIONS: This study provides an initial genome-wide assessment of possible genetic contribution to anxiety in MDD. Although suggestive evidence of association was found for several SNPs, our findings suggest that there are no common variants strongly associated with anxious depression.
Resumo:
Since the advent of high-throughput DNA sequencing technologies, the ever-increasing rate at which genomes have been published has generated new challenges notably at the level of genome annotation. Even if gene predictors and annotation softwares are more and more efficient, the ultimate validation is still in the observation of predicted gene product( s). Mass-spectrometry based proteomics provides the necessary high throughput technology to show evidences of protein presence and, from the identified sequences, confirmation or invalidation of predicted annotations. We review here different strategies used to perform a MS-based proteogenomics experiment with a bottom-up approach. We start from the strengths and weaknesses of the different database construction strategies, based on different genomic information (whole genome, ORF, cDNA, EST or RNA-Seq data), which are then used for matching mass spectra to peptides and proteins. We also review the important points to be considered for a correct statistical assessment of the peptide identifications. Finally, we provide references for tools used to map and visualize the peptide identifications back to the original genomic information.
Resumo:
Pseudomonas knackmussii B13 was the first strain to be isolated in 1974 that could degrade chlorinated aromatic hydrocarbons. This discovery was the prologue for subsequent characterization of numerous bacterial metabolic pathways, for genetic and biochemical studies, and which spurred ideas for pollutant bioremediation. In this study, we determined the complete genome sequence of B13 using next generation sequencing technologies and optical mapping. Genome annotation indicated that B13 has a variety of metabolic pathways for degrading monoaromatic hydrocarbons including chlorobenzoate, aminophenol, anthranilate and hydroxyquinol, but not polyaromatic compounds. Comparative genome analysis revealed that B13 is closest to Pseudomonas denitrificans and Pseudomonas aeruginosa. The B13 genome contains at least eight genomic islands [prophages and integrative conjugative elements (ICEs)], which were absent in closely related pseudomonads. We confirm that two ICEs are identical copies of the 103 kb self-transmissible element ICEclc that carries the genes for chlorocatechol metabolism. Comparison of ICEclc showed that it is composed of a variable and a 'core' region, which is very conserved among proteobacterial genomes, suggesting a widely distributed family of so far uncharacterized ICE. Resequencing of two spontaneous B13 mutants revealed a number of single nucleotide substitutions, as well as excision of a large 220 kb region and a prophage that drastically change the host metabolic capacity and survivability.
Resumo:
Pseudomonas knackmussii B13 was the first strain to be isolated in 1974 that could degrade chlorinated aromatic hydrocarbons. This discovery was the prologue for subsequent characterization of numerous bacterial metabolic pathways, for genetic and biochemical studies, and which spurred ideas for pollutant bioremediation. In this study, we determined the complete genome sequence of B13 using next generation sequencing technologies and optical mapping. Genome annotation indicated that B13 has a variety of metabolic pathways for degrading monoaromatic hydrocarbons including chlorobenzoate, aminophenol, anthranilate and hydroxyquinol, but not polyaromatic compounds. Comparative genome analysis revealed that B13 is closest to Pseudomonas denitrificans and Pseudomonas aeruginosa. The B13 genome contains at least eight genomic islands [prophages and integrative conjugative elements (ICEs)], which were absent in closely related pseudomonads. We confirm that two ICEs are identical copies of the 103 kb self-transmissible element ICEclc that carries the genes for chlorocatechol metabolism. Comparison of ICEclc showed that it is composed of a variable and a 'core' region, which is very conserved among proteobacterial genomes, suggesting a widely distributed family of so far uncharacterized ICE. Resequencing of two spontaneous B13 mutants revealed a number of single nucleotide substitutions, as well as excision of a large 220 kb region and a prophage that drastically change the host metabolic capacity and survivability.
Resumo:
Recent technological progress has greatly facilitated de novo genome sequencing. However, de novo assemblies consist in many pieces of contiguous sequence (contigs) arranged in thousands of scaffolds instead of small numbers of chromosomes. Confirming and improving the quality of such assemblies is critical for subsequent analysis. We present a method to evaluate genome scaffolding by aligning independently obtained transcriptome sequences to the genome and visually summarizing the alignments using the Cytoscape software. Applying this method to the genome of the red fire ant Solenopsis invicta allowed us to identify inconsistencies in 7%, confirm contig order in 20% and extend 16% of scaffolds.Scripts that generate tables for visualization in Cytoscape from FASTA sequence and scaffolding information files are publicly available at https://github.com/ksanao/TGNet.
Resumo:
Functional RNA structures play an important role both in the context of noncoding RNA transcripts as well as regulatory elements in mRNAs. Here we present a computational study to detect functional RNA structures within the ENCODE regions of the human genome. Since structural RNAs in general lack characteristic signals in primary sequence, comparative approaches evaluating evolutionary conservation of structures are most promising. We have used three recently introduced programs based on either phylogenetic-stochastic context-free grammar (EvoFold) or energy directed folding (RNAz and AlifoldZ), yielding several thousand candidate structures (corresponding to approximately 2.7% of the ENCODE regions). EvoFold has its highest sensitivity in highly conserved and relatively AU-rich regions, while RNAz favors slightly GC-rich regions, resulting in a relatively small overlap between methods. Comparison with the GENCODE annotation points to functional RNAs in all genomic contexts, with a slightly increased density in 3'-UTRs. While we estimate a significant false discovery rate of approximately 50%-70% many of the predictions can be further substantiated by additional criteria: 248 loci are predicted by both RNAz and EvoFold, and an additional 239 RNAz or EvoFold predictions are supported by the (more stringent) AlifoldZ algorithm. Five hundred seventy RNAz structure predictions fall into regions that show signs of selection pressure also on the sequence level (i.e., conserved elements). More than 700 predictions overlap with noncoding transcripts detected by oligonucleotide tiling arrays. One hundred seventy-five selected candidates were tested by RT-PCR in six tissues, and expression could be verified in 43 cases (24.6%).
Resumo:
2 Abstract2.1 En françaisLe séquençage du génome humain est un pré-requis fondamental à la compréhension de la biologie de l'être humain. Ce projet achevé, les scientifiques ont dû faire face à une tâche aussi importante, comprendre cette suite de 3 milliards de lettres qui compose notre génome. Le consortium ENCODE (ENCyclopedia Of Dna Elements) fût formé comme une suite logique au projet du génome humain. Son rôle est d'identifier tous les éléments fonctionnels de notre génome incluant les régions transcrites, les sites d'attachement des facteurs de transcription, les sites hypersensibles à la DNAse I ainsi que les marqueurs de modification des histones. Dans le cadre de ma thèse doctorale, j'ai participé à 2 sous-projets d'ENCODE. En premier lieu, j'ai eu la tâche de développer et d'optimiser une technique de validation expérimentale à haut rendement de modèles de gènes qui m'a permis d'estimer la qualité de la plus récente annotation manuelle. Ce nouveau processus de validation est bien plus efficace que la technique RNAseq qui est actuellement en train de devenir la norme. Cette technique basée sur la RT-PCR, m'a notamment permis de découvrir de nouveaux exons dans 10% des régions interrogées. En second lieu j'ai participé à une étude ayant pour but d'identifier les extrémités de tous les gènes des chromosomes humains 21 et 22. Cette étude à permis l'identification à large échelle de transcrits chimères comportant des séquences provenant de deux gènes distincts pouvant être à une grande distance l'un de autre.2.2 In EnglishThe completion of the human genome sequence js the prerequisite to fully understand the biology of human beings. This project achieved, scientists had to face another challenging task, understanding the meaning of the 3 billion letters composing this genome. As a logical continuation of the human genome project, the ENCODE (ENCyclopedia Of DNA Elements) consortium was formed with the aim of annotating all its functional elements. These elements include transcribed regions, transcription binding sites, DNAse I hypersensitive sites and histone modification marks. In the frame of my PhD thesis, I was involved in two sub-projects of ENCODE. Firstly I developed and optimized an high throughput method to validate gene models, which allowed me to assess the quality of the most recent manually-curated annotation. This novel experimental validation pipeline is extremely effective, far more so than transcriptome profiling through RNA sequencing, which is becoming the norm. This RT-PCR-seq targeted-approach is likewise particularly efficient in identifying novel exons, as we discovered about 10% of loci with unannotated exons. Secondly, I participated to a study aiming to identify the gene boundaries of all genes in the human chromosome 21 and 22. This study led to the identification of chimeric transcripts that are composed of sequences coming form two distinct genes that can be map far away from each other.