204 resultados para license
Resumo:
GeneID is a program to predict genes in anonymous genomic sequences designed with a hierarchical structure. In the first step, splice sites, and start and stop codons are predicted and scored along the sequence using position weight matrices (PWMs). In the second step, exons are built from the sites. Exons are scored as the sum of the scores of the defining sites, plus the log-likelihood ratio of a Markov model for coding DNA. In the last step, from the set of predicted exons, the gene structure is assembled, maximizing the sum of the scores of the assembled exons. In this paper we describe the obtention of PWMs for sites, and the Markov model of coding DNA in Drosophila melanogaster. We also compare other models of coding DNA with the Markov model. Finally, we present and discuss the results obtained when GeneID is used to predict genes in the Adh region. These results show that the accuracy of GeneID predictions compares currently with that of other existing tools but that GeneID is likely to be more efficient in terms of speed and memory usage.
Resumo:
The completion of the sequencing of the mouse genome promises to help predict human genes with greater accuracy. While current ab initio gene prediction programs are remarkably sensitive (i.e., they predict at least a fragment of most genes), their specificity is often low, predicting a large number of false-positive genes in the human genome. Sequence conservation at the protein level with the mouse genome can help eliminate some of those false positives. Here we describe SGP2, a gene prediction program that combines ab initio gene prediction with TBLASTX searches between two genome sequences to provide both sensitive and specific gene predictions. The accuracy of SGP2 when used to predict genes by comparing the human and mouse genomes is assessed on a number of data sets, including single-gene data sets, the highly curated human chromosome 22 predictions, and entire genome predictions from ENSEMBL. Results indicate that SGP2 outperforms purely ab initio gene prediction methods. Results also indicate that SGP2 works about as well with 3x shotgun data as it does with fully assembled genomes. SGP2 provides a high enough specificity that its predictions can be experimentally verified at a reasonable cost. SGP2 was used to generate a complete set of gene predictions on both the human and mouse by comparing the genomes of these two species. Our results suggest that another few thousand human and mouse genes currently not in ENSEMBL are worth verifying experimentally.
Resumo:
The “one-gene, one-protein” rule, coined by Beadle and Tatum, has been fundamental to molecular biology. The rule implies that the genetic complexity of an organism depends essentially on its gene number. The discovery, however, that alternative gene splicing and transcription are widespread phenomena dramatically altered our understanding of the genetic complexity of higher eukaryotic organisms; in these, a limited number of genes may potentially encode a much larger number of proteins. Here we investigate yet another phenomenon that may contribute to generate additional protein diversity. Indeed, by relying on both computational and experimental analysis, we estimate that at least 4%–5% of the tandem gene pairs in the human genome can be eventually transcribed into a single RNA sequence encoding a putative chimeric protein. While the functional significance of most of these chimeric transcripts remains to be determined, we provide strong evidence that this phenomenon does not correspond to mere technical artifacts and that it is a common mechanism with the potential of generating hundreds of additional proteins in the human genome.
Resumo:
Background: Asparagine N-Glycosylation is one of the most important forms of protein post-translational modification in eukaryotes. This metabolic pathway can be subdivided into two parts: an upstream sub-pathway required for achieving proper folding for most of the proteins synthesized in the secretory pathway, and a downstream sub-pathway required to give variability to trans-membrane proteins, and involved in adaptation to the environment andinnate immunity. Here we analyze the nucleotide variability of the genes of this pathway in human populations, identifying which genes show greater population differentiation and which genes show signatures of recent positive selection. We also compare how these signals are distributed between the upstream and the downstream parts of the pathway, with the aim of exploring how forces of population differentiation and positive selection vary among genes involved in the same metabolic pathway but subject to different functional constraints. Results:Our results show that genes in the downstream part of the pathway are more likely to show a signature of population differentiation, while events of positive selection are equally distributed among the two parts of the pathway. Moreover, events of positive selection arefrequent on genes that are known to be at bifurcation points, and that are identified as beingin key position by a network-level analysis such as MGAT3 and GCS1.Conclusions: These findings indicate that the upstream part of the Asparagine N-Glycosylation pathway has lower diversity among populations, while the downstream part is freer to tolerate diversity among populations. Moreover, the distribution of signatures of population differentiation and positive selection can change between parts of a pathway, especially between parts that are exposed to different functional constraints. Our results support the hypothesis that genes involved in constitutive processes can be expected to show lower population differentiation, while genes involved in traits related to the environment should show higher variability. Taken together, this work broadens our knowledge on how events of population differentiation and of positive selection are distributed among different parts of a metabolic pathway.
Resumo:
Understanding the molecular mechanisms responsible for the regulation of the transcriptome present in eukaryotic cells isone of the most challenging tasks in the postgenomic era. In this regard, alternative splicing (AS) is a key phenomenoncontributing to the production of different mature transcripts from the same primary RNA sequence. As a plethora ofdifferent transcript forms is available in databases, a first step to uncover the biology that drives AS is to identify thedifferent types of reflected splicing variation. In this work, we present a general definition of the AS event along with anotation system that involves the relative positions of the splice sites. This nomenclature univocally and dynamically assignsa specific ‘‘AS code’’ to every possible pattern of splicing variation. On the basis of this definition and the correspondingcodes, we have developed a computational tool (AStalavista) that automatically characterizes the complete landscape of ASevents in a given transcript annotation of a genome, thus providing a platform to investigate the transcriptome diversityacross genes, chromosomes, and species. Our analysis reveals that a substantial part—in human more than a quarter—ofthe observed splicing variations are ignored in common classification pipelines. We have used AStalavista to investigate andto compare the AS landscape of different reference annotation sets in human and in other metazoan species and found thatproportions of AS events change substantially depending on the annotation protocol, species-specific attributes, andcoding constraints acting on the transcripts. The AStalavista system therefore provides a general framework to conductspecific studies investigating the occurrence, impact, and regulation of AS.
Resumo:
Background: The GENCODE consortium was formed to identify and map all protein-coding genes within the ENCODE regions. This was achieved by a combination of initial manualannotation by the HAVANA team, experimental validation by the GENCODE consortium and a refinement of the annotation based on these experimental results.Results: The GENCODE gene features are divided into eight different categories of which onlythe first two (known and novel coding sequence) are confidently predicted to be protein-codinggenes. 5’ rapid amplification of cDNA ends (RACE) and RT-PCR were used to experimentallyverify the initial annotation. Of the 420 coding loci tested, 229 RACE products have beensequenced. They supported 5’ extensions of 30 loci and new splice variants in 50 loci. In addition,46 loci without evidence for a coding sequence were validated, consisting of 31 novel and 15putative transcripts. We assessed the comprehensiveness of the GENCODE annotation byattempting to validate all the predicted exon boundaries outside the GENCODE annotation. Outof 1,215 tested in a subset of the ENCODE regions, 14 novel exon pairs were validated, only twoof them in intergenic regions.Conclusions: In total, 487 loci, of which 434 are coding, have been annotated as part of theGENCODE reference set available from the UCSC browser. Comparison of GENCODEannotation with RefSeq and ENSEMBL show only 40% of GENCODE exons are contained withinthe two sets, which is a reflection of the high number of alternative splice forms with uniqueexons annotated. Over 50% of coding loci have been experimentally verified by 5’ RACE forEGASP and the GENCODE collaboration is continuing to refine its annotation of 1% humangenome with the aid of experimental validation.
Resumo:
Background: The analysis of the promoter sequence of genes with similar expression patterns isa basic tool to annotate common regulatory elements. Multiple sequence alignments are on thebasis of most comparative approaches. The characterization of regulatory regions from coexpressedgenes at the sequence level, however, does not yield satisfactory results in manyoccasions as promoter regions of genes sharing similar expression programs often do not shownucleotide sequence conservation.Results: In a recent approach to circumvent this limitation, we proposed to align the maps ofpredicted transcription factors (referred as TF-maps) instead of the nucleotide sequence of tworelated promoters, taking into account the label of the corresponding factor and the position in theprimary sequence. We have now extended the basic algorithm to permit multiple promotercomparisons using the progressive alignment paradigm. In addition, non-collinear conservationblocks might now be identified in the resulting alignments. We have optimized the parameters ofthe algorithm in a small, but well-characterized collection of human-mouse-chicken-zebrafishorthologous gene promoters.Conclusion: Results in this dataset indicate that TF-map alignments are able to detect high-levelregulatory conservation at the promoter and the 3'UTR gene regions, which cannot be detectedby the typical sequence alignments. Three particular examples are introduced here to illustrate thepower of the multiple TF-map alignments to characterize conserved regulatory elements inabsence of sequence similarity. We consider this kind of approach can be extremely useful in thefuture to annotate potential transcription factor binding sites on sets of co-regulated genes fromhigh-throughput expression experiments.
Resumo:
The construction of metagenomic libraries has permitted the study of microorganisms resistant to isolation and the analysis of 16S rDNA sequences has been used for over two decades to examine bacterial biodiversity. Here, we show that the analysis of random sequence reads (RSRs) instead of 16S is a suitable shortcut to estimate the biodiversity of a bacterial community from metagenomic libraries. We generated 10,010 RSRs from a metagenomic library of microorganisms found in human faecal samples. Then searched them using the program BLASTN against a prokaryotic sequence database to assign a taxon to each RSR. The results were compared with those obtained by screening and analysing the clones containing 16S rDNA sequences in the whole library. We found that the biodiversity observed by RSR analysis is consistent with that obtained by 16S rDNA. We also show that RSRs are suitable to compare the biodiversity between different metagenomic libraries. RSRs can thus provide a good estimate of the biodiversity of a metagenomic library and, as an alternative to 16S, this approach is both faster and cheaper.
Resumo:
Selenoproteins are a diverse group of proteinsusually misidentified and misannotated in sequencedatabases. The presence of an in-frame UGA (stop)codon in the coding sequence of selenoproteingenes precludes their identification and correctannotation. The in-frame UGA codons are recodedto cotranslationally incorporate selenocysteine,a rare selenium-containing amino acid. The developmentof ad hoc experimental and, more recently,computational approaches have allowed the efficientidentification and characterization of theselenoproteomes of a growing number of species.Today, dozens of selenoprotein families have beendescribed and more are being discovered in recentlysequenced species, but the correct genomic annotationis not available for the majority of thesegenes. SelenoDB is a long-term project that aims toprovide, through the collaborative effort of experimentaland computational researchers, automaticand manually curated annotations of selenoproteingenes, proteins and SECIS elements. Version 1.0 ofthe database includes an initial set of eukaryoticgenomic annotations, with special emphasis on thehuman selenoproteome, for immediate inspectionby selenium researchers or incorporation into moregeneral databases. SelenoDB is freely available athttp://www.selenodb.org.
Resumo:
Selenoproteins contain the amino acid selenocysteine which is encoded by a UGA Sec codon. Recoding UGA Sec requires a complex mechanism, comprising the cis-acting SECIS RNA hairpin in the 3′UTR of selenoprotein mRNAs, and trans-acting factors. Among these, the SECIS Binding Protein 2 (SBP2) is central to the mechanism. SBP2 has been so far functionally characterized only in rats and humans. In this work, we report the characterization of the Drosophila melanogaster SBP2 (dSBP2). Despite its shorter length, it retained the same selenoprotein synthesis-promoting capabilities as the mammalian counterpart. However, a major difference resides in the SECIS recognition pattern: while human SBP2 (hSBP2) binds the distinct form 1 and 2 SECIS RNAs with similar affinities, dSBP2 exhibits high affinity toward form 2 only. In addition, we report the identification of a K (lysine)-rich domain in all SBP2s, essential for SECIS and 60S ribosomal subunit binding, differing from the well-characterized L7Ae RNA-binding domain. Swapping only five amino acids between dSBP2 and hSBP2 in the K-rich domain conferred reversed SECIS-binding properties to the proteins, thus unveiling an important sequence for form 1 binding.
Resumo:
Background: Despite the continuous production of genome sequence for a number of organisms,reliable, comprehensive, and cost effective gene prediction remains problematic. This is particularlytrue for genomes for which there is not a large collection of known gene sequences, such as therecently published chicken genome. We used the chicken sequence to test comparative andhomology-based gene-finding methods followed by experimental validation as an effective genomeannotation method.Results: We performed experimental evaluation by RT-PCR of three different computational genefinders, Ensembl, SGP2 and TWINSCAN, applied to the chicken genome. A Venn diagram wascomputed and each component of it was evaluated. The results showed that de novo comparativemethods can identify up to about 700 chicken genes with no previous evidence of expression, andcan correctly extend about 40% of homology-based predictions at the 5' end.Conclusions: De novo comparative gene prediction followed by experimental verification iseffective at enhancing the annotation of the newly sequenced genomes provided by standardhomology-based methods.
Resumo:
BACKGROUND: The trithorax group (trxG) and Polycomb group (PcG) proteins are responsible for the maintenance of stable transcriptional patterns of many developmental regulators. They bind to specific regions of DNA and direct the post-translational modifications of histones, playing a role in the dynamics of chromatin structure. RESULTS: We have performed genome-wide expression studies of trx and ash2 mutants in Drosophila melanogaster. Using computational analysis of our microarray data, we have identified 25 clusters of genes potentially regulated by TRX. Most of these clusters consist of genes that encode structural proteins involved in cuticle formation. This organization appears to be a distinctive feature of the regulatory networks of TRX and other chromatin regulators, since we have observed the same arrangement in clusters after experiments performed with ASH2, as well as in experiments performed by others with NURF, dMyc, and ASH1. We have also found many of these clusters to be significantly conserved in D. simulans, D. yakuba, D. pseudoobscura and partially in Anopheles gambiae. CONCLUSION: The analysis of genes governed by chromatin regulators has led to the identification of clusters of functionally related genes conserved in other insect species, suggesting this chromosomal organization is biologically important. Moreover, our results indicate that TRX and other chromatin regulators may act globally on chromatin domains that contain transcriptionally co-regulated genes.
Resumo:
We address the problem of comparing and characterizing the promoter regions of genes with similar expression patterns. This remains a challenging problem in sequence analysis, because often the promoter regions of co-expressed genes do not show discernible sequence conservation. In our approach, thus, we have not directly compared the nucleotide sequence of promoters. Instead, we have obtained predictions of transcription factor binding sites, annotated the predicted sites with the labels of the corresponding binding factors, and aligned the resulting sequences of labels—to which we refer here as transcription factor maps (TF-maps). To obtain the global pairwise alignment of two TF-maps, we have adapted an algorithm initially developed to align restriction enzyme maps. We have optimized the parameters of the algorithm in a small, but well-curated, collection of human–mouse orthologous gene pairs. Results in this dataset, as well as in an independent much larger dataset from the CISRED database, indicate that TF-map alignments are able to uncover conserved regulatory elements, which cannot be detected by the typical sequence alignments.
Resumo:
A large proportion of the death toll associated with malaria is a consequence of malaria infection during pregnancy, causing up to 200,000 infant deaths annually. We previously published the first extensive genetic association study of placental malaria infection, and here we extend this analysis considerably, investigating genetic variation in over 9,000 SNPs in more than 1,000 genes involved in immunity and inflammation for their involvement in susceptibility to placental malaria infection. We applied a new approach incorporating results from both single gene analysis as well as gene-gene interactionson a protein-protein interaction network. We found suggestive associations of variants in the gene KLRK1 in the single geneanalysis, as well as evidence for associations of multiple members of the IL-7/IL-7R signalling cascade in the combined analysis. To our knowledge, this is the first large-scale genetic study on placental malaria infection to date, opening the door for follow-up studies trying to elucidate the genetic basis of this neglected form of malaria.
Resumo:
Background: An excess of caffeine is cytotoxic to all eukaryotic cell types. We aim to study how cells become tolerant to atoxic dose of this drug, and the relationship between caffeine and oxidative stress pathways.Methodology/Principal Findings: We searched for Schizosaccharomyces pombe mutants with inhibited growth on caffeinecontainingplates. We screened a collection of 2,700 haploid mutant cells, of which 98 were sensitive to caffeine. The genes mutated in these sensitive clones were involved in a number of cellular roles including the H2O2-induced Pap1 and Sty1 stress pathways, the integrity and calcineurin pathways, cell morphology and chromatin remodeling. We have investigated the role of the oxidative stress pathways in sensing and promoting survival to caffeine. The Pap1 and the Sty1 pathways are both required for normal tolerance to caffeine, but only the Sty1 pathway is activated by the drug. Cells lacking Pap1 aresensitive to caffeine due to the decreased expression of the efflux pump Hba2. Indeed, ?hba2 cells are sensitive to caffeine, and constitutive activation of the Pap1 pathway enhances resistance to caffeine in an Hba2-dependent manner. Conclusions/Significance: With our caffeine-sensitive, genome-wide screen of an S. pombe deletion collection, we havedemonstrated the importance of some oxidative stress pathway components on wild-type tolerance to the drug.