18 resultados para Nucleòtids
Resumo:
One of the first useful products from the human genome will be a set of predicted genes. Besides its intrinsic scientific interest, the accuracy and completeness of this data set is of considerable importance for human health and medicine. Though progress has been made on computational gene identification in terms of both methods and accuracy evaluation measures, most of the sequence sets in which the programs are tested are short genomic sequences, and there is concern that these accuracy measures may not extrapolate well to larger, more challenging data sets. Given the absence of experimentally verified large genomic data sets, we constructed a semiartificial test set comprising a number of short single-gene genomic sequences with randomly generated intergenic regions. This test set, which should still present an easier problem than real human genomic sequence, mimics the approximately 200kb long BACs being sequenced. In our experiments with these longer genomic sequences, the accuracy of GENSCAN, one of the most accurate ab initio gene prediction programs, dropped significantly, although its sensitivity remained high. Conversely, the accuracy of similarity-based programs, such as GENEWISE, PROCRUSTES, and BLASTX was not affected significantly by the presence of random intergenic sequence, but depended on the strength of the similarity to the protein homolog. As expected, the accuracy dropped if the models were built using more distant homologs, and we were able to quantitatively estimate this decline. However, the specificities of these techniques are still rather good even when the similarity is weak, which is a desirable characteristic for driving expensive follow-up experiments. Our experiments suggest that though gene prediction will improve with every new protein that is discovered and through improvements in the current set of tools, we still have a long way to go before we can decipher the precise exonic structure of every gene in the human genome using purely computational methodology.
Resumo:
Annotation of protein-coding genes is a key goal of genome sequencing projects. In spite of tremendous recent advances in computational gene finding, comprehensive annotation remains a challenge. Peptide mass spectrometry is a powerful tool for researching the dynamic proteome and suggests an attractive approach to discover and validate protein-coding genes. We present algorithms to construct and efficiently search spectra against a genomic database, with no prior knowledge of encoded proteins. By searching a corpus of 18.5 million tandem mass spectra (MS/MS) from human proteomic samples, we validate 39,000 exons and 11,000 introns at the level of translation. We present translation-level evidence for novel or extended exons in 16 genes, confirm translation of 224 hypothetical proteins, and discover or confirm over 40 alternative splicing events. Polymorphisms are efficiently encoded in our database, allowing us to observe variant alleles for 308 coding SNPs. Finally, we demonstrate the use of mass spectrometry to improve automated gene prediction, adding 800 correct exons to our predictions using a simple rescoring strategy. Our results demonstrate that proteomic profiling should play a role in any genome sequencing project.
Resumo:
GeneID is a program to predict genes in anonymous genomic sequences designed with a hierarchical structure. In the first step, splice sites, and start and stop codons are predicted and scored along the sequence using position weight matrices (PWMs). In the second step, exons are built from the sites. Exons are scored as the sum of the scores of the defining sites, plus the log-likelihood ratio of a Markov model for coding DNA. In the last step, from the set of predicted exons, the gene structure is assembled, maximizing the sum of the scores of the assembled exons. In this paper we describe the obtention of PWMs for sites, and the Markov model of coding DNA in Drosophila melanogaster. We also compare other models of coding DNA with the Markov model. Finally, we present and discuss the results obtained when GeneID is used to predict genes in the Adh region. These results show that the accuracy of GeneID predictions compares currently with that of other existing tools but that GeneID is likely to be more efficient in terms of speed and memory usage.
Resumo:
In a number of programs for gene structure prediction in higher eukaryotic genomic sequences, exon prediction is decoupled from gene assembly: a large pool of candidate exons is predicted and scored from features located in the query DNA sequence, and candidate genes are assembled from such a pool as sequences of nonoverlapping frame-compatible exons. Genes are scored as a function of the scores of the assembled exons, and the highest scoring candidate gene is assumed to be the most likely gene encoded by the query DNA sequence. Considering additive gene scoring functions, currently available algorithms to determine such a highest scoring candidate gene run in time proportional to the square of the number of predicted exons. Here, we present an algorithm whose running time grows only linearly with the size of the set of predicted exons. Polynomial algorithms rely on the fact that, while scanning the set of predicted exons, the highest scoring gene ending in a given exon can be obtained by appending the exon to the highest scoring among the highest scoring genes ending at each compatible preceding exon. The algorithm here relies on the simple fact that such highest scoring gene can be stored and updated. This requires scanning the set of predicted exons simultaneously by increasing acceptor and donor position. On the other hand, the algorithm described here does not assume an underlying gene structure model. Indeed, the definition of valid gene structures is externally defined in the so-called Gene Model. The Gene Model specifies simply which gene features are allowed immediately upstream which other gene features in valid gene structures. This allows for great flexibility in formulating the gene identification problem. In particular it allows for multiple-gene two-strand predictions and for considering gene features other than coding exons (such as promoter elements) in valid gene structures.
Resumo:
The completion of the sequencing of the mouse genome promises to help predict human genes with greater accuracy. While current ab initio gene prediction programs are remarkably sensitive (i.e., they predict at least a fragment of most genes), their specificity is often low, predicting a large number of false-positive genes in the human genome. Sequence conservation at the protein level with the mouse genome can help eliminate some of those false positives. Here we describe SGP2, a gene prediction program that combines ab initio gene prediction with TBLASTX searches between two genome sequences to provide both sensitive and specific gene predictions. The accuracy of SGP2 when used to predict genes by comparing the human and mouse genomes is assessed on a number of data sets, including single-gene data sets, the highly curated human chromosome 22 predictions, and entire genome predictions from ENSEMBL. Results indicate that SGP2 outperforms purely ab initio gene prediction methods. Results also indicate that SGP2 works about as well with 3x shotgun data as it does with fully assembled genomes. SGP2 provides a high enough specificity that its predictions can be experimentally verified at a reasonable cost. SGP2 was used to generate a complete set of gene predictions on both the human and mouse by comparing the genomes of these two species. Our results suggest that another few thousand human and mouse genes currently not in ENSEMBL are worth verifying experimentally.
Resumo:
The “one-gene, one-protein” rule, coined by Beadle and Tatum, has been fundamental to molecular biology. The rule implies that the genetic complexity of an organism depends essentially on its gene number. The discovery, however, that alternative gene splicing and transcription are widespread phenomena dramatically altered our understanding of the genetic complexity of higher eukaryotic organisms; in these, a limited number of genes may potentially encode a much larger number of proteins. Here we investigate yet another phenomenon that may contribute to generate additional protein diversity. Indeed, by relying on both computational and experimental analysis, we estimate that at least 4%–5% of the tandem gene pairs in the human genome can be eventually transcribed into a single RNA sequence encoding a putative chimeric protein. While the functional significance of most of these chimeric transcripts remains to be determined, we provide strong evidence that this phenomenon does not correspond to mere technical artifacts and that it is a common mechanism with the potential of generating hundreds of additional proteins in the human genome.
Resumo:
A number of experimental methods have been reported for estimating the number of genes in a genome, or the closely related coding density of a genome, defined as the fraction of base pairs in codons. Recently, DNA sequence data representative of the genome as a whole have become available for several organisms, making the problem of estimating coding density amenable to sequence analytic methods. Estimates of coding density for a single genome vary widely, so that methods with characterized error bounds have become increasingly desirable. We present a method to estimate the protein coding density in a corpus of DNA sequence data, in which a ‘coding statistic’ is calculated for a large number of windows of the sequence under study, and the distribution of the statistic is decomposed into two normal distributions, assumed to be the distributions of the coding statistic in the coding and noncoding fractions of the sequence windows. The accuracy of the method is evaluated using known data and application is made to the yeast chromosome III sequence and to C.elegans cosmid sequences. It can also be applied to fragmentary data, for example a collection of short sequences determined in the course of STS mapping.
Resumo:
The vast majority of the biology of a newly sequenced genome is inferred from the set of encoded proteins. Predicting this set is therefore invariably the first step after the completion of the genome DNA sequence. Here we review the main computational pipelines used to generate the human reference protein-coding gene sets.
Resumo:
Nucleoside transporters (NTs) mediate the uptake of nucleosides and nucleobases across the plasma membrane, mostly for salvage purposes. The canonical NTs belong to two gene families, SLC29 and SLC28. The former encode equilibrative nucleoside transporter proteins (ENTs), which mediate the facilitative diffusion of natural nucleosides with broad selectivity, whereas the latter encode concentrative nucleoside transporters (CNTs), which are sodium-coupled and show high affinity for substrates with variable selectivity. These proteins are expressed in most cell types, exhibiting apparent functional redundancy. This might indicate that CNTs play specific roles in the physiology of the cell beyond nucleoside salvage. Here, we addressed this possibility using adenoviral vectors to restore tumor cell expression of hCNT1 or a polymorphic variant (hCNT1S546P) lacking nucleoside translocation ability. We found that hCNT1 restoration in pancreatic cancer cells significantly altered cell-cycle progression and phosphorylation status of key signal-transducing kinases, promoted poly-(ADP ribose) polymerase hyperactivation and cell death, and reduced tumor growth and cell migration. Importantly, the translocation-defective transporter triggered these same effects on cell physiology. These data predict a novel and totally unexpected biological role for the nucleoside transporter protein hCNT1 that appears to be independent of its role as mediator of nucleoside uptake by cells, thereby suggesting a transceptor function. Cell Death & Disease Anastasis Stephanou Receiving Editor Cell Death & Disease 19th Apr 2013 Dr Perez-Torras Av/ Diagonal 643. Edif. Prevosti, Pl -1 Barcelona 08028 Spain RE: Manuscript CDDIS-13-0136R, 'CDDIS-13-0136R' Dear Dr Perez-Torras, It is a pleasure to inform you that your manuscript has been evaluated at the editorial level and has now been officially accepted for publication in Cell Death & Disease, pending you meet the following editorial requirements: 1) the list of the abbreviations is missing please include Could you send us the revised text as word file via e-mail and we will proceed and transfer the paper onto our typesetters. Please download, print, sign, and return the Licence to Publish Form using the link below. This must be returned via FAX to ++ 39 06 7259 6977 before your manuscript can be published:
Resumo:
[2,5-Dimethylfuran]-protected maleimides were placed at both internal positions and the 3'-end of oligonucleotides making use of solid-phase synthesis procedures. A new phosphoramidite derivative and a new solid support incorporating the protected maleimide moiety were prepared for this purpose. In all cases maleimide deprotection (retro-Diels-Alder reaction) followed by reaction with thiol-containing compounds afforded the target conjugate.
Resumo:
Animal olfactory systems have a critical role for the survival and reproduction of individuals. In insects, the odorant-binding proteins (OBPs) are encoded by a moderately sized gene family, and mediate the first steps of the olfactory processing. Most OBPs are organized in clusters of a few paralogs, which are conserved over time. Currently, the biological mechanism explaining the close physical proximity among OBPs is not yet established. Here, we conducted a comprehensive study aiming to gain insights into the mechanisms underlying the OBP genomic organization. We found that the OBP clusters are embedded within large conserved arrangements. These organizations also include other non-OBP genes, which often encode proteins integral to plasma membrane. Moreover, the conservation degree of such large clusters is related to the following: 1) the promoter architecture of the confined genes, 2) a characteristic transcriptional environment, and 3) the chromatin conformation of the chromosomal region. Our results suggest that chromatin domains may restrict the location of OBP genes to regions having the appropriate transcriptional environment, leading to the OBP cluster structure. However, the appropriate transcriptional environment for OBP and the other neighbor genes is not dominated by reduced levels of expression noise. Indeed, the stochastic fluctuations in the OBP transcript abundance may have a critical role in the combinatorial nature of the olfactory coding process.