Biblioteca Digital

65 resultados para coding

Assembling genes from predicted exons in linear time with dynamic programming

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In a number of programs for gene structure prediction in higher eukaryotic genomic sequences, exon prediction is decoupled from gene assembly: a large pool of candidate exons is predicted and scored from features located in the query DNA sequence, and candidate genes are assembled from such a pool as sequences of nonoverlapping frame-compatible exons. Genes are scored as a function of the scores of the assembled exons, and the highest scoring candidate gene is assumed to be the most likely gene encoded by the query DNA sequence. Considering additive gene scoring functions, currently available algorithms to determine such a highest scoring candidate gene run in time proportional to the square of the number of predicted exons. Here, we present an algorithm whose running time grows only linearly with the size of the set of predicted exons. Polynomial algorithms rely on the fact that, while scanning the set of predicted exons, the highest scoring gene ending in a given exon can be obtained by appending the exon to the highest scoring among the highest scoring genes ending at each compatible preceding exon. The algorithm here relies on the simple fact that such highest scoring gene can be stored and updated. This requires scanning the set of predicted exons simultaneously by increasing acceptor and donor position. On the other hand, the algorithm described here does not assume an underlying gene structure model. Indeed, the definition of valid gene structures is externally defined in the so-called Gene Model. The Gene Model specifies simply which gene features are allowed immediately upstream which other gene features in valid gene structures. This allows for great flexibility in formulating the gene identification problem. In particular it allows for multiple-gene two-strand predictions and for considering gene features other than coding exons (such as promoter elements) in valid gene structures.

GENCODE: producing a reference annotation for ENCODE

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Background: The GENCODE consortium was formed to identify and map all protein-coding genes within the ENCODE regions. This was achieved by a combination of initial manualannotation by the HAVANA team, experimental validation by the GENCODE consortium and a refinement of the annotation based on these experimental results.Results: The GENCODE gene features are divided into eight different categories of which onlythe first two (known and novel coding sequence) are confidently predicted to be protein-codinggenes. 5’ rapid amplification of cDNA ends (RACE) and RT-PCR were used to experimentallyverify the initial annotation. Of the 420 coding loci tested, 229 RACE products have beensequenced. They supported 5’ extensions of 30 loci and new splice variants in 50 loci. In addition,46 loci without evidence for a coding sequence were validated, consisting of 31 novel and 15putative transcripts. We assessed the comprehensiveness of the GENCODE annotation byattempting to validate all the predicted exon boundaries outside the GENCODE annotation. Outof 1,215 tested in a subset of the ENCODE regions, 14 novel exon pairs were validated, only twoof them in intergenic regions.Conclusions: In total, 487 loci, of which 434 are coding, have been annotated as part of theGENCODE reference set available from the UCSC browser. Comparison of GENCODEannotation with RefSeq and ENSEMBL show only 40% of GENCODE exons are contained withinthe two sets, which is a reflection of the high number of alternative splice forms with uniqueexons annotated. Over 50% of coding loci have been experimentally verified by 5’ RACE forEGASP and the GENCODE collaboration is continuing to refine its annotation of 1% humangenome with the aid of experimental validation.

EGASP: the human ENCODE Genome Annotation Assessment Project

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Background: We present the results of EGASP, a community experiment to assess the state-ofthe-art in genome annotation within the ENCODE regions, which span 1% of the human genomesequence. The experiment had two major goals: the assessment of the accuracy of computationalmethods to predict protein coding genes; and the overall assessment of the completeness of thecurrent human genome annotations as represented in the ENCODE regions. For thecomputational prediction assessment, eighteen groups contributed gene predictions. Weevaluated these submissions against each other based on a ‘reference set’ of annotationsgenerated as part of the GENCODE project. These annotations were not available to theprediction groups prior to the submission deadline, so that their predictions were blind and anexternal advisory committee could perform a fair assessment.Results: The best methods had at least one gene transcript correctly predicted for close to 70%of the annotated genes. Nevertheless, the multiple transcript accuracy, taking into accountalternative splicing, reached only approximately 40% to 50% accuracy. At the coding nucleotidelevel, the best programs reached an accuracy of 90% in both sensitivity and specificity. Programsrelying on mRNA and protein sequences were the most accurate in reproducing the manuallycurated annotations. Experimental validation shows that only a very small percentage (3.2%) of the selected 221 computationally predicted exons outside of the existing annotation could beverified.Conclusions: This is the first such experiment in human DNA, and we have followed thestandards established in a similar experiment, GASP1, in Drosophila melanogaster. We believe theresults presented here contribute to the value of ongoing large-scale annotation projects and shouldguide further experimental methods when being scaled up to the entire human genome sequence.

SelenoDB 1.0 : a database of selenoprotein genes, proteins and SECIS elements

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Selenoproteins are a diverse group of proteinsusually misidentified and misannotated in sequencedatabases. The presence of an in-frame UGA (stop)codon in the coding sequence of selenoproteingenes precludes their identification and correctannotation. The in-frame UGA codons are recodedto cotranslationally incorporate selenocysteine,a rare selenium-containing amino acid. The developmentof ad hoc experimental and, more recently,computational approaches have allowed the efficientidentification and characterization of theselenoproteomes of a growing number of species.Today, dozens of selenoprotein families have beendescribed and more are being discovered in recentlysequenced species, but the correct genomic annotationis not available for the majority of thesegenes. SelenoDB is a long-term project that aims toprovide, through the collaborative effort of experimentaland computational researchers, automaticand manually curated annotations of selenoproteingenes, proteins and SECIS elements. Version 1.0 ofthe database includes an initial set of eukaryoticgenomic annotations, with special emphasis on thehuman selenoproteome, for immediate inspectionby selenium researchers or incorporation into moregeneral databases. SelenoDB is freely available athttp://www.selenodb.org.

Comparative gene finding in chicken indicates that we are closing in on the set of multi-exonic widely-expressed human genes

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The recent availability of the chicken genome sequence poses the question of whether there are human protein-coding genes conserved in chicken that are currently not included in the human gene catalog. Here, we show, using comparative gene finding followed by experimental verification of exon pairs by RT–PCR, that the addition to the multi-exonic subset of this catalog could be as little as 0.2%, suggesting that we may be closing in on the human gene set. Our protocol, however, has two shortcomings: (i) the bioinformatic screening of the predicted genes, applied to filter out false positives, cannot handle intronless genes; and (ii) the experimental verification could fail to identify expression at a specific developmental time. This highlights the importance of developing methods that could provide a reliable estimate of the number of these two types of genes.

Lifespan extension by calorie restriction relies on the Sty1 MAP kinase stress pathway

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Either calorie restriction, loss of function of the nutrient-dependent PKA or TOR/SCH9 pathways, or activation of stress defences improves longevity in different eukaryotes. However, the molecular links between glucose depletion, nutrient-dependent pathways and stress responses are unknown. Here we show that either calorie restriction or inactivation of nutrient-dependent pathways induces life-span extension in fission yeast, and that such effect is dependent on the activation of the stress-dependent Sty1 MAP kinase. During transition to stationary phase in glucose-limiting conditions, Sty1 becomes activated and triggers a transcriptional stress program, whereas such activation does not occur under glucose-rich conditions. Deletion of the genes coding for the SCH9-homologue Sck2 or the Pka1 kinases, or mutations leading to constitutive activation of the Sty1 stress pathway increase life span under glucose-rich conditions, and importantly such beneficial effects depend ultimately on Sty1. Furthermore, cells lacking Pka1 display enhanced oxygen consumption and Sty1 activation under glucose-rich conditions. We conclude that calorie restriction favours oxidative metabolism, reactive oxygen species production and Sty1 MAP kinase activation, and this stress pathway favours life-span extension.

Design and evaluation of a panel os single-nucleotide polymorphisms in microRNA genomic regions for association studies in human disease

Relevância:

10.00% 10.00%

Publicador:

Resumo:

MicroRNAs (miRNA) are recognized posttranscriptional gene repressors involved in the control of almost every biological process. Allelic variants in these regions may be an important source of phenotypic diversity and contribute to disease susceptibility. We analyzed the genomic organization of 325 human miRNAs (release 7.1, miRBase) to construct a panel of 768 single-nucleotide polymorphisms (SNPs) covering approximately 1 Mb of genomic DNA, including 131 isolated miRNAs (40%) and 194 miRNAs arranged in 48 miRNA clusters, as well as their 5-kb flanking regions. Of these miRNAs, 37% were inside known protein-coding genes, which were significantly associated with biological functions regarding neurological, psychological or nutritional disorders. SNP coverage analysis revealed a lower SNP density in miRNAs compared with the average of the genome, with only 24 SNPs located in the 325 miRNAs studied. Further genotyping of 340 unrelated Spanish individuals showed that more than half of the SNPs in miRNAs were either rare or monomorphic, in agreement with the reported selective constraint on human miRNAs. A comparison of the minor allele frequencies between Spanish and HapMap population samples confirmed the applicability of this SNP panel to the study of complex disorders among the Spanish population, and revealed two miRNA regions, hsa-mir-26a-2 in the CTDSP2 gene and hsa-mir-128-1 in the R3HDM1 gene, showing geographical allelic frequency variation among the four HapMap populations, probably because of differences in natural selection. The designed miRNA SNP panel could help to identify still hidden links between miRNAs and human disease.

Genetic origin, admixture, and asymmetry in maternal and paternal human lineages in Cuba

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Background: Before the arrival of Europeans to Cuba, the island was inhabited by two Native American groups, the Tainos and the Ciboneys. Most of the present archaeological, linguistic and ancient DNA evidence indicates a South American origin for these populations. In colonial times, Cuban Native American people were replaced by European settlers and slaves from Africa. It is still unknown however, to what extent their genetic pool intermingled with and was 'diluted' by the arrival of newcomers. In order to investigate the demographic processes that gave rise to the current Cuban population, we analyzed the hypervariable region I (HVS-I) and five single nucleotide polymorphisms (SNPs) in the mitochondrial DNA (mtDNA) coding region in 245 individuals, and 40 Y-chromosome SNPs in 132 male individuals. Results: The Native American contribution to present-day Cubans accounted for 33% of the maternal lineages, whereas Africa and Eurasia contributed 45% and 22% of the lineages, respectively. This Native American substrate in Cuba cannot be traced back to a single origin within the American continent, as previously suggested by ancient DNA analyses. Strikingly, no Native American lineages were found for the Y-chromosome, for which the Eurasian and African contributions were around 80% and 20%, respectively. Conclusion: While the ancestral Native American substrate is still appreciable in the maternal lineages, the extensive process of population admixture in Cuba has left no trace of the paternal Native American lineages, mirroring the strong sexual bias in the admixture processes taking place during colonial times.

Patterns of evolutionary constraints on genes in humans

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Background: Different regions in a genome evolve at different rates depending on structural and functional constraints. Some genomic regions are highly conserved during metazoan evolution, while other regions may evolve rapidly, either in all species or in a lineage-specific manner. A strong or even moderate change in constraints in functional regions, for example in coding regions, can have significant evolutionary consequences. Results: Here we discuss a novel framework, 'BaseDiver', to classify groups of genes in humans based on the patterns of evolutionary constraints on polymorphic positions in their coding regions. Comparing the nucleotide-level divergence among mammals with the extent of deviation from the ancestral base in the human lineage, we identify patterns of evolutionary pressure on nonsynonymous base-positions in groups of genes belonging to the same functional category. Focussing on groups of genes in functional categories, we find that transcription factors contain a significant excess of nonsynonymous base-positions that are conserved in other mammals but changed in human, while immunity related genes harbour mutations at base-positions that evolve rapidly in all mammals including humans due to strong preference for advantageous alleles. Genes involved in olfaction also evolve rapidly in all mammals, and in humans this appears to be due to weak negative selection. Conclusion: While recent studies have identified genes under positive selection in humans, our approach identifies evolutionary constraints on Gene Ontology groups identifying changes in humans relative to some of the other mammals.

Differentiated evolutionary rates in alternative exons and the implications for splicing regulation

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Background: Alternatively spliced exons play an important role in the diversification of gene function in most metazoans and are highly regulated by conserved motifs in exons and introns. Two contradicting properties have been associated to evolutionary conserved alternative exons: higher sequence conservation and higher rate of non-synonymous substitutions, relative to constitutive exons. In order to clarify this issue, we have performed an analysis of the evolution of alternative and constitutive exons, using a large set of protein coding exons conserved between human and mouse and taking into account the conservation of the transcript exonic structure. Further, we have also defined a measure of the variation of the arrangement of exonic splicing enhancers (ESE-conservation score) to study the evolution of splicing regulatory sequences. We have used this measure to correlate the changes in the arrangement of ESEs with the divergence of exon and intron sequences. Results: We find evidence for a relation between the lack of conservation of the exonic structure and the weakening of the sequence evolutionary constraints in alternative and constitutive exons. Exons in transcripts with non-conserved exonic structures have higher synonymous (dS) and non-synonymous (dN) substitution rates than exons in conserved structures. Moreover, alternative exons in transcripts with non-conserved exonic structure are the least constrained in sequence evolution, and at high EST-inclusion levels they are found to be very similar to constitutive exons, whereas alternative exons in transcripts with conserved exonic structure have a dS significantly lower than average at all EST-inclusion levels. We also find higher conservation in the arrangement of ESEs in constitutive exons compared to alternative ones. Additionally, the sequence conservation at flanking introns remains constant for constitutive exons at all ESE-conservation values, but increases for alternative exons at high ESE-conservation values. Conclusion: We conclude that most of the differences in dN observed between alternative and constitutive exons can be explained by the conservation of the transcript exonic structure. Low dS values are more characteristic of alternative exons with conserved exonic structure, but not of those with non-conserved exonic structure. Additionally, constitutive exons are characterized by a higher conservation in the arrangement of ESEs, and alternative exons with an ESE-conservation similar to that of constitutive exons are characterized by a conservation of the flanking intron sequences higher than average, indicating the presence of more intronic regulatory signals.

What can spike train distances tell us about the neural code?

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Time scale parametric spike train distances like the Victor and the van Rossum distancesare often applied to study the neural code based on neural stimuli discrimination.Different neural coding hypotheses, such as rate or coincidence coding,can be assessed by combining a time scale parametric spike train distance with aclassifier in order to obtain the optimal discrimination performance. The time scalefor which the responses to different stimuli are distinguished best is assumed to bethe discriminative precision of the neural code. The relevance of temporal codingis evaluated by comparing the optimal discrimination performance with the oneachieved when assuming a rate code.We here characterize the measures quantifying the discrimination performance,the discriminative precision, and the relevance of temporal coding. Furthermore,we evaluate the information these quantities provide about the neural code. Weshow that the discriminative precision is too unspecific to be interpreted in termsof the time scales relevant for encoding. Accordingly, the time scale parametricnature of the distances is mainly an advantage because it allows maximizing thediscrimination performance across a whole set of measures with different sensitivitiesdetermined by the time scale parameter, but not due to the possibility toexamine the temporal properties of the neural code.

Distortion control for delay-sensitive sources

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We investigate the problem of finding minimum-distortion policies for streaming delay-sensitive but distortion-tolerant data. We consider cross-layer approaches which exploit the coupling between presentation and transport layers. We make the natural assumption that the distortion function is convex and decreasing. We focus on a single source-destination pair and analytically find the optimum transmission policy when the transmission is done over an error-free channel. This optimum policy turns out to be independent of the exact form of the convex and decreasing distortion function. Then, for a packet-erasure channel, we analytically find the optimum open-loop transmission policy, which is also independent of the form of the convex distortion function. We then find computationally efficient closed-loop heuristic policies and show, through numerical evaluation, that they outperform the open-loop policy and have near optimal performance.

Motivation, Test Scores, and Economic Success

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper argues that low-stakes test scores, available in surveys, may be partially determined by test-taking motivation, which is associated with personality traits but not with cognitive ability. Therefore, such test score distributions may not be informative regarding cognitive ability distributions. Moreover, correlations, found in survey data, between high test scores and economic success may be partially caused by favorable personality traits. To demonstrate these points, I use the coding speed test that was administered without incentives to National Longitudinal Survey of Youth 1979 (NLSY) participants. I suggest that due to its simplicity its scores may especially depend on individuals' test-taking motivation. I show that controlling for conventional measures of cognitive skills, the coding speed scores are correlated with future earnings of male NLSY participants. Moreover, the coding speed scores of highly motivated, though less educated, population (potential enlists to the armed forces) are higher than NLSY participants' scores. I then use controlled experiments to show that when no performance-based incentives are provided, participants' characteristics, but not their cognitive skills, affect effort invested in the coding speed test. Thus, participants with the same ability (measured by their scores on an incentivized test) have significantly different scores on tests without performance- based incentives.

Measures of fit in multiple correspondence analysis of crisp and fuzzy coded data

Relevância:

10.00% 10.00%

Publicador:

Resumo:

When continuous data are coded to categorical variables, two types of coding are possible: crisp coding in the form of indicator, or dummy, variables with values either 0 or 1; or fuzzy coding where each observation is transformed to a set of "degrees of membership" between 0 and 1, using co-called membership functions. It is well known that the correspondence analysis of crisp coded data, namely multiple correspondence analysis, yields principal inertias (eigenvalues) that considerably underestimate the quality of the solution in a low-dimensional space. Since the crisp data only code the categories to which each individual case belongs, an alternative measure of fit is simply to count how well these categories are predicted by the solution. Another approach is to consider multiple correspondence analysis equivalently as the analysis of the Burt matrix (i.e., the matrix of all two-way cross-tabulations of the categorical variables), and then perform a joint correspondence analysis to fit just the off-diagonal tables of the Burt matrix - the measure of fit is then computed as the quality of explaining these tables only. The correspondence analysis of fuzzy coded data, called "fuzzy multiple correspondence analysis", suffers from the same problem, albeit attenuated. Again, one can count how many correct predictions are made of the categories which have highest degree of membership. But here one can also defuzzify the results of the analysis to obtain estimated values of the original data, and then calculate a measure of fit in the familiar percentage form, thanks to the resultant orthogonal decomposition of variance. Furthermore, if one thinks of fuzzy multiple correspondence analysis as explaining the two-way associations between variables, a fuzzy Burt matrix can be computed and the same strategy as in the crisp case can be applied to analyse the off-diagonal part of this matrix. In this paper these alternative measures of fit are defined and applied to a data set of continuous meteorological variables, which are coded crisply and fuzzily into three categories. Measuring the fit is further discussed when the data set consists of a mixture of discrete and continuous variables.

Motivation, test scores and economic success

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper argues that low-stakes test scores, available in surveys, may be partially determinedby test-taking motivation, which is associated with personality traits but not with cognitiveability. Therefore, such test score distributions may not be informative regarding cognitiveability distributions. Moreover, correlations, found in survey data, between high test scoresand economic success may be partially caused by favorable personality traits. To demonstratethese points, I use the coding speed test that was administered without incentives to NationalLongitudinal Survey of Youth 1979 (NLSY) participants. I suggest that due to its simplicityits scores may especially depend on individuals' test-taking motivation. I show that controllingfor conventional measures of cognitive skills, the coding speed scores are correlated with futureearnings of male NLSY participants. Moreover, the coding speed scores of highly motivated,though less educated, population (potential enlists to the armed forces) are higher than NLSYparticipants' scores. I then use controlled experiments to show that when no performance-basedincentives are provided, participants' characteristics, but not their cognitive skills, affect effortinvested in the coding speed test. Thus, participants with the same ability (measured by theirscores on an incentivized test) have significantly different scores on tests without performance-based incentives.

«
1
2
3
4
5
»