46 resultados para Human Genome
em Consorci de Serveis Universitaris de Catalunya (CSUC), Spain
Resumo:
Functional RNA structures play an important role both in the context of noncoding RNA transcripts as well as regulatory elements in mRNAs. Here we present a computational study to detect functional RNA structures within the ENCODE regions of the human genome. Since structural RNAs in general lack characteristic signals in primary sequence, comparative approaches evaluating evolutionary conservation of structures are most promising. We have used three recently introduced programs based on either phylogenetic–stochastic context-free grammar (EvoFold) or energy directed folding (RNAz and AlifoldZ), yielding several thousand candidate structures (corresponding to ∼2.7% of the ENCODE regions). EvoFold has its highest sensitivity in highly conserved and relatively AU-rich regions, while RNAz favors slightly GC-rich regions, resulting in a relatively small overlap between methods. Comparison with the GENCODE annotation points to functional RNAs in all genomic contexts, with a slightly increased density in 3′-UTRs. While we estimate a significant false discovery rate of ∼50%–70% many of the predictions can be further substantiated by additional criteria: 248 loci are predicted by both RNAz and EvoFold, and an additional 239 RNAz or EvoFold predictions are supported by the (more stringent) AlifoldZ algorithm. Five hundred seventy RNAz structure predictions fall into regions that show signs of selection pressure also on the sequence level (i.e., conserved elements). More than 700 predictions overlap with noncoding transcripts detected by oligonucleotide tiling arrays. One hundred seventy-five selected candidates were tested by RT-PCR in six tissues, and expression could be verified in 43 cases (24.6%).
Resumo:
The “one-gene, one-protein” rule, coined by Beadle and Tatum, has been fundamental to molecular biology. The rule implies that the genetic complexity of an organism depends essentially on its gene number. The discovery, however, that alternative gene splicing and transcription are widespread phenomena dramatically altered our understanding of the genetic complexity of higher eukaryotic organisms; in these, a limited number of genes may potentially encode a much larger number of proteins. Here we investigate yet another phenomenon that may contribute to generate additional protein diversity. Indeed, by relying on both computational and experimental analysis, we estimate that at least 4%–5% of the tandem gene pairs in the human genome can be eventually transcribed into a single RNA sequence encoding a putative chimeric protein. While the functional significance of most of these chimeric transcripts remains to be determined, we provide strong evidence that this phenomenon does not correspond to mere technical artifacts and that it is a common mechanism with the potential of generating hundreds of additional proteins in the human genome.
Resumo:
Background: We present the results of EGASP, a community experiment to assess the state-ofthe-art in genome annotation within the ENCODE regions, which span 1% of the human genomesequence. The experiment had two major goals: the assessment of the accuracy of computationalmethods to predict protein coding genes; and the overall assessment of the completeness of thecurrent human genome annotations as represented in the ENCODE regions. For thecomputational prediction assessment, eighteen groups contributed gene predictions. Weevaluated these submissions against each other based on a ‘reference set’ of annotationsgenerated as part of the GENCODE project. These annotations were not available to theprediction groups prior to the submission deadline, so that their predictions were blind and anexternal advisory committee could perform a fair assessment.Results: The best methods had at least one gene transcript correctly predicted for close to 70%of the annotated genes. Nevertheless, the multiple transcript accuracy, taking into accountalternative splicing, reached only approximately 40% to 50% accuracy. At the coding nucleotidelevel, the best programs reached an accuracy of 90% in both sensitivity and specificity. Programsrelying on mRNA and protein sequences were the most accurate in reproducing the manuallycurated annotations. Experimental validation shows that only a very small percentage (3.2%) of the selected 221 computationally predicted exons outside of the existing annotation could beverified.Conclusions: This is the first such experiment in human DNA, and we have followed thestandards established in a similar experiment, GASP1, in Drosophila melanogaster. We believe theresults presented here contribute to the value of ongoing large-scale annotation projects and shouldguide further experimental methods when being scaled up to the entire human genome sequence.
Resumo:
Background: Non-long terminal repeat (non-LTR) retrotransposons have contributed to shaping the structure and function of genomes. In silico and experimental approaches have been used to identify the non-LTR elements of the urochordate Ciona intestinalis. Knowledge of the types and abundance of non-LTR elements in urochordates is a key step in understanding their contribution to the structure and function of vertebrate genomes. Results: Consensus elements phylogenetically related to the I, LINE1, LINE2, LOA and R2 elements of the 14 eukaryotic non-LTR clades are described from C. intestinalis. The ascidian elements showed conservation of both the reverse transcriptase coding sequence and the overall structural organization seen in each clade. The apurinic/apyrimidinic endonuclease and nucleic-acid-binding domains encoded upstream of the reverse transcriptase, and the RNase H and the restriction enzyme-like endonuclease motifs encoded downstream of the reverse transcriptase were identified in the corresponding Ciona families. Conclusions: The genome of C. intestinalis harbors representatives of at least five clades of non-LTR retrotransposons. The copy number per haploid genome of each element is low, less than 100, far below the values reported for vertebrate counterparts but within the range for protostomes. Genomic and sequence analysis shows that the ascidian non-LTR elements are unmethylated and flanked by genomic segments with a gene density lower than average for the genome. The analysis provides valuable data for understanding the evolution of early chordate genomes and enlarges the view on the distribution of the non-LTR retrotransposons in eukaryotes.
Resumo:
Replication of human immunodeficiency virus (HIV) requires base pairing of the reverse transcriptase primer, human tRNA(Lys3), to the viral RNA. Although the major complementary base pairing occurs between the HIV primer binding sequence (PBS) and the tRNA's 3'-terminus, an important discriminatory, secondary contact occurs between the viral A-rich Loop I, 5'-adjacent to the PBS, and the modified, U-rich anticodon domain of tRNA(Lys3). The importance of individual and combined anticodon modifications to the tRNA/HIV-1 Loop I RNA's interaction was determined. The thermal stabilities of variously modified tRNA anticodon region sequences bound to the Loop I of viral sub(sero)types G and B were analyzed and the structure of one duplex containing two modified nucleosides was determined using NMR spectroscopy and restrained molecular dynamics. The modifications 2-thiouridine, s(2)U(34), and pseudouridine, Psi(39), appreciably stabilized the interaction of the anticodon region with the viral subtype G and B RNAs. The structure of the duplex results in two coaxially stacked A-form RNA stems separated by two mismatched base pairs, U(162)*Psi(39) and G(163)*A(38), that maintained a reasonable A-form helix diameter. The tRNA's s(2)U(34) stabilized the interaction between the A-rich HIV Loop I sequence and the U-rich anticodon, whereas the tRNA's Psi(39) stabilized the adjacent mismatched pairs.
Resumo:
The goals of the human genome project did not include sequencing of the heterochromatic regions. We describe here an initial sequence of 1.1 Mb of the short arm of human chromosome 21 (HSA21p), estimated to be 10% of 21p. This region contains extensive euchromatic-like sequence and includes on average one transcript every 100 kb. These transcripts show multiple inter- and intrachromosomal copies, and extensive copy number and sequence variability. The sequencing of the "heterochromatic" regions of the human genome is likely to reveal many additional functional elements and provide important evolutionary information.
Resumo:
The completion of the sequencing of the mouse genome promises to help predict human genes with greater accuracy. While current ab initio gene prediction programs are remarkably sensitive (i.e., they predict at least a fragment of most genes), their specificity is often low, predicting a large number of false-positive genes in the human genome. Sequence conservation at the protein level with the mouse genome can help eliminate some of those false positives. Here we describe SGP2, a gene prediction program that combines ab initio gene prediction with TBLASTX searches between two genome sequences to provide both sensitive and specific gene predictions. The accuracy of SGP2 when used to predict genes by comparing the human and mouse genomes is assessed on a number of data sets, including single-gene data sets, the highly curated human chromosome 22 predictions, and entire genome predictions from ENSEMBL. Results indicate that SGP2 outperforms purely ab initio gene prediction methods. Results also indicate that SGP2 works about as well with 3x shotgun data as it does with fully assembled genomes. SGP2 provides a high enough specificity that its predictions can be experimentally verified at a reasonable cost. SGP2 was used to generate a complete set of gene predictions on both the human and mouse by comparing the genomes of these two species. Our results suggest that another few thousand human and mouse genes currently not in ENSEMBL are worth verifying experimentally.
Resumo:
BACKGROUND: The need for an integrated view of data obtained from high-throughput technologies gave rise to network analyses. These are especially useful to rationalize how external perturbations propagate through the expression of genes. To address this issue in the case of drug resistance, we constructed biological association networks of genes differentially expressed in cell lines resistant to methotrexate (MTX). METHODS: Seven cell lines representative of different types of cancer, including colon cancer (HT29 and Caco2), breast cancer (MCF-7 and MDA-MB-468), pancreatic cancer (MIA PaCa-2), erythroblastic leukemia (K562) and osteosarcoma (Saos-2), were used. The differential expression pattern between sensitive and MTX-resistant cells was determined by whole human genome microarrays and analyzed with the GeneSpring GX software package. Genes deregulated in common between the different cancer cell lines served to generate biological association networks using the Pathway Architect software. RESULTS: Dikkopf homolog-1 (DKK1) is a highly interconnected node in the network generated with genes in common between the two colon cancer cell lines, and functional validations of this target using small interfering RNAs (siRNAs) showed a chemosensitization toward MTX. Members of the UDP-glucuronosyltransferase 1A (UGT1A) family formed a network of genes differentially expressed in the two breast cancer cell lines. siRNA treatment against UGT1A also showed an increase in MTX sensitivity. Eukaryotic translation elongation factor 1 alpha 1 (EEF1A1) was overexpressed among the pancreatic cancer, leukemia and osteosarcoma cell lines, and siRNA treatment against EEF1A1 produced a chemosensitization toward MTX. CONCLUSIONS: Biological association networks identified DKK1, UGT1As and EEF1A1 as important gene nodes in MTX-resistance. Treatments using siRNA technology against these three genes showed chemosensitization toward MTX.
Resumo:
Different signatures of natural selection persist over varying time scales in our genome, revealing possible episodes of adaptative evolution during human history. Here, we identify genes showing signatures of ancestral positive selection in the human lineage and investigate whether some of those genes have been evolving adaptatively in extant human populations. Specifically, we compared more than 11,000 human genes with their orthologs inchimpanzee, mouse, rat and dog and applied a branch-site likelihood method to test for positive selection on the human lineage. Among the significant cases, a robust set of 11 genes were then further explored for signatures of recent positive selection using SNP data. We genotyped 223 SNPs in 39 worldwide populations from the HGDP Diversity panel and supplemented this information with available genotypes for up to 4,814 SNPs distributed along 2 Mb centered on each gene. After exploring the allele frequency spectrum, population differentiation and the maintainance of long unbroken haplotypes, we found signals of recent adaptative phenomena in only one of the 11 candidate gene regions. However, the signal ofrecent selection in this region may come from a different, neighbouring gene (CD5) ratherthan from the candidate gene itself (VPS37C). For this set of positively-selected genes in thehuman lineage, we find no indication that these genes maintained their rapid evolutionarypace among human populations. Based on these data, it therefore appears that adaptation forhuman-specific and for population-specific traits may have involved different genes.
Resumo:
BACKGROUND: The need for an integrated view of data obtained from high-throughput technologies gave rise to network analyses. These are especially useful to rationalize how external perturbations propagate through the expression of genes. To address this issue in the case of drug resistance, we constructed biological association networks of genes differentially expressed in cell lines resistant to methotrexate (MTX). METHODS: Seven cell lines representative of different types of cancer, including colon cancer (HT29 and Caco2), breast cancer (MCF-7 and MDA-MB-468), pancreatic cancer (MIA PaCa-2), erythroblastic leukemia (K562) and osteosarcoma (Saos-2), were used. The differential expression pattern between sensitive and MTX-resistant cells was determined by whole human genome microarrays and analyzed with the GeneSpring GX software package. Genes deregulated in common between the different cancer cell lines served to generate biological association networks using the Pathway Architect software. RESULTS: Dikkopf homolog-1 (DKK1) is a highly interconnected node in the network generated with genes in common between the two colon cancer cell lines, and functional validations of this target using small interfering RNAs (siRNAs) showed a chemosensitization toward MTX. Members of the UDP-glucuronosyltransferase 1A (UGT1A) family formed a network of genes differentially expressed in the two breast cancer cell lines. siRNA treatment against UGT1A also showed an increase in MTX sensitivity. Eukaryotic translation elongation factor 1 alpha 1 (EEF1A1) was overexpressed among the pancreatic cancer, leukemia and osteosarcoma cell lines, and siRNA treatment against EEF1A1 produced a chemosensitization toward MTX. CONCLUSIONS: Biological association networks identified DKK1, UGT1As and EEF1A1 as important gene nodes in MTX-resistance. Treatments using siRNA technology against these three genes showed chemosensitization toward MTX.
Resumo:
There is great scientific and popular interest in understanding the genetic history of populations in the Americas. We wish to understand when different regions of the continent were inhabited, where settlers came from, and how current inhabitants relate genetically to earlier populations. Recent studies unraveled parts of the genetic history of the continent using genotyping arrays and uniparental markers. The 1000 Genomes Project provides a unique opportunity for improving our understanding of population genetic history by providing over a hundred sequenced low coverage genomes and exomes from Colombian (CLM), Mexican-American (MXL), and Puerto Rican (PUR) populations. Here, we explore the genomic contributions of African, European, and especially Native American ancestry to these populations. Estimated Native American ancestry is 48% in MXL, 25% in CLM, and 13% in PUR. Native American ancestry in PUR is most closely related to populations surrounding the Orinoco River basin, confirming the Southern American ancestry of the Taíno people of the Caribbean. We present new methods to estimate the allele frequencies in the Native American fraction of the populations, and model their distribution using a demographic model for three ancestral Native American populations. These ancestral populations likely split in close succession: the most likely scenario, based on a peopling of the Americas 16 thousand years ago (kya), supports that the MXL Ancestors split 12.2kya, with a subsequent split of the ancestors to CLM and PUR 11.7kya. The model also features effective populations of 62,000 in Mexico, 8,700 in Colombia, and 1,900 in Puerto Rico. Modeling Identity-by-descent (IBD) and ancestry tract length, we show that post-contact populations also differ markedly in their effective sizes and migration patterns, with Puerto Rico showing the smallest effective size and the earlier migration from Europe. Finally, we compare IBD and ancestry assignments to find evidence for relatedness among European founders to the three populations.
Resumo:
Background It is well known that the pattern of linkage disequilibrium varies between human populations, with remarkable geographical stratification. Indirect association studies routinely exploit linkage disequilibrium around genes, particularly in isolated populations where it is assumed to be higher. Here, we explore both the amount and the decay of linkage disequilibrium with physical distance along 211 gene regions, most of them related to complex diseases, across 39 HGDP-CEPH population samples, focusing particularly on the populations defined as isolates. Within each gene region and population we use r2 between all possible single nucleotide polymorphism (SNP) pairs as a measure of linkage disequilibrium and focus on the proportion of SNP pairs with r2 greater than 0.8. Results Although the average r2 was found to be significantly different both between and within continental regions, a much higher proportion of r2 variance could be attributed to differences between continental regions (2.8% vs. 0.5%, respectively). Similarly, while the proportion of SNP pairs with r2 > 0.8 was significantly different across continents for all distance classes, it was generally much more homogenous within continents, except in the case of Africa and the Americas. The only isolated populations with consistently higher LD in all distance classes with respect to their continent are the Kalash (Central South Asia) and the Surui (America). Moreover, isolated populations showed only slightly higher proportions of SNP pairs with r2 > 0.8 per gene region than non-isolated populations in the same continent. Thus, the number of SNPs in isolated populations that need to be genotyped may be only slightly less than in non-isolates. Conclusion The 'isolated population' label by itself does not guarantee a greater genotyping efficiency in association studies, and properties other than increased linkage disequilibrium may make these populations interesting in genetic epidemiology.
Resumo:
L’èxit del Projecte Genoma Humà (PGH) l’any 2000 va fer de la “medicina personalitzada” una realitat més propera. Els descobriments del PGH han simplificat les tècniques de seqüenciació de tal manera que actualment qualsevol persona pot aconseguir la seva seqüència d’ADN complerta. La tecnologia de Read Mapping destaca en aquest tipus de tècniques i es caracteritza per manegar una gran quantitat de dades. Hadoop, el framework d’Apache per aplicacions intensives de dades sota el paradigma Map Reduce, resulta un aliat perfecte per aquest tipus de tecnologia i ha sigut l’opció escollida per a realitzar aquest projecte. Durant tot el treball es realitza l’estudi, l’anàlisi i les experimentacions necessàries per aconseguir un Algorisme Genètic innovador que utilitzi tot el potencial de Hadoop.
Resumo:
Desde el inicio del proyecto del genoma humano y su éxito en el año 2001 se han secuenciado genomas de multitud de especies. La mejora en las tecnologías de secuenciación ha generado volúmenes de datos con un crecimiento exponencial. El proyecto Análisis bioinformáticos sobre la tecnología Hadoop abarca la computación paralela de datos biológicos como son las secuencias de ADN. El estudio ha sido encauzado por la naturaleza del problema a resolver. El alineamiento de secuencias genéticas con el paradigma MapReduce.
Resumo:
One of the first useful products from the human genome will be a set of predicted genes. Besides its intrinsic scientific interest, the accuracy and completeness of this data set is of considerable importance for human health and medicine. Though progress has been made on computational gene identification in terms of both methods and accuracy evaluation measures, most of the sequence sets in which the programs are tested are short genomic sequences, and there is concern that these accuracy measures may not extrapolate well to larger, more challenging data sets. Given the absence of experimentally verified large genomic data sets, we constructed a semiartificial test set comprising a number of short single-gene genomic sequences with randomly generated intergenic regions. This test set, which should still present an easier problem than real human genomic sequence, mimics the approximately 200kb long BACs being sequenced. In our experiments with these longer genomic sequences, the accuracy of GENSCAN, one of the most accurate ab initio gene prediction programs, dropped significantly, although its sensitivity remained high. Conversely, the accuracy of similarity-based programs, such as GENEWISE, PROCRUSTES, and BLASTX was not affected significantly by the presence of random intergenic sequence, but depended on the strength of the similarity to the protein homolog. As expected, the accuracy dropped if the models were built using more distant homologs, and we were able to quantitatively estimate this decline. However, the specificities of these techniques are still rather good even when the similarity is weak, which is a desirable characteristic for driving expensive follow-up experiments. Our experiments suggest that though gene prediction will improve with every new protein that is discovered and through improvements in the current set of tools, we still have a long way to go before we can decipher the precise exonic structure of every gene in the human genome using purely computational methodology.