738 resultados para Annotation de génomes


Relevância:

10.00% 10.00%

Publicador:

Resumo:

We report the construction of the mouse full-length cDNA encyclopedia, the most extensive view of a complex transcriptome, on the basis of preparing and sequencing 246 libraries. Before cloning, cDNAs were enriched in full-length by Cap-Trapper, and in most cases, aggressively subtracted/normalized. We have produced 1,442,236 successful 3'-end sequences clustered into 171,144 groups, from which 60,770 clones were fully sequenced cDNAs annotated in the FANTOM-2 annotation. We have also produced 547,149 5' end reads, which clustered into 124,258 groups. Altogether, these cDNAs were further grouped in 70,000 transcriptional units (TU), which represent the best coverage of a transcriptome so far. By monitoring the extent of normalization/subtraction, we define the tentative equivalent coverage (TEC), which was estimated to be equivalent to >12,000,000 ESTs derived from standard libraries. High coverage explains discrepancies between the very large. numbers of clusters (and TUs) of this project, which also include non-protein-coding RNAs, and the lower gene number estimation of genome annotations. Altogether, S'-end clusters identify regions that are potential promoters for 8637 known genes and S'-end clusters suggest the presence of almost 63,000 transcriptional starting points. An estimate of the frequency of polyadenylation signals suggests that at least half of the singletons in the EST set represent real mRNAs. Clones accounting for about half of the predicted TUs await further sequencing. The continued high-discovery rate suggests that the task of transcriptome discovery is not yet complete.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

With the sequencing and annotation of genomes and transcriptomes of several eukaryotes, the importance of noncoding RNA (ncRNA)-RNA molecules that are not translated to protein products-has become more evident. A subclass of ncRNA transcripts are encoded by highly regulated, multi-exon, transcriptional units, are processed like typical protein-coding mRNAs and are increasingly implicated in regulation of many cellular functions in eukaryotes. This study describes the identification of candidate functional ncRNAs from among the RIKEN mouse full-length cDNA collection, which contains 60,770 sequences, by using a systematic computational filtering approach. We initially searched for previously reported ncRNAs and found nine murine ncRNAs and homologs of several previously described nonmouse ncRNAs. Through our computational approach to filter artifact-free clones that lack protein coding potential, we extracted 4280 transcripts as the largest-candidate set. Many clones in the set had EST hits, potential CpG islands surrounding the transcription start sites, and homologies with the human genome. This implies that many candidates are indeed transcribed in a regulated manner. Our results demonstrate that ncRNAs are a major functional subclass of processed transcripts in mammals.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Zinc-finger-containing proteins can be classified into evolutionary and functionally divergent protein families that share one or more domains in which a zinc ion is tetrahedrally coordinated by cysteines and histidines. The zinc finger domain defines one of the largest protein superfamilies in mammalian genomes; 46 different conserved zinc finger domains are listed in InterPro (http://www.ebi.ac.uk/InterPro). Zinc finger proteins can bind to DNA, RNA, other proteins, or lipids as a modular domain in combination with other conserved structures. Owing to this combinatorial diversity, different members of zinc finger superfamilies contribute to many distinct cellular processes, including transcriptional regulation, mRNA stability and processing, and protein turnover. Accordingly, mutations of zinc finger genes lead to aberrations in a broad spectrum of biological processes such as development, differentiation, apoptosis, and immunological responses. This study provides the first comprehensive classification of zinc finger proteins in a mammalian transcriptome. Specific detailed analysis of the SP/Kruppel-like factors and the E3 ubiquitin-ligase RING-H2 families illustrates the importance of such an analysis for a more comprehensive functional classification of large protein families. We describe the characterization of a new family of C2H2 zinc-finger-containing proteins and a new conserved domain characteristic of this family, the identification and characterization of Sp8, a new member of the Sp family of transcriptional regulators, and the identification of five new RING-H2 proteins.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The current RIKEN transcript set represents a significant proportion of the mouse transcriptome but transcripts expressed in the innate and acquired immune systems are poorly represented. In the present study we have assessed the complexity of the transcriptome expressed in mouse macrophages before and after treatment with lipopolysaccharide, a global regulator of macrophage gene expression, using existing RIKEN 19K arrays. By comparison to array profiles of other cells and tissues, we identify a large set of macrophage-enriched genes, many of which have obvious functions in endocytosis and phagocytosis. In addition, a significant number of LPS-inducible genes were identified. The data suggest that macrophages are a complex source of mRNA for transcriptome studies. To assess complexity and identify additional macrophage expressed genes, cDNA libraries were created from purified populations of macrophage and dendritic cells, a functionally related cell type. Sequence analysis revealed a high incidence of novel mRNAs within these cDNA libraries. These studies provide insights into the depths of transcriptional complexity still untapped amongst products of inducible genes, and identify macrophage and dendritic cell populations as a starting point for sampling the inducible mammalian transcriptome.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The number of mammalian transcripts identified by full-length cDNA projects and genome sequencing projects is increasing remarkably. Clustering them into a strictly nonredundant and comprehensive set provides a platform for functional analysis of the transcriptome and proteome, but the quality of the clustering and predictive usefulness have previously required manual curation to identify truncated transcripts and inappropriate clustering of closely related sequences. A Representative Transcript and Protein Sets (RTPS) pipeline was previously designed to identify the nonredundant and comprehensive set of mouse transcripts based on clustering of a large mouse full-length cDNA set (FANTOM2). Here we propose an alternative method that is more robust, requires less manual curation, and is applicable to other organisms in addition to mouse. RTPSs of human, mouse, and rat have been produced by this method and used for validation. Their comprehensiveness and quality are discussed by comparison with other clustering approaches. The RTPSs are available at ftp://fantom2.gsc.riken.go.jp/RTPS/. (C). 2004 Elsevier Inc. All rights reserved.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Scorpion toxins are common experimental tools for studies of biochemical and pharmacological properties of ion channels. The number of functionally annotated scorpion toxins is steadily growing, but the number of identified toxin sequences is increasing at much faster pace. With an estimated 100,000 different variants, bioinformatic analysis of scorpion toxins is becoming a necessary tool for their systematic functional analysis. Here, we report a bioinformatics-driven system involving scorpion toxin structural classification, functional annotation, database technology, sequence comparison, nearest neighbour analysis, and decision rules which produces highly accurate predictions of scorpion toxin functional properties. (c) 2005 Elsevier Inc. All rights reserved.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The reconstructed cellular metabolic network of Mus musculus, based on annotated genomic data, pathway databases, and currently available biochemical and physiological information, is presented. Although incomplete, it represents the first attempt to collect and characterize the metabolic network of a mammalian cell on the basis of genomic data. The reaction network is generic in nature and attempts to capture the carbon, energy, and nitrogen metabolism of the cell. The metabolic reactions were compartmentalized between the cytosol and the mitochondria, including transport reactions between the compartments and the extracellular medium. The reaction list consists of 872 internal metabolites involved in a total of 1220 reactions, whereof 473 relate to known open reading frames. Initial in silico analysis of the reconstructed model is presented.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The prediction of regulatory elements is a problem where computational methods offer great hope. Over the past few years, numerous tools have become available for this task. The purpose of the current assessment is twofold: to provide some guidance to users regarding the accuracy of currently available tools in various settings, and to provide a benchmark of data sets for assessing future tools.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Motivation: Targeting peptides direct nascent proteins to their specific subcellular compartment. Knowledge of targeting signals enables informed drug design and reliable annotation of gene products. However, due to the low similarity of such sequences and the dynamical nature of the sorting process, the computational prediction of subcellular localization of proteins is challenging. Results: We contrast the use of feed forward models as employed by the popular TargetP/SignalP predictors with a sequence-biased recurrent network model. The models are evaluated in terms of performance at the residue level and at the sequence level, and demonstrate that recurrent networks improve the overall prediction performance. Compared to the original results reported for TargetP, an ensemble of the tested models increases the accuracy by 6 and 5% on non-plant and plant data, respectively.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Motivation: The clustering of gene profiles across some experimental conditions of interest contributes significantly to the elucidation of unknown gene function, the validation of gene discoveries and the interpretation of biological processes. However, this clustering problem is not straightforward as the profiles of the genes are not all independently distributed and the expression levels may have been obtained from an experimental design involving replicated arrays. Ignoring the dependence between the gene profiles and the structure of the replicated data can result in important sources of variability in the experiments being overlooked in the analysis, with the consequent possibility of misleading inferences being made. We propose a random-effects model that provides a unified approach to the clustering of genes with correlated expression levels measured in a wide variety of experimental situations. Our model is an extension of the normal mixture model to account for the correlations between the gene profiles and to enable covariate information to be incorporated into the clustering process. Hence the model is applicable to longitudinal studies with or without replication, for example, time-course experiments by using time as a covariate, and to cross-sectional experiments by using categorical covariates to represent the different experimental classes. Results: We show that our random-effects model can be fitted by maximum likelihood via the EM algorithm for which the E(expectation) and M(maximization) steps can be implemented in closed form. Hence our model can be fitted deterministically without the need for time-consuming Monte Carlo approximations. The effectiveness of our model-based procedure for the clustering of correlated gene profiles is demonstrated on three real datasets, representing typical microarray experimental designs, covering time-course, repeated-measurement and cross-sectional data. In these examples, relevant clusters of the genes are obtained, which are supported by existing gene-function annotation. A synthetic dataset is considered too.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Membrane organization describes the orientation of a protein with respect to the membrane and can be determined by the presence, or absence, and organization within the protein sequence of two features: endoplasmic reticulum signal peptides and alpha-helical transmembrane domains. These features allow protein sequences to be classified into one of five membrane organization categories: soluble intracellular proteins, soluble secreted proteins, type I membrane proteins, type II membrane proteins, and multi- spanning membrane proteins. Generation of protein isoforms with variable membrane organizations can change a protein's subcellular localization or association with the membrane. Application of MemO, a membrane organization annotation pipeline, to the FANTOM3 Isoform Protein Sequence mouse protein set revealed that within the 8,032 transcriptional units ( TUs) with multiple protein isoforms, 573 had variation in their use of signal peptides, 1,527 had variation in their use of transmembrane domains, and 615 generated protein isoforms from distinct membrane organization classes. The mechanisms underlying these transcript variations were analyzed. While TUs were identified encoding all pairwise combinations of membrane organization categories, the most common was conversion of membrane proteins to soluble proteins. Observed within our highconfidence set were 156 TUs predicted to generate both extracellular soluble and membrane proteins, and 217 TUs generating both intracellular soluble and membrane proteins. The differential use of endoplasmic reticulum signal peptides and transmembrane domains is a common occurrence within the variable protein output of TUs. The generation of protein isoforms that are targeted to multiple subcellular locations represents a major functional consequence of transcript variation within the mouse transcriptome.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Cdca4 (Hepp) was originally identified as a gene expressed specifically in hematopoietic progenitor cells as opposed to hematopoietic stem cells. More recently, it has been shown to stimulate p53 activity and also lead to p53-independent growth inhibition when overexpressed. We independently isolated the murine Cdca4 gene in a genomic expression-based screen for genes involved in mammalian craniofacial development, and show that Cdca4 is expressed in a spatio-temporally restricted pattern during mouse embryogenesis. In addition to expression in the facial primordia including the pharyngeal arches, Cdca4 is expressed in the developing limb buds, brain, spinal cord, dorsal root ganglia, teeth, eye and hair follicles. Along with a small number of proteins from a range of species, the predicted CDCA4 protein contains a novel SERTA motif in addition to cyclin A-binding and PHD bromodomain-binding regions of homology. While the function of the SERTA domain is unknown, proteins containing this domain have previously been linked to cell cycle progression and chromatin remodelling. Using in silico database mining we have extended the number of evolutionarily conserved orthologues of known SERTA domain proteins and identified an uncharacterised member of the SERTA domain family, SERTAD4, with orthologues to date in human, mouse, rat, dog, cow, Tetraodon and chicken. Immunolocalisation of transiently and stably transfected epitope-tagged CDCA4 protein in mammalian cells suggests that it resides predominantly in the nucleus throughout all stages of the cell cycle. (c) 2006 Elsevier B.V. All rights reserved.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Craniofacial anomalies are a common feature of human congenital dysmorphology syndromes, suggesting that genes expressed in the developing face are likely to play a wider role in embryonic development. To facilitate the identification of genes involved in embryogenesis, we previously constructed an enriched cDNA library by subtracting adult mouse liver cDNA from that of embryonic day (E)10.5 mouse pharyngeal arch cDNA. From this library, 273 unique clones were sequenced and known proteins binned into functional categories in order to assess enrichment of the library (1). We have now selected 31 novel and poorly characterised genes from this library and present bioinformatic analysis to predict proteins encoded by these genes, and to detect evolutionary conservation. Of these genes 61% (19/31) showed restricted expression in the developing embryo, and a subset of these was chosen for further in silico characterisation as well as experimental determination of subcellular localisation based on transient transfection of predicted full-length coding sequences into mammalian cell lines. Where a human orthologue of these genes was detected, chromosomal localisation was determined relative to known loci for human congenital disease.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Scorpion toxins are important experimental tools for characterization of vast array of ion channels and serve as scaffolds for drug design. General public database entries contain limited annotation whereby rich structure-function information from mutation studies is typically not available. SCORPION2 contains more than 800 records of native and mutant toxin sequences enriched with binding affinity and toxicity information, 624 three-dimensional structures and some 500 references. SCORPION2 has a set of search and prediction tools that allow users to extract and perform specific queries: text searches of scorpion toxin records, sequence similarity search, extraction of sequences, visualization of scorpion toxin structures, analysis of toxic activity, and functional annotation of previously uncharacterized scorpion toxins. The SCORPION2 database is available at http://sdmc.i2r.a-star.edu.sg/scorpion/. (c) 2006 Elsevier Ltd. All rights reserved.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Large-scale gene discovery has been performed for the grass fungal endophytes Neotyphodium coenophialum, Neotyphodium lolii, and Epichloe festucae. The resulting sequences have been annotated by comparison with public DNA and protein sequence databases and using intermediate gene ontology annotation tools. Endophyte sequences have also been analysed for the presence of simple sequence repeat and single nucleotide polymorphism molecular genetic markers. Sequences and annotation are maintained within a MySQL database that may be queried using a custom web interface. Two cDNA-based microarrays have been generated from this genome resource, They permit the interrogation of 3806 Neotyphodium genes (Nchip (TM) rnicroarray), and 4195 Neotyphodium and 920 Epichloe genes (EndoChip (TM) microarray), respectively. These microarrays provide tools for high-throughput transcriptome analysis, including genome-specific gene expression studies, profiling of novel endophyte genes, and investigation of the host grass-symbiont interaction. Comparative transcriptome analysis in Neotyphodium and Epichloe was performed. (c) 2006 Elsevier