102 resultados para Genomic sequence database
Resumo:
STACK is a tool for detection and visualisation of expressed transcript variation in the context of developmental and pathological states. The datasystem organises and reconstructs human transcripts from available public data in the context of expression state. The expression state of a transcript can include developmental state, pathological association, site of expression and isoform of expressed transcript. STACK consensus transcripts are reconstructed from clusters that capture and reflect the growing evidence of transcript diversity. The comprehensive capture of transcript variants is achieved by the use of a novel clustering approach that is tolerant of sub-sequence diversity and does not rely on pairwise alignment. This is in contrast with other gene indexing projects. STACK is generated at least four times a year and represents the exhaustive processing of all publicly available human EST data extracted from GenBank. This processed information can be explored through 15 tissue-specific categories, a disease-related category and a whole-body index and is accessible via WWW at http://www.sanbi.ac.za/Dbases.html. STACK represents a broadly applicable resource, as it is the only reconstructed transcript database for which the tools for its generation are also broadly available (http://www.sanbi.ac.za/CODES).
Resumo:
The emotif database is a collection of more than 170 000 highly specific and sensitive protein sequence motifs representing conserved biochemical properties and biological functions. These protein motifs are derived from 7697 sequence alignments in the BLOCKS+ database (released on June 23, 2000) and all 8244 protein sequence alignments in the PRINTS database (version 27.0) using the emotif-maker algorithm developed by Nevill-Manning et al. (Nevill-Manning,C.G., Wu,T.D. and Brutlag,D.L. (1998) Proc. Natl Acad. Sci. USA, 95, 5865–5871; Nevill-Manning,C.G., Sethi,K.S., Wu,T.D. and Brutlag,D.L. (1997) ISMB-97, 5, 202–209). Since the amino acids and the groups of amino acids in these sequence motifs represent critical positions conserved in evolution, search algorithms employing the emotif patterns can identify and classify more widely divergent sequences than methods based on global sequence similarity. The emotif protein pattern database is available at http://motif.stanford.edu/emotif/.
Resumo:
Signature databases are vital tools for identifying distant relationships in novel sequences and hence for inferring protein function. InterPro is an integrated documentation resource for protein families, domains and functional sites, which amalgamates the efforts of the PROSITE, PRINTS, Pfam and ProDom database projects. Each InterPro entry includes a functional description, annotation, literature references and links back to the relevant member database(s). Release 2.0 of InterPro (October 2000) contains over 3000 entries, representing families, domains, repeats and sites of post-translational modification encoded by a total of 6804 different regular expressions, profiles, fingerprints and Hidden Markov Models. Each InterPro entry lists all the matches against SWISS-PROT and TrEMBL (more than 1 000 000 hits from 462 500 proteins in SWISS-PROT and TrEMBL). The database is accessible for text- and sequence-based searches at http://www.ebi.ac.uk/interpro/. Questions can be emailed to interhelp@ebi.ac.uk.
Resumo:
Methylation of cytosine in the 5 position of the pyrimidine ring is a major modification of the DNA in most organisms. In eukaryotes, the distribution and number of 5-methylcytosines (5mC) along the DNA is heritable but can also change with the developmental state of the cell and as a response to modifications of the environment. While DNA methylation probably has a number of functions, scientific interest has recently focused on the gene silencing effect methylation can have in eukaryotic cells. In particular, the discovery of changes in the methylation level during cancer development has increased the interest in this field. In the past, a vast amount of data has been generated with different levels of resolution ranging from 5mC content of total DNA to the methylation status of single nucleotides. We present here a database for DNA methylation data that attempts to unify these results in a common resource. The database is accessible via WWW (http://www.methdb.de). It stores information about the origin of the investigated sample and the experimental procedure, and contains the DNA methylation data. Query masks allow for searching for 5mC content, species, tissue, gene, sex, phenotype, sequence ID and DNA type. The output lists all available information including the relative gene expression level. DNA methylation patterns and methylation profiles are shown both as a graphical representation and as G/A/T/C/5mC-sequences or tables with sequence positions and methylation levels, respectively.
Resumo:
A new thermodynamic database for normal and modified nucleic acids has been developed. This Thermodynamic Database for Nucleic Acids (NTDB) includes sequence, structure and thermodynamic information as well as experimental methods and conditions. In this release, there are 1851 sequences containing both normal and modified nucleic acids. A user-friendly web-based interface has been developed to allow data searching under different conditions. Useful thermodynamic tools for the study of nucleic acids have been collected and linked for easy usage. NTDB is available at http://ntdb.chem.cuhk.edu.hk.
Resumo:
The Identification and Classification of Bacteria (ICB) database (http:/www.mbio.co.jp/icb) contains currently available information about the DNA gyrase subunit B (gyrB) gene in bacteria. The database is designed to provide the scientific community with a reference point for using gyrB as an evolutionary and taxonomic marker. Nucleic and amino acid sequence data are currently available for over 850 strains, along with alignments at several different taxonomic levels and an exhaustive review of primer selection and background information.
Resumo:
The PlantsP database is a curated database that combines information derived from sequences with experimental functional genomics information. PlantsP focuses on plant protein kinases and protein phosphatases. The database will specifically provide a resource for information on a collection of T-DNA insertion mutants (knockouts) in each protein kinase and phosphatase in Arabidopsis thaliana. PlantsP also provides a curated view of each protein that includes a comprehensive annotation of functionally related sequence motifs, sequence family definitions, alignments and phylogenetic trees, and descriptive information drawn directly from the literature. PlantsP is available at http://PlantsP.sdsc.edu.
Resumo:
The database, called HyPaLib (for Hybrid Pattern Library), contains annotated structural elements characteristic for certain classes of structural and/or functional RNAs. These elements are described in a language specifically designed for this purpose. The language allows convenient specification of hybrid patterns, i.e. motifs consisting of sequence features and structural elements together with sequence similarity and thermodynamic constraints. We are currently developing software tools that allow a user to search sequence databases for any pattern in HyPaLib, thus providing functionality which is similar to PROSITE, but dedicated to the more complex patterns in RNA sequences. HyPaLib is available at http://bibiserv.techfak.uni-bielefeld.de/HyPa/.
Resumo:
With the completion of the determination of its entire genome sequence, one of the next major targets of Bacillus subtilis genomics is to clarify the whole gene regulatory network. To this end, the results of systematic experiments should be compared with the rich source of individual experimental results accumulated so far. Thus, we constructed a database of the upstream regulatory information of B.subtilis (DBTBS). The current version was constructed by surveying 291 references and contains information on 90 binding factors and 403 promoters. For each promoter, all of its known cis-elements are listed according to their positions, while these cis-elements are aligned to illustrate their consensus sequence for each transcription factor. All probable transcription factors coded in the genome were classified with the Pfam motifs. Using this database, we compared the character of B.subtilis promoters with that of Escherichia coli promoters. Our database is accessible at http://elmo.ims.u-tokyo.ac.jp/dbtbs/.
Resumo:
The extremely halophilic archaeon Halobacterium sp. NRC-1 can grow phototrophically by means of light-driven proton pumping by bacteriorhodopsin in the purple membrane. Here, we show by genetic analysis of the wild type, and insertion and double-frame shift mutants of Bat that this transcriptional regulator coordinates synthesis of a structural protein and a chromophore for purple membrane biogenesis in response to both light and oxygen. Analysis of the complete Halobacterium sp. NRC-1 genome sequence showed that the regulatory site, upstream activator sequence (UAS), the putative binding site for Bat upstream of the bacterio-opsin gene (bop), is also present upstream to the other Bat-regulated genes. The transcription regulator Bat contains a photoresponsive cGMP-binding (GAF) domain, and a bacterial AraC type helix–turn–helix DNA binding motif. We also provide evidence for involvement of the PAS/PAC domain of Bat in redox-sensing activity by genetic analysis of a purple membrane overproducer. Five additional Bat-like putative regulatory genes were found, which together are likely to be responsible for orchestrating the complex response of this archaeon to light and oxygen. Similarities of the bop-like UAS and transcription factors in diverse organisms, including a plant and a γ-proteobacterium, suggest an ancient origin for this regulon capable of coordinating light and oxygen responses in the three major branches of the evolutionary tree of life. Finally, sensitivity of four of five regulon genes to DNA supercoiling is demonstrated and correlated to presence of alternating purine–pyrimidine sequences (RY boxes) near the regulated promoters.
Resumo:
We describe a technique, sequence-tagged microsatellite profiling (STMP), to rapidly generate large numbers of simple sequence repeat (SSR) markers from genomic or cDNA. This technique eliminates the need for library screening to identify SSR-containing clones and provides an ∼25-fold increase in sequencing throughput compared to traditional methods. STMP generates short but characteristic nucleotide sequence tags for fragments that are present within a pool of SSR amplicons. These tags are then ligated together to form concatemers for cloning and sequencing. The analysis of thousands of tags gives rise to a representational profile of the abundance and frequency of SSRs within the DNA pool, from which low copy sequences can be identified. As each tag contains sufficient nucleotide sequence for primer design, their conversion into PCR primers allows the amplification of corresponding full-length fragments from the pool of SSR amplicons. These fragments permit the full characterisation of a SSR locus and provide flanking sequence for the development of a microsatellite marker. Alternatively, sequence tag primers can be used to directly amplify corresponding SSR loci from genomic DNA, thereby reducing the cost of developing a microsatellite marker to the synthesis of just one sequence-specific primer. We demonstrate the utility of STMP by the development of SSR markers in bread wheat.
Resumo:
We present a method for discovering conserved sequence motifs from families of aligned protein sequences. The method has been implemented as a computer program called emotif (http://motif.stanford.edu/emotif). Given an aligned set of protein sequences, emotif generates a set of motifs with a wide range of specificities and sensitivities. emotif also can generate motifs that describe possible subfamilies of a protein superfamily. A disjunction of such motifs often can represent the entire superfamily with high specificity and sensitivity. We have used emotif to generate sets of motifs from all 7,000 protein alignments in the blocks and prints databases. The resulting database, called identify (http://motif.stanford.edu/identify), contains more than 50,000 motifs. For each alignment, the database contains several motifs having a probability of matching a false positive that range from 10−10 to 10−5. Highly specific motifs are well suited for searching entire proteomes, while generating very few false predictions. identify assigns biological functions to 25–30% of all proteins encoded by the Saccharomyces cerevisiae genome and by several bacterial genomes. In particular, identify assigned functions to 172 of proteins of unknown function in the yeast genome.
Resumo:
We present an approach for assessing the significance of sequence and structure comparisons by using nearly identical statistical formalisms for both sequence and structure. Doing so involves an all-vs.-all comparison of protein domains [taken here from the Structural Classification of Proteins (scop) database] and then fitting a simple distribution function to the observed scores. By using this distribution, we can attach a statistical significance to each comparison score in the form of a P value, the probability that a better score would occur by chance. As expected, we find that the scores for sequence matching follow an extreme-value distribution. The agreement, moreover, between the P values that we derive from this distribution and those reported by standard programs (e.g., blast and fasta validates our approach. Structure comparison scores also follow an extreme-value distribution when the statistics are expressed in terms of a structural alignment score (essentially the sum of reciprocated distances between aligned atoms minus gap penalties). We find that the traditional metric of structural similarity, the rms deviation in atom positions after fitting aligned atoms, follows a different distribution of scores and does not perform as well as the structural alignment score. Comparison of the sequence and structure statistics for pairs of proteins known to be related distantly shows that structural comparison is able to detect approximately twice as many distant relationships as sequence comparison at the same error rate. The comparison also indicates that there are very few pairs with significant similarity in terms of sequence but not structure whereas many pairs have significant similarity in terms of structure but not sequence.
Resumo:
Olfactory receptor (OR) genes represent ≈1% of genomic coding sequence in mammals, and these genes are clustered on multiple chromosomes in both the mouse and human genomes. We have taken a comparative genomics approach to identify features that may be involved in the dynamic evolution of this gene family and in the transcriptional control that results in a single OR gene expressed per olfactory neuron. We sequenced ≈350 kb of the murine P2 OR cluster and used synteny, gene linkage, and phylogenetic analysis to identify and sequence ≈111 kb of an orthologous cluster in the human genome. In total, 18 mouse and 8 human OR genes were identified, including 7 orthologs that appear to be functional in both species. Noncoding homology is evident between orthologs and generally is confined within the transcriptional unit. We find no evidence for common regulatory features shared among paralogs, and promoter regions generally do not contain strong promoter motifs. We discuss these observations, as well as OR clustering, in the context of evolutionary expansion and transcriptional regulation of OR repertoires.
Resumo:
The root hair is a specialized cell type involved in water and nutrient uptake in plants. In legumes the root hair is also the primary site of recognition and infection by symbiotic nitrogen-fixing Rhizobium bacteria. We have studied the root hairs of Medicago truncatula, which is emerging as an increasingly important model legume for studies of symbiotic nodulation. However, only 27 genes from M. truncatula were represented in GenBank/EMBL as of October, 1997. We report here the construction of a root-hair-enriched cDNA library and single-pass sequencing of randomly selected clones. Expressed sequence tags (899 total, 603 of which have homology to known genes) were generated and made available on the Internet. We believe that the database and the associated DNA materials will provide a useful resource to the community of scientists studying the biology of roots, root tips, root hairs, and nodulation.