894 resultados para SEQUENCE DATABASES
Resumo:
In the last decade microsatellites have become one of the most useful genetic markers used in a large number of organisms due to their abundance and high level of polymorphism. Microsatellites have been used for individual identification, paternity tests, forensic studies and population genetics. Data on microsatellite abundance comes preferentially from microsatellite enriched libraries and DNA sequence databases. We have conducted a search in GenBank of more than 16,000 Schistosoma mansoni ESTs and 42,000 BAC sequences. In addition, we obtained 300 sequences from CA and AT microsatellite enriched genomic libraries. The sequences were searched for simple repeats using the RepeatMasker software. Of 16,022 ESTs, we detected 481 (3%) sequences that contained 622 microsatellites (434 perfect, 164 imperfect and 24 compounds). Of the 481 ESTs, 194 were grouped in 63 clusters containing 2 to 15 ESTs per cluster. Polymorphisms were observed in 16 clusters. The 287 remaining ESTs were orphan sequences. Of the 42,017 BAC end sequences, 1,598 (3.8%) contained microsatellites (2,335 perfect, 287 imperfect and 79 compounds). The 1,598 BAC end sequences 80 were grouped into 17 clusters containing 3 to 17 BAC end sequences per cluster. Microsatellites were present in 67 out of 300 sequences from microsatellite enriched libraries (55 perfect, 38 imperfect and 15 compounds). From all of the observed loci 55 were selected for having the longest perfect repeats and flanking regions that allowed the design of primers for PCR amplification. Additionally we describe two new polymorphic microsatellite loci.
Resumo:
There is no control over the information provided with sequences when they are deposited in the sequence databases. Consequently mistakes can seed the incorrect annotation of other sequences. Grouping genes into families and applying controlled annotation overcomes the problems of incorrect annotation associated with individual sequences. Two databases (http://www.mendel.ac.uk) were created to apply controlled annotation to plant genes and plant ESTs: Mendel-GFDb is a database of plant protein (gene) families based on gapped-BLAST analysis of all sequences in the SWISS-PROT family of databases. Sequences are aligned (ClustalW) and identical and similar residues shaded. The families are visually curated to ensure that one or more criteria, for example overall relatedness and/or domain similarity relate all sequences within a family. Sequence families are assigned a ‘Gene Family Number’ and a unified description is developed which best describes the family and its members. If authority exists the gene family is assigned a ‘Gene Family Name’. This information is placed in Mendel-GFDb. Mendel-ESTS is primarily a database of plant ESTs, which have been compared to Mendel-GFDb, completely sequenced genomes and domain databases. This approach associated ESTs with individual sequences and the controlled annotation of gene families and protein domains; the information being placed in Mendel-ESTS. The controlled annotation applied to genes and ESTs provides a basis from which a plant transcription database can be developed.
Resumo:
SBASE 8.0 is the eighth release of the SBASE library of protein domain sequences that contains 294 898 annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to most major sequence databases and sequence pattern collections. The entries are clustered into over 2005 statistically validated domain groups (SBASE-A) and 595 non-validated groups (SBASE-B), provided with several WWW-based search and browsing facilities for online use. A domain-search facility was developed, based on non-parametric pattern recognition methods, including artificial neural networks. SBASE 8.0 is freely available by anonymous ‘ftp’ file transfer from ftp.icgeb.trieste.it. Automated searching of SBASE can be carried out with the WWW servers http://www.icgeb.trieste.it/sbase/ and http://sbase.abc.hu/sbase/.
Resumo:
The release of vast quantities of DNA sequence data by large-scale genome and expressed sequence tag (EST) projects underlines the necessity for the development of efficient and inexpensive ways to link sequence databases with temporal and spatial expression profiles. Here we demonstrate the power of linking cDNA sequence data (including EST sequences) with transcript profiles revealed by cDNA-AFLP, a highly reproducible differential display method based on restriction enzyme digests and selective amplification under high stringency conditions. We have developed a computer program (GenEST) that predicts the sizes of virtual transcript-derived fragments (TDFs) of in silico-digested cDNA sequences retrieved from databases. The vast majority of the resulting virtual TDFs could be traced back among the thousands of TDFs displayed on cDNA-AFLP gels. Sequencing of the corresponding bands excised from cDNA-AFLP gels revealed no inconsistencies. As a consequence, cDNA sequence databases can be screened very efficiently to identify genes with relevant expression profiles. The other way round, it is possible to switch from cDNA-AFLP gels to sequences in the databases. Using the restriction enzyme recognition sites, the primer extensions and the estimated TDF size as identifiers, the DNA sequence(s) corresponding to a TDF with an interesting expression pattern can be identified. In this paper we show examples in both directions by analyzing the plant parasitic nematode Globodera rostochiensis. Various novel pathogenicity factors were identified by combining ESTs from the infective stage juveniles with expression profiles of ∼4000 genes in five developmental stages produced by cDNA-AFLP.
Sequence similarity analysis of Escherichia coli proteins: functional and evolutionary implications.
Resumo:
A computer analysis of 2328 protein sequences comprising about 60% of the Escherichia coli gene products was performed using methods for database screening with individual sequences and alignment blocks. A high fraction of E. coli proteins--86%--shows significant sequence similarity to other proteins in current databases; about 70% show conservation at least at the level of distantly related bacteria, and about 40% contain ancient conserved regions (ACRs) shared with eukaryotic or Archaeal proteins. For > 90% of the E. coli proteins, either functional information or sequence similarity, or both, are available. Forty-six percent of the E. coli proteins belong to 299 clusters of paralogs (intraspecies homologs) defined on the basis of pairwise similarity. Another 10% could be included in 70 superclusters using motif detection methods. The majority of the clusters contain only two to four members. In contrast, nearly 25% of all E. coli proteins belong to the four largest superclusters--namely, permeases, ATPases and GTPases with the conserved "Walker-type" motif, helix-turn-helix regulatory proteins, and NAD(FAD)-binding proteins. We conclude that bacterial protein sequences generally are highly conserved in evolution, with about 50% of all ACR-containing protein families represented among the E. coli gene products. With the current sequence databases and methods of their screening, computer analysis yields useful information on the functions and evolutionary relationships of the vast majority of genes in a bacterial genome. Sequence similarity with E. coli proteins allows the prediction of functions for a number of important eukaryotic genes, including several whose products are implicated in human diseases.
Resumo:
Several aspects of photoperception and light signal transduction have been elucidated by studies with model plants. However, the information available for economically important crops, such as Fabaceae species, is scarce. In order to incorporate the existing genomic tools into a strategy to advance soybean research, we have investigated publicly available expressed sequence tag ( EST) sequence databases in order to identify Glycine max sequences related to genes involved in light-regulated developmental control in model plants. Approximately 38,000 sequences from open-access databases were investigated, and all bona fide and putative photoreceptor gene families were found in soybean sequence databases. We have identified G. max orthologs for several families of transcriptional regulators and cytoplasmic proteins mediating photoreceptor-induced responses, although some important Arabidopsis phytochrome-signaling components are absent. Moreover, soybean and Arabidopsis gene-family homologs appear to have undergone a distinct expansion process in some cases. We propose a working model of light perception, signal transduction and response-eliciting in G. max, based on the identified key components from Arabidopsis. These results demonstrate the power of comparative genomics between model systems and crop species to elucidate several aspects of plant physiology and metabolism.
Resumo:
In this paper we included a very broad representation of grass family diversity (84% of tribes and 42% of genera). Phylogenetic inference was based on three plastid DNA regions rbcL, matK and trnL-F, using maximum parsimony and Bayesian methods. Our results resolved most of the subfamily relationships within the major clades (BEP and PACCMAD), which had previously been unclear, such as, among others the: (i) BEP and PACCMAD sister relationship, (ii) composition of clades and the sister-relationship of Ehrhartoideae and Bambusoideae + Pooideae, (iii) paraphyly of tribe Bambuseae, (iv) position of Gynerium as sister to Panicoideae, (v) phylogenetic position of Micrairoideae. With the presence of a relatively large amount of missing data, we were able to increase taxon sampling substantially in our analyses from 107 to 295 taxa. However, bootstrap support and to a lesser extent Bayesian inference posterior probabilities were generally lower in analyses involving missing data than those not including them. We produced a fully resolved phylogenetic summary tree for the grass family at subfamily level and indicated the most likely relationships of all included tribes in our analysis.
Resumo:
The MyHits web server (http://myhits.isb-sib.ch) is a new integrated service dedicated to the annotation of protein sequences and to the analysis of their domains and signatures. Guest users can use the system anonymously, with full access to (i) standard bioinformatics programs (e.g. PSI-BLAST, ClustalW, T-Coffee, Jalview); (ii) a large number of protein sequence databases, including standard (Swiss-Prot, TrEMBL) and locally developed databases (splice variants); (iii) databases of protein motifs (Prosite, Interpro); (iv) a precomputed list of matches ('hits') between the sequence and motif databases. All databases are updated on a weekly basis and the hit list is kept up to date incrementally. The MyHits server also includes a new collection of tools to generate graphical representations of pairwise and multiple sequence alignments including their annotated features. Free registration enables users to upload their own sequences and motifs to private databases. These are then made available through the same web interface and the same set of analytical tools. Registered users can manage their own sequences and annotations using only web tools and freeze their data in their private database for publication purposes.
Resumo:
The ENCyclopedia Of DNA Elements (ENCODE) Project aims to identify all functional elements in the human genome sequence. The pilot phase of the Project is focused on a specified 30 megabases (approximately 1%) of the human genome sequence and is organized as an international consortium of computational and laboratory-based scientists working to develop and apply high-throughput approaches for detecting all sequence elements that confer biological function. The results of this pilot phase will guide future efforts to analyze the entire human genome.
Resumo:
Summary Cancer is a leading cause of morbidity and mortality in Western countries (as an example, colorectal cancer accounts for about 300'000 new cases and 200'000 deaths each year in Europe and in the USA). Despite that many patients with cancer have complete macroscopic clearance of their disease after resection, radiotherapy and/or chemotherapy, many of these patients develop fatal recurrence. Vaccination with immunogenic peptide tumor antigens has shown encouraging progresses in the last decade; immunotherapy might therefore constitute a fourth therapeutic option in the future. We dissect here and critically evaluate the numerous steps of reverse immunology, a forecast procedure to identify antigenic peptides from the sequence of a gene of interest. Bioinformatic algorithms were applied to mine sequence databases for tumor-specific transcripts. A quality assessment of publicly available sequence databanks allowed defining strengths and weaknesses of bioinformatics-based prediction of colon cancer-specific alternative splicing: new splice variants could be identified, however cancer-restricted expression could not be significantly predicted. Other sources of target transcripts were quantitatively investigated by polymerase chain reactions, as cancer-testis genes or reported overexpressed transcripts. Based on the relative expression of a defined set of housekeeping genes in colon cancer tissues, we characterized a precise procedure for accurate normalization and determined a threshold for the definition of significant overexpression of genes in cancers versus normal tissues. Further steps of reverse immunology were applied on a splice variant of the Melan¬A gene. Since it is known that the C-termini of antigenic peptides are directly produced by the proteasome, longer precursor and overlapping peptides encoded by the target sequence were synthesized chemically and digested in vitro with purified proteasome. The resulting fragments were identified by mass spectroscopy to detect cleavage sites. Using this information and based on the available anchor motifs for defined HLA class I molecules, putative antigenic peptides could be predicted. Their relative affinity for HLA molecules was confirmed experimentally with functional competitive binding assays and they were used to search patients' peripheral blood lymphocytes for the presence of specific cytolytic T lymphocytes (CTL). CTL clones specific for a splice variant of Melan-A could be isolated; although they recognized peptide-pulsed cells, they failed to lyse melanoma cells in functional assays of antigen recognition. In the conclusion, we discuss advantages and bottlenecks of reverse immunology and compare the technical aspects of this approach with the more classical procedure of direct immunology, a technique introduced by Boon and colleagues more than 10 years ago to successfully clone tumor antigens.
Resumo:
Abstract Arbuscular mycorrhizal fungi (AMF) form symbiosis with roots of approximately 80% of known land plants. These fungi play a key role in the ecology and adaptation of plants to various ecosystems.by increasing the plant resources for various nutrients. Despite their important ecological role, we still have poor understanding of their genetic structure and their molecular evolution. The work presented in this thesis aims to isolate and analyse AMF genes with various molecular techniques, in order to obtain new insights about their genetics, phylogeny and molecular evolution. Some AMF genes were shown through phylogenetic analyses to be more related with plants or mycoparasites than with other fungal organisms. These results led to the prediction that lateral gene transfers (LGT) occurred between AMF and plants during their long-term co-évolution. By phylogenetic and molecular analyses, in the chapter 2 I demonstrate that the hypothesis of LGT is most likely a consequence of analyses carried out on contaminant non AMF-DNA. In addition, various features characteristic of AMF genes have been determined, allowing researchers to scan their own sequence databases for potential non-AMF contaminants. Phylogenetic relationships of AMF with other fungi has been mostly analysed using molecular markers of ribosomal origin. In chapter 2 I successfully isolated gene encoding α- and ß-tubulins from several AMF genera. Consequently, phylogenetic analyses showed that AMF possess an unexpected relationship with ancestral aquatic fungi (chytrids). These results are consistent with the prediction stating that AMF may have played an important role in the colonisation of land by green plants through the establishment of a symbiosis and after the divergence of AMF from aquatic ancestors. In Chapter 4 I tried to isolate the entire AMF gene family encoding P-Type II ATPases, in order to determine their molecular evolution with the fungal kingdom. These genes were further analysed to detect the level of sequence polymorphism that is present within an AMF population. The results obtained show that mutational events previously thought as occurring only among divergent evolutionary lineages (gene duplications, indel mutations in coding regions) can occur within a single population of AMF. These results have far reaching consequences for our understanding of the genetics and ecology of AMF. Résumé Les champignons endomycorrhiziens arbusculaires (CEA) forment une symbiose racinaire avec environ 80% des plantes vasculaires connues. Ces champignons possèdent un rôle important dans l'écologie et l'adaptation des plantes au sein de différents écosystèmes en .augmentant leurs ressources en nutriments. Le travail présenté dans cette thèse se propose d'isoler et d'analyser certains gènes de CEA avec différentes techniques moléculaires à fin d'obtenir de pÌus amples informations concernant l'évolution moléculaire, la phylogénie et leur diversité génétique à diverses échelles taxonomiques. Certaines analyses phylogénétiques des CEA ont conduit à l'hypothèse que des transferts horizontaux de gènes (THG) ont pu avoir lieu durant leur longue co-évolution avec les plantes vasculaires. Dans le chapitre 2 de cette thèse nous démontrons par analyses moléculaire et phylogénétique que l'hypothèse de THG est une conséquence de contaminations à partir d'ADN de plante ou d'autres micro-organismes. De plus, de nombreuses caractéristiques moléculaires de CEA ont pu être déterminées, permettant la mise en place d'un plan à suivre lors de l'analyse de gènes de CEA dans les études futures. Les relations évolutives des. CEA avec d'autres champignons ont été analysées majoritairement à l'aide de marqueurs moléculaires d'origine ribosomiale. Dans les chapitres 2 et 3 j'ai isolé des gènes codant pour l'a- et la ß-tubuline chez différents genres, de CEA. Les analyses phylogénétiques ont démontré une parenté entre les CEA et des champignons aquatiques ancestraux (chytrides). Ces résultats sont en accord avec l'hypothèse selon laquelle les CEA ont probablement joué un rôle primordial dans l'établissement des plantes sur terre à travers une symbiose et suite à leur évolution à partir d'ancêtres vivant dans des milieux aquatiques: Dans le chapitre 4 j'ai isolé une entière famille de gènes chez les CEA codant des ATPases de la membrane plasmique, et étudié leur évolution moléculaire dans le règne des champignons. Ces mêmes gènes ont été analysés ultérieurement à fin de déterminer le degré de polymorphisme de séquence qui peut être présent au sein d'une population de CEA. Les résultats obtenus montrent que des évènements mutationnels considérés comme apparaissant exclusivement dans des lignées évolutives très divergentes (duplication de gènes, insertions/délétions dans des régions transcrites du génome) ont lieu sein d'une même population de CEA. Cette découverte a un impact important sur nos connaissances concernant la génétique des populations des CEA et leur écologie.
Resumo:
In recent years, both homing endonucleases (HEases) and zinc-finger nucleases (ZFNs) have been engineered and selected for the targeting of desired human loci for gene therapy. However, enzyme engineering is lengthy and expensive and the off-target effect of the manufactured endonucleases is difficult to predict. Moreover, enzymes selected to cleave a human DNA locus may not cleave the homologous locus in the genome of animal models because of sequence divergence, thus hampering attempts to assess the in vivo efficacy and safety of any engineered enzyme prior to its application in human trials. Here, we show that naturally occurring HEases can be found, that cleave desirable human targets. Some of these enzymes are also shown to cleave the homologous sequence in the genome of animal models. In addition, the distribution of off-target effects may be more predictable for native HEases. Based on our experimental observations, we present the HomeBase algorithm, database and web server that allow a high-throughput computational search and assignment of HEases for the targeting of specific loci in the human and other genomes. We validate experimentally the predicted target specificity of candidate fungal, bacterial and archaeal HEases using cell free, yeast and archaeal assays.
Resumo:
Searching for matches between large collections of short (14-30 nucleotides) words and sequence databases comprising full genomes or transcriptomes is a common task in biological sequence analysis. We investigated the performance of simple indexing strategies for handling such tasks and developed two programs, fetchGWI and tagger, that index either the database or the query set. Either strategy outperforms megablast for searches with more than 10,000 probes. FetchGWI is shown to be a versatile tool for rapidly searching multiple genomes, whose performance is limited in most cases by the speed of access to the filesystem. We have made publicly available a Web interface for searching the human, mouse, and several other genomes and transcriptomes with oligonucleotide queries.
Resumo:
The EfeUOB system of Escherichia coli is a tripartite, low pH, ferrous iron transporter. It resembles the high-affinity iron transporter (Ftr1p-Fet3p) of yeast in that EfeU is homologous to Ftr1p, an integral-membrane iron-permease. However, EfeUOB lacks an equivalent of the Fet3p component—the multicopper oxidase with three cupredoxin-like domains. EfeO and EfeB are periplasmic but their precise roles are unclear. EfeO consists primarily of a C-terminal peptidase-M75 domain with a conserved ‘HxxE’ motif potentially involved in metal binding. The smaller N-terminal domain (EfeO-N) is predicted to be cupredoxin (Cup) like, suggesting a previously unrecognised similarity between EfeO and Fet3p. Our structural modelling of the E. coli EfeO Cup domain identifies two potential metal-binding sites. Site I is predicted to bind Cu2+ using three conserved residues (C41 and 103, and E66) and M101. Of these, only one (C103) is conserved in classical cupredoxins where it also acts as a Cu ligand. Site II most probably binds Fe3+ and consists of four well conserved surface Glu residues. Phylogenetic analysis indicates that the EfeO-Cup domains form a novel Cup family, designated the ‘EfeO-Cup’ family. Structural modelling of two other representative EfeO-Cup domains indicates that different subfamilies employ distinct ligand sets at their proposed metal-binding sites. The ~100 efeO homologues in the bacterial sequence databases are all associated with various iron-transport related genes indicating a common role for EfeO-Cup proteins in iron transport, supporting a new copper-iron connection in biology.
Resumo:
BACKGROUND: In order to maintain the most comprehensive structural annotation databases we must carry out regular updates for each proteome using the latest profile-profile fold recognition methods. The ability to carry out these updates on demand is necessary to keep pace with the regular updates of sequence and structure databases. Providing the highest quality structural models requires the most intensive profile-profile fold recognition methods running with the very latest available sequence databases and fold libraries. However, running these methods on such a regular basis for every sequenced proteome requires large amounts of processing power.In this paper we describe and benchmark the JYDE (Job Yield Distribution Environment) system, which is a meta-scheduler designed to work above cluster schedulers, such as Sun Grid Engine (SGE) or Condor. We demonstrate the ability of JYDE to distribute the load of genomic-scale fold recognition across multiple independent Grid domains. We use the most recent profile-profile version of our mGenTHREADER software in order to annotate the latest version of the Human proteome against the latest sequence and structure databases in as short a time as possible. RESULTS: We show that our JYDE system is able to scale to large numbers of intensive fold recognition jobs running across several independent computer clusters. Using our JYDE system we have been able to annotate 99.9% of the protein sequences within the Human proteome in less than 24 hours, by harnessing over 500 CPUs from 3 independent Grid domains. CONCLUSION: This study clearly demonstrates the feasibility of carrying out on demand high quality structural annotations for the proteomes of major eukaryotic organisms. Specifically, we have shown that it is now possible to provide complete regular updates of profile-profile based fold recognition models for entire eukaryotic proteomes, through the use of Grid middleware such as JYDE.