846 resultados para Whole genome sequencing
Resumo:
Hybrid mice carrying oncogenic transgenes afford powerful systems for investigating loss of heterozygosity (LOH) in tumors. Here, we apply this approach to a neoplasm of key importance in human medicine: mammary carcinoma. We performed a whole genome search for LOH using the mouse mammary tumor virus/v-Ha-ras mammary carcinoma model in female (FVB/N × Mus musculus castaneus)F1 mice. Mammary tumors developed as expected, as well as a few tumors of a second type (uterine leiomyosarcoma) not previously associated with this transgene. Genotyping of 94 anatomically independent tumors revealed high-frequency LOH (≈38%) for markers on chromosome 4. A marked allelic bias was observed, with M. musculus castaneus alleles almost exclusively being lost. No evidence of genomic imprinting effects was noted. These data point to the presence of a tumor suppressor gene(s) on mouse chromosome 4 involved in mammary carcinogenesis induced by mutant H-ras expression, and for which a significant functional difference may exist between the M. musculus castaneus and FVB/N alleles. Provisional subchromosomal localization of this gene, designated Loh-3, can be made to a distal segment having syntenic correspondence to human chromosome 1p; LOH in this latter region is observed in several human malignancies, including breast cancers. Evidence was also obtained for a possible second locus associated with LOH with less marked allele bias on proximal chromosome 4.
Resumo:
Clusters of orthologous groups [COGs; Tatusov, R. L., Koonin, E. V. & Lipman, D. J. (1997) Science 278, 631–637] were identified for a set of 13 completely sequenced herpesviruses. Each COG represented a family of gene products conserved across several herpes genomes. These families were defined without using an arbitrary threshold criterion based on sequence similarity. The COG technique was modified so that variable stringency in COG construction was possible. High stringencies identify a core set of highly conserved genes. Varying COG stringency reveals differences in the degree of conservation between functional classes of genes. The COG data were used to construct whole-genome phylogenetic trees based on gene content. These trees agree well with trees based on other methods and are robust when tested by bootstrap analysis. The COG data also were used to construct a reciprocal tree that clustered genes with similar phylogenetic profiles. This clustering may give clues to genes with related functions or with related histories of acquisition and loss during herpesvirus evolution.
Resumo:
The molecular identity and function of the Drosophila melanogaster Y-linked fertility factors have long eluded researchers. Although the D. melanogaster genome sequence was recently completed, the fertility factors still were not identified, in part because of low cloning efficiency of heterochromatic Y sequences. Here we report a method for iterative blast searching to assemble heterochromatic genes from shotgun assemblies, and we successfully identify kl-2 and kl-3 as 1β- and γ-dynein heavy chains, respectively. Our conclusions are supported by formal genetics with X-Y translocation lines. Reverse transcription–PCR was successful in linking together unmapped sequence fragments from the whole-genome shotgun assembly, although some sequences were missing altogether from the shotgun effort and had to be generated de novo. We also found a previously undescribed Y gene, polycystine-related (PRY). The closest paralogs of kl-2, kl-3, and PRY (and also of kl-5) are autosomal and not X-linked, suggesting that the evolution of the Drosophila Y chromosome has been driven by an accumulation of male-related genes arising de novo from the autosomes.
Resumo:
Taking advantage of the ongoing Dictyostelium genome sequencing project, we have assembled >73 kb of genomic DNA in 15 contigs harbouring 15 genes and one pseudogene of Rho-related proteins. Comparison with EST sequences revealed that every gene is interrupted by at least one and up to four introns. For racC extensive alternative splicing was identified. Northern blot analysis showed that mRNAs for racA, racE, racG, racH and racI were present at all stages of development, whereas racJ and racL were expressed only at late stages. Amino acid sequences have been analysed in the context of Rho-related proteins of other organisms. Rac1a/1b/1c, RacF1/F2 and to a lesser extent RacB and the GTPase domain of RacA can be grouped in the Rac subfamily. None of the additional Dictyostelium Rho-related proteins belongs to any of the well-defined subfamilies, like Rac, Cdc42 or Rho. RacD and RacA are unique in that they lack the prenylation motif characteristic of Rho proteins. RacD possesses a 50 residue C-terminal extension and RacA a 400 residue C-terminal extension that contains a proline-rich region, two BTB domains and a novel C-terminal domain. We have also identified homologues for RacA in Drosophila and mammals, thus defining a new subfamily of Rho proteins, RhoBTB.
Resumo:
The EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/) is maintained at the European Bioinformatics Institute (EBI) in an international collaboration with the DNA Data Bank of Japan (DDBJ) and GenBank at the NCBI (USA). Data is exchanged amongst the collaborating databases on a daily basis. The major contributors to the EMBL database are individual authors and genome project groups. Webin is the preferred web-based submission system for individual submitters, whilst automatic procedures allow incorporation of sequence data from large-scale genome sequencing centres and from the European Patent Office (EPO). Database releases are produced quarterly. Network services allow free access to the most up-to-date data collection via ftp, email and World Wide Web interfaces. EBI’s Sequence Retrieval System (SRS), a network browser for databanks in molecular biology, integrates and links the main nucleotide and protein databases plus many specialized databases. For sequence similarity searching a variety of tools (e.g. Blitz, Fasta, BLAST) are available which allow external users to compare their own sequences against the latest data in the EMBL Nucleotide Sequence Database and SWISS-PROT.
Resumo:
FULL-malaria is a database for a full-length-enriched cDNA library from the human malaria parasite Plasmodium falciparum (http://133.11.149.55/). Because of its medical importance, this organism is the first target for genome sequencing of a eukaryotic pathogen; the sequences of two of its 14 chromosomes have already been determined. However, for the full exploitation of this rapidly accumulating information, correct identification of the genes and study of their expression are essential. Using the oligo-capping method, we have produced a full-length-enriched cDNA library from erythrocytic stage parasites and performed one-pass reading. The database consists of nucleotide sequences of 2490 random clones that include 390 (16%) known malaria genes according to BLASTN analysis of the nr-nt database in GenBank; these represent 98 genes, and the clones for 48 of these genes contain the complete protein-coding sequence (49%). On the other hand, comparisons with the complete chromosome 2 sequence revealed that 35 of 210 predicted genes are expressed, and in addition led to detection of three new gene candidates that were not previously known. In total, 19 of these 38 clones (50%) were full-length. From these observations, it is expected that the database contains ∼1000 genes, including 500 full-length clones. It should be an invaluable resource for the development of vaccines and novel drugs.
Resumo:
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides data analysis and retrieval resources that operate on the data in GenBank and a variety of other biological data made available through NCBI’s Web site. NCBI data retrieval resources include Entrez, PubMed, LocusLink and the Taxonomy Browser. Data analysis resources include BLAST, Electronic PCR, OrfFinder, RefSeq, UniGene, HomoloGene, Database of Single Nucleotide Polymorphisms (dbSNP), Human Genome Sequencing, Human MapViewer, GeneMap’99, Human–Mouse Homology Map, Cancer Chromosome Aberration Project (CCAP), Entrez Genomes, Clusters of Orthologous Groups (COGs) database, Retroviral Genotyping Tools, Cancer Genome Anatomy Project (CGAP), SAGEmap, Gene Expression Omnibus (GEO), Online Mendelian Inheritance in Man (OMIM), the Molecular Modeling Database (MMDB) and the Conserved Domain Database (CDD). Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of the resources can be accessed through the NCBI home page at: http://www.ncbi.nlm.nih.gov.
Resumo:
While genome sequencing projects are advancing rapidly, EST sequencing and analysis remains a primary research tool for the identification and categorization of gene sequences in a wide variety of species and an important resource for annotation of genomic sequence. The TIGR Gene Indices (http://www.tigr.org/tdb/tgi.shtml) are a collection of species-specific databases that use a highly refined protocol to analyze EST sequences in an attempt to identify the genes represented by that data and to provide additional information regarding those genes. Gene Indices are constructed by first clustering, then assembling EST and annotated gene sequences from GenBank for the targeted species. This process produces a set of unique, high-fidelity virtual transcripts, or Tentative Consensus (TC) sequences. The TC sequences can be used to provide putative genes with functional annotation, to link the transcripts to mapping and genomic sequence data, to provide links between orthologous and paralogous genes and as a resource for comparative sequence analysis.
Resumo:
A database (SpliceDB) of known mammalian splice site sequences has been developed. We extracted 43 337 splice pairs from mammalian divisions of the gene-centered Infogene database, including sites from incomplete or alternatively spliced genes. Known EST sequences supported 22 815 of them. After discarding sequences with putative errors and ambiguous location of splice junctions the verified dataset includes 22 489 entries. Of these, 98.71% contain canonical GT–AG junctions (22 199 entries) and 0.56% have non-canonical GC–AG splice site pairs. The remainder (0.73%) occurs in a lot of small groups (with a maximum size of 0.05%). We especially studied non-canonical splice sites, which comprise 3.73% of GenBank annotated splice pairs. EST alignments allowed us to verify only the exonic part of splice sites. To check the conservative dinucleotides we compared sequences of human non-canonical splice sites with sequences from the high throughput genome sequencing project (HTG). Out of 171 human non-canonical and EST-supported splice pairs, 156 (91.23%) had a clear match in the human HTG. They can be classified after sequence analysis as: 79 GC–AG pairs (of which one was an error that corrected to GC–AG), 61 errors corrected to GT–AG canonical pairs, six AT–AC pairs (of which two were errors corrected to AT–AC), one case was produced from a non-existent intron, seven cases were found in HTG that were deposited to GenBank and finally there were only two other cases left of supported non-canonical splice pairs. The information about verified splice site sequences for canonical and non-canonical sites is presented in SpliceDB with the supporting evidence. We also built weight matrices for the major splice groups, which can be incorporated into gene prediction programs. SpliceDB is available at the computational genomic Web server of the Sanger Centre: http://genomic.sanger.ac.uk/spldb/SpliceDB.html and at http://www.softberry.com/spldb/SpliceDB.html.
Resumo:
TIGRFAMs is a collection of protein families featuring curated multiple sequence alignments, hidden Markov models and associated information designed to support the automated functional identification of proteins by sequence homology. We introduce the term ‘equivalog’ to describe members of a set of homologous proteins that are conserved with respect to function since their last common ancestor. Related proteins are grouped into equivalog families where possible, and otherwise into protein families with other hierarchically defined homology types. TIGRFAMs currently contains over 800 protein families, available for searching or downloading at www.tigr.org/TIGRFAMs. Classification by equivalog family, where achievable, complements classification by orthology, superfamily, domain or motif. It provides the information best suited for automatic assignment of specific functions to proteins from large-scale genome sequencing projects.
Resumo:
The opportunistic pathogenic bacterium Pseudomonas aeruginosa uses quorum-sensing signaling systems as global regulators of virulence genes. There are two quorum-sensing signal receptor and signal generator pairs, LasR–LasI and RhlR–RhlI. The recently completed P. aeruginosa genome-sequencing project revealed a gene coding for a homolog of the signal receptors, LasR and RhlR. Here we describe a role for this gene, which we call qscR. The qscR gene product governs the timing of quorum-sensing-controlled gene expression and it dampens virulence in an insect model. We present evidence that suggests the primary role of QscR is repression of lasI. A qscR mutant produces the LasI-generated signal prematurely, and this results in premature transcription of a number of quorum-sensing-regulated genes. When fed to Drosophila melanogaster, the qscR mutant kills the animals more rapidly than the parental P. aeruginosa. The repression of lasI by QscR could serve to ensure that quorum-sensing-controlled genes are not activated in environments where they are not useful.
Resumo:
We present here the complete genome sequence of a common avian clone of Pasteurella multocida, Pm70. The genome of Pm70 is a single circular chromosome 2,257,487 base pairs in length and contains 2,014 predicted coding regions, 6 ribosomal RNA operons, and 57 tRNAs. Genome-scale evolutionary analyses based on pairwise comparisons of 1,197 orthologous sequences between P. multocida, Haemophilus influenzae, and Escherichia coli suggest that P. multocida and H. influenzae diverged ≈270 million years ago and the γ subdivision of the proteobacteria radiated about 680 million years ago. Two previously undescribed open reading frames, accounting for ≈1% of the genome, encode large proteins with homology to the virulence-associated filamentous hemagglutinin of Bordetella pertussis. Consistent with the critical role of iron in the survival of many microbial pathogens, in silico and whole-genome microarray analyses identified more than 50 Pm70 genes with a potential role in iron acquisition and metabolism. Overall, the complete genomic sequence and preliminary functional analyses provide a foundation for future research into the mechanisms of pathogenesis and host specificity of this important multispecies pathogen.
Resumo:
For the most part, studies of grass genome structure have been limited to the generation of whole-genome genetic maps or the fine structure and sequence analysis of single genes or gene clusters. We have investigated large contiguous segments of the genomes of maize, sorghum, and rice, primarily focusing on intergenic spaces. Our data indicate that much (>50%) of the maize genome is composed of interspersed repetitive DNAs, primarily nested retrotransposons that insert between genes. These retroelements are less abundant in smaller genome plants, including rice and sorghum. Although 5- to 200-kb blocks of methylated, presumably heterochromatic, retrotransposons flank most maize genes, rice and sorghum genes are often adjacent. Similar genes are commonly found in the same relative chromosomal locations and orientations in each of these three species, although there are numerous exceptions to this collinearity (i.e., rearrangements) that can be detected at the levels of both the recombinational map and cloned DNA. Evolutionarily conserved sequences are largely confined to genes and their regulatory elements. Our results indicate that a knowledge of grass genome structure will be a useful tool for gene discovery and isolation, but the general rules and biological significance of grass genome organization remain to be determined. Moreover, the nature and frequency of exceptions to the general patterns of grass genome structure and collinearity are still largely unknown and will require extensive further investigation.
Resumo:
A whole genome cattle-hamster radiation hybrid cell panel was used to construct a map of 54 markers located on bovine chromosome 5 (BTA5). Of the 54 markers, 34 are microsatellites selected from the cattle linkage map and 20 are genes. Among the 20 mapped genes, 10 are new assignments that were made by using the comparative mapping by annotation and sequence similarity strategy. A LOD-3 radiation hybrid framework map consisting of 21 markers was constructed. The relatively low retention frequency of markers on this chromosome (19%) prevented unambiguous ordering of the other 33 markers. The length of the map is 398.7 cR, corresponding to a ratio of ≈2.8 cR5,000/cM. Type I genes were binned for comparison of gene order among cattle, humans, and mice. Multiple internal rearrangements within conserved syntenic groups were apparent upon comparison of gene order on BTA5 and HSA12 and HSA22. A similarly high number of rearrangements were observed between BTA5 and MMU6, MMU10, and MMU15. The detailed comparative map of BTA5 should facilitate identification of genes affecting economically important traits that have been mapped to this chromosome and should contribute to our understanding of mammalian chromosome evolution.
Resumo:
Unlike many pathogens that are overtly toxic to their hosts, the primary virulence determinant of Mycobacterium tuberculosis appears to be its ability to persist for years or decades within humans in a clinically latent state. Since early in the 20th century latency has been linked to hypoxic conditions within the host, but the response of M. tuberculosis to a hypoxic signal remains poorly characterized. The M. tuberculosis α-crystallin (acr) gene is powerfully and rapidly induced at reduced oxygen tensions, providing us with a means to identify regulators of the hypoxic response. Using a whole genome microarray, we identified >100 genes whose expression is rapidly altered by defined hypoxic conditions. Numerous genes involved in biosynthesis and aerobic metabolism are repressed, whereas a high proportion of the induced genes have no known function. Among the induced genes is an apparent operon that includes the putative two-component response regulator pair Rv3133c/Rv3132c. When we interrupted expression of this operon by targeted disruption of the upstream gene Rv3134c, the hypoxic regulation of acr was eliminated. These results suggest a possible role for Rv3132c/3133c/3134c in mycobacterial latency.