944 resultados para automated text classification


Relevância:

30.00% 30.00%

Publicador:

Resumo:

In order to support the structural genomic initiatives, both by rapidly classifying newly determined structures and by suggesting suitable targets for structure determination, we have recently developed several new protocols for classifying structures in the CATH domain database (http://www.biochem.ucl.ac.uk/bsm/cath). These aim to increase the speed of classification of new structures using fast algorithms for structure comparison (GRATH) and to improve the sensitivity in recognising distant structural relatives by incorporating sequence information from relatives in the genomes (DomainFinder). In order to ensure the integrity of the database given the expected increase in data, the CATH Protein Family Database (CATH-PFDB), which currently includes 25 320 structural domains and a further 160 000 sequence relatives has now been installed in a relational ORACLE database. This was essential for developing more rigorous validation procedures and for allowing efficient querying of the database, particularly for genome analysis. The associated Dictionary of Homologous Superfamilies [Bray,J.E., Todd,A.E., Pearl,F.M.G., Thornton,J.M. and Orengo,C.A. (2000) Protein Eng., 13, 153–165], which provides multiple structural alignments and functional information to assist in assigning new relatives, has also been expanded recently and now includes information for 903 homo­logous superfamilies. In order to improve coverage of known structures, preliminary classification levels are now provided for new structures at interim stages in the classification protocol. Since a large proportion of new structures can be rapidly classified using profile-based sequence analysis [e.g. PSI-BLAST: Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 3389–3402], this provides preliminary classification for easily recognisable homologues, which in the latest release of CATH (version 1.7) represented nearly three-quarters of the non-identical structures.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The Dali Domain Dictionary (http://www.ebi.ac.uk/dali/domain) is a numerical taxonomy of all known structures in the Protein Data Bank (PDB). The taxonomy is derived fully automatically from measurements of structural, functional and sequence similarities. Here, we report the extension of the classification to match the traditional four hierarchical levels corresponding to: (i) supersecondary structural motifs (attractors in fold space), (ii) the topology of globular domains (fold types), (iii) remote homologues (functional families) and (iv) homologues with sequence identity above 25% (sequence families). The computational definitions of attractors and functional families are new. In September 2000, the Dali classification contained 10 531 PDB entries comprising 17 101 chains, which were partitioned into five attractor regions, 1375 fold types, 2582 functional families and 3724 domain sequence families. Sequence families were further associated with 99 582 unique homologous sequences in the HSSP database, which increases the number of effectively known structures several-fold. The resulting database contains the description of protein domain architecture, the definition of structural neighbours around each known structure, the definition of structurally conserved cores and a comprehensive library of explicit multiple alignments of distantly related protein families.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The database of Clusters of Orthologous Groups of proteins (COGs), which represents an attempt on a phylogenetic classification of the proteins encoded in complete genomes, currently consists of 2791 COGs including 45 350 proteins from 30 genomes of bacteria, archaea and the yeast Saccharomyces cerevisiae (http://www.ncbi.nlm.nih.gov/COG). In addition, a supplement to the COGs is available, in which proteins encoded in the genomes of two multicellular eukaryotes, the nematode Caenorhabditis elegans and the fruit fly Drosophila melanogaster, and shared with bacteria and/or archaea were included. The new features added to the COG database include information pages with structural and functional details on each COG and literature references, improvements of the COGNITOR program that is used to fit new proteins into the COGs, and classification of genomes and COGs constructed by using principal component analysis.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The SWISS-PROT group at EBI has developed the Proteome Analysis Database utilising existing resources and providing comparative analysis of the predicted protein coding sequences of the complete genomes of bacteria, archaea and eukaryotes (http://www.ebi.ac.uk/proteome/). The two main projects used, InterPro and CluSTr, give a new perspective on families, domains and sites and cover 31–67% (InterPro statistics) of the proteins from each of the complete genomes. CluSTr covers the three complete eukaryotic genomes and the incomplete human genome data. The Proteome Analysis Database is accompanied by a program that has been designed to carry out InterPro proteome comparisons for any one proteome against any other one or more of the proteomes in the database.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The iProClass database is an integrated resource that provides comprehensive family relationships and structural and functional features of proteins, with rich links to various databases. It is extended from ProClass, a protein family database that integrates PIR superfamilies and PROSITE motifs. The iProClass currently consists of more than 200 000 non-redundant PIR and SWISS-PROT proteins organized with more than 28 000 superfamilies, 2600 domains, 1300 motifs, 280 post-translational modification sites and links to more than 30 databases of protein families, structures, functions, genes, genomes, literature and taxonomy. Protein and family summary reports provide rich annotations, including membership information with length, taxonomy and keyword statistics, full family relationships, comprehensive enzyme and PDB cross-references and graphical feature display. The database facilitates classification-driven annotation for protein sequence databases and complete genomes, and supports structural and functional genomic research. The iProClass is implemented in Oracle 8i object-relational system and available for sequence search and report retrieval at http://pir.georgetow n.edu/iproclass/.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

TIGRFAMs is a collection of protein families featuring curated multiple sequence alignments, hidden Markov models and associated information designed to support the automated functional identification of proteins by sequence homology. We introduce the term ‘equivalog’ to describe members of a set of homologous proteins that are conserved with respect to function since their last common ancestor. Related proteins are grouped into equivalog families where possible, and otherwise into protein families with other hierarchically defined homology types. TIGRFAMs currently contains over 800 protein families, available for searching or downloading at www.tigr.org/TIGRFAMs. Classification by equivalog family, where achievable, complements classification by orthology, superfamily, domain or motif. It provides the information best suited for automatic assignment of specific functions to proteins from large-scale genome sequencing projects.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The Identification and Classification of Bacteria (ICB) database (http:/www.mbio.co.jp/icb) contains currently available information about the DNA gyrase subunit B (gyrB) gene in bacteria. The database is designed to provide the scientific community with a reference point for using gyrB as an evolutionary and taxonomic marker. Nucleic and amino acid sequence data are currently available for over 850 strains, along with alignments at several different taxonomic levels and an exhaustive review of primer selection and background information.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Macromolecular transport systems in bacteria currently are classified by function and sequence comparisons into five basic types. In this classification system, type II and type IV secretion systems both possess members of a superfamily of genes for putative NTP hydrolase (NTPase) proteins that are strikingly similar in structure, function, and sequence. These include VirB11, TrbB, TraG, GspE, PilB, PilT, and ComG1. The predicted protein product of tadA, a recently discovered gene required for tenacious adherence of Actinobacillus actinomycetemcomitans, also has significant sequence similarity to members of this superfamily and to several unclassified and uncharacterized gene products of both Archaea and Bacteria. To understand the relationship of tadA and tadA-like genes to those encoding the putative NTPases of type II/IV secretion, we used a phylogenetic approach to obtain a genealogy of 148 NTPase genes and reconstruct a scenario of gene superfamily evolution. In this phylogeny, clear distinctions can be made between type II and type IV families and their constituent subfamilies. In addition, the subgroup containing tadA constitutes a novel and extremely widespread subfamily of the family encompassing all putative NTPases of type IV secretion systems. We report diagnostic amino acid residue positions for each major monophyletic family and subfamily in the phylogenetic tree, and we propose an easy method for precisely classifying and naming putative NTPase genes based on phylogeny. This molecular key-based method can be applied to other gene superfamilies and represents a valuable tool for genome analysis.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Precise classification of tumors is critically important for cancer diagnosis and treatment. It is also a scientifically challenging task. Recently, efforts have been made to use gene expression profiles to improve the precision of classification, with limited success. Using a published data set for purposes of comparison, we introduce a methodology based on classification trees and demonstrate that it is significantly more accurate for discriminating among distinct colon cancer tissues than other statistical approaches used heretofore. In addition, competing classification trees are displayed, which suggest that different genes may coregulate colon cancers.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The 5' noncoding region of poliovirus RNA contains an internal ribosome entry site (IRES) for cap-independent initiation of translation. Utilization of the IRES requires the participation of one or more cellular proteins that mediate events in the translation initiation reaction, but whose biochemical roles have not been defined. In this report, we identify a cellular RNA binding protein isolated from the ribosomal salt wash of uninfected HeLa cells that specifically binds to stem-loop IV, a domain located in the central part of the poliovirus IRES. The protein was isolated by specific RNA affinity chromatography, and 55% of its sequence was determined by automated liquid chromatography-tandem mass spectrometry. The sequence obtained matched that of poly(rC) binding protein 2 (PCBP2), previously identified as an RNA binding protein from human cells. PCBP2, as well as a related protein, PCBP1, was over-expressed in Escherichia coli after cloning the cDNAs into an expression plasmid to produce a histidine-tagged fusion protein. Specific interaction between recombinant PCBP2 and poliovirus stem-loop IV was demonstrated by RNA mobility shift analysis. The closely related PCBP1 showed no stable interaction with the RNA. Stem-loop IV RNA containing a three nucleotide insertion that abrogates translation activity and virus viability was unable to bind PCBP2.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Detection of loss of heterozygosity (LOH) by comparison of normal and tumor genotypes using PCR-based microsatellite loci provides considerable advantages over traditional Southern blotting-based approaches. However, current methodologies are limited by several factors, including the numbers of loci that can be evaluated for LOH in a single experiment, the discrimination of true alleles versus "stutter bands," and the use of radionucleotides in detecting PCR products. Here we describe methods for high throughput simultaneous assessment of LOH at multiple loci in human tumors; these methods rely on the detection of amplified microsatellite loci by fluorescence-based DNA sequencing technology. Data generated by this approach are processed by several computer software programs that enable the automated linear quantitation and calculation of allelic ratios, allowing rapid ascertainment of LOH. As a test of this approach, genotypes at a series of loci on chromosome 4 were determined for 58 carcinomas of the uterine cervix. The results underscore the efficacy, sensitivity, and remarkable reproducibility of this approach to LOH detection and provide subchromosomal localization of two regions of chromosome 4 commonly altered in cervical tumors.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A symbiosis-based phylogeny leads to a consistent, useful classification system for all life. "Kingdoms" and "Domains" are replaced by biological names for the most inclusive taxa: Prokarya (bacteria) and Eukarya (symbiosis-derived nucleated organisms). The earliest Eukarya, anaerobic mastigotes, hypothetically originated from permanent whole-cell fusion between members of Archaea (e.g., Thermoplasma-like organisms) and of Eubacteria (e.g., Spirochaeta-like organisms). Molecular biology, life-history, and fossil record evidence support the reunification of bacteria as Prokarya while subdividing Eukarya into uniquely defined subtaxa: Protoctista, Animalia, Fungi, and Plantae.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Transmission of human immunodeficiency virus 1 (HIV-1) from an infected women to her offspring during gestation and delivery was found to be influenced by the infant's major histocompatibility complex class II DRB1 alleles. Forty-six HIV-infected infants and 63 seroreverting infants, born with passively acquired anti-HIV antibodies but not becoming detectably infected, were typed by an automated nucleotide-sequence-based technique that uses low-resolution PCR to select either the simpler Taq or the more demanding T7 sequencing chemistry. One or more DR13 alleles, including DRB1*1301, 1302, and 1303, were found in 31.7% of seroreverting infants and 15.2% of those becoming HIV-infected [OR (odds ratio) = 2.6 (95% confidence interval 1.0-6.8); P = 0.048]. This association was influenced by ethnicity, being seen more strongly among the 80 Black and Hispanic children [OR = 4.3 (1.2-16.4); P = 0.023], with the most pronounced effect among Black infants where 7 of 24 seroreverters inherited these alleles with none among 12 HIV-infected infants (Haldane OR = 12.3; P = 0.037). The previously recognized association of DR13 alleles with some situations of long-term nonprogression of HIV suggests that similar mechanisms may regulate both the occurrence of infection and disease progression after infection. Upon examining for residual associations, only only the DR2 allele DRB1*1501 was associated with seroreversion in Caucasoid infants (OR = 24; P = 0.004). Among Caucasoids the DRB1*03011 allele was positively associated with the occurrence of HIV infection (P = 0.03).