132 resultados para sequence similarity searches
Resumo:
We present an approach for assessing the significance of sequence and structure comparisons by using nearly identical statistical formalisms for both sequence and structure. Doing so involves an all-vs.-all comparison of protein domains [taken here from the Structural Classification of Proteins (scop) database] and then fitting a simple distribution function to the observed scores. By using this distribution, we can attach a statistical significance to each comparison score in the form of a P value, the probability that a better score would occur by chance. As expected, we find that the scores for sequence matching follow an extreme-value distribution. The agreement, moreover, between the P values that we derive from this distribution and those reported by standard programs (e.g., blast and fasta validates our approach. Structure comparison scores also follow an extreme-value distribution when the statistics are expressed in terms of a structural alignment score (essentially the sum of reciprocated distances between aligned atoms minus gap penalties). We find that the traditional metric of structural similarity, the rms deviation in atom positions after fitting aligned atoms, follows a different distribution of scores and does not perform as well as the structural alignment score. Comparison of the sequence and structure statistics for pairs of proteins known to be related distantly shows that structural comparison is able to detect approximately twice as many distant relationships as sequence comparison at the same error rate. The comparison also indicates that there are very few pairs with significant similarity in terms of sequence but not structure whereas many pairs have significant similarity in terms of structure but not sequence.
Resumo:
Streptomyces lavendulae produces complestatin, a cyclic peptide natural product that antagonizes pharmacologically relevant protein–protein interactions including formation of the C4b,2b complex in the complement cascade and gp120-CD4 binding in the HIV life cycle. Complestatin, a member of the vancomycin group of natural products, consists of an α-ketoacyl hexapeptide backbone modified by oxidative phenolic couplings and halogenations. The entire complestatin biosynthetic and regulatory gene cluster spanning ca. 50 kb was cloned and sequenced. It consisted of 16 ORFs, encoding proteins homologous to nonribosomal peptide synthetases, cytochrome P450-related oxidases, ferredoxins, nonheme halogenases, four enzymes involved in 4-hydroxyphenylglycine (Hpg) biosynthesis, transcriptional regulators, and ABC transporters. The nonribosomal peptide synthetase consisted of a priming module, six extending modules, and a terminal thioesterase; their arrangement and domain content was entirely consistent with functions required for the biosynthesis of a heptapeptide or α-ketoacyl hexapeptide backbone. Two oxidase genes were proposed to be responsible for the construction of the unique aryl-ether-aryl-aryl linkage on the linear heptapeptide intermediate. Hpg, 3,5-dichloro-Hpg, and 3,5-dichloro-hydroxybenzoylformate are unusual building blocks that repesent five of the seven requisite monomers in the complestatin peptide. Heterologous expression and biochemical analysis of 4-hydroxyphenylglycine transaminon confirmed its role as an aminotransferase responsible for formation of all three precursors. The close similarity but functional divergence between complestatin and chloroeremomycin biosynthetic genes also presents a unique opportunity for the construction of hybrid vancomycin-type antibiotics.
Resumo:
In this paper, a new way to think about, and to construct, pairwise as well as multiple alignments of DNA and protein sequences is proposed. Rather than forcing alignments to either align single residues or to introduce gaps by defining an alignment as a path running right from the source up to the sink in the associated dot-matrix diagram, we propose to consider alignments as consistent equivalence relations defined on the set of all positions occurring in all sequences under consideration. We also propose constructing alignments from whole segments exhibiting highly significant overall similarity rather than by aligning individual residues. Consequently, we present an alignment algorithm that (i) is based on segment-to-segment comparison instead of the commonly used residue-to-residue comparison and which (ii) avoids the well-known difficulties concerning the choice of appropriate gap penalties: gaps are not treated explicity, but remain as those parts of the sequences that do not belong to any of the aligned segments. Finally, we discuss the application of our algorithm to two test examples and compare it with commonly used alignment methods. As a first example, we aligned a set of 11 DNA sequences coding for functional helix-loop-helix proteins. Though the sequences show only low overall similarity, our program correctly aligned all of the 11 functional sites, which was a unique result among the methods tested. As a by-product, the reading frames of the sequences were identified. Next, we aligned a set of ribonuclease H proteins and compared our results with alignments produced by other programs as reported by McClure et al. [McClure, M. A., Vasi, T. K. & Fitch, W. M. (1994) Mol. Biol. Evol. 11, 571-592]. Our program was one of the best scoring programs. However, in contrast to other methods, our protein alignments are independent of user-defined parameters.
Resumo:
The rearrangement of antibody and T-cell receptor gene segments is indispensable to the vertebrate immune response. All extant jawed vertebrates can rearrange these gene segments. This ability is conferred by the recombination activating genes I and II (RAG I and RAG II). To elucidate their origin and function, the cDNA encoding RAG I from a member of the most ancient class of extant gnathostomes, the Carcharhine sharks, was characterized. Homology domains identified within shark RAG I prompted sequence comparison analyses that suggested similarity of the RAG I and II genes, respectively, to the integrase family genes and integration host factor genes of the bacterial site-specific recombination system. Thus, the apparent explosive evolution (or "big bang") of the ancestral immune system may have been initiated by a transfer of microbial site-specific recombinases.
Resumo:
Gene recognition is one of the most important problems in computational molecular biology. Previous attempts to solve this problem were based on statistics, and applications of combinatorial methods for gene recognition were almost unexplored. Recent advances in large-scale cDNA sequencing open a way toward a new approach to gene recognition that uses previously sequenced genes as a clue for recognition of newly sequenced genes. This paper describes a spliced alignment algorithm and software tool that explores all possible exon assemblies in polynomial time and finds the multiexon structure with the best fit to a related protein. Unlike other existing methods, the algorithm successfully recognizes genes even in the case of short exons or exons with unusual codon usage; we also report correct assemblies for genes with more than 10 exons. On a test sample of human genes with known mammalian relatives, the average correlation between the predicted and actual proteins was 99%. The algorithm correctly reconstructed 87% of genes and the rare discrepancies between the predicted and real exon-intron structures were caused either by short (less than 5 amino acids) initial/terminal exons or by alternative splicing. Moreover, the algorithm predicts human genes reasonably well when the homologous protein is nonvertebrate or even prokaryotic. The surprisingly good performance of the method was confirmed by extensive simulations: in particular, with target proteins at 160 accepted point mutations (PAM) (25% similarity), the correlation between the predicted and actual genes was still as high as 95%.
Resumo:
We describe a conserved family of bacterial gene products that includes the VirB1 virulence factor encoded by tumor-inducing plasmids of Agrobacterium spp., proteins involved in conjugative DNA transfer of broad-host-range bacterial plasmids, and gene products that may be involved in invasion by Shigella spp. and Salmonella enterica. Sequence analysis and structural modeling show that the proteins in this group are related to chicken egg white lysozyme and are likely to adopt a lysozyme-like structural fold. Based on their similarity to lysozyme, we predict that these proteins have glycosidase activity. Iterative data base searches with three conserved sequence motifs from this protein family detect a more distant relationship to bacterial and bacteriophage lytic transglycosylases, and goose egg white lysozyme. Two acidic residues in the VirB1 protein of Agrobacterium tumefaciens form a putative catalytic dyad, Each of these residues was changed into the corresponding amide by site-directed mutagenesis. Strains of A. tumefaciens that express mutated VirB1 proteins have a significantly reduced virulence. We hypothesize that many bacterial proteins involved in export of macromolecules belong to a widespread class of hydrolases and cleave beta-1,4-glycosidic bonds as part of their function.
Resumo:
A 69-kDa proteinase (P69), a member of the pathogenesis-related proteins, is induced and accumulates in tomato (Lycopersicon esculentum) plants as a consequence of pathogen attack. We have used the polymerase chain reaction to identify and clone a cDNA from tomato plants that represent the pathogenesis-related P69 proteinase. The nucleotide sequence analysis revealed that P69 is synthesized in a preproenzyme form, a 745-amino acid polypeptide with a 22-amino acid signal peptide, a 92-amino acid propolypeptide, and a 631-amino acid mature polypeptide. Within the mature region the most salient feature was the presence of domains homologous to the subtilisin serine protease family. The amino acid sequences surrounding Asp-146, His-203, and Ser-532 of P69 are closely related to the catalytic sites (catalytic triad) of the subtilisin-like proteases. Northern blot analysis revealed that the 2.4-kb P69 mRNA accumulates abundantly in leaves and stem tissues from viroid-infected plants, whereas the mRNA levels in tissues from healthy plants were undetectable. Our results indicate that P69, a secreted calcium-activated endopeptidase, is a plant pathogenesis-related subtilisin-like proteinase that may collaborate with other defensive proteins in a general mechanism of active defense against attacking pathogens.
Resumo:
We have implemented an approach for the detection of DNA alterations in cancer by means of computerized analysis of end-labeled genomic fragments, separated in two dimensions. Analysis of two-dimensional patterns of neuroblastoma tumors, prepared by first digesting DNA with the methylation-sensitive restriction enzyme Not I, yielded a multicopy fragment which was detected in some tumor patterns but not in normal controls. Cloning and sequencing of the fragment, isolated from two-dimensional gels, yielded a sequence with a strong homology to a subtelomeric sequence in chimpanzees and which was previously reported to be undetectable in humans. Fluorescence in situ hybridization indicated the occurrence of this sequence in normal tissue, for the most part in the satellite regions of acrocentric chromosomes. A product containing this sequence was obtained by telomere-anchored PCR using as a primer an oligonucleotide sequence from the cloned fragment. Our data suggest demethylation of cytosines at the cloned Not I site and in neighboring DNA in some tumors, compared with normal tissue, and suggest a greater similarity between human and chimpanzee subtelomeric sequences than was previously reported.
Resumo:
Autonomously replicating sequence (ARS) elements of the fission yeast Schizosaccharomyces pombe contain multiple imperfect copies of the consensus sequence reported by Maundrell et al. [Maundrell K., Hutchison, A. & Shall, S. (1988) EMBO J. 7, 2203-2209]. When cell free extracts of S. pombe were incubated with a dimer or tetramer of an oligonucleotide containing the ARS consensus sequence, several complexes were detected using a gel mobility-shift assay. The proteins forming these complexes also bind ars3002, which is the most active origin in the ura4 region of chromosome III of S. pombe. One protein, partly responsible for the binding activity observed with crude extracts, was purified to near homogeneity. It is a 60-kDa protein and was named ARS-binding protein 1 (Abp1). Abp1 preferentially binds to multiple sites in ARS 3002 and to the DNA polymer poly[d(A.T)]. The cloning and sequence of the gene coding for Abp1 revealed that it encodes a protein of 59.8 kDa (522 amino acids). Abp1 has significant homology (25% identity, 50% similarity) to the N-terminal region (approximately 300 amino acids) of the human and mouse centromere DNA-binding protein CENP-B. Because centromeres of S. pombe contain a high density of ARS elements, Abp1 may play a role connecting DNA replication and chromosome segregation.
Resumo:
Expansins are unusual proteins discovered by virtue of their ability to mediate cell wall extension in plants. We identified cDNA clones for two cucumber expansins on the basis of peptide sequences of proteins purified from cucumber hypocotyls. The expansin cDNAs encode related proteins with signal peptides predicted to direct protein secretion to the cell wall. Northern blot analysis showed moderate transcript abundance in the growing region of the hypocotyl and no detectable transcripts in the nongrowing region. Rice and Arabidopsis expansin cDNAs were identified from collections of anonymous cDNAs (expressed sequence tags). Sequence comparisons indicate at least four distinct expansin cDNAs in rice and at least six in Arabidopsis. Expansins are highly conserved in size and sequence (60-87% amino acid sequence identity and 75-95% similarity between any pairwise comparison), and phylogenetic trees indicate that this multigene family formed before the evolutionary divergence of monocotyledons and dicotyledons. Sequence and motif analyses show no similarities to known functional domains that might account for expansin action on wall extension. A series of highly conserved tryptophans may function in expansin binding to cellulose or other glycans. The high conservation of this multigene family indicates that the mechanism by which expansins promote wall extensin tolerates little variation in protein structure.
Resumo:
Simple sequence repeats (SSRs), consisting of tandemly repeated multiple copies of mono-, di-, tri-, or tetranucleotide motifs, are ubiquitous in eukaryotic genomes and are frequently used as genetic markers, taking advantage of their length polymorphism. We have examined the polymorphism of such sequences in the chloroplast genomes of plants, by using a PCR-based assay. GenBank searches identified the presence of several (dA)n.(dT)n mononucleotide stretches in chloroplast genomes. A chloroplast (cp) SSR was identified in three pine species (Pinus contorta, Pinus sylvestris, and Pinus thunbergii) 312 bp upstream of the psbA gene. DNA amplification of this repeated region from 11 pine species identified nine length variants. The polymorphic amplified fragments were isolated and the DNA sequence was determined, confirming that the length polymorphism was caused by variation in the length of the repeated region. In the pines, the chloroplast genome is transmitted through pollen and this PCR assay may be used to monitor gene flow in this genus. Analysis of 305 individuals from seven populations of Pinus leucodermis Ant. revealed the presence of four variants with intrapopulational diversities ranging from 0.000 to 0.629 and an average of 0.320. Restriction fragment length polymorphism analysis of cpDNA on the same populations previously failed to detect any variation. Population subdivision based on cpSSR was higher (Gst = 0.22, where Gst is coefficient of gene differentiation) than that revealed in a previous isozyme study (Gst = 0.05). We anticipate that SSR loci within the chloroplast genome should provide a highly informative assay for the analysis of the genetic structure of plant populations.
Resumo:
When expressed as part of a glutathione S-transferase fusion protein the NH2-terminal domain of the lymphocyte cell adhesion molecule CD2 is shown to adopt two different folds. The immunoglobulin superfamily structure of the major (85%) monomeric component has previously been determined by both x-ray crystallography and NMR spectroscopy. We now describe the structure of a second, dimeric, form present in about 15% of recombinant CD2 molecules. After denaturation and refolding in the absence of the fusion partner, dimeric CD2 is converted to monomer, illustrating that the dimeric form represents a metastable folded state. The crystal structure of this dimeric form, refined to 2.0-A resolution, reveals two domains with overall similarity to the IgSF fold found in the monomer. However, in the dimer each domain is formed by the intercalation of two polypeptide chains. Hence each domain represents a distinct folding unit that can assemble in two different ways. In the dimer the two domains fold around a hydrophilic interface believed to mimic the cell adhesion interaction at the cell surface, and the formation of dimer can be regulated by mutating single residues at this interface. This unusual misfolded form of the protein, which appears to result from inter- rather than intramolecular interactions being favored by an intermediate structure formed during the folding process, illustrates that evolution of protein oligomers is possible from the sequence for a single protein domain.