7 resultados para Maximum entropy statistical estimate
em National Center for Biotechnology Information - NCBI
Resumo:
Structural genomics aims to solve a large number of protein structures that represent the protein space. Currently an exhaustive solution for all structures seems prohibitively expensive, so the challenge is to define a relatively small set of proteins with new, currently unknown folds. This paper presents a method that assigns each protein with a probability of having an unsolved fold. The method makes extensive use of protomap, a sequence-based classification, and scop, a structure-based classification. According to protomap, the protein space encodes the relationship among proteins as a graph whose vertices correspond to 13,354 clusters of proteins. A representative fold for a cluster with at least one solved protein is determined after superposition of all scop (release 1.37) folds onto protomap clusters. Distances within the protomap graph are computed from each representative fold to the neighboring folds. The distribution of these distances is used to create a statistical model for distances among those folds that are already known and those that have yet to be discovered. The distribution of distances for solved/unsolved proteins is significantly different. This difference makes it possible to use Bayes' rule to derive a statistical estimate that any protein has a yet undetermined fold. Proteins that score the highest probability to represent a new fold constitute the target list for structural determination. Our predicted probabilities for unsolved proteins correlate very well with the proportion of new folds among recently solved structures (new scop 1.39 records) that are disjoint from our original training set.
Resumo:
Patterns in sequences of amino acid hydrophobic free energies predict secondary structures in proteins. In protein folding, matches in hydrophobic free energy statistical wavelengths appear to contribute to selective aggregation of secondary structures in “hydrophobic zippers.” In a similar setting, the use of Fourier analysis to characterize the dominant statistical wavelengths of peptide ligands’ and receptor proteins’ hydrophobic modes to predict such matches has been limited by the aliasing and end effects of short peptide lengths, as well as the broad-band, mode multiplicity of many of their frequency (power) spectra. In addition, the sequence locations of the matching modes are lost in this transformation. We make new use of three techniques to address these difficulties: (i) eigenfunction construction from the linear decomposition of the lagged covariance matrices of the ligands and receptors as hydrophobic free energy sequences; (ii) maximum entropy, complex poles power spectra, which select the dominant modes of the hydrophobic free energy sequences or their eigenfunctions; and (iii) discrete, best bases, trigonometric wavelet transformations, which confirm the dominant spectral frequencies of the eigenfunctions and locate them as (absolute valued) moduli in the peptide or receptor sequence. The leading eigenfunction of the covariance matrix of a transmembrane receptor sequence locates the same transmembrane segments seen in n-block-averaged hydropathy plots while leaving the remaining hydrophobic modes unsmoothed and available for further analyses as secondary eigenfunctions. In these receptor eigenfunctions, we find a set of statistical wavelength matches between peptide ligands and their G-protein and tyrosine kinase coupled receptors, ranging across examples from 13.10 amino acids in acid fibroblast growth factor to 2.18 residues in corticotropin releasing factor. We find that the wavelet-located receptor modes in the extracellular loops are compatible with studies of receptor chimeric exchanges and point mutations. A nonbinding corticotropin-releasing factor receptor mutant is shown to have lost the signatory mode common to the normal receptor and its ligand. Hydrophobic free energy eigenfunctions and their transformations offer new quantitative physical homologies in database searches for peptide-receptor matches.
Resumo:
In this study, we estimate the statistical significance of structure prediction by threading. We introduce a single parameter ɛ that serves as a universal measure determining the probability that the best alignment is indeed a native-like analog. Parameter ɛ takes into account both length and composition of the query sequence and the number of decoys in threading simulation. It can be computed directly from the query sequence and potential of interactions, eliminating the need for sequence reshuffling and realignment. Although our theoretical analysis is general, here we compare its predictions with the results of gapless threading. Finally we estimate the number of decoys from which the native structure can be found by existing potentials of interactions. We discuss how this analysis can be extended to determine the optimal gap penalties for any sequence-structure alignment (threading) method, thus optimizing it to maximum possible performance.
Resumo:
In this paper we propose a method to estimate by maximum likelihood the divergence time between two populations, specifically designed for the analysis of nonrecurrent rare mutations. Given the rapidly growing amount of data, rare disease mutations affecting humans seem the most suitable candidates for this method. The estimator RD, and its conditional version RDc, were derived, assuming that the population dynamics of rare alleles can be described by using a birth–death process approximation and that each mutation arose before the split of a common ancestral population into the two diverging populations. The RD estimator seems more suitable for large sample sizes and few alleles, whose age can be approximated, whereas the RDc estimator appears preferable when this is not the case. When applied to three cystic fibrosis mutations, the estimator RD could not exclude a very recent time of divergence among three Mediterranean populations. On the other hand, the divergence time between these populations and the Danish population was estimated to be, on the average, 4,500 or 15,000 years, assuming or not a selective advantage for cystic fibrosis carriers, respectively. Confidence intervals are large, however, and can probably be reduced only by analyzing more alleles or loci.
Resumo:
A maximum likelihood estimator based on the coalescent for unequal migration rates and different subpopulation sizes is developed. The method uses a Markov chain Monte Carlo approach to investigate possible genealogies with branch lengths and with migration events. Properties of the new method are shown by using simulated data from a four-population n-island model and a source–sink population model. Our estimation method as coded in migrate is tested against genetree; both programs deliver a very similar likelihood surface. The algorithm converges to the estimates fairly quickly, even when the Markov chain is started from unfavorable parameters. The method was used to estimate gene flow in the Nile valley by using mtDNA data from three human populations.
Resumo:
How fast can a protein fold? The rate of polypeptide collapse to a compact state sets an upper limit to the rate of folding. Collapse may in turn be limited by the rate of intrachain diffusion. To address this question, we have determined the rate at which two regions of an unfolded protein are brought into contact by diffusion. Our nanosecond-resolved spectroscopy shows that under strongly denaturing conditions, regions of unfolded cytochrome separated by approximately 50 residues diffuse together in 35-40 microseconds. This result leads to an estimate of approximately (1 microsecond)-1 as the upper limit for the rate of protein folding.
Resumo:
Frequencies of meiotic configurations in cytogenetic stocks are dependent on chiasma frequencies in segments defined by centromeres, breakpoints, and telomeres. The expectation maximization algorithm is proposed as a general method to perform maximum likelihood estimations of the chiasma frequencies in the intervals between such locations. The estimates can be translated via mapping functions into genetic maps of cytogenetic landmarks. One set of observational data was analyzed to exemplify application of these methods, results of which were largely concordant with other comparable data. The method was also tested by Monte Carlo simulation of frequencies of meiotic configurations from a monotelodisomic translocation heterozygote, assuming six different sample sizes. The estimate averages were always close to the values given initially to the parameters. The maximum likelihood estimation procedures can be extended readily to other kinds of cytogenetic stocks and allow the pooling of diverse cytogenetic data to collectively estimate lengths of segments, arms, and chromosomes.