13 resultados para Statistical Language Model
em National Center for Biotechnology Information - NCBI
Resumo:
Structural genomics aims to solve a large number of protein structures that represent the protein space. Currently an exhaustive solution for all structures seems prohibitively expensive, so the challenge is to define a relatively small set of proteins with new, currently unknown folds. This paper presents a method that assigns each protein with a probability of having an unsolved fold. The method makes extensive use of protomap, a sequence-based classification, and scop, a structure-based classification. According to protomap, the protein space encodes the relationship among proteins as a graph whose vertices correspond to 13,354 clusters of proteins. A representative fold for a cluster with at least one solved protein is determined after superposition of all scop (release 1.37) folds onto protomap clusters. Distances within the protomap graph are computed from each representative fold to the neighboring folds. The distribution of these distances is used to create a statistical model for distances among those folds that are already known and those that have yet to be discovered. The distribution of distances for solved/unsolved proteins is significantly different. This difference makes it possible to use Bayes' rule to derive a statistical estimate that any protein has a yet undetermined fold. Proteins that score the highest probability to represent a new fold constitute the target list for structural determination. Our predicted probabilities for unsolved proteins correlate very well with the proportion of new folds among recently solved structures (new scop 1.39 records) that are disjoint from our original training set.
Resumo:
Understanding the mechanism of protein secondary structure formation is an essential part of the protein-folding puzzle. Here, we describe a simple statistical mechanical model for the formation of a β-hairpin, the minimal structural element of the antiparallel β-pleated sheet. The model accurately describes the thermodynamic and kinetic behavior of a 16-residue, β-hairpin-forming peptide, successfully explaining its two-state behavior and apparent negative activation energy for folding. The model classifies structures according to their backbone conformation, defined by 15 pairs of dihedral angles, and is further simplified by considering only the 120 structures with contiguous stretches of native pairs of backbone dihedral angles. This single sequence approximation is tested by comparison with a more complete model that includes the 215 possible conformations and 15 × 215 possible kinetic transitions. Finally, we use the model to predict the equilibrium unfolding curves and kinetics for several variants of the β-hairpin peptide.
Resumo:
Speech recognition involves three processes: extraction of acoustic indices from the speech signal, estimation of the probability that the observed index string was caused by a hypothesized utterance segment, and determination of the recognized utterance via a search among hypothesized alternatives. This paper is not concerned with the first process. Estimation of the probability of an index string involves a model of index production by any given utterance segment (e.g., a word). Hidden Markov models (HMMs) are used for this purpose [Makhoul, J. & Schwartz, R. (1995) Proc. Natl. Acad. Sci. USA 92, 9956-9963]. Their parameters are state transition probabilities and output probability distributions associated with the transitions. The Baum algorithm that obtains the values of these parameters from speech data via their successive reestimation will be described in this paper. The recognizer wishes to find the most probable utterance that could have caused the observed acoustic index string. That probability is the product of two factors: the probability that the utterance will produce the string and the probability that the speaker will wish to produce the utterance (the language model probability). Even if the vocabulary size is moderate, it is impossible to search for the utterance exhaustively. One practical algorithm is described [Viterbi, A. J. (1967) IEEE Trans. Inf. Theory IT-13, 260-267] that, given the index string, has a high likelihood of finding the most probable utterance.
Resumo:
A “most probable state” equilibrium statistical theory for random distributions of hetons in a closed basin is developed here in the context of two-layer quasigeostrophic models for the spreading phase of open-ocean convection. The theory depends only on bulk conserved quantities such as energy, circulation, and the range of values of potential vorticity in each layer. The simplest theory is formulated for a uniform cooling event over the entire basin that triggers a homogeneous random distribution of convective towers. For a small Rossby deformation radius typical for open-ocean convection sites, the most probable states that arise from this theory strongly resemble the saturated baroclinic states of the spreading phase of convection, with a stabilizing barotropic rim current and localized temperature anomaly.
Resumo:
The HIV Reverse Transcriptase and Protease Sequence Database is an on-line relational database that catalogs evolutionary and drug-related sequence variation in the human immunodeficiency virus (HIV) reverse transcriptase (RT) and protease enzymes, the molecular targets of anti-HIV therapy (http://hivdb.stanford.edu). The database contains a compilation of nearly all published HIV RT and protease sequences, including submissions from International Collaboration databases and sequences published in journal articles. Sequences are linked to data about the source of the sequence sample and the antiretroviral drug treatment history of the individual from whom the isolate was obtained. During the past year 3500 sequences have been added and the data model has been expanded to include drug susceptibility data on sequenced isolates. Database content has also been integrated with didactic text and the output of two sequence analysis programs.
Resumo:
The field of natural language processing (NLP) has seen a dramatic shift in both research direction and methodology in the past several years. In the past, most work in computational linguistics tended to focus on purely symbolic methods. Recently, more and more work is shifting toward hybrid methods that combine new empirical corpus-based methods, including the use of probabilistic and information-theoretic techniques, with traditional symbolic methods. This work is made possible by the recent availability of linguistic databases that add rich linguistic annotation to corpora of natural language text. Already, these methods have led to a dramatic improvement in the performance of a variety of NLP systems with similar improvement likely in the coming years. This paper focuses on these trends, surveying in particular three areas of recent progress: part-of-speech tagging, stochastic parsing, and lexical semantics.
Resumo:
We present statistical methods for analyzing replicated cDNA microarray expression data and report the results of a controlled experiment. The study was conducted to investigate inherent variability in gene expression data and the extent to which replication in an experiment produces more consistent and reliable findings. We introduce a statistical model to describe the probability that mRNA is contained in the target sample tissue, converted to probe, and ultimately detected on the slide. We also introduce a method to analyze the combined data from all replicates. Of the 288 genes considered in this controlled experiment, 32 would be expected to produce strong hybridization signals because of the known presence of repetitive sequences within them. Results based on individual replicates, however, show that there are 55, 36, and 58 highly expressed genes in replicates 1, 2, and 3, respectively. On the other hand, an analysis by using the combined data from all 3 replicates reveals that only 2 of the 288 genes are incorrectly classified as expressed. Our experiment shows that any single microarray output is subject to substantial variability. By pooling data from replicates, we can provide a more reliable analysis of gene expression data. Therefore, we conclude that designing experiments with replications will greatly reduce misclassification rates. We recommend that at least three replicates be used in designing experiments by using cDNA microarrays, particularly when gene expression data from single specimens are being analyzed.
Resumo:
The availability of complete genome sequences and mRNA expression data for all genes creates new opportunities and challenges for identifying DNA sequence motifs that control gene expression. An algorithm, “MobyDick,” is presented that decomposes a set of DNA sequences into the most probable dictionary of motifs or words. This method is applicable to any set of DNA sequences: for example, all upstream regions in a genome or all genes expressed under certain conditions. Identification of words is based on a probabilistic segmentation model in which the significance of longer words is deduced from the frequency of shorter ones of various lengths, eliminating the need for a separate set of reference data to define probabilities. We have built a dictionary with 1,200 words for the 6,000 upstream regulatory regions in the yeast genome; the 500 most significant words (some with as few as 10 copies in all of the upstream regions) match 114 of 443 experimentally determined sites (a significance level of 18 standard deviations). When analyzing all of the genes up-regulated during sporulation as a group, we find many motifs in addition to the few previously identified by analyzing the subclusters individually to the expression subclusters. Applying MobyDick to the genes derepressed when the general repressor Tup1 is deleted, we find known as well as putative binding sites for its regulatory partners.
Resumo:
The present work develops and implements a biomathematical statement of how reciprocal connectivity drives stress-adaptive homeostasis in the corticotropic (hypothalamo-pituitary-adrenal) axis. In initial analyses with this interactive construct, we test six specific a priori hypotheses of mechanisms linking circadian (24-h) rhythmicity to pulsatile secretory output. This formulation offers a dynamic framework for later statistical estimation of unobserved in vivo neurohormone secretion and within-axis, dose-responsive interfaces in health and disease. Explication of the core dynamics of the stress-responsive corticotropic axis based on secure physiological precepts should help to unveil new biomedical hypotheses of stressor-specific system failure.
Resumo:
A statistical modeling approach is proposed for use in searching large microarray data sets for genes that have a transcriptional response to a stimulus. The approach is unrestricted with respect to the timing, magnitude or duration of the response, or the overall abundance of the transcript. The statistical model makes an accommodation for systematic heterogeneity in expression levels. Corresponding data analyses provide gene-specific information, and the approach provides a means for evaluating the statistical significance of such information. To illustrate this strategy we have derived a model to depict the profile expected for a periodically transcribed gene and used it to look for budding yeast transcripts that adhere to this profile. Using objective criteria, this method identifies 81% of the known periodic transcripts and 1,088 genes, which show significant periodicity in at least one of the three data sets analyzed. However, only one-quarter of these genes show significant oscillations in at least two data sets and can be classified as periodic with high confidence. The method provides estimates of the mean activation and deactivation times, induced and basal expression levels, and statistical measures of the precision of these estimates for each periodic transcript.
Resumo:
A model of interdependent decision making has been developed to understand group differences in socioeconomic behavior such as nonmarital fertility, school attendance, and drug use. The statistical mechanical structure of the model illustrates how the physical sciences contain useful tools for the study of socioeconomic phenomena.
Resumo:
At the forefront of debates on language are new data demonstrating infants' early acquisition of information about their native language. The data show that infants perceptually “map” critical aspects of ambient language in the first year of life before they can speak. Statistical properties of speech are picked up through exposure to ambient language. Moreover, linguistic experience alters infants' perception of speech, warping perception in the service of language. Infants' strategies are unexpected and unpredicted by historical views. A new theoretical position has emerged, and six postulates of this position are described.
Resumo:
A molecular model of poorly understood hydrophobic effects is heuristically developed using the methods of information theory. Because primitive hydrophobic effects can be tied to the probability of observing a molecular-sized cavity in the solvent, the probability distribution of the number of solvent centers in a cavity volume is modeled on the basis of the two moments available from the density and radial distribution of oxygen atoms in liquid water. The modeled distribution then yields the probability that no solvent centers are found in the cavity volume. This model is shown to account quantitatively for the central hydrophobic phenomena of cavity formation and association of inert gas solutes. The connection of information theory to statistical thermodynamics provides a basis for clarification of hydrophobic effects. The simplicity and flexibility of the approach suggest that it should permit applications to conformational equilibria of nonpolar solutes and hydrophobic residues in biopolymers.