982 resultados para Stochastic Context-Free Grammars
Resumo:
This thesis introduces an extension of Chomsky’s context-free grammars equipped with operators for referring to left and right contexts of strings.The new model is called grammar with contexts. The semantics of these grammars are given in two equivalent ways — by language equations and by logical deduction, where a grammar is understood as a logic for the recursive definition of syntax. The motivation for grammars with contexts comes from an extensive example that completely defines the syntax and static semantics of a simple typed programming language. Grammars with contexts maintain most important practical properties of context-free grammars, including a variant of the Chomsky normal form. For grammars with one-sided contexts (that is, either left or right), there is a cubic-time tabular parsing algorithm, applicable to an arbitrary grammar. The time complexity of this algorithm can be improved to quadratic,provided that the grammar is unambiguous, that is, it only allows one parsefor every string it defines. A tabular parsing algorithm for grammars withtwo-sided contexts has fourth power time complexity. For these grammarsthere is a recognition algorithm that uses a linear amount of space. For certain subclasses of grammars with contexts there are low-degree polynomial parsing algorithms. One of them is an extension of the classical recursive descent for context-free grammars; the version for grammars with contexts still works in linear time like its prototype. Another algorithm, with time complexity varying from linear to cubic depending on the particular grammar, adapts deterministic LR parsing to the new model. If all context operators in a grammar define regular languages, then such a grammar can be transformed to an equivalent grammar without context operators at all. This allows one to represent the syntax of languages in a more succinct way by utilizing context specifications. Linear grammars with contexts turned out to be non-trivial already over a one-letter alphabet. This fact leads to some undecidability results for this family of grammars
Resumo:
We study several extensions of the notion of alternation from context-free grammars to context-sensitive and arbitrary phrase-structure grammars. Thereby new grammatical characterizations are obtained for the class of languages that are accepted by alternating pushdown automata.
Resumo:
"UIUCDCS-R-74-624"
Resumo:
Formal grammars can used for describing complex repeatable structures such as DNA sequences. In this paper, we describe the structural composition of DNA sequences using a context-free stochastic L-grammar. L-grammars are a special class of parallel grammars that can model the growth of living organisms, e.g. plant development, and model the morphology of a variety of organisms. We believe that parallel grammars also can be used for modeling genetic mechanisms and sequences such as promoters. Promoters are short regulatory DNA sequences located upstream of a gene. Detection of promoters in DNA sequences is important for successful gene prediction. Promoters can be recognized by certain patterns that are conserved within a species, but there are many exceptions which makes the promoter recognition a complex problem. We replace the problem of promoter recognition by induction of context-free stochastic L-grammar rules, which are later used for the structural analysis of promoter sequences. L-grammar rules are derived automatically from the drosophila and vertebrate promoter datasets using a genetic programming technique and their fitness is evaluated using a Support Vector Machine (SVM) classifier. The artificial promoter sequences generated using the derived L- grammar rules are analyzed and compared with natural promoter sequences.
Resumo:
Dans un premier temps, nous avons modélisé la structure d’une famille d’ARN avec une grammaire de graphes afin d’identifier les séquences qui en font partie. Plusieurs autres méthodes de modélisation ont été développées, telles que des grammaires stochastiques hors-contexte, des modèles de covariance, des profils de structures secondaires et des réseaux de contraintes. Ces méthodes de modélisation se basent sur la structure secondaire classique comparativement à nos grammaires de graphes qui se basent sur les motifs cycliques de nucléotides. Pour exemplifier notre modèle, nous avons utilisé la boucle E du ribosome qui contient le motif Sarcin-Ricin qui a été largement étudié depuis sa découverte par cristallographie aux rayons X au début des années 90. Nous avons construit une grammaire de graphes pour la structure du motif Sarcin-Ricin et avons dérivé toutes les séquences qui peuvent s’y replier. La pertinence biologique de ces séquences a été confirmée par une comparaison des séquences d’un alignement de plus de 800 séquences ribosomiques bactériennes. Cette comparaison a soulevée des alignements alternatifs pour quelques unes des séquences que nous avons supportés par des prédictions de structures secondaires et tertiaires. Les motifs cycliques de nucléotides ont été observés par les membres de notre laboratoire dans l'ARN dont la structure tertiaire a été résolue expérimentalement. Une étude des séquences et des structures tertiaires de chaque cycle composant la structure du Sarcin-Ricin a révélé que l'espace des séquences dépend grandement des interactions entre tous les nucléotides à proximité dans l’espace tridimensionnel, c’est-à-dire pas uniquement entre deux paires de bases adjacentes. Le nombre de séquences générées par la grammaire de graphes est plus petit que ceux des méthodes basées sur la structure secondaire classique. Cela suggère l’importance du contexte pour la relation entre la séquence et la structure, d’où l’utilisation d’une grammaire de graphes contextuelle plus expressive que les grammaires hors-contexte. Les grammaires de graphes que nous avons développées ne tiennent compte que de la structure tertiaire et négligent les interactions de groupes chimiques spécifiques avec des éléments extra-moléculaires, comme d’autres macromolécules ou ligands. Dans un deuxième temps et pour tenir compte de ces interactions, nous avons développé un modèle qui tient compte de la position des groupes chimiques à la surface des structures tertiaires. L’hypothèse étant que les groupes chimiques à des positions conservées dans des séquences prédéterminées actives, qui sont déplacés dans des séquences inactives pour une fonction précise, ont de plus grandes chances d’être impliqués dans des interactions avec des facteurs. En poursuivant avec l’exemple de la boucle E, nous avons cherché les groupes de cette boucle qui pourraient être impliqués dans des interactions avec des facteurs d'élongation. Une fois les groupes identifiés, on peut prédire par modélisation tridimensionnelle les séquences qui positionnent correctement ces groupes dans leurs structures tertiaires. Il existe quelques modèles pour adresser ce problème, telles que des descripteurs de molécules, des matrices d’adjacences de nucléotides et ceux basé sur la thermodynamique. Cependant, tous ces modèles utilisent une représentation trop simplifiée de la structure d’ARN, ce qui limite leur applicabilité. Nous avons appliqué notre modèle sur les structures tertiaires d’un ensemble de variants d’une séquence d’une instance du Sarcin-Ricin d’un ribosome bactérien. L’équipe de Wool à l’université de Chicago a déjà étudié cette instance expérimentalement en testant la viabilité de 12 variants. Ils ont déterminé 4 variants viables et 8 létaux. Nous avons utilisé cet ensemble de 12 séquences pour l’entraînement de notre modèle et nous avons déterminé un ensemble de propriétés essentielles à leur fonction biologique. Pour chaque variant de l’ensemble d’entraînement nous avons construit des modèles de structures tertiaires. Nous avons ensuite mesuré les charges partielles des atomes exposés sur la surface et encodé cette information dans des vecteurs. Nous avons utilisé l’analyse des composantes principales pour transformer les vecteurs en un ensemble de variables non corrélées, qu’on appelle les composantes principales. En utilisant la distance Euclidienne pondérée et l’algorithme du plus proche voisin, nous avons appliqué la technique du « Leave-One-Out Cross-Validation » pour choisir les meilleurs paramètres pour prédire l’activité d’une nouvelle séquence en la faisant correspondre à ces composantes principales. Finalement, nous avons confirmé le pouvoir prédictif du modèle à l’aide d’un nouvel ensemble de 8 variants dont la viabilité à été vérifiée expérimentalement dans notre laboratoire. En conclusion, les grammaires de graphes permettent de modéliser la relation entre la séquence et la structure d’un élément structural d’ARN, comme la boucle E contenant le motif Sarcin-Ricin du ribosome. Les applications vont de la correction à l’aide à l'alignement de séquences jusqu’au design de séquences ayant une structure prédéterminée. Nous avons également développé un modèle pour tenir compte des interactions spécifiques liées à une fonction biologique donnée, soit avec des facteurs environnants. Notre modèle est basé sur la conservation de l'exposition des groupes chimiques qui sont impliqués dans ces interactions. Ce modèle nous a permis de prédire l’activité biologique d’un ensemble de variants de la boucle E du ribosome qui se lie à des facteurs d'élongation.
Resumo:
This paper proposes a sequential coupling of a Hidden Markov Model (HMM) recognizer for offline handwritten English sentences with a probabilistic bottom-up chart parser using Stochastic Context-Free Grammars (SCFG) extracted from a text corpus. Based on extensive experiments, we conclude that syntax analysis helps to improve recognition rates significantly.
Resumo:
Multitype branching processes (MTBP) model branching structures, where the nodes of the resulting tree are particles of different types. Usually such a process is not observable in the sense of the whole tree, but only as the “generation” at a given moment in time, which consists of the number of particles of every type. This requires an EM-type algorithm to obtain a maximum likelihood (ML) estimate of the parameters of the branching process. Using a version of the inside-outside algorithm for stochastic context-free grammars (SCFG), such an estimate could be obtained for the offspring distribution of the process.
Resumo:
2000 Mathematics Subject Classification: 60J80, 60J85, 62P10, 92D25.
Resumo:
Bibliography: p. 67.
Resumo:
Continuous-valued recurrent neural networks can learn mechanisms for processing context-free languages. The dynamics of such networks is usually based on damped oscillation around fixed points in state space and requires that the dynamical components are arranged in certain ways. It is shown that qualitatively similar dynamics with similar constraints hold for a(n)b(n)c(n), a context-sensitive language. The additional difficulty with a(n)b(n)c(n), compared with the context-free language a(n)b(n), consists of 'counting up' and 'counting down' letters simultaneously. The network solution is to oscillate in two principal dimensions, one for counting up and one for counting down. This study focuses on the dynamics employed by the sequential cascaded network, in contrast to the simple recurrent network, and the use of backpropagation through time. Found solutions generalize well beyond training data, however, learning is not reliable. The contribution of this study lies in demonstrating how the dynamics in recurrent neural networks that process context-free languages can also be employed in processing some context-sensitive languages (traditionally thought of as requiring additional computation resources). This continuity of mechanism between language classes contributes to our understanding of neural networks in modelling language learning and processing.
Resumo:
The long short-term memory (LSTM) is not the only neural network which learns a context sensitive language. Second-order sequential cascaded networks (SCNs) are able to induce means from a finite fragment of a context-sensitive language for processing strings outside the training set. The dynamical behavior of the SCN is qualitatively distinct from that observed in LSTM networks. Differences in performance and dynamics are discussed.
Resumo:
Functional RNA structures play an important role both in the context of noncoding RNA transcripts as well as regulatory elements in mRNAs. Here we present a computational study to detect functional RNA structures within the ENCODE regions of the human genome. Since structural RNAs in general lack characteristic signals in primary sequence, comparative approaches evaluating evolutionary conservation of structures are most promising. We have used three recently introduced programs based on either phylogenetic-stochastic context-free grammar (EvoFold) or energy directed folding (RNAz and AlifoldZ), yielding several thousand candidate structures (corresponding to approximately 2.7% of the ENCODE regions). EvoFold has its highest sensitivity in highly conserved and relatively AU-rich regions, while RNAz favors slightly GC-rich regions, resulting in a relatively small overlap between methods. Comparison with the GENCODE annotation points to functional RNAs in all genomic contexts, with a slightly increased density in 3'-UTRs. While we estimate a significant false discovery rate of approximately 50%-70% many of the predictions can be further substantiated by additional criteria: 248 loci are predicted by both RNAz and EvoFold, and an additional 239 RNAz or EvoFold predictions are supported by the (more stringent) AlifoldZ algorithm. Five hundred seventy RNAz structure predictions fall into regions that show signs of selection pressure also on the sequence level (i.e., conserved elements). More than 700 predictions overlap with noncoding transcripts detected by oligonucleotide tiling arrays. One hundred seventy-five selected candidates were tested by RT-PCR in six tissues, and expression could be verified in 43 cases (24.6%).
Resumo:
Some programs may have their entry data specified by formalized context-free grammars. This formalization facilitates the use of tools in the systematization and the rise of the quality of their test process. This category of programs, compilers have been the first to use this kind of tool for the automation of their tests. In this work we present an approach for definition of tests from the formal description of the entries of the program. The generation of the sentences is performed by taking into account syntactic aspects defined by the specification of the entries, the grammar. For optimization, their coverage criteria are used to limit the quantity of tests without diminishing their quality. Our approach uses these criteria to drive generation to produce sentences that satisfy a specific coverage criterion. The approach presented is based on the use of Lua language, relying heavily on its resources of coroutines and dynamic construction of functions. With these resources, we propose a simple and compact implementation that can be optimized and controlled in different ways, in order to seek satisfaction the different implemented coverage criteria. To make the use of our tool simpler, the EBNF notation for the specification of the entries was adopted. Its parser was specified in the tool Meta-Environment for rapid prototyping
Resumo:
The work proposed by Cleverton Hentz (2010) presented an approach to define tests from the formal description of a program s input. Since some programs, such as compilers, may have their inputs formalized through grammars, it is common to use context-free grammars to specify the set of its valid entries. In the original work the author developed a tool that automatically generates tests for compilers. In the present work we identify types of problems in various areas where grammars are used to describe them , for example, to specify software configurations, which are potential situations to use LGen. In addition, we conducted case studies with grammars of different domains and from these studies it was possible to evaluate the behavior and performance of LGen during the generation of sentences, evaluating aspects such as execution time, number of generated sentences and satisfaction of coverage criteria available in LGen