526 resultados para Alignement de phrases


100.00% 100.00%



La traduction statistique requiert des corpus parallèles en grande quantité. L’obtention de tels corpus passe par l’alignement automatique au niveau des phrases. L’alignement des corpus parallèles a reçu beaucoup d’attention dans les années quatre vingt et cette étape est considérée comme résolue par la communauté. Nous montrons dans notre mémoire que ce n’est pas le cas et proposons un nouvel aligneur que nous comparons à des algorithmes à l’état de l’art. Notre aligneur est simple, rapide et permet d’aligner une très grande quantité de données. Il produit des résultats souvent meilleurs que ceux produits par les aligneurs les plus élaborés. Nous analysons la robustesse de notre aligneur en fonction du genre des textes à aligner et du bruit qu’ils contiennent. Pour cela, nos expériences se décomposent en deux grandes parties. Dans la première partie, nous travaillons sur le corpus BAF où nous mesurons la qualité d’alignement produit en fonction du bruit qui atteint les 60%. Dans la deuxième partie, nous travaillons sur le corpus EuroParl où nous revisitons la procédure d’alignement avec laquelle le corpus Europarl a été préparé et montrons que de meilleures performances au niveau des systèmes de traduction statistique peuvent être obtenues en utilisant notre aligneur.


70.00% 70.00%



Afin d'enrichir les données de corpus bilingues parallèles, il peut être judicieux de travailler avec des corpus dits comparables. En effet dans ce type de corpus, même si les documents dans la langue cible ne sont pas l'exacte traduction de ceux dans la langue source, on peut y retrouver des mots ou des phrases en relation de traduction. L'encyclopédie libre Wikipédia constitue un corpus comparable multilingue de plusieurs millions de documents. Notre travail consiste à trouver une méthode générale et endogène permettant d'extraire un maximum de phrases parallèles. Nous travaillons avec le couple de langues français-anglais mais notre méthode, qui n'utilise aucune ressource bilingue extérieure, peut s'appliquer à tout autre couple de langues. Elle se décompose en deux étapes. La première consiste à détecter les paires d’articles qui ont le plus de chance de contenir des traductions. Nous utilisons pour cela un réseau de neurones entraîné sur un petit ensemble de données constitué d'articles alignés au niveau des phrases. La deuxième étape effectue la sélection des paires de phrases grâce à un autre réseau de neurones dont les sorties sont alors réinterprétées par un algorithme d'optimisation combinatoire et une heuristique d'extension. L'ajout des quelques 560~000 paires de phrases extraites de Wikipédia au corpus d'entraînement d'un système de traduction automatique statistique de référence permet d'améliorer la qualité des traductions produites. Nous mettons les données alignées et le corpus extrait à la disposition de la communauté scientifique.


20.00% 20.00%



In this paper, we discuss our participation to the INEX 2008 Link-the-Wiki track. We utilized a sliding window based algorithm to extract the frequent terms and phrases. Using the extracted phrases and term as descriptive vectors, the anchors and relevant links (both incoming and outgoing) are recognized efficiently.


20.00% 20.00%



Coordination and juxtaposed sentences The object of this study is the examination of the relations between juxtaposed clauses in contemporary French. The matter in question is sentences which are composed of several clauses adjoined without a conjunction or other connector, as in: Je détournai les yeux, mon c ur se mit à battre. The aim of the study is to determine, which quality is the relation in these sentences and, on the other hand, what is the part of the coordination there. Furthermore, what is this relation of coordination, which, according to some grammars, manifests through a conjunction of coordination, but which, according to some others is marked in juxtaposed sentences through different features. The study is based on a corpus of written French from literary and journalistic text sources. Syntactic, semantic and textual properties in the clauses are discussed. The analysis points to differences so, it has been noted, in each case, if one of the clauses is affirmative and the other negative and if in the second clause, the subject has not been repeated. Also, an analysis has been made on the ground of the tense, mode, phrase structure type, and thematic structure, taking into account, in each case, if the clauses are identical or different. Punctuation has been one of the properties considered. The final aim has been to eliminate gradually, based on the partition of properties, subordinate sentences, so that only the hard core of coordinate sentences remains. In this way, the coordination could be defined similarly as the phoneme is defined as a group of distinctive features. The quantitative analyses have led to the conclusion that the sentences which, from a semantic point of view, are interpreted as coordinating, contain the least of these differences, while the sentences which can be considered as subordinating present the most of these differences. The conditions of coordination are, in that sense, hierarchical, so that the syntactic constraints have to make room for semantic, textual and cognitive factors. It is interesting to notice that everyone has the ability to produce correct coordinating structures and recognize incorrect coordinating structures. This can be explained by the human ability to categorize which has been widely researched in the semantic of prototype. The study suggests that coordination and subordination could be considered as prototypical cognitive categories based on different linguistic and pragmatic features.


20.00% 20.00%



This paper presents a symbolic navigation system that uses spatial language descriptions to inform goal-directed exploration in unfamiliar office environments. An abstract map is created from a collection of natural language phrases describing the spatial layout of the environment. The spatial representation in the abstract map is controlled by a constraint based interpretation of each natural language phrase. In goal-directed exploration of an unseen office environment, the robot links the information in the abstract map to observed symbolic information and its grounded world representation. This paper demonstrates the ability of the system, in both simulated and real-world trials, to efficiently find target rooms in environments that it has never been to previously. In three unexplored environments, it is shown that on average the system travels only 8.42% further than the optimal path when using only natural language phrases to complete navigation tasks.


20.00% 20.00%



There are many popular models available for classification of documents like Naïve Bayes Classifier, k-Nearest Neighbors and Support Vector Machine. In all these cases, the representation is based on the “Bag of words” model. This model doesn't capture the actual semantic meaning of a word in a particular document. Semantics are better captured by proximity of words and their occurrence in the document. We propose a new “Bag of Phrases” model to capture this discriminative power of phrases for text classification. We present a novel algorithm to extract phrases from the corpus using the well known topic model, Latent Dirichlet Allocation(LDA), and to integrate them in vector space model for classification. Experiments show a better performance of classifiers with the new Bag of Phrases model against related representation models.


20.00% 20.00%



Marggraf Turley, Richard, 'Keats, Cornwall and the 'Scent of Strong-Smelling Phrases,' Romanticism (2006) 12 (2), pp. 102-114 RAE2008


20.00% 20.00%



We address the problem of mining interesting phrases from subsets of a text corpus where the subset is specified using a set of features such as keywords that form a query. Previous algorithms for the problem have proposed solutions that involve sifting through a phrase dictionary based index or a document-based index where the solution is linear in either the phrase dictionary size or the size of the document subset. We propose the usage of an independence assumption between query keywords given the top correlated phrases, wherein the pre-processing could be reduced to discovering phrases from among the top phrases per each feature in the query. We then outline an indexing mechanism where per-keyword phrase lists are stored either in disk or memory, so that popular aggregation algorithms such as No Random Access and Sort-merge Join may be adapted to do the scoring at real-time to identify the top interesting phrases. Though such an approach is expected to be approximate, we empirically illustrate that very high accuracies (of over 90%) are achieved against the results of exact algorithms. Due to the simplified list-aggregation, we are also able to provide response times that are orders of magnitude better than state-of-the-art algorithms. Interestingly, our disk-based approach outperforms the in-memory baselines by up to hundred times and sometimes more, confirming the superiority of the proposed method.