923 resultados para Natural Language Processing,Recommender Systems,Android,Applicazione mobile
Resumo:
Abstract Since its creation, the Internet has permeated our daily life. The web is omnipresent for communication, research and organization. This exploitation has resulted in the rapid development of the Internet. Nowadays, the Internet is the biggest container of resources. Information databases such as Wikipedia, Dmoz and the open data available on the net are a great informational potentiality for mankind. The easy and free web access is one of the major feature characterizing the Internet culture. Ten years earlier, the web was completely dominated by English. Today, the web community is no longer only English speaking but it is becoming a genuinely multilingual community. The availability of content is intertwined with the availability of logical organizations (ontologies) for which multilinguality plays a fundamental role. In this work we introduce a very high-level logical organization fully based on semiotic assumptions. We thus present the theoretical foundations as well as the ontology itself, named Linguistic Meta-Model. The most important feature of Linguistic Meta-Model is its ability to support the representation of different knowledge sources developed according to different underlying semiotic theories. This is possible because mast knowledge representation schemata, either formal or informal, can be put into the context of the so-called semiotic triangle. In order to show the main characteristics of Linguistic Meta-Model from a practical paint of view, we developed VIKI (Virtual Intelligence for Knowledge Induction). VIKI is a work-in-progress system aiming at exploiting the Linguistic Meta-Model structure for knowledge expansion. It is a modular system in which each module accomplishes a natural language processing task, from terminology extraction to knowledge retrieval. VIKI is a supporting system to Linguistic Meta-Model and its main task is to give some empirical evidence regarding the use of Linguistic Meta-Model without claiming to be thorough.
Resumo:
By providing a better understanding of paraphrase and coreference in terms of similarities and differences in their linguistic nature, this article delimits what the focus of paraphrase extraction and coreference resolution tasks should be, and to what extent they can help each other. We argue for the relevance of this discussion to Natural Language Processing.
Resumo:
Finding an adequate paraphrase representation formalism is a challenging issue in Natural Language Processing. In this paper, we analyse the performance of Tree Edit Distance as a paraphrase representation baseline. Our experiments using Edit Distance Textual Entailment Suite show that, as Tree Edit Distance consists of a purely syntactic approach, paraphrase alternations not based on structural reorganizations do not find an adequate representation. They also show that there is much scope for better modelling of the way trees are aligned.
Resumo:
In this paper, we present a critical analysis of the state of the art in the definition and typologies of paraphrasing. This analysis shows that there exists no characterization of paraphrasing that is comprehensive, linguistically based and computationally tractable at the same time. The following sets out to define and delimit the concept on the basis of the propositional content. We present a general, inclusive and computationally oriented typology of the linguistic mechanisms that give rise to form variations between paraphrase pairs.
Resumo:
In this paper we present ClInt (Clinical Interview), a bilingual Spanish-Catalan spoken corpus that contains 15 hours of clinical interviews. It consists of audio files aligned with multiple-level transcriptions comprising orthographic, phonetic and morphological information, as well as linguistic and extralinguistic encoding. This is a previously non-existent resource for these languages and it offers a wide-ranging exploitation potential in a broad variety of disciplines such as Linguistics, Natural Language Processing and related fields.
Resumo:
CoCo is a collaborative web interface for the compilation of linguistic resources. In this demo we are presenting one of its possible applications: paraphrase acquisition.
Resumo:
Recent advances in machine learning methods enable increasingly the automatic construction of various types of computer assisted methods that have been difficult or laborious to program by human experts. The tasks for which this kind of tools are needed arise in many areas, here especially in the fields of bioinformatics and natural language processing. The machine learning methods may not work satisfactorily if they are not appropriately tailored to the task in question. However, their learning performance can often be improved by taking advantage of deeper insight of the application domain or the learning problem at hand. This thesis considers developing kernel-based learning algorithms incorporating this kind of prior knowledge of the task in question in an advantageous way. Moreover, computationally efficient algorithms for training the learning machines for specific tasks are presented. In the context of kernel-based learning methods, the incorporation of prior knowledge is often done by designing appropriate kernel functions. Another well-known way is to develop cost functions that fit to the task under consideration. For disambiguation tasks in natural language, we develop kernel functions that take account of the positional information and the mutual similarities of words. It is shown that the use of this information significantly improves the disambiguation performance of the learning machine. Further, we design a new cost function that is better suitable for the task of information retrieval and for more general ranking problems than the cost functions designed for regression and classification. We also consider other applications of the kernel-based learning algorithms such as text categorization, and pattern recognition in differential display. We develop computationally efficient algorithms for training the considered learning machines with the proposed kernel functions. We also design a fast cross-validation algorithm for regularized least-squares type of learning algorithm. Further, an efficient version of the regularized least-squares algorithm that can be used together with the new cost function for preference learning and ranking tasks is proposed. In summary, we demonstrate that the incorporation of prior knowledge is possible and beneficial, and novel advanced kernels and cost functions can be used in algorithms efficiently.
Resumo:
In this thesis we study the field of opinion mining by giving a comprehensive review of the available research that has been done in this topic. Also using this available knowledge we present a case study of a multilevel opinion mining system for a student organization's sales management system. We describe the field of opinion mining by discussing its historical roots, its motivations and applications as well as the different scientific approaches that have been used to solve this challenging problem of mining opinions. To deal with this huge subfield of natural language processing, we first give an abstraction of the problem of opinion mining and describe the theoretical frameworks that are available for dealing with appraisal language. Then we discuss the relation between opinion mining and computational linguistics which is a crucial pre-processing step for the accuracy of the subsequent steps of opinion mining. The second part of our thesis deals with the semantics of opinions where we describe the different ways used to collect lists of opinion words as well as the methods and techniques available for extracting knowledge from opinions present in unstructured textual data. In the part about collecting lists of opinion words we describe manual, semi manual and automatic ways to do so and give a review of the available lists that are used as gold standards in opinion mining research. For the methods and techniques of opinion mining we divide the task into three levels that are the document, sentence and feature level. The techniques that are presented in the document and sentence level are divided into supervised and unsupervised approaches that are used to determine the subjectivity and polarity of texts and sentences at these levels of analysis. At the feature level we give a description of the techniques available for finding the opinion targets, the polarity of the opinions about these opinion targets and the opinion holders. Also at the feature level we discuss the various ways to summarize and visualize the results of this level of analysis. In the third part of our thesis we present a case study of a sales management system that uses free form text and that can benefit from an opinion mining system. Using the knowledge gathered in the review of this field we provide a theoretical multi level opinion mining system (MLOM) that can perform most of the tasks needed from an opinion mining system. Based on the previous research we give some hints that many of the laborious market research tasks that are done by the sales force, which uses this sales management system, can improve their insight about their partners and by that increase the quality of their sales services and their overall results.
Resumo:
Learning of preference relations has recently received significant attention in machine learning community. It is closely related to the classification and regression analysis and can be reduced to these tasks. However, preference learning involves prediction of ordering of the data points rather than prediction of a single numerical value as in case of regression or a class label as in case of classification. Therefore, studying preference relations within a separate framework facilitates not only better theoretical understanding of the problem, but also motivates development of the efficient algorithms for the task. Preference learning has many applications in domains such as information retrieval, bioinformatics, natural language processing, etc. For example, algorithms that learn to rank are frequently used in search engines for ordering documents retrieved by the query. Preference learning methods have been also applied to collaborative filtering problems for predicting individual customer choices from the vast amount of user generated feedback. In this thesis we propose several algorithms for learning preference relations. These algorithms stem from well founded and robust class of regularized least-squares methods and have many attractive computational properties. In order to improve the performance of our methods, we introduce several non-linear kernel functions. Thus, contribution of this thesis is twofold: kernel functions for structured data that are used to take advantage of various non-vectorial data representations and the preference learning algorithms that are suitable for different tasks, namely efficient learning of preference relations, learning with large amount of training data, and semi-supervised preference learning. Proposed kernel-based algorithms and kernels are applied to the parse ranking task in natural language processing, document ranking in information retrieval, and remote homology detection in bioinformatics domain. Training of kernel-based ranking algorithms can be infeasible when the size of the training set is large. This problem is addressed by proposing a preference learning algorithm whose computation complexity scales linearly with the number of training data points. We also introduce sparse approximation of the algorithm that can be efficiently trained with large amount of data. For situations when small amount of labeled data but a large amount of unlabeled data is available, we propose a co-regularized preference learning algorithm. To conclude, the methods presented in this thesis address not only the problem of the efficient training of the algorithms but also fast regularization parameter selection, multiple output prediction, and cross-validation. Furthermore, proposed algorithms lead to notably better performance in many preference learning tasks considered.
Resumo:
Biomedical natural language processing (BioNLP) is a subfield of natural language processing, an area of computational linguistics concerned with developing programs that work with natural language: written texts and speech. Biomedical relation extraction concerns the detection of semantic relations such as protein-protein interactions (PPI) from scientific texts. The aim is to enhance information retrieval by detecting relations between concepts, not just individual concepts as with a keyword search. In recent years, events have been proposed as a more detailed alternative for simple pairwise PPI relations. Events provide a systematic, structural representation for annotating the content of natural language texts. Events are characterized by annotated trigger words, directed and typed arguments and the ability to nest other events. For example, the sentence “Protein A causes protein B to bind protein C” can be annotated with the nested event structure CAUSE(A, BIND(B, C)). Converted to such formal representations, the information of natural language texts can be used by computational applications. Biomedical event annotations were introduced by the BioInfer and GENIA corpora, and event extraction was popularized by the BioNLP'09 Shared Task on Event Extraction. In this thesis we present a method for automated event extraction, implemented as the Turku Event Extraction System (TEES). A unified graph format is defined for representing event annotations and the problem of extracting complex event structures is decomposed into a number of independent classification tasks. These classification tasks are solved using SVM and RLS classifiers, utilizing rich feature representations built from full dependency parsing. Building on earlier work on pairwise relation extraction and using a generalized graph representation, the resulting TEES system is capable of detecting binary relations as well as complex event structures. We show that this event extraction system has good performance, reaching the first place in the BioNLP'09 Shared Task on Event Extraction. Subsequently, TEES has achieved several first ranks in the BioNLP'11 and BioNLP'13 Shared Tasks, as well as shown competitive performance in the binary relation Drug-Drug Interaction Extraction 2011 and 2013 shared tasks. The Turku Event Extraction System is published as a freely available open-source project, documenting the research in detail as well as making the method available for practical applications. In particular, in this thesis we describe the application of the event extraction method to PubMed-scale text mining, showing how the developed approach not only shows good performance, but is generalizable and applicable to large-scale real-world text mining projects. Finally, we discuss related literature, summarize the contributions of the work and present some thoughts on future directions for biomedical event extraction. This thesis includes and builds on six original research publications. The first of these introduces the analysis of dependency parses that leads to development of TEES. The entries in the three BioNLP Shared Tasks, as well as in the DDIExtraction 2011 task are covered in four publications, and the sixth one demonstrates the application of the system to PubMed-scale text mining.
Resumo:
Les avancés dans le domaine de l’intelligence artificielle, permettent à des systèmes informatiques de résoudre des tâches de plus en plus complexes liées par exemple à la vision, à la compréhension de signaux sonores ou au traitement de la langue. Parmi les modèles existants, on retrouve les Réseaux de Neurones Artificiels (RNA), dont la popularité a fait un grand bond en avant avec la découverte de Hinton et al. [22], soit l’utilisation de Machines de Boltzmann Restreintes (RBM) pour un pré-entraînement non-supervisé couche après couche, facilitant grandement l’entraînement supervisé du réseau à plusieurs couches cachées (DBN), entraînement qui s’avérait jusqu’alors très difficile à réussir. Depuis cette découverte, des chercheurs ont étudié l’efficacité de nouvelles stratégies de pré-entraînement, telles que l’empilement d’auto-encodeurs traditionnels(SAE) [5, 38], et l’empilement d’auto-encodeur débruiteur (SDAE) [44]. C’est dans ce contexte qu’a débuté la présente étude. Après un bref passage en revue des notions de base du domaine de l’apprentissage machine et des méthodes de pré-entraînement employées jusqu’à présent avec les modules RBM, AE et DAE, nous avons approfondi notre compréhension du pré-entraînement de type SDAE, exploré ses différentes propriétés et étudié des variantes de SDAE comme stratégie d’initialisation d’architecture profonde. Nous avons ainsi pu, entre autres choses, mettre en lumière l’influence du niveau de bruit, du nombre de couches et du nombre d’unités cachées sur l’erreur de généralisation du SDAE. Nous avons constaté une amélioration de la performance sur la tâche supervisée avec l’utilisation des bruits poivre et sel (PS) et gaussien (GS), bruits s’avérant mieux justifiés que celui utilisé jusqu’à présent, soit le masque à zéro (MN). De plus, nous avons démontré que la performance profitait d’une emphase imposée sur la reconstruction des données corrompues durant l’entraînement des différents DAE. Nos travaux ont aussi permis de révéler que le DAE était en mesure d’apprendre, sur des images naturelles, des filtres semblables à ceux retrouvés dans les cellules V1 du cortex visuel, soit des filtres détecteurs de bordures. Nous aurons par ailleurs pu montrer que les représentations apprises du SDAE, composées des caractéristiques ainsi extraites, s’avéraient fort utiles à l’apprentissage d’une machine à vecteurs de support (SVM) linéaire ou à noyau gaussien, améliorant grandement sa performance de généralisation. Aussi, nous aurons observé que similairement au DBN, et contrairement au SAE, le SDAE possédait une bonne capacité en tant que modèle générateur. Nous avons également ouvert la porte à de nouvelles stratégies de pré-entraînement et découvert le potentiel de l’une d’entre elles, soit l’empilement d’auto-encodeurs rebruiteurs (SRAE).
Resumo:
Les logiciels de correction grammaticale commettent parfois des détections illégitimes (fausses alertes), que nous appelons ici surdétections. La présente étude décrit les expériences de mise au point d’un système créé pour identifier et mettre en sourdine les surdétections produites par le correcteur du français conçu par la société Druide informatique. Plusieurs classificateurs ont été entraînés de manière supervisée sur 14 types de détections faites par le correcteur, en employant des traits couvrant di-verses informations linguistiques (dépendances et catégories syntaxiques, exploration du contexte des mots, etc.) extraites de phrases avec et sans surdétections. Huit des 14 classificateurs développés sont maintenant intégrés à la nouvelle version d’un correcteur commercial très populaire. Nos expériences ont aussi montré que les modèles de langue probabilistes, les SVM et la désambiguïsation sémantique améliorent la qualité de ces classificateurs. Ce travail est un exemple réussi de déploiement d’une approche d’apprentissage machine au service d’une application langagière grand public robuste.
Resumo:
Mémoire numérisé par la Division de la gestion de documents et des archives de l'Université de Montréal
Resumo:
Mémoire numérisé par la Division de la gestion de documents et des archives de l'Université de Montréal
Resumo:
Cette thèse présente le résultat de plusieurs années de recherche dans le domaine de la génération automatique de résumés. Trois contributions majeures, présentées sous la forme d'articles publiés ou soumis pour publication, en forment le coeur. Elles retracent un cheminement qui part des méthodes par extraction en résumé jusqu'aux méthodes par abstraction. L'expérience HexTac, sujet du premier article, a d'abord été menée pour évaluer le niveau de performance des êtres humains dans la rédaction de résumés par extraction de phrases. Les résultats montrent un écart important entre la performance humaine sous la contrainte d'extraire des phrases du texte source par rapport à la rédaction de résumés sans contrainte. Cette limite à la rédaction de résumés par extraction de phrases, observée empiriquement, démontre l'intérêt de développer d'autres approches automatiques pour le résumé. Nous avons ensuite développé un premier système selon l'approche Fully Abstractive Summarization, qui se situe dans la catégorie des approches semi-extractives, comme la compression de phrases et la fusion de phrases. Le développement et l'évaluation du système, décrits dans le second article, ont permis de constater le grand défi de générer un résumé facile à lire sans faire de l'extraction de phrases. Dans cette approche, le niveau de compréhension du contenu du texte source demeure insuffisant pour guider le processus de sélection du contenu pour le résumé, comme dans les approches par extraction de phrases. Enfin, l'approche par abstraction basée sur des connaissances nommée K-BABS est proposée dans un troisième article. Un repérage des éléments d'information pertinents est effectué, menant directement à la génération de phrases pour le résumé. Cette approche a été implémentée dans le système ABSUM, qui produit des résumés très courts mais riches en contenu. Ils ont été évalués selon les standards d'aujourd'hui et cette évaluation montre que des résumés hybrides formés à la fois de la sortie d'ABSUM et de phrases extraites ont un contenu informatif significativement plus élevé qu'un système provenant de l'état de l'art en extraction de phrases.