967 resultados para Natural language techniques, Semantic spaces, Random projection, Documents
Resumo:
* This work was financially supported by RFBF-04-01-00858.
Resumo:
* This paper was made according to the program of fundamental scientific research of the Presidium of the Russian Academy of Sciences «Mathematical simulation and intellectual systems», the project "Theoretical foundation of the intellectual systems based on ontologies for intellectual support of scientific researches".
Resumo:
A model of the cognitive process of natural language processing has been developed using the formalism of generalized nets. Following this stage-simulating model, the treatment of information inevitably includes phases, which require joint operations in two knowledge spaces – language and semantics. In order to examine and formalize the relations between the language and the semantic levels of treatment, the language is presented as an information system, conceived on the bases of human cognitive resources, semantic primitives, semantic operators and language rules and data. This approach is applied for modeling a specific grammatical rule – the secondary predication in Russian. Grammatical rules of the language space are expressed as operators in the semantic space. Examples from the linguistics domain are treated and several conclusions for the semantics of the modeled rule are made. The results of applying the information system approach to the language turn up to be consistent with the stages of treatment modeled with the generalized net.
Resumo:
In this paper, we propose an unsupervised methodology to automatically discover pairs of semantically related words by highlighting their local environment and evaluating their semantic similarity in local and global semantic spaces. This proposal di®ers from previous research as it tries to take the best of two different methodologies i.e. semantic space models and information extraction models. It can be applied to extract close semantic relations, it limits the search space and it is unsupervised.
Resumo:
One of the ultimate aims of Natural Language Processing is to automate the analysis of the meaning of text. A fundamental step in that direction consists in enabling effective ways to automatically link textual references to their referents, that is, real world objects. The work presented in this paper addresses the problem of attributing a sense to proper names in a given text, i.e., automatically associating words representing Named Entities with their referents. The method for Named Entity Disambiguation proposed here is based on the concept of semantic relatedness, which in this work is obtained via a graph-based model over Wikipedia. We show that, without building the traditional bag of words representation of the text, but instead only considering named entities within the text, the proposed method achieves results competitive with the state-of-the-art on two different datasets.
Resumo:
The semantic model developed in this research was in response to the difficulty a group of mathematics learners had with conventional mathematical language and their interpretation of mathematical constructs. In order to develop the model ideas from linguistics, psycholinguistics, cognitive psychology, formal languages and natural language processing were investigated. This investigation led to the identification of four main processes: the parsing process, syntactic processing, semantic processing and conceptual processing. The model showed the complex interdependency between these four processes and provided a theoretical framework in which the behaviour of the mathematics learner could be analysed. The model was then extended to include the use of technological artefacts into the learning process. To facilitate this aspect of the research, the theory of instrumentation was incorporated into the semantic model. The conclusion of this research was that although the cognitive processes were interdependent, they could develop at different rates until mastery of a topic was achieved. It also found that the introduction of a technological artefact into the learning environment introduced another layer of complexity, both in terms of the learning process and the underlying relationship between the four cognitive processes.
Resumo:
Rhythm analysis of written texts focuses on literary analysis and it mainly considers poetry. In this paper we investigate the relevance of rhythmic features for categorizing texts in prosaic form pertaining to different genres. Our contribution is threefold. First, we define a set of rhythmic features for written texts. Second, we extract these features from three corpora, of speeches, essays, and newspaper articles. Third, we perform feature selection by means of statistical analyses, and determine a subset of features which efficiently discriminates between the three genres. We find that using as little as eight rhythmic features, documents can be adequately assigned to a given genre with an accuracy of around 80 %, significantly higher than the 33 % baseline which results from random assignment.
Resumo:
L’augmentation de la croissance des réseaux, des blogs et des utilisateurs des sites d’examen sociaux font d’Internet une énorme source de données, en particulier sur la façon dont les gens pensent, sentent et agissent envers différentes questions. Ces jours-ci, les opinions des gens jouent un rôle important dans la politique, l’industrie, l’éducation, etc. Alors, les gouvernements, les grandes et petites industries, les instituts universitaires, les entreprises et les individus cherchent à étudier des techniques automatiques fin d’extraire les informations dont ils ont besoin dans les larges volumes de données. L’analyse des sentiments est une véritable réponse à ce besoin. Elle est une application de traitement du langage naturel et linguistique informatique qui se compose de techniques de pointe telles que l’apprentissage machine et les modèles de langue pour capturer les évaluations positives, négatives ou neutre, avec ou sans leur force, dans des texte brut. Dans ce mémoire, nous étudions une approche basée sur les cas pour l’analyse des sentiments au niveau des documents. Notre approche basée sur les cas génère un classificateur binaire qui utilise un ensemble de documents classifies, et cinq lexiques de sentiments différents pour extraire la polarité sur les scores correspondants aux commentaires. Puisque l’analyse des sentiments est en soi une tâche dépendante du domaine qui rend le travail difficile et coûteux, nous appliquons une approche «cross domain» en basant notre classificateur sur les six différents domaines au lieu de le limiter à un seul domaine. Pour améliorer la précision de la classification, nous ajoutons la détection de la négation comme une partie de notre algorithme. En outre, pour améliorer la performance de notre approche, quelques modifications innovantes sont appliquées. Il est intéressant de mentionner que notre approche ouvre la voie à nouveaux développements en ajoutant plus de lexiques de sentiment et ensembles de données à l’avenir.
Resumo:
The overwhelming amount and unprecedented speed of publication in the biomedical domain make it difficult for life science researchers to acquire and maintain a broad view of the field and gather all information that would be relevant for their research. As a response to this problem, the BioNLP (Biomedical Natural Language Processing) community of researches has emerged and strives to assist life science researchers by developing modern natural language processing (NLP), information extraction (IE) and information retrieval (IR) methods that can be applied at large-scale, to scan the whole publicly available biomedical literature and extract and aggregate the information found within, while automatically normalizing the variability of natural language statements. Among different tasks, biomedical event extraction has received much attention within BioNLP community recently. Biomedical event extraction constitutes the identification of biological processes and interactions described in biomedical literature, and their representation as a set of recursive event structures. The 2009–2013 series of BioNLP Shared Tasks on Event Extraction have given raise to a number of event extraction systems, several of which have been applied at a large scale (the full set of PubMed abstracts and PubMed Central Open Access full text articles), leading to creation of massive biomedical event databases, each of which containing millions of events. Sinece top-ranking event extraction systems are based on machine-learning approach and are trained on the narrow-domain, carefully selected Shared Task training data, their performance drops when being faced with the topically highly varied PubMed and PubMed Central documents. Specifically, false-positive predictions by these systems lead to generation of incorrect biomolecular events which are spotted by the end-users. This thesis proposes a novel post-processing approach, utilizing a combination of supervised and unsupervised learning techniques, that can automatically identify and filter out a considerable proportion of incorrect events from large-scale event databases, thus increasing the general credibility of those databases. The second part of this thesis is dedicated to a system we developed for hypothesis generation from large-scale event databases, which is able to discover novel biomolecular interactions among genes/gene-products. We cast the hypothesis generation problem as a supervised network topology prediction, i.e predicting new edges in the network, as well as types and directions for these edges, utilizing a set of features that can be extracted from large biomedical event networks. Routine machine learning evaluation results, as well as manual evaluation results suggest that the problem is indeed learnable. This work won the Best Paper Award in The 5th International Symposium on Languages in Biology and Medicine (LBM 2013).
Resumo:
A primary goal of context-aware systems is delivering the right information at the right place and right time to users in order to enable them to make effective decisions and improve their quality of life. There are three key requirements for achieving this goal: determining what information is relevant, personalizing it based on the users’ context (location, preferences, behavioral history etc.), and delivering it to them in a timely manner without an explicit request from them. These requirements create a paradigm that we term as “Proactive Context-aware Computing”. Most of the existing context-aware systems fulfill only a subset of these requirements. Many of these systems focus only on personalization of the requested information based on users’ current context. Moreover, they are often designed for specific domains. In addition, most of the existing systems are reactive - the users request for some information and the system delivers it to them. These systems are not proactive i.e. they cannot anticipate users’ intent and behavior and act proactively without an explicit request from them. In order to overcome these limitations, we need to conduct a deeper analysis and enhance our understanding of context-aware systems that are generic, universal, proactive and applicable to a wide variety of domains. To support this dissertation, we explore several directions. Clearly the most significant sources of information about users today are smartphones. A large amount of users’ context can be acquired through them and they can be used as an effective means to deliver information to users. In addition, social media such as Facebook, Flickr and Foursquare provide a rich and powerful platform to mine users’ interests, preferences and behavioral history. We employ the ubiquity of smartphones and the wealth of information available from social media to address the challenge of building proactive context-aware systems. We have implemented and evaluated a few approaches, including some as part of the Rover framework, to achieve the paradigm of Proactive Context-aware Computing. Rover is a context-aware research platform which has been evolving for the last 6 years. Since location is one of the most important context for users, we have developed ‘Locus’, an indoor localization, tracking and navigation system for multi-story buildings. Other important dimensions of users’ context include the activities that they are engaged in. To this end, we have developed ‘SenseMe’, a system that leverages the smartphone and its multiple sensors in order to perform multidimensional context and activity recognition for users. As part of the ‘SenseMe’ project, we also conducted an exploratory study of privacy, trust, risks and other concerns of users with smart phone based personal sensing systems and applications. To determine what information would be relevant to users’ situations, we have developed ‘TellMe’ - a system that employs a new, flexible and scalable approach based on Natural Language Processing techniques to perform bootstrapped discovery and ranking of relevant information in context-aware systems. In order to personalize the relevant information, we have also developed an algorithm and system for mining a broad range of users’ preferences from their social network profiles and activities. For recommending new information to the users based on their past behavior and context history (such as visited locations, activities and time), we have developed a recommender system and approach for performing multi-dimensional collaborative recommendations using tensor factorization. For timely delivery of personalized and relevant information, it is essential to anticipate and predict users’ behavior. To this end, we have developed a unified infrastructure, within the Rover framework, and implemented several novel approaches and algorithms that employ various contextual features and state of the art machine learning techniques for building diverse behavioral models of users. Examples of generated models include classifying users’ semantic places and mobility states, predicting their availability for accepting calls on smartphones and inferring their device charging behavior. Finally, to enable proactivity in context-aware systems, we have also developed a planning framework based on HTN planning. Together, these works provide a major push in the direction of proactive context-aware computing.
Resumo:
A computer vision system that has to interact in natural language needs to understand the visual appearance of interactions between objects along with the appearance of objects themselves. Relationships between objects are frequently mentioned in queries of tasks like semantic image retrieval, image captioning, visual question answering and natural language object detection. Hence, it is essential to model context between objects for solving these tasks. In the first part of this thesis, we present a technique for detecting an object mentioned in a natural language query. Specifically, we work with referring expressions which are sentences that identify a particular object instance in an image. In many referring expressions, an object is described in relation to another object using prepositions, comparative adjectives, action verbs etc. Our proposed technique can identify both the referred object and the context object mentioned in such expressions. Context is also useful for incrementally understanding scenes and videos. In the second part of this thesis, we propose techniques for searching for objects in an image and events in a video. Our proposed incremental algorithms use the context from previously explored regions to prioritize the regions to explore next. The advantage of incremental understanding is restricting the amount of computation time and/or resources spent for various detection tasks. Our first proposed technique shows how to learn context in indoor scenes in an implicit manner and use it for searching for objects. The second technique shows how explicitly written context rules of one-on-one basketball can be used to sequentially detect events in a game.
Resumo:
Dissertação de Mestrado, Ciências da Linguagem, Faculdade de Ciências Humanas e Sociais, Universidade do Algarve, 2014
Resumo:
A evolução tecnológica tem provocado uma evolução na medicina, através de sistemas computacionais voltados para o armazenamento, captura e disponibilização de informações médicas. Os relatórios médicos são, na maior parte das vezes, guardados num texto livre não estruturado e escritos com vocabulário proprietário, podendo ocasionar falhas de interpretação. Através das linguagens da Web Semântica, é possível utilizar antologias como modo de estruturar e padronizar a informação dos relatórios médicos, adicionando¬ lhe anotações semânticas. A informação contida nos relatórios pode desta forma ser publicada na Web, permitindo às máquinas o processamento automático da informação. No entanto, o processo de criação de antologias é bastante complexo, pois existe o problema de criar uma ontologia que não cubra todo o domínio pretendido. Este trabalho incide na criação de uma ontologia e respectiva povoação, através de técnicas de PLN e Aprendizagem Automática que permitem extrair a informação dos relatórios médicos. Foi desenvolvida uma aplicação, que permite ao utilizador converter relatórios do formato digital para o formato OWL. ABSTRACT: Technological evolution has caused a medicine evolution through computer systems which allow storage, gathering and availability of medical information. Medical reports are, most of the times, stored in a non-structured free text and written in a personal way so that misunderstandings may occur. Through Semantic Web languages, it’s possible to use ontology as a way to structure and standardize medical reports information by adding semantic notes. The information in those reports can, by these means, be displayed on the web, allowing machines automatic information processing. However, the process of creating ontology is very complex, as there is a risk creating of an ontology that not covering the whole desired domain. This work is about creation of an ontology and its population through NLP and Machine Learning techniques to extract information from medical reports. An application was developed which allows the user to convert reports from digital for¬ mat to OWL format.
Resumo:
Natural Language Processing (NLP) has seen tremendous improvements over the last few years. Transformer architectures achieved impressive results in almost any NLP task, such as Text Classification, Machine Translation, and Language Generation. As time went by, transformers continued to improve thanks to larger corpora and bigger networks, reaching hundreds of billions of parameters. Training and deploying such large models has become prohibitively expensive, such that only big high tech companies can afford to train those models. Therefore, a lot of research has been dedicated to reducing a model’s size. In this thesis, we investigate the effects of Vocabulary Transfer and Knowledge Distillation for compressing large Language Models. The goal is to combine these two methodologies to further compress models without significant loss of performance. In particular, we designed different combination strategies and conducted a series of experiments on different vertical domains (medical, legal, news) and downstream tasks (Text Classification and Named Entity Recognition). Four different methods involving Vocabulary Transfer (VIPI) with and without a Masked Language Modelling (MLM) step and with and without Knowledge Distillation are compared against a baseline that assigns random vectors to new elements of the vocabulary. Results indicate that VIPI effectively transfers information of the original vocabulary and that MLM is beneficial. It is also noted that both vocabulary transfer and knowledge distillation are orthogonal to one another and may be applied jointly. The application of knowledge distillation first before subsequently applying vocabulary transfer is recommended. Finally, model performance due to vocabulary transfer does not always show a consistent trend as the vocabulary size is reduced. Hence, the choice of vocabulary size should be empirically selected by evaluation on the downstream task similar to hyperparameter tuning.
Resumo:
Nowadays the idea of injecting world or domain-specific structured knowledge into pre-trained language models (PLMs) is becoming an increasingly popular approach for solving problems such as biases, hallucinations, huge architectural sizes, and explainability lack—critical for real-world natural language processing applications in sensitive fields like bioinformatics. One recent work that has garnered much attention in Neuro-symbolic AI is QA-GNN, an end-to-end model for multiple-choice open-domain question answering (MCOQA) tasks via interpretable text-graph reasoning. Unlike previous publications, QA-GNN mutually informs PLMs and graph neural networks (GNNs) on top of relevant facts retrieved from knowledge graphs (KGs). However, taking a more holistic view, existing PLM+KG contributions mainly consider commonsense benchmarks and ignore or shallowly analyze performances on biomedical datasets. This thesis start from a propose of a deep investigation of QA-GNN for biomedicine, comparing existing or brand-new PLMs, KGs, edge-aware GNNs, preprocessing techniques, and initialization strategies. By combining the insights emerged in DISI's research, we introduce Bio-QA-GNN that include a KG. Working with this part has led to an improvement in state-of-the-art of MCOQA model on biomedical/clinical text, largely outperforming the original one (+3.63\% accuracy on MedQA). Our findings also contribute to a better understanding of the explanation degree allowed by joint text-graph reasoning architectures and their effectiveness on different medical subjects and reasoning types. Codes, models, datasets, and demos to reproduce the results are freely available at: \url{https://github.com/disi-unibo-nlp/bio-qagnn}.