995 resultados para Relation extraction
Resumo:
We identify relation completion (RC) as one recurring problem that is central to the success of novel big data applications such as Entity Reconstruction and Data Enrichment. Given a semantic relation, RC attempts at linking entity pairs between two entity lists under the relation. To accomplish the RC goals, we propose to formulate search queries for each query entity α based on some auxiliary information, so that to detect its target entity β from the set of retrieved documents. For instance, a pattern-based method (PaRE) uses extracted patterns as the auxiliary information in formulating search queries. However, high-quality patterns may decrease the probability of finding suitable target entities. As an alternative, we propose CoRE method that uses context terms learned surrounding the expression of a relation as the auxiliary information in formulating queries. The experimental results based on several real-world web data collections demonstrate that CoRE reaches a much higher accuracy than PaRE for the purpose of RC.
Resumo:
Biomedical relation extraction aims to uncover high-quality relations from life science literature with high accuracy and efficiency. Early biomedical relation extraction tasks focused on capturing binary relations, such as protein-protein interactions, which are crucial for virtually every process in a living cell. Information about these interactions provides the foundations for new therapeutic approaches. In recent years, more interests have been shifted to the extraction of complex relations such as biomolecular events. While complex relations go beyond binary relations and involve more than two arguments, they might also take another relation as an argument. In the paper, we conduct a thorough survey on the research in biomedical relation extraction. We first present a general framework for biomedical relation extraction and then discuss the approaches proposed for binary and complex relation extraction with focus on the latter since it is a much more difficult task compared to binary relation extraction. Finally, we discuss challenges that we are facing with complex relation extraction and outline possible solutions and future directions.
Resumo:
The electronic storage of medical patient data is becoming a daily experience in most of the practices and hospitals worldwide. However, much of the data available is in free-form text, a convenient way of expressing concepts and events, but especially challenging if one wants to perform automatic searches, summarization or statistical analysis. Information Extraction can relieve some of these problems by offering a semantically informed interpretation and abstraction of the texts. MedInX, the Medical Information eXtraction system presented in this document, is the first information extraction system developed to process textual clinical discharge records written in Portuguese. The main goal of the system is to improve access to the information locked up in unstructured text, and, consequently, the efficiency of the health care process, by allowing faster and reliable access to quality information on health, for both patient and health professionals. MedInX components are based on Natural Language Processing principles, and provide several mechanisms to read, process and utilize external resources, such as terminologies and ontologies, in the process of automatic mapping of free text reports onto a structured representation. However, the flexible and scalable architecture of the system, also allowed its application to the task of Named Entity Recognition on a shared evaluation contest focused on Portuguese general domain free-form texts. The evaluation of the system on a set of authentic hospital discharge letters indicates that the system performs with 95% F-measure, on the task of entity recognition, and 95% precision on the task of relation extraction. Example applications, demonstrating the use of MedInX capabilities in real applications in the hospital setting, are also presented in this document. These applications were designed to answer common clinical problems related with the automatic coding of diagnoses and other health-related conditions described in the documents, according to the international classification systems ICD-9-CM and ICF. The automatic review of the content and completeness of the documents is an example of another developed application, denominated MedInX Clinical Audit system.
Resumo:
Ontology design and population -core aspects of semantic technologies- re- cently have become fields of great interest due to the increasing need of domain-specific knowledge bases that can boost the use of Semantic Web. For building such knowledge resources, the state of the art tools for ontology design require a lot of human work. Producing meaningful schemas and populating them with domain-specific data is in fact a very difficult and time-consuming task. Even more if the task consists in modelling knowledge at a web scale. The primary aim of this work is to investigate a novel and flexible method- ology for automatically learning ontology from textual data, lightening the human workload required for conceptualizing domain-specific knowledge and populating an extracted schema with real data, speeding up the whole ontology production process. Here computational linguistics plays a fundamental role, from automati- cally identifying facts from natural language and extracting frame of relations among recognized entities, to producing linked data with which extending existing knowledge bases or creating new ones. In the state of the art, automatic ontology learning systems are mainly based on plain-pipelined linguistics classifiers performing tasks such as Named Entity recognition, Entity resolution, Taxonomy and Relation extraction [11]. These approaches present some weaknesses, specially in capturing struc- tures through which the meaning of complex concepts is expressed [24]. Humans, in fact, tend to organize knowledge in well-defined patterns, which include participant entities and meaningful relations linking entities with each other. In literature, these structures have been called Semantic Frames by Fill- 6 Introduction more [20], or more recently as Knowledge Patterns [23]. Some NLP studies has recently shown the possibility of performing more accurate deep parsing with the ability of logically understanding the structure of discourse [7]. In this work, some of these technologies have been investigated and em- ployed to produce accurate ontology schemas. The long-term goal is to collect large amounts of semantically structured information from the web of crowds, through an automated process, in order to identify and investigate the cognitive patterns used by human to organize their knowledge.
Resumo:
This paper proposes a novel framework of incorporating protein-protein interactions (PPI) ontology knowledge into PPI extraction from biomedical literature in order to address the emerging challenges of deep natural language understanding. It is built upon the existing work on relation extraction using the Hidden Vector State (HVS) model. The HVS model belongs to the category of statistical learning methods. It can be trained directly from un-annotated data in a constrained way whilst at the same time being able to capture the underlying named entity relationships. However, it is difficult to incorporate background knowledge or non-local information into the HVS model. This paper proposes to represent the HVS model as a conditionally trained undirected graphical model in which non-local features derived from PPI ontology through inference would be easily incorporated. The seamless fusion of ontology inference with statistical learning produces a new paradigm to information extraction.
Resumo:
随着互联网和电子化办公的发展,出现了大量的文本资源。信息抽取技术可以帮助人们快速获取大规模文本中的有用信息。命名体识别与关系抽取是信息抽取的两个基本任务。本文在调研当前命名体识别和实体关系抽取中采用的主要方法的基础上,分别给出了解决方案。论文开展的主要工作有:(1)从模型选择和特征选择两个方面总结了命名体识别及实体关系抽取的国内外研究现状,重点介绍用于命名体识别的统计学习方法以及用于实体关系抽取的基于核的方法。(2)针对当前命名体识别中命名体片段边界的确定问题,研究了如何将 Semi-Markov CRFs 模型应用于中文命名体识别。这种模型只要求段间遵循马尔科夫规则,而段内的文本之间则可以被灵活的赋予各种规则。将这种模型用于中文命名体识别任务时,我们可以更有效更自由的设计出各种有利于识别出命名体片段边界的特征。实验表明,加入段相关的特征后,命名体识别的性能提高了 4-5 个百分点。(3)实体关系抽取的任务是判别两个实体之间的语义关系。之前的研究已经表明,待判别关系的两个实体间的语法树结构对于确定二者的关系类别是非常有用的,而相对成熟的基于平面特征的关系抽取方法在充分提取语法树结构特征方面的能力有限,因此,本文研究了基于核的中文实体关系抽取方法。针对中文特点,我们探讨了卷积(Convolution)核中使用不同的语法树对中文实体关系抽取性能的影响,构造了几种基于卷积核的复合核,改进了最短路依赖核。因为核方法开始被用于英文关系抽取时,F1 值也只有40%左右,而我们只使用作用在语法树上的卷积核时,中文关系抽取的F1 值达到了35%,可见核方法对中文关系抽取也是有效的。
Resumo:
Los métodos para Extracción de Información basados en la Supervisión a Distancia se basan en usar tuplas correctas para adquirir menciones de esas tuplas, y así entrenar un sistema tradicional de extracción de información supervisado. En este artículo analizamos las fuentes de ruido en las menciones, y exploramos métodos sencillos para filtrar menciones ruidosas. Los resultados demuestran que combinando el filtrado de tuplas por frecuencia, la información mutua y la eliminación de menciones lejos de los centroides de sus respectivas etiquetas mejora los resultados de dos modelos de extracción de información significativamente.
Resumo:
We introduce a variation density function that profiles the relationship between multiple scalar fields over isosurfaces of a given scalar field. This profile serves as a valuable tool for multifield data exploration because it provides the user with cues to identify interesting isovalues of scalar fields. Existing isosurface-based techniques for scalar data exploration like Reeb graphs, contour spectra, isosurface statistics, etc., study a scalar field in isolation. We argue that the identification of interesting isovalues in a multifield data set should necessarily be based on the interaction between the different fields. We demonstrate the effectiveness of our approach by applying it to explore data from a wide variety of applications.
Resumo:
The relationship between charge carrier lifetime and mobility in a bulk heterojunction based organic solar cell, utilizing diketopyrrolopyrole- naphthalene co-polymer and PC71BM in the photoactive blend layer, is investigated using the photoinduced charge extraction by linearly increasing voltage technique. Light intensity, delay time, and temperature dependent experiments are used to quantify the charge carrier mobility and density as well as the temperature dependence of both. From the saturation of photoinduced current at high laser intensities, it is shown that Langevin-type bimolecular recombination is present in the studied system. The charge carrier lifetime, especially in Langevin systems, is discussed to be an ambiguous and unreliable parameter to determine the performance of organic solar cells, because of the dependence of charge carrier lifetime on charge carrier density, mobility, and type of recombination. It is revealed that the relation between charge mobility (μ) and lifetime (τ) is inversely proportional, where the μτ product is independent of temperature. The results indicate that in photovoltaic systems with Langevin type bimolecular recombination, the strategies to increase the charge lifetime might not be beneficial because of an accompanying reduction in charge carrier mobility. Instead, the focus on non-Langevin mechanisms of recombination is crucial, because this allows an increase in the charge extraction rate by improving the carrier lifetime, density, and mobility simultaneously. © 2013 AIP Publishing LLC.
Resumo:
This project is a step forward in the study of text mining where enhanced text representation with semantic information plays a significant role. It develops effective methods of entity-oriented retrieval, semantic relation identification and text clustering utilizing semantically annotated data. These methods are based on enriched text representation generated by introducing semantic information extracted from Wikipedia into the input text data. The proposed methods are evaluated against several start-of-art benchmarking methods on real-life data-sets. In particular, this thesis improves the performance of entity-oriented retrieval, identifies different lexical forms for an entity relation and handles clustering documents with multiple feature spaces.
Resumo:
Reverse osmosis is the dominant technology utilized for desalination of saline water produced during the extraction of coal seam gas. Alternatively, ion exchange is of interest due to potential cost advantages. However, there is limited information regarding the column performance of strong acid cation resin for removal of sodium ions from both model and actual coal seam water samples. In particular, the impact of bed depth, flow rate, and regeneration was not clear. Consequently, this study applied Bed Depth Service Time (BDST) models to reveal that increasing sodium ion concentration and flow rates diminished the time required for breakthrough to occur. The loading of sodium ions on fresh resin was calculated to be ca. 71.1 g Na/kg resin. Difficulties in regeneration of the resin using hydrochloric acid solutions were discovered, with 86% recovery of exchange sites observed. The maximum concentration of sodium ions in the regenerant brine was found to be 47,400 mg/L under the conditions employed. The volume of regenerant waste formed was 6.2% of the total volume of water treated. A coal seam water sample was found to load the resin with only 53.5 g Na/kg resin, which was consistent with not only the co-presence of more favoured ions such as calcium, magnesium, barium and strontium, but also inefficient regeneration of the resin prior to the coal seam water test.
Resumo:
An aerodynamic sound source extraction from a general flow field is applied to a number of model problems and to a problem of engineering interest. The extraction technique is based on a variable decomposition, which results to an acoustic correction method, of each of the flow variables into a dominant flow component and a perturbation component. The dominant flow component is obtained with a general-purpose Computational Fluid Dynamics (CFD) code which uses a cell-centred finite volume method to solve the Reynolds-averaged Navier–Stokes equations. The perturbations are calculated from a set of acoustic perturbation equations with source terms extracted from unsteady CFD solutions at each time step via the use of a staggered dispersion-relation-preserving (DRP) finite-difference scheme. Numerical experiments include (1) propagation of a 1-D acoustic pulse without mean flow, (2) propagation of a 2-D acoustic pulse with/without mean flow, (3) reflection of an acoustic pulse from a flat plate with mean flow, and (4) flow-induced noise generated by the an unsteady laminar flow past a 2-D cavity. The computational results demonstrate the accuracy for model problems and illustrate the feasibility for more complex aeroacoustic problems of the source extraction technique.