894 resultados para Query expansion, Text mining, Information retrieval, Chinese IR
Resumo:
The Semantic Annotation component is a software application that provides support for automated text classification, a process grounded in a cohesion-centered representation of discourse that facilitates topic extraction. The component enables the semantic meta-annotation of text resources, including automated classification, thus facilitating information retrieval within the RAGE ecosystem. It is available in the ReaderBench framework (http://readerbench.com/) which integrates advanced Natural Language Processing (NLP) techniques. The component makes use of Cohesion Network Analysis (CNA) in order to ensure an in-depth representation of discourse, useful for mining keywords and performing automated text categorization. Our component automatically classifies documents into the categories provided by the ACM Computing Classification System (http://dl.acm.org/ccs_flat.cfm), but also into the categories from a high level serious games categorization provisionally developed by RAGE. English and French languages are already covered by the provided web service, whereas the entire framework can be extended in order to support additional languages.
Resumo:
Background and aims: Machine learning techniques for the text mining of cancer-related clinical documents have not been sufficiently explored. Here some techniques are presented for the pre-processing of free-text breast cancer pathology reports, with the aim of facilitating the extraction of information relevant to cancer staging.
Materials and methods: The first technique was implemented using the freely available software RapidMiner to classify the reports according to their general layout: ‘semi-structured’ and ‘unstructured’. The second technique was developed using the open source language engineering framework GATE and aimed at the prediction of chunks of the report text containing information pertaining to the cancer morphology, the tumour size, its hormone receptor status and the number of positive nodes. The classifiers were trained and tested respectively on sets of 635 and 163 manually classified or annotated reports, from the Northern Ireland Cancer Registry.
Results: The best result of 99.4% accuracy – which included only one semi-structured report predicted as unstructured – was produced by the layout classifier with the k nearest algorithm, using the binary term occurrence word vector type with stopword filter and pruning. For chunk recognition, the best results were found using the PAUM algorithm with the same parameters for all cases, except for the prediction of chunks containing cancer morphology. For semi-structured reports the performance ranged from 0.97 to 0.94 and from 0.92 to 0.83 in precision and recall, while for unstructured reports performance ranged from 0.91 to 0.64 and from 0.68 to 0.41 in precision and recall. Poor results were found when the classifier was trained on semi-structured reports but tested on unstructured.
Conclusions: These results show that it is possible and beneficial to predict the layout of reports and that the accuracy of prediction of which segments of a report may contain certain information is sensitive to the report layout and the type of information sought.
Resumo:
The overwhelming amount and unprecedented speed of publication in the biomedical domain make it difficult for life science researchers to acquire and maintain a broad view of the field and gather all information that would be relevant for their research. As a response to this problem, the BioNLP (Biomedical Natural Language Processing) community of researches has emerged and strives to assist life science researchers by developing modern natural language processing (NLP), information extraction (IE) and information retrieval (IR) methods that can be applied at large-scale, to scan the whole publicly available biomedical literature and extract and aggregate the information found within, while automatically normalizing the variability of natural language statements. Among different tasks, biomedical event extraction has received much attention within BioNLP community recently. Biomedical event extraction constitutes the identification of biological processes and interactions described in biomedical literature, and their representation as a set of recursive event structures. The 2009–2013 series of BioNLP Shared Tasks on Event Extraction have given raise to a number of event extraction systems, several of which have been applied at a large scale (the full set of PubMed abstracts and PubMed Central Open Access full text articles), leading to creation of massive biomedical event databases, each of which containing millions of events. Sinece top-ranking event extraction systems are based on machine-learning approach and are trained on the narrow-domain, carefully selected Shared Task training data, their performance drops when being faced with the topically highly varied PubMed and PubMed Central documents. Specifically, false-positive predictions by these systems lead to generation of incorrect biomolecular events which are spotted by the end-users. This thesis proposes a novel post-processing approach, utilizing a combination of supervised and unsupervised learning techniques, that can automatically identify and filter out a considerable proportion of incorrect events from large-scale event databases, thus increasing the general credibility of those databases. The second part of this thesis is dedicated to a system we developed for hypothesis generation from large-scale event databases, which is able to discover novel biomolecular interactions among genes/gene-products. We cast the hypothesis generation problem as a supervised network topology prediction, i.e predicting new edges in the network, as well as types and directions for these edges, utilizing a set of features that can be extracted from large biomedical event networks. Routine machine learning evaluation results, as well as manual evaluation results suggest that the problem is indeed learnable. This work won the Best Paper Award in The 5th International Symposium on Languages in Biology and Medicine (LBM 2013).
Resumo:
While news stories are an important traditional medium to broadcast and consume news, microblogging has recently emerged as a place where people can dis- cuss, disseminate, collect or report information about news. However, the massive information in the microblogosphere makes it hard for readers to keep up with these real-time updates. This is especially a problem when it comes to breaking news, where people are more eager to know “what is happening”. Therefore, this dis- sertation is intended as an exploratory effort to investigate computational methods to augment human effort when monitoring the development of breaking news on a given topic from a microblog stream by extractively summarizing the updates in a timely manner. More specifically, given an interest in a topic, either entered as a query or presented as an initial news report, a microblog temporal summarization system is proposed to filter microblog posts from a stream with three primary concerns: topical relevance, novelty, and salience. Considering the relatively high arrival rate of microblog streams, a cascade framework consisting of three stages is proposed to progressively reduce quantity of posts. For each step in the cascade, this dissertation studies methods that improve over current baselines. In the relevance filtering stage, query and document expansion techniques are applied to mitigate sparsity and vocabulary mismatch issues. The use of word embedding as a basis for filtering is also explored, using unsupervised and supervised modeling to characterize lexical and semantic similarity. In the novelty filtering stage, several statistical ways of characterizing novelty are investigated and ensemble learning techniques are used to integrate results from these diverse techniques. These results are compared with a baseline clustering approach using both standard and delay-discounted measures. In the salience filtering stage, because of the real-time prediction requirement a method of learning verb phrase usage from past relevant news reports is used in conjunction with some standard measures for characterizing writing quality. Following a Cranfield-like evaluation paradigm, this dissertation includes a se- ries of experiments to evaluate the proposed methods for each step, and for the end- to-end system. New microblog novelty and salience judgments are created, building on existing relevance judgments from the TREC Microblog track. The results point to future research directions at the intersection of social media, computational jour- nalism, information retrieval, automatic summarization, and machine learning.
Resumo:
Discovery Driven Analysis (DDA) is a common feature of OLAP technology to analyze structured data. In essence, DDA helps analysts to discover anomalous data by highlighting 'unexpected' values in the OLAP cube. By giving indications to the analyst on what dimensions to explore, DDA speeds up the process of discovering anomalies and their causes. However, Discovery Driven Analysis (and OLAP in general) is only applicable on structured data, such as records in databases. We propose a system to extend DDA technology to semi-structured text documents, that is, text documents with a few structured data. Our system pipeline consists of two stages: first, the text part of each document is structured around user specified dimensions, using semi-PLSA algorithm; then, we adapt DDA to these fully structured documents, thus enabling DDA on text documents. We present some applications of this system in OLAP analysis and show how scalability issues are solved. Results show that our system can handle reasonable datasets of documents, in real time, without any need for pre-computation.
Resumo:
International audience
Resumo:
The incredible rapid development to huge volumes of air travel, mainly because of jet airliners that appeared to the sky in the 1950s, created the need for systematic research for aviation safety and collecting data about air traffic. The structured data can be analysed easily using queries from databases and running theseresults through graphic tools. However, in analysing narratives that often give more accurate information about the case, mining tools are needed. The analysis of textual data with computers has not been possible until data mining tools have been developed. Their use, at least among aviation, is still at a moderate level. The research aims at discovering lethal trends in the flight safety reports. The narratives of 1,200 flight safety reports from years 1994 – 1996 in Finnish were processed with three text mining tools. One of them was totally language independent, the other had a specific configuration for Finnish and the third originally created for English, but encouraging results had been achieved with Spanish and that is why a Finnish test was undertaken, too. The global rate of accidents is stabilising and the situation can now be regarded as satisfactory, but because of the growth in air traffic, the absolute number of fatal accidents per year might increase, if the flight safety will not be improved. The collection of data and reporting systems have reached their top level. The focal point in increasing the flight safety is analysis. The air traffic has generally been forecasted to grow 5 – 6 per cent annually over the next two decades. During this period, the global air travel will probably double also with relatively conservative expectations of economic growth. This development makes the airline management confront growing pressure due to increasing competition, signify cant rise in fuel prices and the need to reduce the incident rate due to expected growth in air traffic volumes. All this emphasises the urgent need for new tools and methods. All systems provided encouraging results, as well as proved challenges still to be won. Flight safety can be improved through the development and utilisation of sophisticated analysis tools and methods, like data mining, using its results supporting the decision process of the executives.
Resumo:
[EU]Testu bat koherente egiten duten arrazoiak ulertzea oso baliagarria da testuaren beraren ulermenerako, koherentzia eta koherentzia-erlazioak testu bat edo gehiago koherente diren ondorioztatzen laguntzen baitigu. Lan honetan gai bera duten testu ezberdinen arteko koherentziazko 3 Cross Document Structure Theory edo CST (Radev, 2000) erlazio aztertu eta sailkatu dira. Hori egin ahal izateko, euskaraz idatziriko gai berari buruzko testuak segmentatzeko eta beraien arteko erlazioak etiketatzeko gidalerroak proposatzen dira. 10 testuz osaturiko corpusa etiketatu da; horietako 3 cluster bi etiketatzailek aztertu dute. Etiketatzaileen arteko adostasunaren berri ematen dugu. Koherentzia-erlazioak garatzea oso garrantzitsua da Hizkuntzaren Prozesamenduko hainbat sistementzat, hala nola, informazioa erauzteko sistementzat, itzulpen automatikoarentzat, galde-erantzun sistementzat eta laburpen automatikoarentzat. Etorkizunean CSTko erlazio guztiak corpus esanguratsuan aztertuko balira, testuen arteko koherentzia- erlazioak euskarazko testuen prozesaketa automatikoa bideratzeko lehenengo pausua litzateke hemen egindakoa.
Resumo:
Prior research shows that electronic word of mouth (eWOM) wields considerable influence over consumer behavior. However, as the volume and variety of eWOM grows, firms are faced with challenges in analyzing and responding to this information. In this dissertation, I argue that to meet the new challenges and opportunities posed by the expansion of eWOM and to more accurately measure its impacts on firms and consumers, we need to revisit our methodologies for extracting insights from eWOM. This dissertation consists of three essays that further our understanding of the value of social media analytics, especially with respect to eWOM. In the first essay, I use machine learning techniques to extract semantic structure from online reviews. These semantic dimensions describe the experiences of consumers in the service industry more accurately than traditional numerical variables. To demonstrate the value of these dimensions, I show that they can be used to substantially improve the accuracy of econometric models of firm survival. In the second essay, I explore the effects on eWOM of online deals, such as those offered by Groupon, the value of which to both consumers and merchants is controversial. Through a combination of Bayesian econometric models and controlled lab experiments, I examine the conditions under which online deals affect online reviews and provide strategies to mitigate the potential negative eWOM effects resulting from online deals. In the third essay, I focus on how eWOM can be incorporated into efforts to reduce foodborne illness, a major public health concern. I demonstrate how machine learning techniques can be used to monitor hygiene in restaurants through crowd-sourced online reviews. I am able to identify instances of moral hazard within the hygiene inspection scheme used in New York City by leveraging a dictionary specifically crafted for this purpose. To the extent that online reviews provide some visibility into the hygiene practices of restaurants, I show how losses from information asymmetry may be partially mitigated in this context. Taken together, this dissertation contributes by revisiting and refining the use of eWOM in the service sector through a combination of machine learning and econometric methodologies.
Resumo:
This paper presents work done at Medical Minner Project on the TREC-2011 Medical Records Track. The paper proposes four models for medical information retrieval based on Lucene index approach. Our retrieval engine used an Lucen Index scheme with traditional stopping and stemming, enhanced with entity recognition software on query terms. Our aim in this first competition is to set a broader project that involves the develop of a configurable Apache Lucene-based framework that allows the rapid development of medical search facilities. Results around the track median have been achieved. In this exploratory track, we think that these results are a good beginning and encourage us for future developments.
Resumo:
This paper presents our work at 2016 FIRE CHIS. Given a CHIS query and a document associated with that query, the task is to classify the sentences in the document as relevant to the query or not; and further classify the relevant sentences to be supporting, neutral or opposing to the claim made in the query. In this paper, we present two different approaches to do the classification. With the first approach, we implement two models to satisfy the task. We first implement an information retrieval model to retrieve the sentences that are relevant to the query; and then we use supervised learning method to train a classification model to classify the relevant sentences into support, oppose or neutral. With the second approach, we only use machine learning techniques to learn a model and classify the sentences into four classes (relevant & support, relevant & neutral, relevant & oppose, irrelevant & neutral). Our submission for CHIS uses the first approach.
Resumo:
Much of the real-world dataset, including textual data, can be represented using graph structures. The use of graphs to represent textual data has many advantages, mainly related to maintaining a more significant amount of information, such as the relationships between words and their types. In recent years, many neural network architectures have been proposed to deal with tasks on graphs. Many of them consider only node features, ignoring or not giving the proper relevance to relationships between them. However, in many node classification tasks, they play a fundamental role. This thesis aims to analyze the main GNNs, evaluate their advantages and disadvantages, propose an innovative solution considered as an extension of GAT, and apply them to a case study in the biomedical field. We propose the reference GNNs, implemented with methodologies later analyzed, and then applied to a question answering system in the biomedical field as a replacement for the pre-existing GNN. We attempt to obtain better results by using models that can accept as input both node and edge features. As shown later, our proposed models can beat the original solution and define the state-of-the-art for the task under analysis.
Resumo:
Artificial Intelligence is reshaping the field of fashion industry in different ways. E-commerce retailers exploit their data through AI to enhance their search engines, make outfit suggestions and forecast the success of a specific fashion product. However, it is a challenging endeavour as the data they possess is huge, complex and multi-modal. The most common way to search for fashion products online is by matching keywords with phrases in the product's description which are often cluttered, inadequate and differ across collections and sellers. A customer may also browse an online store's taxonomy, although this is time-consuming and doesn't guarantee relevant items. With the advent of Deep Learning architectures, particularly Vision-Language models, ad-hoc solutions have been proposed to model both the product image and description to solve this problems. However, the suggested solutions do not exploit effectively the semantic or syntactic information of these modalities, and the unique qualities and relations of clothing items. In this work of thesis, a novel approach is proposed to address this issues, which aims to model and process images and text descriptions as graphs in order to exploit the relations inside and between each modality and employs specific techniques to extract syntactic and semantic information. The results obtained show promising performances on different tasks when compared to the present state-of-the-art deep learning architectures.
Resumo:
This article discusses issues related to the organization and reception of information in the context of services and public information systems driven by technology. It stems from the assumption that in a ""technologized"" society, the distance between users and information is almost always of cognitive and socio-cultural nature, a product of our effort to design communication. In this context, we favor the approach of the information sign, seeking to answer how a documentary message turns into information, i.e. a structure recognized as socially useful. Observing the structural, cognitive and communicative aspects of the documentary message, based on Documentary Linguistics, Terminology, as well as on Textual Linguistics, the policy of knowledge management and innovation of the Government of the State of Sao Paulo is analyzed, which authorizes the use of Web 2.0, also questioning to what extent this initiative represents innovation in the environment of libraries.