1000 resultados para Term weighting


Relevância:

70.00% 70.00%

Publicador:

Resumo:

This paper presents a graph-based method to weight medical concepts in documents for the purposes of information retrieval. Medical concepts are extracted from free-text documents using a state-of-the-art technique that maps n-grams to concepts from the SNOMED CT medical ontology. In our graph-based concept representation, concepts are vertices in a graph built from a document, edges represent associations between concepts. This representation naturally captures dependencies between concepts, an important requirement for interpreting medical text, and a feature lacking in bag-of-words representations. We apply existing graph-based term weighting methods to weight medical concepts. Using concepts rather than terms addresses vocabulary mismatch as well as encapsulates terms belonging to a single medical entity into a single concept. In addition, we further extend previous graph-based approaches by injecting domain knowledge that estimates the importance of a concept within the global medical domain. Retrieval experiments on the TREC Medical Records collection show our method outperforms both term and concept baselines. More generally, this work provides a means of integrating background knowledge contained in medical ontologies into data-driven information retrieval approaches.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Information Retrieval is an important albeit imperfect component of information technologies. A problem of insufficient diversity of retrieved documents is one of the primary issues studied in this research. This study shows that this problem leads to a decrease of precision and recall, traditional measures of information retrieval effectiveness. This thesis presents an adaptive IR system based on the theory of adaptive dual control. The aim of the approach is the optimization of retrieval precision after all feedback has been issued. This is done by increasing the diversity of retrieved documents. This study shows that the value of recall reflects this diversity. The Probability Ranking Principle is viewed in the literature as the “bedrock” of current probabilistic Information Retrieval theory. Neither the proposed approach nor other methods of diversification of retrieved documents from the literature conform to this principle. This study shows by counterexample that the Probability Ranking Principle does not in general lead to optimal precision in a search session with feedback (for which it may not have been designed but is actively used). Retrieval precision of the search session should be optimized with a multistage stochastic programming model to accomplish the aim. However, such models are computationally intractable. Therefore, approximate linear multistage stochastic programming models are derived in this study, where the multistage improvement of the probability distribution is modelled using the proposed feedback correctness method. The proposed optimization models are based on several assumptions, starting with the assumption that Information Retrieval is conducted in units of topics. The use of clusters is the primary reasons why a new method of probability estimation is proposed. The adaptive dual control of topic-based IR system was evaluated in a series of experiments conducted on the Reuters, Wikipedia and TREC collections of documents. The Wikipedia experiment revealed that the dual control feedback mechanism improves precision and S-recall when all the underlying assumptions are satisfied. In the TREC experiment, this feedback mechanism was compared to a state-of-the-art adaptive IR system based on BM-25 term weighting and the Rocchio relevance feedback algorithm. The baseline system exhibited better effectiveness than the cluster-based optimization model of ADTIR. The main reason for this was insufficient quality of the generated clusters in the TREC collection that violated the underlying assumption.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Nowadays communication is switching from a centralized scenario, where communication media like newspapers, radio, TV programs produce information and people are just consumers, to a completely different decentralized scenario, where everyone is potentially an information producer through the use of social networks, blogs, forums that allow a real-time worldwide information exchange. These new instruments, as a result of their widespread diffusion, have started playing an important socio-economic role. They are the most used communication media and, as a consequence, they constitute the main source of information enterprises, political parties and other organizations can rely on. Analyzing data stored in servers all over the world is feasible by means of Text Mining techniques like Sentiment Analysis, which aims to extract opinions from huge amount of unstructured texts. This could lead to determine, for instance, the user satisfaction degree about products, services, politicians and so on. In this context, this dissertation presents new Document Sentiment Classification methods based on the mathematical theory of Markov Chains. All these approaches bank on a Markov Chain based model, which is language independent and whose killing features are simplicity and generality, which make it interesting with respect to previous sophisticated techniques. Every discussed technique has been tested in both Single-Domain and Cross-Domain Sentiment Classification areas, comparing performance with those of other two previous works. The performed analysis shows that some of the examined algorithms produce results comparable with the best methods in literature, with reference to both single-domain and cross-domain tasks, in $2$-classes (i.e. positive and negative) Document Sentiment Classification. However, there is still room for improvement, because this work also shows the way to walk in order to enhance performance, that is, a good novel feature selection process would be enough to outperform the state of the art. Furthermore, since some of the proposed approaches show promising results in $2$-classes Single-Domain Sentiment Classification, another future work will regard validating these results also in tasks with more than $2$ classes.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Se presenta el desarrollo de una interface de recuperación de información para catálogos en línea de acceso público (plataforma CDS/ISIS), basada en el concepto de similaridad para generar los resultados de una búsqueda ordenados por posible relevancia. Se expresan los fundamentos teóricos involucrados, para luego detallar la forma en que se efectuó su aplicación tecnológica, explícita a nivel de programación. Para finalizar se esbozan los problemas de implementación según el entorno

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Se presenta el desarrollo de una interface de recuperación de información para catálogos en línea de acceso público (plataforma CDS/ISIS), basada en el concepto de similaridad para generar los resultados de una búsqueda ordenados por posible relevancia. Se expresan los fundamentos teóricos involucrados, para luego detallar la forma en que se efectuó su aplicación tecnológica, explícita a nivel de programación. Para finalizar se esbozan los problemas de implementación según el entorno

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this paper we explore the use of semantic classes in an existing information retrieval system in order to improve its results. Thus, we use two different ontologies of semantic classes (WordNet domain and Basic Level Concepts) in order to re-rank the retrieved documents and obtain better recall and precision. Finally, we implement a new method for weighting the expanded terms taking into account the weights of the original query terms and their relations in WordNet with respect to the new ones (which have demonstrated to improve the results). The evaluation of these approaches was carried out in the CLEF Robust-WSD Task, obtaining an improvement of 1.8% in GMAP for the semantic classes approach and 10% in MAP employing the WordNet term weighting approach.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Se presenta el desarrollo de una interface de recuperación de información para catálogos en línea de acceso público (plataforma CDS/ISIS), basada en el concepto de similaridad para generar los resultados de una búsqueda ordenados por posible relevancia. Se expresan los fundamentos teóricos involucrados, para luego detallar la forma en que se efectuó su aplicación tecnológica, explícita a nivel de programación. Para finalizar se esbozan los problemas de implementación según el entorno

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Shade plots, simple visual representations of abundance matrices from multivariate species assemblage studies, are shown to be an effective aid in choosing an overall transformation (or other pre-treatment) of quantitative data for long-term use, striking an appropriate balance between dominant and less abundant taxa in ensuing resemblance-based multivariate analyses. Though the exposition is entirely general and applicable to all community studies, detailed illustrations of the comparative power and interpretative possibilities of shade plots are given in the case of two estuarine assemblage studies in south-western Australia: (a) macrobenthos in the upper Swan Estuary over a two-year period covering a highly significant precipitation event for the Perth area; and (b) a wide-scale spatial study of the nearshore fish fauna from five divergent estuaries. The utility of transformations of intermediate severity is again demonstrated and, with greater novelty, the potential importance seen of further mild transformation of all data after differential down-weighting (dispersion weighting) of spatially clumped' or schooled' species. Among the new techniques utilized is a two-way form of the RELATE test, which demonstrates linking of assemblage structure (fish) to continuous environmental variables (water quality), having removed a categorical factor (estuary differences). Re-orderings of sample and species axes in the associated shade plots are seen to provide transparent explanations at the species level for such continuous multivariate patterns.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Feature selection and feature weighting are useful techniques for improving the classification accuracy of K-nearest-neighbor (K-NN) rule. The term feature selection refers to algorithms that select the best subset of the input feature set. In feature weighting, each feature is multiplied by a weight value proportional to the ability of the feature to distinguish pattern classes. In this paper, a novel hybrid approach is proposed for simultaneous feature selection and feature weighting of K-NN rule based on Tabu Search (TS) heuristic. The proposed TS heuristic in combination with K-NN classifier is compared with several classifiers on various available data sets. The results have indicated a significant improvement in the performance in classification accuracy. The proposed TS heuristic is also compared with various feature selection algorithms. Experiments performed revealed that the proposed hybrid TS heuristic is superior to both simple TS and sequential search algorithms. We also present results for the classification of prostate cancer using multispectral images, an important problem in biomedicine.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Aims: Newer-generation everolimus-eluting stents (EES) have been shown to improve clinical outcomes compared with early-generation sirolimus-eluting (SES) and paclitaxel-eluting stents (PES) in patients undergoing percutaneous coronary intervention (PCI). Whether this benefit is maintained among patients with saphenous vein graft (SVG) disease remains controversial. Methods and results: We assessed cumulative incidence rates (CIR) per 100 patient years after inverse probability of treatment weighting to compare clinical outcomes. The pre-specified primary endpoint was the composite of cardiac death, myocardial infarction (MI), and target vessel revascularisation (TVR). Out of 12,339 consecutively treated patients, 288 patients (5.7%) underwent PCI of at least one SVG lesion with EES (n=127), SES (n=103) or PES (n=58). Up to four years, CIR of the primary endpoint were 58.7 for EES, 45.2 for SES and 45.6 for PES with similar adjusted risks between groups (EES vs. SES; HR 0.94, 95% CI: 0.55-1.60, EES vs. PES; HR 1.07, 95% CI: 0.60-1.91). Adjusted risks showed no significant differences between stent types for cardiac death, MI and TVR. Conclusions: Among patients undergoing PCI for SVG lesions, newer-generation EES have similar safety and efficacy to early-generation SES and PES during long-term follow-up to four years.

Relevância:

20.00% 20.00%

Publicador: