882 resultados para query rewriting
Resumo:
We address the problem of mining interesting phrases from subsets of a text corpus where the subset is specified using a set of features such as keywords that form a query. Previous algorithms for the problem have proposed solutions that involve sifting through a phrase dictionary based index or a document-based index where the solution is linear in either the phrase dictionary size or the size of the document subset. We propose the usage of an independence assumption between query keywords given the top correlated phrases, wherein the pre-processing could be reduced to discovering phrases from among the top phrases per each feature in the query. We then outline an indexing mechanism where per-keyword phrase lists are stored either in disk or memory, so that popular aggregation algorithms such as No Random Access and Sort-merge Join may be adapted to do the scoring at real-time to identify the top interesting phrases. Though such an approach is expected to be approximate, we empirically illustrate that very high accuracies (of over 90%) are achieved against the results of exact algorithms. Due to the simplified list-aggregation, we are also able to provide response times that are orders of magnitude better than state-of-the-art algorithms. Interestingly, our disk-based approach outperforms the in-memory baselines by up to hundred times and sometimes more, confirming the superiority of the proposed method.
Resumo:
When a user of a microblogging site authors a microblog
post or browses through a microblog post, it provides cues as to what
topic she is interested in at that point in time. Example-based search
that retrieves similar tweets given one exemplary tweet, such as the one
just authored, can help provide the user with relevant content. We investigate
various components of microblog posts, such as the associated
timestamp, author’s social network, and the content of the post, and
develop approaches that harness such factors in finding relevant tweets
given a query tweet. An empirical analysis of such techniques on real
world twitter-data is then presented to quantify the utility of the various
factors in assessing tweet relevance. We observe that content-wise similar
tweets that also contain extra information not already present in the
query, are perceived as useful. We then develop a composite technique
that combines the various approaches by scoring tweets using a dynamic
query-specific linear combination of separate techniques. An empirical
evaluation establishes the effectiveness of the composite technique, and
that it outperforms each of its constituents.
Resumo:
Online forums are becoming a popular way of finding useful
information on the web. Search over forums for existing discussion
threads so far is limited to keyword-based search due
to the minimal effort required on part of the users. However,
it is often not possible to capture all the relevant context in a
complex query using a small number of keywords. Examplebased
search that retrieves similar discussion threads given
one exemplary thread is an alternate approach that can help
the user provide richer context and vastly improve forum
search results. In this paper, we address the problem of
finding similar threads to a given thread. Towards this, we
propose a novel methodology to estimate similarity between
discussion threads. Our method exploits the thread structure
to decompose threads in to set of weighted overlapping
components. It then estimates pairwise thread similarities
by quantifying how well the information in the threads are
mutually contained within each other using lexical similarities
between their underlying components. We compare our
proposed methods on real datasets against state-of-the-art
thread retrieval mechanisms wherein we illustrate that our
techniques outperform others by large margins on popular
retrieval evaluation measures such as NDCG, MAP, Precision@k
and MRR. In particular, consistent improvements of
up to 10% are observed on all evaluation measures
Resumo:
We consider the problem of linking web search queries to entities from a knowledge base such as Wikipedia. Such linking enables converting a user’s web search session to a footprint in the knowledge base that could be used to enrich the user profile. Traditional methods for entity linking have been directed towards finding entity mentions in text documents such as news reports, each of which are possibly linked to multiple entities enabling the usage of measures like entity set coherence. Since web search queries are very small text fragments, such criteria that rely on existence of a multitude of mentions do not work too well on them. We propose a three-phase method for linking web search queries to wikipedia entities. The first phase does IR-style scoring of entities against the search query to narrow down to a subset of entities that are expanded using hyperlink information in the second phase to a larger set. Lastly, we use a graph traversal approach to identify the top entities to link the query to. Through an empirical evaluation on real-world web search queries, we illustrate that our methods significantly enhance the linking accuracy over state-of-the-art methods.
Resumo:
Massive amount of data that are geo-tagged and associated with text information are being generated at an unprecedented scale. These geo-textual data cover a wide range of topics. Users are interested in receiving up-to-date geo-textual objects (e.g., geo-tagged Tweets) such that their locations meet users’ need and their texts are interesting to users. For example, a user may want to be updated with tweets near her home on the topic “dengue fever headache.” In this demonstration, we present SOPS, the Spatial-Keyword Publish/Subscribe System, that is capable of efficiently processing spatial keyword continuous queries. SOPS supports two types of queries: (1) Boolean Range Continuous (BRC) query that can be used to subscribe the geo-textual objects satisfying a boolean keyword expression and falling in a specified spatial region; (2) Temporal Spatial-Keyword Top-k Continuous (TaSK) query that continuously maintains up-to-date top-k most relevant results over a stream of geo-textual objects. SOPS enables users to formulate their queries and view the real-time results over a stream of geotextual objects by browser-based user interfaces. On the server side, we propose solutions to efficiently processing a large number of BRC queries (tens of millions) and TaSK queries over a stream of geo-textual objects.
Resumo:
The complexity of modern SCADA networks and their associated cyber-attacks requires an expressive but flexible manner for representing both domain knowledge and collected intrusion alerts with the ability to integrate them for enhanced analytical capabilities and better understanding of attacks. This paper proposes an ontology-based approach for contextualized intrusion alerts in SCADA networks. In this approach, three security ontologies were developed to represent and store information on intrusion alerts, Modbus communications, and Modbus attack descriptions. This information is correlated into enriched intrusion alerts using simple ontology logic rules written in Semantic Query-Enhanced Web Rules (SQWRL). The contextualized alerts give analysts the means to better understand evolving attacks and to uncover the semantic relationships between sequences of individual attack events. The proposed system is illustrated by two use case scenarios.
Resumo:
The past decade had witnessed an unprecedented growth in the amount of available digital content, and its volume is expected to continue to grow the next few years. Unstructured text data generated from web and enterprise sources form a large fraction of such content. Many of these contain large volumes of reusable data such as solutions to frequently occurring problems, and general know-how that may be reused in appropriate contexts. In this work, we address issues around leveraging unstructured text data from sources as diverse as the web and the enterprise within the Case-based Reasoning framework. Case-based Reasoning (CBR) provides a framework and methodology for systematic reuse of historical knowledge that is available in the form of problemsolution
pairs, in solving new problems. Here, we consider possibilities of enhancing Textual CBR systems under three main themes: procurement, maintenance and retrieval. We adapt and build upon the stateof-the-art techniques from data mining and natural language processing in addressing various challenges therein. Under procurement, we investigate the problem of extracting cases (i.e., problem-solution pairs) from data sources such as incident/experience
reports. We develop case-base maintenance methods specifically tuned to text targeted towards retaining solutions such that the utility of the filtered case base in solving new problems is maximized. Further, we address the problem of query suggestions for textual case-bases and show that exploiting the problem-solution partition can enhance retrieval effectiveness by prioritizing more useful query suggestions. Additionally, we illustrate interpretable clustering as a tool to drill-down to domain specific text collections (since CBR systems are usually very domain specific) and develop techniques for improved similarity assessment in social media sources such as microblogs. Through extensive empirical evaluations, we illustrate the improvements that we are able to
achieve over the state-of-the-art methods for the respective tasks.
Resumo:
Gene expression connectivity mapping has gained much popularity recently with a number of successful applications in biomedical research testifying its utility and promise. Previously methodological research in connectivity mapping mainly focused on two of the key components in the framework, namely, the reference gene expression profiles and the connectivity mapping algorithms. The other key component in this framework, the query gene signature, has been left to users to construct without much consensus on how this should be done, albeit it has been an issue most relevant to end users. As a key input to the connectivity mapping process, gene signature is crucially important in returning biologically meaningful and relevant results. This paper intends to formulate a standardized procedure for constructing high quality gene signatures from a user’s perspective.
Resumo:
Within the study of domestic violence typological approaches have gained prominence in part as a response to the wider feminist canon that presumes perpetrators are all simply motivated by male power. In this article we use a single case study to query the presumption inherent in the most commonly used typological approaches that offender motivations remain largely static overtime and can be read off easily from self-reports or official records. We conclude by pointing to the need, both for academics and practitioners, to engage interpretively with the specific meanings acts of violence hold for domestic violence perpetrators - informed as they can be by sexist values, perceptions of entitlement and a specific history of conflict, suspicion or grievance – that can change who they are and the way they behave in the aftermath of assaults and breakups, as the foreground of crime is reincorporated into a background narrative.
Resumo:
O trabalho apresentado nesta dissertação teve por objectivo principal a concepção, modelação e desenvolvimento de uma plataforma de middleware que permitisse a integração de sistemas de informação, em todos os seus níveis (dados, lógico e apresentação), perfazendo uma federação de bibliotecas digitais distribuídas e ecléticas. Para este fim, foram estudadas as várias abordagens de modelação e organização das bibliotecas digitais, assim como os diversos sistemas e tecnologias de suporte existentes no momento inicial do trabalho. Compreendendo a existência de muitas lacunas ainda neste domínio, nomeadamente ao nível da interoperabilidade de sistemas heterogéneos e integração da semântica de metadados, decidiu-se proceder a um trabalho de investigação e desenvolvimento que pudesse apresentar eventuais soluções para o preenchimento de tais lacunas. Desta forma, surgem neste trabalho duas tecnologias, o XML e o Dublin Core, que servem de base a todas as restantes tecnologias usadas para a interoperabilidade e para a integração. Ainda utilizando estas tecnologias base, foram estudados e desenvolvidos meios simples, mas eficientes, de salvaguarda, indexação e pesquisa de informação, tentando manter a independência face aos grandes produtores de bases de dados, que só por si não resolvem alguns dos problemas mais críticos da investigação no domínio das bibliotecas digitais. ABSTRACT: The main objective of the work presented in this dissertation is the design, modulation and development of a middleware framework to allow information systems interoperability, in all their scope (data, logic and presentation), to accomplish a distributed and eclectic digital libraries federation. Several modulations and organizations were approached, and several support systems and technologies were studied. Understanding the existence of many gaps in this domain, namely in heterogeneous information systems interoperation and metadata semantic integration, it was decided to conduct a research and development work, which, eventually, could present some solutions to fill in these gaps. In this way, two technologies, XML and Dublin Core, appear to serve as the basis of all remaining technologies, to interoperate and to achieve semantic integration. Using yet these technologies, it was also studied and developed simple means, but efficient ones, to save, index and query information, preserving the independence from major data base producers, which by their selves don’t solve critical problems in the digital libraries research domain.
Resumo:
O desenvolvimento de equipamentos de descodificação massiva de genomas veio aumentar de uma forma brutal os dados disponíveis. No entanto, para desvendarmos informação relevante a partir da análise desses dados é necessário software cada vez mais específico, orientado para determinadas tarefas que auxiliem o investigador a obter conclusões o mais rápido possível. É nesse campo que a bioinformática surge, como aliado fundamental da biologia, uma vez que tira partido de métodos e infra-estruturas computacionais para desenvolver algoritmos e aplicações informáticas. Por outro lado, na maior parte das vezes, face a novas questões biológicas é necessário responder com novas soluções específicas, pelo que o desenvolvimento de aplicações se torna um desafio permanente para os engenheiros de software. Foi nesse contexto que surgiram os principais objectivos deste trabalho, centrados na análise de tripletos e de repetições em estruturas primárias de DNA. Para esse efeito, foram propostos novos métodos e novos algoritmos que permitirem o processamento e a obtenção de resultados sobre grandes volumes de dados. Ao nível da análise de tripletos de codões e de aminoácidos foi proposto um sistema concebido para duas vertentes: por um lado o processamento dos dados, por outro a disponibilização na Web dos dados processados, através de um mecanismo visual de composição de consultas. Relativamente à análise de repetições, foi proposto e desenvolvido um sistema para identificar padrões de nucleótidos e aminoácidos repetidos em sequências específicas, com particular aplicação em genes ortólogos. As soluções propostas foram posteriormente validadas através de casos de estudo que atestam a mais-valia do trabalho desenvolvido.
Resumo:
When the scribes of ancient Mesopotamia rewrote the Epic of Gilgamesh over a period of over two thousand years, the modifications made reflected the social transformations occurring during the same era. The dethroning of the goddess Inanna-Ishtar and the devaluation of other female characters in the evolving Epic of Gilgamesh coincided with the declining status of women in society. Since the 1960s, translations into modern languages have been readily available. The Mesopotamian myth has been reused in a wide variety of mythic and mythological texts by Quebecois, Canadian and American authors. Our analysis of the first group of mythic texts, written in the 1960s and 1970s, shows a reversal of the tendency of the Mesopotamian texts. Written at a time when the feminist movement was transforming North American society, these retellings feature a goddess with her high status restored and her ancient attributes re-established. Another group of writers, publishing in the 1980s and 1990s, makes a radical shift away from these feminist tendencies while still basically rewriting the Epic. In this group of mythic texts, the goddess and other female characters find their roles reduced while the male gods and characters have expanded and glorified roles. The third group of texts analysed does not rewrite the Epic. The Epic is reused here intertextually to give depth to mythological works set in the twentieth century or later. The dialogue created between the contemporary text and the Epic emphasises the role that the individual has in society. A large-scale comparative mythotextual study of texts that share a common hypotext can, especially when socio-historical factors are considered, provide a window onto the relationship between text and society. A comparative study of how the Epic of Gilgamesh is rewritten and referred to intertextually through time can help us relativize the understanding of our own time and culture.
Resumo:
The first four essays in this volume all focus on issues of gender in the works of different English authors and thinkers. Shorter versions of each of these essays were formerly presented as papers in an autonomous section of the Research and Educational Programme on Studies of Identity at the XXth Meeting of the Portuguese Association of Anglo-American Studies (Póvoa de Varzim, 1999) and published in the proceedings of the conference. The second cluster of essays in this volume — two of which (Jennie Wang’s and Teresa Cid’s) were first presented, in shorter versions, at the joint ASA/CAAS Conference (Montréal, 1999) — addresses the work of American women variously engaged in contexts of cultural diversity and grappling with the ideas of what it means to be an American and a woman, particularly in the twentieth century. These essays approach, from different angles, the definitional quandaries and semantic difficulties encountered when speaking about the self and the United States and provide, in one way or another, a sort of feminine rewriting of American myths and history.