990 resultados para Top-K
Resumo:
This paper describes a new method of indexing and searching large binary signature collections to efficiently find similar signatures, addressing the scalability problem in signature search. Signatures offer efficient computation with acceptable measure of similarity in numerous applications. However, performing a complete search with a given search argument (a signature) requires a Hamming distance calculation against every signature in the collection. This quickly becomes excessive when dealing with large collections, presenting issues of scalability that limit their applicability. Our method efficiently finds similar signatures in very large collections, trading memory use and precision for greatly improved search speed. Experimental results demonstrate that our approach is capable of finding a set of nearest signatures to a given search argument with a high degree of speed and fidelity.
Resumo:
The top-k retrieval problem aims to find the optimal set of k documents from a number of relevant documents given the user’s query. The key issue is to balance the relevance and diversity of the top-k search results. In this paper, we address this problem using Facility Location Analysis taken from Operations Research, where the locations of facilities are optimally chosen according to some criteria. We show how this analysis technique is a generalization of state-of-the-art retrieval models for diversification (such as the Modern Portfolio Theory for Information Retrieval), which treat the top-k search results like “obnoxious facilities” that should be dispersed as far as possible from each other. However, Facility Location Analysis suggests that the top-k search results could be treated like “desirable facilities” to be placed as close as possible to their customers. This leads to a new top-k retrieval model where the best representatives of the relevant documents are selected. In a series of experiments conducted on two TREC diversity collections, we show that significant improvements can be made over the current state-of-the-art through this alternative treatment of the top-k retrieval problem.
Resumo:
In this paper, we consider the problem of selecting, for any given positive integer k, the top-k nodes in a social network, based on a certain measure appropriate for the social network. This problem is relevant in many settings such as analysis of co-authorship networks, diffusion of information, viral marketing, etc. However, in most situations, this problem turns out to be NP-hard. The existing approaches for solving this problem are based on approximation algorithms and assume that the objective function is sub-modular. In this paper, we propose a novel and intuitive algorithm based on the Shapley value, for efficiently computing an approximate solution to this problem. Our proposed algorithm does not use the sub-modularity of the underlying objective function and hence it is a general approach. We demonstrate the efficacy of the algorithm using a co-authorship data set from e-print arXiv (www.arxiv.org), having 8361 authors.
Resumo:
Massive amount of data that are geo-tagged and associated with text information are being generated at an unprecedented scale. These geo-textual data cover a wide range of topics. Users are interested in receiving up-to-date tweets such that their locations are close to a user specified location and their texts are interesting to users. For example, a user may want to be updated with tweets near her home on the topic “food poisoning vomiting.” We consider the Temporal Spatial-Keyword Top-k Subscription (TaSK) query. Given a TaSK query, we continuously maintain up-to-date top-k most relevant results over a stream of geo-textual objects (e.g., geo-tagged Tweets) for the query. The TaSK query takes into account text relevance, spatial proximity, and recency of geo-textual objects in evaluating its relevance with a geo-textual object. We propose a novel solution to efficiently process a large number of TaSK queries over a stream of geotextual objects. We evaluate the efficiency of our approach on two real-world datasets and the experimental results show that our solution is able to achieve a reduction of the processing time by 70-80% compared with two baselines.
Resumo:
Edge-labeled graphs have proliferated rapidly over the last decade due to the increased popularity of social networks and the Semantic Web. In social networks, relationships between people are represented by edges and each edge is labeled with a semantic annotation. Hence, a huge single graph can express many different relationships between entities. The Semantic Web represents each single fragment of knowledge as a triple (subject, predicate, object), which is conceptually identical to an edge from subject to object labeled with predicates. A set of triples constitutes an edge-labeled graph on which knowledge inference is performed. Subgraph matching has been extensively used as a query language for patterns in the context of edge-labeled graphs. For example, in social networks, users can specify a subgraph matching query to find all people that have certain neighborhood relationships. Heavily used fragments of the SPARQL query language for the Semantic Web and graph queries of other graph DBMS can also be viewed as subgraph matching over large graphs. Though subgraph matching has been extensively studied as a query paradigm in the Semantic Web and in social networks, a user can get a large number of answers in response to a query. These answers can be shown to the user in accordance with an importance ranking. In this thesis proposal, we present four different scoring models along with scalable algorithms to find the top-k answers via a suite of intelligent pruning techniques. The suggested models consist of a practically important subset of the SPARQL query language augmented with some additional useful features. The first model called Substitution Importance Query (SIQ) identifies the top-k answers whose scores are calculated from matched vertices' properties in each answer in accordance with a user-specified notion of importance. The second model called Vertex Importance Query (VIQ) identifies important vertices in accordance with a user-defined scoring method that builds on top of various subgraphs articulated by the user. Approximate Importance Query (AIQ), our third model, allows partial and inexact matchings and returns top-k of them with a user-specified approximation terms and scoring functions. In the fourth model called Probabilistic Importance Query (PIQ), a query consists of several sub-blocks: one mandatory block that must be mapped and other blocks that can be opportunistically mapped. The probability is calculated from various aspects of answers such as the number of mapped blocks, vertices' properties in each block and so on and the most top-k probable answers are returned. An important distinguishing feature of our work is that we allow the user a huge amount of freedom in specifying: (i) what pattern and approximation he considers important, (ii) how to score answers - irrespective of whether they are vertices or substitution, and (iii) how to combine and aggregate scores generated by multiple patterns and/or multiple substitutions. Because so much power is given to the user, indexing is more challenging than in situations where additional restrictions are imposed on the queries the user can ask. The proposed algorithms for the first model can also be used for answering SPARQL queries with ORDER BY and LIMIT, and the method for the second model also works for SPARQL queries with GROUP BY, ORDER BY and LIMIT. We test our algorithms on multiple real-world graph databases, showing that our algorithms are far more efficient than popular triple stores.
Resumo:
In a pilot application based on web search engine calledWeb-based Relation Completion (WebRC), we propose to join two columns of entities linked by a predefined relation by mining knowledge from the web through a web search engine. To achieve this, a novel retrieval task Relation Query Expansion (RelQE) is modelled: given an entity (query), the task is to retrieve documents containing entities in predefined relation to the given one. Solving this problem entails expanding the query before submitting it to a web search engine to ensure that mostly documents containing the linked entity are returned in the top K search results. In this paper, we propose a novel Learning-based Relevance Feedback (LRF) approach to solve this retrieval task. Expansion terms are learned from training pairs of entities linked by the predefined relation and applied to new entity-queries to find entities linked by the same relation. After describing the approach, we present experimental results on real-world web data collections, which show that the LRF approach always improves the precision of top-ranked search results to up to 8.6 times the baseline. Using LRF, WebRC also shows performances way above the baseline.
Resumo:
With the overwhelming increase in the amount of data on the web and data bases, many text mining techniques have been proposed for mining useful patterns in text documents. Extracting closed sequential patterns using the Pattern Taxonomy Model (PTM) is one of the pruning methods to remove noisy, inconsistent, and redundant patterns. However, PTM model treats each extracted pattern as whole without considering included terms, which could affect the quality of extracted patterns. This paper propose an innovative and effective method that extends the random set to accurately weigh patterns based on their distribution in the documents and their terms distribution in patterns. Then, the proposed approach will find the specific closed sequential patterns (SCSP) based on the new calculated weight. The experimental results on Reuters Corpus Volume 1 (RCV1) data collection and TREC topics show that the proposed method significantly outperforms other state-of-the-art methods in different popular measures.
Resumo:
Our study concerns an important current problem, that of diffusion of information in social networks. This problem has received significant attention from the Internet research community in the recent times, driven by many potential applications such as viral marketing and sales promotions. In this paper, we focus on the target set selection problem, which involves discovering a small subset of influential players in a given social network, to perform a certain task of information diffusion. The target set selection problem manifests in two forms: 1) top-k nodes problem and 2) lambda-coverage problem. In the top-k nodes problem, we are required to find a set of k key nodes that would maximize the number of nodes being influenced in the network. The lambda-coverage problem is concerned with finding a set of k key nodes having minimal size that can influence a given percentage lambda of the nodes in the entire network. We propose a new way of solving these problems using the concept of Shapley value which is a well known solution concept in cooperative game theory. Our approach leads to algorithms which we call the ShaPley value-based Influential Nodes (SPINs) algorithms for solving the top-k nodes problem and the lambda-coverage problem. We compare the performance of the proposed SPIN algorithms with well known algorithms in the literature. Through extensive experimentation on four synthetically generated random graphs and six real-world data sets (Celegans, Jazz, NIPS coauthorship data set, Netscience data set, High-Energy Physics data set, and Political Books data set), we show that the proposed SPIN approach is more powerful and computationally efficient. Note to Practitioners-In recent times, social networks have received a high level of attention due to their proven ability in improving the performance of web search, recommendations in collaborative filtering systems, spreading a technology in the market using viral marketing techniques, etc. It is well known that the interpersonal relationships (or ties or links) between individuals cause change or improvement in the social system because the decisions made by individuals are influenced heavily by the behavior of their neighbors. An interesting and key problem in social networks is to discover the most influential nodes in the social network which can influence other nodes in the social network in a strong and deep way. This problem is called the target set selection problem and has two variants: 1) the top-k nodes problem, where we are required to identify a set of k influential nodes that maximize the number of nodes being influenced in the network and 2) the lambda-coverage problem which involves finding a set of influential nodes having minimum size that can influence a given percentage lambda of the nodes in the entire network. There are many existing algorithms in the literature for solving these problems. In this paper, we propose a new algorithm which is based on a novel interpretation of information diffusion in a social network as a cooperative game. Using this analogy, we develop an algorithm based on the Shapley value of the underlying cooperative game. The proposed algorithm outperforms the existing algorithms in terms of generality or computational complexity or both. Our results are validated through extensive experimentation on both synthetically generated and real-world data sets.
Resumo:
Increasing network lifetime is important in wireless sensor/ad-hoc networks. In this paper, we are concerned with algorithms to increase network lifetime and amount of data delivered during the lifetime by deploying multiple mobile base stations in the sensor network field. Specifically, we allow multiple mobile base stations to be deployed along the periphery of the sensor network field and develop algorithms to dynamically choose the locations of these base stations so as to improve network lifetime. We propose energy efficient low-complexity algorithms to determine the locations of the base stations; they include i) Top-K-max algorithm, ii) maximizing the minimum residual energy (Max-Min-RE) algorithm, and iii) minimizing the residual energy difference (MinDiff-RE) algorithm. We show that the proposed base stations placement algorithms provide increased network lifetimes and amount of data delivered during the network lifetime compared to single base station scenario as well as multiple static base stations scenario, and close to those obtained by solving an integer linear program (ILP) to determine the locations of the mobile base stations. We also investigate the lifetime gain when an energy aware routing protocol is employed along with multiple base stations.
Resumo:
We investigate the problem of influence limitation in the presence of competing campaigns in a social network. Given a negative campaign which starts propagating from a specified source and a positive/counter campaign that is initiated, after a certain time delay, to limit the the influence or spread of misinformation by the negative campaign, we are interested in finding the top k influential nodes at which the positive campaign may be triggered. This problem has numerous applications in situations such as limiting the propagation of rumor, arresting the spread of virus through inoculation, initiating a counter-campaign against malicious propaganda, etc. The influence function for the generic influence limitation problem is non-submodular. Restricted versions of the influence limitation problem, reported in the literature, assume submodularity of the influence function and do not capture the problem in a realistic setting. In this paper, we propose a novel computational approach for the influence limitation problem based on Shapley value, a solution concept in cooperative game theory. Our approach works equally effectively for both submodular and non-submodular influence functions. Experiments on standard real world social network datasets reveal that the proposed approach outperforms existing heuristics in the literature. As a non-trivial extension, we also address the problem of influence limitation in the presence of multiple competing campaigns.
Resumo:
随着信息处理技术在通信、金融、工业生产等领域的广泛应用,数据已经不 仅仅拘泥于文件、数据表等传统形式。大量连续、变化的流式数据在越来越多 的现代应用中出现,例如军事指挥、交通控制、传感器数据处理、网络监控、金 融数据分析等。在这些应用中,数据以流的形式不断到达,系统需要对这些数据 进行连续、及时的处理。虽然现有的数据流应用已经收集了大量的流数据,但 是其中用户所关心的事件通常是那些异常事件,因为异常事件往往隐藏着更多 值得关注的信息。为了能够将单个数据流上的异常事件及时、准确的分发到复 合事件检测模块,具有QoS自适应能力的事件通知模型是一个理想的选择。复 合事件的产生,迎合了实际应用中的复杂需求,它通常是由原子事件通过逻辑 连接符和各种操作符连接组合而成。另外,为了能够及时的做出响应,预先定 义的动作应该被触发,例如:发出警报或在该异常事件发生地进行拍照,这就 要求检测系统具有一定的主动性。ECA规则作为“发现-响应”模型的基石,可 以满足上述主动性需求。目前大多数研究集中于对正常事件的离线挖掘与分析, 而对于实时数据流原子异常事件检测技术、具有QoS自适应的事件通知模型和 面向任意顺序数据流的复合事件检测技术研究则比较少,因此本文重点讨论上 述三个方面的内容。 首先,本文系统地分析了数据流应用的需求和特点,并提出一个面向实时 数据流的异常事件检测框架HAPS。HAPS共分为四层,分别为:数据流原子异 常事件检测层、QoS自适应的实时事件通知服务、复合事件检测层和动作执行 层。其次,本文对现有上述四个方面的研究工作进行了全面、详细的综述,分析 了现有研究的不足之处。然后针对上述四个方面的不足之处分别进行了深入的 研究,取得了部分创新性的研究成果。最后实现了框架的原型系统,并使用大 量仿真数据流和真实数据流进行了实验,实验结果表明在这四个方面的研究均 达到了预期的目标。 本文的主要创新点为: 1. 基于局部相关指数,提出一种增量式数据流异常事件检测算法(简称 为incLOCI),时间复杂度仅为O(NlogN)。证明了无论是事件的新增还是过时事件的删除都仅只影响其有限个近邻。 2. 提出了一个“近似”top-k实时事件通知模型,它使用事件内容与订阅 要求之间的相关程度作为匹配标准,在截止期内,自适应的选择“近 似”top-k相关数据。 3. 提出了一个面向任意顺序数据流的复合事件检测模型,该模型支持三种典 型的事件语境,同时还支持聚集函数,设计并实现了模型中使用的数据结 构和算法。 4. 实现了一个面向实时数据流的异常事件检测原型系统。它不仅可以检测实 时数据流中的原子异常事件,还能够通过QoS自适应实时事件通知服务将 原子事件分发到复合事件层,生成复合事件,最后利用RECA规则对异常 情况做出及时响应。 面向实时数据流的异常事件检测技术研究具有较高的应用价值和广阔的应 用前景。本文的研究成果为进一步探讨实时数据流上原子异常事件检测技术和 复合事件检测技术提供了良好的基础。
Resumo:
Matching query interfaces is a crucial step in data integration across multiple Web databases. The problem is closely related to schema matching that typically exploits different features of schemas. Relying on a particular feature of schemas is not suffcient. We propose an evidential approach to combining multiple matchers using Dempster-Shafer theory of evidence. First, our approach views the match results of an individual matcher as a source of evidence that provides a level of confidence on the validity of each candidate attribute correspondence. Second, it combines multiple sources of evidence to get a combined mass function that represents the overall level of confidence, taking into account the match results of different matchers. Our combination mechanism does not require use of weighing parameters, hence no setting and tuning of them is needed. Third, it selects the top k attribute correspondences of each source attribute from the target schema based on the combined mass function. Finally it uses some heuristics to resolve any conflicts between the attribute correspondences of different source attributes. Our experimental results show that our approach is highly accurate and effective.