996 resultados para Top-K


Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper describes a new method of indexing and searching large binary signature collections to efficiently find similar signatures, addressing the scalability problem in signature search. Signatures offer efficient computation with acceptable measure of similarity in numerous applications. However, performing a complete search with a given search argument (a signature) requires a Hamming distance calculation against every signature in the collection. This quickly becomes excessive when dealing with large collections, presenting issues of scalability that limit their applicability. Our method efficiently finds similar signatures in very large collections, trading memory use and precision for greatly improved search speed. Experimental results demonstrate that our approach is capable of finding a set of nearest signatures to a given search argument with a high degree of speed and fidelity.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The top-k retrieval problem aims to find the optimal set of k documents from a number of relevant documents given the user’s query. The key issue is to balance the relevance and diversity of the top-k search results. In this paper, we address this problem using Facility Location Analysis taken from Operations Research, where the locations of facilities are optimally chosen according to some criteria. We show how this analysis technique is a generalization of state-of-the-art retrieval models for diversification (such as the Modern Portfolio Theory for Information Retrieval), which treat the top-k search results like “obnoxious facilities” that should be dispersed as far as possible from each other. However, Facility Location Analysis suggests that the top-k search results could be treated like “desirable facilities” to be placed as close as possible to their customers. This leads to a new top-k retrieval model where the best representatives of the relevant documents are selected. In a series of experiments conducted on two TREC diversity collections, we show that significant improvements can be made over the current state-of-the-art through this alternative treatment of the top-k retrieval problem.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper, we consider the problem of selecting, for any given positive integer k, the top-k nodes in a social network, based on a certain measure appropriate for the social network. This problem is relevant in many settings such as analysis of co-authorship networks, diffusion of information, viral marketing, etc. However, in most situations, this problem turns out to be NP-hard. The existing approaches for solving this problem are based on approximation algorithms and assume that the objective function is sub-modular. In this paper, we propose a novel and intuitive algorithm based on the Shapley value, for efficiently computing an approximate solution to this problem. Our proposed algorithm does not use the sub-modularity of the underlying objective function and hence it is a general approach. We demonstrate the efficacy of the algorithm using a co-authorship data set from e-print arXiv (www.arxiv.org), having 8361 authors.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Massive amount of data that are geo-tagged and associated with text information are being generated at an unprecedented scale. These geo-textual data cover a wide range of topics. Users are interested in receiving up-to-date tweets such that their locations are close to a user specified location and their texts are interesting to users. For example, a user may want to be updated with tweets near her home on the topic “food poisoning vomiting.” We consider the Temporal Spatial-Keyword Top-k Subscription (TaSK) query. Given a TaSK query, we continuously maintain up-to-date top-k most relevant results over a stream of geo-textual objects (e.g., geo-tagged Tweets) for the query. The TaSK query takes into account text relevance, spatial proximity, and recency of geo-textual objects in evaluating its relevance with a geo-textual object. We propose a novel solution to efficiently process a large number of TaSK queries over a stream of geotextual objects. We evaluate the efficiency of our approach on two real-world datasets and the experimental results show that our solution is able to achieve a reduction of the processing time by 70-80% compared with two baselines.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper introduces a new type of discriminative subgraph pattern called breaker emerging subgraph pattern by introducing three constraints and two new concepts: base and breaker. A breaker emerging sub-graph pattern consists of three subpatterns: a con-strained emerging subgraph pattern, a set of bases and a set of breakers. An efficient approach is pro-posed for the discovery of top-k breaker emerging sub-graph patterns from graph datasets. Experimental re-sults show that the approach is capable of efficiently discovering top-k breaker emerging subgraph patterns from given datasets, is more efficient than two previ-ous methods for mining discriminative subgraph pat-terns. The discovered top-k breaker emerging sub-graph patterns are more informative, more discrim-inative, more accurate and more compact than the minimal distinguishing subgraph patterns. The top-k breaker emerging patterns are more useful for sub-structure analysis, such as molecular fragment analy-sis. © 2009, Australian Computer Society, Inc.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In recent years, evaluating the influence of nodes and finding top-k influential nodes in social networks, has drawn a wide attention and has become a hot-pot research issue. Considering the characteristics of social networks, we present a novel mechanism to mine the top-k influential nodes in mobile social networks. The proposed mechanism is based on the behaviors analysis of SMS/MMS (simple messaging service / multimedia messaging service) communication between mobile users. We introduce the complex network theory to build a social relation graph, which is used to reveal the relationship among people's social contacts and messages sending. Moreover, intimacy degree is also introduced to characterize social frequency among nodes. Election mechanism is hired to find the most influential node, and then a heap sorting algorithm is used to sort the voting results to find the k most influential nodes. The experimental results show that the mechanism can finds out the most influential top-k nodes efficiently and effectively. © 2013 IEEE.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Edge-labeled graphs have proliferated rapidly over the last decade due to the increased popularity of social networks and the Semantic Web. In social networks, relationships between people are represented by edges and each edge is labeled with a semantic annotation. Hence, a huge single graph can express many different relationships between entities. The Semantic Web represents each single fragment of knowledge as a triple (subject, predicate, object), which is conceptually identical to an edge from subject to object labeled with predicates. A set of triples constitutes an edge-labeled graph on which knowledge inference is performed. Subgraph matching has been extensively used as a query language for patterns in the context of edge-labeled graphs. For example, in social networks, users can specify a subgraph matching query to find all people that have certain neighborhood relationships. Heavily used fragments of the SPARQL query language for the Semantic Web and graph queries of other graph DBMS can also be viewed as subgraph matching over large graphs. Though subgraph matching has been extensively studied as a query paradigm in the Semantic Web and in social networks, a user can get a large number of answers in response to a query. These answers can be shown to the user in accordance with an importance ranking. In this thesis proposal, we present four different scoring models along with scalable algorithms to find the top-k answers via a suite of intelligent pruning techniques. The suggested models consist of a practically important subset of the SPARQL query language augmented with some additional useful features. The first model called Substitution Importance Query (SIQ) identifies the top-k answers whose scores are calculated from matched vertices' properties in each answer in accordance with a user-specified notion of importance. The second model called Vertex Importance Query (VIQ) identifies important vertices in accordance with a user-defined scoring method that builds on top of various subgraphs articulated by the user. Approximate Importance Query (AIQ), our third model, allows partial and inexact matchings and returns top-k of them with a user-specified approximation terms and scoring functions. In the fourth model called Probabilistic Importance Query (PIQ), a query consists of several sub-blocks: one mandatory block that must be mapped and other blocks that can be opportunistically mapped. The probability is calculated from various aspects of answers such as the number of mapped blocks, vertices' properties in each block and so on and the most top-k probable answers are returned. An important distinguishing feature of our work is that we allow the user a huge amount of freedom in specifying: (i) what pattern and approximation he considers important, (ii) how to score answers - irrespective of whether they are vertices or substitution, and (iii) how to combine and aggregate scores generated by multiple patterns and/or multiple substitutions. Because so much power is given to the user, indexing is more challenging than in situations where additional restrictions are imposed on the queries the user can ask. The proposed algorithms for the first model can also be used for answering SPARQL queries with ORDER BY and LIMIT, and the method for the second model also works for SPARQL queries with GROUP BY, ORDER BY and LIMIT. We test our algorithms on multiple real-world graph databases, showing that our algorithms are far more efficient than popular triple stores.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In a pilot application based on web search engine calledWeb-based Relation Completion (WebRC), we propose to join two columns of entities linked by a predefined relation by mining knowledge from the web through a web search engine. To achieve this, a novel retrieval task Relation Query Expansion (RelQE) is modelled: given an entity (query), the task is to retrieve documents containing entities in predefined relation to the given one. Solving this problem entails expanding the query before submitting it to a web search engine to ensure that mostly documents containing the linked entity are returned in the top K search results. In this paper, we propose a novel Learning-based Relevance Feedback (LRF) approach to solve this retrieval task. Expansion terms are learned from training pairs of entities linked by the predefined relation and applied to new entity-queries to find entities linked by the same relation. After describing the approach, we present experimental results on real-world web data collections, which show that the LRF approach always improves the precision of top-ranked search results to up to 8.6 times the baseline. Using LRF, WebRC also shows performances way above the baseline.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

With the overwhelming increase in the amount of data on the web and data bases, many text mining techniques have been proposed for mining useful patterns in text documents. Extracting closed sequential patterns using the Pattern Taxonomy Model (PTM) is one of the pruning methods to remove noisy, inconsistent, and redundant patterns. However, PTM model treats each extracted pattern as whole without considering included terms, which could affect the quality of extracted patterns. This paper propose an innovative and effective method that extends the random set to accurately weigh patterns based on their distribution in the documents and their terms distribution in patterns. Then, the proposed approach will find the specific closed sequential patterns (SCSP) based on the new calculated weight. The experimental results on Reuters Corpus Volume 1 (RCV1) data collection and TREC topics show that the proposed method significantly outperforms other state-of-the-art methods in different popular measures.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Our study concerns an important current problem, that of diffusion of information in social networks. This problem has received significant attention from the Internet research community in the recent times, driven by many potential applications such as viral marketing and sales promotions. In this paper, we focus on the target set selection problem, which involves discovering a small subset of influential players in a given social network, to perform a certain task of information diffusion. The target set selection problem manifests in two forms: 1) top-k nodes problem and 2) lambda-coverage problem. In the top-k nodes problem, we are required to find a set of k key nodes that would maximize the number of nodes being influenced in the network. The lambda-coverage problem is concerned with finding a set of k key nodes having minimal size that can influence a given percentage lambda of the nodes in the entire network. We propose a new way of solving these problems using the concept of Shapley value which is a well known solution concept in cooperative game theory. Our approach leads to algorithms which we call the ShaPley value-based Influential Nodes (SPINs) algorithms for solving the top-k nodes problem and the lambda-coverage problem. We compare the performance of the proposed SPIN algorithms with well known algorithms in the literature. Through extensive experimentation on four synthetically generated random graphs and six real-world data sets (Celegans, Jazz, NIPS coauthorship data set, Netscience data set, High-Energy Physics data set, and Political Books data set), we show that the proposed SPIN approach is more powerful and computationally efficient. Note to Practitioners-In recent times, social networks have received a high level of attention due to their proven ability in improving the performance of web search, recommendations in collaborative filtering systems, spreading a technology in the market using viral marketing techniques, etc. It is well known that the interpersonal relationships (or ties or links) between individuals cause change or improvement in the social system because the decisions made by individuals are influenced heavily by the behavior of their neighbors. An interesting and key problem in social networks is to discover the most influential nodes in the social network which can influence other nodes in the social network in a strong and deep way. This problem is called the target set selection problem and has two variants: 1) the top-k nodes problem, where we are required to identify a set of k influential nodes that maximize the number of nodes being influenced in the network and 2) the lambda-coverage problem which involves finding a set of influential nodes having minimum size that can influence a given percentage lambda of the nodes in the entire network. There are many existing algorithms in the literature for solving these problems. In this paper, we propose a new algorithm which is based on a novel interpretation of information diffusion in a social network as a cooperative game. Using this analogy, we develop an algorithm based on the Shapley value of the underlying cooperative game. The proposed algorithm outperforms the existing algorithms in terms of generality or computational complexity or both. Our results are validated through extensive experimentation on both synthetically generated and real-world data sets.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Increasing network lifetime is important in wireless sensor/ad-hoc networks. In this paper, we are concerned with algorithms to increase network lifetime and amount of data delivered during the lifetime by deploying multiple mobile base stations in the sensor network field. Specifically, we allow multiple mobile base stations to be deployed along the periphery of the sensor network field and develop algorithms to dynamically choose the locations of these base stations so as to improve network lifetime. We propose energy efficient low-complexity algorithms to determine the locations of the base stations; they include i) Top-K-max algorithm, ii) maximizing the minimum residual energy (Max-Min-RE) algorithm, and iii) minimizing the residual energy difference (MinDiff-RE) algorithm. We show that the proposed base stations placement algorithms provide increased network lifetimes and amount of data delivered during the network lifetime compared to single base station scenario as well as multiple static base stations scenario, and close to those obtained by solving an integer linear program (ILP) to determine the locations of the mobile base stations. We also investigate the lifetime gain when an energy aware routing protocol is employed along with multiple base stations.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

We investigate the problem of influence limitation in the presence of competing campaigns in a social network. Given a negative campaign which starts propagating from a specified source and a positive/counter campaign that is initiated, after a certain time delay, to limit the the influence or spread of misinformation by the negative campaign, we are interested in finding the top k influential nodes at which the positive campaign may be triggered. This problem has numerous applications in situations such as limiting the propagation of rumor, arresting the spread of virus through inoculation, initiating a counter-campaign against malicious propaganda, etc. The influence function for the generic influence limitation problem is non-submodular. Restricted versions of the influence limitation problem, reported in the literature, assume submodularity of the influence function and do not capture the problem in a realistic setting. In this paper, we propose a novel computational approach for the influence limitation problem based on Shapley value, a solution concept in cooperative game theory. Our approach works equally effectively for both submodular and non-submodular influence functions. Experiments on standard real world social network datasets reveal that the proposed approach outperforms existing heuristics in the literature. As a non-trivial extension, we also address the problem of influence limitation in the presence of multiple competing campaigns.