901 resultados para Vector Space IR, Search Engines, Document Clustering, Document
Resumo:
Performance comparisons between File Signatures and Inverted Files for text retrieval have previously shown several significant shortcomings of file signatures relative to inverted files. The inverted file approach underpins most state-of-the-art search engine algorithms, such as Language and Probabilistic models. It has been widely accepted that traditional file signatures are inferior alternatives to inverted files. This paper describes TopSig, a new approach to the construction of file signatures. Many advances in semantic hashing and dimensionality reduction have been made in recent times, but these were not so far linked to general purpose, signature file based, search engines. This paper introduces a different signature file approach that builds upon and extends these recent advances. We are able to demonstrate significant improvements in the performance of signature file based indexing and retrieval, performance that is comparable to that of state of the art inverted file based systems, including Language models and BM25. These findings suggest that file signatures offer a viable alternative to inverted files in suitable settings and positions the file signatures model in the class of Vector Space retrieval models.
Resumo:
This paper analyses the pairwise distances of signatures produced by the TopSig retrieval model on two document collections. The distribution of the distances are compared to purely random signatures. It explains why TopSig is only competitive with state of the art retrieval models at early precision. Only the local neighbourhood of the signatures is interpretable. We suggest this is a common property of vector space models.
Resumo:
In this paper, we propose a text mining method called LRD (latent relation discovery), which extends the traditional vector space model of document representation in order to improve information retrieval (IR) on documents and document clustering. Our LRD method extracts terms and entities, such as person, organization, or project names, and discovers relationships between them by taking into account their co-occurrence in textual corpora. Given a target entity, LRD discovers other entities closely related to the target effectively and efficiently. With respect to such relatedness, a measure of relation strength between entities is defined. LRD uses relation strength to enhance the vector space model, and uses the enhanced vector space model for query based IR on documents and clustering documents in order to discover complex relationships among terms and entities. Our experiments on a standard dataset for query based IR shows that our LRD method performed significantly better than traditional vector space model and other five standard statistical methods for vector expansion.
Resumo:
In this paper we propose a method that integrates the no- tion of understandability, as a factor of document relevance, into the evaluation of information retrieval systems for con- sumer health search. We consider the gain-discount evaluation framework (RBP, nDCG, ERR) and propose two understandability-based variants (uRBP) of rank biased precision, characterised by an estimation of understandability based on document readability and by different models of how readability influences user understanding of document content. The proposed uRBP measures are empirically contrasted to RBP by comparing system rankings obtained with each measure. The findings suggest that considering understandability along with topicality in the evaluation of in- formation retrieval systems lead to different claims about systems effectiveness than considering topicality alone.
Resumo:
Entity-oriented search has become an essential component of modern search engines. It focuses on retrieving a list of entities or information about the specific entities instead of documents. In this paper, we study the problem of finding entity related information, referred to as attribute-value pairs, that play a significant role in searching target entities. We propose a novel decomposition framework combining reduced relations and the discriminative model, Conditional Random Field (CRF), for automatically finding entity-related attribute-value pairs from free text documents. This decomposition framework allows us to locate potential text fragments and identify the hidden semantics, in the form of attribute-value pairs for user queries. Empirical analysis shows that the decomposition framework outperforms pattern-based approaches due to its capability of effective integration of syntactic and semantic features.
Resumo:
When publishing information on the web, one expects it to reach all the people that could be interested in. This is mainly achieved with general purpose indexing and search engines like Google which is the most used today. In the particular case of geographic information (GI) domain, exposing content to mainstream search engines is a complex task that needs specific actions. In many occasions it is convenient to provide a web site with a specially tailored search engine. Such is the case for on-line dictionaries (wikipedia, wordreference), stores (amazon, ebay), and generally all those holding thematic databases. Due to proliferation of these engines, A9.com proposed a standard interface called OpenSearch, used by modern web browsers to manage custom search engines. Geographic information can also benefit from the use of specific search engines. We can distinguish between two main approaches in GI retrieval information efforts: Classical OGC standardization on one hand (CSW, WFS filters), which are very complex for the mainstream user, and on the other hand the neogeographer’s approach, usually in the form of specific APIs lacking a common query interface and standard geographic formats. A draft ‘geo’ extension for OpenSearch has been proposed. It adds geographic filtering for queries and recommends a set of simple standard response geographic formats, such as KML, Atom and GeoRSS. This proposal enables standardization while keeping simplicity, thus covering a wide range of use cases, in both OGC and the neogeography paradigms. In this article we will analyze the OpenSearch geo extension in detail and its use cases, demonstrating its applicability to both the SDI and the geoweb. Open source implementations will be presented as well
Resumo:
This thesis investigates how web search evaluation can be improved using historical interaction data. Modern search engines combine offline and online evaluation approaches in a sequence of steps that a tested change needs to pass through to be accepted as an improvement and subsequently deployed. We refer to such a sequence of steps as an evaluation pipeline. In this thesis, we consider the evaluation pipeline to contain three sequential steps: an offline evaluation step, an online evaluation scheduling step, and an online evaluation step. In this thesis we show that historical user interaction data can aid in improving the accuracy or efficiency of each of the steps of the web search evaluation pipeline. As a result of these improvements, the overall efficiency of the entire evaluation pipeline is increased. Firstly, we investigate how user interaction data can be used to build accurate offline evaluation methods for query auto-completion mechanisms. We propose a family of offline evaluation metrics for query auto-completion that represents the effort the user has to spend in order to submit their query. The parameters of our proposed metrics are trained against a set of user interactions recorded in the search engine’s query logs. From our experimental study, we observe that our proposed metrics are significantly more correlated with an online user satisfaction indicator than the metrics proposed in the existing literature. Hence, fewer changes will pass the offline evaluation step to be rejected after the online evaluation step. As a result, this would allow us to achieve a higher efficiency of the entire evaluation pipeline. Secondly, we state the problem of the optimised scheduling of online experiments. We tackle this problem by considering a greedy scheduler that prioritises the evaluation queue according to the predicted likelihood of success of a particular experiment. This predictor is trained on a set of online experiments, and uses a diverse set of features to represent an online experiment. Our study demonstrates that a higher number of successful experiments per unit of time can be achieved by deploying such a scheduler on the second step of the evaluation pipeline. Consequently, we argue that the efficiency of the evaluation pipeline can be increased. Next, to improve the efficiency of the online evaluation step, we propose the Generalised Team Draft interleaving framework. Generalised Team Draft considers both the interleaving policy (how often a particular combination of results is shown) and click scoring (how important each click is) as parameters in a data-driven optimisation of the interleaving sensitivity. Further, Generalised Team Draft is applicable beyond domains with a list-based representation of results, i.e. in domains with a grid-based representation, such as image search. Our study using datasets of interleaving experiments performed both in document and image search domains demonstrates that Generalised Team Draft achieves the highest sensitivity. A higher sensitivity indicates that the interleaving experiments can be deployed for a shorter period of time or use a smaller sample of users. Importantly, Generalised Team Draft optimises the interleaving parameters w.r.t. historical interaction data recorded in the interleaving experiments. Finally, we propose to apply the sequential testing methods to reduce the mean deployment time for the interleaving experiments. We adapt two sequential tests for the interleaving experimentation. We demonstrate that one can achieve a significant decrease in experiment duration by using such sequential testing methods. The highest efficiency is achieved by the sequential tests that adjust their stopping thresholds using historical interaction data recorded in diagnostic experiments. Our further experimental study demonstrates that cumulative gains in the online experimentation efficiency can be achieved by combining the interleaving sensitivity optimisation approaches, including Generalised Team Draft, and the sequential testing approaches. Overall, the central contributions of this thesis are the proposed approaches to improve the accuracy or efficiency of the steps of the evaluation pipeline: the offline evaluation frameworks for the query auto-completion, an approach for the optimised scheduling of online experiments, a general framework for the efficient online interleaving evaluation, and a sequential testing approach for the online search evaluation. The experiments in this thesis are based on massive real-life datasets obtained from Yandex, a leading commercial search engine. These experiments demonstrate the potential of the proposed approaches to improve the efficiency of the evaluation pipeline.
Resumo:
Internet growth has provoked that information search had come to have one of the most relevant roles in the industry and to be one of the most current topics in research environments. Internet is the largest information container in history and its facility to generate new information leads to new challenges when talking about retrieving information and discern which one is more relevant than the rest. Parallel to the information growth in quantity, the way information is provided has also changed. One of these changes that has provoked more information traffic has been the emergence of social networks. We have seen how social networks can provoke more traffic than search engines themselves. We can draw conclusions that allow us to take a new approach to the information retrieval problem. Public trusts the most information coming from known contacts. In this document we will explore a possible change in classic search engines to bring them closer to the social side and adquire those social advantages.
Resumo:
As the Web evolves unexpectedly fast, information grows explosively. Useful resources become more and more difficult to find because of their dynamic and unstructured characteristics. A vertical search engine is designed and implemented towards a specific domain. Instead of processing the giant volume of miscellaneous information distributed in the Web, a vertical search engine targets at identifying relevant information in specific domains or topics and eventually provides users with up-to-date information, highly focused insights and actionable knowledge representation. As the mobile device gets more popular, the nature of the search is changing. So, acquiring information on a mobile device poses unique requirements on traditional search engines, which will potentially change every feature they used to have. To summarize, users are strongly expecting search engines that can satisfy their individual information needs, adapt their current situation, and present highly personalized search results. ^ In my research, the next generation vertical search engine means to utilize and enrich existing domain information to close the loop of vertical search engine's system that mutually facilitate knowledge discovering, actionable information extraction, and user interests modeling and recommendation. I investigate three problems in which domain taxonomy plays an important role, including taxonomy generation using a vertical search engine, actionable information extraction based on domain taxonomy, and the use of ensemble taxonomy to catch user's interests. As the fundamental theory, ultra-metric, dendrogram, and hierarchical clustering are intensively discussed. Methods on taxonomy generation using my research on hierarchical clustering are developed. The related vertical search engine techniques are practically used in Disaster Management Domain. Especially, three disaster information management systems are developed and represented as real use cases of my research work.^
Resumo:
As the Web evolves unexpectedly fast, information grows explosively. Useful resources become more and more difficult to find because of their dynamic and unstructured characteristics. A vertical search engine is designed and implemented towards a specific domain. Instead of processing the giant volume of miscellaneous information distributed in the Web, a vertical search engine targets at identifying relevant information in specific domains or topics and eventually provides users with up-to-date information, highly focused insights and actionable knowledge representation. As the mobile device gets more popular, the nature of the search is changing. So, acquiring information on a mobile device poses unique requirements on traditional search engines, which will potentially change every feature they used to have. To summarize, users are strongly expecting search engines that can satisfy their individual information needs, adapt their current situation, and present highly personalized search results. In my research, the next generation vertical search engine means to utilize and enrich existing domain information to close the loop of vertical search engine's system that mutually facilitate knowledge discovering, actionable information extraction, and user interests modeling and recommendation. I investigate three problems in which domain taxonomy plays an important role, including taxonomy generation using a vertical search engine, actionable information extraction based on domain taxonomy, and the use of ensemble taxonomy to catch user's interests. As the fundamental theory, ultra-metric, dendrogram, and hierarchical clustering are intensively discussed. Methods on taxonomy generation using my research on hierarchical clustering are developed. The related vertical search engine techniques are practically used in Disaster Management Domain. Especially, three disaster information management systems are developed and represented as real use cases of my research work.