935 resultados para Web Search
Resumo:
In this paper we propose a method that integrates the no- tion of understandability, as a factor of document relevance, into the evaluation of information retrieval systems for con- sumer health search. We consider the gain-discount evaluation framework (RBP, nDCG, ERR) and propose two understandability-based variants (uRBP) of rank biased precision, characterised by an estimation of understandability based on document readability and by different models of how readability influences user understanding of document content. The proposed uRBP measures are empirically contrasted to RBP by comparing system rankings obtained with each measure. The findings suggest that considering understandability along with topicality in the evaluation of in- formation retrieval systems lead to different claims about systems effectiveness than considering topicality alone.
Resumo:
In a pilot application based on web search engine calledWeb-based Relation Completion (WebRC), we propose to join two columns of entities linked by a predefined relation by mining knowledge from the web through a web search engine. To achieve this, a novel retrieval task Relation Query Expansion (RelQE) is modelled: given an entity (query), the task is to retrieve documents containing entities in predefined relation to the given one. Solving this problem entails expanding the query before submitting it to a web search engine to ensure that mostly documents containing the linked entity are returned in the top K search results. In this paper, we propose a novel Learning-based Relevance Feedback (LRF) approach to solve this retrieval task. Expansion terms are learned from training pairs of entities linked by the predefined relation and applied to new entity-queries to find entities linked by the same relation. After describing the approach, we present experimental results on real-world web data collections, which show that the LRF approach always improves the precision of top-ranked search results to up to 8.6 times the baseline. Using LRF, WebRC also shows performances way above the baseline.
Resumo:
For people with cognitive disabilities, technology is more often thought of as a support mechanism, rather than a source of division that may require intervention to equalize access across the cognitive spectrum. This paper presents a first attempt at formalizing the digital gap created by the generalization of search engines. This was achieved through the development of a mapping of cognitive abilities required by users to execute low- level tasks during a standard Web search task. The mapping demonstrates how critical these abilities are to successfully use search engines with an adequate level of independence. It will lead to a set of design guidelines for search engine interfaces that will allow for the engagement of users of all abilities, and also, more importantly, in search algorithms such as query suggestion and measure of relevance (i.e. ranking).
Resumo:
The proliferation of the web presents an unsolved problem of automatically analyzing billions of pages of natural language. We introduce a scalable algorithm that clusters hundreds of millions of web pages into hundreds of thousands of clusters. It does this on a single mid-range machine using efficient algorithms and compressed document representations. It is applied to two web-scale crawls covering tens of terabytes. ClueWeb09 and ClueWeb12 contain 500 and 733 million web pages and were clustered into 500,000 to 700,000 clusters. To the best of our knowledge, such fine grained clustering has not been previously demonstrated. Previous approaches clustered a sample that limits the maximum number of discoverable clusters. The proposed EM-tree algorithm uses the entire collection in clustering and produces several orders of magnitude more clusters than the existing algorithms. Fine grained clustering is necessary for meaningful clustering in massive collections where the number of distinct topics grows linearly with collection size. These fine-grained clusters show an improved cluster quality when assessed with two novel evaluations using ad hoc search relevance judgments and spam classifications for external validation. These evaluations solve the problem of assessing the quality of clusters where categorical labeling is unavailable and unfeasible.
Resumo:
Information available on company websites can help people navigate to the offices of groups and individuals within the company. Automatically retrieving this within-organisation spatial information is a challenging AI problem This paper introduces a novel unsupervised pattern-based method to extract within-organisation spatial information by taking advantage of HTML structure patterns, together with a novel Conditional Random Fields (CRF) based method to identify different categories of within-organisation spatial information. The results show that the proposed method can achieve a high performance in terms of F-Score, indicating that this purely syntactic method based on web search and an analysis of HTML structure is well-suited for retrieving within-organisation spatial information.
Resumo:
This paper investigates the effect that text pre-processing approaches have on the estimation of the readability of web pages. Readability has been highlighted as an important aspect of web search result personalisation in previous work. The most widely used text readability measures rely on surface level characteristics of text, such as the length of words and sentences. We demonstrate that different tools for extracting text from web pages lead to very different estimations of readability. This has an important implication for search engines because search result personalisation strategies that consider users reading ability may fail if incorrect text readability estimations are computed.
Resumo:
For many, particularly in the Anglophone world and Western Europe, it may be obvious that Google has a monopoly over online search and advertising and that this is an undesirable state of affairs, due to Google's ability to mediate information flows online. The baffling question may be why governments and regulators are doing little to nothing about this situation, given the increasingly pivotal importance of the internet and free flowing communications in our lives. However, the law concerning monopolies, namely antitrust or competition law, works in what may be seen as a less intuitive way by the general public. Monopolies themselves are not illegal. Conduct that is unlawful, i.e. abuses of that market power, is defined by a complex set of rules and revolves principally around economic harm suffered due to anticompetitive behavior. However the effect of information monopolies over search, such as Google’s, is more than just economic, yet competition law does not address this. Furthermore, Google’s collection and analysis of user data and its portfolio of related services make it difficult for others to compete. Such a situation may also explain why Google’s established search rivals, Bing and Yahoo, have not managed to provide services that are as effective or popular as Google’s own (on this issue see also the texts by Dirk Lewandowski and Astrid Mager in this reader). Users, however, are not entirely powerless. Google's business model rests, at least partially, on them – especially the data collected about them. If they stop using Google, then Google is nothing.
Resumo:
This paper investigates how people return to information in a dynamic information environment. For example, a person might want to return to Web content via a link encountered earlier on a Web page, only to learn that the link has since been removed. Changes can benefit users by providing new information, but they hinder returning to previously viewed information. The observational study presented here analyzed instances, collected via a Web search, where people expressed difficulty re-finding information because of changes to the information or its environment. A number of interesting observations arose from this analysis, including that the path originally taken to get to the information target appeared important in its re-retrieval, whereas, surprisingly, the temporal aspects of when the information was seen before were not. While people expressed frustration when problems arose, an explanation of why the change had occurred was often sufficient to allay that frustration, even in the absence of a solution. The implications of these observations for systems that support re-finding in dynamic environments are discussed.
Resumo:
ImageRover is a search by image content navigation tool for the world wide web. To gather images expediently, the image collection subsystem utilizes a distributed fleet of WWW robots running on different computers. The image robots gather information about the images they find, computing the appropriate image decompositions and indices, and store this extracted information in vector form for searches based on image content. At search time, users can iteratively guide the search through the selection of relevant examples. Search performance is made efficient through the use of an approximate, optimized k-d tree algorithm. The system employs a novel relevance feedback algorithm that selects the distance metrics appropriate for a particular query.
Resumo:
Se analizan y describen las principales líneas de trabajo de la Web Semántica en el ámbito de los archivos de televisión. Para ello, se analiza y contextualiza la web semántica desde una perspectiva general para posteriormente analizar las principales iniciativas que trabajan con lo audiovisual: Proyecto MuNCH, Proyecto S5T, Semantic Television y VideoActive.
Resumo:
Search engines - such as Google - have been characterized as "Databases of intentions". This class will focus on different aspects of intentionality on the web, including goal mining, goal modeling and goal-oriented search. Readings: M. Strohmaier, M. Lux, M. Granitzer, P. Scheir, S. Liaskos, E. Yu, How Do Users Express Goals on the Web? - An Exploration of Intentional Structures in Web Search, We Know'07 International Workshop on Collaborative Knowledge Management for Web Information Systems in conjunction with WISE'07, Nancy, France, 2007. [Web link] Readings: Automatic identification of user goals in web search, U. Lee and Z. Liu and J. Cho WWW '05: Proceedings of the 14th International World Wide Web Conference 391--400 (2005) [Web link]
Resumo:
Tenint en compte l’evolució a Internet dels portals d’informació dels mitjans de comunicació, sorgeix la idea d’un motor de cerca orientat a la recaptació de notícies dispersades per les diferents pàgines web dels grans mitjans de comunicació espanyols, que permetés obtenir informació sobre “descriptors contractats” pels usuaris d’un portal. El primer objectiu és l’anàlisi de les necessitats que es volen cobrir per a un hipotètic client de l’aplicació, el segon és en l’àmbit algorítmic, cal obtenir una metodologia de treball que permeti l’obtenció de la notícia. En l’àmbit de la programació es consideren tres etapes: descarregar les pàgines web necessàries, que es farà mitjançant les eines que proporciona la llibreria cUrl; l’anàlisi de les notícies (obtenir tots els enllaços que corresponen a notícies, filtrar els descriptors per decidir si cal guardar la notícia, analitzar l’estructura interna de les notícies seleccionades per guardar-ne només les parts establertes), i la base de dades que ens ha de permetre organitzar i gestionar les notícies escollides
Resumo:
El projecte iSAC (Servei Intel·ligent d’Atenció Ciutadana via web) es va iniciar el mes de gener de 2006 amb l’ajut del nou coneixement científic en agents intel·ligents, junt amb l’aplicació de les Tecnologies de la Informació i la Comunicació (TIC) i els cercadors. Actualment, el servei actual d’atenció al ciutadà està composat per dues àrees: l’atenció directa a les oficines i l’atenció telefònica a través del Call Center. Les limitacions de personal i horari d’atenció fan que aquest servei perdi eficàcia. Es vol desenvolupar un producte amb una tecnologia capaç d’ampliar i millorar la capacitat i la qualitat de l’atenció ciutadana en les administracions públiques, sigui quina sigui la seva dimensió. Tot i això, aquest projecte l’explotaran especialment els ajuntaments, als quals la ciutadania s'acosta amb tot tipus de preguntes i dubtes, habitualment no restringides a l'àmbit local. Més concretament, es vol automatitzar a través d’un portal web l’atenció al ciutadà per tal d’obtenir un servei més efectiu
Resumo:
This paper describes the implementation of a semantic web search engine on conversation styled transcripts. Our choice of data is Hansard, a publicly available conversation style transcript of parliamentary debates. The current search engine implementation on Hansard is limited to running search queries based on keywords or phrases hence lacks the ability to make semantic inferences from user queries. By making use of knowledge such as the relationship between members of parliament, constituencies, terms of office, as well as topics of debates the search results can be improved in terms of both relevance and coverage. Our contribution is not algorithmic instead we describe how we exploit a collection of external data sources, ontologies, semantic web vocabularies and named entity extraction in the analysis of underlying semantics of user queries as well as the semantic enrichment of the search index thereby improving the quality of results.