2 resultados para focused web crawling
em CentAUR: Central Archive University of Reading - UK
Resumo:
Search engines exploit the Web's hyperlink structure to help infer information content. The new phenomenon of personal Web logs, or 'blogs', encourage more extensive annotation of Web content. If their resulting link structures bias the Web crawling applications that search engines depend upon, there are implications for another form of annotation rapidly on the rise, the Semantic Web. We conducted a Web crawl of 160 000 pages in which the link structure of the Web is compared with that of several thousand blogs. Results show that the two link structures are significantly different. We analyse the differences and infer the likely effect upon the performance of existing and future Web agents. The Semantic Web offers new opportunities to navigate the Web, but Web agents should be designed to take advantage of the emerging link structures, or their effectiveness will diminish.
Resumo:
The Web's link structure (termed the Web Graph) is a richly connected set of Web pages. Current applications use this graph for indexing and information retrieval purposes. In contrast the relationship between Web Graph and application is reversed by letting the structure of the Web Graph influence the behaviour of an application. Presents a novel Web crawling agent, AlienBot, the output of which is orthogonally coupled to the enemy generation strategy of a computer game. The Web Graph guides AlienBot, causing it to generate a stochastic process. Shows the effectiveness of such unorthodox coupling to both the playability of the game and the heuristics of the Web crawler. In addition, presents the results of the sample of Web pages collected by the crawling process. In particular, shows: how AlienBot was able to identify the power law inherent in the link structure of the Web; that 61.74 per cent of Web pages use some form of scripting technology; that the size of the Web can be estimated at just over 5.2 billion pages; and that less than 7 per cent of Web pages fully comply with some variant of (X)HTML.