950 resultados para web page similarity


100.00% 100.00%



This paper proposes a hyperlink-based web page similarity measurement and two matrix-based hierarchical web page clustering algorithms. The web page similarity measurement incorporates hyperlink transitivity and page importance within the concerned web page space. One clustering algorithm takes cluster overlapping into account, another one does not. These algorithxms do not require predefined similarity thresholds for clustering, and are independent of the page order. The primary evaluations show the effectiveness of the proposed algorithms in clustering improvement.


100.00% 100.00%



The rapid increase of web complexity and size makes web searched results far from satisfaction in many cases due to a huge amount of information returned by search engines. How to find intrinsic relationships among the web pages at a higher level to implement efficient web searched information management and retrieval is becoming a challenge problem. In this paper, we propose an approach to measure web page similarity. This approach takes hyperlink transitivity and page importance into consideration. From this new similarity measurement, an effective hierarchical web page clustering algorithm is proposed. The primary evaluations show the effectiveness of the new similarity measurement and the improvement of web page clustering. The proposed page similarity, as well as the matrix-based hyperlink analysis methods, could be applied to other web-based research areas..


100.00% 100.00%



With the size and state of the Internet today, a good quality approach to organizing this mass of information is of great importance. Clustering web pages into groups of similar documents is one approach, but relies heavily on good feature extraction and document representation as well as a good clustering approach and algorithm. Due to the changing nature of the Internet, resulting in a dynamic dataset, an incremental approach is preferred. In this work we propose an enhanced incremental clustering approach to develop a better clustering algorithm that can help to better organize the information available on the Internet in an incremental fashion. Experiments show that the enhanced algorithm outperforms the original histogram based algorithm by up to 7.5%.


100.00% 100.00%



Many web sites incorporate dynamic web pages to deliver customized contents to their users. However, dynamic pages result in increased user response times due to their construction overheads. In this paper, we consider mechanisms for reducing these overheads by utilizing the excess capacity with which web servers are typically provisioned. Specifically, we present a caching technique that integrates fragment caching with anticipatory page pre-generation in order to deliver dynamic pages faster during normal operating situations. A feedback mechanism is used to tune the page pre-generation process to match the current system load. The experimental results from a detailed simulation study of our technique indicate that, given a fixed cache budget, page construction speedups of more than fifty percent can be consistently achieved as compared to a pure fragment caching approach.


100.00% 100.00%



Automatically determining and assigning shared and meaningful text labels to data extracted from an e-Commerce web page is a challenging problem. An e-Commerce web page can display a list of data records, each of which can contain a combination of data items (e.g. product name and price) and explicit labels, which describe some of these data items. Recent advances in extraction techniques have made it much easier to precisely extract individual data items and labels from a web page, however, there are two open problems: 1. assigning an explicit label to a data item, and 2. determining labels for the remaining data items. Furthermore, improvements in the availability and coverage of vocabularies, especially in the context of e-Commerce web sites, means that we now have access to a bank of relevant, meaningful and shared labels which can be assigned to extracted data items. However, there is a need for a technique which will take as input a set of extracted data items and assign automatically to them the most relevant and meaningful labels from a shared vocabulary. We observe that the Information Extraction (IE) community has developed a great number of techniques which solve problems similar to our own. In this work-in-progress paper we propose our intention to theoretically and experimentally evaluate different IE techniques to ascertain which is most suitable to solve this problem.


100.00% 100.00%



The use of the Web has become an essential part of teaching and learning in Australian schools. Nevertheless, many students lack knowledge of how to properly evaluate and cite Web-sourced information. This paper presents criteria, by which students can judge the reliability of Web resources, and guidelines on the citation of Web information. These strategies have been implemented successfully by the authors in their classes.


100.00% 100.00%



The World Wide Web is now a huge information source with its own characteristics. In most cases, traditional database-based technologies are no longer suitable for web information processing and management. For effectively processing and managing web. information, it is necessary to reveal intrinsic relationships/structures among concerned web information objects such as web pages. In this work, a set of web pages that has its own intrinsic structure is called a web page community. This paper proposes a matrix model to describe relationships among concerned web pages. Based on this model, intrinsic relationships among pages could be revealed, and in turn a web page community could be constructed. The issues that are related to this model in its application are deeply investigated and studied. Some applications based on this model are presented, which demonstrate the potential of this matrix model in different kinds of web page community construction and information processing.


100.00% 100.00%



The development of the Internet has boosted prosperity of the World Wide Web, which is now a huge information source. Because of characteristics of the web, in most cases, traditional databasebased technologies are no longer suitable for web information retrieval and management. To effectively manage web information, it is necessary to reveal intrinsic relationships/structures among web information objects by eliminating noise factors. This paper proposes a mechanism that could be widely used in information processing, including web information processing and noise factor elimination for getting more intrinsic relationships. As an application case of this mechanism, one relevant web page finding algorithm is proposed to uncover intrinsic relationship among web pages from their hyperlink patterns, and find more semantic relevant web pages. The experimental evaluation shows the feasibility and effectiveness of the algorithm and demonstrates the potential of the proposed mechanism in web applications.


100.00% 100.00%



Discovering intrinsic relationships/structures among concerned web information objects such as web pages is important for effectively processing and managing web information. In this work, a set of web pages that has its own intrinsic structure is called a web page community. This paper proposes a matrix model to describe relationships among concerned web pages. Based on this model, intrinsic relationships among pages could be revealed, and in turn a web page community could be constructed. The issues that are related to this model and its applications are investigated and studied. Some applications based on this model are presented, which demonstrate the potential of this matrix model in different kinds of web page community construction and information processing.


100.00% 100.00%



The rapid development of network technologies has made the web a huge information source with its own characteristics. In most cases, traditional database-based technologies are no longer suitable for web information processing and management. For effectively processing and managing web information, it is necessary to reveal intrinsic relationships/structures among concerned web information objects such as web pages. In this work, a set of web pages that have their intrinsic relationships is called a web page community. This paper proposes a matrix-based model to describe relationships among concerned web pages. Based on this model, intrinsic relationships among pages could be revealed, and in turn a web page community could be constructed. The issues that are related to the application of the model are deeply investigated and studied. The concepts of community and intrinsic relationships, as well as the proposed matrix-based model, are then extended to other application areas such as biological data processing. Some application cases of the model in a broad range of areas are presented, demonstrating the potentials of this matrix-based model.


100.00% 100.00%



The rapid development of Web technologies has made the World Wide Web a huge information source. However, due to the lack of a well-defined underlying data model for Web documents, effectively and efficiently finding required information and managing Web data are usually tedious and difficult tasks when using conventional information retrieval and data management techniques. The Web page community, defined as a set of Web-based documents with its own logical and/or semantic structures, provides a flexible and effective approach to support Wed data management, information retrieval and applications. This book addresses using hyperlink information to discover Web page communities. The work establishes a uniform framework for hyperlink analysis and community construction. Algorithms, supporting mechanisms and data models are proposed in the book. This book should help shed some light on this new and exciting research and application area. It is useful to researchers and students in Web mining, Web data management and information retrieval, as well as to professionals who may be considering utilizing Web communities to improve their applications.


100.00% 100.00%



100.00% 100.00%



In spite of the increasing presence of Semantic Web Facilities, only a limited amount of the available resources in the Internet provide a semantic access. Recent initiatives such as the emerging Linked Data Web are providing semantic access to available data by porting existing resources to the semantic web using different technologies, such as database-semantic mapping and scraping. Nevertheless, existing scraping solutions are based on ad-hoc solutions complemented with graphical interfaces for speeding up the scraper development. This article proposes a generic framework for web scraping based on semantic technologies. This framework is structured in three levels: scraping services, semantic scraping model and syntactic scraping. The first level provides an interface to generic applications or intelligent agents for gathering information from the web at a high level. The second level defines a semantic RDF model of the scraping process, in order to provide a declarative approach to the scraping task. Finally, the third level provides an implementation of the RDF scraping model for specific technologies. The work has been validated in a scenario that illustrates its application to mashup technologies