Web Data Extraction from Query Result Pages Based on Visual and Content Features
Data(s) |
2012
|
---|---|
Resumo |
A rapidly increasing number of Web databases are now become accessible via<br/>their HTML form-based query interfaces. Query result pages are dynamically generated<br/>in response to user queries, which encode structured data and are displayed for human<br/>use. Query result pages usually contain other types of information in addition to query<br/>results, e.g., advertisements, navigation bar etc. The problem of extracting structured data<br/>from query result pages is critical for web data integration applications, such as comparison<br/>shopping, meta-search engines etc, and has been intensively studied. A number of approaches<br/>have been proposed. As the structures of Web pages become more and more complex, the<br/>existing approaches start to fail, and most of them do not remove irrelevant contents which<br/>may a®ect the accuracy of data record extraction. We propose an automated approach for<br/>Web data extraction. First, it makes use of visual features and query terms to identify data<br/>sections and extracts data records in these sections. We also represent several content and<br/>visual features of visual blocks in a data section, and use them to ¯lter out noisy blocks.<br/>Second, it measures similarity between data items in di®erent data records based on their<br/>visual and content features, and aligns them into di®erent groups so that the data in the<br/>same group have the same semantics. The results of our experiments with a large set of<br/>Web query result pages in di®erent domains show that our proposed approaches are highly<br/>e®ective. |
Identificador | |
Idioma(s) |
eng |
Direitos |
info:eu-repo/semantics/restrictedAccess |
Fonte |
Weng , D , Hong , J & Bell , D 2012 , ' Web Data Extraction from Query Result Pages Based on Visual and Content Features ' International Journal of Software and Informatics , vol 6 , no. 3 , pp. 453-472 . |
Tipo |
article |