Biblioteca Digital

Web Data Extraction from Query Result Pages Based on Visual and Content Features

**Autoria(s):** Weng, Daiyue; Hong, Jun; Bell, David
Data(s)	2012
Resumo	A rapidly increasing number of Web databases are now become accessible via<br/>their HTML form-based query interfaces. Query result pages are dynamically generated<br/>in response to user queries, which encode structured data and are displayed for human<br/>use. Query result pages usually contain other types of information in addition to query<br/>results, e.g., advertisements, navigation bar etc. The problem of extracting structured data<br/>from query result pages is critical for web data integration applications, such as comparison<br/>shopping, meta-search engines etc, and has been intensively studied. A number of approaches<br/>have been proposed. As the structures of Web pages become more and more complex, the<br/>existing approaches start to fail, and most of them do not remove irrelevant contents which<br/>may a®ect the accuracy of data record extraction. We propose an automated approach for<br/>Web data extraction. First, it makes use of visual features and query terms to identify data<br/>sections and extracts data records in these sections. We also represent several content and<br/>visual features of visual blocks in a data section, and use them to ¯lter out noisy blocks.<br/>Second, it measures similarity between data items in di®erent data records based on their<br/>visual and content features, and aligns them into di®erent groups so that the data in the<br/>same group have the same semantics. The results of our experiments with a large set of<br/>Web query result pages in di®erent domains show that our proposed approaches are highly<br/>e®ective.
Identificador	http://pure.qub.ac.uk/portal/en/publications/web-data-extraction-from-query-result-pages-based-on-visual-and-content-features(5729cb55-9224-4a3a-bc9e-09e26dfafff4).html
Idioma(s)	eng
Direitos	info:eu-repo/semantics/restrictedAccess
Fonte	Weng , D , Hong , J & Bell , D 2012 , ' Web Data Extraction from Query Result Pages Based on Visual and Content Features ' International Journal of Software and Informatics , vol 6 , no. 3 , pp. 453-472 .
Tipo	article

Acesso ao item digital