WebPut : Efficient web-based data imputation


Autoria(s): Li, Zhixu; Sharaf, Mohamed A.; Sitbon, Laurianne; Sadiq, Shazia; Indulska, Marta; Zhou, Xiaofang
Contribuinte(s)

Wang, Sean X.

Cruz, Isabel

Delis, Alex

Huang, Guangyan

Data(s)

28/11/2012

Resumo

In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is also proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Experiments based on several real-world data collections demonstrate that WebPut outperforms existing approaches.

Formato

application/pdf

Identificador

http://eprints.qut.edu.au/56384/

Publicador

Springer Berlin Heidelberg

Relação

http://eprints.qut.edu.au/56384/1/webImp-WISE2012final.pdf

DOI:10.1007/978-3-642-35063-4_18

Li, Zhixu, Sharaf, Mohamed A., Sitbon, Laurianne, Sadiq, Shazia, Indulska, Marta, & Zhou, Xiaofang (2012) WebPut : Efficient web-based data imputation. In Wang, Sean X., Cruz, Isabel, Delis, Alex, & Huang, Guangyan (Eds.) 13th International Conference on Web Information Systems Engineering - WISE 2012, Springer Berlin Heidelberg, Paphos, Cyprus, pp. 243-256.

Direitos

Copyright 2012 Springer-Verlag Berlin Heidelberg

Author retains, in addition to uses permitted by law, the right to communicate the content of the Contribution to other scientists, to share the Contribution with them in manuscript form, to perform or present the Contribution or to use the content for non-commercial internal and educational purposes, provided that the Springer publication is mentioned as the original source of publication in any printed or electronic materials. Author retains the right to republish the Contribution in any collection consisting solely of Author’s ownworks without charge but must ensure that the publication by Springer is properly credited and that the relevant copyright notice is repeated verbatim. Author may self-archive an author-created version of his/her Contribution on his/her own website and/or in his/her institutional repository, as well as on a non-commercial archival repository such as ArXiv/CoRR and HAL, including his/her final version. Author may also deposit this version on his/her funder’s or funder’s designated repository at the funder’s request or as a result of a legal obligation. Author may not use the publisher’s PDF version, which is posted on www.springerlink.com, for the purpose of self-archiving or deposit. Furthermore, Author may only post his/her version provided acknowledgement is given to the original source of publication and a link is inserted to the published article on Springer’s website. The link should be accompanied by the following text: "The original publication is available at www.springerlink.com". Author retains the right to use his/her Contribution for his/her further scientific career by including the final published paper in his/her dissertation or doctoral thesis provided acknowledgement is given to the original source of publication. Author also retains the right to use, without having to pay a fee and without having to inform the publisher, parts of the Contribution (e.g. illustrations) for inclusion in future work, and to publish a substantially revised version (at least 30% new content) elsewhere, provided that the original Springer Contribution is properly cited.

Fonte

School of Electrical Engineering & Computer Science; Science & Engineering Faculty

Palavras-Chave #080000 INFORMATION AND COMPUTING SCIENCES #080107 Natural Language Processing #080604 Database Management #Web-based Data Imputation #WebPut #Incomplete Data #Data quality
Tipo

Conference Paper