TY - JOUR
T1 - A web-based approach to data imputation
AU - Li, Zhixu
AU - Sharaf, Mohamed Abdel Fattah
AU - Sitbon, Laurianne
AU - Sadiq, Shazia Wasim
AU - Indulska, Marta
AU - Zhou, Xiaofang
N1 - KAUST Repository Item: Exported on 2020-10-01
Acknowledgements: This research is partially supported by National 863 High-tech Program (Grant No. 2012AA011001) and the Australian Research Council (Grant No. DP110102777).
PY - 2013/10/24
Y1 - 2013/10/24
N2 - In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Moreover, several optimization techniques are also proposed to reduce the cost of estimating the confidence of imputation queries at both the tuple-level and the database-level. Experiments based on several real-world data collections demonstrate not only the effectiveness of WebPut compared to existing approaches, but also the efficiency of our proposed algorithms and optimization techniques. © 2013 Springer Science+Business Media New York.
AB - In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Moreover, several optimization techniques are also proposed to reduce the cost of estimating the confidence of imputation queries at both the tuple-level and the database-level. Experiments based on several real-world data collections demonstrate not only the effectiveness of WebPut compared to existing approaches, but also the efficiency of our proposed algorithms and optimization techniques. © 2013 Springer Science+Business Media New York.
UR - http://hdl.handle.net/10754/575587
UR - http://link.springer.com/10.1007/s11280-013-0263-z
UR - http://www.scopus.com/inward/record.url?scp=84927804608&partnerID=8YFLogxK
U2 - 10.1007/s11280-013-0263-z
DO - 10.1007/s11280-013-0263-z
M3 - Article
SN - 1386-145X
VL - 17
SP - 873
EP - 897
JO - World Wide Web
JF - World Wide Web
IS - 5
ER -