Addressing instance ambiguity in web harvesting

Zhixu Li, Xiang Liang Zhang, Hai Huang, Qing Xie, Jia Zhu, Xiaofang Zhou

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Web Harvesting enables the enrichment of incomplete data sets by retrieving required information from the Web. However, the ambiguity of instances may greatly decrease the quality of the harvested data, given that any instance in the local data set may become ambiguous when attempting to identify it on theWeb. Although plenty of disambiguation methods have been proposed to deal with the ambiguity problems in various settings, none of them are able to handle the instance ambiguity problem in Web Harvesting. In this paper, we propose to do instance disambiguation in Web Harvesting with a novel disambiguation method inspired by the idea of collaborative identity recognition. In particular, we expect to find some common properties in forms of latent shared attribute values among instances in the list, such that these shared attribute values can differentiate instances within the list against those ambiguous ones on the Web. Our extensive experimental evaluation illustrates the utility of collaborative disambiguation for a popular Web Harvesting application, and shows that it substantially improves the accuracy of the harvested data.

Original languageEnglish (US)
Title of host publication18th International Workshop on the Web and Databases, WebDB 2015
Subtitle of host publicationFreshness, Correctness, Quality of Information and Knowledge on the Web - Proceedings
EditorsJulia Stoyanovich, Fabian M. Suchanek
PublisherAssociation for Computing Machinery, Inc
Pages6-12
Number of pages7
ISBN (Electronic)9781450336277
DOIs
StatePublished - May 31 2015
Event18th International Workshop on the Web and Databases, WebDB 2015 - Melbourne, Australia
Duration: May 31 2015 → …

Publication series

Name18th International Workshop on the Web and Databases, WebDB 2015: Freshness, Correctness, Quality of Information and Knowledge on the Web - Proceedings

Other

Other18th International Workshop on the Web and Databases, WebDB 2015
Country/TerritoryAustralia
CityMelbourne
Period05/31/15 → …

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Information Systems

Fingerprint

Dive into the research topics of 'Addressing instance ambiguity in web harvesting'. Together they form a unique fingerprint.

Cite this