111
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Robust Web Data Extraction Based on Weighted Path-layer Similarity

ORCID Icon &
Pages 536-546 | Published online: 02 Mar 2021
 

ABSTRACT

Web data extraction techniques often focus on accurate and efficient information acquisition from webpages. However, webpage variants cause frequent extraction to fail and result in high maintenance costs. Significant effort is attracted to robust extraction, but most either require complex pre-processing or supplementary files. In this paper, a novel method is proposed to enhance extraction robustness by using datatype and weight information of path-layers. The similarities between paths of the target node in the original webpage and candidate nodes in page variants are calculated to determine the node with the highest possibility. Experiments on a large set of real data show that this method yields better robustness than the existing approaches.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 145.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.