111
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Robust Web Data Extraction Based on Weighted Path-layer Similarity

ORCID Icon &

References

  • Guo J, Han H. A method for facilitating end-user mashup based on description. Int J Web Eng Tech. 2014;9(2):99–124. doi:https://doi.org/10.1504/IJWET.2014.064767.
  • Dalvi N, Bohannon P, Sha F Robust web extraction: an approach based on a probabilistic tree-edit model. Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. New York, NY, USA: Association for Computing Machinery; 2009. p. 335–48.
  • Gao P, Han H, Guo JX, Saeki M. Stable web scraping: an approach based on neighbour zone and path similarity of page elements. Int J Web Eng Tech. 2018;13(4):301–33. doi:https://doi.org/10.1504/IJWET.2018.097561.
  • Zhao F, Zhou J, Nie C, Huang H, Jin JH. SmartCrawler: a two-stage crawler for efficiently harvesting deep-web interfaces. IEEE Trans Serv Comput. 2016;9(4):608–20. doi:https://doi.org/10.1109/TSC.2015.2414931.
  • Munir K, Sheraz Anjum M. The use of ontologies for effective knowledge modelling and information retrieval. Appl Comput Inf. 2018;14(2):116–26. doi:https://doi.org/10.1016/j.aci.2017.07.003.
  • Sheikh M, Conlon S, Rule-Based A. System to extract financial information. J Comput Inf Syst. 2012;52:10–19.
  • Furche T, Gottlob G, Grasso G, Guo XN, Giorgio O OPAL: automated form understanding for the deep web. Proceedings of the 21st international conference on World Wide Web. New York, NY, USA: Association for Computing Machinery; 2012. p. 829–38.
  • Gatterbauer W, Bohunsky P, Herzog M, Krüpl B, Pollak B Towards domain-independent information extraction from web tables. Proceedings of the 16th international conference on World Wide Web. New York, NY, USA: Association for Computing Machinery; 2007. p. 71–80.
  • Chen Z, Cafarella M Automatic web spreadsheet data extraction. Proceedings of the 3rd International Workshop on Semantic Search Over the Web. New York, NY, USA: Association for Computing Machinery; 2013. p. 1–8.
  • Audeh B, Beigbeder M, Zimmermann A, Jaillon P, Bousquet C, Choo KKR. Vigi4Med scraper: a framework for web forum structured data extraction and semantic representation. Plos One. 2017;12(1):e0169658. doi:https://doi.org/10.1371/journal.pone.0169658.
  • Liu W, Meng X, Meng W. ViDE: a vision-based approach for deep web data extraction. IEEE Trans Knowl Data Eng. 2010;22(3):447–460.4. doi:https://doi.org/10.1109/TKDE.2009.109.
  • Fang Y, Xie X, Zhang X, Cheng R, Zhang Z. STEM: a suffix tree-based method for web data records extraction. Knowl Inf Syst. 2018;55:305–31.
  • Figueiredo LNL, de Assis GT, Ferreira AA. DERIN: a data extraction method based on rendering information and n-gram. Inf Process Manag. 2017;53(5):1120–38. doi:https://doi.org/10.1016/j.ipm.2017.04.007.
  • Reis DC, Golgher PB, Silva AS, Laender AH Automatic web news extraction using tree edit distance. Proceedings of the 13th international conference on World Wide Web. New York, NY, USA: Association for Computing Machinery; 2004. p. 502–11.
  • Extracting HH. News from server side databases by query interfaces. J Comput Inf Syst. 2014;54(2):57–65. doi:https://doi.org/10.1080/08874417.2014.11645686.
  • Wu S, Liu J, Fan J Automatic web content extraction by combination of learning and grouping. Proceedings of the 24th International Conference on World Wide Web. Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee; 2015; p. 1264–74.
  • Nielandt J, De Mol R, Bronselaer A, de Tŕe G. Wrapper induction by XPath alignment. Proceedings of the International Conference on Knowledge Discovery and Information Retrieval. Rome, Italy: Science and Technology Publications; 2014. p. 492–500.
  • Nielandt J, Bronselaer A, de Tré G. Predicate enrichment of aligned XPaths for wrapper induction. Expert Syst Appl. 2016;51:259–75. doi:https://doi.org/10.1016/j.eswa.2015.12.040.
  • Ferrara E, Baumgartner R. Automatic wrapper adaptation by tree edit distance matching. In: Hatzilygeroudis I, Prentzas J, editors. Combinations of intelligent methods and applications. Berlin (Heidelberg): Springer; 2011. p. 41–54.
  • Cohen JP, Ding W, Bagherjeiran A Semi-supervised web wrapper repair via recursive tree matching. arXiv:150501303v1 [cs]. 2015; Available from: http://arxiv.org/abs/1505.01303v1. [accessed 2020 Jan 12]
  • Omari A, Shoham S, Yahav E Synthesis of forgiving data extractors. Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. New York, NY, USA: Association for Computing Machinery; 2017. p. 385–94.
  • Furche T, Gottlob G, Grasso G, Schallhart C, Sellers A. OXPath: a language for scalable data extraction, automation, and crawling on the deep web. Vldb J. 2013;22(1):47–72. doi:https://doi.org/10.1007/s00778-012-0286-6.
  • Leotta M, Stocco A, Ricca F, Tonella P Reducing web test cases aging by means of robust XPath locators. Proceedings of 25th IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW 2014), 3-6 November 2014, Napoli, Italy, p.449–54.
  • Leotta M, Stocco A, Ricca F, Tonella P. Robula+: an algorithm for generating robust XPath locators for web testing. J Software Evol Process. 2016;28:177–204.
  • Robie J, Chamberlin D, Dyck M, Snelson J XML path language (XPath) 3.0. W3C recommendation. World Wide Web Consortium (W3C). 2014.
  • Wagner RA, Fischer MJ. The string-to-string correction problem. J Acm. 1974;21(1):168–73. doi:https://doi.org/10.1145/321796.321811.
  • Hao Q, Cai R, Pang Y, Zhang L From one tree to a forest: a unified solution for structured web data extraction. Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. New York, NY, USA: Association for Computing Machinery; 2011. p. 775–84.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.