Search in:

Advanced search

Journal of Computer Information Systems Volume 62, 2022 - Issue 3

Submit an article Journal homepage

111

Views

CrossRef citations to date

Altmetric

Research Article

Robust Web Data Extraction Based on Weighted Path-layer Similarity

Peng Gaoa Laboratory of Information Security and Software Engineering, NARI Group Corporation/State Grid Electric Power Research Institute, Nanjing, ChinaCorrespondence[email protected]

https://orcid.org/0000-0001-6331-9383

Hao Hanb Konica Minolta, Tokyo, Japan

Pages 536-546 | Published online: 02 Mar 2021

Cite this article
https://doi.org/10.1080/08874417.2020.1861571
CrossMark

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions

References

Guo J, Han H. A method for facilitating end-user mashup based on description. Int J Web Eng Tech. 2014;9(2):99–124. doi:https://doi.org/10.1504/IJWET.2014.064767.
Google Scholar
Dalvi N, Bohannon P, Sha F Robust web extraction: an approach based on a probabilistic tree-edit model. Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. New York, NY, USA: Association for Computing Machinery; 2009. p. 335–48.
Google Scholar
Gao P, Han H, Guo JX, Saeki M. Stable web scraping: an approach based on neighbour zone and path similarity of page elements. Int J Web Eng Tech. 2018;13(4):301–33. doi:https://doi.org/10.1504/IJWET.2018.097561.
Google Scholar
Zhao F, Zhou J, Nie C, Huang H, Jin JH. SmartCrawler: a two-stage crawler for efficiently harvesting deep-web interfaces. IEEE Trans Serv Comput. 2016;9(4):608–20. doi:https://doi.org/10.1109/TSC.2015.2414931.
Web of Science ®Google Scholar
Munir K, Sheraz Anjum M. The use of ontologies for effective knowledge modelling and information retrieval. Appl Comput Inf. 2018;14(2):116–26. doi:https://doi.org/10.1016/j.aci.2017.07.003.
Google Scholar
Sheikh M, Conlon S, Rule-Based A. System to extract financial information. J Comput Inf Syst. 2012;52:10–19.
Web of Science ®Google Scholar
Furche T, Gottlob G, Grasso G, Guo XN, Giorgio O OPAL: automated form understanding for the deep web. Proceedings of the 21st international conference on World Wide Web. New York, NY, USA: Association for Computing Machinery; 2012. p. 829–38.
Google Scholar
Gatterbauer W, Bohunsky P, Herzog M, Krüpl B, Pollak B Towards domain-independent information extraction from web tables. Proceedings of the 16th international conference on World Wide Web. New York, NY, USA: Association for Computing Machinery; 2007. p. 71–80.
Google Scholar
Chen Z, Cafarella M Automatic web spreadsheet data extraction. Proceedings of the 3rd International Workshop on Semantic Search Over the Web. New York, NY, USA: Association for Computing Machinery; 2013. p. 1–8.
Google Scholar
Audeh B, Beigbeder M, Zimmermann A, Jaillon P, Bousquet C, Choo KKR. Vigi4Med scraper: a framework for web forum structured data extraction and semantic representation. Plos One. 2017;12(1):e0169658. doi:https://doi.org/10.1371/journal.pone.0169658.
PubMed Web of Science ®Google Scholar
Liu W, Meng X, Meng W. ViDE: a vision-based approach for deep web data extraction. IEEE Trans Knowl Data Eng. 2010;22(3):447–460.4. doi:https://doi.org/10.1109/TKDE.2009.109.
Web of Science ®Google Scholar
Fang Y, Xie X, Zhang X, Cheng R, Zhang Z. STEM: a suffix tree-based method for web data records extraction. Knowl Inf Syst. 2018;55:305–31.
Web of Science ®Google Scholar
Figueiredo LNL, de Assis GT, Ferreira AA. DERIN: a data extraction method based on rendering information and n-gram. Inf Process Manag. 2017;53(5):1120–38. doi:https://doi.org/10.1016/j.ipm.2017.04.007.
Web of Science ®Google Scholar
Reis DC, Golgher PB, Silva AS, Laender AH Automatic web news extraction using tree edit distance. Proceedings of the 13th international conference on World Wide Web. New York, NY, USA: Association for Computing Machinery; 2004. p. 502–11.
Google Scholar
Extracting HH. News from server side databases by query interfaces. J Comput Inf Syst. 2014;54(2):57–65. doi:https://doi.org/10.1080/08874417.2014.11645686.
Web of Science ®Google Scholar
Wu S, Liu J, Fan J Automatic web content extraction by combination of learning and grouping. Proceedings of the 24th International Conference on World Wide Web. Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee; 2015; p. 1264–74.
Google Scholar
Nielandt J, De Mol R, Bronselaer A, de Tŕe G. Wrapper induction by XPath alignment. Proceedings of the International Conference on Knowledge Discovery and Information Retrieval. Rome, Italy: Science and Technology Publications; 2014. p. 492–500.
Google Scholar
Nielandt J, Bronselaer A, de Tré G. Predicate enrichment of aligned XPaths for wrapper induction. Expert Syst Appl. 2016;51:259–75. doi:https://doi.org/10.1016/j.eswa.2015.12.040.
Web of Science ®Google Scholar
Ferrara E, Baumgartner R. Automatic wrapper adaptation by tree edit distance matching. In: Hatzilygeroudis I, Prentzas J, editors. Combinations of intelligent methods and applications. Berlin (Heidelberg): Springer; 2011. p. 41–54.
Google Scholar
Cohen JP, Ding W, Bagherjeiran A Semi-supervised web wrapper repair via recursive tree matching. arXiv:150501303v1 [cs]. 2015; Available from: http://arxiv.org/abs/1505.01303v1. [accessed 2020 Jan 12]
Google Scholar
Omari A, Shoham S, Yahav E Synthesis of forgiving data extractors. Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. New York, NY, USA: Association for Computing Machinery; 2017. p. 385–94.
Google Scholar
Furche T, Gottlob G, Grasso G, Schallhart C, Sellers A. OXPath: a language for scalable data extraction, automation, and crawling on the deep web. Vldb J. 2013;22(1):47–72. doi:https://doi.org/10.1007/s00778-012-0286-6.
Web of Science ®Google Scholar
Leotta M, Stocco A, Ricca F, Tonella P Reducing web test cases aging by means of robust XPath locators. Proceedings of 25th IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW 2014), 3-6 November 2014, Napoli, Italy, p.449–54.
Google Scholar
Leotta M, Stocco A, Ricca F, Tonella P. Robula+: an algorithm for generating robust XPath locators for web testing. J Software Evol Process. 2016;28:177–204.
Web of Science ®Google Scholar
Robie J, Chamberlin D, Dyck M, Snelson J XML path language (XPath) 3.0. W3C recommendation. World Wide Web Consortium (W3C). 2014.
Google Scholar
Wagner RA, Fischer MJ. The string-to-string correction problem. J Acm. 1974;21(1):168–73. doi:https://doi.org/10.1145/321796.321811.
Web of Science ®Google Scholar
Hao Q, Cai R, Pang Y, Zhang L From one tree to a forest: a unified solution for structured web data extraction. Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. New York, NY, USA: Association for Computing Machinery; 2011. p. 775–84.
Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Robust Web Data Extraction Based on Weighted Path-layer Similarity

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

Robust Web Data Extraction Based on Weighted Path-layer Similarity

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date