139
Views
2
CrossRef citations to date
0
Altmetric
Articles

An ontology learning based approach for focused web crawling using combined normalized pointwise mutual information and Resnik algorithm

ORCID Icon & ORCID Icon
Pages 1123-1129 | Received 25 Jun 2019, Accepted 18 Oct 2019, Published online: 30 Oct 2019
 

Abstract

The precedence of unexplored Uniform Resource Locators (URLs) is calculated in many existing works based on a linear combination of similarities of different texts of the web_page and the specified topic along with their associated weights. These weights, however, are chosen based on various methodologies like Term Frequency-Inverse Document Frequency (TF-IDF), so these weights can immediately create severe deviations from the priorities of unvisited web pages and also it will calulate the similarity only if the word occurs in the web page. It won’t consider the semantic similarity of the word in the web page. To overcome the troubles mentioned above, this article presents a new focused web crawler based on combined Normalized Pointwise Mutual Information (NPMI) and Resnik based semantic similarity algorithm, called as P-crawler. In the P-crawler, the records of an unexplored web page are made up of web page text, anchor text, title text, bold text and heading text of the web page. The experimental findings show that the suggested algorithm increases focused on crawler efficiency. In conclusion, the above technique is efficient and promising for focused web crawlers.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Notes on contributors

P. R. Joe Dhanith

P. R. Joe Dhanith received his B.Tech degree in Information Technology from Anna University in 2010 and M.E degree in Computer Science and Engineering from Anna University in 2012. He is currently pursuing his Ph.D degree in Computer Science and Engineering at National Institute of Technology Puducherry. His main research interests includes web mining, web crawling and information retrieval.

B. Surendiran

B. Surendiran is currently working as Assistant Professor in the Department of Computer Science and Engineering at National Institute of Technology Puducherry, Karaikal, India. He has completed his Ph.D in Computer Science and Engineering at National Institute of Technology Tiruchirapalli. His research interest includes recommender systems and data mining. He has received “Best Paper Award” for his paper at artcom2009 international conference.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 288.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.