44
Views
0
CrossRef citations to date
0
Altmetric
Original Articles

The Classification Power of Web Features

, , , , &
Pages 421-457 | Published online: 15 Sep 2014
 

Abstract

In this article we give a comprehensive overview of features devised for web spam detection and investigate how much various classes, some requiring very high computational effort, add to the classification accuracy.

  • We collect and handle a large number of features based on recent advances in web spam filtering, including temporal ones; in particular, we analyze the strength and sensitivity of linkage change.

  • We propose new, temporal link-similarity-based features and show how to compute them efficiently on large graphs.

  • We show that machine learning techniques, including ensemble selection, LogitBoost, and random forest significantly improve accuracy.

  • We conclude that, with appropriate learning techniques, a simple and computationally inexpensive feature subset outperforms all previous results published so far on our dataset and can be further improved only slightly by computationally expensive features.

  • We test our method on three major publicly available datasets: the Web Spam Challenge 2008 dataset WEBSPAM-UK2007, the ECML/PKDD Discovery Challenge dataset DC2010, and the Waterloo Spam Rankings for ClueWeb09.

Our classifier ensemble sets the strongest classification benchmark compared to participants of the Web Spam and ECML/PKDD Discovery Challenges as well as the TREC Web track.

To foster research in the area, we make several feature sets and source codes public,Footnote1

https://datamining.sztaki.hu/en/download/web-spam-resources

including the temporal features of eight .uk crawl snapshots that include WEBSPAM-UK2007 as well as the Web Spam Challenge features for the labeled part of ClueWeb09.

Notes

The dataset can be downloaded from: http://law.di.unimi.it/datasets.php

http://icu-project.org/

The exact classifier model specification files used for Weka and the data files used for the experiments are available upon request from the authors.

A summary is available as part of our data release at https://dms.sztaki.hu/sites/dms.sztaki.hu/files/download/2013/enpt-queries.txt.gz.

The temporal feature data used in our research is available at: https://datamining.sztaki.hu/en/download/web-spam-resources

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access
  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart
* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.