CrossRef citations to date
Original Articles

The Classification Power of Web Features

, , , , &


  • [Abernethy et al. 08] J. Abernethy, O. Chapelle, and C. Castillo. “WITCH: A New Approach to Web Spam Detection.” Paper presented at the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Beijing, April 22, 2008.
  • [Attenberg and Suel 08] J. Attenberg and T. Suel. “Cleaning Search Results Using Term Distance Features.” In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, pp. 21–24, New York, NY, USA: ACM, 2008.
  • [Bar-Yossef et al. 04] Z. Bar-Yossef, A. Z. Broder, R. Kumar, and A. Tomkins. “Sic transit gloria telae: Towards an Understanding of the Web’s Decay.” In Proceedings of the 13th World Wide Web Conference (WWW), pp. 328–337. New York, NY: ACM Press, 2004.
  • [Bar-Yossef et al. 09] Z. Bar-Yossef, I. Keidar, and U. Schonfeld. “Do Not Crawl in the Dust: different URLS with Similar Text.” ACM Transactions on the Web (TWEB) 3: 1 (2009), 1–31.
  • [Barton 12] S. Barton. Mignify, a big data refinery built on HBASE. In HBASE CON, San Francisco, May 22, 2012.
  • [Becchetti 06] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. “Link-Based Characterization and Detection of Web Spam.” paper presented at the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Seattle, WA, August 10, 2006.
  • [Benczúr et al. 08] A. A. Benczúr, D. Siklósi, J. Szabó, I. Bíró, Z. Fekete, M. Kurucz, A. Pereszlényi, S. Rácz, and A. Szabó. “Web Spam: A Survey with Vision for the Archivist.” International Web Archiving Workshop, Aarhus, Denmark, September 18–19, 2008.
  • [Benczúr et al. 09] A. A. Benczúr, M. Erdélyi, J. Masanés, and D. Siklósi. “Web Spam Challenge Proposal for Filtering in Archives.” In AIRWeb ’09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web. New York, NY: ACM Press, 2009.
  • [Boldi et al. 04] P. Boldi, B. Codenotti, M. Santini, and S. Vigna. “Ubicrawler: A Scalable Fully Distributed Web Crawler.” Software: Practice & Experience 34: 8 (2004), 721–726.
  • [Boldi et al. 08] P. Boldi, M. Santini, and S. Vigna. “A Large Time Aware Web Graph.” SIGIR Forum 42: 2 (2008), 33–38.
  • [Bordino et al. 08] I. Bordino, P. Boldi, D. Donato, M. Santini, and S. Vigna. “Temporal Evolution of the UK Web.” Paper presented at the Workshop on Analysis of Dynamic Networks (ICDM-ADN’08), Sparks, NV, May 2, 2008.
  • [Bordino et al. 10] I. Bordino, D. Donato, and R. Baeza-Yates. “Coniunge et impera: Multiple-Graph Mining for Query-Log Analysis.” In Machine Learning and Knowledge Discovery in Databases, pp. 168–183. Berlin: Springer, 2010.
  • [Breiman 01] L. Breiman. “Random Forests.” Machine Learning, 45: 1 (2001), 5–32.
  • [Broder 97] A. Z. Broder. “On the Resemblance and Containment of Documents.” In Proceedings of the Compression and Complexity of Sequences (SEQUENCES’97), pp. 21–29, Salerno, Italy, June 11–13, 1997.
  • [Caruana et al. 06] R. Caruana, A. Munson, and A. Niculescu-Mizil. “Getting the Most Out of Ensemble Selection.” In ICDM ’06: Proceedings of the Sixth International Conference on Data Mining, pp. 828–833, Washington, DC, USA: IEEE Computer Society, 2006.
  • [Caruana et al. 04] R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes. “Ensemble Selection from Libraries of Models.” In ICML ’04: Proceedings of the 21st International Conference on Machine Learning, pp. 18, New York, NY, USA: ACM, 2004.
  • [Castillo et al. 08] C. Castillo, K. Chellapilla, and L. Denoyer. Web Spam Challenge 2008.In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Beijing, April 22, 2008.
  • [Castillo and Davison 11] C. Castillo and B. Davison. Adversarial Web Search, Vol. 4. Hanover, MA: Now Publishers Inc, 2011.
  • [Castillo et al. 06] C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. “A reference collection for web spam.” SIGIR Forum 40: 2 (2006), 11–24.
  • [Castillo et al. 07] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. “Know Your Neighbors: Web Spam Detection Using the Web Topology.” Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 423–430, Amsterdam, July 23–27, 2007.
  • [Chawla et al. 04] N. Chawla, N. Japkowicz, and A. Kotcz. “Editorial: Special Issue on Learning from imbalanced Data Sets.” ACM SIGKDD Explorations Newsletter 6: 1 (2004), 1–6.
  • [Chekuri et al. 97] C. Chekuri, M. H. Goldwasser, P. Raghavan, and E. Upfal. Web Search Using Automatic Classification. Paper presented at the 6th International World Wide Web Conference (WWW), San Jose, USA, 1997.
  • [Cho and Garcia-Molina 00a] J. Cho and H. Garcia-Molina. “The Evolution of the Web and Implications for an Incremental Crawler.” In The VLDB Journal, pp. 200–209, San Francisco, CA: Morgan Kaufman, 2000.
  • [Cho and Garcia-Molina 00b] J. Cho and H. Garcia-Molina. “Synchronizing a database to Improve Freshness.” In Proceedings of the International Conference on Management of Data, pp. 117–128. New York, NY: ACM, 2000.
  • [Chung et al. 09] Y. Joo Chung, M. Toyoda, and M. Kitsuregawa. “A Study of Web Spam Evolution Using a Time Series of Web Snapshots.” In AIRWeb ’09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web. New York, NY: ACM Press, 2009.
  • [Convey 96] E. Convey. “Porn Sneaks Way Back on Web.” The Boston Herald, May1996.
  • [Cormack 07] G. Cormack. “Content-Based Web Spam Detection.” In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), New York, NY: ACM, 2007.
  • [Cormack et al. 11] G. Cormack, M. Smucker, and C. Clarke. “Efficient and Effective Spam Filtering and Re-ranking for Large Web Datasets.” Information Retrieval, 14: 5 (2011), 441–465.
  • [Csalogány et al. 07] K. Csalogány, A. Benczúr, D. Siklósi, and L. Lukács. “Semi-Supervised Learning: A Comparative Study for Web Spam and Telephone User Churn.” In Graph Labeling Workshop in Conjunction with ECML/PKDD 2007, Warsaw, September 17, 2007.
  • [Dai et al. 09] N. Dai, B. D. Davison, and X. Qi. “Looking into the Past to Better Classify Web Spam.” In AIRWeb ’09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web. New York, NY: ACM Press, 2009.
  • [Desikan et al. 05] P. Desikan, N. Pathak, J. Srivastava, and V. Kumar. “Incremental Page Rank Computation on Evolving Graphs.” In WWW ’05: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, pp. 1094–1095, New York, NY, USA: ACM, 2005.
  • [Desikan et al. 06] P. K. Desikan, N. Pathak, J. Srivastava, and V. Kumar. Divide and Conquer Approach for Efficient Pagerank Computation. In ICWE ’06: Proceedings of the 6th International Conference on Web Engineering, pp. 233–240. New York, NY, USA: ACM, 2006.
  • [Dong et al. 10] A. Dong, Y. Chang, Z. Zheng, G. Mishne, J. Bai, K. Buchner, R. Zhang, C. Liao, and F. Diaz. “Towards Recency Ranking in Web Search.” In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, New York, February 3–6, 2010.
  • [Eiron et al. 04] N. Eiron, K. S. McCurley, and J. A. Tomlin. “Ranking the Web Frontier.” In Proceedings of the 13th International World Wide Web Conference (WWW), pp. 309–318. New York, NY, USA: ACM Press, 2004.
  • [Erdélyi and Benczúr 11] M. Erdélyi and A. A. Benczúr. “Temporal Analysis for Web Spam Detection: An Overview.” In 1st International Temporal Web Analytics Workshop (TWAW) in conjunction with the 20th International World Wide Web Conference in Hyderabad, India. CEUR Workshop Proceedings, Hyderabad, India, March 28, 2011.
  • [Erdélyi et al. 09] M. Erdélyi, A. A. Benczúr, J. Masanés, and D. Siklósi. “Web Spam Filtering in Internet Archives.” In AIRWeb ’09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web. New York, NY: ACM Press, 2009.
  • [Erdélyi et al. 11] M. Erdélyi, A. Garzó, and A. A. Benczúr. “Web Spam Classification: A Few Features Worth More.” In Joint WICOW/AIRWeb Workshop on Web Quality (WebQuality 2011) In conjunction with the 20th International World Wide Web Conference in Hyderabad, India. ACM Press, 2011.
  • [Fetterly and Gyöngyi 09] D. Fetterly and Z. Gyöngyi. editors. Fifth International Workshop on Adversarial Information Retrieval on the Web (AIRWeb 2009), Madrid, April 21, 2009.
  • [Fogaras and Rácz 05] D. Fogaras and B. Rácz. “Scaling Link-Based Similarity Search.” Paper presented at the 14th World Wide Web Conference (WWW), pp. 641–650. Chiba, Japan, 2005.
  • [Fogarty et al. 05] J. Fogarty, R. S. Baker, and S. E. Hudson. “Case Studies in the Use of Roc Curve Analysis for Sensor-Based Estimates in Human Computer Interaction.” In Proceedings of Graphics Interface 2005, GI ’05, pp. 129–136, School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada: Canadian Human-Computer Communications Society, 2005.
  • [Friedman et al. 00] J. Friedman, T. Hastie, and R. Tibshirani. “Additive Logistic Regression: A Statistical View of Boosting.” Annals of Statistics 28: 2 (2000), 337–374.
  • [Geng et al. 08] G. Geng, X. Jin, and C. Wang. “CASIA at WSC2008.” In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb). New York, NY: ACM, 2008.
  • [Geng et al. 10] X.-C. Z. Guang-Gang Geng, Xiao-Bo Jin, and D. Zhang. Evaluating web content quality via multi-scale features. In Proceedings of the ECML/PKDD 2010 Discovery Challenge, Barcelona, September 20, 2010.
  • [Gyöngyi and Garcia-Molina 05a] Z. Gyöngyi and H. Garcia-Molina. “Spam: It’s Not Just for Inboxes Anymore.” IEEE Computer Magazine 38: 10 (2005), 28–34.
  • [Gyöngyi and Garcia-Molina 05b] Z. Gyöngyi and H. Garcia-Molina. “Web Spam Taxonomy.” In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005.
  • [Gyöngyi et al. 04] Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. “Combating Web Spam With TrustRank.” Paper presented at the 30th International Conference on Very Large Data Bases (VLDB), pp. 576–587, Toronto, Canada, 2004.
  • [Henzinger et al. 02] M. R. Henzinger, R. Motwani, and C. Silverstein. “Challenges in Web Search Engines.” SIGIR Forum 36: 2 (2002), 11–22.
  • [Hotho et al. 08] A. Hotho, D. Benz, R. Jäschke, and B. Krause. editors. Proceedings of the ECML/PKDD Discovery Challenge, Antwerp, Belgium, September 15, 2008.
  • [Jeh and Widom 02] G. Jeh and J. Widom. SimRank: A Measure of Structural-Context Similarity. In Proceedings of the 8th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 538–543. New York, NY: ACM, 2002.
  • [Kohlschütter et al. 06] C. Kohlschütter, P. A. Chirita, and W. Nejdl. “Efficient Parallel Computation of PageRank.” In Advances in Information Retrieval, pp. 241–252, Lecture Notes in Computer Science, Volume 3936. Berlin Heidelberg: Springer, 2006.
  • [Kou and Cohen 07] Z. Kou, and W. W. Cohen. Stacked Graphical Models for Efficient Inference in Markov Random Fields. In SDM 07, Minneapolis, MN, April 26–28, 2007.
  • [Lin et al. 07] Y. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. Tseng. “Splog Detection Using Content, Time and Link Structures.” In 2007 IEEE International Conference on Multimedia and Expo, pp. 2030–2033. IEEE, 2007.
  • [Lynam et al. 06] T. Lynam, G. Cormack, and D. Cheriton. “On-Line Spam Filter Fusion.” Proceedings of the 29th international ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 123–130. New York, NY: ACM, 2006.
  • [Niculescu-Mizil et al. 09] A. Niculescu-Mizil, C. Perlich, G. Swirszcz, V. Sindhwani, Y. Liu, P. Melville, D. Wang, J. Xiao, J. Hu, M. Singh, et al. “Winning the KDD Cup Orange Challenge with Ensemble Selection.” In KDD Cup and Workshop in Conjunction with KDD 2009, Paris, June 28, 2009.
  • [Nikulin 10] V. Nikulin. “Web-Mining with Wilcoxon-Based Feature Selection, Ensembling and Multiple Binary Classifiers.” In Proceedings of the ECML/PKDD 2010 Discovery Challenge, Barcelona, September 20, 2010.
  • [Ntoulas et al. 06] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. “Detecting Spam Web Pages Through Content Analysis.” In Proceedings of the 15th International World Wide Web Conference (WWW), pp. 83–92. Edinburgh, Scotland, 2006.
  • [Robertson and Walker 94] S. E. Robertson and S. Walker. “Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval.” In In Proceedings of SIGIR’94, pp. 232–241. Berlin: Springer-Verlag, 1994.
  • [Shen et al. 06] G. Shen, B. Gao, T. Liu, G. Feng, S. Song, and H. Li. “Detecting Link Spam Using Temporal Information.” In Proceedings of the IEEE International Conference on Data Mining (ICDM)Hong Kong, December 18–22, pp. 1049–1053, 2006.
  • [Siklósi et al. 12] D. Siklósi, B. Daróczy, and A. Benczúr. “Content-Based Trust and Bias Classification via Biclustering.” In Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality, pp. 41–47. New York, NY: ACM, 2012.
  • [Singhal 04] A. Singhal. “Challenges in Running a Commercial Search Engine.” In IBM Search and Collaboration Seminar 2004. IBM Haifa Labs, 2004.
  • [Sokolov et al. 10] L. D. A. Sokolov, T. Urvoy, and O. Ricard. Madspam consortium at the ECML/PKDD Discovery Challenge 2010. Paper presented at the ECML/PKDD 2010 Discovery Challenge, Barcelona, September 20, 2010.
  • [Webb et al. 08] S. Webb, J. Caverlee, and C. Pu. “Predicting Web Spam With HTTP Session Information.” In Proceeding of the 17th ACM Conference on Information and Knowledge Management, pp. 339–348. New York, NY: ACM, 2008.
  • [Witten and Frank 05] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, Second edition, June2005.
  • [Wu et al. 06] B. Wu, V. Goel, and B. D. Davison. “Topical TrustRank: Using Topicality to Combat Web Spam.” Paper presented at the 15th International World Wide Web Conference (WWW), Edinburgh, Scotland, 2006.
  • [Zhou et al. 08] B. Zhou, J. Pei, and Z. Tang. “A Spamicity Approach to Web Spam Detection.” In Proceedings of the 2008 SIAM International Conference on Data Mining (SDM’08), pp. 277–288, Atlanta, April 24–26, 2008.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.