202
Views
7
CrossRef citations to date
0
Altmetric
Original Articles

Internet Search Result Probabilities: Heaps' Law and Word Associativity*

&
Pages 40-66 | Published online: 25 Feb 2009
 

Abstract

We study the number of internet search results returned from multi-word queries based on the number of results returned when each word is searched for individually. We derive a model to describe search result values for multi-word queries using the total number of pages indexed by Google and by applying the Zipf power law to the words per page distribution on the internet and Heaps' law for unique word counts. Based on data from 351 word pairs each with exactly one hit when searched for together, and a Zipf law coefficient determined in other studies, we approximate the Heaps' law coefficient for the indexed worldwide web (about 8 billion pages) to be β = 0.52. Previous studies used under 20,000 pages. We demonstrate through examples how the model can be used to analyse automatically the relatedness of word pairs assigning each a value we call “strength of associativity”. We demonstrate the validity of our method with word triplets and through two experiments conducted 8 months apart. We then use our model to compare the index sizes of competing search giants Yahoo and Google.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 394.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.