181
Views
1
CrossRef citations to date
0
Altmetric
Research Article

Why One Cannot Estimate the Entropy of English by Sampling

&
Pages 77-106 | Published online: 06 Jul 2017
 

Abstract

There have been attempts to approximate the entropy of English by frequency analysis of large corpora. Our original goal was to deduce more precise estimates by extensive calculations. This did not work well, thus confirming a widely held belief in linguistics. In order to put this belief on a firm basis, we used a simplified language model, closely related to others in the literature. This model exhibits an unexpected trichotomy: for very small n, say up to in our case, n-gram counting is reasonably reliable; for medium n, up to 14, increasing statistical noise is added, and beyond that we see statistical noise only. The model is precise enough to yield explicit values for the thresholds given above dependent on the corpus size. Even though a mathematically rigorous proof for English itself is out of reach, our model gives a strong indication that frequency counting in (large) corpora is a dead end for approximating the entropy of English, and different linguistic tools and insights are required. As far as we know, this is the first rigorous quantifiable argument concerning the linguistic intuition that frequency counting of samples is insufficient for entropy determination.

Acknowledgements

The authors are grateful to Graeme Hirst for advice on the subject and to Reinhard Köhler for enlightening discussions.

Notes

No potential conflict of interest was reported by the authors.

1 The entropy of the distribution of lowercase Roman letters without space is 4.19.

2 In the purged COCA, we have . Even if only for each letter n-gram occurs only once, we still get relative letter entropy .

3 Following Hilbert and López (Citation2011), the storage used nowadays is estimated as 295 exabyte.

Additional information

Funding

The authors’ work was funded by the Bonn–Aachen International Center for Information Technology foundation (the B-IT Foundation) and the state of Nordrhein-Westfalen.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 394.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.