181
Views
1
CrossRef citations to date
0
Altmetric
Research Article

Why One Cannot Estimate the Entropy of English by Sampling

&
 

Abstract

There have been attempts to approximate the entropy of English by frequency analysis of large corpora. Our original goal was to deduce more precise estimates by extensive calculations. This did not work well, thus confirming a widely held belief in linguistics. In order to put this belief on a firm basis, we used a simplified language model, closely related to others in the literature. This model exhibits an unexpected trichotomy: for very small n, say up to in our case, n-gram counting is reasonably reliable; for medium n, up to 14, increasing statistical noise is added, and beyond that we see statistical noise only. The model is precise enough to yield explicit values for the thresholds given above dependent on the corpus size. Even though a mathematically rigorous proof for English itself is out of reach, our model gives a strong indication that frequency counting in (large) corpora is a dead end for approximating the entropy of English, and different linguistic tools and insights are required. As far as we know, this is the first rigorous quantifiable argument concerning the linguistic intuition that frequency counting of samples is insufficient for entropy determination.

Acknowledgements

The authors are grateful to Graeme Hirst for advice on the subject and to Reinhard Köhler for enlightening discussions.

Notes

No potential conflict of interest was reported by the authors.

1 The entropy of the distribution of lowercase Roman letters without space is 4.19.

2 In the purged COCA, we have . Even if only for each letter n-gram occurs only once, we still get relative letter entropy .

3 Following Hilbert and López (Citation2011), the storage used nowadays is estimated as 295 exabyte.

Additional information

Funding

The authors’ work was funded by the Bonn–Aachen International Center for Information Technology foundation (the B-IT Foundation) and the state of Nordrhein-Westfalen.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.