Good‐turing frequency estimation without tearsFootnote*
Address correspondence to Geoffrey Sampson, School of Cognitive and Computing Sciences, University of Sussex, Falmer, Brighton BN1 9QH, England, e‐mail: [email protected], tel.: +44 1273 678525, fax: +44 1273 671320.
The authors are very grateful to Professor I.J. Good for detailed comments on a draft of this paper. Responsibility for the contents of the paper is the authors’ alone.
W.A. Gale has retired, March 1995.

William A. Gale AT&T Bell Laboratories, USA

Geoffrey Sampson School of Cognitive and Computing Sciences, University of Sussex, Falmer, Brighton, BN1 9QH, England Phone: +44 1273 678525 Fax: +44 1273 678525 E-mail: [email protected]

Abstract

Linguists and speech researchers who use statistical methods often need to estimate the frequency of some type of item in a population containing items of various types. A common approach is to divide the number of cases observed in a sample by the size of the sample; sometimes small positive quantities are added to divisor and dividend in order to avoid zero estimates for types missing from the sample. These approaches are obvious and simple, but they lack principled justification, and yield estimates that can be wildly inaccurate. I.J. Good and Alan Turing developed a family of theoretically well‐founded techniques appropriate to this domain. Some versions of the Good‐Turing approach are very demanding computationally, but we define a version, the Simple Good‐Turing estimator, which is straightforward to use. Tested on a variety of natural‐language‐related data sets, the Simple Good‐Turing estimator performs well, absolutely and relative both to the approaches just discussed and to other, more sophisticated techniques.

Notes

Address correspondence to Geoffrey Sampson, School of Cognitive and Computing Sciences, University of Sussex, Falmer, Brighton BN1 9QH, England, e‐mail: [email protected], tel.: +44 1273 678525, fax: +44 1273 671320.

The authors are very grateful to Professor I.J. Good for detailed comments on a draft of this paper. Responsibility for the contents of the paper is the authors’ alone.

W.A. Gale has retired, March 1995.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Information for

Open access

Opportunities

Help and information

Abstract

Notes

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature