6,014
Views
4
CrossRef citations to date
0
Altmetric
Original Articles

Estimating Vocabulary Size

Pages 219-244 | Published online: 04 Dec 2015

  • Some of these problems have been treated in G. U. Yule's classical The Statistical Study of Literary Vocabulary, Cambridge, 1944: and more recently in Pierre Guiraud, Les Caractères Statistiques du Vocabulaire, Paris, 1954.
  • The symbols used are Yule's. W stands for “words,” V for “vocabulary.”
  • One possible procedure is described in detail in A. O. Booth, L. Brandwood, and J. P. Cleave, Mechanical Resolution of Linguistic Problems, London, 1958, pp. 35–43.
  • E. L. Thorndike and I. Lorge, The Teacher's Word Book of 30,000 Words, New York, 1944.
  • The curve is drawn on the basis of the following data. Three of the first 500 entries (0.6%), twelve of the following 500 (2.4%), and about 11.5% of the first 19,440 words were found to be proper names, leaving roughly 17,200 other lexical units above the frequency one per million. As will be found below, the first 500 entries contribute about 70% of the occurrences, the next 500 about 10%, and the remainder about 20% of the running words of a text sample. Hence the proper names will contribute about 0.006 × 70 × 0.024 × 10 × 0.115 × 20 = 2.96% of the occurrences due to the words with a frequency above one per million. Accordingly the removal of proper names lowers V by about 11.5% in the upper (right-hand) portion of the diagram, while W is reduced by about 3% only.
  • Professor Tord Ganelius, Göteborg, informs me that the curve representing the TL material including proper names can be described with remarkable accuracy by means of the simple formula: which implies that for low values of IV, V is one-tenth of W, and for high values of W, increasing W by a factor of e increases V by 7,520 units. (My own extrapolation of Curve I in Diagram 1, which excludes proper names, assumes that for high values of W, an increase by a factor of e increases V by 8,690 units. Curve I cannot be described in the same simple manner anyway.)
  • The following simplified argument yields a rough estimate. In a text mass of one million words, all the 17,000 lexical units with relative frequency of 10-6 or more should occur at least once, and the words with a relative frequency below 10-6 should occur at most once (i.e., either once, or not at all. Hence all the occurrences of the rare words will be once-occurrences; and so the rare words, which contribute 0.69% of the running words (according to Table 1), should contribute 6,900 different words. This would bring the total to be expected up to 17,000 × 6,900 = 23,900. In fact, however, this reasoning yields too high a value for the number of words to be found in a given text mass. In a random sample of the TL material, some of the words with relative frequency above 10-6 would fail to appear, and some of the words with relative frequency below 10-6 would occur more than once. This is clearly brought out in our Table 2.
  • I have taken over this formula from Guiraud, Caractères Statistiques, p. 33, to whom I refer for a fuller discussion of it. The symbol L is used by Guiraud for Lexique, or lexicon, the hypothetical potential vocabulary of which the actual vocabulary is but a reflex (see below, note 20). It must be understood that the formula yields an approximation only, since it assumes that the probability of occurrence is the same for all the words belonging to the selected frequency range, which is not strictly the case.
  • Clearly that proportion is simply the sum of the relative frequencies of each word in the range. If we assign rank numbers to all the words in the language, and let a be the rank of the most frequent word, and b the rank of the least frequent word in the range we are considering, we have
  • When estimating P from the diagram, Curve I has been divided into small portions, for which a reasonably accurate average frequency (P) can be obtained by simply dividing the logarithmic distance between the two limits of the portion into two equal parts, and reading off the frequency corresponding to the point so obtained. Table 1 has been condensed for reasons of space: for the purposes of calculation, the curve portion between p = 10-2 and p = 10-3 was divided into seven parts, whose limits were (p expressed in thousandths): 6.7, 5, 3.3, 2.5, 2, 1.43, and with the following averages: 8.2, 5.8, 4.1, 2.9, 2.2, 1.7, 1.2. Similarly for the lower frequency ranges. Later calculations in this paper have of course been based on this complete set of figures.
  • See note 8, above.
  • E.g. Tables of the Exponential Function ex, U.S. National Bureau of Standards, Applied Mathematics Series 14, Washington, 1951.
  • J. Mersand, Chaucer's Romance Vocabulary, New York, 1939.
  • O. Jespersen, Growth and Structure of the English Language, Oxford, 1952, pp. 199–200.
  • Ibid.
  • Miles L. Hanley, Word Index to James Joyce's Ulysses, Madison, 1951.
  • See above, note 5.
  • O. Jespersen, Growth and Structure, p. 200. I assume that the count excluded proper names.
  • For the counts presented below, which all include proper names, I employ the following figures, arrived at after some trial and error: W= 5,000, reduction percentage 90%; W = 10,000, 85%; W = 20,000, 80%; W = 30,000, 75%; W = 50,000, 70%; W = 100,000 or more, 65%.
  • The following data are from G. K. Zipf, Human Behavior and the Principle of Least Effort (Cambridge, Mass., 1949), pp. 25, 123, 291. One set of data concerns 44,000 words of American newspaper material, carried out by Eldridge. It yielded 6,002 word-form units, which I equate with 4,200 lexical units. Reducing the W value by 3% for proper names, we arrive at the position indicated by the dot at “E” in Diagram 1.
  • The ringed dots in Diagram 1 represent the following texts: Paston Letters (W = 20,000, V = 1,600 lexical units); Hamlet (W = 30,000, V = 3,750); Wycherley (W = 10,000, V = 1,020); Congreve (W = 10,000, V = 1,280); Johnson, Rasselas (W = 10,000, V = 1,700); T. S. Eliot (V= 5,000, V = 1,440).
  • The crosses in Diagram 1 represent six different samples of letters written by a schizophrenic patient—I: W = 50,000, V = 1,640; II: W = 30,000, V = 1,270; III: W = 20,000, V= 1,000; IV: W = 10,000, V = 650; V: W = 5,000, V = 405; VI: W = 2,000, V = 210.
  • Zipf, op. cit., p. 546, held that the “Ulysses data provided the most clear-cut example of a harmonic distribution in speech of which I know,” and regarded them as confirmation that his “law,” namely, rank x frequency = constant (the constant being 1/10) was valid for large samples as well as for smaller ones. As a matter of fact, Ulysses is too exceptional a text to serve as a standard of reference. The other text data given above show that Zipf's law holds only for a limited range of text sizes. Expressed in our terms, Zipf's law states that Vp = 0.1. This implies that the values for individual texts should all fall on a straight line with slope 45°, starting from V = l, p = 10-1. That is approximately true of samples between W = 5,000 and W = 50,000 (or rather more if Zipf's word definition is used). But above that region the value of Vp decreases more and more: for W = 106 it is 0.022, with our word definition. For a discussion of Zipf's law, see also Guiraud, Caractères Statistiques, pp. 12, 24.
  • Another way of estimating a writer's vocabulary has been suggested by Guiraud. Guiraud makes a distinction between the vocabulary, V, the words appearing in the actual text, and the lexicon, L, the hypothetical potential vocabulary on which the writer draws when composing his text—the “words at risk,” as Yule said. Guiraud estimates L from the ratio of once-occurrences to the total number of words in the vocabulary. Further, he estimates the concentration, C, of the writer's vocabulary by dividing the number of occurrences contributed by the 50 most common words, by the total number of words in the sample. Finally, he measures the dispersion or richness, R, of the vocabulary in terms of the formula V/√/N. Both R and C are in their turn related to standard values and to L.
  • Guiraud establishes beyond doubt that L, R, and C, and their derivatives, are (approximately) constants for individual authors, at least for text sizes around 5,000–100,000 words. My own material confirms this in the case of R: the Chaucer value is around 14 within that range. I am somewhat more doubtful about the interpretation of those of Guiraud's constants which involve L. L itself is calculated on the assumption that the writer's lexicon, whether great or small, has the same proportion of words within each frequency range as the “normal” lexicon of 24,000 words. That seems hardly realistic; it appears more likely that writers with a large vocabulary differ from those with a smaller one mainly in the number of words belonging to the infrequent ranges. However, as Guiraud himself stresses, whatever the interpretation of L, there is no doubt that it is a characteristic constant of a writer's language, and any discussion of these matters must take account of Guiraud's important and stimulating contribution.
  • Polysemy had to be left out of account. It should be noted in this connection that it is a problem that concerns above all the most frequent words in the language. The less frequent a word is, the less likely is it to have several meanings (Zipf, Human Behavior, pp. 27ff. Guiraud, Caractères Statistiques, p. 2).
  • In this calculation we assume that one or more lexical units always yield one, and only one, semantic element. We do not list endings, prefixes, etc., as semantic elements. This is necessary because the frequencies are expressed in terms of the number of occurrences of word units. To convert this into number of occurrences of semantic elements proper appeared too baffling a task. The procedure adopted also implies that compounds can be referred to one of the components only in the count of semantic elements.
  • F. W. Kaeding, Häfigkeitswörterbuch der deutschen Sprache, Steglitz bei Berlin, 1898.
  • Kaeding's list included personal names, but not place names. I have drawn the curve only to 1/p=2.6 × 106, i.e. to include the words used four times in the material of 10.9 × 106 words.
  • Kaeding's “Stammsilben” were not etymological stems, but, by and large, the stressed syllable of the word. Thus hauf (as in Haufen) was distinguished from häuf (as in häufig), and gnos was put down as the stem syllable of Diagnose.
  • Zipf, Human Behavior, p. 89.
  • Hanley's Ulysses list provides a further check. My sample count of that list indicated that 38.9% (standard error ±2%) of the items could be reckoned as semantic units in our sense, yielding 11,600 units for the whole text of 246,000 words excluding names, as compared with 18,400 lexical units: a percentage of 63%. For a text of this size. Diagram 1 gives V = 14,000, and Diagram 2 gives V = 9,000, a percentage of 64%
  • For a discussion of work on this subject, see Dorothea McCarthy “Language Development in Children,” in Manual of Child Psychology, ed. Leonard Carmichael, New York, 1954, pp. 492–630.
  • The easiest procedure would be to use the frequency ranges described above, note 8. The average frequency in the range is obtained by dividing the aggregate proportion of occurrences Σp) in the range by the number of words in the range. For the range 10-5 to 10-6 this yields 0.02530/7,400 = 3.42 × 10-6. Hence the test words for the range should be taken from those listed by Thorndike as occurring 3 or 4 times per million.
  • In actual fact, the test words were selected before the theory set out above was worked out, and the limits of the frequency ranges had to be chosen so as to fit the already selected averages, which explains the somewhat unsystematic appearance of the range divisions. The following are the actual words used in one of the tests: 1.22 × 10-4 (120–130 per million): action, cause, contain, desire, finally, important, light, member, often, probably. 5.6 × 10-5 (55–65 per million): accident, affect, castle, choice, coast, defeat, introduce, meal, nail, occur. 2.2 × 10-5 (20–30 per million): witch, wreath, amendment, array, bewilder, bough, broad, canvas, challenge, chop. 5.5 × 10-6 (5 per million): arbitration, bail, cadence, deluge, ember, flop, gaudy, hag, lair, maim. 1.3 × 10-6 (1 per million): arable, banister, cad, dally, easel, feint, gamut, halibut, laity, manse. 4.2 × 10-7 (less than 1 per million): accretion, blarney, captions, daft, eavesdrop, fairway, flotsam, garble, halcyon, indict.
  • In strictness, of course, the W values in the last two rows should be those corresponding to the total vocabularies, not, as in the table, those corresponding to the vocabularies as curtailed above the frequency p = 10-8. But the correction is of little practical importance.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.