1,411
Views
109
CrossRef citations to date
0
Altmetric
Regular articles

Word knowledge in the crowd: Measuring vocabulary size and word prevalence in a massive online experiment

, , &
Pages 1665-1692 | Received 04 Jun 2014, Accepted 18 Feb 2015, Published online: 08 Apr 2015
 

Abstract

We use the results of a large online experiment on word knowledge in Dutch to investigate variables influencing vocabulary size in a large population and to examine the effect of word prevalence—the percentage of a population knowing a word—as a measure of word occurrence. Nearly 300,000 participants were presented with about 70 word stimuli (selected from a list of 53,000 words) in an adapted lexical decision task. We identify age, education, and multilingualism as the most important factors influencing vocabulary size. The results suggest that the accumulation of vocabulary throughout life and in multiple languages mirrors the logarithmic growth of number of types with number of tokens observed in text corpora (Herdan's law). Moreover, the vocabulary that multilinguals acquire in related languages seems to increase their first language (L1) vocabulary size and outweighs the loss caused by decreased exposure to L1. In addition, we show that corpus word frequency and prevalence are complementary measures of word occurrence covering a broad range of language experiences. Prevalence is shown to be the strongest independent predictor of word processing times in the Dutch Lexicon Project, making it an important variable for psycholinguistic research.

This work was supported by an Odysseus grant awarded by the Government of Flanders to Marc Brysbaert.

Supplemental material

The prevalence values for Dutch words used in this article can be downloaded from (http://crr.ugent.be/prevalence/.

Notes

1Whenever profile data were saved or changed on a device, we saved a new profile identifier to that device. The number of unique profile identifiers associated to one or more finished sessions gives us a rough estimate of the number of participants, keeping in mind that multiple participants could have used the same identifier and that the same individual can have different identifiers because they use multiple devices.

2Although the current paper focuses on the accuracy measures obtained from our study, reaction times were also collected for each trial. For reference, we report the intraclass correlation coefficients (ICCs) as a measure of the reliability of the obtained reaction times. ICC(C, k), or the expected correlation for a repeat study on the average reaction times for individual words, was .999. ICC(C, 1), or the expected correlation of the average reaction times per word with those of a single new participant, was .168.

3The plotted values are conditioned on the weighted mean for the discrete variables and on the true mean for the continuous variables. As such, the effect plots can be interpreted as if they resulted from a completely balanced design for all discrete variables (i.e., the same number of Dutch and Belgian participants, the same number of female and male participants, etc.)

4For reference, the correlation between log frequency and its quadratic trend was .96. Adding the quadratic trend for log frequency to the model lead to an increase in R2 from .5235 to .5457. Eta squared for the quadratic trend was .0181. Eta squared for log frequency decreased from .1573 to .0652, and eta squared for prevalence increased from .2107 to .2194.

5We thank Michael Ramscar for suggesting this interpretation.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.