9,225
Views
633
CrossRef citations to date
0
Altmetric
Regular articles

SUBTLEX-UK: A new and improved word frequency database for British English

, , ORCID Icon &
Pages 1176-1190 | Received 21 Jun 2013, Accepted 16 Sep 2013, Published online: 13 Jan 2014
 

Abstract

We present word frequencies based on subtitles of British television programmes. We show that the SUBTLEX-UK word frequencies explain more of the variance in the lexical decision times of the British Lexicon Project than the word frequencies based on the British National Corpus and the SUBTLEX-US frequencies. In addition to the word form frequencies, we also present measures of contextual diversity part-of-speech specific word frequencies, word frequencies in children programmes, and word bigram frequencies, giving researchers of British English access to the full range of norms recently made available for other languages. Finally, we introduce a new measure of word frequency, the Zipf scale, which we hope will stop the current misunderstandings of the word frequency effect.

Notes

1 On the basis of anecdotal evidence we can add that these subtitles are also appreciated by viewers with English as second language.

2 Brysbaert and New (Citation2009) reported that the word type frequencies themselves show little difference once the corpus contains 30 million words, a finding that was replicated in the present analyses.

3 Dehyphenation also occurs in automatic text parsers, such as CLAWS and the Stanford parser (to be described later). Because the Stanford parser dehyphenates more words than CLAWS, the outcome of this parser outperformed that of CLAWS on the raw corpus, but no longer on the dehyphenated corpus.

4 The reason why these numbers are very similar is that both channels have a similar rotation of programmes with repeats after a rather short period of time.

5 A disadvantage of the Stanford tagger is that in its default mode it Americanizes the spellings of the words. So, one must be careful to change this when one is working with British spellings.

6 A notorious example is “horsefly”, which both CLAWS and Stanford parse as an adverb (arguably because the word is not in the programme's lexicon, so that too much reliance is put on the end letters –ly). Ironically, Stanford does correctly classify “horseflies” as a noun associated with the lemma “horsefly” (presumably because the end letters, –lies, are more likely to be associated with plural nouns than with other parts of speech).

7 These frequencies were not subtracted from the frequencies of the individual words, under the assumption that the component words of a hyphenated word get coactivated upon seeing the hyphenated word.

8 An easy introduction to the test and an Excel file to calculate the exact values are available on the website (http://crr.ugent.be/archives/546)

9 SUBTLEX-UK frequencies not including childhood frequencies can easily be obtained by subtracting the CBeebies and CBBC frequency counts from the total frequency counts.

10 The speller was the MS Office 2007 spellchecker, augmented with a list of lemmas one of the authors (M.B.) is compiling.