Abstract
The present study explores linguistic predictors and behavioural implications of the orthographic alternation between a spaced (bell tower), hyphenated (bell-tower), and concatenated (belltower) format observed in English compound words. On the basis of two English corpora, we model the evolution of spelling for compounds undergoing lexicalisation, as well as define the set of orthographic, distributional, and semantic properties of the compound's constituents that co-determine the preference for one of the available realisations. We explore iconicity and economy as competing motivations for both the diachronic change and synchronous preferences in spelling. Observed patterns of written production closely mirror the demands and strategies of recognition of compound words in reading. Orthographic choices that go against the reader's economy of effort come with a high recognition cost, as evidenced in inflated lexical decision and naming latencies to concatenated compounds that occur in other spelling formats.
Acknowledgements
Authors would like to thank Valentin Spitkovsky for extensive and insightful discussions of computational aspects of this work, and to Joan Bresnan, Barbara Juhasz and Emmanuel Keuleers for their comments on an earlier draft.
Notes
1Throughout this article, we will use the plus sign in the compound spelling (e.g., girl+friend) to refer to the compound regardless of how it is spelled.
2While this procedure meets our goal of describing orthographic alternation, it cannot be used to identify nonalternating concatenated compounds. Unlike spaced and hyphenated compounds, concatenated compounds cannot be told apart from morphologically simple singular common nouns on the basis of part-of-speech tags or orthography in parsed Wikipedia, and the size of the corpus prohibits manual identification of concatenated compounds. To roughly estimate the number of nonalternating concatenated compounds, we extracted all concatenated compounds identified in the morphological coding of the lexical database CELEX (Baayen, Piepenbrock, & Gulikers, Citation1995). We found 500 compounds that were not part of our alternating set in Wikipedia. The sum of the alternating concatenated compounds and the nonalternating ones are 2,100 (=1,600+500) and are a more precise – though potentially still too low – estimate of the total type count of (alternating or nonalternating) concatenated compounds in the Wikipedia corpus.
3The list of alternating compounds comprises: audio+book, call+center, care+giver, chick+pea, coffee+house, copy+cat, cyber+cafe, data+base, die+cast, field+house, fund+raiser, help+line, house+cat, jump+start, race+day, road+show, salary+cap, shoe+box, show+biz, slide+show, soap+box, sound+board, steak+house, paint+ball.
4We are indebted to Emmanuel Keuleers for raising this possibility.
5We also considered entropy as the measure of uncertainty in the choice of one of several alternatives. Entropy is minimal (zero) when a compound occurs in only one of the available formats, and it is maximal when all alternatives are equiprobable (see Milin et al., Citation2009b). Entropy of orthographic choice was not a significant predictor (at the 0.05-level) of either lexical decision or word naming response times.