8
Views
0
CrossRef citations to date
0
Altmetric
Research Article

The role of hyponymy and context concreteness in compound word processing

Received 08 May 2023, Accepted 27 Mar 2024, Published online: 08 Jun 2024
 

ABSTRACT

This paper describes the effects of hyponymy and emotion on the comprehension and production of compound words. The research subjects are over 2000 concatenated compounds of English taken from the LADEC database (Gagné et al. 2019). The study builds on the research carried out in Charitonidis (2022), according to which context concreteness for the second constituent was a significant positive predictor of lexical decision and naming times from the English Lexicon Project (ELP) and the British Lexicon Project (BLP). In the present paper, the hyponymy norms from Gagné et al. (2020) were added in the analysis. The results show that both hyponymy and context concreteness for the second constituent are relevant. In addition, all models including both variables have a better fit than nested models omitting one of these variables. There is thus strong evidence that both hyponymy and context concreteness for the second constituent are obligatory parameters in compound word processing.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Large language models

GPT-3.5 (https://chat.openai.com) and Google Gemini (https://gemini.google.com) were selectively utilized as auxiliary proofreading and editing aids (accessed 14 March 2024). Subsequently, the responses generated by both AI tools underwent additional proofreading and editing by the author.

Notes

1 The descriptions in this section were taken from Charitonidis (Citation2022: section 1). Minor changes were made.

2 As Jackendoff (Citation2010: 416) reports, ‘there are some families of left-headed compounds in English, such as attorney general, mother-in-law, blowup, and pickpocket’.

3 Overviews of word recognition models can be found in Schreuder & Baayen (Citation1995), Kuperman (Citation2013), Norris (Citation2013), and Snefjella & Kuperman (Citation2016).

4 In the literature and in the present paper, the terms ‘norms’ and ‘ratings’ are often used interchangeably to refer to the same values. Actually, the output datasets contain norms obtained by averaging native speakers’ ratings.

5 The descriptions regarding semantic transparency were taken from Charitonidis (Citation2022: section 1.2). Minor changes were made.

6 This statement indicates that the histogram depicting the data for the second constituent exhibits a taller and narrower peak compared to the histogram representing the data for the first constituent. Furthermore, the distribution for the second constituent contains a higher overall frequency or number of data points than the distribution for the first constituent.

7 In the parentheses, means and standard deviations are given in percentages. In the present study, the three morphological levels ‘compound’, ‘first constituent’, and ‘second constituent’ are abbreviated as ‘cmp’, ‘c1’, and ‘c2’, respectively.

8 Steiger’s (1980) z test showed that this difference was significant, z = 27.71, p< .0001 (Gagné et al. Citation2019).

9 By referring to previous research, Gagné et al. (Citation2019: 10) report that ‘the modifier (the first constituent in English) tends to play a larger role in the ease-of-relation selection during the processing of compounds and noun phrases.’

10 The descriptions in this section were taken from Charitonidis (Citation2022: Introduction). Minor changes were made.

11 Warriner et al.’s (2013) dataset also included ‘dominance’ norms. Dominance refers to the ‘degree of control’ exerted by the stimulus word.

12 In the literature and in the present paper, the terms ‘norms’ and ‘ratings’ are used interchangeably to refer to the same values. In fact, the datasets referred to above contain norms obtained through averaging of native speakers’ ratings.

13 In Snefjella & Kuperman (Citation2016), the term ‘content words’ is equivalent to the term ‘non-stopwords’. Stopwords correspond to the default English stopword list of the R tm-package (personal communication).

14 Also excluded were 493 words whose overall context values ‘were more than three standard deviations above or below the mean of the respective variable’ (Snefjella & Kuperman Citation2016: 136).

15 The full list of norms can be found in the supplementary dataset of Snefjella & Kuperman (Citation2016).

16 This section draws on various sections in Charitonidis (Citation2022).

17 This version was also used in Snefjella & Kuperman (Citation2016) and Gagné et al. (Citation2019).

18 The SUBTLEX-US corpus was based on subtitles from US films and television programs. As Chen et al. (Citation2018) note, ‘a series of recent studies demonstrated that frequency norms derived from subtitles of films and TV programs tended to outperform those from printed texts in accounting for the variance of lexical processing time (and sometimes also accuracy) among native speakers of different languages’ (see Chen et al. Citation2018: 2 and the references therein).

19 does not include the BLP lexical decision model with the log compound frequencies from the British National Corpus (BNC) as control variable. This model was dissociated from the ELP and BLP lexical decision models with frequencies from the SUBTLEX-US corpus. It should be noted that BNC frequencies were derived from a corpus ‘with a mixture of written and spoken genres’ (Chen et al. Citation2018: 8).

20 The potential inhibitory function of the second compound constituent, as outlined in this paragraph, was first explored in Charitonidis (Citation2022: sections 5.1.2 and 7).

21 On the website Researchgate.net can be found (a) the full set of compounds with the corresponding input values, and (b) the descriptions and sources for all variables.

22 In other words, hyponymy and context concreteness for the second constituent were utilized with both positive and negative features (+/–).

23 The Akaike Information Criterion (AIC) encounters challenges when dealing with different sample sizes (Burnham & Anderson Citation2002). All information criteria rely on the likelihood function, which is influenced by sample size. As sample size increases, likelihood decreases, leading to higher (=inferior) information criterion values. To mitigate this issue, this study employs scaled AIC, dividing AIC by the sample size. While not universally embraced, this method is commonly utilized across diverse applications and aids in adjusting for discrepancies in sample sizes when evaluating models (see, for instance, Hastie et al. Citation2009: 230–231; for further details, see Charitonidis Citation2022).

24 In listwise deletion, only data points with complete information for all variables are used. In the present analysis, listwise deletion resulted in a substantial number of cases (856 items out of the original sample) and is a viable option. It should be noted that Gagné et al. (Citation2019) also used listwise deletion in their correlation analysis.

25 Regarding the emotion variables themselves, most of the strongest correlations were between representation and context norms (see also Charitonidis Citation2022). In particular, all representation norms correlated moderately and positively with context norms at the same morphological level (cmp, c1, c2). The correlation between representation and context concreteness was moderate, i.e. r = .56*** (In Snefjella & Kuperman Citation2016, the correlation between the same variables was strong, i.e. r = .72***, see in section 2.2.1). Context valence and context arousal correlated moderately and negatively at the same morphological level, whereby the correlations between representation valence and representation arousal were weak and negative, again at the same morphological level.

26 It should be noted that the coefficient for hyponymy in Model 3 did not reach statistical significance, β = −0.032, p = 0.246.

27 The coefficient for hyponymy in Model 3 reached statistical significance, β = −0.073*, p = 0.026, in contrast to the corresponding ELP lexical decision model (see section 5.2.1).

28 The coefficient of determination (R2) is an effect-size estimation referring to ‘the percentage of variance in one variable that is predicted or explained by the other’ (Ozer Citation1985: 307).

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 202.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.