964
Views
0
CrossRef citations to date
0
Altmetric
Research Article

The Dimensionality of Lexical Features in General, Academic, and Disciplinary Vocabulary

ORCID Icon, ORCID Icon & ORCID Icon
 

ABSTRACT

Purpose

There are many aspects of words that can influence our lexical processing, and the words we are exposed to influence our opportunities for language and reading development. The purpose of this study is to establish a more comprehensive understanding of the lexical challenges and opportunities students face.

Method

We explore the latent relationships of word features across three established word lists: the General Service List, Academic Word List, and discipline-specific word lists from the Academic Vocabulary List. We fit exploratory factor models using 22 non-behavioral, empirical measures to three sets of vocabulary words: 2,060 high-frequency words, 1,051 general academic words, and 3,413 domain-specific words.

Results

We found Frequency, Complexity, Proximity, Polysemy, and Diversity were largely stable factors across the sets of high-frequency and general academic words, but that the challenge facing learners is structurally different for domain-specific words.

Conclusion

Despite substantial stability, there are important differences in the latent lexical features that learners encounter. We discuss these results and provide our latent factor estimates for words in our sample.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

1. Using the raw frequency can be problematic in model estimation because of Zipf’s law: the frequency of a word is inversely proportional to its ranking. A few high-ranking words take up a significant portion of corpora (e.g. “the,” “and,” “a”), many low-ranking words take up a small portion of corpora (e.g. “projectile,” “calendar”), and frequency and rankings are not linearly related. For this reason, linear models tend to instead be based on some transformation of the raw frequency – either a log transformation or zipfian transformation. The zipfian transformation accounts for the word frequency effect based on Zipf’s law (Van Heuven et al., Citation2014) and is calculated as: log10(raw frequency+1corpus size in millions + word types in millions)+3

2. Because the COCA splits by part of speech, we totaled word frequency for all parts of speech before taking the Zipfian transformation.

3. Because the OED is split by part of speech, we used the highest occurring frequency band for each word.

4. Neighborhood sizes were collected from the English Lexicon Project (Yap et al., Citation2012), however, no specific corpus is disclosed.

5. Because Ngram data splits by part of speech, we used the oldest occurrence for word age.

6. Because WordNet splits by part of speech, we took the average score for each word.

7. Other measures were considered for the factor analysis, but were too highly correlated with other measures (r > .98; Standardized Frequency Index (SFI) from Subtlex with zenozipf, and Contextual Diversity and Word Frequency from Subtlex with subzipf) or did not have enough variability to warrant inclusion for any word set (MSA < .60; mean bigram from English Lexicon Project and word age from Oxford English Dictionary).

8. This paper uses the general academic word (AWL-reference) model to estimate factor scores on a specific set of vocabulary from the Word Generation trials. Factor score estimates for these words differ in the current paper when the words are scaled based on the GSL-reference, AWL-reference, and AVL-DS-reference as opposed to scaled amongst themselves.

Additional information

Funding

This research was supported by the grant No. R305A170151 Improving the Accuracy of Academic Vocabulary Assessment for English Language Learners, grant No. R305A090555 Word Generation an Efficacy Trial, and grant No. R305A080647 Measuring the Development of Vocabulary and Word Learning to Support Content Area Reading and Learning from the Institute of Educational Sciences.