165
Views
0
CrossRef citations to date
0
Altmetric
Research Article

From tweets to trends: analyzing sociolinguistic variation and change using the Twitter Corpus of English in Hong Kong (TCOEHK)

ORCID Icon
Received 09 Sep 2022, Accepted 21 Aug 2023, Published online: 18 Oct 2023
 

ABSTRACT

This article presents the Twitter Corpus of English in Hong Kong (TCOEHK): a 123-million-word corpus derived from sampling tweets across the 18 districts and three geographical (macro-)regions of Hong Kong from 2010 to 2022. It introduces the corpus and demonstrates its utility by examining four linguistic variables found in English in Hong Kong (EngHK) and the dominant variety Hong Kong English (HKE): tense marking, ‘-ize/-ise’ suffix use, adverb syntactic position, and copula (non-)use. It explores their relationship with intralinguistic, stylistic (e.g. formality), and extralinguistic factors (e.g. region, year, affect). The findings show that the distribution of variants in all four variables (e.g. rates of -ize use) is similar to the patterns identified in prior HKE work. In addition to confirming previous research, the results also reveal how intralinguistic, stylistic, and extralinguistic factors can each influence the distribution of variants differently depending on the variable studied, highlighting the complex and ever-changing nature of EngHK. The availability of social metadata and the large size of the TCOEHK make it viable for examining (socio)linguistic variation and changes in contemporary (Twitter-style) EngHK, as well as potential regional and social sub-varieties/styles within EngHK. It promises to advance research on variation and change in EngHK.

Acknowledgements

This article has benefitted from the support of The Chinese University of Hong Kong Faculty of Arts Direct Grant (Exploring Variation and Change in Chinese-related Multilingual Practices in East Asia, Project # 4051228).

Disclosure statement

No potential conflict of interest was reported by the author.

Supplementary material

Supplemental data for this article can be accessed online at https://doi.org/10.1080/13488678.2023.2251771

Notes

1. Since early 2023, Twitter has altered its API access, making it impossible for the public to scrape geo-location data without paying a large sum of money. This has made the TCOEHK additionally valuable.

3. Both style predictors were estimated using Grafmiller, Szmrecsanyi, and Hinrichs' (Citation2018) method.

4. I included ‘North vs. non-North’ as a variable as the North district is close to the Mainland border (see ), and speakers of English in this area may have patterns that differ from speakers living away from the border. This is plausible given the research on sociopolitical borders (including the Hong Kong–Shenzhen China border), which have uncovered the existence of complex and dynamic identities and the central role linguistic behavior plays in constructing and negotiating between identities (Danielewicz-Betz & Graddol, Citation2014; Holguín Mendoza, Citation2018; Watt, Llamas, Docherty, Hall, & Nycz, Citation2022).

5. Sentiment was extracted using an R package that estimates the polarity of a string. A positive value indicates that the utterance is positive while a negative value indicates that the utterance is negative (Rinker, Citation2022).

6. I also hope to demonstrate the application of the TCOEHK for conducting comprehensive and intricate sociolinguistic analyses. To achieve this, I will employ Wang et al.'s (Citation2019) M3 demographic inference tool, utilizing the final variable as a specific case study. This program takes a Twitter identification number as input, which is available in the TCOEHK, and examines various aspects such as the profile image, username, screen name, and biography of the Twitter user. It then generates probabilities that indicate the age, sex, and entity type (i.e. organization or non-organization) of the user, with relatively high precision and recall (Macro-F1: gender = 0.918, age = 0.522, entity type = 0.898) (Wang et al., Citation2019). These probabilities, which pertain to imputed or stylistic age and sex, can subsequently be utilized to explore the relationship between the copula variable and two influential predictors of sociolinguistic variation: age and sex. This exploration can be conducted even in the absence of actual age and sex metadata within the corpus, providing valuable insights into the connection between the copula variable and these demographic factors.

7. The RegEx terms used were ‘did_AUX\s(?:n’t_PART\s|not_PART\s)?(?:say|make|know|think|see|want|take|go|need|find|come|use|give|help|look|like|work|keep|feel|become|believe|tell|go|try|love|base|understand|seem|start|provide|live)_VERB\s’ and ‘did_AUX\s(?:n’t_PART\s|not_PART\s)?(?:said|made|knew|thought|saw|wanted|took|went|needed|found|came|used|gave|helped|looked|liked|worked|kept|felt|became|believed|told|went|tried|loved|based|understood|seemed|started|provided|lived)_VERB\s’.

8. The 21 'eyes words' were selected based on their frequency in the GloWBe corpus. They are presented in note 9.

9. The RegEx terms used in my analyses are “recognize|realize|organize|utilize|minimize|maximize|apologize|emphasize|criticize|optimize|customize|specialize|summarize|visualize|capitalize|mobilize|prioritize|stabilize|characterize|authorize|memorize“ and ”recognise|realise|organise|utilise|minimise|maximise|apologise|emphasise|criticise|optimise|customise|specialise|summarise|visualise|capitalise|mobilise|prioritise|stabilise|characterise|authorise|memorise”. My TCOEHK analyses randomly sampled 50% of all tokens that met the RegEx criteria.

10. The RegEx used is ”(?:also|already|only)_ADV\s[\w]+_VERB” and “[\w]+_VERB\s(?:[\w]+_(?:PROPN\s|NOUN\s|PRON\s))?(?:also|already|only)_ADV(?:$|\s[.!?,]_PUNCT)”.

11. The RegEx formula used is ‘(?:You|They|We|He|She|It)_PRON\s(?:is|are)_AUX\s(?:[\w]+_ADV\s)?(?:[\w]+_ADJ\s)’ and ‘(?:You|They|We|He|She|It)_PRON\s(?:[\w]+_ADV\s)?(?:[\w]+_ADJ\s)’. The sampling rate is 50%.

Additional information

Funding

This work was supported by the The Chinese University of Hong Kong Faculty of Arts [4051228].

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 157.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.