501
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Thematic Concentration as a Discriminating Feature of Text Types

ORCID Icon & ORCID Icon
 

Abstract

Generally, human brains can grasp intuitively the gist of thematic content of different texts through comprehensive reading, and such human-like generalization process may be accomplished with a more exact basis. With three representative text types in Chinese and English from two comparative corpora as our focus, that is, LCMC (the Lancaster Corpus of Mandarin Chinese) and Frown (the Freiburg-Brown Corpus of American English), this study compares thematic characteristics of these texts with PAM (Partition around Medoids) and HA (Hierarchical Agglomerative) clustering via three quantitative indicators, namely, TC (Thematic Concentration), STC (Secondary Thematic Concentration) and PTC (Proportional Thematic Concentration). The results show that: (1) eigenvectors standing for the thematic characteristic of three text types can be clustered into their corresponding categories in both Chinese and English; (2) two contributing factors are identified for the clustering results. One is the differences of TC, STC and PTC values of three text types lying in different hierarchical levels; the other is the differences of the percentages of ‘thematic words’, especially nouns at the pre-h-point and pre-2 h-point domain in three text types. The characterization of three text types as thematic-intensive (Official Document), thematic-balanced (News) and thematic-dispersive (Fiction) bears a cross-linguistic similarity in both Chinese and English.

Acknowledgement

We sincerely thank Professor Vladimir Matlach who kindly and patiently provided necessary technical support for the current study. We also sincerely thank the reviewer for the insightful and helpful suggestions, which contributed greatly to the improvement of this paper.

Notes

1. The h-index has an advantage over some other indicators, such as ‘total number of papers’ (which does not account for the quality of scientific publications), ‘number of significant papers’ (which may be arbitrary to determine the quality of being ‘significant’), or ‘number of citations received by each of the most-cited paper’ (which does not consider the fact ‘that total number of citations can be disproportionately affected by participation in a single publication of major influence, for instance, methodological papers proposing successful new techniques, methods or approximations, which can generate a large number of citations’) (from Wikipedia at https://en.wikipedia.org/wiki/H-index, accessed 30 December 2016).

2. Three subtypes in the domain of News and five subtypes in Fiction (excluding humour) are chosen respectively, since the current analyses are concerned with thematic features of texts in a more general sense.

3. Among the three text types, Official only has 30 samples, which is regarded as the reference number for selecting texts in News and Fiction, as the number of each text type should be comparable for the following clustering analysis.

4. Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighbouring clusters and thus provides a way to assess parameters like number of clusters visually. This measure has a range of [-1, 1]. Silhouette coefficients near +1 indicate that the sample is far away from the neighbouring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighbouring clusters and negative values indicate that those samples might have been assigned to the wrong cluster.

5. CPCC is a statistical index to measure the degree of similarity between distance matrix (the distance calculated from the hierarchical clustering computation) and cophenetic matrix (the minimum merging distance between two clusters). As the value of CPCC approaches 100%, it indicates that the clustering is quite fit.

6. In Figure , ‘x’ stands for the data matrix or data frame, or dissimilarity matrix or dissimilarity object, depending on the value of the dissimilarity argument; ‘k’ stands for the number of clusters, which is 3 in the current study for LCMC and Frown; ‘diss’ functions as logical flag, if it is TRUE (default), then ‘x’ will be considered as a dissimilarity matrix; If FALSE, then ‘x’ will be considered as a matrix of observations by variables. ‘Component 1’ and ‘Component 2’ are just principal components in ‘Principal Component Analysis (PCA)’. As clustering can also be seen as a dimensionality-reduction procedure that a multivariate data set visualized as a set of coordinates in a high-dimensional data space is projected onto a lower-dimensional picture, and thus the internal structure of the data, in a way, are best explained by the variance in the data.

7. Text vectors are represented by the numbers on the bottom of the dendrogram, in which numbers 1–30 stand for News, 31–60 for Official and 61–90 for Fiction.

8. The test distributions of one-sample Kolmogorov-Smirnov Test for three indicators are all normal, with p- values of .011 (TC), .074 (STC) and .936 (PTC) respectively in LCMC; while .026 (TC), .116 (STC) and .086 (PTC) respectively in Frown.

9. The percentages of verbs is quite small before the pre-h and 2 h domain in Frown, despite the fact that the current study calculates all values based on the lemmatized forms of words (for example, the lemma ‘do’ represents the word forms of does, did, done, and doing). One possible reason is that the pre-h and 2 h domains are occupied by too many non-thematic contributors – like the (determiner) which in most cases occupy the first ranking position, a/an (articles), other determiners, prepositions and pronouns, thus greatly excluding the opportunities of verbs from counting as thematic words. Coulthard (Citation1994, p. 202) also mentions that ‘the most frequent 10 words in English are the, of, to, and, a, in, that, it, is and for, and roughly 6% of any text consists simply of the word the’.

10. The following categorizations are classified as ‘Markers’ in the current study with few content meanings: the lemmatized forms of all variations of ‘be’, ‘do’, ‘have’, determiners, negation markers ‘not’/‘n’t’, possessive markers ‘’s’, and articles ‘the’ and ‘a/an’.

11. In Official, the standardization of language use which is strongly emphasized may result in the repeated use of some terms or expressions.

12. Pronouns in Fiction may function like providing necessary background information about prior events and upcoming events in a clear and concise way without causing ambiguity; providing a factual, objective reportage of recent events (Biber, Citation1988); and encouraging ‘readers to (temporarily) take an internal field position and to simulate being in the position of the news’ (Van Krieken, Sanders, & Hoeken, Citation2015, p. 228).

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.