Abstract
A novel method of statistical analysis of texts is suggested. The frequency distribution of the first significant digits in numerals of connected authorial English-language texts is considered. Benford’s law is found to hold approximately for these frequencies with a marked predominance of the digit 1. Differences between the Benford-like distributions for the texts by different authors are statistically significant author peculiarities that allow, under certain conditions, to consider the problem of authorship. The actual frequency of occurrence usually is higher than the probability according to Benford’s law for the first significant digits 1, 2, and sometimes 3; for greater digits, the situation is reversed, and the digits distributions are characterized by strong fluctuations thus making these distributions unrepresentative for our purpose. The approach suggested and the conclusions are backed by the examples of the computer analysis of works by W. M. Thackeray, M. Twain, R. L. Stevenson, et al. The results are confirmed on the basis of the parametric Pearson chi-squared test as well as the non-parametric Mann–Whitney U test and Kruskal–Wallis test.
Acknowledgement
I am grateful to Dr William M. Goodman, for his valuable comments.
Notes
1. In Russian, grammatical articles are absent, but some constructions can substitute them. ‘A certain circumstance had crept in, a disagreeable and troublesome factor, which threatened to overturn the whole business’ (‘The Idiot’ by F. Dostoevsky, Part I, Chapter IV, translated by E. Martin): in the Russian original, both cases of the occurrence of the indefinite article ‘a’ correspond to the Russian numeral ‘odin’ (one).
2. As an example: ‘World War I … ended almost 100 years ago’ – a phrase from: Battlefield Events: Landscape, commemoration and heritage (Routledge Advances in Event Research Series), by K. Reeves, G. R. Bird, L. James, B. Stichelbaut, & J. Bourgeois (Eds.), 1st Edition 2015 .
3. Since the idioms cannot be decomposed into single words without loss of meaning, numerals casually contained in such expressions are not to be taken into account in the analysis of the use of numerals. As for itemizations, they are analogous to page numbering. Not to mention the fact that they are not always set by the author himself (it may depend on editorial corrections), they are merely the usual system of markings rather than the author’s intention. Anyway, the deleted items (owing to their rare occurrence) have a negligible influence on results obtained.
4. They are easily performed in statistical packages (we used SPSS).
5. According to Pearson's chi-squared test, the frequency distribution of text No. 1 (by Harper Lee) significantly differs from that of No. 3 (by Capote) at α = 0.01. If we take No. 2 instead of No. 1, the difference is only significant at α = 0.1.