307
Views
1
CrossRef citations to date
0
Altmetric
Original Articles

A Method of Text Attribution Based on the Statistics of Numerals

Pages 256-270 | Published online: 20 Sep 2017
 

Abstract

A novel method of statistical analysis of texts is suggested. The frequency distribution of the first significant digits in numerals of connected authorial English-language texts is considered. Benford’s law is found to hold approximately for these frequencies with a marked predominance of the digit 1. Differences between the Benford-like distributions for the texts by different authors are statistically significant author peculiarities that allow, under certain conditions, to consider the problem of authorship. The actual frequency of occurrence usually is higher than the probability according to Benford’s law for the first significant digits 1, 2, and sometimes 3; for greater digits, the situation is reversed, and the digits distributions are characterized by strong fluctuations thus making these distributions unrepresentative for our purpose. The approach suggested and the conclusions are backed by the examples of the computer analysis of works by W. M. Thackeray, M. Twain, R. L. Stevenson, et al. The results are confirmed on the basis of the parametric Pearson chi-squared test as well as the non-parametric Mann–Whitney U test and Kruskal–Wallis test.

Acknowledgement

I am grateful to Dr William M. Goodman, for his valuable comments.

Notes

1. In Russian, grammatical articles are absent, but some constructions can substitute them. ‘A certain circumstance had crept in, a disagreeable and troublesome factor, which threatened to overturn the whole business’ (‘The Idiot’ by F. Dostoevsky, Part I, Chapter IV, translated by E. Martin): in the Russian original, both cases of the occurrence of the indefinite article ‘a’ correspond to the Russian numeral ‘odin’ (one).

2. As an example: ‘World War I … ended almost 100 years ago’ – a phrase from: Battlefield Events: Landscape, commemoration and heritage (Routledge Advances in Event Research Series), by K. Reeves, G. R. Bird, L. James, B. Stichelbaut, & J. Bourgeois (Eds.), 1st Edition 2015 .

3. Since the idioms cannot be decomposed into single words without loss of meaning, numerals casually contained in such expressions are not to be taken into account in the analysis of the use of numerals. As for itemizations, they are analogous to page numbering. Not to mention the fact that they are not always set by the author himself (it may depend on editorial corrections), they are merely the usual system of markings rather than the author’s intention. Anyway, the deleted items (owing to their rare occurrence) have a negligible influence on results obtained.

4. They are easily performed in statistical packages (we used SPSS).

5. According to Pearson's chi-squared test, the frequency distribution of text No. 1 (by Harper Lee) significantly differs from that of No. 3 (by Capote) at α = 0.01. If we take No. 2 instead of No. 1, the difference is only significant at α = 0.1.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 394.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.