Abstract
Differences in the writing style of authors provide the grounds for authorship attribution or classification studies. This paper attempts to quantify the literary style of various forms of media, including broadsheet and tabloid newspapers, technical periodicals and television news scripts. The aim is to investigate the richness of vocabulary exhibited in these texts under the proposition that the writing style usually varies depending on the targeted readership or audience. In this paper we show the importance of using maximum likelihood estimates for the three parameters of the Sichel distribution, as opposed to using the inverse Gaussian Poisson distribution, which is a particular case. We then use multivariate pattern recognition techniques such as discriminant analysis, classification trees and neural networks to establish differences in the afore-mentioned types of media.
Acknowledgements
We would like to thank Diana Lewis, from the Centre for Linguistics and Philology, University of Oxford, for helpful comments and advice. We are also grateful to Lou Burnard and Mike Popham from the Humanities Computing Unit, Oxford University Computing Services, for valuable assistance in retrieving the data from the British National Corpus. Special thanks go to Susan Hutchinson, from the Department of Statistics, University of Oxford, for her advice on computing issues and for providing the algorithm for the initial data tabulation, and to Julian Stander, from the School of Mathematics and Statistics, University of Plymouth, for his comments on a preliminary version of the paper. Most of this work was done when both authors were at the Department of Statistics, University of Oxford. Research at the UCL Institute of Child Health and Great Ormond Street Hospital for Children NHS Trust benefits from Research and Development funding received from the NHS Executive.
Notes
1In the UK the quality of writing in broadsheet newspapers is generally considered better than that of the tabloids; we would expect the former to have a richer vocabulary than the latter. Note that in 2005 The Independent changed to the tabloid format and The Guardian to the Berliner format. However, we use the terms ‘Broadsheet’ and ‘Tabloid’ to differentiate newspapers by their contents rather than by their formats.