214
Views
2
CrossRef citations to date
0
Altmetric
Original Articles

Finding the Minimum Document Length for Reliable Clustering of Multi-Document Natural Language Corpora

Pages 23-52 | Published online: 24 Feb 2011
 

Abstract

This paper deals with a problem that can arise when the aim is to cluster a document collection by textual feature frequency and there is substantial variation in document length. The first part of the discussion shows why such length variation can be a problem for frequency-based clustering. The second describes data normalizations to deal with the problem and shows that these are unreliable where documents are too short to provide accurate probability estimates for data variables. The third uses statistical sampling theory to develop a method for identifying and eliminating such documents from the analysis.

Cluster analysis is used across many science and engineering disciplines to identify interesting structure in data; see for example Gan, Ma and Wu (Citation2007, ch. 18), Xu & Wunsch (Citation2009, pp. 8–12), and the extensive references to cluster analysis applications on the Web. The advent of digital electronic natural language text has seen its application in disciplines like information retrieval (Manning et al., Citation2008) and data mining (Feldman & Sanger, Citation2007) and, increasingly, in corpus-based linguistics (Moisl, Citation2009). In all these application areas, the reliability of cluster analytical results is contingent on the combination of the clustering algorithm being used and the characteristics of the data being analysed, where “reliability” is understood as the extent to which the result identifies structure that really is present in the domain from which the data was abstracted, and some well defined sense of what it means for structure to be “really present” is available. This discussion focuses on how the reliability of cluster analysis can be compromised by one particular characteristic of data abstracted from natural language corpora, and what to do about it.

That characteristic arises when the aim is to cluster a collection of length-varying documents based on the frequency of occurrence of one or more linguistic or textual features; recent examples are clustering of the suras of the Qur'an on the basis of lexical frequency (Thabet, Citation2005) and of dialect speakers on the basis of phonetic segment frequency (Moisl et al., Citation2006). Because longer documents are likely to contain more examples of the features of interest than shorter ones, the frequencies of the data variables representing those features will be numerically larger for the longer documents than for the shorter ones, which in turn leads one to expect that the documents will cluster in accordance with relative length rather than with more interesting criteria latent in the data; this expectation has been empirically confirmed (for example Thabet, Citation2005). The solution is to eliminate relative document length as a factor by adjusting the data frequencies using a length normalization method. This is not a panacea, however. One or more documents in the collection might be too short to provide accurate population probability estimates for the variables, and, because length normalization methods exacerbate such inaccuracies, the result is that analysis based on the normalized data inaccurately clusters the documents in question. To deal with this problem, the present discussion proposes definition of a minimum length threshold for acceptably accurate variable probability estimation, and elimination of any documents which fall below that threshold from the analysis.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 394.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.