Abstract
Quantifying the concept of co-occurrence and iterated co-occurrence yields indices of similarity between words or between documents. These similarities are associated with a reversible Markov transition matrix, the formal properties of which enable us to define euclidean distances, allowing in turn to perform words-documents correspondence analysis as well as words (or documents) classifications at various co-occurrences orders.
ACKNOWLEDGEMENTS
Thanks to M. Rajman and J.-C. Chappelier for stimulating discussions, and to N. Jufer and S. Durrer for their textual data.
Notes
1This distribution is unique if njk is irreducible, that is, not degenerate into two or more components (for instance, one component containing French words only in French documents and another containing German words only in German documents, with no lexical intersection).
2Reversibility characterizes here the word – word or document – document association, and does not refer to the sequential ordering of words inside documents, of course.
3cf. the behaviour of category DETDEMFS in illustration 5 below.
4A significant exception to this is the case of co-ordination.
5 La Liberté, edited in Fribourg, Switzerland.
6Key to the abbreviations: PREP = preposition, ADV = adverb, NC(M|F)(S|P) = masculine/feminine singular/plural common noun, ADJ(M|F)(S|P) = masculine/feminine singular/plural adjective, ADJ(S|P)IG = idem, gender-invariant, DET(I|D|DEM|POSS)(M|F)S = indefinite/definite/demonstrative/possessive masculine/feminine singular article, DET(I|D|DEM|POSS)(S|P)IG = idem, singular/plural gender-invariant.
7Proof:
The associativity
Singular-plural factor α = 2 opposes cluster 2 ((ADJ|DETPOSS)SIG) to cluster 4 (NC(F|M)P, ADJ(F|M)P, ADJPIG, ADJNUM, DET(D|I|DEM|POSS)PIG). Masculine-feminine factor α = 3 opposes cluster 1 ((NC|ADJ)MS, DET(D|I|DEM|POSS)MS) to cluster 3 ((NC|ADJ)FS, DET(D|I|DEM|POSS)FS).