Abstract
In sequential pattern analysis, the frequency of patterns is evaluated by the support. While computed efficiently from large databases, we show that the support cannot be compared between different databases, since it is influenced by the actual sequence length distribution. Models for this sequence length distribution are surveyed. One of these models, the Good distribution, appears to be sufficiently flexible for practice. It is used to exemplify an approach for adjusting the relative support such that the resulting adjusted support values are better comparable between different databases. We illustrate our findings with texts from the bilingual FinDe corpus.
Mathematics Subject Classification:
Notes
We abbreviate this type of frequency by , since later in Section 2, p is used to denote the true pattern observation probability.
(2) In typical control chart applications, even the more strict 0.99730-quantile is used for outlier identification. Our motivation: The expected number of “false declarations” is only I · P(L ⩾ bI ) ≈ I · 1/I = 1.