131
Views
1
CrossRef citations to date
0
Altmetric
Original Articles

Sequential Pattern Analysis: A Statistical Investigation of Sequence Length and Support

&
Pages 1044-1062 | Received 12 Aug 2010, Accepted 22 Dec 2011, Published online: 02 Jan 2013
 

Abstract

In sequential pattern analysis, the frequency of patterns is evaluated by the support. While computed efficiently from large databases, we show that the support cannot be compared between different databases, since it is influenced by the actual sequence length distribution. Models for this sequence length distribution are surveyed. One of these models, the Good distribution, appears to be sufficiently flexible for practice. It is used to exemplify an approach for adjusting the relative support such that the resulting adjusted support values are better comparable between different databases. We illustrate our findings with texts from the bilingual FinDe corpus.

Mathematics Subject Classification:

Notes

We abbreviate this type of frequency by , since later in Section 2, p is used to denote the true pattern observation probability.

(2) In typical control chart applications, even the more strict 0.99730-quantile is used for outlier identification. Our motivation: The expected number of “false declarations” is only I · P(LbI ) ≈ I · 1/I = 1.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.