Full article: Discovery of process variants based on trace context tree

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Process variants usually exhibit a high degree of internal heterogeneity, in the sense that the executions of the process differ widely from each other due to contextual factors, human factors, or deliberate business decisions. Understanding differences among process variants helps analysts and managers to make informed decisions as to how to standardise or otherwise improve a business process. Existing process variant mining approaches typically fall short in full supporting semantic process variability mining, especially rarely taking activity behaviour relationships and trace context semantic into consideration. Here, we propose a semantic process variant discovery method, aimed at solving the difficulty of distinguishing similar-but-different behaviours directly from event logs. More specifically, we adapt concepts of benchmark logs and trace context tree to formalise context semantic of event log, to classify benchmark logs into several parts, thereby the clustered trace cohorts are mapped to discover the configurable process variants. In the experimental part, some performance metrics of the proposed method are evaluated and calculated by real-world event logs, supporting the usefulness of the proposed method. The experimental results show that the proposed method is able to distinguish similar-but-different behaviours and is superior to the characteristic trace clustering method using conventional neural networks.

Keywords:

1. Introduction

Owing to inevitable software maintenance and adaptability of process models, many process models are derived from the same base model in practical applications, in order to match the increasing individualisation of customer demands. These kind of configurable process models is gaining importance, as an example it can offer various benefits like reusability and flexibility compared to traditional predefined business process models. Configurable process models derived from the same base model usually have a high level of similarity, and cannot be differentiated from each other using state-of-the-art similarity measurements. This is especially true in the scenario of process variant mining directly from event logs, without relying on any a priori reference model.

Process variant analysis is a set of techniques to analyse event logs produced during the execution of a process, in order to identify and explain the differences between two or more process models (van der Aalst, Citation2022). The goal of process variant analysis is to help business analysts or stakeholders to understand why and how multiple variants of a process differ (Lopez-Martinez-Carrasco et al., Citation2021).

In this setting, a process variant is a subset of executions of a business process that can be distinguished from others based on some characteristics (Taymouri et al., Citation2021). In this work, we call a set of execution logs as similar-but-different process variants, which usually have high degrees of similarities. For example, an organisation may have different process orchestration for some given specific business process, such as multiple products sales processes in different countries(say $C 1$ , $C 2$ , $C 3$ , $C 4$ ), or multiple accounting processes in different branches(say $C 1$ , $C 2$ , $C 3$ , $C 4$ ). So, the actual executions of the same process may vary with time and geography, we can obtain 4 similar-but-different process variants: one for each of these countries or branches. In these variants, some relevant event data such as location, different business modules, products, and customer types could change, but the main process models are similar, and can be divided into differentiated clusters. The sub-models of clusters are functionally homogeneous, but can be differentiated from each other by some number of partial variations, and these similar models can be formalised, understood, and expressed as process variants.

The process variants have proven to be a mainstream development technology for flexible business systems adapting to different markets, and a wide range of methods for process variant analysis have been proposed in the past decade, such as configurable BPMN(Business Process Modelling Notation), configurable Petri nets, etc (Van Den Ingh et al., Citation2021). Latest process variants mining or discovery techniques can be divided into three categories:

Process variability modelling methods, which mainly deduce process variants through configurable operations to base model;
Configurable process mining methods, which discover process variants through semantic trace fragmentation or slicing operations;
Trace clustering methods based on machine learning, which utilise characteristic data clustering methods to extract process variants.

Due to the interdisciplinary nature of this field, the existing methods and the types of differences they can identify vary widely. The challenges encountered while managing process variants discovery are related to the models creation and the configuration. Recently, process mining offers some advanced techniques to discover, check conformance of models, and enhance configurable process models using a collection of event logs, that captures traces during the execution of process variants (Bettina et al., Citation2022). However, existing works in configurable process mining lack the incorporation of semantics in the resulting model. Historically, semantic process mining has been applied to event logs to improve process discovery with respect to semantic (De Leoni et al., Citation2016; Khannat et al.,Citation2021).

This paper integrates the advantages of configurable process mining and trace clustering methods, and presents a log-based process variants discovery method. The main contributions of this paper are as the following:

The formalisation of the behaviour semantic of event log, enriching the collection of event logs with configurable benchmark log concepts that capture variability of elements present in the logs. This is an important step towards discovering semantically enriched process variants. First, concepts of benchmark log and trace context tree are formalised to describe the behaviour semantics of event log, where kth- strict order relationship between activities are highlighted. Then, a kind of weighted frequency cosine similarity measurement is presented, in order to select the representative activity nodes and their neighbourhood length in benchmark log. Finally, the context tree of event log is constructed in the form of frequent pattern tree(abbreviated as FP tree).
The construction of a configurable process variants discovery method based on trace context tree, named semantic α splitting method. Semantic α splitting method can discover trace clusters directly from event log, and it combines behaviour profiles and trace clustering techniques together. In the experimental part, results show that semantic α splitting method can identify process variants that cannot be distinguished by the existing methods, as well as it has higher fitness and higher precision than characteristic tracing clustering methods using conventional neural network (abbreviated as CNN), especially in the scenarios that different variants have different probability distribution.

The trace semantic α splitting method starts from the point of configurable process mining, and is also an effective method, which can improve characteristic trace clustering method in fitness and precision. The first advantage is to construct an approach of configurable process mining framework incorporating behaviour profiles, so it enriches process variant mining techniques in configurable process mining; the second advantage is to build a kind of trace similarity measurement incorporating various behaviour relationships of activities in event log, so it extends the classical similarity measurement in machine learning; the third advantage lies in that it can simplify the traces of event log to the greatest extent, in the meanwhile preserving the behavioural relationships of activities.

In addition to these three advantages, the biggest innovation of the proposed method is highlighted as the short and long dependencies among activities of trace semantic extraction, which is expressed as a trace context tree in neighbourhood length of k( $k \geq 1$ ).

The remainder of this paper is structured as follows. Section 2 reviews some basic concepts and notations, and Section 3 introduces the related work. Section 4 presents an illustrative motivation example. Section 5 introduces the proposed method of this work, named semantic α splitting method based on the activity context of event log. Section 6 conducts experiments and analyses the experimental results. Finally, Section 7 concludes this paper.

2. Preliminaries

In this section, we briefly review a couple of terminologies such as events, traces, event log, and log behaviour profiles based on previous work (Agostinelli et al., Citation2023; Lu et al., Citation2022; Wang et al., Citation2022), in order to ease the readability of this paper.

A business process is a set of activities executed in a given setting to achieve predefined business object. An activity is an expression of the form $A (a_{1}, a_{2}, \dots, a_{n_{A}})$ , where A is the activity name and each $a_{i}$ is an attribute name. We call $n_{A}$ the arity of A. The attribute names of an activity are all distinct, but different activities may contain attributes with matching names (Agostinelli et al., Citation2023).

We assume a finite set Act of activities, all with distinct names; thus, activities can be identified by their name, instead of by the whole tuple. Every attribute $a_{i}$ of an activity A is associated with a type $D_{A} (a_{i})$ , i.e. the set of values that can be assigned to $a_{i}$ when activity is executed (Agostinelli et al., Citation2023).

An event is the execution of an activity and is formally captured by an expression of the form $e = A (v_{1}, v_{2}, \dots, v_{n_{A}})$ , where $A \in A c t$ is an activity name with $v_{i} \in D_{A} (a_{i})$ (Agostinelli et al., Citation2023). The set of events is denoted as Event.

A trace is formally defined as finite sequences of events $σ = ⟨ e_{1}, e_{2}, \dots, e_{n} ⟩$ with $e_{i} = A_{i} (v_{1}, v_{2}, \dots, v_{n_{A_{i}}})$ . Traces model process executions, i.e. the sequences of activities performed by a process instance CID. A finite collection of executions into a set L of traces is called an event log (Agostinelli et al., Citation2023).

Take the event log in Table as an example. The event log contains 10 traces, with 3 instances of trace $C_{1}$ (cf. column “CID”), 2 instances of $C_{2}$ , etc. Herein, $σ = ⟨ A, B, C, E ⟩$ refers to any of the three instances of $C_{1}$ , such that $E_{σ} = {e_{0}, e_{1}, e_{2}, e_{4}}$ and $ψ (e_{0}) = A, ψ (e_{1}) = B, \dots, ψ (e_{4}) = E$ , etc.

Table 1. An example of event logs.

Display Table

Definition 2.1

Weak order relationship of events (Fang et al., Citation2020; Lu et al., Citation2022)

Let L be an event log, and $σ_{i} =< e_{1}, e_{2} \dots, e_{m} >\in σ$ be any trace in L, the weak order relationship $≺_{L} \subseteq (E v e n t \times E v e n t)$ contains all events pairs $(x, y)$ , such that $\exists j, k \subseteq {1, \dots, m} \land j < k \leq m \land e_{j} = x, e_{k} = y$ .

Definition 2.2

Log behaviour profiles (Fang et al., Citation2020; Lu et al., Citation2022)

Let L be an event log, an event pair $(x, y) \in (E v e n t \times E v e n t)$ is in at most one of the following relations:

the strict order relation $\to$ , if $x ≺_{L} y \land y ⊁_{L} x$ .
the interleaving order relation ∥, if $x ≺_{L} y \land y ≺_{L} x$ .
the strict reverse order relation $\leftarrow^{- 1}$ , if $y ≺_{L} x \land x ⊀_{L} y$ .

The set $B_{L} = {\to, ∥, \leftarrow^{- 1}}$ is the log behaviour profile of L.

In the behaviour profile $B_{L}$ of L, the exclusiveness relations are not appear, for the reason that if a pair of events are exclusive each other, then they definitely not occur in the same trace concurrently. However, the opposite is not necessarily always true. So, we can not deduce exclusiveness relationship based on event log alone.

Definition 2.3

Frequent Pattern Tree (FP tree) (Borah & Nath, Citation2018)

A triple $F P T = (T r, P, H)$ that meets the following conditions is called frequent pattern trees, where:

Tr is a root node of the tree;
P is the item prefix subtree, its node item is denoted as $t r = (t n a m e, c o u n t, n o d e l i n k)$ , where tname represents the identifier of the node item, count represents the number of subpaths from root to it, and nodelink represents the next node with the same identifier tname in the prefix subtree;
$H = (i n a m e, h n o d e l i n k)$ is a frequent item header table, where iname stands for frequent item identification domain, and hnodelink is a pointer to the first frequent item node with the same item identifier in the prefix subtree.

3. Related work

Process variant analysis is a rather broad topic, and the main research question (RQ for short) of this field is can be summarised as “given a set of two or more process variants executions, how to identify and explain the differences among variants (Taymouri et al., Citation2021)?” The RQ can be tackled from three different perspectives: process variability modelling, configurable process mining and trace clustering.

3.1. Process variability modelling methods

From the perspective of process variability modelling, the process variants are in terms of a cluster of process models, and different operational managements to the base model could generate process cohorts (Döhring et al., Citation2014; Li et al., Citation2011; Rosa et al., Citation2017; Taymouri et al., Citation2021; Van Den Ingh et al., Citation2021). In this scenario, the up mentioned RQ can be simplified into a fine-grained problem named RQ1.

Given a set of two or more process variants models, how to identify and explain the difference among variants?

Suppose that the base model of a specific business process is given as known, some process variants clusters or families can be obtained via configuring personalised operations to base model. These configurable operations are performed by stakeholder, process manager, or end-user, and can be expressed in the forms of Not-Functional-Requirement (Taymouri et al., Citation2021), reasonable process fragments (Schunselaar et al., Citation2012), declarative variability rules (van Beest et al., Citation2019), and etc.

Although, the digital and the physical worlds are closely aligned, and it is possible to track operational management processes in detail to some extent, however, there exist challenges for identify these variants (Pourbafrani et al., Citation2020). While employing model-based comparison to process variants, the key problem is related to the fact that the variants are compared in terms of their model structure whereas we aim to compare the behaviour. So, a kind of low-level behavioural representation is preferred, i.e. transition systems, instead of high-level process modelling languages, such as BPMN or Petri nets. However, low-level modelling methods are fall short in state-explosion. Moreover, existing process variability modelling methods are mostly from the perspective of control flow, another drawback of model-based approaches is that they are unable to detect differences in terms of frequency or other perspectives. Therefore, some additional comprehensive techniques could be take into consideration, supporting advanced process variability modelling.

3.2. Configurable process mining methods

Process mining (van der Aalst, Citation2022) is a body of methods and tools to analyse business execution logs (named event logs), and many organisations have adopted process mining tools that use these event data for the discovery and analysis of the actual execution of their business process. In this context, an event log describes all that occurred during the execution of the relevant system by the end-users, such as events, activities, time stamps, case instance, etc.

From the perspective of process mining, the process variants are finally formalised as a cluster of process models, so the RQ defined upcoming can be further refined a fine-grained problem, named RQ2.

Given a set of event logs of two or more process variants executions, how to identify and explain the differences among process variants?

This kind of process variants mining methods starts from event logs directly, and does not depend on any priori knowledge about the business process model, and the first step is mostly splitting the event logs into cohorts using some trace merging or splitting operations. Chan et al. used process mining technology to mine configurable process models from event log collections, and proposed a frequency-based method to guide the configuration process for discovering process variants (Chan et al., Citation2014); Folino et al. proposed an automatically discovering method from the perspective of the control flow of event logs (Folino et al., Citation2015). This method generated workflow patterns collections of process models, each workflow pattern describes a trace cluster, which can further used to discover process variants; Bolt et al. integrated the control flow perspective and performance perspective of the event log together, so as to detect related process variants in an interactive manner (Bolt et al., Citation2017). Addressing to the problem that existing works in configurable process discovery lack the incorporation of the semantics in the resulting model, Khannat et al. proposed a novel method to enrich the collection of event logs with configurable process ontology concepts by introducing semantic annotations, which can capture the variability of elements in the event log (Khannat et al., Citation2021).

Besides the studies mentioned above, there are some other related process fragmentation or slicing studies about process variants. Owing to the fact that process variants are composed of several process fragments with commonalities and differences, so they are usually with high similarity. These similarities can be used to merge a cluster of variants together. Hasankiyadeh et al. used a process slicing algorithm to identify process fragments in the event log (Hasankiyadeh et al., Citation2014). Later, Pourmasoumi et al. proposed an algorithm to extract morphological fragments from the event log (Pourmasoumi et al., Citation2017). Reusing the extracted morphological fragments are predictably to reduce the cost of designing a new process model, and speed up the design progress. Hmami et al. introduced configurable process models and the variability concept in change mining approaches, and propose an approach of merging and filtering a collection of event logs from the same family with respect to variability (Hmami et al., Citation2021). This method is aim to enhance change mining from a collection of event logs and detect changes in variable fragments of the obtained event log.

Historically, semantic process mining has been applied to event logs to improve process discovery with respect to semantic. So, at present, in the study of configurable process mining, the greatest challenge lies in that how to introduce semantic in the mining procedures, enhance distinguishing similar-but-different variants effectively. This motivates why, in this paper, we have opted context tree combining frequency and behaviour relationships together as the semantic expression of event log.

3.3. trace clustering methods

From the perspective of machine learning, the process variants are finally formalised as a cluster of process traces, so RQ defined upcoming is equivalent to RQ2.

Machine learning is the systematic design, analysis and study of algorithms and systems that learn from past experience. Machine learning is inherently a multidisciplinary field. As for RQ2, trace clustering is a suitable technique to divide event log into several clusters or cohorts.

Most of the state-of-the-art process model discovery methods focus on how to find well-structured and understandable process models (De Leoni et al., Citation2016; Zandkarimi et al., Citation2020). However, due to the highly flexible and complex nature of processes, it may be particularly difficult to find actual process models being executed in some real-life environments, such as health-care, product development, customer support and other processes. When processing such unstructured processes, process mining algorithms can generate an incomprehensible ”spaghetti like” process model (De Koninck et al., Citation2021). One of the main reasons is the diversity of event logs, that is, there are local, non-significant differences between several process execution instances. There are many implicit process variants in these process execution instances, and the model of each variant is more suitable to describe some personalised logs than the complete process model.

Therefore, in order to solve the problems caused by the local diversity of event logs, researchers have proposed many different techniques. In addition to event log filtering, event log conversion, event log sampling and specially developed process mining algorithm (De Leoni et al., Citation2016), another method to overcome this limitation is trace clustering (De Koninck et al., Citation2021; Delias et al., Citation2023; Luengo & Sepúlveda, Citation2011; Tariq et al., Citation2021; Tavares et al., Citation2022; Vertuam Neto et al., Citation2021; Xu & Liu, Citation2019; Zandkarimi et al., Citation2020).

Trace clustering techniques divide the log into more homogeneous subsets by reducing the number of process log instances that participate in the analysis at one time, and combining the similarity metrics to measure the similarity between process instances. The obtained research conclusions show that trace clustering techniques would definitely enhance process mining, for the reason that all traces in each cluster can be analysed independently in flexible environment. In the subsequent research, many researchers tried to improve the trace clustering algorithms from the perspectives of feature extraction, trace coding and distance measurement, in order to enable the process mining algorithm to generate more accurate process models.

Trace clustering not only has advantages in model discovery, but also can be used in the fields of conformance checking (De Koninck et al., Citation2021), compliance detection (Tavares et al., Citation2022; Xu & Liu, Citation2019), log repairing (Tariq et al., Citation2021; Vertuam Neto et al., Citation2021), concept drift detection (Richetti et al., Citation2022), and process monitoring prediction (Tang et al., Citation2022).

3.4. Summary of related work

As outlined in the previous subsections, a wide range of methods have been proposed to tackle the problem of process variant mining. However, because of the heterogeneous nature of the underlying algorithms, there exist some deficiencies and challenges shown in Table that need to further study and improve. The State-of-the-art log-based process variant mining methods, either from configurable process mining, or trace clustering, are designed to distinguish similar-but-different behaviours of process variants, and only few researches take the relationships of activities or event semantics into consideration.

Table 2. The comparisons of different process variants discovery methods.

Download CSV Display Table

Behaviour profile has been proved to be an effective technique to evaluate the relationships between activities in event logs (Tang et al., Citation2022). Therefore, based on our previous work (Fang et al., Citation2020), this paper takes the relationships between activities named activity context semantics into consideration, and proposes a new log-based approach to discover process variants, incorporating the concept of Frequent Pattern tree in pattern mining field (Borah & Nath, Citation2018), in order to distinguish similar-but-different behaviours among process variant clusters.

4. Motivation

In order to provide a better understanding of process variants mining method, we first introduce an example to illustrate the motivation of this paper.

Assuming that there are two sets of event logs $L_{1} = {⟨ E, A, B ⟩^{274}, ⟨ E, F, B ⟩^{375}, ⟨ I, G, J, H ⟩^{476}, ⟨ I, C, D, J, H ⟩^{875}}$ and $L_{2} = {⟨ E, A, B, J ⟩^{274}, ⟨ E, F, B, J ⟩^{300}, ⟨ E, X, B, J ⟩^{75}, ⟨ I, G, J, H ⟩^{276}, ⟨ E, C, D, I, H ⟩^{675}, ⟨ E, G, H ⟩^{400}}$ . Model $M_{1}$ shown in Figure (a) are mined from $L_{1}$ using inductive mining algorithm on pm4py platform .Footnote¹ pm4py is the leading open source process mining platform written in Python. Model $M_{2}$ shown in Figure (b) is obtained by applying four personalised operations to model $M_{1}$ , named $M o v e (M_{1}, E, s t a r t, I)$ , $M o v e (M_{1}, J, B, e n d)$ , $I n s e r t (M_{1}, X, E, B)$ , $M o v e (M_{1}, I, D, H)$ . Here, $I n s e r t (M_{1}, X_{1}, X_{2}, X_{3})$ means insert activity $X_{1}$ into the position of after $X_{2}$ and before $X_{3}$ in model $M_{1}$ , $M o v e (M_{1}, X_{1}, X_{2}, X_{3})$ means move activity $X_{1}$ to the position of after $X_{2}$ and before $X_{3}$ in model $M_{1}$ .

Figure 1. A reference model and its process variant. (a) The mined reference model $M_{1}$ for log $L_{1}$ using inductive mining method (b) A process variant model $M_{2}$ from $M_{1}$ by personalised operations.

Suppose that $L_{1}$ and $L_{2}$ are known, and the priori models $M_{1}$ and $M_{2}$ are keep unknown. Let $L = L_{1} ⊔ L_{2}$ , that is mixing the two groups of event logs together. We use inductive mining on pm4py for log L, and the mined model from L is obtained as shown in Figure . The fitness of the model for log L is 1, however the precision is only 0.41.

Figure 2. The mined model of log L through inductive mining.

Trace clustering is a common method that can enhance the discovery of process models, and has been widely used in the field of process mining. Here, we use trace clustering method with the event frequency coding technique, two model clusters are resulted as shown in Figure . In order to further evaluate the effectiveness of different methods in identifying similar-but-different process variants, we conduct a series of trace clustering and process mining experiments on log L, the related experimental results are listed in Table .

Figure 3. Two clusters deduced from log L through tracing clustering method (Xu & Liu, Citation2019). (a) The first cluster model through tracing clustering method (b) The second cluster model through tracing clustering method.

Table 3. Experimental results comparison among different methods for log $L = L_{1} ⊔ L_{2}$ .

Display Table

It is obviously that trace clustering methods indeed enhance process mining approaches (as shown in Table ), as the performance indicators of fitness and precision of each cluster are higher than those using process mining method alone. It is noteworthy that these 4 characteristic clustering methods have a common bottleneck, more specifically, the fitness and precision are all equal to 1 in one cluster, however, in the other cluster, precision is relatively low. So, it is indicated that characteristic trace clustering methods can not distinguish upcoming mentioned similar-but-different behaviours of process variants. The experimental results in Table also verify that characteristic trace clustering and process variant mining methods are different from each other, especially in the scenario of distinguishing similar-but-different behaviours.

As shown in this motivation example, this paper proposes a novel approach to discover process variants, named trace semantic α splitting method. It is assumed that none of reference models is given as known, and the process variants are mined directly from event log. The intuitive idea of our approach is to extract the benchmark log from the event log by trace compression. In trace compression, some properties such as activities behaviour profile, trace frequencies, and etc. are preserved, but the obtained benchmark log can simplify the initial log to the greatest extent, and inherently reduce the complexity of corresponding log processing algorithm.

5. Proposed trace semantic α splitting method

5.1. Benchmark log extraction

In order to discover process variants from event log, this section formalises the concept of context log, and uses it as the basis of benchmark log extraction.

Let $L = {σ_{i} : i \geq 1}$ be the available event log set, which is a simplified formalisation of $L = {C I D, σ, {L a t t r}_{{1, 2, \dots m}}}$ , $A = {A_{i} : i \geq 1}$ be the activities set. For a given activity $a \in A$ , a sub-log related to the activity a is denoted as $L_{s u b} (a) = {π_{context (a)} (σ_{i}) ∣ i \geq 1}$ , where $context (a) = {{a} \cup {x} ∣ x \in A \land (x, a) \in B_{L}}$ , and $π_{X} (σ_{i})$ represents the projection of event sequence $σ_{i}$ on set X.

Definition 5.1

kth-strict order relationship

In the event log $L = {σ_{i} : i \geq 1}$ , two activities x and y are in kth-strict order relationship, denoted as $x ⟶_{k} y$ , if and only if $(x \to y) \cap (\exists t = a_{1} a_{2} \dots a_{n} \in L : a_{i} = x \land a_{i + k} = y : 1 \leq i \leq n) \cap (\forall t_{p} = a_{1} a_{2} \dots a_{n} \in L : t_{p} \neq t, a_{i} = x \land a_{i + l} = y \Rightarrow k < l)$ .

As the relationship of $x \to y$ means that there exists flow relationship between activity x and activity y, and the kth-strict order relationship has more relax preconditions than those in strict order relationship, so we can choose reasonable value of k in $x \to_{k} y$ relationship to limit the neighbourhood length of the activity x.

Definition 5.2

Context Log

Let $A_{a}^{k}$ be the kth-context alphabet corresponding to the activity a, and $L_{a}^{k}$ be the context log of the activity a, $L_{a}^{k} = {t_{i} ∣ t_{i} \in \Pr (L, A_{a}^{k})}$ , where:

$A_{a}^{k} = {a} \cup {a_{i} : a_{i} \in A}$ satisfies $(a \to_{l} a_{i} \lor a_{i} \to_{l} a, 1 \leq l \leq k) \lor (a ∥ a_{i}) \lor (a \leftarrow^{- 1} a_{i})$ ;
$\Pr (L, A)$ is a mapping function extended from $π_{A} (σ_{i})$ , which represents the projection of event log L on activity set A, $\Pr (L, A_{a}^{k}) = {π_{A_{a}^{k}} (σ_{i}) | σ_{i} \in L \land 1 \leq i \leq | L ∣}$ .

The event log in Table is used to illustrate the concept of context log. Activity z is selected for an example. According to the kth-strict order relationship, activities b, c, i, d, l and activity z are in 1st-strict order relationship, and activities b, c, e, h, m, i and activity z are in $2^{n d}$ -strict order relationship. Therefore, according to Definition 5.2, $b \to z$ and $Z \to_{2} m$ can be obtained. If the length is limited to 2, the 2th-context alphabet corresponding to activity z is $A_{z}^{2} = {z, b, c, d, e, l, h, i, m}$ , then the context log is $L_{z}^{2} = {(b c i z d e)^{5}, (c b i z d h)^{3}, (c i b z l m)^{2}, (i b c z l m)^{2}}$ .

Table 4. An example of event log L to illustrate context log.

Download CSV Display Table

Since there may be more than one context log for a given activity, so the concept of weighted frequency cosine similarity is proposed on the basis of cosine similarity, which helps to select the most suitable log from the context log as the benchmark log.

Definition 5.3

Benchmark log

Let $L = {σ_{i} : i \geq 1}$ be the available event log set, $A = {A_{i} : i \geq 1}$ be the activities set, $A_{a}^{k}$ be the kth-context alphabet corresponding to the activity a, and $L_{a}^{k}$ be the context log of the activity a, $L_{a}^{k} = {t_{i} ∣ t_{i} \in \Pr (L, A_{a}^{k})}$ . Benchmark log is defined as a set of $L_{a}^{k}$ , denoted as $B e n c h L = {L_{a}^{k} | a \in A \land k \geq 1}$ .

Definition 5.4

Weighted frequency cosine similarity

Let $X (x_{j}^{\to}, x_{j}^{∥}, x_{j}^{\leftarrow^{- 1}})$ , $Y (y_{j}^{\to}, y_{j}^{∥}, y_{j}^{\leftarrow^{- 1}})$ be two three-dimensional vectors, $x_{j}^{α} (α \in (\to, ‖, \leftarrow^{- 1}))$ be the frequency of each trace in the context log with α relationship, $y_{j}^{α} (α \in (\to, ‖, \leftarrow^{- 1}))$ be the frequency of each trace in the original event log with α relationship, $ω_{j}^{α} (α \in (\to, ‖, \leftarrow^{- 1}))$ be the weight distribution of α relationship, $p_{i}$ is the percentage of each trace frequency in the event log L, then the weighted frequency cosine similarity between X and Y are denoted as $C O S (X, Y)$ : (1) $COS (X, Y) = \sum_{i = 1}^{n} p_{i} \cdot \frac{\sum_{j = 1}^{3} x_{j}^{α} y_{j}^{α} ω_{j}^{α^{2}}}{\sqrt{\sum_{j = 1}^{k} x_{j}^{α^{2}} \cdot ω_{j}^{α^{2}}} \cdot \sqrt{\sum_{j = 1}^{k} y_{j}^{α^{2}} \cdot ω_{j}^{α^{2}}}}$ (1)

If $C O S (X, Y) = 1$ , then the X and Y match exactly; on the contrary, if $C O S (X, Y) = 0$ , then X and Y don't match at all. The closer the weighted frequency cosine similarity is to 1, the higher the matching degree. In this paper, we use $C O S (X, Y)$ as a metric to select the benchmark log.

Therefore, the value of $C O S (X, Y)$ can be used to select the benchmark log.

5.2. Context tree construction based on FP tree

Generally, there exist a set of common activities among different process variants, process variants are usually realised through individual orchestration and configuration to these common activities. Therefore, this subsection starts with the common activities of trace, and gives definitions of trace context and context tree of event log, to illustrate the semantic contexts of activities and traces.

Definition 5.5

Trace context

Let $L = {σ_{i} : i \geq 1}$ be the available event log set, $σ_{i}$ is a trace, LCP be the longest common prefix of traces in log L, SP is called as the context of the trace $σ_{i}$ , if and only if $S P = {d \in 2^{σ_{i}} | σ_{i} = L C P | d}$ , where the symbol “|” represents the concatenation operator.

As the activity common prefix can be represented by a prefix tree, in order to effectively identify the semantic context, a novel prefix tree structure named context tree is introduced here on the basis of the frequent pattern tree (Definition 2.6).

Definition 5.6

Context Tree

A triple $C T = (T r^{'}, P^{'}, H^{'})$ that fulfill the following conditions is called a context tree, where:

$T r^{'}$ is a root node of context tree;
$P^{'}$ is the context prefix subtree, the node $t^{'} = (t n a m e, c o u n t, n o d e l i n k)$ in the context prefix subtree. Among them, tname represents the activity name of the node, count represents the number of subpaths from root to it, and nodelink represents the next node with the same identifier tname in the prefix subtree (if none, then the next node is recorded as null);
$H^{'} = (i n a m e, h n o d e l i n k)$ is a context header table, where iname represents activity name, hnodelink is a pointer to the first node with the activity name in the prefix subtree.

The context tree corresponding to the event log in Table is shown in Figure .

Figure 4. The context tree corresponding to the event logs in Table .

It is can be concluded that each trace in the event log is substituted as a branch of the context tree (as shown in Figure ). The context tree has a top-down layout, and traces with the same prefix share a branch block of the root node. At the same time, the context header table can help us to retrieve the structure faster during the dynamic construction and query of the tree.

5.3. Selection of activity node and neighbourhood length

Due to the different choices of activity nodes and the length of the neighbourhood, the context logs extracted from the same event log are likely different, which obviously resulting in different benchmark logs. If calculating $A_{a}^{k}$ for each activity node in activity set, then the number of context logs generated afterwards is definitely huge, and thus bring a highly complicated calculation complexity to the selection of benchmark log.

In order to simplifies the calculation difficulty, it is suggested that the parameters of activity node and its neighbourhood length should be selected and determined carefully.

Firstly, it is reasonable to narrow the nodes in the activity set to a controllable range. The selection of activity nodes is mainly determined according to the number of occurrence times (i.e. frequency) that they appear in the event log, and the activity nodes are resorted by frequency in descending order. At the same time, the average frequency of activity nodes is calculated, and the activity nodes that occur less frequently are excluded, which reduces the selection range of active nodes; the activity nodes selected in the narrowed range can be further divided into different clusters based on frequency, and a representative activity node can be further selected in each node cluster for next operation.

Secondly, it is also crucial to choose the length of the neighbourhood in order to select the appropriate context log.According to the selected activity node a, the corresponding kth-context alphabet $A_{a}^{k}$ is determined, that is, the activity alphabet displays the context activities of the activity node.

Summarily, for the selected activity node a, an important step is determining the value of k in $A_{a}^{k}$ , i.e. the neighbourhood length.

5.4. Algorithms and complexity analysis

Here, we propose semantic α trace splitting method to discover process variants directly from event log. Three algorithms (Algorithms 1− 3) are formalised to illustrate the in-depth procedures of this method.

5.4.1. Algorithms

Algorithm 1 extracts benchmark log from initial event log, and there are 4 parameters should given as constants in advance, which are threshold ϕ, weighted values for $ω_{j}^{α} ((α \in (\to, ‖, \leftarrow^{- 1})))$ . It is suggested that ϕ is a value between 0.6 and 0.8, and this value is user defined. Algorithm 2 aims to constructing the context tree on the basis of benchmark logs, and Algorithm 3 uses benchmark logs and context trees as input to discover process variants. It is noteworthy that the final process variants we mined are depicted in the form of Petri nets.

5.4.2. Complexity analysis

Given an event log L, suppose the number of activities contained in the log be n, the number of traces be m, now L is used as input to analyse the complexity of each algorithm.

The core of Algorithm 1 is to extract the benchmark log: calculating average frequency avg of all activities in event log, deleting those activity nodes with frequencies lower than avg in the alphabet, and classifying the remained activities to form a classification. After a series of operations, $p \times k$ benchmark logs are obtained. Then the time complexity of extracting the benchmark log is $O (p k)$ ; the core of Algorithm 2 is to construct the context tree, assuming that the number of activities in the longest common prefix is z, and the number of activities of the remaining sequence of activities is q, the corresponding time complexity is $O (m (z + q))$ ; the core of Algorithm 3 is mining process variants. Suppose that there are x clusters, and the traces in the benchmark log are added to the clusters using context tree, so the time complexity is $O (m x)$ . Additionally, the complexity of mapping clusters of benchmark log to their counterparts in the original event log is $O (x)$ . Therefore, the total time complexity of mining process variants from the event log is $O (p k + m (z + q) + x)$ , which has equal complexity with $O (n^{2})$ .

6. Experiment and evaluation

In this section we apply our proposed method on three kinds of real complex industrial logs, showing that, although designed for distinguish some similar-but-different behaviours, such as in banks credit authorisation system (BPIC 2012) (Bautista et al., Citation2012), the proposed method can provide insights and unveil some deficiencies of existing methods. An important fact is that our proposed process variants mining method can improve the deficiency of trace clustering in variant mining, especially in the scenario that different variants with imbalanced distributions. The first part of this section describes variant mining procedures in the credit authorisation system (BPIC 2012), the second part provides a comparative analysis between this work and the existing methods, and the third part provides an in-depth discussion about the findings and potential limitations.

6.1. Case study

In practice, disposal of banks credit authorisation may have different policies due to different types of lenders. Based on these policies, event logs are generated during the execution of the credit business process from BPIC 2012, which is taken a case study to validate the effectiveness of the proposed method in this paper. By using the credit model alignment technology (Borah & Nath, Citation2018) for change operations, the operational logs were extracted from CPN tools platform, and the extracted event logs are shown in Table . There are 23 activities in the event log, and the activity label table is $N = {a, b, c, d, e, f, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w}$ . The meaning of each alphabet in Table is described as the following. a (accepting credit applications), b (starting to process credit applications), c (registering customer information), d (checking customer credit), e (contacting the bank),f (checking company funds), m (approving Loan type), n (approval request), o (check income),p (archive request),q (contact customer), r (check documents), g (check interest),h (stop inspection),i (verify residential area type),j (check property information ), k (request the mortgage insurer), l (verify the information is qualified), s (verify the loan amount), t (verify the system funds), u(end the verification), v (end the inspection phase), w (the loan is successful).

Table 5. Event logs of credit disposal processes.

Download CSV Display Table

In Algorithm 1, rows 1–2 are executed to obtain the frequency of each activity in the activity label table N, which are listed in Table according to frequency in descending order. The average frequency avg of these activities is calculated as 166.87, and it is used as a threshold value.

Table 6. Frequency of activities.

Download CSV Display Table

Every activity node whose frequency below avg is filtered out according to Algorithm 1, in order to reduce the number of activity nodes in the activity alphabet N, resulting a filtering activity alphabet is obtained as $N^{'} = {a, b, w, c, d, e, h, n, o, p}$ . Line 5 of Algorithm 1 is executed to classify the nodes in the activity alphabet $N^{'}$ into two categories according to their frequencies, i.e. 350 and 200, respectively, and line 7 is executed to randomly select a representative node from each cluster, namely b and h, which is used to perform the further operations.

After that, lines 10–11 in Algorithm 1 are executed, and the context activities corresponding to the kth-strict order relationship of representative nodes b and h in length k are calculated, the activity nodes $A_{i}^{k}$ are listed in Table .

Table 7. kth-strict order relationships of nodes b and h.

Download CSV Display Table

In order to avoid underfitting due to a small number of activity nodes, we control the number of activity nodes are controlled within a certain range, by using the judging condition shown in line 12 of Algorithm 1. Lines 13–14 in the Algorithm 1 are designed to control the number of activities contained in the activity relationship table within a certain threshold range, and the threshold range is a user defined parameter(in this paper, the threshold range is set to 60%–80%). Therefore, any length l below or above k is not taken into consideration, ensuring that the contextual alphabets are controlled in a reasonable complexity. There are a total of 23 activities in the activity alphabets, then the number of active nodes based on the length k ranges from 13.8 to 18.4, which is used to obtain $A_{i}^{k^{'}}$ , where i = b or h.

From Tables and , it can be induced that the neighbourhood length selected for node b is k = 5, and the lengths selected for node h can be k = 5, k = 6 or k = 7 according to the range of 13.8–18.4. However, since the activity alphabets corresponding to lengths k = 6 and k = 7 of node h are the same, so k is set to 6 for node h.

Table 8. Activity alphabet with node b of length k.

Display Table

Table 9. Activity alphabet with node h length k.

Display Table

The context logs of nodes b and h are extracted from the event log by mapping L to $A_{i}^{k^{'}}$ in line 17. The details are shown in Table , where each row in the table represents a trace in the context log.

Table 10. Context logs for nodes b and h.

Display Table

Hereafter, these context logs in Table are used as the basis for calculating Equation (Equation1(1) $COS (X, Y) = \sum_{i = 1}^{n} p_{i} \cdot \frac{\sum_{j = 1}^{3} x_{j}^{α} y_{j}^{α} ω_{j}^{α^{2}}}{\sqrt{\sum_{j = 1}^{k} x_{j}^{α^{2}} \cdot ω_{j}^{α^{2}}} \cdot \sqrt{\sum_{j = 1}^{k} y_{j}^{α^{2}} \cdot ω_{j}^{α^{2}}}}$ (1) ) (lines 18–20). The weights of three behavioural relations are set as the following. The weight of the strict order relation is set to 45 $%$ , the weight of the interleaving order relation is set to 30 $%$ , and the weight of the strict inverse order relation is set to 25 $%$ . The calculation results is listed in Table .

Table 11. Weighted frequency cosine similarity calculation results.

Display Table

As indicated as in Table , $L_{h}^{6}$ is selected as the benchmark log, and based on it, line 3 of Algorithm 2 is executed to split each trace in the benchmark log $L_{h}^{6}$ into two parts: the longest common prefix of the trace and the remaining active sequence. Taking the first trace in the benchmark log $L_{h}^{6}$ as an illustration, ab is the longest common prefix of the trace and the remaining active sequence is $d_{1} =< c d e h i j l k m n o p w >$ . After that, Algorithm 2 are executed to update the context tree with new activities iteratively, and the final context tree is constructed as shown in Figure .

Figure 5. Context tree of $L_{h}^{6}$ .

In Algorithm 3, initially each trace in benchmark log forms a cluster, and then a trace distance measurement $d i s t a n c e (b t_{i}, b t_{j}) = 1 - | π_{A} (b t_{i}) \cap π_{A} (b t_{j}) | / | A |$ is utilised to merge the nearest two traces into the same cluster iteratively, until the cluster number is less than or equals to the value we set in priori. After completing the traces clustering in benchmark log, the counterpart of each trace in benchmark is identified in the original event log. Finally, the mined process variants models are depicted in the form of Petri net. So, based on the context tree in Figure , the benchmark log is clustered into 4 clusters, i.e. $L_{h_{1}}^{6} = {⟨ a b c d e h i j l k m n o p w ⟩^{47}, ⟨ a b d c e h i j k l m n o q w ⟩^{28}, ⟨ a b c d e h i j l k m n o p ⟩^{25}}$ , $L_{h_{2}}^{6} = {⟨ a b h w ⟩^{50}}$ , $L_{h_{3}}^{6} = {⟨ a b c d f h n w ⟩^{48}, ⟨ a b d c f h n w ⟩^{52}}, L_{h_{4}}^{6} = {⟨ a b h o q w ⟩^{48}, ⟨ a b h o p w ⟩^{52}}$ . The last 3 lines of the Algorithm 3 are executed to map the benchmark log to the original event log, and the final process variant model are mined in the form of Petri nets, as shown in Figure .

Figure 6. Process variants (a), (b), (c), (d) found in the event log. (a) Housing loan process (b) Student loan process (c) Commercial loan process (d) Small loan process.

As each activity in Figure has a specific meaning, so the credit processes can be specifically interpreted as housing loan processes, student loan processes, commercial loan processes, and microloan processes. It is obviously that these 4 process variants behaves many commonalities and a certain degree of differences. Distinguishing the behaviours of these similar-but-different process variants can definitely enhance process mining, and surely bring convenience to the subsequent management by organisational managers.

6.2. Comparative analysis

In order to validate the feasibility of our approach, we have evaluated two kinds of real-life event cases, and compared this work with the activity recommendation approach (Chan et al., Citation2014), and tracing clustering method (Xu & Liu, Citation2019) using CNN coding (Pan et al., Citation2020).

The research work of the activity recommendation approach (Chan et al., Citation2014) develops process variants by using event logs to recommend activities in the process model, provided that the process model is known. Obviously, the main deficiencies of this method are listed as follows: (1) it requires a priori known process model; (2) it has a large computational complexity. However, comparatively, the proposed trace semantic α splitting method in this paper overcomes the upcoming two problems. Table gives an illustrative comparison between these two methods, where $n_{A}$ represents the number of activities, $n_{P}$ represents the number of business process variants, n represents the maximum number of public activities located on a layer, and k represents the number of layers considered.

Table 12. Comparison of process variants discovery methods.

Display Table

To validate the event complexity calculated by the method in this paper, the dataset used for the experiments in this section consists of two parts: the event logs of the bank credit process and the library book checkout and return process, and the data can be achieved from the website .Footnote² The main indices of the event log are shown in Table .

Table 13. Holistic information of datasets.

Download CSV Display Table

In order to effective comparison, we calculate: (1) the average number of configuration steps required to derive the process variants; (2) the proportion of traces in the log that the process variants can be completely replayed; (3) the accuracy of the model. The results are shown in Table .

Table 14. Performance indicators comparisons with (Chan et al., Citation2014).

Download CSV Display Table

Furthermore, we conduct another comparison experiment with tracing clustering using CNN method, based on the same two datasets. The experimental results are listed in Table , and graphical comparisons are shown in Figure .

Table 15. Performance indicators comparisons with (Xu & Liu, Citation2019).

Download CSV Display Table

Figure 7. Fitness and precision comparisons based on daset1 and dataset2. (a) Performance comparisons with (Chan et al., Citation2014) (b) Performance comparisons with (Xu & Liu, Citation2019).

From Tables and , some conclusions can be made as the following.

Compared with the work of Chan et al. (Citation2014), the average configuration steps required to derive the process variants are similar for both methods, but our method is slightly better than the activity recommendation method; furthermore, the differences between our work and (Chan et al., Citation2014) is statistically significant in fitness and accuracy performance, as the P-values by Wilcoxon Test are equal to 0.0017 in both of fitness and accuracy indicators. So, it is obviously that our proposed method is superior to the activity recommendation method.
Compared with the work of Xu and Liu (Citation2019), the differences in fitness and accuracy performance are not statistically significant, as the P-value by Wilcoxon Test in fitness is 0.27, while in accuracy it is 0.74, both of them are greater than significance level(0.05). However, the mean fitness and the mean accuracy are both higher than those in Xu and Liu (Citation2019).

In order to further discuss the differences between our work and the trace clustering method in Xu and Liu (Citation2019), next subsection develops another two experiments on different event logs with different distribution of the variants.

6.3. Further experimental discussions

In some scenarios, the distribution of different variants may be extremely non-uniform, such as BPIC2015. In the event logs of BPIC2015, most of trace instances are with only one occurrence, so the number of variants is very similar to the number of traces. In order to evaluate the validation of our method dealing with this kind of event logs, here we conduct two comparative experiments. The first one is based on a set of small-scale handmade event logs, where each trace instance occurs only once; the second one is based on the event logs of BPIC2015.

As shown in the section of “4. Motivation”, two sets of event logs are used to illustrate the novelty of this work. Here, we convert these two sets of event logs into new ones, where each trace occurs only once, i.e. two sets of event logs $L_{1}^{'} = {⟨ E, A, B ⟩^{1}, ⟨ E, F, B ⟩^{1}, ⟨ I, G, J, H ⟩^{1}, ⟨ I, C, D, J, H ⟩^{1}}$ and $L_{2}^{'} = {⟨ E, A, B, J ⟩^{1}, ⟨ E, F, B, J ⟩^{1}, ⟨ E, X, B, J ⟩^{1}, ⟨ I, G, J, H ⟩^{1}, ⟨ E, C, D, I, H ⟩^{1}, ⟨ E, G, H ⟩^{1}}$ . Then, based on event logs of $L^{'} = L_{1}^{'} ⊔ L_{2}^{'}$ , the experimental results are listed in Table .

Table 16. Comparisons among different methods for log $L^{'} = L_{1}^{'} ⊔ L_{2}^{'}$ .

Display Table

Similarity, we conduct another similar experiment on the dataset of BPIC 2015. There are 5 sub logs in BPIC2015, named BPIC2015-1, BPIC2015-2, BPIC2015-3, BPIC2015-4, BPIC2015-5. BPIC2015 represents the union of the 5 sub logs. Taking BPIC2015-1 as an example, there are 1199 trace instances in it, and the frequency of each trace is equals to 1. We use inductive mining method, trace clustering method and our process variants discovery method proposed in this work respectively to deal with BPIC2015, and the results are shown in Table .

Table 17. Comparisons among different methods for BPIC 2015 log.

Download CSV Display Table

Figure gives a comprehensive and graphical comparisons for the mentioned two kinds of event logs. From Tables – and Figure , it is noticed that the almost mining methods are in the same fitness level, such as Inductive mining, trace clustering by SOM, trace clustering by Kmeans, trace clustering by Agglomerative clustering, and our proposed method, except Heuristics mining. However, compared with the fitness indicator, precision level is a relatively differentiated indicator. From Tables and and Figure , we can notice that our method has higher precision level in all sub logs of BPIC 2015 and artificial log $L^{'}$ . So, the data in Tables and and Figure gives the evidence that the proposed method in this work is superior to characteristic tracing clustering method in tackling event logs with imbalance variants distribution. The reason is that the method proposed in this paper not only considers the frequency information of traces, but also highlights on the activity relationships of the log, so the proposed method can effectively capture the behavioural differences of different variants.

Figure 8. Fitness and precision comparisons for event logs with different variant distributions. (a) fitness of $L^{'}$ (b) precision of $L^{'}$ (c) fitness of BPIC2015 (d) precision of BPIC2015.

Admittedly, as the proposed method requires additional procedures for activity context calculation within k neighbourhood length, so the running time will be slightly longer than the characteristic clustering methods. Taking BPIC2015 event log as an example, the characteristic trace clustering method using CNN takes 3.97 seconds, while our method takes 16.87 seconds. Based on the mentioned three kinds of medium-scale datasets, a detail execution time comparison is depicted in Figure .

Figure 9. Execution time of different methods.

7. Conclusions and future work

Process variants are a set of models or execution logs, which have high degrees of similarity, however, the behaviour of each process variant is differentiated from the others. In the scenario of only event logs are given as known, how to realise process variants mining is an open difficult problem. At present, latest trace clustering methods can significantly improve process mining in fitness and precision. However, when we encounter the problem of process variants mining, state-of-the-art trace clustering methods cannot work effectively because of the inherently high similarity of variants. To the best of our knowledge, there is only few amount of researches aiming at discovering process variants directly from the perspective of configurable process mining, and do not rely on any priori process model.

In this work, we propose a semantic α splitting method based on activity context of event log, to effectively discovering process variants directly from event logs, and obtain the process variant clusters. Unlike the previous work based on configuration operations, the proposed method combines the advantages of configurable process mining and trace clustering methods, it presents a trace similarity measurement incorporating behaviour profiles of event log. The biggest innovation of the proposed method lies in that it extracts the trace semantic in the form of trace context tree, where the short and long dependencies of activities are expressed as kth-strict order relationship. The kth-strict order relationship is a kind of behaviour profiles, and it helps to simplifies the event logs to the greatest extent, and hence reduce the calculation complexity as possible. The paper achieves two main objectives:

A framework of semantic process variants mining is constructed, which depicts the semantic of event log as context tree, and simplifies the event log to the greatest extent by using the activities alphabet within k neighbourhood length. This approach highlights calculating traces directly without converting them into any other forms, and shortens the length of traces to be tackled in the event log.
An approach of process variants discovery method is present, which can effectively discover the process cohorts. The mined process variants are hard to identified for their inherently high similarity. We conduct a series of experiments based on real life datasets, and compare the proposed work with those in configurable process mining, or trace clustering. Through experiments and discussions, it is demonstrated that the proposed method works effectively with high fitness and precision, foremost, it can discover variants that cannot be mined through characteristic tracing clustering method.

It is undeniable that, while the proposed method in this paper has the up mentioned advantages, it also has some shortcomings. For example, it has longer execution time than characteristic trace clustering methods, although this execution time is also in an acceptable range. As future work, we will focus on the performance improvement of configurable process mining algorithms, and build a development environment by plug-in component in PROM(Process Mining Framework). Also, exploring comprehensive configurable process mining method involved multi-perspective event attributes, such as resources, organisations, and etc., would be a key research direction of future work.

Acknowledgments

We also gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by the National Natural Science Foundation of China [grant number 61902002].

Notes

1 https://pm4py.fit.fraunhofer.de/

2 https://github.com/aliuisbigger/python-trcae-clustering.git

References

Agostinelli, S., Chiariello, F., Maggi, F. M., Marrella, A., & Patrizi, F. (2023). Process mining meets model learning: Discovering deterministic finite state automata from event logs for business process analysis. Information Systems, 114, 102180. https://doi.org/10.1016/j.is.2023.102180
Web of Science ®Google Scholar
Bautista, A. D., Wangikar, L., & Akbar, S. M. K. (2012). Process mining-driven optimization of a consumer loan approvals process. In BPM 2012: Business process management workshops (pp. 219–220). BPI Challenge.
Google Scholar
Bettina, F., Sergio, F., Filippo, F., & Luigi, P. (2022). Process mining meets argumentation: Explainable interpretations of low-level event logs via abstract argumentation. Information Systems, 107, 101987. https://doi.org/10.1016/j.is.2022.101987
Web of Science ®Google Scholar
Bolt, A., van der Aalst, W. M., & Leoni, M. D. (2017, October 23–27). Finding process variants in event logs: (Short Paper). In On the move to meaningful Internet systems. OTM 2017 conferences: Confederated International Conferences: CoopIS, C&TC, and ODBASE 2017, Rhodes, Greece, Proceedings, Part I (pp. 45–52). Springer International Publishing.
Google Scholar
Borah, A., & Nath, B. (2018). Fp-tree and its variants: Towards solving the pattern mining challenges. In Proceedings of first international conference on smart system, innovations and computing, Singapore (pp. 535–543). Springer.
Google Scholar
Chan, N. N., Yongsiriwit, K., Gaaloul, W., & Mendling, J. (2014). Mining event logs to assist the development of executable process variants. In International conference on advanced information systems engineering (pp. 548–563). Springer International Publishing.
Google Scholar
De Koninck, P., Nelissen, K., Baesens, B., Snoeck, M., & De Weerdt, J. (2021). Expert-driven trace clustering with instance-level constraints. Knowledge and Information Systems, 63(5), 1197–1220. https://doi.org/10.1007/s10115-021-01548-6
Web of Science ®Google Scholar
De Leoni, M., van der Aalst, W. M., & Dees, M. (2016). A general process mining framework for correlating, predicting and clustering dynamic behavior based on event logs. Information Systems, 56, 235–257. https://doi.org/10.1016/j.is.2015.07.003
Web of Science ®Google Scholar
Delias, P., Doumpos, M., Grigoroudis, E., Manolitzas, P., & Matsatsinis, N. (2015). Supporting healthcare management decisions via robust clustering of event logs. Knowledge-Based Systems, 84, 203–213. https://doi.org/10.1016/j.knosys.2015.04.012
Web of Science ®Google Scholar
Delias, P., Doumpos, M., Grigoroudis, E., & Matsatsinis, N. (2023). Improving the non-compensatory trace-clustering decision process. International Transactions in Operational Research, 30(3), 1387–1406. https://doi.org/10.1111/itor.v30.3
Web of Science ®Google Scholar
Döhring, M., Reijers, H. A., & Smirnov, S. (2014). Configuration vs. adaptation for business process variant maintenance: An empirical study. Information Systems, 39, 108–133. https://doi.org/10.1016/j.is.2013.06.002
Web of Science ®Google Scholar
Fang, H., Jin, P. P., Fang, X. W., & Wang, L. L. (2020). Process variants cluster mining method based on causal behavioral profiles. Computer Integrated Manufacturing System, 26(6), 1538–1547. https://doi.org/10.13196/j.cims.2020.06.010
Google Scholar
Folino, F., Guarascio, M., & Pontieri, L. (2015). Mining multi-variant process models from low-level logs. In International conference on business information systems (pp. 165–177).Springer International Publishing.
Google Scholar
Hasankiyadeh, A. P., Kahani, M., Bagheri, E., & Asadi, M. (2014). Mining common morphological fragments from process event logs. In Proceedings of 24th annual international conference on computer science and software engineering (pp. 179–191). IBM Corp.
Google Scholar
Hmami, A., Sbai, H., & Fredj, M. (2021). Enhancing change mining from a collection of event logs: Merging and filtering approaches. Journal of Physics: Conference Series, 1743(1), 012020. https://doi.org/10.1088/1742-6596/1743/1/012020
Google Scholar
Khannat, A., Sbai, H., & Kjiri, L. (2021). Configurable process mining: Semantic variability in event logs. In ICEIS (pp. 768–775). SCITEPRESS.
Google Scholar
Li, C., Reichert, M., & Wombacher, A. (2011). Mining business process variants: Challenges, scenarios, algorithms. Data & Knowledge Engineering, 70(5), 409–434. https://doi.org/10.1016/j.datak.2011.01.005
Web of Science ®Google Scholar
Lopez-Martinez-Carrasco, A., Juarez, J. M., Campos, M., & Canovas-Segura, B. (2021). A methodology based on trace-based clustering for patient phenotyping. Knowledge-Based Systems, 232, 107469. https://doi.org/10.1016/j.knosys.2021.107469
Web of Science ®Google Scholar
Lu, K., Fang, X., Fang, N., & Asare, E. (2022). Discovery of effective infrequent sequences based on maximum probability path. Connection Science, 34(1), 63–82. https://doi.org/10.1080/09540091.2021.1951667
Web of Science ®Google Scholar
Luengo, D., & Sepúlveda, M. (2011). Applying clustering in process mining to find different versions of a business process that changes over time. In International conference on business process management (pp. 153–158). Springer.
Google Scholar
Medeiros, A. K. A. D., Guzzo, A., Greco, G., Van der Aalst, W. M., Weijters, A., B. F. V. Dongen, & Sacca, D. (2007). Process mining based on clustering: A quest for precision. In International conference on business process management (pp. 17–29). Springer.
Google Scholar
Pan, Y., Zhang, L., & Li, Z. (2020). Mining event logs for knowledge discovery based on adaptive efficient fuzzy Kohonen clustering network. Knowledge-Based Systems, 209, 106482. https://doi.org/10.1016/j.knosys.2020.106482
Web of Science ®Google Scholar
Pourbafrani, M., van Zelst, S., & Aalst, W. (2020). Supporting automatic system dynamics model generation for simulation in the context of process mining. In 23rd International conference on business information systems (pp. 249–263). Springer International Publishing.
Google Scholar
Pourmasoumi, A., Kahani, M., & Bagheri, E. (2017). Mining variable fragments from process event logs. Information Systems Frontiers, 19(6), 1423–1443. https://doi.org/10.1007/s10796-016-9662-x
Web of Science ®Google Scholar
Richetti, P., Jazbik, L. S., Baiao, F. A., & Campos, M. (2022). Deviance mining with treatment learning and declare-based encoding of event logs. Expert Systems with Application, 187, 115962. https://doi.org/10.1016/j.eswa.2021.115962
Web of Science ®Google Scholar
Rosa, M. L., Aalst, W. M. V. D., Dumas, M., & Milani, F. P. (2017). Business process variability modeling: A survey. ACM Computing Surveys (CSUR), 50(1), 1–45. https://doi.org/10.1145/3041957
Web of Science ®Google Scholar
Schunselaar, D. M., Verbeek, E., Van Der Aalst, W. M., & Raijers, H. A. (2012). Creating sound and reversible configurable process models using CoSeNets. In International conference on business information systems (pp. 24–35). Springer Berlin Heidelberg.
Google Scholar
Tang, Y., Li, T., Zhu, R., Liu, C., & Zhang, S. (2022). A hybrid genetic service mining method based on trace clustering population. IEICE Transactions on Information and Systems, E105D(8), 1443–1455. https://doi.org/10.1587/transinf.2021EDP7190
Web of Science ®Google Scholar
Tariq, Z., Charles, D., McClean, S., McChesney, I., & Taylor, P. (2021). An event-level clustering framework for process mining using common sequential rules. In International conference for emerging technologies in computing (pp. 147–160). Springer International Publishing.
Google Scholar
Tavares, G. M., Barbon Junior, S., Damiani, E., & Ceravolo, P. (2022). Selecting optimal trace clustering pipelines with meta-learning. In J. C. Xavier-Junior, R. A. Rios (Eds.), Intelligent systems (pp. 150–164). Springer International Publishing.
Google Scholar
Taymouri, F., La Rosa, M., Dumas, M., & Maggi, F. M. (2021). Business process variant analysis: Survey and classification. Knowledge-Based Systems, 211, 106557. https://doi.org/10.1016/j.knosys.2020.106557
Web of Science ®Google Scholar
van Beest, N., Groefsema, H., García-Bañuelos, L., & Aiello, M. (2019). Variability in business processes: Automatically obtaining a generic specification. Information Systems, 80, 36–55. https://doi.org/10.1016/j.is.2018.09.005
Web of Science ®Google Scholar
Van Den Ingh, L., Eshuis, R., & Gelper, S. (2021). Assessing performance of mined business process variants. Enterprise Information Systems, 15(5), 676–693. https://doi.org/10.1080/17517575.2020.1746405
Web of Science ®Google Scholar
Van der Aalst, W. (2011). Process mining: Discovery, conformance and enhancement of business processes (2nd ed.). Springer.
Google Scholar
van der Aalst, W. M. P. (2022). Process mining: A 360 degree overview. Springer International Publishing
Google Scholar
Vertuam Neto, R., Tavares, G., Ceravolo, P., & Barbon, S. (2021). On the use of online clustering for anomaly detection in trace streams. In XVII brazilian symposium on information systems (pp. 1–8). Association for Computing Machinery.
Google Scholar
Wang, Q., Shao, C., Fang, X., & Zhang, H. (2022). Business process recommendation method based on cost constraints. Connection Science, 34(1), 2520–2537. https://doi.org/10.1080/09540091.2022.2133083
Web of Science ®Google Scholar
Xu, J., & Liu, J. (2019). A profile clustering based event logs repairing approach for process mining. IEEE Access, 7, 17872–17881. https://doi.org/10.1109/ACCESS.2019.2894905
Web of Science ®Google Scholar
Zandkarimi, F., Rehse, J. R., Soudmand, P., & Hoehle, H. (2020). A generic framework for trace clustering in process mining. In 2020 2nd International conference on process mining (icpm) (pp. 177–184). IEEE.
Google Scholar

Discovery of process variants based on trace context tree

Abstract

1. Introduction

2. Preliminaries

Table 1. An example of event logs.

Weak order relationship of events (Fang et al., Citation2020; Lu et al., Citation2022)

Log behaviour profiles (Fang et al., Citation2020; Lu et al., Citation2022)

Frequent Pattern Tree (FP tree) (Borah & Nath, Citation2018)

3. Related work

3.1. Process variability modelling methods

3.2. Configurable process mining methods

3.3. trace clustering methods

3.4. Summary of related work

Table 2. The comparisons of different process variants discovery methods.

4. Motivation

Table 3. Experimental results comparison among different methods for log L=L1⊔L2.

5. Proposed trace semantic α splitting method

5.1. Benchmark log extraction

kth-strict order relationship

Context Log

Table 4. An example of event log L to illustrate context log.

Benchmark log

Weighted frequency cosine similarity

5.2. Context tree construction based on FP tree

Trace context

Context Tree

5.3. Selection of activity node and neighbourhood length

5.4. Algorithms and complexity analysis

5.4.1. Algorithms

5.4.2. Complexity analysis

6. Experiment and evaluation

6.1. Case study

Table 5. Event logs of credit disposal processes.

Table 6. Frequency of activities.

Table 7. kth-strict order relationships of nodes b and h.

Table 8. Activity alphabet with node b of length k.

Table 9. Activity alphabet with node h length k.

Table 10. Context logs for nodes b and h.

Table 11. Weighted frequency cosine similarity calculation results.

6.2. Comparative analysis

Table 12. Comparison of process variants discovery methods.

Table 13. Holistic information of datasets.

Table 14. Performance indicators comparisons with (Chan et al., Citation2014).

Table 15. Performance indicators comparisons with (Xu & Liu, Citation2019).

6.3. Further experimental discussions

Table 16. Comparisons among different methods for log L′=L1′⊔L2′.

Table 17. Comparisons among different methods for BPIC 2015 log.

7. Conclusions and future work

Acknowledgments

Disclosure statement

Additional information

Funding

Notes

References

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature

Table 3. Experimental results comparison among different methods for log $L = L_{1} ⊔ L_{2}$ .

Table 16. Comparisons among different methods for log $L^{'} = L_{1}^{'} ⊔ L_{2}^{'}$ .