Full article: A Hybrid Clustering Method Based on the Several Diverse Basic Clustering and Meta-Clustering Aggregation Technique

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

In hybrid clustering, several basic clustering is first generated and then for the clustering aggregation, a function is used in order to create a final clustering that is similar to all the basic clustering as much as possible. The input of this function is all basic clustering and its output is a clustering called clustering agreement. However, this claim is correct if some conditions are met. This study has provided a hybrid clustering method. This study has used the basic k-means clustering method as a basic cluster. Also, this study has increased the diversity of consensus by adopting some measures. Here, the aggregation process of the basic clusters is done by the meta-clustering technique, where the primary clusters are re-clustered to form the final clusters. The proposed hybrid clustering method has the advantages of k-means, its high speed, as well as it does not have its major weaknesses, the inability to detect non-spherical and non-uniform clusters. In the empirical studies, we have evaluated the proposed hybrid clustering method with other up-to-date and robust clustering methods on the different datasets and compared them. According to the simulation results, the proposed hybrid clustering method is stronger than other clustering methods.

Keywords:

1. Introduction

Nowadays, clustering plays an important role in most research fields such as engineering, medicine, biology, and data mining (Sun et al. Citation2018; Tan et al. Citation2020). Clustering is one of the fields of unsupervised learning and is an automatic process during which samples are divided into categories whose members are similar to each other, and these categories are called clusters. Therefore, a cluster is a collection of samples in which the samples are similar to each other and are not similar to the samples in other clusters (Wei et al. Citation2019; Trik, Pour Mozaffari, and Bidgoli Citation2021). Different criteria can be considered for similarity. For example, the distance criterion can be used for clustering and samples that are closer to each other can be considered as a cluster. This type of clustering is known as distance-based clustering. In simple words, the purpose is to separate groups with similar features and divide them into clusters (Yang et al. Citation2021; Ma et al. Citation2021).

Clustering methods take the data and form these groups using some kind of similarity criterion. The results obtained from these clusters/groups can be used on many applications such as image processing, pattern recognition, social network analysis, recommendation engine and information retrieval (Zhao et al. Citation2019). In the process of machine learning for clustering, a similarity measure based on distance plays a pivotal role in clustering decision (Ghobaei-Arani and Shahidinejad Citation2021). In all kinds of clustering methods, two main objectives should be considered in order to obtain the least error: one, the similarity between one data point with another point and the second, the distinction of those similar data points with other points (Forouzandeh et al. Citation2021; Berahmand et al. Citation2021). The basis for such divisions begins with our ability to scale large datasets, and this is a starting point. Another challenge in clustering is the different types of features in the data. Data can be structured, unstructured, hierarchical, and continuous (Ghobaei-Arani Citation2021; Shahidinejad, Ghobaei-Arani, and Esmaeili Citation2020). Also, it is evident that the data is not dimensionally limited and is multidimensional in nature.

Basically, a suitable distance measure can be very effective in clustering. However, the appearance of the clusters can be geometric, so this challenge must also be considered. On the other hand, the results of the clustering method should be understandable in order to solve business problems. Therefore, scalability, features, dimensions, appearance, noises, and interpretability are the things that clustering methods should consider to solve the problem (Nasiri et al. Citation2022; Jadidi and Dizadji Citation2021). In general, performing clustering using different methods have a similar architecture. This is while the differences among the clustering methods include the distance/similarity criteria, initial cluster values and how to form the final clusters. These differences have led to the development of different clustering methods over time. Basically, there are five main classes of clustering methods including Density-based Clustering (DC), Grid-based Clustering (GC), Model-based Clustering (MC), Hierarchical Clustering (HC), and Partitional Clustering (PC), as shown in (Wei, Li, and Zhang Citation2018).

Figure 1. Taxonomy of clustering methods.

Since most of the basic clustering methods emphasize on specific aspects of the data, they are efficient on specific datasets (Niu et al. Citation2020; Li, Qian, and Wang Citation2021). For this reason, there is a need for approaches that can create better results by using the combination of these methods and considering their strengths. Meanwhile, hybrid clustering is a new clustering method that is obtained by combining the results of different clustering methods. Accuracy, correctness, and stability are important characteristics of a hybrid clustering method compared to classical clustering methods (Zheng et al. Citation2021; Zhu et al. Citation2021; Tan et al. Citation2020). In fact, the main purpose of hybrid clustering is to search for better and stronger results, using the combination of information and results obtained from several primary clustering (as partitions). So far, many studies have been done on hybrid clustering. Recent research in this field has shown that data clustering can significantly benefit from the combination of several data parts. In addition, their parallelization power has a natural adaptation to the need of distributed data mining. Hybrid clustering can provide better solutions in terms of robustness, scalability, stability, and flexibility than basic clustering methods (as individual).

Basically, hybrid clustering includes two main steps: (1) producing different results from basic clustering methods and (2) combining the results obtained from basic clustering methods to produce final clusters (Zhu et al. Citation2021). The first is related to the creation of partitions with dispersion and diversity by different methods, and the second refers to an agreement function to combine the results (Wei et al. Citation2019). Usually, in the first step of hybrid clustering, a number of primary clusters are created, each of which emphasizes a specific feature of the data. Applying a clustering method on several different parts of the data or using several different clustering methods can cause dispersion and diversity in the partition results (Yang et al. Citation2021). After the primary partitions are formed, these results are usually combined by using an agreement function. One of the most common methods of combining the results is using the correlation matrix. A hybrid clustering framework is shown in , where the results of several basic clustering methods are combined to achieve more stable, scalable, and quality clustering.

Figure 2. Hybrid clustering framework.

Therefore, nowadays instead of addressing the making a strong global clustering method, more attention has been paid to building frameworks that integrate several weak clusters (Zhao et al. Citation2019; Trik et al. Citation2022). In this regard, the “hybrid cluster” or “clusters aggregation”, has been provided for improving the strength and quality of the clustering process (Tan et al. Citation2020). The k-means clustering method, which is one of the flat approaches, is known as a very fast and fairly efficient method (Yang et al. Citation2021; Ma et al. Citation2021). This method, as a weak clustering method, is one of the best basic clustering methods for contributing to consensus building in hybrid clustering. This paper addresses the existing problems by presenting valid local cluster theory. Here, the similarity between valid local clusters is estimated by applying an inter-cluster and intra-cluster similarity metric. In the next step of the method, the aggregation process of the basic clusters is done by the meta-clustering technique, where the primary clusters are re-clustered to form the final clusters. Eventually, the output of these clusters is considered along with the average credits to optimize the final agreement. The proposed hybrid clustering method has the advantages of k-means, its high speed, as well as it does not have its major weaknesses.

The main contribution of this paper is as follows:

The aggregation process of the basic clusters with a new meta-clustering technique.
Definition of valid local clusters by considering the data around the cluster centers in k-means.
Generating diverse primary clusters by applying a duplicate strategy on nonappearance data in valid local clusters.
Perform extensive experiments to demonstrate the efficacy of the proposed clustering method and give credence to our idea.

The rest of the paper is organized as follows. A brief discussion of related works in the literature is provided in Section 2. The formulation of the problem is provided in Section 3. The proposed clustering method is presented in Section 4. Experimental results are demonstrated in Section 5. Finally, Section 6 concludes the paper.

2. Related Works

So far, many studies have been presented by the research community on the development of clustering methods (Jain Citation2010; Hansen and Mladenović Citation2001; Zhang, Hsu, and Dayal Citation2000). The k-means method is one of the popular clustering approaches with many improved versions. For example, H-means solves the empty cluster problem in k-means (Jain Citation2010; Walid et al. Citation2021). Problems of k-means such as outliers, sensitive to noise and local optimum are considered by J-means method (Hansen and Mladenović Citation2001). This method can also solve the problem of degeneracy in k-means. Jiang et al. (Citation2010) proposed K-Harmonic Means (KHM) to solve the primary clustering problem in k-means. KHM has succeeded in obtaining high-quality results by considering the harmonic mean of intervals as the objective function. However, KHM is not suitable for global optimization. In this regard, Swarm Intelligence techniques are being developed to replace KHM. The ACOKHM (Ant Colony Optimization and K-Harmonic Means) method for clustering with a global approach was presented by Bouyer and Hatamlou (Citation2018). Although ACOKHM provides high-quality and accurate results, it has a slow convergence to the global optimum.

Hybrid clustering has become very popular as a technique to improve clustering results. The results of hybrid clustering using basic clustering methods with higher diversity and more quality are far more accurate (Bouyer and Hatamlou Citation2018). However, obtaining more accurate results by having more diversity in some collections has not yet been proven (Azimi and Fern Citation2009). Link-based Cluster Ensemble (LCE) was proposed as a hybrid clustering method by Jain (Citation2010). LCE is an improved version of Hybrid Bipartite Graph Formulation (HBGF) in which bipartite graph is used. The authors first create a dense graph for each pair of samples and clusters and then form the final clusters using spectral clustering. Niu et al. (Citation2020) proposed a hybrid clustering method that they developed based on the hybrid of locally reliable cluster solutions. This method is configured based on k-medoids and provides the concept of valid local clusters. Here, weighted undirected graph is used to find relationships between clusters.

Huang, Wang, and Lai (Citation2017) proposed Locally Weighted Meta-Clustering (LWMC) to improve hybrid clustering methods. Here, the Jaccard coefficient is used to calculate the weight of connections between clusters. LWMC uses the normalized cut method to create meta-clusters, where each meta-cluster contains several clusters (Huang et al. Citation2020). The authors use a weighted voting-based technique to create the final clusters. Consensus clustering by partitioning similarity graph was proposed by Hamidi, Akbari, and Motameni (Citation2019). This method uses graph pruning for clustering, where the number of clusters is automatically estimated. The authors use meta-cluster and majority vote as an aggregation function to create the final clusters. Here, the Jaccard coefficient is used to calculate the similarity. Iterative Combining Clustering Method (ICCM) was proposed by Khedairia and Khadir (Citation2022). ICCM uses an iteratively based technique to analyze data and create primary clusters. Here a voting method is used to create a set of partitions. For this, each sample votes for its own sub-cluster so that samples with higher votes are assigned to the corresponding sub-clusters. In the meantime, the samples that do not get the highest vote are clustered in the next iterations.

The hybrid clustering is still considered as a tool as well as a research field of the theory studied. A review paper is presented by Golalipour et al. (Citation2021) for a variety of these methods. Due to the fact that precision in clustering does not have a straightforward meaning such as classification, an alternative concept is presented for it, which states that a precise clustering is clustering which is most similar to other clusters formed on the given data, in other words, a better clustering means a more stable clustering. For a reason similar to the reason for the suitability of a diverse collection of classifiers for hybrid classification, a set of the clustering is considered as a goof set, if its basic clustering is varied (Bai, Liang, and Cao Citation2020). In order to generate a diverse clustering consensus, a weak clustering method must be applied to the data several times.

We use the k-means clustering method as a weak cluster for solving this problem (Abapour, Shafiesabet, and Mahboub Citation2021). Four sub-problems in hybrid clustering are presented as follows: (1) The problem of recognizing relatively correct labels in clustering: Unlike categorization, there is no real information about labels in clustering. (2) The problem of obtaining a variety of clustering that describe the entire data: In hybrid learning, while several poor learners are combined as strong learners, whatever the basic learners more complement each other, the hybrid learner acts better (Rezaeipanah, Nazari, and Ahmadi Citation2019; Rezaeipanah et al. Citation2021). That is, any weak clustering will cover the rest of the clustering. Therefore, for this purpose, we need to create several complementary clustering by applying k-means clustering methods. (3) The problem of determining the appropriateness between clusters: Unlike classifications in which each label is exclusively assigned to a category, the labels do not have a single meaning in clustering, and they simply represent that data has the same cluster (Mojarad et al. Citation2021). The clusters with the same name in two different clustering do not imply any truth. Therefore, before doing anything in hybrid clustering, the label of different clustering should be re-labeled based on correspondence. In addition, even two clusters of the same clustering are likely to signify a real cluster. (4) The problem of combining the results of matched basic clustering: In different clustering, each sample may have different labels. So, we have to determine a final label called an agreement label. In hybrid learning, while several poor learners are combined as a strong learner, whatever the action is more effective, the hybrid learner acts better (Li, Rezaeipanah, and El Din Citation2022).

3. Problem Formulation

A dataset is defined as a set of data samples that each data sample itself is a numerical vector (or feature vector). The dataset is shown by $X$ and each data sample is shown by $x_{i}$ and obviously $x_{i} \in X .$ The $j$ -th feature of the $x_{i}$ data sample is shown by $x_{ij} .$ The size of each dataset $X$ is shown by $| X | .$ The number of features of the dataset $X$ is shown by $| x_{1} | .$ Let $N$ be the number of samples and $M$ the number of features from a dataset. Let $c$ be the subset of data as clustered/partitioned. When the union of all subsets is equivalent to the original data set and each pair of subsets has no intersection, then each subset can be defined as a cluster. A clustering is shown by $π = {π^{1}, π^{2}, \dots, π^{c}},$ where $π^{i}$ represents the $i$ -th cluster. Obviously, $⋃_{i = 1}^{c} π^{i} = X$ and $\forall i, j \in {1, 2, \dots, c} : π^{i} \cap π^{j} = \emptyset .$ The center of each cluster $π^{i}$ is shown by $C^{π^{i}},$ and its $j$ -th feature is defined as EquationEq. (1)(1) $C_{j}^{π^{i}} = \frac{\sum_{k \in π^{i}} x_{kj}}{| π^{i} |}$ (1) (1) $C_{j}^{π^{i}} = \frac{\sum_{k \in π^{i}} x_{kj}}{| π^{i} |}$ (1)

A valid sub-cluster from a cluster $π^{i}$ is shown by $r_{π^{i}}$ and is defined according to EquationEq. (2)(2) $r_{π^{i}} = {x_{k} : π^{i} | \sqrt{\sum_{j = 1}^{| x_{1} |} {| C_{j}^{π^{i}} - x_{kj} |}^{2}} \leq γ}$ (2) (2) $r_{π^{i}} = {x_{k} : π^{i} | \sqrt{\sum_{j = 1}^{| x_{1} |} {| C_{j}^{π^{i}} - x_{kj} |}^{2}} \leq γ}$ (2) where $γ$ is a threshold parameter. It should be noted that a sub-cluster can be considered as a cluster.

Basically, there are many similarity/distance measures in the literature to define the difference between two clusters. In this paper, we define the similarity metric between the two clusters $π^{i}$ and $π^{j},$ which is shown by $sim (π^{i}, π^{j}),$ and defined as EquationEq. (3)(3) $sim (π^{i}, π^{j}) = {\begin{matrix} \frac{π^{i} \cap π^{j}}{π^{i} \cup π^{j}} + \frac{⋃_{q = 1}^{9} T_{q} (π^{i}, π^{j}) - (π^{i} \cup π^{j})}{\sqrt{\sum_{w = 1}^{| x_{1} |} {| C_{w}^{π^{i}} - C_{w}^{π^{j}} |}^{2}}} & \sqrt{\sum_{w = 1}^{| x_{1} |} {| C_{w}^{π^{i}} - C_{w}^{π^{j}} |}^{2}} \leq 4 γ \\ 0 & O therwise \end{matrix}$ (3) (3) $sim (π^{i}, π^{j}) = {\begin{matrix} \frac{π^{i} \cap π^{j}}{π^{i} \cup π^{j}} + \frac{⋃_{q = 1}^{9} T_{q} (π^{i}, π^{j}) - (π^{i} \cup π^{j})}{\sqrt{\sum_{w = 1}^{| x_{1} |} {| C_{w}^{π^{i}} - C_{w}^{π^{j}} |}^{2}}} & \sqrt{\sum_{w = 1}^{| x_{1} |} {| C_{w}^{π^{i}} - C_{w}^{π^{j}} |}^{2}} \leq 4 γ \\ 0 & O therwise \end{matrix}$ (3) where $T_{q} (π^{i}, π^{j})$ is calculated using EquationEq. (4)(4) $T_{q} (π^{i}, π^{j}) = {x_{k} : X | \sqrt{\sum_{w = 1}^{| x_{1} |} {| p_{qw} (π^{i}, π^{j}) - x_{kw} |}^{2}} \leq γ}$ (4) (4) $T_{q} (π^{i}, π^{j}) = {x_{k} : X | \sqrt{\sum_{w = 1}^{| x_{1} |} {| p_{qw} (π^{i}, π^{j}) - x_{kw} |}^{2}} \leq γ}$ (4) where $p_{q} (π^{i}, π^{j})$ is a point and $w$ -th feature is denied as EquationEq. (5)(5) $p_{qw} (π^{i}, π^{j}) = \frac{(q) \times C_{w}^{π^{i}} + (10 - q) \times C_{w}^{π^{i}}}{10}$ (5) (5) $p_{qw} (π^{i}, π^{j}) = \frac{(q) \times C_{w}^{π^{i}} + (10 - q) \times C_{w}^{π^{i}}}{10}$ (5)

Let $X = {x_{1}, x_{2}, \dots, x_{i}, \dots, x_{n}}$ be a set of $n$ samples of the dataset $X,$ where $x_{i} = [x_{1}^{i}, x_{2}^{i}, \dots, x_{j}^{i}, \dots, x_{d}^{i}]$ is an $i$ -th sample with $d$ features. Also, let $Π = {π_{1}, π_{2}, \dots, π_{k}, \dots, π_{m}}$ be a hybrid of $m$ individual clustering method, where $π_{k}$ is the $k$ -th member of the hybrid. Each $π_{k} \in Π$ returns a set of clusters $π_{k} = [c_{1}^{k}, c_{2}^{k}, \dots c_{l}^{k}, \dots, c_{| π_{k} |}^{k}]$ (as a partition), where $| π_{k} |$ refers to the number of clusters created by $π_{k} .$ For each $x_{i} \in X,$ $π_{k} (x_{i})$ represents the cluster label belonging to $x_{i}$ in $π_{k} .$ Here, the problem of hybrid clustering is defined as finding a new partition $π_{*} = [c_{1}^{*}, c_{2}^{*}, \dots c_{l}^{*}, \dots, c_{K}^{*}]$ from the consensus results of set $Π,$ where $K$ is the number of final clusters.

A weighting graph corresponding to a consensus of the clustering is shown by $Π$ with $G (Π)$ and is defined as $G (Π) = [V (Π), E (Π)] .$ The vertex set of this graph is also the valid subsets of all consensus’s clusters, namely, $V (Π) = {r_{π_{1}^{1}}, \dots, r_{π_{1}^{c_{1}}}, r_{π_{2}^{1}}, \dots, r_{π_{2}^{c_{2}}}, \dots r_{π_{B}^{1}}, \dots, r_{π_{B}^{c_{B}}}} .$ The weight of the edges between the vertices of this graph or the cluster-cluster connections is considered as the similarity value, as shown in EquationEq. (6)(6) $E (v_{1}, v_{2}) = sim (v_{j}, v_{i})$ (6) (6) $E (v_{1}, v_{2}) = sim (v_{j}, v_{i})$ (6)

Basically, the k-means clustering method is considered as an unsupervised learning method, where it is used to process unlabeled data. The purpose of this clustering is to find the best group in the data and $k$ determines the number of clusters. The data is placed in clusters based on the degree of similarity. In such a way that the data with the most similarity are placed in one group and have the least similarity with other groups. Here, $k$ specifies the number of clusters and means the averaging. Clusters have a number of characteristics. The first feature: all the data in a cluster must be most similar to each other. The second feature: the data in different clusters should have the greatest difference. The time complexity of the k-means method is $O (N . k . I),$ so that $I$ is the number of iterations.

The pseudocode for k-means-based hybrid clustering is shown in Algorithm 1. In this pseudocode, the original dataset is saved as $TX$ and then an improved version of k-means is called sequentially to find and store the clustering results.

Algorithm 1.

The hybrid clustering method based on k-means method.

01: $Π = \emptyset;$

02: $TX = X;$

03: For $i$ = 1 to $B$ do

04: $π_{i}$ = modified k-means ( $TX,$ $c_{i}$ );

05: $TX = TX - ⋃_{j = 1}^{c_{i}} r_{π_{i}^{j}};$

06: $Π = Π \cup {π_{i}};$

07: End

Since the difference between basic clustering is a prerequisite for the effectiveness of the cluster group, in the following, how to obtain several k-means clustering with different valid local labels will be discussed. For the first time, we define an optimization problem for generating basic clustering as EquationEqs. (7)(7) $\begin{matrix} \min \\ π \end{matrix} [z (π) = \sum_{h = 1}^{N} \sum_{i = 1}^{T} θ_{h} (X_{i}) ⋋ (π_{h} (X_{i})) d (X_{i}, V_{πh} (X_{i})))]$ (7) and Equation(8)(8) $\sum_{h = 1}^{T} θ_{h} (X_{i}) ⋋_{h} (X_{i}) = 1, 1 \leq i \leq N$ (8) (7) $\begin{matrix} \min \\ π \end{matrix} [z (π) = \sum_{h = 1}^{N} \sum_{i = 1}^{T} θ_{h} (X_{i}) ⋋ (π_{h} (X_{i})) d (X_{i}, V_{πh} (X_{i})))]$ (7) (8) $\sum_{h = 1}^{T} θ_{h} (X_{i}) ⋋_{h} (X_{i}) = 1, 1 \leq i \leq N$ (8) where $θ_{h} (X_{i})$ is a Boolean variable that if is equal to 1, $X_{i}$ will partly play a role in the production of the basic cluster $h .$ $θ_{h} (X_{i})$ is provided to control this issue that how many times do each sample play.

Here, constraint is required to each of the samples is applied only once simultaneously to produce basic clustering that is provided by cluster centers in clustering. The purpose of minimizing the objective function $Z$ is to create cluster centers in each basic cluster to indicate that samples are in the valid local and possible spaces. We suggest an incremental learning method for solving the optimization problem. This method gradually produces the productive basic clustering by trying to optimize an incremental problem in each step. The incremental problem is as EquationEq. (9)(9) $Min Z (Π^{'} \cup {π_{g + 1}}$ (9) . Given that $Π$ has gained the first basic cluster $g (0 < g < T) .$ (9) $Min Z (Π^{'} \cup {π_{g + 1}}$ (9)

In addition, $θ_{h + 1 i}$ is estimated through EquationEq (10)(10) $θ_{h + 1 i} = {\begin{matrix} 1, \sum_{h = I}^{g} ⋋_{h} (X_{i}) = 0, \\ 0, otherwise \end{matrix}$ (10) (10) $θ_{h + 1 i} = {\begin{matrix} 1, \sum_{h = I}^{g} ⋋_{h} (X_{i}) = 0, \\ 0, otherwise \end{matrix}$ (10) where $1 \leq i \leq N .$

Given this constraint, we see that samples which are obtained by cluster centers and not shown in $Π$ play an important role in basic clustering $g + 1 .$ The incremental learning method is as follows: We first set $h = 1,$ $θ_{h} (X_{i}) = 1$ for $1 \leq i \leq N$ and $S = X .$ At each step, we select $k$ samples as the primary cluster centers from $S$ randomly and use k-means with limitation for its cluster. In the clustering method, the cluster centers are limited, which can be seen only in relation to their neighborhoods in EquationEq. (9)(9) $Min Z (Π^{'} \cup {π_{g + 1}}$ (9) . This will cause the final cluster centers obtained to show samples in local spaces to be valid. After executing k-means, we will update $S = S - S',$ where $S'$ is a set of samples that have valid local labels in the $k_{h}$ basic clustering. Additionally, we will update $h = h + 1,$ if $x_{i} \in S,$ then $θ_{h} (X_{i}) = 1,$ otherwise, for $1 \leq i \leq N,$ it will be 0.

The above method repeats until the number of samples in $S$ is less than $k_{h}^{2} .$ Updating the cluster centers at each step through the iteration mechanism leads to the production of the final cluster centers. It can guarantee data description by multiple clustering. On the other hand, the importance of satisfying the final conditions should be determined. Many researchers argued that the maximum number of clustering in the set $S$ of samples should be less than $\sqrt{| S |}$ (Zheng et al. Citation2021; Zhu et al. Citation2021; Tan et al. Citation2020). Thus, while the number of samples in $S$ is less than $k_{h}^{2},$ we assume that $S$ cannot be divided into $k_{h}$ clusters. Finally, if these conditions are met, the repetition can be stopped.

Algorithm 2.

Pseudocode of the MKM scheme.

Input: $X,$ $k,$ $ε$

Output: $Π,$ $V$

01: $Π = \emptyset, V = \emptyset, S = X, h = 0;$

02: $θ_{h} (X_{i}) = 1, for 1 \leq i \leq N;$

03: Randomly select $k_{h}$ primary cluster centers as $v_{h}$ on $S;$

04: While $F < F'$ do

05: $F' = F;$

06: Given $v_{h},$ ${\hat{π}}_{h}$ is updated by $\underset{l = 1 \dots k_{h}}{argmin} d (X_{i}, v_{h_{l}})$

07: For $X_{i} \in S;$

08: Given $π_{h},$ ${\hat{v}}_{h}$ is updated by ${\hat{v}}_{h} = \frac{\sum_{X_{i \in D}} X_{i}}{| D |},$

09: where $D = | {π (X_{i}) = l ⋀ X_{i} \in B (v_{hl}), X_{i} \in S} |$

10: For $1 \leq l \leq k_{h};$

11: $F = \sum_{l = 1 π_{h} (X_{i}) = l, X_{i} \in S}^{k_{h}} \sum d (X_{i}, {v_{hl})}^{2}$

12: $S^{'} = {⋋_{h} (X_{i}) = 1, X_{i} \in S};$

13: End For

13: For $i$ = 1 to $N$ do

14: If $X_{i} \in S^{'}$ then

15: $θ_{h + 1} (X_{i}) = 0;$

16: else

17: $θ_{h + 1} (X_{i}) = θ_{h} (X_{i});$

18: End If

19: End For

20: $Π = Π \cup {π_{h}};$

21: $V = V \cup v_{h};$

22: $S = S - S';$

23: End For

24: End While

The incremental method is called the Modified k-means (MKM) clustering method, which is formally described in Algorithm 2. The time complexity of MKM, $O (Nt T k_{h}),$ where $T$ is the number of partitions generated. The outputs of the algorithm have been the clustering set $Π = {π_{h}, 1 \leq h \leq T}$ and also the set of cluster centers, which is equal to $V = {v_{h}, 1 \leq h \leq T} .$ In order to simplify the basic clustering generation process, we determine a number of clusters in each basic clustering as $k,$ $k_{h} = k, 1 \leq h \leq T .$ We continue the following example in . Here, we obtain a set of data as $ε = 0.8$ on the dataset as well as 10 cluster bases. Part (d) shows the partition lines of these basic clusters generated by the MKM scheme. We observe that these basic clusters are somewhat different, which is useful for the cluster group.

Figure 3. An example of MKM: (a) real class labels, (b) clustering from k-means, (c) local hypothesis of the clusters, and (d) multiple partitions by MKM.

Note that the number of $T$ basic cluster depends on the parameter $ε .$ When the amount of $ε$ decreases, the $T$ value must be increased, because a small amount of $ε$ indicates that each basic cluster contains a number of local modifications. Therefore, while the $ε$ is set to a smaller value, we need a more basic clustering to describe the whole data. The setting of $ε$ depends on the needs of users, so that users can set the parameter to control the basic cluster number based on their needs.

4. Proposed Clustering Method

This study has provided a hybrid clustering method. This study has used the basic k-means clustering method as a basic cluster. Also, this study has increased the diversity of aggregation by adopting some measures. Here, the aggregation process of the basic clusters is done by the meta-clustering technique, where the primary clusters are re-clustered to form the final clusters. The proposed hybrid clustering method has the advantages of k-means, its high speed, as well as it does not have its major weaknesses.

In general, the labels in the dataset represent classes, but the labels in clustering only represent groups. Therefore, the labels in the clustering cannot be used for comparisons and cluster analysis. In this regard, it is necessary to align labels in clustering. Additionally, since the k-means method can only detect spherical and uniform clusters, two of the same clustering can represent a clustering. Hence, analysis of the relationship between clusters through similar clustering in needed. Now, there are inconsistent measures among the clusters proposed in the research literature (Yang et al. Citation2021; Ma et al. Citation2021). An example of this can be seen in chain clustering, where the intersection between clusters is determined by the distance between the farthest/closest sample between two clusters (Zhao et al. Citation2019). This method is sensitive to noise because it depends on a few specific samples to determine the final clusters. On the other hand, the distance between centers in center-based clustering approaches is defined as the absence of correlation. This method does not have the ability to effectively identify the border between clusters, but it has high computational efficiency and is resistant to noise.

In general, the similarity between two clusters in different partitions can be estimated based on the number of samples belonging to those clusters. This strategy cannot reflect samples with wrong labels in the cluster. However, some of these samples can have a high impact on the similarity calculation. Also, two clusters from the same partition share no sample, which is the reason for the inability of this metric to calculate similarity. Although there is good practice coordination between measures, they are not suitable for hybrid clustering. As mentioned, the labels of the created base partitions are different from the valid local labels. In other words, the validity of labels of each cluster may be low or high. Hence, the calculation of differences between clusters should be considered based on local labels. However, due to the use of MKM to generate initial partitions, the overlap between local labels should be relatively small. In this regard, we use an indirect overlap technique to calculate the similarity between clusters.

If $c_{h_{l}}$ and $c_{g_{i}}$ are two clusters, $V_{h_{l}}$ and $V_{g_{i}}$ are their cluster centers and $(V_{h_{l}} + V_{g_{i}}) / 2$ is the middle point of two centers. We assume there is a hidden cluster $c_{z}$ whose cluster center is $(V_{h_{l}} + V_{g_{i}}) / 2$ for hidden for the cluster. Let the probability of samples being in valid local locations be greater with the density of samples. If there is a hidden cluster and the distance between $V_{h_{l}}$ and $V_{g_{i}}$ is not greater than $4 \times ε,$ valid local spaces from the clusters $c_{h_{l}}$ and $c_{g_{i}}$ are overlapping with the hidden clusters $c_{z},$ as shown in . In this case, the valid local spaces $c_{h_{l}}$ and $c_{g_{i}}$ are indirectly overlapping with the hidden cluster. For clusters $c_{h_{l}}$ and $c_{g_{i}},$ we consider these parameters to estimate the similarity between clusters.

Figure 4. Hidden cluster between clusters.

The distance between cluster centers is estimated based on the probability of a hidden cluster between them. As we know, whatever $d (V_{h_{l}}, V_{g_{i}})$ is smaller, the valid local spaces between them and $c_{z}$ will more overlap. In this respect, it is a fact that their similarity is inversely related to $d (V_{h_{l}}, V_{g_{i}}) .$ Also, k-means is a clustering approach with a linear mechanism and can identify the border of two clusters through a line between their centers. If the range around them is among several samples, they can be clearly identified. We use the following example in .

Figure 5. Similarity between clusters.

It is clearly seen that clusters B and C have centers with larger distances compared to clusters A and B. Meanwhile, it is easier to determine the border between clusters A and B. Hence, the distance between the centers of clusters A and B may be increased considering the clarity of the boundary identification. According to this hypothesis, let the similarity between two clusters be estimated through a hidden cluster. Formally, similarity is measured as EquationEq (11)(11) $δ (c_{h_{l}}, c_{g_{i}}) = {\begin{matrix} \frac{| B (\frac{V_{h_{l}} + V_{g_{i}}}{2}) |}{d (V_{h_{l}}, V_{g_{i}})} & d (V_{h_{l}}, V_{g_{i}}) \leq 4 \times ε \\ 0 & O therwise \end{matrix}$ (11) (11) $δ (c_{h_{l}}, c_{g_{i}}) = {\begin{matrix} \frac{| B (\frac{V_{h_{l}} + V_{g_{i}}}{2}) |}{d (V_{h_{l}}, V_{g_{i}})} & d (V_{h_{l}}, V_{g_{i}}) \leq 4 \times ε \\ 0 & O therwise \end{matrix}$ (11)

Given the defined similarity matrix, we use an undirected weight graph (e.g., $G = < A, W >$ ) to describe the relationships between clusters. In this graph, $A$ refers to the set of nodes that represent the cluster labels in $Π .$ On the other hand, $W$ in $G$ refers to the weight of edges, which expresses the similarity between clusters. Hence, the similarity of both clusters is the concept of the weight of the edges between them, for example, ${x, y \in A, w}_{Xy} = δ (c_{x}, c_{y}),$ and whatever there is similarity between them. By calculating the weighted graph, the relationships between the clusters can be mapped to the normal graph discharge challenge, which is as EquationEq. (12)(12) $\begin{matrix} \min \\ Ω \end{matrix} [Q (Ω) = \frac{1}{k} \sum_{l = 1}^{K} \frac{\sum_{x \in A_{l}, y \in A - A_{l}} w_{xy}}{\sum_{x \in A_{l}, z \in A} w_{xz}}]$ (12) (12) $\begin{matrix} \min \\ Ω \end{matrix} [Q (Ω) = \frac{1}{k} \sum_{l = 1}^{K} \frac{\sum_{x \in A_{l}, y \in A - A_{l}} w_{xy}}{\sum_{x \in A_{l}, z \in A} w_{xz}}]$ (12) where $Ω = A_{l}, \forall l = 1, 2, . . ., k$ is a partition of nodes in $G$ and $A_{l}$ is one of the subsets of $A .$

Our goal is to measure this partition using the minimization of the objective function $Q .$ This is achieved by creating a partition that has high similarity between nodes in similar subsets and low similarity with nodes in other subsets. To solve this problem and create partition A, the normalized spectral clustering method has been used, where nodes in similar subsets represent a cluster. Hence, if $L (c_{x})$ is the label of the subset which $c_{x}$ belongs to it, then we will have $L (C_{x}) = l,$ if $C_{x} \in A_{l} .$ If $1 < l < k$ and $x \in A,$ the time complexity of the making of the cluster relationship is $O (N {(T . k_{h})}^{2}) .$

The use of hybrid clustering leads to the mapping of the clustering problem from the sample level to the cluster level. Assume that $PC$ is a set containing all primary clusters created from all basic methods. Taking each cluster as a sample, the clustering process is applied again, where this time the clusters are clustered. This technique can create meta-clusters, where each meta-cluster contains several clusters. Meta-clusters have more knowledge about the data than clusters because they combine the latent knowledge from different clustering methods. Here, the clusters’ clustering method is done using k-means. Let the similarity of two samples from the available dataset be $s (x_{i}, x_{j}) .$ Anyway, in meta-clusters the concept of similarity is extended from the sample level to the cluster level. We define the similarity measure of clusters in a meta-cluster through EquationEq. (13)(13) $\begin{matrix} Ψ ({mc}_{α}, {mc}_{β}) = \frac{1}{| {mc}_{α} | . | {mc}_{β} |} \sum_{v = 1}^{| {mc}_{α} |} \sum_{w = 1}^{| {mc}_{α} |} [\frac{\sum_{i = 1}^{| c_{v} |} \sum_{j = 1}^{| c_{w} |} Γ (x_{i}, x_{j})}{| c_{v} | . | c_{w} |}] \\ \forall x_{i} \in c_{v}, x_{j} \in c_{w} \end{matrix}$ (13) (13) $\begin{matrix} Ψ ({mc}_{α}, {mc}_{β}) = \frac{1}{| {mc}_{α} | . | {mc}_{β} |} \sum_{v = 1}^{| {mc}_{α} |} \sum_{w = 1}^{| {mc}_{α} |} [\frac{\sum_{i = 1}^{| c_{v} |} \sum_{j = 1}^{| c_{w} |} Γ (x_{i}, x_{j})}{| c_{v} | . | c_{w} |}] \\ \forall x_{i} \in c_{v}, x_{j} \in c_{w} \end{matrix}$ (13) where ${mc}_{α}$ and ${mc}_{β}$ are two meta-clusters, and $Ψ ({mc}_{α}, {mc}_{β})$ refers to the average similarity between them. Also, $| {mc}_{α} |$ and $| {mc}_{β} |$ are the number of clusters in ${mc}_{α}$ and ${mc}_{β},$ respectively. Moreover, $| c_{v} |$ and $| c_{w} |$ describe the number of samples in $c_{v}$ and $c_{w},$ respectively.

We create the final clusters by considering meta-clusters, where each instance of the dataset is assigned to a meta-cluster with maximum similarity. Meanwhile, the number of suitable clusters can be recognized by merging the initial clusters and applying a threshold value. Therefore, $k$ is determined as the number of optimal clusters by merging the initial clusters until the threshold $θ$ is reached, as defined in EquationEq. (14)(14) $\begin{matrix} if σ (c_{a}, c_{b}) \geq θ ⟹ {\begin{matrix} \begin{matrix} hence merged & T rue \end{matrix} \\ \begin{matrix} not merged & False \end{matrix} \end{matrix}, & \forall a, b \in PC \end{matrix}$ (14) . In other words, clusters are merged until the similarity of each existing pair of clusters is greater than $θ .$ (14) $\begin{matrix} if σ (c_{a}, c_{b}) \geq θ ⟹ {\begin{matrix} \begin{matrix} hence merged & T rue \end{matrix} \\ \begin{matrix} not merged & False \end{matrix} \end{matrix}, & \forall a, b \in PC \end{matrix}$ (14) where $c_{a}$ and $c_{b}$ are two clusters of the $PC .$ Also, $σ (c_{a}, c_{b})$ defines to the average similarity between $c_{a}$ and $c_{b} .$

5. Experimental Results

This section is related to the evaluation of the proposed clustering method based on four synthetic datasets and five real datasets. Here, the efficiency of the proposed method is evaluated through the analysis of different validation methods and runtime. The evaluation of the proposed method is compared with some state-of-the-art methods such as COllaborative-Single Link (CO-SL) (Fred and Jain Citation2005), COllaborative-Average Link (CO-AL) (Fred and Jain Citation2005), Combined Similarity Measure-Single Link (CSM-SL) (Iam-On et al. Citation2011), Combined Similarity Measure-Average Link (CSM-AL) (Iam-On et al. Citation2011), Weighted Triple Quality-Single Link (WTQ-SL) (Iam-On et al. Citation2011), Weighted Triple Quality-Average Link (WTQ-AL) (Iam-On et al. Citation2011), Weighted Connection Triple-Single Link (WCT-SL) (Iam-On et al. Citation2011), Weighted Connection Triple-Average Link (WCT-AL) (Iam-On et al. Citation2011), Meta-Clustering Algorithm (MCLA) (Strehl and Ghosh Citation2002), HyperGraph Partitioning Algorithm (HGPA) (Strehl and Ghosh Citation2002), Cluster-based Similarity Partitioning Algorithm (CSPA) (Strehl and Ghosh Citation2002), Selective Voting (SV) (Zhou and Tang Citation2006), Selective Weighted Voting (SWV) (Zhou and Tang Citation2006), Iterative Voting Consensus (IVC) (Nguyen and Caruana Citation2007), Expectation–Maximization (EM) (Topchy, Jain, and Punch Citation2005), Normalized Spectral Clustering (NSC) (Ng, Jordan, and Weiss Citation2001), Density Based Spatial Clustering of Applications with Noise (DBCAN) (Ester et al. Citation1996), and Clustering by Fast Search and Find of Density Peaks (CFSFDP) (Rodriguez and Laio Citation2014).

5.1. Experiment Settings

A number of settings for these compared methods are listed below to ensure that the comparisons are in uniform environment. The number of clusters per basic cluster is equal to the actual number of classes in each of the desired datasets. k-means is also used as a productive of basic clustering. There are two methods for basic clustering: (1) Multiple implementations of the k-means $T,$ each with a random amount of cluster centers. Let $N$ refer to the number of samples. A set of the group $T$ is set based on the dataset scale. If $N \leq 500,$ then $T = 25,$ if $500 \leq N < 1, 000,$ then $T = 45$ and if $N \geq 1, 000,$ then $T = 15 .$ (2) Implement the proposed method. The method requires an input parameter $ε,$ which is adapted to the size of the $T$ group, as required. Here, the size of the group $T$ is essentially consistent with the first plan.

We implemented all these methods with MATLAB 2019a simulator for experiments. The simulations are based on a synthetic fog environment on the Dell Latitude Laptop with Intel^® Atom^TM processor N550 (Core i7 at 3.3 GHz) and 16 GB of RAM. Meanwhile, the proposed method has some parameters as input whose values are adjusted using Taguchi approach (Yang et al. Citation2021).

5.2. Evaluation Criteria

Given the availability of real labels from the original dataset, we use two common measures based on unsupervised learning to estimate the similarity between the results and the correct division of the dataset of different methods. Given a dataset $X$ and two partitions of these samples, namely $C = {c_{1}, c_{2}, \dots, c_{k}}$ (clustering result) and $P = {p_{1}, p_{2}, \dots, p_{k'}}$ (real partition), the values associated with $C$ and $P$ can be provided in a contingency table (), so that $n_{ij}$ indicants the number of same nodes in the clusters $c_{i}$ and $p_{j} : n_{ij} = | c_{i} \cap p_{j} | .$

Table 1. The default table to compare two partitions.

Display Table

Normalized Mutual Information (NMI) and Adjust Rand Index (ARI) are evaluation metrics in experiments. The details of these criteria are described below.

5.2.1. NMI

Let $π_{α} = [c_{1}^{α}, c_{2}^{α}, \dots, c_{| π_{α} |}^{α}]$ and $π_{β} = [c_{1}^{β}, c_{2}^{β}, \dots, c_{| π_{β} |}^{β}]$ be the results of two basic clustering methods as two partitions with $| π_{α} |$ and $| π_{β} |$ clusters, respectively. Accordingly, the $NMI (π_{α}, π_{β})$ defines the diversity value for these partitions (Li, Qian, and Wang Citation2021), as shown in EquationEq. (15)(15) $NMI (π_{α}, π_{β}) = \frac{2 \sum_{i = 1}^{| π_{α} |} \sum_{j = 1}^{| π_{β} |} n_{ij} l og (\frac{n . n_{ij}}{n_{iα} . n_{βj}})}{\sum_{i = 1}^{| π_{α} |} n_{iα} \log (\frac{n_{iα}}{n}) + \sum_{j = 1}^{| π_{β} |} n_{βj} \log (\frac{n_{βj}}{n})}$ (15) (15) $NMI (π_{α}, π_{β}) = \frac{2 \sum_{i = 1}^{| π_{α} |} \sum_{j = 1}^{| π_{β} |} n_{ij} l og (\frac{n . n_{ij}}{n_{iα} . n_{βj}})}{\sum_{i = 1}^{| π_{α} |} n_{iα} \log (\frac{n_{iα}}{n}) + \sum_{j = 1}^{| π_{β} |} n_{βj} \log (\frac{n_{βj}}{n})}$ (15) where $n$ is the number of samples, $n_{ij}$ is the same number of samples in $c_{i}^{α}$ and $c_{j}^{β},$ $n_{iα}$ is the number of samples in $c_{i}^{α},$ and $n_{βj}$ is the number of samples in $c_{j}^{β} .$

5.2.2. ARI

This measure is often used in cluster validation and can indicate agreement between two partitions (Niu et al. Citation2020). The ARI is calculated based on the Rand Index, as defined in EquationEq. (16)(16) $ARI (π_{α}, π_{β}) = \frac{\sum_{i = 1}^{| π_{α} |} \sum_{j = 1}^{| π_{β} |} (\begin{matrix} n_{ij} \\ 2 \end{matrix}) - [\sum_{i = 1}^{| π_{α} |} (\frac{n_{iα}}{2}) \sum_{j = 1}^{| π_{β} |} (\frac{n_{βj}}{2})] / (\frac{n}{2})}{\frac{1}{2} [\sum_{i = 1}^{| π_{α} |} (\frac{n_{iα}}{2}) + \sum_{j = 1}^{| π_{β} |} (\frac{n_{βj}}{2})] - [\sum_{i = 1}^{| π_{α} |} (\frac{n_{iα}}{2}) \sum_{j = 1}^{| π_{β} |} (\frac{n_{βj}}{2})] / (\frac{n}{2})}$ (16) (16) $ARI (π_{α}, π_{β}) = \frac{\sum_{i = 1}^{| π_{α} |} \sum_{j = 1}^{| π_{β} |} (\begin{matrix} n_{ij} \\ 2 \end{matrix}) - [\sum_{i = 1}^{| π_{α} |} (\frac{n_{iα}}{2}) \sum_{j = 1}^{| π_{β} |} (\frac{n_{βj}}{2})] / (\frac{n}{2})}{\frac{1}{2} [\sum_{i = 1}^{| π_{α} |} (\frac{n_{iα}}{2}) + \sum_{j = 1}^{| π_{β} |} (\frac{n_{βj}}{2})] - [\sum_{i = 1}^{| π_{α} |} (\frac{n_{iα}}{2}) \sum_{j = 1}^{| π_{β} |} (\frac{n_{βj}}{2})] / (\frac{n}{2})}$ (16) where $n$ is the number of samples, $n_{ij}$ is the same number of samples in $c_{i}^{α}$ and $c_{j}^{β},$ $n_{iα}$ is the number of samples in $c_{i}^{α},$ and $n_{βj}$ is the number of samples in $c_{j}^{β} .$

5.3. Datasets

Experimental evaluations were performed on nine datasets. More information describing the datasets used in the experiments can be found in . The cluster distribution of this synthetic 2D dataset is shown in . The real datasets are derived from the UCI machine learning repository (Golrou et al. Citation2018; Movahhed Neya, Saberi, and Rezaie Citation2022).

Figure 6. Distribution of four synthetic datasets: (a) imbalance, (b) aggregation, (c) banana, and (d) ring.

Table 2. Description of the datasets used.

Download CSV Display Table

5.4. Compared Methods

The proposed hybrid clustering method is evaluated in comparison with a wide range of clustering methods. Most of the clustering methods used for comparison are state-of-the-art and hybrid clustering methods. These methods include CO-average as a dual similarity approach that performs clustering through shared similarity matrix (Fred and Jain Citation2005). Similarity matrices based on CSM, WTQ, and WCT also belong to dual similarity approaches and are considered for comparison (Iam-On et al. Citation2011). Here, CO, CSM, WTQ, and WCT are analyzed through Single-Link (SL) and Average-Link (AL) hierarchical clustering methods to calculate the final results.

Also, HGPA, MCLA, and CSPA are hybrid clustering methods presented by Strehl and Ghosh (Citation2002). These methods are also considered for comparison and evaluation of the proposed method. In addition, we use SV and SWV as analysis-based weighted clustering methods for comparison work (Zhou and Tang Citation2006). Here, two feature-based clustering methods including IVC and EM are also used for comparison. IVC is presented by Nguyen and Caruana (Citation2007) and EM by Topchy, Jain, and Punch (Citation2005).

We also evaluated the proposed method in comparison with some Strong clustering approaches. Here, DBCAN, NSC, and CFSFDP were used for comparison. NSC is presented by Ng, Jordan, and Weiss (Citation2001), DBCAN by Ester et al. (Citation1996), and CFSFDP by Rodriguez and Laio (Citation2014).

5.5. Results and Discussions

This section analyzes the results of the proposed method in comparison with existing clustering methods. First, the parameter $ε$ is analyzed as an effective input parameter for the proposed method. In general, setting the parameter $ε$ is an important challenge in the proposed method. We discussed that the selection of this parameter depends on the number of basic clustering considered by the users. Then, we examined the effect of the parameter $ε$ on the performance of proposed method with performing relevant tests. For example, this problem has been analyzed on Wine and Iris datasets. As shown in part (a) in and , the number of basic clusters generated by the MKM scheme decreases with increasing $ε .$ However, as shown in part (b) in these figures, the quality of the clustering results does not increase, hence the value of $ε$ should be slightly increased. On the other hand, the results clearly show that the number of basic clustering methods considered is not suitable. In other words, the number of methods considered to produce high-quality final clusters is high or low. Therefore, we must select an appropriate value of $ε$ to control the number of basic clustering on each dataset.

Figure 7. Analysis of the $ε$ parameter of the proposed method on the Wine dataset. (a) number of basic clusters generated and (b) quality of clustering results.

Figure 8. Analysis of the $ε$ parameter of the proposed method on the Iris dataset. (a) number of basic clusters generated and (b) quality of clustering results.

In the following, the proposed method is evaluated in comparison with other hybrid clustering methods. Based on the NMI and ARI credit criteria, the performance of different clustering methods has been compared on synthetic and real datasets. shows the results of the comparisons for the synthetic dataset based on NMI, and the results of this measure for the real dataset are reported in . These comparisons for ARI are presented in and , respectively. Here, the last two columns indicate the average and Standard Deviation (SD) of each method for this dataset based on MKM. As illustrated, the superiority of the proposed method in creating high-quality and high-accuracy clusters on synthetic data sets is clear. This issue is confirmed by observing the results of the subject clustering methods. The proposed method has succeeded in creating higher quality clusters due to the use of valid local clustering theory as well as the use of meta-clusters. Therefore, the proposed method identifies the final clusters more effectively and increases the efficiency.

Basically, the performance of the proposed method is better than other methods in the real dataset. However, improving the accuracy of the proposed method in the real dataset is less than that of the synthetic dataset. One of the most important reasons for this is that the dimensions of real data sets are much larger than synthetic datasets. In addition, according to the results, it can be stated that most of the compared clustering methods have performed better than the MKM scheme considering the random scheme. Because any basic cluster generated by the MKM scheme only includes the local modification of clusters in a dataset. However, these existing methods do not recognize or consider local modification. Therefore, they cannot get the results of a good group in the MKM scheme. The proposed method has better performance in MKM scheme than other methods. Note that the proposed method implements only in the MKM scheme, because the MKM scheme is part of it. We observe that the proposed method in the MKM scheme works better in terms of NMI and ARI based on other methods in the randomized scheme.

In the following, the proposed method is evaluated in comparison with other strong clustering methods. The results of comparing the proposed method with three strong clustering methods (i.e., NSC, DBCAN, and CFSFDP) based on the synthetic and real datasets are reported in and , respectively. Here, the last two rows refer to the mean and SD in each clustering method. As shown in these experiments, the clustering quality provided by the proposed method is better or promising compared to other methods. As the simulation results show, the proposed method can simulate strong simulation results and realize “a few clusters equal to a strong cluster.”

Table 7. Results of evaluations on synthetic datasets.

Download CSV Display Table

Table 8. Results of evaluations on real datasets.

Download CSV Display Table

In another experiment, the computational complexity of clustering methods is evaluated through runtime analysis. The efficiency of the proposed method on the KDD-CUP99 dataset was tested. We set $k = 2$ and $ε = 0.14 .$ The runtime of the method with a number of samples (i.e., $N$ ) is shown in . It is clearly evident that the number of $T$ basic clustering increases with increasing number of samples. Given the time complexity of the proposed method, the runtime with $T$ is second order. Given the runtime of the proposed method, the runtime with $T$ is second order. However, since that $T < N$ and $T$ slowly increase in comparison with $N$ growth, the cost of increasing $T$ growth time is acceptable. As depicted, the cost of runtime of the method proposed is proportional to the number of linear samples. Therefore, the proposed method must be able to quickly obtain the final clustering in a large-scale dataset. As illustrated, the proposed method is very efficient.

Table 9. Performance of the proposed method based on runtime (s) on the KDD-CUP99 dataset.

Display Table

6. Conclusion

Among the clustering methods, hybrid clustering is one of the popular methods with high stability and robustness, which provides the ability to discover hidden patterns with high accuracy. Hybrid clustering can adapt itself to the input dataset by using the knowledge of different methods and increase the quality of the final solution. The different quality of partitions from basic clustering methods is one of the arguments of hybrid clustering, which can achieve better results by combining them. Although k-means is a poor clustering method, it has a low computational cost, which makes it unsuitable for clustering results. Therefore, this study used the k-means clustering method as the basic cluster. Here, we presented a different definition of valid local clusters by considering the data around the cluster centers in k-means. To increase the diversity in the primary clusters, we used a duplicate strategy on nonappearance data in valid local clusters. Also, we used the inter-cluster and intra-cluster similarity measure to estimate the similarity between valid local clusters. This process has resulted in the production of a weighted graph in which the weight of the edges expresses the degree of similarity between the clusters. An aggregation function based on meta-clustering was used to create the final clusters, in which the primary clusters were re-clustered to obtain the final clusters. In general, the idea of the proposed method is to understand the concept of several weak clusters equal to a strong cluster by k-means. The results obtained from the proposed hybrid clustering method are more consistent with the real data structure. This method has reported better results compared to state-of-the-art methods on different datasets. Based on the results, the proposed method is effective for dealing with large-scale datasets. According to the concept of granular computing, how to extract the relationship between basic clustering methods and primary partitions is worth studying in future work. Also, the proposed method can appear more effective considering feature extraction/selection approaches. On the other hand, it is recommended to use techniques of increasing diversity such as bagging to select suitable basic clustering methods for future works.

Data availability

Data sharing not applicable to this manuscript as no datasets were generated or analyzed during the current study.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

This work was supported by Training plan for young backbone teachers in Henan Province (No.2018GGJS267).

References

Abapour, N., A. Shafiesabet, and R. Mahboub. 2021. A novel security based routing method using ant colony optimization algorithms and RPL protocol in the IoT networks. International Journal of Electrical and Computer Sciences 3 (1):1–9.
Google Scholar
Azimi, J., and X. Fern. 2009. Adaptive cluster ensemble selection. In Twenty-First International Joint Conference on Artificial Intelligence, Vol. 9, 992–7, California, USA, July 11–17.
Google Scholar
Bai, L., J. Liang, and F. Cao. 2020. A multiple k-means clustering ensemble algorithm to find nonlinearly separable clusters. Information Fusion 61:36–47. doi:10.1016/j.inffus.2020.03.009.
Web of Science ®Google Scholar
Berahmand, K., E. Nasiri, R. Pir Mohammadiani, and Y. Li. 2021. Spectral clustering on protein-protein interaction networks via constructing affinity matrix using attributed graph embedding. Computers in Biology and Medicine 138:104933.
PubMed Web of Science ®Google Scholar
Bouyer, A, and A. Hatamlou. 2018. An efficient hybrid clustering method based on improved cuckoo optimization and modified particle swarm optimization algorithms. Applied Soft Computing 67:172–82. doi:10.1016/j.asoc.2018.03.011.
Web of Science ®Google Scholar
Ester, M., H. P. Kriegel, J. Sander, and X. Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD-96, 226–31, Portland, Oregon, USA, August 2–4.
Google Scholar
Forouzandeh, S., K. Berahmand, E. Nasiri, and M. Rostami. 2021. A hotel recommender system for tourists using the Artificial Bee Colony Algorithm and Fuzzy TOPSIS Model: A case study of tripadvisor. International Journal of Information Technology & Decision Making 20 (1):399–429. doi:10.1142/S0219622020500522.
Web of Science ®Google Scholar
Fred, A. L., and A. K. Jain. 2005. Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (6):835–50.
PubMed Web of Science ®Google Scholar
Ghobaei-Arani, M. 2021. A workload clustering based resource provisioning mechanism using Biogeography based optimization technique in the cloud based systems. Soft Computing 25 (5):3813–30. doi:10.1007/s00500-020-05409-2.
Web of Science ®Google Scholar
Ghobaei-Arani, M., and A. Shahidinejad. 2021. An efficient resource provisioning approach for analyzing cloud workloads: a metaheuristic-based clustering approach. The Journal of Supercomputing 77 (1):711–50. doi:10.1007/s11227-020-03296-w.
Web of Science ®Google Scholar
Golalipour, K., E. Akbari, S. S. Hamidi, M. Lee, and R. Enayatifar. 2021. From clustering to clustering ensemble selection: A review. Engineering Applications of Artificial Intelligence 104:104388. doi:10.1016/j.engappai.2021.104388.
Web of Science ®Google Scholar
Golrou, A., A. Sheikhani, A. M. Nasrabadi, and M. R. Saebipour. 2018. Enhancement of sleep quality and stability using acoustic stimulation during slow wave sleep. International Clinical Neuroscience Journal 5 (4):126–34. doi:10.15171/icnj.2018.25.
Google Scholar
Hamidi, S. S., E. Akbari, and H. Motameni. 2019. Consensus clustering algorithm based on the automatic partitioning similarity graph. Data & Knowledge Engineering 124:101754. doi:10.1016/j.datak.2019.101754.
Web of Science ®Google Scholar
Hansen, P., and N. Mladenović. 2001. J-means: A new local search heuristic for minimum sum of squares clustering. Pattern Recognition 34 (2):405–13. doi:10.1016/S0031-3203(99)00216-2.
Web of Science ®Google Scholar
Huang, D., C. D. Wang, and J. H. Lai. 2017. LWMC: A locally weighted meta-clustering algorithm for ensemble clustering. In International Conference on Neural Information Processing, 167–76. Cham: Springer.
Google Scholar
Huang, D., C. D. Wang, J. S. Wu, J. H. Lai, and C. K. Kwoh. 2020. Ultra-scalable spectral clustering and ensemble clustering. IEEE Transactions on Knowledge and Data Engineering 32 (6):1212–26. doi:10.1109/TKDE.2019.2903410.
Web of Science ®Google Scholar
Iam-On, N., T. Boongoen, S. Garrett, and C. Price. 2011. A link-based approach to the cluster ensemble problem. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (12):2396–409. doi:10.1109/TPAMI.2011.84.
PubMed Web of Science ®Google Scholar
Jadidi, A., and M. R. Dizadji. 2021. Node clustering in binary asymmetric stochastic block model with noisy label attributes via SDP. In 2021 International Conference on Smart Applications, Communications and Networking (SmartNets), 1–6. New York: IEEE. doi:10.1109/SmartNets50376.2021.9555421.
Google Scholar
Jain, A. K. 2010. Data clustering: 50 years beyond k-means. Pattern Recognition Letters 31 (8):651–66. doi:10.1016/j.patrec.2009.09.011.
Web of Science ®Google Scholar
Jiang, H., S. Yi, J. Li, F. Yang, and X. Hu. 2010. Ant clustering algorithm with K-harmonic means clustering. Expert Systems with Applications 37 (12):8679–84. doi:10.1016/j.eswa.2010.06.061.
Web of Science ®Google Scholar
Khedairia, S., and M. T. Khadir. 2022. A multiple clustering combination approach based on iterative voting process. Journal of King Saud University – Computer and Information Sciences 34 (1):1370–80. doi:10.1016/j.jksuci.2019.09.013.
Web of Science ®Google Scholar
Li, F., Y. Qian, and J. Wang. 2021. GoT: A growing tree model for clustering ensemble. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, 8349–56, California, USA, February 2–9.
Google Scholar
Li, T., A. Rezaeipanah, and E. M. T. El Din. 2022. An ensemble agglomerative hierarchical clustering algorithm based on clusters clustering technique and the novel similarity measurement. Journal of King Saud University – Computer and Information Sciences 34 (6):3828–42. doi:10.1016/j.jksuci.2022.04.010.
Web of Science ®Google Scholar
Ma, T., Z. Zhang, L. Guo, X. Wang, Y. Qian, and N. Al-Nabhan. 2021. Semi-supervised Selective Clustering Ensemble based on constraint information. Neurocomputing 462:412–25. doi:10.1016/j.neucom.2021.07.056.
Web of Science ®Google Scholar
Mojarad, M., F. Sarhangnia, A. Rezaeipanah, H. Parvin, and S. Nejatian. 2021. Modeling hereditary disease behavior using an innovative similarity criterion and ensemble clustering. Current Bioinformatics 16 (5):749–64. doi:10.2174/1574893616999210128175715.
Web of Science ®Google Scholar
Movahhed Neya, N., S. Saberi, and B. Rezaie. 2022. Design of an adaptive controller to capture maximum power from a variable speed wind turbine system without any prior knowledge of system parameters. Transactions of the Institute of Measurement and Control 44 (3):609–19. doi:10.1177/01423312211039041.
Web of Science ®Google Scholar
Nasiri, E., K. Berahmand, Z. Samei, and Y. Li. 2022. Impact of centrality measures on the common neighbors in link prediction for multiplex networks. Big Data 10 (2):138–50.
PubMed Web of Science ®Google Scholar
Ng, A., M. Jordan, and Y. Weiss. 2001. On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems 14:849–56.
Google Scholar
Nguyen, N., and R. Caruana. 2007. Consensus clusterings. In Seventh IEEE International Conference on Data Mining (ICDM 2007), 607–12. New York: IEEE. doi:10.1109/ICDM.2007.73.
Google Scholar
Niu, H., N. Khozouie, H. Parvin, H. Alinejad-Rokny, A. Beheshti, and M. R. Mahmoudi. 2020. An ensemble of locally reliable cluster solutions. Applied Sciences 10 (5):1891. doi:10.3390/app10051891.
Google Scholar
Rezaeipanah, A., P. Amiri, H. Nazari, M. Mojarad, and H. Parvin. 2021. An energy-aware hybrid approach for wireless sensor networks using re-clustering-based multi-hop routing. Wireless Personal Communications 120 (4):3293–314. doi:10.1007/s11277-021-08614-w.
Web of Science ®Google Scholar
Rezaeipanah, A., H. Nazari, and G. Ahmadi. 2019. A hybrid approach for prolonging lifetime of wireless sensor networks using genetic algorithm and online clustering. Journal of Computing Science and Engineering 13 (4):163–74. doi:10.5626/JCSE.2019.13.4.163.
Google Scholar
Rodriguez, A., and A. Laio. 2014. Clustering by fast search and find of density peaks. Science (New York, N.Y.) 344 (6191):1492–6. doi:10.1126/science.1242072.
PubMed Web of Science ®Google Scholar
Shahidinejad, A., M. Ghobaei-Arani, and L. Esmaeili. 2020. An elastic controller using Colored Petri Nets in cloud computing environment. Cluster Computing 23 (2):1045–71. doi:10.1007/s10586-019-02972-8.
Web of Science ®Google Scholar
Strehl, A., and J. Ghosh. 2002. Cluster ensembles – A knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3:583–617.
Google Scholar
Sun, S., S. Wang, G. Zhang, and J. Zheng. 2018. A decomposition-clustering-ensemble learning approach for solar radiation forecasting. Solar Energy 163:189–99. doi:10.1016/j.solener.2018.02.006.
Web of Science ®Google Scholar
Tan, H., Y. Tian, L. Wang, and G. Lin. 2020. Name disambiguation using meta clusters and clustering ensemble. Journal of Intelligent & Fuzzy Systems 38 (2):1559–68. doi:10.3233/JIFS-179519.
Web of Science ®Google Scholar
Topchy, A., A. K. Jain, and W. Punch. 2005. Clustering ensembles: Models of consensus and weak partitions. IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (12):1866–81. doi:10.1109/TPAMI.2005.237.
PubMed Web of Science ®Google Scholar
Trik, M., A. M. N. G. Molk, F. Ghasemi, and P. Pouryeganeh. 2022. A hybrid selection strategy based on traffic analysis for improving performance in networks on chip. Journal of Sensors 2022:1–19. doi:10.1155/2022/3112170.
Web of Science ®Google Scholar
Trik, M., S. Pour Mozaffari, and A. M. Bidgoli. 2021. Providing an adaptive routing along with a hybrid selection strategy to increase efficiency in NoC-based neuromorphic systems. Computational Intelligence and Neuroscience 2021:8338903. doi:10.1155/2021/8338903.
PubMed Web of Science ®Google Scholar
Walid, W., M. Awais, A. Ahmed, G. Masera, and M. Martina. 2021. Real-time implementation of fast discriminative scale space tracking algorithm. Journal of Real-Time Image Processing 18 (6):2347–60. doi:10.1007/s11554-021-01119-6.
Web of Science ®Google Scholar
Wei, S., Z. Li, and C. Zhang. 2018. Combined constraint-based with metric-based in semi-supervised clustering ensemble. International Journal of Machine Learning and Cybernetics 9 (7):1085–100. doi:10.1007/s13042-016-0628-6.
Web of Science ®Google Scholar
Wei, Y., S. Sun, J. Ma, S. Wang, and K. K. Lai. 2019. A decomposition clustering ensemble learning approach for forecasting foreign exchange rates. Journal of Management Science and Engineering 4 (1):45–54.
Google Scholar
Yang, W., Y. Zhang, H. Wang, P. Deng, and T. Li. 2021. Hybrid genetic model for clustering ensemble. Knowledge-Based Systems 231:107457. doi:10.1016/j.knosys.2021.107457.
Web of Science ®Google Scholar
Zhang, B., M. Hsu, and U. Dayal. 2000. K-harmonic means-a spatial clustering algorithm with boosting. In International Workshop on Temporal, Spatial, and Spatio-Temporal Data Mining, 31–45. Berlin, Heidelberg: Springer.
Google Scholar
Zhao, Q., Y. Zhu, D. Wan, Y. Yu, and Y. Lu. 2019. Similarity analysis of small-and medium-sized watersheds based on clustering ensemble model. Water 12 (1):69. doi:10.3390/w12010069.
Web of Science ®Google Scholar
Zheng, Y., Z. Long, C. Wei, and H. Wang. 2021. Particle swarm optimization for clustering ensemble. In 2021 16th International Conference on Intelligent Systems and Knowledge Engineering (ISKE), 385–91. New York: IEEE. doi:10.1109/ISKE54062.2021.9755338.
Google Scholar
Zhou, Z. H., and W. Tang. 2006. Clusterer ensemble. Knowledge-Based Systems 19 (1):77–83. doi:10.1016/j.knosys.2005.11.003.
Web of Science ®Google Scholar
Zhu, X., B. Fei, D. Liu, and W. Bao. 2021. Adaptive clustering ensemble method based on uncertain entropy decision-making. In 2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), 61–7. New York: IEEE. doi:10.1109/TrustCom53373.2021.00026.
Google Scholar

A Hybrid Clustering Method Based on the Several Diverse Basic Clustering and Meta-Clustering Aggregation Technique

Abstract

1. Introduction

2. Related Works

3. Problem Formulation

4. Proposed Clustering Method