1,278
Views
1
CrossRef citations to date
0
Altmetric
Research Article

Research on hybrid intrusion detection based on improved Harris Hawk optimization algorithm

, &
Article: 2195595 | Received 18 Nov 2022, Accepted 22 Mar 2023, Published online: 19 Apr 2023

Abstract

Aiming at the problem of low detection accuracy of network traffic data types by traditional intrusion detection methods, we propose an improved Harris Hawk hybrid intrusion detection method to enhance the detection capability. The improved Harris Hawk optimization algorithm is used as a feature selection scheme to reduce the impact of redundant and noisy features on the performance of the classification model. The algorithm introduces the singer map to initialise the population, uses multi-information fusion to obtain the best prey position, and applies the sine function-based escape energy to execute a prey search strategy to obtain the optimal subset of features. In addition, the original data is preprocessed by the k-nearest neighbour and deep denoising autoencoder (KNN-DDAE) to relieve the imbalance problem of the network traffic data. Finally, a deep neural network (DNN) is used to complete the classification. Simulation experiments are conducted on the dataset NSL-KDD, KDD CUP99, and UNSW-NB15. The results show that our feature selection and data balancing scheme greatly improves the detection accuracy. In addition, the detection performance of this method is better than the current popular intrusion detection schemes.

1. Introduction

At present, network attacks occur frequently and threaten the privacy and security of users seriously. According to the Cyber Attack Trends: 2022 Mid-Year Report released by Point (Citation2022), global weekly cyberattacks increased by 42%, which causes great damage to people's daily life. As an important part of major infrastructure, the industrial control system will cause serious consequences such as major economic losses and casualties if it is attacked by the network. How to protect the network from attacks has become an urgent problem to be solved.

Intrusion detection technology has made a significant contribution to network protection, and its essence is to complete the classification of network traffic through data processing and modelling analysis. Scholars apply traditional machine learning algorithms to solve network security problems (Ahmim et al., Citation2018), such as Decision Tree Algorithm (Kunhare et al., Citation2020), SVM (Binbusayyis & Vaiyapuri, Citation2021), Random Forest Algorithm (Alzubi et al., Citation2022), etc. However, with the change of attack traffic becoming more and more complex, traditional machine learning algorithms perform poorly in extracting high-dimensional features of the data.

Deep learning (Shrestha & Mahmood, Citation2019) is the product of the further development of machine learning, and the extracted features have stronger generalisation performance, it's widely used in many fields (Diao et al., Citation2022; Hu et al., Citation2022; Li et al., Citation2022; Liang et al., Citation2021).

Ahmad et al. (Citation2021) show that deep learning is more suitable for intrusion detection than traditional machine learning methods. Such as Andresini, Appice, Malerba et al. (Citation2021) proposed a technology based on convolutional neural network (CNN) that used the combination of nearest neighbour search and clustering algorithm to generate image information of network flow, which retains other categories of potential data that have a significant impact on the results. However, the original information may be lost in the process of converting the features through the autoencoder, which will affect the performance of the model. Kanna and Santhi (Citation2021) learned temporal features of network traffic data by Long Short Term Memory Network (LSTM), spatial features of network traffic data by CNN, and adjusted the hyperparameters of CNN using Lion Swarm Optimization (LSO) algorithm, its excellent intrusion detection performance was demonstrated on the ISCX-IDS and UNSW-NB15 datasets, but the process of optimising CNN networks by hyperparameters can greatly increase the system overhead. Gavel et al. (Citation2022) used mutual information technology to select the important features and used KELM (Kernel Extreme Learning Machine) as a classifier for intrusion detection, and this scheme obtains high detection accuracy.

Deep learning-based intrusion detection shows better results than traditional machine learning, but the problems such as high dimensionality of network traffic data and imbalance of data categories prevent the further improvement of the classification capability of the model. Feature extraction and data balancing operations on the data can resolve these problems, however, the existing feature selection schemes cannot effectively extract the best feature subset. In addition, most researchers address the problem of data imbalance by generating minority class labels without comprehensively considering the overlap of majority class label and minority class label information problems, making it difficult for the model to separate them.

Aiming at the above problems, we propose a hybrid intrusion detection model based on the improved Harris Hawk algorithm. The contributions of this research are summarised as follows:

  1. This work proposes an improved Harris Hawk algorithm for feature selection to obtain the optimal feature subset of network traffic data. We compared the original data and the data after feature selection on three benchmark data sets.

  2. We upsample and downsample the data using DDAE and KNN algorithm. The classification accuracy of the model is further improved by removing the interference data through the data balancing method.

  3. To verify the effectiveness of the model, we use the well-known NSL-KDD, KDD CUP99, and UNSW-NB15 datasets for performance evaluation.

The rest of the paper is organised as follows. In Section 2, we review feature selection and balance processing methods in the field of intrusion detection and briefly introduce the Harris Hawk algorithm. In Section 3, we describe in detail the improved Harris Hawk algorithm and the method for balancing the data. In Section 4, We conduct experiments and analyse the performance of the model. Section 5 summarises the paper and briefly discusses future research directions.

2. Related work

2.1. Data processing for intrusion detection

Most of the datasets used for intrusion detection system evaluation contain a large amount of feature information. Irrelevant and redundant features may lead to degradation of the performance of the intrusion detection system, feature selection is a common method of selecting important features. In the literature on intrusion detection, scholars have used feature selection schemes such as packing methods, embedded methods, and filtering methods to obtain the subset of features which can further improve the performance of intrusion detection systems.

Long et al. (Citation2022) used a random attention-based data fusion method to remove redundant features, and then used a semi-supervised ladder network model for intrusion detection. Talita et al. (Citation2021) used particle swarm optimisation (PSO) for feature selection and validated it with a Naïve Bayesian classifier, the experiment shows that the best classification results are achieved when 38 features are used. However, it is not possible to determine whether the proposed method is representative since it was tested on only one set of data. Li et al. (Citation2020) applied random forest to group the selected features into multiple feature subsets after the AP clustering algorithm, then used the same number of autoencoders as the subsets for training, the results show that this method can speed up the training and testing process. Li et al. (Citation2021) proposed an improved krill swarm (KH) optimisation algorithm for feature selection on the NSL-KKD dataset and CICIDS-2017 dataset, which retained an average of 7 features and an average of 10.2 features on the two datasets, it effectively eliminates redundant features and ensures high detection accuracy. However, the author did not take into account the problem of data imbalance, resulting in low classification accuracy of the model for minority data. Mojtahedi et al. (Citation2022) combined the whale optimisation algorithm (WOA) and genetic algorithm (GA) for feature selection, and classification by KNN, which get a better result than other previous methods. In addition, to address data imbalance, Chuang and Wu (Citation2019) used deep variational autoencoder to generate more minority data, the results demonstrate that the method improves the classification performance of the minority class labels. Huang and Lei (Citation2020) proposed an unbalanced generative adversarial network (IGAN), it introduces an imbalanced data filter and convolutional layers to the typical GAN, which generates more representative data for minority classes. However, network traffic has many characteristics, and these methods do not consider the negative impact of redundant or noisy features on model accuracy. As shown in Table , we summarise the works discussed above.

Table 1. Comparison of intrusion detection schemes.

According to the above literature, intrusion detection data has the characteristics of high dimensions and extreme imbalance between attack data and normal data. Many intrusion detection methods need feature selection and data balance to reduce the training time of the model and improve classification accuracy. The feature selection and data balance processing scheme proposed in this paper ensures the accuracy of the detection.

2.2. Harris Hawk optimization algorithm

In 2019, Heidari et al. proposed the Harris Hawk optimization algorithm (HHO) (Heidari et al., Citation2019). The algorithm executes the optimal hunting strategy through the information exchange between Harris Hawks, and finally captures the prey. It consists of three parts: the search phase, the conversion phase between search and development, and the development phase. The specific steps are as follows.

2.2.1. Search phase

In this stage, Harris Hawk randomly inhabit a certain place to search for prey within the range of [lb,ub]. Find the prey by formula (Equation1) and (Equation2), then choose the strategy through q to update the individual position. (1) Xm(t+1)={Xrand(t)r1|Xrand(t)2r2X(t)|,q0.5[Xrabbit(t)Xm(t)]r3[lb+r4(ublb)],q<0.5(1) (2) Xm(t)=k=1MXk(t)/M(2) Among them, X(t) represents the current individual's location, Xrand(t) is the position of an individual randomly selected from the current population, Xrabbit(t) represents the current position of the individual with the best fitness, Xm(t) is the average position of individuals in the current population, r1,r2,r3,r4,q ∈ (0,1), M is the population size.

2.2.2. Search and development transformation

To ensure the Harris Hawk algorithm can accurately find the optimal solution, it is necessary to enter the local search after the global search. realising the conversion of the two search methods through the prey's escape energy E, the formula is as (Equation3): (3) E=2E0(1tT)(3) Among them, E represents the energy required for the prey to escape, E0 is a random number in (-1,1), t is the number of iterations of the current population, and T means the population has reached the maximum iteration value.

2.2.3. Development stage

At this stage, the Harris Hawk determines the location of the prey and forms a circle around the prey for executing the strategy of assaulting the prey. In this process, the prey will try different strategies to escape the pursuit after sensing the danger. To hunt more accurately, the Harris Hawk algorithm adopts four strategies chosen by parameter to deal with the escape behaviour of the prey. soft siege strategy: when 0.5|E|<1 and r5, the prey has sufficient energy and tries to get rid of the siege. At this moment, the Harris Hawk updates the position by executing the soft siege strategy, the formula is as (Equation4): (4) X(t+1)=ΔX(t)E|JXrabbit(t)X(t)|(4) Among them, ΔX(t) represents the difference between the optimal individual position and the current individual position, and J denotes the random jumping distance during the prey's escape, which is between 0 and 2. hard siege strategy: when |E|<0.5 and r0.5, the prey's energy is not enough so that it's hard to get rid of the siege, the Harris Hawk executes a hard siege strategy to update the position, the formula is as (Equation5): (5) X(t+1)=Xrabbit(t)E|ΔX(t)|(5) Soft encircling strategy of asymptotic rapid dive: when 0.5|E|<1 and r<0.5, the prey has sufficient energy to get rid of the encirclement, the Harris Hawk executes the soft encircling strategy of asymptotic rapid dive to capture prey, the formula is as (Equation6): (6) X(t+1)={Y:Xrabbit(t)E|JXrabbit(t)X(t)|,f(Y)<f(X(t))Z:Y+SLF(D),f(Z)<f(X(t))(6) Among them, f is the fitness function, S is a randomly generated D-dimensional random vector with a value between 0 and 1, and LF is the Levy flight formula.

Hard encircling strategy with asymptotic rapid dive: when |E|<0.5 and r<0.5, the energy of the prey is insufficient, but still has a chance to get rid of the pursuit, it executes the hard encircling strategy of asymptotic rapid dive to update the position, the formula is as (Equation7): (7) X(t+1)={Y:Y=Xrabbit(t)E|JXrabbit(t)Xm(t)|,f(Y)<f(X(t))Z:Y+SLF(D),f(Z)<f(X(t))(7) In the iterative process of the population, the objective function used in this paper is shown in Equation (Equation8): (8) Fun=alpha(1acc)+(1alpha)(NfeatMfeat)(8) Among them, alpha is assigned 0.999, Nfeat is the count of selected features, Mfeat is the total count of features, and acc represents the test accuracy of the currently selected features.

3. A hybrid intrusion detection model based on improved Harris Hawk algorithm

In reality, most of the network traffic data is normal, and some attack traffic has a low number of records. In addition, each sample of network traffic data contains many features, however, not all features contain correct information about the class to which the sample belongs. According to the characteristics of network traffic data, a hybrid intrusion detection method is used to improve detection accuracy. Figure  summarises the overall structure of the model. First, we preprocess the original data, then the preprocessed data is selected by the improved Harris Hawk optimization algorithm for feature selection. Meanwhile, the preprocessed data is used by KNN-DDAE to form new data, then apply the best subset of features to the new data, finally, the data is fed into the DNN to train a high-performance classification model.

Figure 1. The hybrid model proposed in this paper.

Figure 1. The hybrid model proposed in this paper.

3.1. Improved Harris Hawk optimization algorithm

In the process of Harris Hawk iteration, there are problems such as falling into local optimal solution easily and convergence speed slowly. Aiming at these problems, we improve the traditional Harris Hawk algorithm in three aspects: population initialisation stage, optimal prey position, and energy function. the improved algorithms can achieve a better balance between global search and local search so that has a greater chance to obtain the optimal solution.

3.1.1. Singer chaos map

In the population initialisation stage, the traditional pseudo-random number initialisation Harris Hawk population has the problem of uneven distribution, which may impact the quality of the optimal solution. We use the Singer chaotic map (Ibrahim et al., Citation2017) to complete the initialisation of the Harris Hawk population, which generated population has a more symmetrical probability distribution. It often achieves better results in finding the optimal solution, the Singer chaotic map formula is shown in (Equation9): (9) Zk+1=μ(7.86zk23.31zk2+28.75zk313.302875zk4)(9) Among them, zk is a control parameter, and the value range is (0,1). when μ[0.9,1.08], Singer maps have chaotic behaviour.

3.1.2. Multi-Information fusion

In the process of population iteration in the Harris Hawk algorithm, the optimal prey position only considers the information of the current optimal individual, ignoring the information of other individuals may also have an important impact on the results. Due to the singleness of its information sources, the algorithm is prone to lack of population diversity, which ultimately leads to poor quality of the optimal solution. According to the change rule of the objective function in the iterative process, this paper introduces two strategies for fusing non-optimal individual information to ensure the convergence accuracy of the algorithm, and execute the relevant strategies on the basis of different conditions to change the current prey location, the equation is as (Equation10)–(Equation12): (10) Xrabbit(i)=μxm+(1μ)xn,m,n(2,Nˆ)(10) (11) Xrabbit(i)=w1Xrabbit(t)+(1w1)Xrabbit(i),fit[t]<fit[t1](11) (12) Xrabbit(i)=w2Xrabbit(t)+(1w2)Xrabbit(i),|fit[t]fit[t1]|<m(12) Among them, Xrabbit(i) indicates the two population individuals with non-optimal fitness are fused according to different weights to form a new i-th population individual position. Xrabbit(t) stands for the position of the current optimal individual, which is the current prey position. m, n represent the index corresponding to the individuals of the population, which are non-repeating random integers between 2 and Nˆ, where Nˆ=N/21. fit[t] represents the best fitness value of the t th iteration, ω1(0.3,0.4), ω2(0.5,0.7), μ (0,1), Xrabbit(i) is the fusion of various information to form a new optimal prey position. The best prey position is updated when the best fitness value of the current round is less than the best fitness value of the previous round, or the difference between the best fitness values of the two rounds is less than the threshold, otherwise, the multi-information fusion strategy is not implemented.

3.1.3. Escape energy based on sine function

In the standard HHO algorithm, the escape energy factor E manipulates algorithm to performdifferent search strategies and is fully converted to a local search in the later iterations. In the middle and late iteration of the traditional escape energy factor, the proportion of local search is greatly increased, which makes the algorithm easily sink into a local optimum. We propose an escape energy factor E based on the sine function to overcome the shortcomings of the algorithm in the middle and late iterations. Where r1, r2 are random numbers between 0 and 1, and T, t represent the current iteration number and the highest iteration number respectively. The energy factor update equation is shown in Equation (Equation13): (13) E=(1.5(Tt)+r1)sin(2(Tt)+r2)T(13) The escape energy based on the sine function used in this paper is shown in Figure .

Figure 2. Escape energy based on sine function.

Figure 2. Escape energy based on sine function.

We used the NSL-KDD dataset to compare the fitness between the original Harris Hawk algorithm and the improved Harris Hawk algorithm. Figure  shows the specific results, the improved Harris Hawk algorithm converges faster than the original Harris Hawk algorithm and obtains smaller fitness values. The pseudocode of the improved Harris Hawk optimization algorithm is described as Algorithm 1.

Figure 3. Comparison between improved HHO and original HHO.

Figure 3. Comparison between improved HHO and original HHO.

3.2. KNN data downsampling

Data imbalance processing includes data upsampling and downsampling. Random downsampling is done by randomly removing a portion of the data to achieve balance. However, category-balanced data can also produce poor classification performance because the method does not take into account the problem of partially different categories of data overlapping each other under the same space. Some researchers transform the original data through feature mapping to increase the distance between different categories, but this scheme may have the risk of losing the original data information.

According to the characteristics of intrusion detection datasets, this paper performs KNN downsampling on majority class data to solve the above problems. When the neighbours around the minority class data belong to a certain majority class, the current majority class data is deleted. By processing the majority class data, the overlapping rate of different class data is reduced, and the minority class data can be expressed more accurately, which easier to distinguish them by the classifier.

Based on the nature of the KNN algorithm. For a given set of data D{x1,x2,,xi}, the Minkowski distance formula is used to find the nearest K instances of the sample to be tested, let the current sample be categorised as the category that accounts for the highest proportion of the K nearest neighbour samples according to the majority voting principle. Among them, the Minkowski distance formula of samples is shown in (Equation14): (14) dxy=k=1n|xkyk|2p(14) The pseudocode of under-sampling is described as Algorithm 2.

3.3. Deep denoising autoencoder

Autoencoder (AE) is an unsupervised learning model that is widely used in data dimensionality reduction and data denoising because it does not require data labels. In addition, it is a typical generative model which achieves good results in the fields of image generation and data enhancement. AE includes an input layer, encoding layer, and decoding layer. The weight and bias of each layer are adjusted by the back-propagation algorithm so that the decoded data is as consistent as possible with the original data. The denoising autoencoder (DAE) is based on the optimisation and improvement of the auto-encoder, the model is shown in Figure . For a given matrix X={x1,x2,xn}, where xiRn. Convert the original data X to X={x1,x2,xn} by adding Gaussian noise. In the encoding stage, the vector is transformed into the corresponding low-dimensional vector xkRd, where d represents the dimension of the corresponding low-dimensional vector. In the decoding stage, the low-dimensional vector is converted into a vector with the same dimension as the input vector. The output matrix is represented as Y.

Figure 4.  Denoising autoencoder.

Figure 4.  Denoising autoencoder.

The whole denoising autoencoder training process is as follows.

For a given matrix X={x1,x2,xn}, where xiRn. It is necessary to add noise to the original data, which can be obtained by formula (Equation15). (15) X=X+Noise(15) Among them, the added noise needs to meet the standard Gaussian distribution, which μ=0, σ=1. In addition, the randomly generated noise is limited in the range of [0, 1] through the upper and lower boundaries, which avoids the problem of too large or too small noise.

In the encoding stage, X is used as the input of DAE, where xjRn corresponds to the jth vector of the matrix X. An encoding layer with neurons is obtained by formula (Equation16). (16) h(i)=gf(WiTX+bi)(16) Among them, gf is the introduced nonlinear activation function, such as the sigmoid function and the relu function. Wi represents the weight matrix of the i-th layer, and bi denotes the bias vector of the layer.

The data of the coding layer is decoded to obtain the decoded data Y. The formula is shown in (Equation17). (17) Y=gf(WiTh(i)+bi)(17) The denoising autoencoder automatically adjusts the weight matrix and bias vector so that the error between X and Y is minimised, and the minimisation loss function is shown in Equation (Equation18). (18) LDAE=1mi=1m(XiYi)2(18) To improve the performance of the denoising autoencoder, we combine multiple denoising autoencoders into a deep denoising autoencoder (DDAE), the model structure is shown in Figure . The deep denoising autoencoder of the first layer is trained, and its output is taken as the next layer's input of the network in turn, finally, we finish the training of the entire deep denoising autoencoder.

Figure 5. Deep denoising autoencoder.

Figure 5. Deep denoising autoencoder.

In this paper, we use DDAE to generate minority class data including setting the number of neurons d for each noise reduction encoder, the nonlinear activation function, and the additive noise factor f. The minority class data X is fed as the input of the first layer denoising autoencoder for training, and after layer-by-layer training, the generated minority class data Y is finally obtained.

4. Experiments and results

The experiments were conducted on R5-3500H CPU 2.10.HZ, 16GB RAM, and Windows 11. The whole program was run in python 3.7 environment, as well as the deep neural network was run on TensorFlow version 2.0. NSL-KDD, UNSW-NB15 and KDD CUP99 datasets were used in the experiment. In the NSL-KDD dataset, KDDTraining+ is used for model training, and KDDTest+ is used for model testing. In the KDD CUP99 dataset, 10_percent_corrected is used for model training, and corrected is used for model testing. In the UNSW-NB15 dataset, a partition from this dataset is used for model training and testing.

4.1. Benchmark datasets and data preprocessing

KDD CUP99 (Tavallaee et al., Citation2009) is the dataset used for the KDD CUP competition, the original data is the network connection data of the simulated US Air Force LAN for 9 weeks. NSL-KDD dataset solves some of the problems of the KDD dataset, including redundant data, duplicate records, etc. Scholars still experiment with these datasets for intrusion detection because the data are valid, even though they were generated far back in time. Both datasets contain 5 categories, Normal represents normal traffic data, DoS is a denial of service attack, Probe is a kind of attack which attempts to obtain information from the network, and U2R represents unauthorised local superuser privileged access, R2L indicates unauthorised access from a remote host. In the NSL-KDD dataset, the size of the training set is 125973, and the size of the test set is 22544. In the KDD CUP99 dataset, the size of the training set is 494021, and the size of the test set is 311029. The feature of data includes basic features, content features, time traffic features, and host traffic features.

The UNSW-NB15 (Moustafa & Slay, Citation2015) dataset comes from the Cyber Range Lab of UNSW Canberra. This data set contains current common attack traffic and normal traffic, which can truly reflect modern network traffic scenarios. We conduct experiments using a partition of this dataset, each with a total of 43 features, including 39 numerical features, 3 discrete features, and 1 categorical feature. There are 175341 records in the training set and 82332 records in the test set.

Data preprocessing includes three parts: numericalization, normalisation, and one-hot encoding, as follows:

  1. Numericalization: Converting text data to numerical data before training the model. In the experimental dataset, the protocol_type feature, the service feature, and the flag feature belong to text data. We use continuous numbers to represent the discrete attributes of these features.

  2. Normalization: The value of each feature of the data is not in the same order of magnitude. To reduce the large difference between the features, normalise feature data values to between 0 and 1. The conversion formula is shown in (Equation19): (19) X=xijMinMaxMin(19) Where xij represents the data in row i and column j, Min represents the minimum value of the column where the data is located, Max represents the maximum value of the column where the data is located, x is the normalised data.

  3. One-hot encoding: There is no logical relationship between the numerical text data, one-hot encoding is used to solve the problem of data discreteness. For example, there are 3 different attributes in the protocol_type feature, which are respectively processed into [1 0 0], [0 1 0], [0 0 1]. The same processing method is also used for the service and flag features, finally, the 41-dimensional feature data is mapped to 122-dimensional feature data.

4.2. Performance metrics

For the classification problem, there are two cases for the results of each category, the article takes the normal traffic data as a reference. The classification details are shown in Table .

Table 2. Confusion matrix for intrusion detection.

TP indicates the amount of normal data predicted to be normal, TN represents the amount of attack data predicted as attack data, and FP indicates the amount of normal data predicted as attack data. FN represents the amount of attack data predicted as normal. Among them, FP and FN are called false positives. The accuracy, precision, and F1 score of the evaluation metrics obtained based on the above parameters are usually used to measure the actual performance of the model. Therefore, we use them as evaluation metrics for the experimental results. These formulas are as follows: (20) Accuracy=TP+TNTP+TN+FP+FN(20) (21) Precision=TPTP+FP(21) (22) Recall=TPTP+FN(22) (23) F1=2PrecisionRecallPrecision+Recall(23)

4.3. Feature selection and data balance processing

In the feature selection stage of the intrusion detection field, two groups of classification accuracy are used to evaluate the fitness value to ensure the selected features have excellent performance in both binary classification and multi-classification. One group uses multi-classification accuracy, the other group uses binary classification accuracy. Finally, the two groups of feature selection results are combined to form the final optimal feature subset. In the experiment, 10 populations are initialised, and each test population is iterated 50 times. Table  shows the feature subsets obtained in two rounds of experiments and the final feature subsets used. Table  shows the best subset of features selected by this paper and other feature selection methods, such as NSGA2-LR (Khammassi & Krichen, Citation2020), RF (Kunhare et al., Citation2020), GA-FCM (Nguyen & Kim, Citation2020), MCMIFS (Gavel et al., Citation2022), Sigmoid PIO (Alazzam et al., Citation2020), GLCC (Mohammadi et al., Citation2019).

Table 3. Feature selection on three datasets.

Table 4. Comparison of feature selection results.

In the stage of processing unbalanced data, KNN is used to downsample the majority class data. After repeated experiments, for NSL-KDD dataset, 1000 neighbour nodes around each sample to be tested are used for evaluation for Normal type data, and 10 neighbour nodes around each sample to be tested are used for evaluation for Dos and Probe data types. Thus, the downsampling of the training set is completed by these parameters. DDAE is used to upsample the data, through many experiments, the epoch is set to 20, the batch size is set to 64, and the hidden layer is set to [80, 50, 30, 20] to achieve the best-generated data. For classes with only a few samples, we do not upsample. Such as the number of U2R is small and the generated data may have a negative impact on the classification performance due to its unrepresentative nature, so only R2L is used for data generation. The hyperparameter settings of DDAE are shown in Table . The data processed by KNN and DDAE are combined into new training data, and the details are shown in Table .

Table 5. DAE hyperparameter settings.

Table 6. Preprocessed data.

In the training stage of the DNN classifier, we set 4 hidden layers and use ReLU as the activation function of each layer to extract features more effectively. In addition, using the Dropout layer to discard a part of nodes, which can avoid overfitting. Table  shows the specific parameters of the DNN model. Figure  shows the changes of loss during DDAE and DNN training on NSL-KDD dataset.

Figure 6. The change of loss with the increase of epoch during DDAE and DNN training. (a) DDAE training (b) DNN training.

Figure 6. The change of loss with the increase of epoch during DDAE and DNN training. (a) DDAE training (b) DNN training.

Table 7. DNN structure for NSL-KDD.

4.4. binary classification task

We analyse each stage to prove the effectiveness of the proposed hybrid detection model. The original data, the data after feature selection, and the data after data balance processing are used for comparison. Figure  shows the confusion matrix of the proposed model in the binary classification, and Table  shows the specific experimental data.

Figure 7. Confusion matrix for binary classification. (a) NSL-KDD, (b) KDD CUP99, and (c) UNSW-NB15.

Figure 7. Confusion matrix for binary classification. (a) NSL-KDD, (b) KDD CUP99, and (c) UNSW-NB15.

Table 8. Binary classification experiment results (%).

The data in the table show that the feature selection and data preprocessing in the experiment are valid, which can improve the classification performance. For NSL-KDD, the accuracy after feature selection is 4.13% higher than the DNN, and the final integrated model is 13.12% higher than the original model. In addition, the precision increased by 12.22%, and F1 Score increased by 13.19%. For KDD CUP99, the data classification accuracy after feature selection is 1.4% higher than the original model, The final integrated model improves by 1.76% accuracy, 4.71% precision, and 3.32% F1 score to the original DNN. For UNSW-NB15, we finally achieved the accuracy of 89.34%, the precision of 85.52% and the F1 Score of 90.94%. The result shows that the feature subset selected by the improved Harris Eagle algorithm proposed in this paper removes redundant and noisy features and improves the classification ability of the DNN model, it also shows that all performance metrics are improved after dealing with the imbalance of the dataset, and DDAE-KNN outperforms the SMOTE method on the dataset.

In addition, we plotted the Receiver Operating Characteristics (ROC) at different stages of the model. As shown in Figure , the results show that the proposed model has excellent classification performance.

Figure 8. ROC curves.

Figure 8. ROC curves.

Table  shows the results of this approach compared to other advanced intrusion detection methods, such as LightGMB-AE (Tang et al., Citation2020), OCSVM (Khraisat et al., Citation2020), MAGENETO-GAN (Andresini, Appice, De Rose et al., Citation2021), MLP (Lopez-Martin et al., Citation2019).

Table 9. Binary classification performance comparison (%).

According to the data described in the table, the accuracy of the model is 93.12% on NSL-KDD, 94.87% on KDD CUP99, and 89.34% on UNSW-NB15. It outperforms other excellent intrusion detection models, indicating that this method has better performance on binary classification tasks.

4.5. Multi-classification task

Security personnel analyse the data and need detailed data types, so it is necessary to distinguish specific types of attack traffic. In this study, all categories were detected and judged in the test dataset. Figure  shows the confusion matrix of the proposed model in the multi-classification, and Table  shows the specific experimental data.

Figure 9. Confusion matrix for multi-classification. (a) NSL-KDD, (b) KDD CUP99, and (c) UNSW-NB15.

Figure 9. Confusion matrix for multi-classification. (a) NSL-KDD, (b) KDD CUP99, and (c) UNSW-NB15.

Table 10. Multi-classification tasks for datasets (%).

Table  shows the effectiveness of our method. For NSL-KDD, the data accuracy rate of feature selection is 1.61% higher than the original DNN, and the final integrated model is 8.08% higher than the original DNN. For KDD CUP99, the data accuracy after feature selection is 0.58% higher than the original DNN, and the final integrated model is 1.58% higher than the original DNN. For UNSW-NB15, the accuracy of the proposed model is 81.92%, which is 6.56% higher than the original DNN.

The proposed method is compared with other advanced intrusion detection schemes for multi-classification tasks to show its superior performance, such as DAE-DNN (Khraisat et al., Citation2020), CNN-BiLSTM (Jiang et al., Citation2020), DAE-DNN (Kunang et al., Citation2021), BAT-MC (Su et al., Citation2020), Adaptive Ensemble (Gao et al., Citation2019), AE-DBN (Li et al., Citation2015), US-CCI (Prasad et al., Citation2020), Multi-level ELM (Al-Yaseen et al., Citation2017), AE (Choi et al., Citation2019), MGWO (Alzaqebah et al., Citation2022).

Table  shows the results of the comparison between the proposed model and other excellent models. From the table, the accuracy rate reaches 86.79% on NSL-KDD, the accuracy rate reaches 94.03% on KDD CUP99, and the accuracy rate reaches 81.92% on UNSW-NB15. In addition, the result also shows our method has great advantages in multi-classification tasks in terms of precision and F1 score.

Table 11. Multi-classification performance comparison (%).

4.6. Time complexity

Time complexity can calculate the time required to process tasks and analyse the execution speed of the algorithm, which is an important indicator to evaluate the performance of the algorithm. The method proposed in this paper includes three parts: feature selection, data processing, and model training. We measured the runtime of multi-classification, Table  shows the time consumed for training and testing on the dataset using different methods.

Table 12. Comparison of running time of different methods (sec).

From Table , it can be seen that traditional machine learning can complete the test quickly, and the deep learning algorithm has a large number of parameters that consume more time to run than machine learning. In addition, the test time difference between our proposed method and the currently popular deep learning methods is small, yet our method achieves much better performance, which proves that the method is competent for intrusion detection.

5. Conclusion

According to the characteristics of network traffic data, we propose a hybrid intrusion detection framework based on DNN. The improved Harris Hawk algorithm is used to select the optimal features, the algorithm uses a reasonable energy factor and multi-information fusion of prey locations, and it solves the problem of easily falling into local optima in the process of population iteration. In addition, we use KNN-DDAE to perform imbalance processing on the preprocessed data, which to some extent alleviates the problem of imbalanced data causing the model to be biased towards the majority class. Experiments are conducted on three common datasets to validate the performance of the proposed model. We compare the data without any processing, data with feature selection, and data with feature selection and data imbalance processing experimentally to ensure that each part of the processing is valid. Finally, we compare with other excellent intrusion detection models, our model always performs better at accuracy, precision, and F1 score.

In recent years, graph neural networks have made achievements in social networks, traffic networks, and protein networks. There are potential correlations between network traffic and graphs can be built based on such correlations. Therefore, the next work will consider combining graph neural networks for intrusion detection.

Abbreviations

The following abbreviations are used in this manuscript:

SVM=

Support Vector Machines

LSTM=

Long Short Term Memory

CNN=

Convolutional Neural Network

HHO=

Harris Hawk Optimization

KNN=

K-Nearest Neighbor

DDAE=

Deep Denoising Autoencoder

DNN=

Deep Neural Network

LSO=

Lion Swarm Optimization

PSO=

Particle Swarm Optimization

WOA=

Whale Optimization Algorithm

GAN=

Generative Adversarial Network

KELM=

Kernel Extreme Learning Machine

GA=

Genetic Algorithm

KH=

Krill Swarm Optimization Algorithm

DAE=

Denoising Autoencoder

AE=

Autoencoder

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by the National Natural Science Foundation of China, ‘Research on High-frequency Blockchain Data Access Control and Autonomous Authentication’ [project number 62072170].

References

  • Ahmad, Z., Shahid Khan, A., Wai Shiang, C., Abdullah, J., & Ahmad, F. (2021). Network intrusion detection system: A systematic study of machine learning and deep learning approaches. Transactions on Emerging Telecommunications Technologies, 32(1), e4150. https://doi.org/10.1002/ett.v32.1
  • Ahmim, A., Derdour, M., & Ferrag, M. A. (2018). An intrusion detection system based on combining probability predictions of a tree of classifiers. International Journal of Communication Systems, 31(9), e3547. https://doi.org/10.1002/dac.v31.9
  • Alazzam, H., Sharieh, A., & K. E. Sabri (2020). A feature selection algorithm for intrusion detection system based on pigeon inspired optimizer. Expert Systems with Applications, 148, 113249. https://doi.org/10.1016/j.eswa.2020.113249
  • Al-Yaseen, W. L., Othman, Z. A., & Nazri, M. Z. A. (2017). Multi-level hybrid support vector machine and extreme learning machine based on modified K-means for intrusion detection system. Expert Systems with Applications, 67, 296–303. https://doi.org/10.1016/j.eswa.2016.09.041
  • Alzaqebah, A., Aljarah, I., Al-Kadi, O., & Damaševičius, R. (2022). A modified grey wolf optimization algorithm for an intrusion detection system. Mathematics, 10(6), 999. https://doi.org/10.3390/math10060999
  • Alzubi, Q. M., Anbar, M., Sanjalawe, Y., Al-Betar, M. A., & Abdullah, R. (2022). Intrusion detection system based on hybridizing a modified binary grey wolf optimization and particle swarm optimization. Expert Systems with Applications, 204, 117597. https://doi.org/10.1016/j.eswa.2022.117597
  • Andresini, G., Appice, A., De Rose, L., & Malerba, D. (2021). GAN augmentation to deal with imbalance in imaging-based intrusion detection. Future Generation Computer Systems, 123, 108–127. https://doi.org/10.1016/j.future.2021.04.017
  • Andresini, G., Appice, A., & Malerba, D. (2021). Nearest cluster-based intrusion detection through convolutional neural networks. Knowledge-Based Systems, 216, 106798. https://doi.org/10.1016/j.knosys.2021.106798
  • Binbusayyis, A., & Vaiyapuri, T. (2021). Unsupervised deep learning approach for network intrusion detection combining convolutional autoencoder and one-class SVM. Applied Intelligence, 51(10), 7094–7108. https://doi.org/10.1007/s10489-021-02205-9
  • Choi, H., Kim, M., Lee, G., & Kim, W. (2019). Unsupervised learning approach for network intrusion detection system using autoencoders. The Journal of Supercomputing, 75(9), 5597–5621. https://doi.org/10.1007/s11227-019-02805-w
  • Chuang, P. J., & Wu, D. Y. (2019). Applying deep learning to balancing network intrusion detection datasets. In IEEE 11th International Conference on Advanced Infocomm Technology (ICAIT) (pp. 213–217).
  • Diao, C., Zhang, D., Liang, W., Li, K. C., Hong, Y., & Gaudiot, J. L. (2022). A novel spatial-temporal multi-scale alignment graph neural network security model for vehicles prediction. IEEE Transactions on Intelligent Transportation Systems, 24(1), 904–914. https://doi.org/10.1109/TITS.2022.3140229
  • Gao, X., Shan, C., Hu, C., Niu, Z., & Liu, Z. (2019). An adaptive ensemble machine learning model for intrusion detection. IEEE Access, 7, 82512–82521. https://doi.org/10.1109/Access.6287639
  • Gavel, S., Raghuvanshi, A. S., & Tiwari, S. (2022). Maximum correlation based mutual information scheme for intrusion detection in the data networks. Expert Systems with Applications, 189, 116089. https://doi.org/10.1016/j.eswa.2021.116089
  • Heidari, A. A., Mirjalili, S., Faris, H., Aljarah, I., Mafarja, M., & Chen, H. (2019). Harris Hawks optimization: Algorithm and applications. Future Generation Computer Systems, 97, 849–872. https://doi.org/10.1016/j.future.2019.02.028
  • Hu, N., Zhang, D., Xie, K., Liang, W., & M. Y. Hsieh (2022). Graph learning-based spatial-temporal graph convolutional neural networks for traffic forecasting. Connection Science, 34(1), 429–448. https://doi.org/10.1080/09540091.2021.2006607
  • Huang, S., & Lei, K. (2020). IGAN-IDS: An imbalanced generative adversarial network towards intrusion detection system in ad-hoc networks. Ad Hoc Networks, 105, 102177. https://doi.org/10.1016/j.adhoc.2020.102177
  • Ibrahim, R. A., Oliva, D., Ewees, A. A., & Lu, S. (2017). Feature selection based on improved runner-root algorithm using chaotic singer map and opposition-based learning. In Neural information processing (pp. 156–166). Springer International Publishing.
  • Jiang, K., Wang, W., Wang, A., & Wu, H. (2020). Network intrusion detection combined hybrid sampling with deep hierarchical network. IEEE Access, 8, 32464–32476. https://doi.org/10.1109/Access.6287639
  • Kanna, P. R., & Santhi, P. (2021). Unified deep learning approach for efficient intrusion detection system using integrated spatial–temporal features. Knowledge-Based Systems, 226, 107132. https://doi.org/10.1016/j.knosys.2021.107132
  • Kasongo, S. M., & Sun, Y. (2020). A deep learning method with wrapper based feature extraction for wireless intrusion detection system. Computers & Security, 92, 101752. https://doi.org/10.1016/j.cose.2020.101752
  • Khammassi, C., & Krichen, S. (2020). A NSGA2-LR wrapper approach for feature selection in network intrusion detection. Computer Networks, 172, 107183. https://doi.org/10.1016/j.comnet.2020.107183
  • Khraisat, A., Gondal, I., Vamplew, P., Kamruzzaman, J., & Alazab, A. (2020). Hybrid intrusion detection system based on the stacking ensemble of C5 decision tree classifier and one class support vector machine. Electronics, 9(1), 173. https://doi.org/10.3390/electronics9010173
  • Kunang, Y. N., Nurmaini, S., Stiawan, D., & Suprapto, B. Y. (2021). Attack classification of an intrusion detection system using deep learning and hyperparameter optimization. Journal of Information Security and Applications, 58, 102804. https://doi.org/10.1016/j.jisa.2021.102804
  • Kunhare, N., Tiwari, R., & Dhar, J. (2020). Particle swarm optimization and feature selection for intrusion detection system. Sādhanā, 45(1), 1–14. https://doi.org/10.1007/s12046-020-1308-5
  • Li, X., Chen, W., Zhang, Q., & Wu, L. (2020). Building auto-encoder intrusion detection system based on random forest feature selection. Computers & Security, 95, 101851. https://doi.org/10.1016/j.cose.2020.101851
  • Li, X., Yi, P., Wei, W., Jiang, Y., & Tian, L. (2021). LNNLS-KH: A feature selection method for network intrusion detection. Security and Communication Networks, 2021, 1–22. https://doi.org/10.1155/2021/8830431
  • Li, Y., Liang, W., Peng, L., Zhang, D., Yang, C., & Li, K. C. (2022). Predicting drug-target interactions via dual-stream graph neural network. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1–11. https://doi.org/10.1109/TCBB.2022.3204188
  • Li, Y., Ma, R., & Jiao, R. (2015). A hybrid Malicious code detection method based on deep learning. International Journal of Software Engineering and Its Applications, 9(5), 279–288. https://doi.org/10.14257/ijseia
  • Liang, W., Xie, S., Cai, J., Xu, J., Hu, Y., Xu, Y., & Qiu, M. (2021). Deep neural network security collaborative filtering scheme for service recommendation in intelligent cyber–physical systems. IEEE Internet of Things Journal, 9(22), 22123–22132. https://doi.org/10.1109/JIOT.2021.3086845
  • Long, J., Liang, W., Li, K. C., Wei, Y., & Marino, M. D. (2022). A regularized cross-layer ladder network for intrusion detection in industrial internet of things. IEEE Transactions on Industrial Informatics, 19(2), 1747–1755. https://doi.org/10.1109/TII.2022.3204034
  • Lopez-Martin, M., Carro, B., Sanchez-Esguevillas, A., & Lloret, J. (2019). Shallow neural network with kernel approximation for prediction problems in highly demanding data networks. Expert Systems with Applications, 124, 196–208. https://doi.org/10.1016/j.eswa.2019.01.063
  • Mohammadi, S., Mirvaziri, H., Ghazizadeh-Ahsaee, M., & Karimipour, H. (2019). Cyber intrusion detection by combined feature selection algorithm. Journal of Information Security and Applications, 44, 80–88. https://doi.org/10.1016/j.jisa.2018.11.007
  • Mojtahedi, A., Sorouri, F., Souha, A. N., Molazadeh, A., & Mehr, S. S. (2022). Feature selection-based intrusion detection system using genetic whale optimization algorithm and sample-based classification. arXiv preprint arXiv:2201.00584.
  • Moustafa, N., & Slay, J. (2015). UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In Military Communications and Information Systems Conference (MILCIS) (pp. 1–6).
  • Nguyen, M. T., & Kim, K. (2020). Genetic convolutional neural network for intrusion detection systems. Future Generation Computer Systems, 113, 418–427. https://doi.org/10.1016/j.future.2020.07.042
  • Point, C. (2022). Cyber attack trends: Mid-year report. https://pages.checkpoint.com/cyber-attack-2022-trends.html
  • Prasad, M., Tripathi, S., & Dahal, K. (2020). Unsupervised feature selection and cluster center initialization based arbitrary shaped clusters for intrusion detection. Computers & Security, 99, 102062. https://doi.org/10.1016/j.cose.2020.102062
  • Sharafaldin, I., Lashkari, A. H., & Ghorbani, A. A. (2018). Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1, 108–116. https://doi.org/10.5220/0006639801080116
  • Shrestha, A., & Mahmood, A. (2019). Review of deep learning algorithms and architectures. IEEE Access, 7, 53040–53065. https://doi.org/10.1109/Access.6287639
  • Su, T., Sun, H., Zhu, J., Wang, S., & Li, Y. (2020). BAT: Deep learning methods on network intrusion detection using NSL-KDD dataset. IEEE Access, 8, 29575–29585. https://doi.org/10.1109/Access.6287639
  • Talita, A., Nataza, O., & Rustam, Z. (2021). Naïve bayes classifier and particle swarm optimization feature selection method for classifying intrusion detection system dataset. Journal of Physics: Conference Series, 1752(1), 012021. https://doi.org/10.1088/1742-6596/1752/1/012021
  • Tang, C., Luktarhan, N., & Zhao, Y. (2020). An efficient intrusion detection method based on LightGBM and autoencoder. Symmetry, 12(9), 1458. https://doi.org/10.3390/sym12091458
  • Tavallaee, M., Bagheri, E., Lu, W., & Ghorbani, A. A. (2009). A detailed analysis of the KDD CUP 99 data set. In IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA) (pp. 1–6). IEEE.
  • Vinayakumar, R., Alazab, M., Soman, K., Poornachandran, P., Al-Nemrat, A., & Venkatraman, S. (2019). Deep learning approach for intelligent intrusion detection system. IEEE Access, 7, 41525–41550. https://doi.org/10.1109/Access.6287639
  • Wang, Z., Liu, Y., He, D., & Chan, S. (2021). Intrusion detection methods based on integrated deep learning model. Computers & Security, 103, 102177. https://doi.org/10.1016/j.cose.2021.102177
  • Xu, X., Li, J., Yang, Y., & Shen, F. (2020). Toward effective intrusion detection using log-cosh conditional variational autoencoder. IEEE Internet of Things Journal, 8(8), 6187–6196. https://doi.org/10.1109/JIOT.2020.3034621
  • Zhang, X., Yang, F., Hu, Y., Tian, Z., Liu, W., Li, Y., & She, W. (2022). RANet: Network intrusion detection with group-gating convolutional neural network. Journal of Network and Computer Applications, 198, 103266. https://doi.org/10.1016/j.jnca.2021.103266
  • Zhang, Z., Li, Y., Dong, H., Gao, H., Jin, Y., & Wang, W. (2021). AESMOTE: Adversarial reinforcement learning with SMOTE for anomaly detection.