Full article: A multimodal hybrid parallel network intrusion detection model

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

With the rapid growth of Internet data traffic, the means of malicious attack become more diversified. The single modal intrusion detection model cannot fully exploit the rich feature information in the massive network traffic data, resulting in unsatisfactory detection results. To address this issue, this paper proposes a multimodal hybrid parallel network intrusion detection model (MHPN). The proposed model extracts network traffic features from two modalities: the statistical information of network traffic and the original load of traffic, and constructs appropriate neural network models for each modal information. Firstly, a two-branch convolutional neural network is combined with Long Short-Term Memory (LSTM) network to extract the spatio-temporal feature information of network traffic from the original load mode of traffic, and a convolutional neural network is used to extract the feature information of traffic statistics. Then, the feature information extracted from the two modalities is fused and fed to the CosMargin classifier for network traffic classification. The experimental results on the ISCX-IDS 2012 and CIC-IDS-2017 datasets show that the MHPN model outperforms the single-modal models and achieves an average accuracy of 99.98 $%$ . The model also demonstrates strong robustness and a positive sample recognition rate.

Keywords:

1. Introduction

The acceleration of the construction (Yang et al., Citation2022) of new-generation information infrastructure, the rapid iterative upgrading of digital technology, and the rise of digital platforms in various fields have resulted in a surge in global Internet traffic. And the COVID-19 pandemic has further accelerated global digitisation, leading to exponential growth in Internet traffic. Simultaneously, malicious attack traffic has grown even faster with frequent various network attacks, posing a severe threat to network security and even national security.

Intrusion Detection System (IDS) (Ahmim et al., Citation2018) is an effective security mechanism for monitoring network traffic and preventing malicious requests. The attack behaviour can be detected by analysing the data traffic generated by network users. Traditional Machine Learning (ML) techniques have been widely used in IDS. However, IDS based on traditional ML require professional researchers to annotate features, which cannot adapt to the dynamic and challenging network environment. With the development of Deep Learning (DL) technology, convolutional neural network (CNN) (Liu & Zhang, Citation2020, may), recurrent neural network (RNN) (Yin et al., Citation2017) and Long Short-Term Memory (LSTM) (Roy et al., Citation2017) are becoming increasingly popular in intrusion detection.

The existing IDS based on DL lack flexibility and adaptability. Only a single mode is used in the end-to-end DL IDS, resulting in significant deviations in the detection results. This is a major issue in intrusion detection, as small deviations in detection can lead to more attacks and potential threats (Z. Wang et al., Citation2022). Network security researchers often make a comprehensive judgment through various information such as abnormal traffic information, traffic load content, and communication interaction process when auditing and analysing malicious traffic. Multimodal learning is a method that seeks to process and understand different modal information through ML and DL. It is more closely aligned with the general laws of human understanding of the world than traditional ML methods that rely on a single mode.

Moreover, data imbalance (Buda et al., Citation2018; Cai et al., Citation2022) is a serious problem in intrusion detection research based on DL and ML, which significantly affects the performance of intrusion detection algorithms. Data imbalance refers to the huge difference in the amount of traffic attack data of different categories. To reduce the impact of data imbalance on the accuracy of the IDS, this paper carries out data unbalance processing onthe input of each modality and uses the feature screening method for the statistical traffic information mode to delete redundant features and retain the features with stronger distinguishing ability. For the original data mode of the data stream, the original feature distribution of the stream is retained. And an improved original feature extraction algorithm of the data stream is proposed, which avoids the introduction of too many 0 elements that are not conducive to network learning and accelerates the convergence speed of the network.

In this paper, a parallel network intrusion detection model based on multimodal feature fusion is proposed, combining multimodal DL methods with intrusion detection technology, to make full use of complex network traffic information and further improve the performance of IDS. The main contributions of this paper are summarised as follows.

A hybrid parallel network intrusion detection model (MHPN) is proposed. The network traffic characteristics are extracted from the statistical information of network traffic and the original load of traffic, and the appropriate neural network models are constructed for different modal information. Firstly, a two-branch CNN combined with LSTM network is used to extract the spatio-temporal feature information of network traffic for the original load mode of traffic, and CNN is used to extract the feature information of traffic statistics. Then, the feature information extracted from the two modalities is fused and fed to the CosMargin classifier for network traffic classification.
Ablation experiments were conducted on the ISCX-IDS 2012 and CIC-IDS-2017 datasets to evaluate the performance of the MHPN model. The results of the experiments demonstrate that the MHPN model is capable of effectively dealing with unbalanced abnormal traffic data, improving the accuracy of anomaly detection, and exhibiting strong robustness and a positive sample recognition rate. Furthermore, compared with the single modal model, the experimental the results indicate that the MHPN model is significantly better in terms of intrusion detection performance. The proposed model can also improve the security and reliability of industrial internet of things (IIoT), a recent research area that connects digital devices and services with physical systems. The IIoT produces large volumes of data from multiple sensors, but also faces various cyberattacks that threaten its operations and information.

The rest of the paper is organised as follows. Section 2 begins with a description of related work. Section 3 is design and method of the model. Section 4 conducts simulation experiments on the proposed model. Section 5 concludes the paper (Table ).

Table 1. Table of acronyms.

Download CSV Display Table

Table 2. Related work.

Download CSV Display Table

2. Related work

Existing methods for intrusion detection are typically classified into three categories, namely traditional pattern matching, ML, and DL.

In the early stages of the study, pattern matching algorithms were commonly used. It is the central algorithm utilised in IDS based on feature matching (H. Zhang, Citation2009). Wu and Shen (Citation2012) summarise the pattern matching algorithms used in IDS, which include the KMP, BM, BMH, BMHS, AC, and AC-BM. To improve matching speed, an improved AC-BM algorithm is proposed. Hnaif et al. (Citation2021) Parallelize the Direct Matching Algorithm (PDMA) using OpenMP technology under multi-core architecture to implement PDMA and improve the speed of the IDS detection engine. However, traditional pattern matching algorithms exhibit high false negative and false positive rates, which cannot meet the requirements of IDS.

To overcome these limitations, scholars have increasingly turned to combine traditional ML with IDS, resulting in promising outcomes. Anthi et al. (Citation2019) propose a three-layer intrusion detection system and employ Decision Tree (DT) for attack classification. Jan et al. (Citation2019) utilise Support Vector Machine (SVM) to detect abnormal traffic, extracted a feature pool from samples, and used the remaining label vectors to train SVM, which yielded satisfactory classification accuracy. J. Li et al. (Citation2019) propose a two-stage intrusion detection system. that employs bat algorithm, population splitting, and binary differential mutation to select typical features, and then utilises the random forest to adaptively adjust sample weight for traffic classification.

Although IDS based on traditional ML address the issues of traffic classification and computational cost, it requires researchers in the field of network security to label features as well as extensive preprocessing of abnormal traffic data and intricate feature engineering (Guo & Han, Citation2023; Su et al., Citation2020). In the complex and dynamic network environment, ML methods cannot resolve the problem of massive abnormal traffic classification.

Considering the limitations of IDS based on traditional ML, network security researchers have applied DL to IDS. DL (Alzubaidi et al., Citation2021) is a branch of ML. IDS based on DL achieves automatic feature extraction by automatically obtaining structured table features and directly inputting raw data for training. At present, DL has achieved remarkable results in the field of intrusion detection. CNNs and their related architectures, such as LeNet-5 (Lecun et al., Citation1998), have gained considerable attention due to their outstanding performance in computer vision (CV) (He et al., Citation2016). Although CNN architectures are commonly applied to CV problems, they have demonstrated promising results in IDS (Dong et al., Citation2019; Vinayakumar et al., Citation2017). LSTM networks are highly suitable for classification and prediction based on time series data. Bontemps et al. (Citation2016) propose a real-time anomaly detection model based on neural network learning, which employs LSTM to train with normal time series data before performing real-time prediction for each time step. Experiments show that the model can provide effective anomaly detection. Y. Zhang et al. (Citation2019) propose a DL model with two-layer parallel learning cross fusion (PCCN) to learn flow features by fusing two-branch CNNs and improve the detection results of unbalanced abnormal flows.

DL has yielded promising results in traffic classification, however, current IDS based on DL are limited in their flexibility and adaptability in the face of rapidly increasing network traffic. These systems utilise only a single mode and do not fully exploit the heterogeneous nature of traffic data, leading to significant discrepancies in the results obtained. Since minor deviations in detection can result in additional attacks and potential security threats (Cai et al., Citation2022). Multimodal learning is a promising technique for network intrusion detection, as it can leverage the rich and diverse information in network traffic data, such as traffic statistics and raw payload, to enhance the performance and robustness of traffic classification models. X. Wang et al. (Citation2020) proposed App-Net, a hybrid neural network based method for encrypted mobile traffic classification, which combines bidirectional Long Short-Term Memory network (bi-LSTM) and one-dimensional Convolutional neural network (1D-CNN) to extract effective features from the packet length sequence of TLS bidirectional flow and the payload bytes of initial packets, respectively. Then, a feature fusion layer is used to learn a joint flow-app embedding, which can differentiate various mobile apps. MIMETIC (Aceto et al., Citation2019) can exploit the heterogeneity of traffic data, learn the characteristics of different views, and capture the dependencies between views, to improve the traffic classification performance in mobile scenarios. DISTILLER (Aceto et al., Citation2021) can simultaneously solve multiple traffic classification tasks by leveraging different types of input data, such as payload bytes and header fields. The method uses a two-step training process to learn intra-modality and inter-modality dependencies and avoid overfitting. Existing works on multimodal deep learning for traffic classification have some limitations regarding the use environment, feature extraction, feature fusion and classification tasks. To overcome these limitations and achieve better results, we propose a novel multimodal deep learning model for traffic classification in this paper.

3. Design and method of the model

In this section, a multimodal hybrid parallel network intrusion detection model, called MHPN, is proposed. MHPN selects appropriate sub-models for training on two different modes, namely statistical traffic information and original traffic load, and then fuses the output results of multiple models to integrate different types of features to enhance the performance of the model detection. As illustrated in Figure , for the training of traffic statistics, a CNN is employed in this paper to extract spatial feature information in traffic statistics. Additionally, a parallel DL model that fuses spatio-temporal features is applied in the training of the original load of traffic. Two parallel CNNs are utilised to extract spatial information, and then the fused features are fed into two parallel LSTMs to analyse the temporal information of the fused features. The processed feature information of the two modalities is fused, and finally, the CosMargin classifier is utilised for classification.

Figure 1. Network structure diagram.

3.1. Parallel cross-fused spatio-temporal feature sub-model

In this paper, a parallel cross-fused spatio-temporal feature model is proposed to process the raw traffic data. The model consists of two parallel CNNs for spatial features extraction and two parallel LSTMs for temporal features extraction. Moreover, feature fusion technology is employed to cross-fuse the features of the two branches, thus enhancing the robustness of the learned features.

A parallel cross-fused spatio-temporal feature sub-model is proposed to extract spatio-temporal features. The upper branch utilises the integration of a Fully Convolutional Network (FCN) (Long et al., Citation2015) and an LSTM, while the lower layer utilises a conventional CNN combined with an LSTM. Furthermore, multiple feature of the upper and lower layers are cross-fused to strengthen the features. The network learns more discriminative and robust features by fusing the features extracted from the two branch networks.

3.1.1. Upper branch

The upper branch can extract richer feature information in network traffic for highly imbalanced small-sample attacks. The FCN has the characteristic of using convolutional layers in the network to replace the pooling layer. The MHPN model replaces the pooling layer with convolution for downsampling. The upper branch employs a fully convolutional layer, which convolves the input four times using four convolutional layers. The first and third convolutional layers have a kernel size of 3, a padding of 1, and a step size of 1. Meanwhile, the second and fourth convolutional layers have a kernel size of 3, a padding of 1, and a step size of 2.

The four-step convolution operation in the upper branch has two main advantages over the traditional convolution plus pooling operation.

It extracts richer semantic information from streaming data without losing traffic information, thanks to the increased number of convolutional layers.
It allows for more robustness by enabling flexible control of the network parameters, which prevents overfitting.

3.1.2. Lower branch

The lower branch can extract the spatial and temporal feature information of network traffic. For complex network traffic, too many redundant features are learned when convolution is performed. The pooling operation reduces redundant features by dividing the input into several regions and outputting only one value as a feature. Since each packet in a data stream contains varying information and malicious traffic often contains less information, the analysis needs to consider the invariance of different packets. The pooling layer can provide translation, rotation, and scale invariance to facilitate this analysis. In the lower branch, convolution and pooling are combined to remove redundant information, expand the perception field, and reduce the dimension and the number of parameters.

3.1.3. Feature fusion

The parallel cross-fused spatio-temporal feature sub-model is proposed, which uses four feature fusion operations. Different from the previous feature fusion using channel concatenation operation, both channel concatenation and addition operations are involved in our paper. The convolution kernels for each output channel are independent, thus allowing for the examination of the output of a single channel. Suppose the two input channels are $X 1, X 2, \dots, X c$ and $Y 1, Y 2, \dots Y c$ . Then the single output channel of the channel concatenation is: (1) $Z_{concat} = \sum_{i = 1}^{c} X_{i} ⨂ K_{i} + \sum_{i = 1}^{c} Y_{i} ⨂ K_{i + c}$ (1) The single output channel for the addition operation is: (2) $Z_{add} = \sum_{i = 1}^{c} (X_{i} + Y_{i}) ⨂ K_{i} = \sum_{i = 1}^{c} X_{i} ⨂ K_{i} + \sum_{i = 1}^{c} Y_{i} ⨂ K_{i}$ (2) Therefore, the addition operation can be used instead of the channel concatenation operation when the two inputs have semantically similar properties of the corresponding channels' feature graph, which saves more parameters and computation. In the parallel cross-fused spatio-temporal feature sub-model, the CNN semantics of the upper and lower layers of the branch are similar. To reduce the network overhead, the first three feature cross fusion uses the addition operation to fuse the features of the two layers of the CNN. The channel concatenation operation is then used before passing to the LSTM to avoid information loss.

3.2. CNN sub-model

CNNs are a type of neural network that possess the characteristics of local connection and weight sharing and have been shown to achieve good results in the field of computer vision (D. Li et al., Citation2022a; Rawat & Wang, Citation2017; Shen et al., Citation2022, december). The convolutional layer is used to extract the spatial features of network traffic, and ReLU is selected as the activation function. Compared to Sigmoid and other functions, ReLU has a more effective gradient descent and backpropagation, as well as a simpler calculation process.

The output of the convolution neuron is as follows: (3) $y_{i} = max (0, \sum_{i ϵ [0, n]}^{} x [i : i + c, : channel] * c_{j} + b_{j})$ (3) Where i represents the position after convolution, j represents the convolution kernel sequence number, c represents the convolution kernel length, and b represents the bias.

In order to address the issue of imbalanced abnormal traffic data, a batch normalisation layer is added after each convolutional layer to ensure that all batch data is distributed uniformly and then input to the next layer. This helps to stabilise the output of each convolutional layer in the network, making it easier to converge and reducing the risk of overfitting. Batch normalisation (Chen et al., Citation2022; Ioffe & Szegedy, Citation2015; J. Li et al., Citation2023) is an improvement over normal distribution normalisation, as it fixes the mean and variance in mini-batches to learn appropriate scaling and offset, thus accelerating the convergence process.

For each batch of input $B = {x_{1}, \dots ., x_{m}}$ , the mean and variance of the training batch data are obtained, and then the obtained mean and variance are used to normalise the training data of the batch. The process is as follows. (4) $\begin{aligned} μ & = \frac{1}{m} \sum_{i = 1}^{m} x_{i} \end{aligned}$ (4) (5) $\begin{aligned} σ_{2}^{B} & = \frac{1}{m} \sum_{i = 1}^{m} (x_{i} - μ_{B})^{2} \end{aligned}$ (5) (6) $\begin{aligned} \hat{x_{i}} & = \frac{x_{i} - μ_{B}}{\sqrt{σ_{2}^{B} + ε}} \end{aligned}$ (6) Where $μ_{B}$ is the mean of batch data and $σ_{2}^{B}$ is the variance of batch data. The sample $x_{i}$ is subtracted from the mean of the batch and divided by the standard deviation of the batch to get $\hat{x_{i}}$ , where ε is the tiny positive number to avoid the divisor is zero.

To enhance the robustness of the network, the data are scaled and shifted. The value needs to be adjusted by multiplying $x_{i}$ by γ and adding the offset by β to get $y_{i}$ , where γ is the scale factor and β is the translation factor. Because the normalised $x_{i}$ is limited to the normal distribution, the expression ability of the network is reduced. To solve this problem, two new parameters, γ and β, are introduced. β and β are learned by the network itself during training.

Therefore, the final output after adding batch normalisation is expressed as: (7) $y_{i} \leftarrow B N_{γ, β} (x_{i}) = γ_{i} + β$ (7) Therefore, the final output after adding batch normalisation is expressed as: (8) $y_{i} = max (0, B N_{γ, β} (\sum_{i ϵ [0, n]}^{} x [i : i + c, : channel] * c_{j} + b_{j}))$ (8)

3.3. Sequential learning model

In the real network environment, network traffic is observed to occur in a sequence rather than in isolation, and the number and content of packets in traffic vary over time, suggesting the presence of time-structured information in traffic data. In this paper, LSTM with hidden layers is exploited to automatically extract temporal features, as opposed to the RNN approach proposed in Yin et al. (Citation2017).

3.4. Traffic classification based on multimode

With the ongoing advancement of the attack-defense game, malicious traffic constantly upgrades technologies to evade detection, making malicious network traffic more and more concealed. To extract the abundant hidden information in network traffic, multimodal fusion and parallel cross-fused spatio-temporal feature fusion models can be employed to further extract and integrate this information, thereby enhancing the accuracy of recognition.

Before traffic classification, the features obtained from the two sub-models, the parallel cross-fused spatio-temporal feature sub-model and the CNN sub-model, need to be aggregated. To preserve the characteristics of multiple modes of network traffic as much as possible, the feature vector matrix $X = [x_{1}, x_{2}, \dots, x_{m}]$ output from the parallel cross-fused spatio-temporal feature fusion sub-model and the feature vector matrix $Y = [y_{1}, y_{2}, \dots, y_{m}]$ output from the CNN sub-model are cascaded to obtain the fused feature matrix $Z_{concat} = [x_{1}, x_{2}, \dots, x_{m}, y_{1}, y_{2}, \dots, y_{3}]$ . The channel concatenation operation splices the original features and feeds them into the network that will learn on its own how to fuse the features without lossing information.

The CosMargin classifier is utilised to classify network traffic instead of the widely-used Softmax in the DL domain. The Softmax function is the most commonly used loss function for classification tasks in the field of deep learning. The formula for the Softmax function is as follows: (9) $L_{s} = - \frac{1}{m} \sum_{i = 1}^{m} \log \frac{e^{w_{y_{i}}^{T} x_{i} + b_{y_{i}}}}{\sum_{j = 1}^{n} e^{w_{j}^{T} x_{i} + b_{j}}}$ (9) In this formula, $x_{i}$ represents the deep feature of the ith sample belonging to the $y_{i}$ class, while m denotes the batch size, n refers to the number of classes present in the dataset, $b_{j}$ indicates bias, w represents weight, and $w_{j}$ represents the weight of the jth sample in the current batch. Furthermore, $e^{w_{j}^{T} x_{i} + b_{j}}$ signifies the output of the fully connected layer of the network. During the process of decreasing loss $L_{S}$ , the direction propagation update parameter continuously increases the proportion of $e^{w_{y_{i}}^{T} x_{i} + b_{y_{i}}}$ . This ensures that more samples of the same class are correctly classified and fall within the decision boundary.

It is evident that Softmax's primary objective is to ascertain whether the sample is correctly classified. It lacks intra-class and inter-class constraints, thus rendering its classification effect inadequate for the imbalance problem. To address this issue, CosMargin (Cai et al., Citation2022) is chosen as the network classifier, which is modified on Softmax to indirectly establish a marginal boundary on the feature layer, thereby allowing the network to achieve superior generalisation performance and a better trade-off between basic and novel categories. To derive a Margin Loss formula from Softmax, we set the bias $b_{j} = 0$ and use the law of cosines to express the inner product of the weights $W_{j}$ and the input $x_{i}$ as $W_{j}^{T} = ‖ W_{j} ‖ ‖ x_{i} ‖ \cos θ_{j}$ , where $θ_{j}$ is the angle between $W_{j}$ and $x_{i}$ . We apply L2 regularisation on both $W_{j}$ and $x_{i}$ to make $‖ W_{j} ‖ = 1$ and $‖ x_{i} ‖ = 1$ . However, since a small norm of $x_{i}$ leads to a large training loss (a small value of Softmax), we scale it by a fixed factor s. Next, we introduce a cosine margin to impose a constraint, such that the current sample still belongs to its class after subtracting a margin m, that is $\cos θ_{1} - m > \cos θ_{2}$ . This gives us the Cosmargin used in this paper: (10) $L_{M L} = - \frac{1}{m} \sum_{i = 1}^{m} \log \frac{e^{s (\cos (θ_{y_{i}} - m))}}{e^{s (\cos (θ_{y_{i}} - m))} + \sum_{j = 1, j \neq y_{i}}^{n} e^{s \cdot \cos θ_{j}}}$ (10)

4. Experiment and result analysis

The experimental environment is shown in Table .

Table 3. Experiment environment.

Download CSV Display Table

4.1. Datasets and data processing

In the field of network intrusion detection, there are numerous public datasets available. Traditional IDS can extract network traffic features from network traffic header files and train models based on these features. However, the method of extracting features from packet headers based on fixed rules only utilise the information stored in the header and overlook the payload. On the other hand, each single data packet is treated as the detection object, disregarding the correlation between the data packets. Therefore, a novel multimodal intrusion detection dataset construction method is proposed by using the CIC-IDS2017 (Sharafaldin et al., Citation2018) and ISCX 2012 datasets (Shiravi et al., Citation2012) that cover a wide range and are closer to the real world. Meanwhile, the original PCAP file of network flow is selected to make full use of the abundant information on network flow. Tables and display the distribution of attack flows in our experiments.

Table 4. The distribution of CIC-IDS2017.

Download CSV Display Table

Table 5. The distribution of ISCX 2012.

Download CSV Display Table

4.1.1. Traffic statistics processing

Traditional IDS extract a variety of features from the header file of the PCAP file of network traffic. For instance, CICFlower can extract over 80 network traffic features. By using traffic feature extraction tools, CIC-IDS2017 and ISCX 2012 both provide network flow features extracted from network traffic and fill them into CSV files named by dates with labels. The provided tag file is processed as follows.

Data cleaning: The data in the tag file is inspected and analysed, and rows with missing values are removed to eliminate duplicate data. Data refinement is conducted to ensure data quality. A random forest feature selection algorithm (Breiman, Citation2001) is employed for feature selection to delete redundant features and retain features with stronger distinguishing abilities, so as to improve the classification and identification of unbalanced network traffic data. The statistical features of the CIC-IDS2017 dataset are screened by random forest feature selection algorithm, and the redundant features that had little effect on classification are discarded, leaving 64 statistical features.
Data standardisation: As can be seen from the label files, the range of each feature value in intrusion detection varies greatly. In this paper, Z-Score normalisation is used to standardise network flow features. (11) $x_{i}^{'} = \frac{x_{i} - \bar{x}}{v}$ (11) In the above formula, $\bar{x} = (1 / n) \sum_{i - 1}^{n} x_{i}$ , $v = \sqrt{(1 / (n - 1) \sum_{i = 1}^{n} (x_{i} - \bar{x})^{2}}$ , n is the total sample number, $x_{i}$ is the eigenvalue of one dimension of the sample data before standardisation, and $x_{i}^{'}$ is the corresponding dimension eigenvalue of the sample data after standardisation (Liu & Zhang, Citation2020, may).

4.1.2. Original traffic load processing

The DL-based malicious traffic detection method is employed to build an end-to-end monitoring model. The original traffic load is used as the information of another mode for feature extraction to enable the model to learn the deep features implied in the network traffic.

A novel original traffic feature extraction method is proposed to address the data imbalance problem of data sets in intrusion detection. The proposed algorithm divides the data traffic into data streams based on the five-tuple information of source IP address, source port, destination IP address, destination port, and transport layer protocol. The number of data packets in the divided data stream is limited to increase the number of data streams, while the data enhancement method is used to alleviate the problem of data imbalance. The specific process of the algorithm is described in detail.

Extract malicious traffic: Tshark is used to obtain the original hexadecimal data of the flow, and scapy is used to parse each traffic packet in turn to obtain the five-tuple information and compare it with the five-tuple in the label file provided by the data set to determine whether the data packet belonged to malicious traffic. To account for the real network environment, we retain the header fields of the MAC layer. If the five-tuple information of the packet is the same as the five-tuple information of the malicious traffic in the label file, the five-tuple, the original data, and the corresponding label are stored in the corresponding CSV file named after the malicious traffic.
Data cleaning: Each CSV file is screened to find duplicate and empty data, and eliminate them. The data quality is improved, and the model error is reduced.
Partition the data stream: The five-tuple information in the CSV file is used to partition the data stream. The number of packets contained in each stream is limited to s, and the number of bytes in each packet is limited to len. The extra traffic packets whose traffic exceeds s are taken as new data flows, which increases the number of data flows and enhances the data. If the traffic is less than s, the previous data packet is used to supplement, avoiding too many 0 elements in the data and reducing redundant features. If the number of bytes in a packet is less than len, 0 is used to replenish the packet. In this study, s is usually set to 5 and len is set to 96.

4.2. Evaluation criteria

In this paper, five common threshold-independent metrics, namely Precision, Recall, False Alarm Rate (FAR), Accuracy, and F1-score, are employed to evaluate the performance of the proposed method. Specifically, TP denotes the number of correctly identified abnormal samples, TN represents the number of correctly identified normal samples, FP represents the number of incorrectly identified normal samples, and FN is the number of incorrectly identified abnormal samples. (12) $\begin{aligned} Precision & = \frac{TP}{TP + FP} \end{aligned}$ (12) (13) $\begin{aligned} Recall & = \frac{TP}{TP + FN} \end{aligned}$ (13) (14) $\begin{aligned} FAR & = \frac{FP}{FP + TN} \end{aligned}$ (14) (15) $\begin{aligned} Accuracy & = \frac{TP + TN}{TP + TN + FP + FN} \end{aligned}$ (15) (16) $\begin{aligned} F 1 - score & = 2 * (\frac{Precision * Recall}{Precision + Recall}) \end{aligned}$ (16)

4.3. The setting of the hyperparameter

In the process of training our model, we utilised the Adam optimiser to expedite network convergence. The momentum was fixed at 0.9, and the weight decay was set at 0.0005 to mitigate overfitting. As different learning rates have varying effects on convergence, we employed a learning rate of 0.001 for the first eight rounds, 0.0001 for the last three rounds, and 0.00001 for the final two rounds, totalling 15 epochs. Our experiments were conducted on an RTX2060 graphics card, with a batch size of 256 during training and 512 during verification and testing. To effectively evaluate the performance of the MHPN model, we refrained from utilising additional data augmentation techniques during training and testing.

4.4. Experiments and results

In order to comprehensively evaluate the proposed MHPN model, experiments are conducted on the CICIDS2017 and ISCX-IDS 2012 datasets, and comparison tests are conducted based on previous research methods. By default, the original traffic load with a better classification effect is selected as the input of the comparison model, and the statistical traffic features is selected as the input of the post-marked (T) model. The detailed experimental results of the CIC-IDS2017 dataset are presented in Tables .

Figure illustrates the overall accuracy of the model on the CIC-IDS2017 dataset, which reveals that the MHPN model has a higher overall accuracy than the other comparable models, regardless of whether the model uses the original traffic load or statistical traffic features as input.

Figure 2. Comparison of CICIDS2017 experimental results.

It can be seen from Table that the detection accuracy of the proposed MHPN model is significantly superior to that of the existing intrusion detection models. Among the classification of 12 types of unbalanced abnormal traffic, the accuracy of the 11th type of abnormal traffic is slightly lower than that of the PCCN model, but it is still higher than that of other models. Moreover, the classification accuracy of the other 11 types is significantly better than that of other comparable models. Tables demonstrate that the Recall, F1-score, and FAR of the proposed MHPN model are all optimal among the 12-class classification, indicating that the model has enhanced robustness and positive sample recognition ability.

Figure presents the confusion matrix of the final test results of the model, which reveals that all 12 kinds of abnormal flows are correctly classified. There is almost no misclassification of abnormal flows with large samples, and a satisfactory classification effect is achieved for abnormal flows with small samples. Consequently, the model has excellent detection performance for unbalanced abnormal traffic.

Figure 3. Confusion matrix of CIC-IDS2017.

To verify the generality of the proposed multimodal model, the same experiments are conducted on the ISCX2012 dataset. As shown in Figure , the overall detection accuracy of the MHPN model still surpasses the other models. Subfigures (A), (B), (C) and (D) in Figure illustrate that the four indicators of the proposed model are superior to other models on the ISCX-IDS 2012 dataset.

Figure 4. Comparison of ISCX2012 experimental results.

Figure 5. ISCX2012 dataset results. (a) Precision. (b) Recall. (c) F1-score. (d) FAR.

In conclusion, the proposed MHPN model is superior to the previous network model, and all evaluation indicators are better than the comparison model and maintain excellent values. The model not only has strong stability and positive sample recognition ability, but also shows satisfactory detection performance for unbalanced abnormal traffic.

4.5. Multimodal functional analysis

Unlike other intrusion detection models, this paper proposes a novel intrusion detection model that utilises a variety of modalities, such as traffic statistics and raw load, to represent network traffic information from multiple perspectives. To evaluate the complementarity and multimodality of network features, three scenarios,traffic statistical information mode, original traffic load mode, and multimodal mode, are tested. Three models are experimented with, and the results are presented.

CNN model: Only traffic statistics are used as model input for abnormal traffic detection.
PCSN model: A cross-fused spatio-temporal feature model using only the raw load of traffic as input.
MHPN model: Abnormal traffic detection is performed using both traffic statistics and raw load of traffic.

Comparing the multimodal model with its branch sub-models of two modes, it can be seen from Figure that the performance of the multimodal model is significantly improved.

Figure 6. Comparison of multimodal experimental results.

As shown in Figure , it can be found that the multimodal model proposed in this paper has a better improvement than the single mode. Among the F1-score, FAR, Precision, Recall, and other indicators of 12 kinds of abnormal flows, the multimodal model has a significant improvement in the evaluation indicators of each category compared with the single mode model. In general, the multimodal model proposed in this paper can improve the performance of intrusion detection, showing better abnormal traffic detection performance than the single modal model when establishing the intrusion detection model.

Figure 7. Comparison of multimodal experimental results. (a) Precision. (b) Recall. (c) F1-score. (d) FAR.

4.6. Functional analysis of FCN

This section aims to examine the functionality of FCN. We present Comparison Table to illustrate the advantages of FCN.

Table 10. Comparison of employing the FCN and wiithout FCN.

Download CSV Display Table

CNN+LSTM is used to extract features using a single set of CNN and LSTM. CMHPN is a variant of MHPN that uses two sets of CNN+LSTM to extract features, replacing the full convolution of its upper branch with the convolution and pooling operation. The experimental results in Table demonstrate that using two sets of CNNS and LSTMS yields a significant improvement compared to using only one set. Additionally, the use of FCN in the upper branch of MHPN yields significantly better results than the use of CMHPN.

4.7. Complexity analysis

To assess the time efficiency of the model, we measured the test time of each model during the inference phase. For a fair comparison, we ran all our models on the same machine and used the same parameters for all neural network models. Table shows the test times for each model.

Table 11. Test time of models.

Download CSV Display Table

The results indicate that the proposed MHPN model is slightly slower than the compared models. As shown in Table , MHPN takes 6.33 seconds at test time, which is second only to CNN_LSTM and faster than PCCN. This demonstrates that it can process data more efficiently and quickly and that it is more concise and optimised than PCCN. Moreover, the training time of the machine learning algorithm KNN is much longer than that of the deep learning algorithm, which makes it unsuitable for big data computing in network intrusion detection. In summary, MHPN is a novel and efficient deep learning model with reasonable and acceptable test time, and it outperforms other models.

5. Conclusion

Faced with complex network traffic, single modal intrusion detection models cannot make the most of the information contained in network traffic. To address this issue, this paper proposes a multimodal hybrid parallel network intrusion detection model, MHPN. The MHPN model synthesises network traffic characteristics from two modes of network traffic, traffic statistics information and original traffic load, and builds appropriate neural network sub-models for each mode. The spatio-temporal features of network traffic are extracted by combining two-branch convolutional neural networks and LSTM for the original load mode of traffic. In addition, convolutional neural networks are used for feature extraction of statistical traffic information. Finally, the information of the two modes are fused and sent to the fully convolutional neural network to extract features for classification. Ablation experiments are conducted on the ISCX-IDS 2012 and CIC-IDS-2017 datasets to evaluate the performance of the MHPN model. The results of the experiments demonstrate that the MHPN model can effectively deal with unbalanced abnormal traffic data, improve the accuracy of anomaly detection, and exhibit strong robustness and a positive sample recognition rate. The model exhibits an average accuracy of 99.98 $%$ . Furthermore, compared with the single modal model, the MHPN model outperforms the single modal model in terms of intrusion detection performance.

Despite its superiority in all aspects, the MHPN model has a limitation in scalability due to its reliance on a closed set protocol. This protocol only allows the classification of known classes that appear in the training data, and fails to detect unknown attacks or even misclassifies them as known classes. Considering the increasing complexity of the network environment, scalable open-set recognition models are more suitable for real network applications. Therefore, in future research, we will further investigate and address the scalability of MHPN, and then test and refine it in real-time network environments. In addition, the model relies on a centralised architecture that may not be scalable or resilient to attacks. A possible direction for future work is to integrate blockchain technology with intrusion detection systems to enhance their security and trustworthiness (Gao et al., Citation2022; Han, Zhu et al., Citation2022; D. Li et al., Citation2022b). Blockchain is a decentralised and distributed ledger that can provide data integrity, transparency, and immutability (Han, Pan et al., Citation2022; H. Li et al., Citation2022). Blockchain can also enable collaborative intrusion detection systems (CIDS) that can share data and alerts among different nodes without compromising data privacy and confidentiality.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

Aceto, G., Ciuonzo, D., Montieri, A., & Pescapé, A. (2019, December). MIMETIC: Mobile encrypted traffic classification using multimodal deep learning. Computer Networks, 165, 106944. https://doi.org/10.1016/j.comnet.2019.106944
Web of Science ®Google Scholar
Aceto, G., Ciuonzo, D., Montieri, A., & Pescapé, A. (2021, June). DISTILLER: Encrypted traffic classification via multimodal multitask deep learning. Journal of Network and Computer Applications, 183–184, 102985. https://doi.org/10.1016/j.jnca.2021.102985
Web of Science ®Google Scholar
Ahmim, A., Derdour, M., & Ferrag, M. A. (2018). An intrusion detection system based on combining probability predictions of a tree of classifiers. International Journal of Communication Systems, 31(9), e3547. https://doi.org/10.1002/dac.v31.9
Web of Science ®Google Scholar
Alzubaidi, L., Zhang, J., Humaidi, A. J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., Al-Amidie, M., & Farhan, L. (2021, March). Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. Journal of Big Data, 8(1), 53. https://doi.org/10.1186/s40537-021-00444-8
PubMedGoogle Scholar
Anthi, E., Williams, L., Słowińka, M., Theodorakopoulos, G., & Burnap, P. (2019, October). A supervised intrusion detection system for smart home IoT devices. IEEE Internet of Things Journal, 6(5), 9042–9053. https://doi.org/10.1109/JIoT.6488907
Web of Science ®Google Scholar
Bontemps, L., Cao, V. L., McDermott, J., & Le-Khac, N. A. (2016). Collective anomaly detection based on long short-term memory recurrent neural networks. In Future data and security engineering (pp. 141–152). Springer. Retrieved December 22, 2022, from https://doi.org/10.1007/978-3-319-48057-29.
Google Scholar
Breiman, L. (2001, October). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
Web of Science ®Google Scholar
Buda, M., Maki, A., & Mazurowski, M. A. (2018, October). A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106, 249–259. https://doi.org/10.1016/j.neunet.2018.07.011
PubMed Web of Science ®Google Scholar
Cai, S., Han, D., Yin, X., Li, D., & Chang, C. C. (2022, December). A hybrid parallel deep learning model for efficient intrusion detection based on metric learning. Connection Science, 34(1), 551–577. https://doi.org/10.1080/09540091.2021.2024509
Web of Science ®Google Scholar
Chen, C., Han, D., & Chang, C. C. (2022, December). CAAN: Context-aware attention network for visual question answering. Pattern Recognition, 132, 108980. https://doi.org/10.1016/j.patcog.2022.108980
Web of Science ®Google Scholar
Dong, Y., Wang, R., & He, J. (2019, October). Real-time network intrusion detection system based on deep learning. In 2019 IEEE 10th international conference on software engineering and service science (ICSESS) (pp. 1–4). ISSN: 2327-0594.
Google Scholar
Gao, N., Han, D., Weng, T. H., Xia, B., Li, D., Castiglione, A., & Li, K. C. (2022, October). Modeling and analysis of port supply chain system based on Fabric blockchain. Computers & Industrial Engineering, 172, 108527. https://doi.org/10.1016/j.cie.2022.108527
Web of Science ®Google Scholar
Guo, Z., & Han, D. (2023, January). Sparse co-attention visual question answering networks based on thresholds. Applied Intelligence, 53(1), 586–600. https://doi.org/10.1007/s10489-022-03559-4
Web of Science ®Google Scholar
Han, D., Pan, N., & Li, K. C. (2022, January). A traceable and revocable ciphertext-policy attribute-based encryption scheme based on privacy protection. IEEE Transactions on Dependable and Secure Computing, 19(1), 316–327. https://doi.org/10.1109/TDSC.2020.2977646
Web of Science ®Google Scholar
Han, D., Zhu, Y., Li, D., Liang, W., Souri, A., & Li, K. C. (2022, May). A blockchain-based auditable access control system for private data in service-centric IoT environments. IEEE Transactions on Industrial Informatics, 18(5), 3530–3540. https://doi.org/10.1109/TII.2021.3114621
Web of Science ®Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In (pp. 770–778). Retrieved December 22, 2022, from https://openaccess.thecvf.com/contentcvpr2016/html/HeDeepResidualLearningCVPR2016paper.html.
Google Scholar
Hnaif, A., Jaber, K., Alia, M., & Daghbosheh, M. (2021, January). Parallel scalable approximate matching algorithm for network intrusion detection systems. International Arab Journal of Information Technology, 18(1), 77–84. https://www.webofscience.com/wos/alldb/full-record/WOS:000607690000009.
Web of Science ®Google Scholar
Ioffe, S., & Szegedy, C. (2015, June). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd international conference on machine learning (pp. 448–456). PMLR. Retrieved December 25, 2022, from https://proceedings.mlr.press/v37/ioffe15.html (ISSN: 1938-7228).
Google Scholar
Jan, S. U., Ahmed, S., Shakhov, V., & Koo, I. (2019). Toward a lightweight intrusion detection system for the internet of things. IEEE Access, 7, 42450–42471. https://doi.org/10.1109/Access.6287639
Google Scholar
Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998, November). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. https://doi.org/10.1109/5.726791
Web of Science ®Google Scholar
Li, D., Han, D., Weng, T. H., Zheng, Z., Li, H., Liu, H., Castiglione, A., & Li, K. C. (2022a, May). Blockchain for federated learning toward secure distributed machine learning systems: A systemic survey. Soft Computing, 26(9), 4423–4440. https://doi.org/10.1007/s00500-021-06496-5
PubMed Web of Science ®Google Scholar
Li, D., Han, D., Weng, T. H., Zheng, Z., Li, H., Liu, H., Castiglione, A., & Li, K. C. (2022b, April). MOOCsChain: A blockchain-based secure storage and sharing scheme for MOOCs learning. Computer Standards & Interfaces, 81, 103597. https://doi.org/10.1016/j.csi.2021.103597
Web of Science ®Google Scholar
Li, H., Han, D., & Tang, M. (2022, March). A privacy-preserving storage scheme for logistics data with assistance of blockchain. IEEE Internet of Things Journal, 9(6), 4704–4720. https://doi.org/10.1109/JIOT.2021.3107846
Web of Science ®Google Scholar
Li, J., Han, D., Wu, Z., Wang, J., Li, K. C., & Castiglione, A. (2023, May). A novel system for medical equipment supply chain traceability based on alliance chain and attribute and role access control. Future Generation Computer Systems, 142, 195–211. https://doi.org/10.1016/j.future.2022.12.037
Web of Science ®Google Scholar
Li, J., Zhao, Z., Li, R., & Zhang, H. (2019, April). AI-based two-stage intrusion detection for software defined IoT networks. IEEE Internet of Things Journal, 6(2), 2093–2102. https://doi.org/10.1109/JIoT.6488907
Web of Science ®Google Scholar
Liu, G., & Zhang, J. (2020, May). CNID: Research of network intrusion detection based on convolutional neural network. Discrete Dynamics in Nature and Society, 2020, 1–11. https://www.hindawi.com/journals/ddns/2020/4705982/.
Web of Science ®Google Scholar
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In (pp. 3431–3440). Retrieved December 25, 2022, from https://openaccess.thecvf.com/contentcvpr2015/html/LongFullyConvolutionalNetworks2015CVPRpaper.
Google Scholar
Rawat, W., & Wang, Z. (2017, September). Deep convolutional neural networks for image classification: A comprehensive review. Neural Computation, 29(9), 2352–2449. https://doi.org/10.1162/neco_a_00990
PubMed Web of Science ®Google Scholar
Roy, S. S., Mallik, A., Gulati, R., Obaidat, M. S., & Krishna, P. V. (2017). A deep learning based artificial neural network approach for intrusion detection. In D. Giri, R. N. Mohapatra, H. Begehr, & M. S. Obaidat (Eds.), Mathematics and computing (Vol. 655, pp. 44–53). Springer Singapore. Retrieved December 16, 2022, from https://doi.org/10.1007/978-981-10-4642-15.
Google Scholar
Sharafaldin, I., Habibi Lashkari, A., & Ghorbani, A. A. (2018). Toward generating a new intrusion detection dataset and intrusion traffic characterization. In Proceedings of the 4th international conference on information systems security and privacy (pp. 108–116). SCITEPRESS - Science and Technology Publications. Retrieved December 22, 2022, from http://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10.5220/0006639801080116.
Google Scholar
Shen, X., Han, D., Guo, Z., Chen, C., Hua, J., & Luo, G. (2022, December). Local self-attention in transformer for visual question answering. Applied Intelligence, 53(13), 16706–16723. https://doi.org/10.1007/s10489-022-04355-w.
Web of Science ®Google Scholar
Shiravi, A., Shiravi, H., Tavallaee, M., & Ghorbani, A. A. (2012, May). Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Computers & Security, 31(3), 357–374. https://doi.org/10.1016/j.cose.2011.12.012
Web of Science ®Google Scholar
Su, T., Sun, H., Zhu, J., Wang, S., & Li, Y. (2020). BAT: Deep learning methods on network intrusion detection using NSL-KDD dataset. IEEE Access, 8, 29575–29585. https://doi.org/10.1109/Access.6287639
Google Scholar
Vinayakumar, R., Soman, K. P., & Poornachandran, P. (2017, September). Applying convolutional neural network for network intrusion detection. In 2017 international conference on advances in computing, communications and informatics (ICACCI) (pp. 1222–1228).
Google Scholar
Wang, X., Chen, S., & Su, J. (2020, July). App-Net: A hybrid neural network for encrypted mobile traffic classification. In IEEE INFOCOM 2020 - IEEE conference on computer communications workshops (INFOCOM WKSHPS) (pp. 424–429). IEEE. Retrieved May 08, 2023, from https://ieeexplore.ieee.org/document/9162891/.
Google Scholar
Wang, Z., Han, D., Li, M., Liu, H., & Cui, M. (2022, December). The abnormal traffic detection scheme based on PCA and SSH. Connection Science, 34(1), 1201–1220. https://doi.org/10.1080/09540091.2022.2051434
Web of Science ®Google Scholar
Wu, P. f., & Shen, H. j. (2012, June). The research and amelioration of pattern-matching algorithm in intrusion detection system. In 2012 IEEE 14th international conference on high performance computing and communication & 2012 IEEE 9th international conference on embedded software and systems (pp. 1712–1715).
Google Scholar
Yang, Z., Liu, X., Li, T., Wu, D., Wang, J., Zhao, Y., & Han, H. (2022). A systematic literature review of methods and datasets for anomaly-based network intrusion detection. Computers & Security, 116, 102675. https://doi.org/10.1016/j.cose.2022.102675
Web of Science ®Google Scholar
Yin, C., Zhu, Y., Fei, J., & He, X. (2017). A deep learning approach for intrusion detection using recurrent neural networks. IEEE Access, 5, 21954–21961. https://doi.org/10.1109/ACCESS.2017.2762418
Web of Science ®Google Scholar
Zhang, H. (2009, January). Design of intrusion detection system based on a new pattern matching algorithm. In 2009 international conference on computer engineering and technology (pp. 545–548). IEEE. Retrieved Decemebr 20, 2022, from https://ieeexplore.ieee.org/document/4769526/.
Google Scholar
Zhang, Y., Chen, X., Guo, D., Song, M., Teng, Y., & Wang, X. (2019). PCCN: Parallel cross convolutional neural network for abnormal network traffic flows detection in multi-class imbalanced network traffic flows. IEEE Access, 7, 119904–119916. https://doi.org/10.1109/Access.6287639
Web of Science ®Google Scholar

Appendix. Source code

The source code for the algorithms and implementations presented in this paper is available upon request. To facilitate this, we gladly offer the source code to interested readers for research and academic purposes. For those seeking access, please feel free to contact us via email using the subject line “Source Code Request – [A multimodal hybrid parallel network intrusion detection model].” In your email, kindly provide a brief overview of how you intend to use the code and include information about your affiliation. We will be prompt in our response and look forward to supporting your research endeavors.

A multimodal hybrid parallel network intrusion detection model

Abstract

1. Introduction

Table 1. Table of acronyms.

Table 2. Related work.

2. Related work

3. Design and method of the model

3.1. Parallel cross-fused spatio-temporal feature sub-model

3.1.1. Upper branch

3.1.2. Lower branch

3.1.3. Feature fusion

3.2. CNN sub-model

3.3. Sequential learning model

3.4. Traffic classification based on multimode

4. Experiment and result analysis

Table 3. Experiment environment.

4.1. Datasets and data processing

Table 4. The distribution of CIC-IDS2017.

Table 5. The distribution of ISCX 2012.

4.1.1. Traffic statistics processing

4.1.2. Original traffic load processing

4.2. Evaluation criteria

4.3. The setting of the hyperparameter

4.4. Experiments and results

Table 6. Precision comparison table.

Table 7. Recall comparison table.

Table 8. F1-score comparison table.

Table 9. FAR comparison table.

4.5. Multimodal functional analysis

4.6. Functional analysis of FCN

Table 10. Comparison of employing the FCN and wiithout FCN.

4.7. Complexity analysis

Table 11. Test time of models.

5. Conclusion

Disclosure statement

References

Appendix. Source code

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date