Full article: TfrAdmCov: a robust transformer encoder based model with Adam optimizer algorithm for COVID-19 mutation prediction

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

The development of vaccines and drugs is very important in combating the coronavirus disease 2019 (COVID-19) virus. The effectiveness of these developed vaccines and drugs has decreased as a result of the mutation of the COVID-19 virus. Therefore, it is very important to combat COVID-19 mutations. The majority of studies published in the literature are studies other than COVID-19 mutation prediction. We focused on this gap in this study. This study proposes a robust transformer encoder based model with Adam optimizer algorithm called TfrAdmCov for COVID-19 mutation prediction. Our main motivation is to predict the mutations occurring in the COVID-19 virus using the proposed TfrAdmCov model. The experimental results have shown that the proposed TfrAdmCov model outperforms both baseline models and several state-of-the-art models. The proposed TfrAdmCov model reached accuracy of 99.93%, precision of 100.00%, recall of 97.38%, f1-score of 98.67% and MCC of 98.65% on the COVID-19 testing dataset. Moreover, to evaluate the performance of the proposed TfrAdmCov model, we carried out mutation prediction on the influenza A/H3N2 HA dataset. The results obtained are promising for the development of vaccines and drugs.

KEYWORDS:

1. Introduction

Coronaviruses contain a single-stranded positive polarity Ribonucleic Acid (RNA) genome sequence. Coronaviruses were first discovered in the 1960s (Haimed et al., Citation2021). The first found coronaviruses were HCoV-229E and HCoV-OC43 coronaviruses. Later, Severe Acute Respiratory Syndrome Coronavirus 1 (SARS-CoV-1) coronavirus in 2003, HCoV-NL63 coronavirus in 2004, HCoV-HKU1 coronavirus in 2005, MERS (Middle East Respiratory Syndrome Coronavirus) coronavirus in 2012, and finally COVID-19, caused by Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), coronavirus found at the end of December 2019 (Haimed et al., Citation2021). COVID-19 virus renamed by the World Health Organization (WHO) (Sohrabi et al., Citation2020), first appeared in the city of Wuhan, the capital of China’s Hubei province, at the end of December 2019 (Wu et al., Citation2020). The COVID-19 virus has spread to many countries, especially China. Countries have had to go into partial or full shutdowns to combat the COVID-19 virus. This caused great fear and panic among the people (Hai-Dong et al., Citation2022). WHO declared a World Emergency for the COVID-19 outbreak on 30 January 2020, and then announced to the world that it had turned into a pandemic on 11 March 2020 (Sharma et al., Citation2021; Tang et al., Citation2024). As of April 28, 2024, worldwide the number of confirmed cases is 775,379,864, while the number of confirmed deaths is 7,047,396 (World Health Organization, Citation2023). The COVID-19 virus has caused mild symptoms in about 80% of infected people, but acute respiratory distress syndrome (ARDS) in some people (Sharma et al., Citation2021). ARDS can cause multiple organ failure and other serious diseases (Suri et al., Citation2020). Many test kits are described for the diagnosis of the COVID-19 virus (Rashid et al., Citation2020). The most widely used and proven real-time reverse transcriptase polymerase chain reaction (rRT-PCR) is frequently used in the detection of COVID-19 virus (Serena Low et al., Citation2021). rRT-PCR test results are usually available in a few hours to 2 days (Sharma et al., Citation2021). Measures such as physical or social distance, quarantine, ventilation of closed areas, covering the mouth and nose in case of coughing and sneezing have been taken to reduce the effectiveness/spread of the COVID-19 virus. Additionally, various vaccines have been developed and the effectiveness of the COVID-19 virus has been reduced to a certain extent. However, the frequent mutation of the COVID-19 virus has either greatly reduced or eliminated the effectiveness of these vaccines and drugs. For this reason, it has become very difficult to fight the COVID-19 virus. To overcome these challenges, it is vitally important to predict mutations that may occur on the COVID-19 virus. If mutations can be predicted on the COVID-19 virus structure, especially in the S protein, vaccines can be updated quickly even if the COVID-19 virus mutates. Especially recently, Transformer based models are used very effectively in sequence-based mutation tasks. Transformer-based models are particularly successful in natural language processing tasks. The main reason for this is that it is completely attention-based and many attention heads work in parallel. In this study, we propose a robust transformer encoder based model with Adam optimizer algorithm, TfrAdmCov, for COVID-19 mutation prediction. we aim to predict mutations on the COVID-19 virus using the proposed TfrAdmCov model. The proposed TfrAdmCov model has the ability to easily capture long-term dependencies. Thanks to the transformer layer of the proposed TfrAdmCov model, it provides the opportunity to perform large-scale parallel computing. In addition, the information learned through the MHA architecture can be remembered more easily. This supports that the proposed TfrAdmCov model is more robust. As a result, it has been observed that the proposed TfrAdmCov model is quite successful in mutation prediction of the COVID-19 virus compared to both other models and the literature. The contributions of this article have been given below.

We propose a robust transformer encoder based model with Adam optimizer algorithm, TfrAdmCov, for COVID-19 mutation prediction.
We use the GridSearchCV hyperparameter tuning algorithm to improve the performance of machine learning-based models.
We use agglomerative clustering algorithm to create Training and Testing datasets.
We use the stratified 10-fold cross validation technique in addition to the holdout technique to evaluate the performance of machine learning-based models in a healthy way.
We perform mutation prediction both on COVID-19 and infulenza A/ H3N2 HA datasets.
We conduct a detailed performance comparison of the proposed TfrAdmCov model with traditional machine learning algorithms and deep learning algorithms on both COVID-19 and influenza A/ H3N2 HA datasets.

To facilitate readability of the article, the organisation of the remaining part of the article is as follows: Related works are mentioned in Section 2. Section 3 provides detailed information about the COVID-19 virus Section 4 presents the proposed TfrAdmCov model. Section 5 presents experimentations (COVID-19 dataset, the GridSearchCV hyperparameter tuning technique, baseline models, etc.). In Section 6 results and discussion are presented. The limitations of the study are mentioned in Section 7. Finally, the findings are discussed in Section 8.

2. Background

When the literature is examined in detail, the majority of studies are studies other than COVID-19 mutation prediction. Additionally, mutation prediction has been frequently performed on influenza virus. Some relevant studies in the literature are given below. Tarek et al. (Citation2023) proposed a convolutional neural network (CNN)- gated recurrent unit (GRU) hybrid model (CNN-GRU) for COVID-19 death prediction. They predicted COVID-19 deaths on the Indian dataset using the CNN-GRU model. When they compared the proposed CNN-GRU model with existing models, they observed that the proposed CNN-GRU model was more successful. ElAraby et al. (Citation2022) proposed the Gray-scale Spatial Exploitation Net (GSEN) model with stochastic gradient descent optimisation technique to classify the COVID-19 Chest X-ray (CXR) images. The GSEN model outperformed other models. Elzeki et al. (Citation2021) proposed the Chest X-Ray COVID Network (CXRVN) model with mini-batch gradient descent and Adam optimizer to detect the COVID-19 virus from Chest X-ray (CXR) images. The proposed model was tested on three datasets. As a result of the test, the proposed model is very successful in detecting the COVID-19 virus. Raheja et al. (Citation2023) proposed a diffusion prediction model for prediction of number of COVID-19 cases in India, France, China and Nepal countries. They compared the proposed model with other state-of-the art models. The proposed model outperformed other state-of-the art models. Elzeki et al. (Citation2021) proposed a novel perceptual two-layer image fusion using deep learning based model (NSCT + CNN_VGG19) to obtain CXR images for imbalanced COVID-19 dataset. They compared the proposed model in detail with other models. The proposed model achieved better performance than other models. Chakraborty et al. (Citation2022) proposed a COVID-19 risk prediction approach for diabetic patients by a fuzzy inference system and machine learning models. They used stratified k-fold cross-validation technique to evaluate the performance of the proposed model. Experimental results showed that the proposed model is more successful in the COVID-19 risk prediction than other existing models. Hassan et al. (Citation2024) proposed a Deep Convolutional Neural Network (DCNN) model for the classification of COVID-19 from CT scan images. Experimental results showed that the proposed model with Adam optimizer outperformed several state-of-the-art models in the COVID-19 classification task. Shrestha et al. (Citation2022) proposed a Deep Learning Based Convolution Neural Network called DCNN model with Adam optimizer to detect Brain Tumour. Experimental results showed that the proposed DCNN model achieved remarkable results in detecting the Brain Tumour. Hassan et al. (Citation2023) proposed a deep-learning based automatic COVID-19 detection model for smart cities. The proposed model could help reduce further spread of COVID-19, especially in crowded places. Cai et al. (Citation2024) proposed an encoder-decoder based FluPMT model to predict the haemagglutinin (HA) protein sequence of the next season’s dominant strain of Influenza A viruses. They used attention mechanisms to investigate dependencies among residues of sequences and used time series to model the evolution of influenza A viruses. As a result, they showed that the performance of the FluPMT model on both the H1N1 dataset and the H3N2 dataset was better than other models. Li et al. (Citation2023) proposed a graph deep learning network-based model, GraphLncLoc for predicting long non-coding RNAs (lncRNAs) subcellular localisation. The GraphLncLoc uses graph convolutional networks to learn latent features and then the high-level features obtained are fed into a fully connected layer to carry out the final prediction. They showed that the GraphLncLoc model achieved better performance than other models. Yin et al. (Citation2022) proposed, named as IAV-CNN, a 2D convolutional neural network (CNN) based model to predict influenza antigenic variants. The IAV-CNN model and other models have been trained and tested on three influenza datasets (H1N1, H3N2 and H5N1). Experimental results show that the IAV-CNN outperforms state-of-the-art models on three influenza datasets. Abbas et al. (Citation2022) applied various sequence-to-sequence deep learning models on antigenic influenza HA sequence pairs and then attempted to generate the antigenic pair of an emerging influenza A virus. They observed that the proposed deep learning models achieved remarkable results on the influenza A virus. Salama et al. (Citation2016) estimated RNA virus mutations by using amino acid sequences of proteins that make up RNA, neural networks (NN) and rough set techniques. In their study, they used a dataset consisting of Newcastle RNA virus sequences obtained from China and South Korea. When the results were examined, they observed that the coarse set technique gave better results than neural networks. They showed that the accuracy rate of the coarse set technique is over 75%. Mohamed et al. (Citation2021) predicted the next DNA sequence using seq2seq LSTM deep learning in their study. In their study, they used the New Castle disease virus dataset and the H1N1 Influenza virus dataset. The New Castle disease virus dataset consists of 83 samples with a sequence length of 1778. The H1N1 Influenza virus dataset consists of 4609 samples with a sequence length of 535. While the success rate (accuracy) of the proposed seq2seq LSTM model on the New Castle disease virus dataset was 96.9%, the success rate (accuracy) on the H1N1 Influenza Virus dataset was 98.9%. Yin et al. (Citation2020) predicted whether mutations would occur in the next flu season using the haemagglutinin (HA) protein sequences of influenza A/H1N1, H3N2, H5N1 viruses in their study. They proposed the Tempel model, which is an efficient and robust time series mutation prediction model for mutation prediction of influenza A viruses. In their study, when the experimental results on three influenza datasets (H1N1, H3N2, H5N1) are examined, it is seen that the proposed Tempel model is compatible with other approaches commonly used in the literature (baseline, LR, SVM, Recurrent Neural Network (RNN), Gated Recurrent Unit (GRU), LSTM). They showed that it can significantly improve predictive performance and provide new insights into viral mutation and evolutionary dynamics. The proposed Tempel model achieved a performance value of 0.991. Yin et al. (Citation2023) proposed ViPal model, a general framework, for virulence prediction. They used a posterior regularisation technique to transform prior viral knowledge into constraint features. The ViPal model could improve virulence prediction performance compared to other models. As far as we have researched in the literature, there are very few studies on the mutation prediction of the COVID-19 virus. We carried out this study to contribute to the literature. Some studies in the literature on COVID-19 mutations and other coronavirus mutations that have appeared before can be summarised as follows. Saha et al. (Citation2020) used 566 COVID-19 genome sequences isolated in India for mutation analysis in their study. Alignment of the sequences was performed using the multiple sequence alignment method (MSA) CLUSTALW (Anonymous, Citation2023a) using the reference (NC_45512.2) sequence from the National Center for Biotechnology Information (NCBI). Once the sequences were aligned, they created a consensus sequence to locate the mutation site and analyse each COVID-19 genome. As a result of the study, they identified a total of 3384 mutation points, including 933 substitution/point mutations, 2449 deletion mutations and 2 insertion mutations in 566 genome sequences isolated in India. Wang et al. (Citation2020) used 31,421 COVID-19 genome sequences for mutation analysis in their study. In this study, they calculated the mutation rate and mutation h-index of all COVID-19 genes. They showed that among the genes that make up the structure of the COVID-19 virus, the N gene was mutated the most. They also stated that the N gene is the most vulnerable gene in the COVID-19 genome. Haimed et al. (Citation2021) proposed a reverse engineering approach to reveal the patterns and evolutionary behaviour of the COVID-19 virus using AI and Big Data. They used the Long Short Term Memory (LSTM) method to predict the next evolved instance of the COVID-19 virus. Also, they used the amino acid sequence ORF7a, which is a 29 amino acid long small protein of the COVID-19 virus. At the end of the study, they predicted the possible evolved sample of the ORF7a protein with a success rate of 40–50%. In order to increase this success rate, they increased the success rate to 72% by using consistent patterns. Nawaz et al. (Citation2021) obtained detailed information from COVID-19 genome sequences by using AI techniques in their study. They experimented with sequential pattern mining (SPM) in a computer environment to see if there are hidden patterns that reveal the frequent patterns of nucleotide bases and their relationships with each other. They also proposed an algorithm for mutation analysis in genome sequences to find the places where nucleotide bases were changed in genome sequences and calculate the mutation rate. When the results obtained were examined, it was seen that SPM revealed important information and patterns in the COVID-19 genome sequences to study the evolution or variations in COVID-19 strains. Hossain et al. (Citation2021a) estimated the mutation rate of the 2000th variant, which may occur in the future, by applying the LSTM deep learning model to the COVID-19 genome sequence in their study. they used a total of 259,044 COVID-19 whole genome sequences, including other regions. They identified a total of 3,334,545 mutations from these used samples. In their study, Zhou et al. (Citation2023) proposed the TEMPO model, called a transformer-based mutation prediction framework, for COVID-19 mutation prediction in their study. They designed a phylogenetic tree-based sampling method to generate viral sequences assembled with temporal information. In addition, they stated that the proposed TEMPO model could successfully predict 22 mutations that had not occurred before. The TEMPO model they proposed obtained an accuracy value of 0.655 on the COVID-19 dataset. As result, when the literature has been examined in detail, the majority of the studies are on other aspects of the COVID-19 virus. However, there are very few studies on mutation prediction of the COVID-19 virus using artificial intelligence-based models.

In this study, we focused on this gap in the literature. This presents a robust transformer encoder based model with Adam optimizer algorithm, TfrAdmCov, for COVID-19 mutation prediction. The proposed TfrAdmCov model has the ability to easily capture long-term dependencies. The attention mechanisms in the transformer encoder layer focuses only on the most important features in the feature set to maximise the performance of the model during training. In this way, it reduces unnecessary computational resources and allows the model to achieve better generalisation performance. The transformer encoder model can easily capture long-term dependencies by its attention mechanism in the input sequence and perform large-scale parallel calculation unlike other deep learning models. As a result, the proposed TfrAdmCov model is quite successful in genetic sequence mutation prediction of the COVID-19 virus, compared to both other models and the state-of-the-art models in the literature.

3. COVID-19 (SARS-CoV-2)

COVID-19 virus is a genus of positive-sense betacoronavirus belonging to the Coronaviridae family with enveloped, single-stranded RNA genomes (de Wit & Cook, Citation2020). Coronaviruses have four types, alpha, beta, gamma and delta (Jaimes et al., Citation2020). Human coronaviruses are in alpha and beta genera (Shereen et al., Citation2020). Some human corona viruses (HCoV-229E, HCoV-OC43, HCoV-NL63, HCoV-HKU1) infect the simple seasonal upper respiratory tract, while others (SARS-CoV-1, MERS and COVID-19) cause pneumonia and severe ARDS can cause illness (Cui et al., Citation2019). The genome of the COVID-19 virus has the highest genome similarity with the bat-RaTG13 coronavirus of all coronaviruses over 96%. It also shows genomic similarity to SARS-CoV-1 over 79% and MERS coronavirus over 50% (Sharma et al., Citation2021). The genome of the first COVID-19 virus to appear in China has a length range of 29,903 kilobases (Nawaz et al., Citation2021). The structure of the COVID-19 virus consists of structural proteins (S, Envelope (E), Membrane (M) and Nucleocapsid (N)), non-structural proteins (NSP1-NSP16) and auxiliary proteins (ORF3a, ORF3b, ORF6, ORF7a, ORF7b, ORF8, ORF9b, ORF9c and ORF10) (C. rong Wu et al., Citation2022). Figure shows the structure of the COVID-19 virus.

Figure 1. Structure of the COVID-19 virus (Shereen et al., Citation2020).

3.1 COVID-19 S (Spike) protein

The S protein on the surface of the COVID-19 virus used as the dataset is expressed as a transmembrane glycoprotein (Zhang et al., Citation2021). The S protein consists of 1273 amino acids in total (Anonymous, Citation2023b). As seen in Figure , the COVID-19 S protein consists of Signal Peptide (SP), S1, S1/S2, S2 subunits (Huang et al., Citation2020). The S1 subunit consists of N-Terminal Domain (NTD), Receptor-Binding Domain (RBD), and C-Terminal Domain 1 (CTD1) and C-Terminal Domain 2 (CTD2), while the S2 subunit consists of Fusion Peptide (FP), Fusion-Peptide Proximal Region (FPPR), Heptad Repeat 1 (HR1), Central Helix (CH), Connector Domain (CD), Heptad Repeat 2 (HR2), Transmembrane Domain (TM) and Cytoplasmic Tail (CT) domains (Barnes et al., Citation2020). Via the RBD domain in the COVID-19 S protein structure, the virus binds to the Angiotensin Converting Enzyme 2 (ACE-2) protein on the host cell surface. Then, using the S2 subunit, fusion with the host cell takes place, and then the COVID-19 virus enters the host cell. Upon binding to the ACE2 receptor, the S protein undergoes conformational changes, enabling cleavage of the S protein by furin proteases in the S1/S2 region to produce the S1 and S2 subunits. In order to facilitate the entry of the COVID-19 virus into the host cell, the transmembrane serine protease-2 (TMPRSS2S2) on the cell surface plays a role in the preparation of the S protein by dividing the S2′ domain in the S2 subunit. COVID-19 S RBD contains a receptor binding motif (RBM) and a core structure (Jackson et al., Citation2022). Figure shows the detailed structure of the COVID-19 S protein.

Figure 2. Detailed structure of the COVID-19 S protein (Jackson et al., Citation2022).

3.2. Mutation

Mutation can be briefly expressed as permanent changes that occur in the Deoxyribonucleic Acid (DNA) or RNA sequence in a living thing’s genome. RNA viruses mutate more than DNA viruses (Shaikh et al., Citation2021). In particular, the virus frequently mutates while copying its RNA genome in the host cell (Qin et al., Citation2021). The COVID-19 virus has mutated many times (Hossain et al., Citation2021b). Developed test kits cannot fully capture the dominant COVID-19 variants. The effectiveness of existing vaccines is also significantly reduced against mutations. Understanding the genome sequence of the COVID-19 virus, its behaviour and origin, and how quickly it mutates is very important for the development of vaccines/drugs (Haimed et al., Citation2021). The COVID-19 virus has mutated in different regions over time, revealing new variants. Although the majority of these new variants did not cause any adverse effects, they changed the course of the epidemic due to the contagiousness/fatality of some dominant variants such as delta/omicron (Shiehzadegan et al., Citation2021). Detailed analysis of the genome sequence of the COVID-19 virus and mutation analysis will contribute to the development of vaccines or drugs (Ahmed & Jeon, Citation2022). So far, some dominant variants caused by the COVID-19 virus can be expressed as follows; B.1.1.7 (Alpha) variant detected in the UK, B.1.351 (Beta) variant and B.1.1.529 (Omicron) variant detected in South Africa, P.1 (Gamma) detected in Brazil variant can be referred to as variant B.1.617.2 (Delta) detected in India (Qin et al., Citation2021; Lopez-Rincon et al., Citation2021; Gage et al., Citation2021; Sokhansanj & Rosen, Citation2022). Based on available data, variant B.1.1.529 (Omicron) is the predominant variant worldwide (Madhi et al., Citation2022). Updates are also made in the Covid-19 test kits in accordance with the detected variants. In this way, negative results can be prevented in updated test kits in patients infected with mutated viruses. With the emergence of COVID-19 mutations, the effects of available vaccines have decreased significantly. Analysis of the genome sequencing of the COVID-19 coronavirus and the use of advanced machine learning-based models can help physicians understand the genetic makeup of the COVID-19 virus. In addition, understanding the genome sequencing of the COVID-19 virus will contribute to the development of vaccines or drugs to be developed (Ahmed & Jeon, Citation2022).

4. The proposed TfrAdmCov model

The transformer encoder model is a fully attention mechanism-based architecture and is frequently used in natural language processing (NLP) tasks (Kalyan et al., Citation2021). The attention mechanism in the transformer encoder layer focuses only on the most important features in the feature set to maximise the performance of the model during training. In this way, it reduces unnecessary computational resources and allows the model to achieve better generalisation performance. The transformer encoder model can easily capture long-term dependencies by its attention mechanism in the input sequence and perform large-scale parallel calculation (Zhou et al., Citation2023). The standard transformer architecture is composed of transformer encoder-decoder layers. Because we will not make machine translation, we only used the transformer encoder layer. Each transformer encoder layer has two sublayers: multi-head attention (MHA) and feedforward network (FFN). Moreover, the transformer encoder layer has a residual connection around each of the two sub-layers, followed by layer normalization (Vaswani et al., Citation2017; Pacal, Citation2024a). Scaled-dot product attention has been shown in Figure . Scaled-dot product attention mechanism utilises $W_{q}, W_{k}, W_{v}$ weight matrices for adjusting model parameters during training. The $Q, K$ and $V$ vectors are obtained via matrix multiplication between the weight matrices $W$ and the embedded inputs $x$ : The index $i$ presents the token index position in the input sequence, which has length $d$ . $Q = x_{i} W_{q}, K = x_{i} W_{k}, V = x_{i} W_{v}$ . An attention function makes a query $(Q = {Q_{1}, \dots, Q_{N}})$ and a set of key-value pairs $({K, V} = {K_{1}, V_{1}, \dots, K_{M}, V_{M}})$ to an output. The output is computed as a weighted sum of the values (Vaswani et al., Citation2017; Galassi et al., Citation2021). The attention function has been given in Equation (1). (1) $Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V$ (1) where $d_{k}$ , is the key dimensionality, $d_{v}$ is the value dimensionality. $\frac{1}{\sqrt{d_{k}}}$ is used to scale the attention weights.

Figure 3. Scaled-dot product attention. Where $N$ is the number of tokens in the input sequence and $d$ is the dimensionality of those tokens.

As seen in Figure , matrices $Q, K, V$ are obtained using the input sequence $x$ . Then, attention weights are obtained as a result of multiplying $Q . K^{T}$ . The data obtained is scaled by multiplying the attention weights by $\frac{1}{\sqrt{d_{k}}}$ . The scaled data is given as input to the softmax function. The data is normalised with the softmax function and the final output is obtained by multiplying the normalised data with the $V$ matrix. Where $N$ is the number of tokens in the input sequence and $d$ is the dimensionality of those tokens. MHA can be expressed as a mechanism that allows the model to jointly participate in information from different representational subspaces at different locations. In other words, MHA allows each token task in the input sequence to be shared with different self-attention heads by using one or more self-attentions running simultaneously or in parallel. This allows multiple operations to be performed at the same time, unlike RNN-based models which in the output depends on the previous input. The MHA mechanism relies on one or more scaled dot-product attention (self-attention) operating on a key $K$ , a value $V$ and a query $Q$ (Voita et al., Citation2020). Now, let MHA be indexed by $i$ , then we calculate the multi-head attention function using the function in Equations (2) and (3). (2) $\begin{aligned} Multihead (Q, K, V) & = Concat (hea d_{1}, \dots, hea d_{h}) W^{O} \end{aligned}$ (2) (3) $\begin{aligned} Where hea d_{i} & = attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}) \end{aligned}$ (3)

$W_{i}^{Q} \in R^{dx d_{q}}, W_{i}^{K} \in R^{dx d_{k}}, W_{i}^{V} \in R^{dx d_{v}} and W^{O} \in R^{h d_{v} xd}$ (Vaswani et al., Citation2017). where $W_{i}^{Q}$ , $W_{i}^{K}$ , $W_{i}^{V}$ are projection matrices. $W^{O}$ is a final linear projection matrice (Voita et al., Citation2020). MHA has been shown in Figure . In this work, we employ h = 2 parallel running scaled-dot product attention layers or heads in MHA. For each scaled-dot product attention layer, we use $dk = dv = d / h = 50. d = 100$ .

Figure 4. MHA. $h$ is the number of total scaled-dot product attention running in parallel.

As seen in Figure , To obtain $Q, K, V$ matrices, the input sequence $x$ is passed through the linear layer. Then, $Q, K, V$ matrices are given as input to scaled-dot product attention layers. The outputs obtained from the scaled-dot product attention layers are combined and the outputs are passed through the linear layer to obtain the final output. The transformer encoder layer consists of one fully connected FFN layer which consists of two linear trasformations and one RELU activation between these two linear trasformations. The formulation of FNN layer has been given in Equation (4). (4) $FFN (x) = \max (0, x W_{1} + b_{1}) W_{2} + b_{2}$ (4) where $x$ , ( $W_{1}$ , $W_{2}$ ), $(b_{1}, b_{2})$ represent the input embedding, weights and biases, respectively (Vaswani et al., Citation2017). In addition, in the transformer encoder, residual or skip connection helps preserve the input sequence, allowing the transformer model to learn more complex functions. Additionally, residual connection helps avoid the vanishing gradient problem in the transformer encoder and improves the transformer model’s performance. In this study, because we did not perform a machine translation task, we only used the transformer encoder layer. For this purpose, the proposed TfrAdmCov model only learns the features of the input sequence and performs COVID-19 mutation prediction based on these learned features. The workflow of the proposed TfrAdmCov model for mutation prediction of COVID-19 virus has been shown in Figure .

Figure 5. The workflows of the proposed TfrAdmCov model for mutation prediction of COVID-19 virus. Where $N$ is the number of tokens in the input sequence and $d$ is the dimensionality of those tokens. The size of the input sequence is $Nxd$ .

As seen in Figure , the processed COVID-19 dataset is given as input to MHA. The data obtained from the MHA is given as input to the layer normalisation. The data obtained from the layer normalisation is given as input to the FFN layer. The data obtained from FFN layer is given as input to the layer normalisation. The data obtained from layer normalisation is given as input to the linear transformation layer. The data obtained from linear transformation layer is given as input to the softmax layer. Then, the new dataset obtained has been passed through the softmax layer and finally mutation prediction of COVID-19 virus has been carried out.

4.1. Softmax layer

The softmax function is a type of activation function often used in deep learning tasks. It maps actual values to probability values between 0 and 1. This function has been used in attention mechanisms recently. The softmax formula has been given in Equation (5). (5) $softmax (x_{a}) = \frac{exp (x_{a})}{\sum_{b = 1}^{G} exp (x_{b})}$ (5) where $x_{a}$ presents the $a$ -th value of input sequence $x$ . $x_{b}$ denotes other sequences in $x$ data. $G$ is the dimension of sequence $x$ .

4.2. Loss function

Binary classification tasks use a loss function called binary cross-entropy, which compares each predicted probability with the true class output and updates the probabilities depending on the distance from the expected value. The task addressed in this study is a two-class problem (mutation, no mutation). Therefore, we use binary cross-entropy for calculating the loss value between the true $y_{t}$ and the predicted ${\dot{y}}_{t}$ . The loss function $LF$ is calculated in Equation (6): (6) $LF = - \frac{1}{N} Σ_{i = 1}^{N} \frac{1}{D^{(i)} - 1} Σ_{t = 1}^{D^{(i)} - 1} Σ_{z}^{F} {y_{t_{(z)}}^{T} \log ({\dot{y}}_{t_{(z)}}) + (1 - y_{t_{(z)}}^{T}) \log (1 - {\dot{y}}_{t_{(z)}})}$ (6) where $N$ is the number of input samples and $F$ is the set of selected residue sites. $D^{(i)}$ is the number of selected positions of $i$ -th training sample for COVID-19 mutation prediction.

5. Experimentations

The COVID-19 S protein dataset, the COVID-19 pre-processing steps, holdout method and stratified 10-fold cross validation method, GridSearchCV hyperparameter tuning technique and baseline models have been explained below.

5.1. COVID-19 S protein dataset

The COVID-19 S protein dataset consists of S protein sequences. The COVID-19 S protein dataset consists of 1273 amino sets in total (Anonymous, Citation2023b; Zhang et al., Citation2021). In this study, a total of 15,000 COVID-19 S protein sequences have been downloaded for each year from the reference address (Anonymous, Citation2023b) between 2020 and 2022. After downloading all S protein sequences, all sequences have been aligned by year using the CLUSTALW (Anonymous, Citation2023a) multiple sequence alignment (MSA) method.

5.2. Preparation and pre-processing steps of the COVID-19 S protein dataset

Apart from the amino acids that are directly encoded by the 20 universal genetic codes in some strains between 2020 and 2022, some strains have a few ambiguous amino acids. In order to eliminate the ambiguity in these strains, one of the letters “D” or “N” has been randomly assigned instead of the indefinite letter “B”. Instead of the indefinite letter “Z” has been randomly assigned one of the letters “E” or “Q”. Finally, a random amino acid assignment has been made among 20 universal amino acids instead of the ambiguous letter “X”. In this way, all ambiguities have been removed (Yin et al., Citation2020). In this study, the dataset generation method presented by Yin et al. (Citation2020) has been used. This method was obtained using the KMeans clustering algorithm. In this study, we used the KMeans clustering algorithm in the first place, but we could not achieve the performance of machine learning based algorithms at the desired level. Therefore, agglomerative clustering (Sasirekha & Baby, Citation2013) algorithm, which is an important factor in increasing the success rate, has been preferred instead of KMeans clustering algorithm while creating the dataset.

5.3. Agglomerative clustering

Agglomerative clustering is a variant of the hierarchical clustering method (Sasirekha & Baby, Citation2013). Agglomerative clustering is referred to as the (part-to-whole or bottom-up) approach. Clusters in the entire dataset are made into clusters. These clusters, which are created later, are combined with the clusters that are close depending on the distance, and a new cluster is obtained. In other words, the agglomerative clustering algorithm starts by treating each object/instance as a single cluster. Then the pairs of clusters are successively merged until all the clusters are combined into one large cluster containing all the objects (Sasirekha & Baby, Citation2013). At the stage of creating the training dataset, COVID-19 strains have been divided according to years and the agglomerative clustering algorithm has been used to divide the strains in each year into clusters. In addition, the parameter of the agglomerative clustering algorithm and the values of these parameters are shown in Table .

In this study, Training, Testing and Kfold datasets have been used. The amounts of Training, Testing and Kfold datasets by years have been shown in Table . As seen in Table , For the Training dataset, 30 strains have been randomly selected for each year among 11,250 COVID-19 S protein strains. For the testing data set, 10 strains have been randomly selected for each year among 3750 COVID-19 S protein strains. For the Kfold dataset, 40 strains have been randomly selected for each year among 15,000 COVID-19 S protein strains. The reason we chose the data quantities for each dataset like this is because we use the GridSearchCV hyperparameter tuning method. Because it takes a lot of time to get the results when choosing the best parameter values through the GridSearchCV method of each machine learning based model. For the datasets used in this study, two (2) clusters have been created for each year using the agglomerative clustering algorithm, which is also expressed in Figure . For example, a strain selected from the B1 cluster in 2020 year, a random strain from the A1 or A2 cluster in 2021, which has the lowest hamming distance (Norouzi et al., Citation2012) to this strain, has been selected. Similarly, a strain selected from the B1 cluster in 2020 year, a random strain from the C1 or C2 cluster in 2022, which has the lowest hamming distance to this strain has been selected. This process continues until all strains have been included in the datasets. Ultimately, datasets have been obtained by combining data from different years of COVID-19 strains one by one (Yin et al., Citation2020).

Figure 6. Example of creation of COVID-19 S protein datasets (Yin et al., Citation2020).

TfrAdmCov: a robust transformer encoder based model with Adam optimizer algorithm for COVID-19 mutation prediction

ABSTRACT

1. Introduction

2. Background

3. COVID-19 (SARS-CoV-2)

3.1 COVID-19 S (Spike) protein

3.2. Mutation

4. The proposed TfrAdmCov model

4.1. Softmax layer

4.2. Loss function

5. Experimentations

5.1. COVID-19 S protein dataset

5.2. Preparation and pre-processing steps of the COVID-19 S protein dataset

5.3. Agglomerative clustering

Table 1. Number of strains of COVID-19 S protein datasets by years.

Table 2. Total amount of Training dataset by years.

Table 3. Total amount of Testing dataset by years.

Table 4. Total amounts of Kfold dataset by years.

5.4. Influenza A/ H3N2HA dataset

Table 5. Class quantities and approximate percentages for Training and Testing datasets of the Influenza A/H3N2 HA protein dataset.

5.5. Holdout method and stratified 10-fold cross validation method

Table 6. Total amount of Training and Testing datasets by years for holdout technique.

Table 7. Total amounts of Kfold dataset by years for stratified 10-fold cross validation technique.

Table 8. Class quantities and approximate percentages for Training, Testing, and Kfold datasets.

5.6. GridSearchCV hyperparameter tuning

Table 9. Hyperparameters for the machine learning models.

5.7. Baselines models

5.7.1. SVM model

5.7.2. KNN model

5.7.3. XGBoost model

5.7.4. LR model

5.7.5. RNN model

5.7.6. LSTM model

5.7.7. GRU model

6. Results and discussions

6.1. Implementation details

Table 10. The hyperparameters of the proposed TfrAdmCov model and other models.

6.2. Performance evaluation

Table 11. Confusion matrix (Luque et al., Citation2019).

6.3. Experimental results

Table 12. Performance values of the SVM model with or without GridSearchCV.

Table 13. Performance values of the KNN model with or without GridSearchCV.

Table 14. Performance values of the XGBoost model with or without GridSearchCV.

Table 15. Performance values of the LR model with or without GridSearchCV.

Table 16. Comparison of performance values of Machine learning-based models with or without GridSearchCV.

Table 17. Performance comparative of the deep learning models on the Testing dataset.

Table 18. Performance comparative of the proposed TfrAdmCov model with Adam, RMSprop, AdamW optimizer models on the Testing dataset.

Table 19. Performance comparative of the RNN, LSTM, GRU, the proposed TfrAdmCov models with 10 random trail with different random seeds on the Testing dataset.

Table 20. Performance comparisons of the proposed TfrAdmCov model and machine learning-based models with GridSearchCV on the Testing dataset.

Table 21. Comparison of average performance values of the machine learning with stratified 10-fold cross-validation technique, the proposed TfrAdmCov model and deep learning models with 10 random trail with different random seeds on the Testing dataset.

6.4. Statistical analyses for the proposed TfrAdmCov model and deep learning models

Table 22. Statistical analysis of the proposed TfrAdmCov model with 10 random trail with different random seeds on the Testing dataset.

Table 23. Statistical analysis of the RNN model with 10 random trail with different random seeds on the Testing dataset.

Table 24. Statistical analysis of the LSTM model with 10 random trail with different random seeds on the Testing dataset.

Table 25. Statistical analysis of the GRU model with 10 random trail with different random seeds on the Testing dataset.

6.5. The reason why agglomerative clustering algorithm is preferred instead of K-means in creating training, testing and Kfold datasets

Table 26. Performance comparison of the proposed TfrAdmCov model on the Testing dataset created using Kmeans and Agglomerative clustering algorithms.

6.6. Performance evaluation of proposed TfrAdmCov on influenza A/ H3N2HA dataset

Table 27. Performance values of the proposed TfrAdmCov model and other models on the H3N2 HA testing dataset.

Table 28. Comparison of the proposed TfrAdmCov model with the state-of-the-art works.

7. Limitations

8. Conclusion

Authors contribution statement

Data availability and access

Ethical and informed consent for data used

Disclosure statement

References

Appendix

Table A1. The parameters of the agglomerative clustering algorithm and the values of these parameters.

Table A2. The best values obtained by using the GridSearchCV algorithm for 3 randomly selected features of the SVM model.

Table A3. The best values obtained by using the GridSearchCV algorithm for 3 randomly selected features of the KNN model.

Table A4. The best values obtained by using the GridSearchCV algorithm for 3 randomly selected features of the XGBoost model.

Table A5. The best values obtained by using the GridSearchCV algorithm for 3 randomly selected features of the LR model.

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access