826
Views
0
CrossRef citations to date
0
Altmetric
Articles

Research on three-step accelerated gradient algorithm in deep learning

ORCID Icon, ORCID Icon & ORCID Icon
Pages 40-57 | Received 04 Jul 2020, Accepted 02 Nov 2020, Published online: 23 Nov 2020

Abstract

Gradient descent (GD) algorithm is the widely used optimisation method in training machine learning and deep learning models. In this paper, based on GD, Polyak's momentum (PM), and Nesterov accelerated gradient (NAG), we give the convergence of the algorithms from an initial value to the optimal value of an objective function in simple quadratic form. Based on the convergence property of the quadratic function, two sister sequences of NAG's iteration and parallel tangent methods in neural networks, the three-step accelerated gradient (TAG) algorithm is proposed, which has three sequences other than two sister sequences. To illustrate the performance of this algorithm, we compare the proposed algorithm with the three other algorithms in quadratic function, high-dimensional quadratic functions, and nonquadratic function. Then we consider to combine the TAG algorithm to the backpropagation algorithm and the stochastic gradient descent algorithm in deep learning. For conveniently facilitate the proposed algorithms, we rewite the R package ‘neuralnet’ and extend it to ‘supneuralnet’. All kinds of deep learning algorithms in this paper are included in ‘supneuralnet’ package. Finally, we show our algorithms are superior to other algorithms in four case studies.

1. Introduction

In recent years, the scale of the industrial application system has expanded rapidly and industrial application data generated with the exponential growth due to the recent advance in computer and information technology. Big Data brings allround thinking revolution, industrial revolution, and management revolution because Big Data has vast amounts of data status and cloud service-level data processing ability, and Big Data has Volume, Velocity, Variety, and Veracity (also called 4V) characteristics. The data of enterprises has reached hundreds of TBs or hundreds of PBs which has been far beyond the processing capacity of traditional computing technology and information system. Therefore, to study effective Big Data processing technology, methods, and measurements have become urgent requirements in our daily lives.

Viktor Mayer-Schonberger, who is known as ‘the Prophet of Big Data Era’, lists a large number of detailed cases of Big Data applications in his book ‘Big Data: A Revolution That Will Transform How We Live, Work, and Think (Schonberger & Cukier, Citation2013)’. He analysed and predicted the development and trends of Big Data, and put forward many important ideas and development strategies. He said Big Data opens up a great transformation stage and stated that Big Data will bring great changes. It will change our lives, work and way of thinking, change our business mode, and change our economics, politics, technologies, societies, and so on.

‘The Outline of Promoting Development of Big Data’ was published by the State Council of China in September 2015. The Prime Minister of China Keqiang Li said we should deploy Big Data program systematically. The ‘Outline’ showed us how to promote Big Data development and application, to create accurate control and the collaborative new social governance mode in the next 5 to 10 years. Big Data can establish a new mechanism of economic operation, which is safe, efficient, and operates steadily. Big Data also can construct a new system about public services which is people oriented and benefits all, can open the new pattern of innovation driven which makes more people to have the sense of entrepreneurship and innovation, can cultivate the new environment of industry development which is intelligent, smart, and prosperous.

Machine learning and deep learning are very important big data processing technologies. Especially for deep learning, which is very popular in industry and academia in recent years. An important reason is that Google has developed AlphaGo. AlphaGo is the first artificial intelligence system that defeated professional Go players, and the first one defeated the Go world champion. It was developed by a team led by Demis Hassabis from Google's DeepMind company, and its main working principle is ‘deep learning’. AlphaGo uses many new technologies, such as neural networks (Flach, Citation2012), deep learning (Lisa, Citation2015), Monte Carlo tree search (Bishop, Citation2006; Murphy, Citation2012) etc., which has made a substantial power in training data and prediction. The performance of deep learning is shown in Figure . It can be seen from the figure that as the amount of data increases, compared with machine learning, the performance of deep learning becomes better and better.

Figure 1. The performance comparison of deep learning and machine learning.

Figure 1. The performance comparison of deep learning and machine learning.

Feedforward neural networks are the most basic type of artificial neural networks. The connections between neurons in the network do not form a cycle. Feedforward neural networks are the first type of artificial neural networks. The reason why they are called feedforward is because the information in these networks is only forwarded, do not have loops. The information first through the input nodes, then through the hidden nodes (if present), and finally through the output nodes (Nielsen, Citation2019). In industry and academia, a neural network is called deep learning which has at least one hidden layer.

Gradient descent (GD) algorithm is an optimisation algorithm used to minimise some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In artificial neural networks, backpropagation is a specific implementation method of the gradient descent algorithm. The name of the backpropogation algorithm originates from the process of calculating the gradient of the neural network, which is actually calculated backward (Rumelhart & McClelland, Citation1986). It is the most extensively used algorithm for training multilayer feedforward neural networks. Given an artificial neural network and an error function, then backpropagation is mainly to calculate the gradient of the error function with respect to the weights of the neural network. However, throughout the current books and literatures (Chen, Citation2016; Gordon, Citation2018; Le, Citation2015; Magoulas et al., Citation1997; Moreira & Fiesler, Citation1995), the mathematical representations of the backpropagation algorithm are different. This paper will present it in the subsequent sections with mathematical formulas.

The actual proceeding is that the gradient of the last layer of weights are calculated first, and the gradient of the first layer of weights are calculated last. When calculating the gradient of the previous layer, the partial derivatives which have been calculated in the current layer can be reused. This technique can be clearly seen when introducing deep learning neural networks in subsequent sections, and it is also very important in programming. It can not only save the time of program running, but also save the memory used by this program. Therefore, the backwards flow of the error information makes the gradient of the weights of each layer can be calculated efficiently, instead of using a very naive method to calculate the gradient separately.

The popularity of backpropagation has been reappeared in recent years, the main reason is that deep neural networks have obvious advantages in image recognition and speech recognition. Backpropagation is an iterative gradient descent algorithm. With the cooperation of Graphics Processing Unit (GPU), its performance has been greatly improved. However, from Churchland and Sejnowski (Citation2016), Rumelhart et al. (Citation1986), it is well known that such a learning scheme can be very slow. So various methods have been created to improve the conventional backpropagation efficiency. Momentum algorithms, first proposed by Polyak (Citation1964), are well known techniques to accelerate deterministic smooth convex optimisation. Recently, Lan (Citation2009Citation2012) and Deng et al. (Citation2018) have incorporated these techniques into optimisation.

Intuitively, the reason for using the momentum is that the GD algorithm is very slow when the error function surface is a very long and narrow valley. In this case, the gradient direction of the GD algorithm is almost perpendicular to the long axis of the valley. This causes the iteration to oscillate back and forth in the short axis while moving very slowly in the long axis of the valley. Qian (Citation1999) pointed out that the role of momentum is to correct the gradient direction from the short axis to the long axis of the valley.

Besides, other methods have also been proposed for improving the speed of convergence of GD algorithm. The conjugate gradient algorithm has been shown to be superior to the GD algorithm in most applications, see Press et al. (Citation1992). But this algorithm is less biologically plausible and harder to implement on hardware. Qian (Citation1999) indicated the momentum algorithm appears to be dominant in the connectionist learning literature. The Polyak momentum (PM) algorithm does not converge in relatively simple convex optimisation problems. Nesterov (Citation1983Citation2004) made further improvements to the PM algorithm, which was later called the Nesterov accelerated gradient (NAG) algorithm. Nesterov's idea is that the improved algorithm should have the same acceleration effect as the PM algorithm, but it must converge for general convex functions optimisation problems. Finally, Nesterov proposed a pair of sister sequences which have correction effect on the direction of the iteration points of the algorithm.

In the global optimisation problems, Shah et al. (Citation1964) and Petalas and Vrahatis (Citation2004) used the parallel tangent method to improve the GD algorithm and speed up its convergence. The main idea of the parallel tangent method is that during the iteration, it use the parallel tangent direction occasionally instead of the gradient descent direction f(xi). Based on NAG algorithm, parallel tangent method, and some properties of optimal gradient descent with quadratic function, we propose a three-step accelerated gradient (TAG) algorithm, which has three sequences.

This paper mainly researches the TAG algorithm and applies it to deep learning, and compares it with the traditional GD algorithm, PM algorithm, and NAG algorithm. The rest of the paper proceeds as follows. Section 2 introduces GD, PM and NAG algorithms, then the TAG algorithm is developed, implements them to the quadratic function. Then the paper conducts experiments to high-dimensional functions, which help us to understand TAG with the fastest convergence among these four algorithms. Section 3 introduces deep learning and backpropagation algorithm. Then the TAG algorithm is added to the backpropagation algorithm. Since the R language package ‘neuralnet’ only contains the GD algorithm, we write a new package ‘supneuralnet’, which extend the PM algorithm, NAG algorithm, and TAG algorithm to the ‘neuralnet’ package. We also add the stochastic gradient descent (SGD) algorithm to the package, which is convenient to implement these algorithms. Section 4 applies the four algorithms to the iris and spiral data sets in the deep learning framework, to compare their classification effects and convergence speed. The proposed algorithm performs best. Section 5 combines the SGD algorithm with the four algorithms mentioned in Section 2, and applies them to the iris and spiral data sets under the deep learning framework to compare their classification effects and convergence speed. The proposed algorithm also performs best. Finally we draw conclusions in Section 6.

2. Three-step accelerated gradient algorithm

2.1. Gradient descent algorithm

From Yin (Citation2015), we have the main iteration steps of GD algorithm are (1) xi=xi1αf(xi1),(1) for i=2,3,. Here f(x):RnR is a differentiable function, its gradient is f(x), and α is step size.

Assume that A is symmetric and positive definite (xTAx>0 for any x0). Consider the quadratic function (2) f(x)=12xTAx+bTx+c,(2) with (3) f(x)=Ax+b.(3) Let A=65510,b=102,x1=1210.Then we can show GD iteration trajectory.

From Figure , the initial point is x1=(12,10). The grey elliptic curves are contour lines which mean the objective function values in each elliptic curve are equal. After first iteration, initial point x1 jumps to x2. Then with the following iteration, the point jumps step by step, finally to the optimal point x=(2.5714,1.0857).

As we can see from Figures , all converge to the optimal point x=(2.5714,1.0857). While different learning rates lead to different number of iteration steps, 117, 54, 44 respectively. So the larger learning rate means fewer iteration steps, but too large learning rate will lead to divergency of the algorithm.

Figure 2. Illustration of GD iteration trajectory with learning rate 0.05.

Figure 2. Illustration of GD iteration trajectory with learning rate 0.05.

Figure 3. Illustration of GD iteration trajectory with learning rate 0.1.

Figure 3. Illustration of GD iteration trajectory with learning rate 0.1.

Figure 4. Illustration of GD iteration trajectory with learning rate 0.12.

Figure 4. Illustration of GD iteration trajectory with learning rate 0.12.

The learning rates above all are constant. Actually we can use alterable learning rate during iteration. The optimal gradient descent (OGD) is an effective method if the function is not difficult.

We start from any x1, set (4) xk+1=xkαkf(xk),k=1,2,.(4) Then (5) αk=argminα0f(xkαf(xk))=f(xk)Tf(xk)f(xk)TAf(xk).(5) Equation (Equation5) is the OGD learning rate.

From Figure , we can see that the OGD algorithm converges to the optimal point x=(2.5714,1.0857), and the number of iteration steps is 41.

Figure 5. Illustration of OGD iteration trajectory.

Figure 5. Illustration of OGD iteration trajectory.

There are some properties of OGD algorithm with quadratic function. We can find that x1x3¯, x3x5¯ etc. are maybe in one line (dotted lines in Figure ). The same for the dashed lines. If it is true, then we can accelerate GD algorithm. We will discuss it later.

2.2. Polyak momentum algorithm

The learning scheme of the general gradient descent can be very slow sometime. The addition of a momentum term has been found to increase the rate of convergence dramatically, see Qian (Citation1999). With this method, (Equation1) takes the form (6) xi=xi1αf(xi1)+μ(xi1xi2),(6) for i=3,4,.

If μ[0,1] and α>0 in (Equation6), we call it Polyak momentum (PM) algorithm. PM algorithm is also called heavy-ball method in some literatures, see Polyak (Citation1964). Here μ is the momentum parameter. That is, the modification of the vector at the current time step depends on both the current gradient and the change of the previous step. Intuitively, the rationale for using the momentum term is that the GD algorithm is particularly slow when the error function surface is a very long and narrow valley. In this situation, the gradient direction of the GD algorithm is almost perpendicular to the long axis of the valley. This causes the iteration to oscillate back and forth in the short axis while moving very slowly in the long axis of the valley. Rumelhart and McClelland (Citation1986) also showed the momentum term helps average out the oscillation along the short axis while at the same time adds up contributions along the long axis.

We can also show PM algorithm iteration trajectory with different momentum parameter μ. When μ=0, PM algorithm is GD algorithm. That is to say, the GD algorithm is just a special case of PM algorithm.

As we can see, when μ is small, for example μ=0.25, PM algorithm converges to the optimal point x=(2.5714,1.0857) very fast, see Figure . But when μ becomes larger, it converges to the optimal point slowly. However, the iteration points always cross the optimal point, see Figures  and . The dashed arrow starts from initial point and we can not use the momentum parameter at this time. The number of iteration steps are given in Tables , which will be discussed later.

Figure 6. Illustration of PM iteration trajectory with α=0.1 and μ=0.25.

Figure 6. Illustration of PM iteration trajectory with α=0.1 and μ=0.25.

Figure 7. Illustration of PM iteration trajectory with α=0.1 and μ=0.5.

Figure 7. Illustration of PM iteration trajectory with α=0.1 and μ=0.5.

Figure 8. Illustration of PM iteration trajectory with α=0.1 and μ=0.75.

Figure 8. Illustration of PM iteration trajectory with α=0.1 and μ=0.75.

Table 1. Iteration steps of PM, NAG, and TAG algorithms with learning rate α=0.05, while the steps of GD algorithm is 117.

Table 2. Iteration steps of PM, NAG, and TAG algorithms with learning rate α=0.1, while the steps of GD algorithm is 54.

Table 3. Iteration steps of PM, NAG, and TAG algorithms with learning rate α=0.12, while the steps of GD algorithm is 44.

2.3. Nesterov accelerated gradient algorithm

PM algorithm may diverge for relatively simple convex optimisation problems. Mitliagkas (Citation2018) presented an example as follows.

Let f be defined by its gradient as f(x)=25x,if x<1;x+24,if 1x<2;25x24,otherwise.The proof revolves around the idea that the iterations are stuck in a limit cycle. The following subseries of the iterations tend to three points that are not the optimal value of f. x3np,x3n+1q,x3n+2r.They show that p0.65, q1.80, and r2.12.

Leveraging the idea of momentum introduced by Polyak, Nesterov solved that problem by finding an algorithm achieving acceleration the same as the PM algorithm, but that can converge for general convex functions. Nesterov gave the Nesterov accelerated gradient (NAG) algorithm in Nesterov (Citation1983), for i=3,4,, the main iteration steps of it are (7) yi=xi1αf(xi1),xi=yi+μ(yiyi1).(7) The main difference between PM algorithm and NAG algorithm comes from the fact that the latter has two sister sequences working together to produce a corrective movement. Comparing with (Equation6), we see that PM algorithm evaluates the gradient before adding momentum, while NAG algorithm evaluates the gradient after applying momentum.

We show NAG algorithm iteration trajectory with different values of momentum parameter μ in Figures . When μ=0, like PM algorithm, NAG algorithm is GD algorithm.

Figure 9. Illustration of NAG iteration trajectory with α=0.1 and μ=0.25.

Figure 9. Illustration of NAG iteration trajectory with α=0.1 and μ=0.25.

Figure 10. Illustration of NAG iteration trajectory with α=0.1 and μ=0.5.

Figure 10. Illustration of NAG iteration trajectory with α=0.1 and μ=0.5.

Figure 11. Illustration of NAG iteration trajectory with α=0.1 and μ=0.75.

Figure 11. Illustration of NAG iteration trajectory with α=0.1 and μ=0.75.

When μ is small, for example μ=0.25 or μ=0.5, NAG algorithm converges to the optimal point x=(2.5714,1.0857) very fast, see Figures . But when μ becomes larger, it converges to the optimal point slow down, the iteration points always cross the optimal point, see Figure . The dashed arrow starts from the initial point and we can not use the momentum parameter at this time. The number of iteration steps are given in Tables . We will discuss them later.

2.4. Three-step accelerated gradient algorithm

If the dotted lines x1x3¯, x3x5¯ etc. in Figure are in one line, or the dashed lines x2x4¯, x4x6¯ etc. are in one line, then we can also accelerate GD algorithm. Because they all point to the optimal point. Actually they are in one line, we prove this in Theorem 2.1.

Theorem 2.1

Three-step Accelerated Gradient Algorithm for Two-dimensional Quadratic Function

Assume that A is symmetric and positive definite (xTAx>0 for any x0). Consider the two-dimensional quadratic function (8) f(x)=12xTAx+bTx+c,(8) with the optimal gradient descent iteration and line search, we need only three steps to get to the optimal point.

Proof.

From (Equation8), we can get f(x)=Ax+b.Let φ(α)=f(xk+1)=f(xkαf(xk)).Then φ(α)=[f(xk+1)]Tf(xk).Because for the optimal gradient descent, φ(α)=0, so [f(xk+1)]Tf(xk)=0,that is, (Axk+1+b)T(Axk+b)=0.Let x be the optimal point, then x=xkA1(Axk+b),so xkx=A1(Axk+b),xk+1x=A1(Axk+1+b).Then (xk+1x)TA2(xkx)=(Axk+1+b)TA1A2A1(Axk+b)=(Axk+1+b)T(Axk+b)=0.Similarly, (xk+2x)TA2(xk+1x)=0.Let Q=A2. When f:R2R,Q:2×2,we have (xk+2x)T[Q(xk+1x)]=0,[Q(xk+1x)]T(xkx)=0.So both xk+2x and xkx are orthogonal to Q(xk+1x).

Because x is the optimal point, so xk+2x and xkx are in the same one dimension subspace. Thus they must be in one line.

Let k = 1, with the direction x3x1 and line search, we can approach x at x4. So we need only three steps to get to the optimal point. Thus the result holds.

We show the three-step accelerated gradient (TAG) algorithm iteration trajectory for the two-dimensional quadratic function in Figure . For the line search of last step, we use the R language package ‘pracma’ which was developed by Borchers (Citation2019).

Figure 12. Illustration of TAG iteration trajectory for two-dimensional quadratic function with line search.

Figure 12. Illustration of TAG iteration trajectory for two-dimensional quadratic function with line search.

Armijo gave three theorems in Armijo (Citation1966) to illustrate the convergence of the GD algorithm. With the Lipschitz condition in the theorems, we can explain why the algorithms always converge when the learning rate α is small, and will diverge when the learning rate is large.

When the target function is very complex, such as the error function in deep learning, line search may take much time and we still need lots of three-steps to get to the optimal point. So at this time, we use a momentum parameter μ instead of using line search. And based on parallel tangent method proposed by Petalas and Vrahatis (Citation2004), we give TAG algorithm as follows.

The main difference between NAG algorithm and TAG algorithm comes from the fact that the latter has three sequences working together to produce a corrective movement. And based on the idea of Petalas and Vrahatis (Citation2004), the TAG algorithm has more advantages to find out the global optimal value.

We show the TAG algorithm iteration trajectory with different momentum parameter μ in Figures . When μ=0, just like the above two algorithms, TAG algorithm is GD algorithm.

Figure 13. Illustration of TAG iteration trajectory with α=0.1 and μ=0.25.

Figure 13. Illustration of TAG iteration trajectory with α=0.1 and μ=0.25.

Figure 14. Illustration of TAG iteration trajectory with α=0.1 and μ=0.5.

Figure 14. Illustration of TAG iteration trajectory with α=0.1 and μ=0.5.

Figure 15. Illustration of TAG iteration trajectory with α=0.1 and μ=0.75.

Figure 15. Illustration of TAG iteration trajectory with α=0.1 and μ=0.75.

Figure 16. Illustration of TAG iteration trajectory with α=0.1 and μ=1.

Figure 16. Illustration of TAG iteration trajectory with α=0.1 and μ=1.

When μ=0.25, μ=0.5, and μ=0.75, TAG algorithm converges to the optimal point x=(2.5714,1.0857) very fast, see Figures . But when μ becomes larger, it converges to the optimal point slowly, the iteration points always cross the optimal point, see Figure . But unlike PM algorithm and NAG algorithm, even when μ=1, TAG algorithm always converges to the optimal point. The number of iteration steps are given in Tables .

From Tables , we can see that TAG algorithm has more advantages in terms of iteration steps, especially when α=0.1 and α=0.12. There have two main reasons. One is that the TAG algorithm has three sequences working together to produce a corrective movement. The other reason is that the objective function is a two-dimensional quadratic function, then Theorem 2.1 shows that the TAG algorithm has more advantages. Furthermore, TAG algorithm is more robust than PM and NAG algorithms, for the latter two algorithms always diverge when μ=1. In order to compare the four algorithms further, we give the following two cases.

2.5. Case studies

In order to compare above four algorithms, GD algorithm, PM algorithm, NAG algorithm, and TAG algorithm, we use two kinds of function to implement, i.e. a high dimension quadratic function and a high dimension nonquadratic function. The symmetric and positive definite matrix A of the quadratic function comes from the R language package ‘pracma’ which was developed by Borchers (Citation2019). All kinds of high-dimensional quadratic function can be generated by this package. For nonquadratic function, FLETCHCR function is used, which comes from CUTE (constrained and unconstrained testing environments, from Association for Computing Machinery) collection (Andrei, Citation2008).

2.5.1. Quadratic function

Tables  are the iteration steps and runtime for 10-, 100-, and 500-dimensional quadratic functions. The runtime in ‘second’ is within parentheses. The first row is the value of learning rate α. The sign of ‘–’ means this algorithm is diverging during this scenario.

Table 4. Iteration steps and runtime (within parentheses) of 10-dimensional quadratic function.

Table 5. Iteration steps and runtime (within parentheses) of 100-dimensional quadratic function.

Table 6. Iteration steps and runtime (within parentheses) of 500-dimensional quadratic function.

Tables  show that with respect to both iteration steps and runtime, TAG algorithm is superior to NAG algorithm, NAG algorithm is superior to PM algorithm, PM algorithm is superior to GD algorithm. And with the momentum parameter μ, we can find out that TAG is more robust than NAG, NAG is more robust than PM. Notice that when μ=1, PM is always diverging. The larger learning rate α, the fewer the iteration steps and the less runtime needed. But too large learning rate will lead to divergency of algorithms. When μ changes from 0 to 1, the iteration steps and runtime decrease then increase, so there may exist an optimal μ for PM, NAG and TAG algorithms. The momentum parameter μ of TAG algorithm can be larger than 1. Next, we implement these four algorithms to the FLETCHCR function and compare them.

2.5.2. FLETCHCR function

The FLETCHCR function is given as follows, which comes from CUTE collection. (9) f(x)=i=1n1C(xi+1xi+1xi2)2,x0=(0,0,,0),C=100.(9)

Tables  demonstrate the iteration steps, runtime, and optimal μ with its range for 10-, 100-, and 500-dimensional FLETCHCR functions. The runtime in ‘second’ is within parentheses. The range of μ is within brackets with optimal value on the left.

Table 7. Iteration steps, runtime (within parentheses), and optimal μ with its range (within brackets) for 10-dimensional FLETCHCR function.

Table 8. Iteration steps, runtime (within parentheses), and optimal μ with its range (within brackets) for 100-dimensional FLETCHCR function.

Table 9. Iteration steps, runtime (within parentheses), and optimal μ with its range (within brackets) for 500-dimensional FLETCHCR function.

From Tables , we can obtain that compared with NAG, PM and GD algorithms, TAG algorithm has fewer iteration steps and less runtime, and has longer range of μ, which means TAG algorithm is more robust than other algorithms. The optimal value of μ of TAG algorithm can be larger than 1. PM algorithm and NAG algorithm maybe have the similar performance but both are superior to GD algorithm.

3. TAG algorithm in deep learning

3.1. Why use deep learning?

The development history of deep learning can be seen in Mcculloch and Pitts (Citation1990), Hebb (Citation1949), Rumelhart et al. (Citation1986), Hinton and Salakhutdinov (Citation2006) and Krizhevsky et al. (Citation2017). The reason we consider accelerating backpropagation algorithm in deep learning is because deep learning has excellent effects in classification projects, speech recognition, and image processing. This paper mainly studies classification problems. From Matloff (Citation2017) and Hastie et al. (Citation2015), we can see that multinomial logistic regression (MLR) is the most practical and effective way to solve multi-classification problems, which is a generalisation of logistic regression. The MLR model which has K categories is as follows. (10) P(y=k|x,w)=eh(x|wk)1+i=1K1eh(x|wi),(10) (11) P(y=K|x,w)=11+i=1K1eh(x|wi).(11) here, h(x|wk)=wk0+wk1x1+wk2x2++wknxn, and k=1,2,,K1.

Let us see a spiral data set from the computer science course of Stanford University (Stanford, Citation2019). For the data size of this data set, the input is 800×2, the output is 800×4, and the output values are divided into 4 categories. The scatter plot of spiral data is given in Figure , which is plotted by R language package ‘ggplot2’ developed by Wickham (Citation2016).

Figure 17. The scatter plot of spiral data.

Figure 17. The scatter plot of spiral data.

Referring to James et al. (Citation2015), we estimate the coefficients of MLR model using the R language packages ‘nnet’ developed by Ripley and Venables (Citation2020) and ‘tidyverse’ developed by Wickham (Citation2014). The classification accuracy of this model is only 0.3188. Obviously, the classification effect is not good. Hastie et al. (Citation2009) pointed out that Support Vector Machines (SVM) is a very effective classification method in statistical learning. It can change each linearly inseparable problem into a separable one. The main part of SVM model is the following constrained optimisation problem. (12) minw,w012w2+Ci=1Nξi,(12) (13) s.t. ξi0,yixiTw+w01ξi,i=1,2,,N.(13) Here N is the sample size, ξi is the slack variable, and C is the penalty coefficient.

Referring to Karatzoglou and Meyer (Citation2006), we use the R language package ‘e1071’ developed by Meyer et al. (Citation2019) to train the spiral data set with SVM model. Here the selected kernel function is ‘radial’ with a penalty coefficient of C = 5. After training, 547 support vectors are generated and the classification accuracy is 0.9163, which shows that the classification effect is very good.

Finally, we use the R language package ‘supneuralnet’ developed by ourselves to train the spiral data set with deep learning model. This package will be introduced in detail later, it is rewritten from the R language package ‘neuralnet’ which was developed by Fritsch et al. (Citation2019). Here the neural network contains only one hidden layer, and the hidden layer contains 50 neurons. After 6000 iterations, the classification accuracy reaches 0.98, which has exceeded the SVM model.

Statistics at a crossroads: who is for the challenge? He illustrated this in Berger and He (Citation2019). Research in deep learning is ruinously expensive, and the standard algorithms are ridiculously slow to converge. So the accelerated algorithms in deep learning are very necessary.

3.2. Backpropagation with three-step accelerated gradient

Feedforward neural networks are the first type of artificial neural networks where the connections between neurons do not form a cycle. Feedforward neural networks are simpler than their counterpart, recurrent neural networks. The reason why they are called feedforward is that the information in these networks is only forwarded, do not have loops. The information first through the input neurons, then through the hidden neurons (if present), and finally through the output neurons. In industry and academia, a neural network is called deep learning which has at least one hidden layer. Figure  gives the information transmission procedure of a feedforward neural network with 3 hidden layers.

Figure 18. Feedforward neural network with 3 hidden layers.

Figure 18. Feedforward neural network with 3 hidden layers.

Throughout the books and literatures on neural networks and deep learning (Anthony & Bartlett, Citation2009; Blum & Rivest, Citation1992; Hinton et al., Citation2006; Hornik, Citation1991; Kawaguchi, Citation2016; Prechelt, Citation1994), the mathematical representations of neural networks are different. The representation in Goodfellow et al. (Citation2016) is very popular. We will combine them and present the mathematical formulas in this paper.

We consider a feedforward neural network has M layers, whose lth layer contains Nl neurons, for l=1,2,,M. yjl represents the output of the jth neuron at the lth layer. When l = 1, which is the input layer, we let yj1 equal to the input vector v=(v1,v2,,vN1) of the input layer, that is (14) yj1=vj,j=1,2,,N1.(14) For the 2nd layer to the (M−1)th layer, i.e. 2lM1, yjl is calculated by (15) netjl=bjl+i=1Nl1wijl1,lyil1,(15) (16) yjl=f(netjl),(16) where j=1,2,,Nl. Here netjl is the sum of its weighted inputs. wijl1,l is the weight from the ith neuron at the (l−1)th layer to the jth neuron at the lth layer. bjl is the bias of the jth neuron at the lth layer. f(netjl) is the activation function of the jth neuron at the lth layer. There are many commonly used activation functions, four of them are listed below.

  • Sigmoid function (17) f(x)=11+ex,(17)

  • Tangent function (18) f(x)=tanh(x),(18)

  • ReLU function (19) f(x)=max{0,x},(19)

  • Softplus function (20) f(x)=ln(1+ex).(20)

The Softplus function is an approximation of the ReLU function. But the Softplus function is a smooth function, which is easy to get the derivative.

For the Mth layer, which is the output layer, yjM is calculated by (21) netjM=bjM+i=1NM1wijM1,MyiM1,(21) (22) yjM=f(netjM),(22) where j=1,2,,NM. We give the formulas of output layer separately because the backpropagation algorithm starts from the output layer in order to minimise the error function. In the following, a combination of numbers is used to represent the structure of a neural network. For example, the neural network in Figure is represented as a 8-9-9-9-4 type neural network.

The backpropagation algorithm was discovered in the 1970s, but the importance of this algorithm was not recognised until Rumelhart, Hinton, and Williams published a paper on Nature (Rumelhart et al., Citation1986). Backpropagation is short for backward propagation of errors, which is an algorithm for supervised learning of artificial neural networks using gradient descent. It is the most widely used algorithm for training multilayer feedforward neural networks. If an artificial neural network and an error function are given, backpropagation mainly uses the chain rule to calculate the gradient of the error function to weights.

Let the input-output pair of a neural network is (v,t), and the sample size is P, then the error function (or loss function) is (23) E=12p=1Pj=1NM(yj,pMtj,p)2=p=1PEp,(23) where the error of the pth sample is (24) Ep=12j=1NM(yj,pMtj,p)2.(24) For the error function, (Equation23) is divided by P in some literatures. But it has no effect on minimising error function compared with (Equation23), because P is a constant. And there are many kinds of loss function, this is just the most widely used one.

To simplify the mathematical formulas further, the bias bjl for the jth neuron at the lth layer will be incorporated into the weights as w0jl1,l with a fixed output of y0,pl1=1. Thus (25) netj,pl=bjl+i=1Nl1wijl1,lyi,pl1=i=0Nl1wijl1,lyi,pl1.(25) In the backpropagation algorithm, minimisation of E is usually using the gradient descent with constant learning rate α, and the computation of E by applying the chain rule and product rule in differential calculus. The backpropagation algorithm is mainly how to calculate (26) Ewijl1,l=p=1PEpwijl1,l.(26) According to the chain rule in differential calculus, we have (27) Epwijl1,l=Epnetj,plnetj,plwijl1,l.(27) Where netj,plwijl1,l=wijl1,lk=0Nl1wkjl1,lyk,pl1=yi,pl1.Let (28) δj,pl=Epnetj,pl,(28) thus (29) Epwijl1,l=δj,plyi,pl1.(29) For the Mth layer, we have (30) δj,pM=Epnetj,pM=yj,pMtj,pfnetj,pM.(30) For the hidden layers, i.e. 2lM1, according to Rumelhart et al. (Citation1986), (31) δj,pl=fnetj,plk=1Nl+1wjkl,l+1δk,pl+1.(31) Artificial neural networks generally spend a lot of time when training data sets with the backpropagation algorithm. So the acceleration of the backpropagation algorithm is very meaningful. We give the backpropagation with three-step accelerated gradient (BPTAG) algorithm as follows.

3.3. R package supneuralnet

Programming is important after given an algorithm, but the programming of deep learning is difficult. It is relatively easy to program a neural network with only one hidden layer, but this is not versatile, which cannot be used in a neural network with two hidden layers.

R language package ‘neuralnet’ which was developed by Fritsch et al. (Citation2019) is a good package in deep learning. But it only includes the backpropagation with gradient descent (BPGD) algorithm for algorithms in this paper. So we rewrite this package, named ‘supneuralnet’, which includes all algorithms in this paper. In addition to BPGD algorithm, backpropagation with Polyak momentum (BPPM) algorithm, backpropagation with Nesterov accelerated gradient (BPNAG) algorithm, and BPTAG algorithm are added to package ‘supneuralnet’.

The stochastic gradient descent algorithm is also added to package ‘supneuralnet’, then backpropagation with stochastic gradient descent (BPSGD) algorithm, backpropagation with Polyak momentum and stochastic gradient descent (BPPMSGD) algorithm, backpropagation with Nesterov accelerated stochastic gradient (BPNASG) algorithm, and backpropagation with three-step accelerated stochastic gradient (BPTASG) algorithm are automatically added to this package. These will be seen in Section 5.

4. Case studies based on BPTAG algorithm

We apply the BPGD, BPPM, BPNAG, and BPTAG algorithms to two cases, and compare the performances of these algorithms. The purposes of the two cases are different. For iris data set, we control the training accuracy and compare the iteration steps and runtime. For spiral data set, we control the iteration steps and compare the training accuracy. In addition, the performances of linear methods are not bad for iris data set, which can not be accepted for spiral data set.

4.1. Iris data set

The iris data set is an open source data set in R language (R Core Team, Citation2020), it has 150 rows and 5 columns. The scatter plot of iris petal is given in Figure .

Figure 19. The scatter plot of iris petal.

Figure 19. The scatter plot of iris petal.

The ‘supneuralnet’ package is used for training iris data with BPGD, BPPM, BPNAG, and BPTAG algorithms. The data is trained 100 times with the 2-6-3 type neural network for each algorithm, and the optimal one is selected. Figure  represents the training results of BPTAG algorithm with learning rate α=0.05 and momentum parameter μ=0.1. The processor of machine is 3.1 GHz Dual-Core Intel Core i7 with 16 GB memory. The iteration steps and runtime of algorithms with the values of learning rate and momentum are given in Tables . The runtime in ‘second’ is within parentheses.

Figure 20. BPTAG algorithm training results of iris data set with α=0.05 and μ=0.1.

Figure 20. BPTAG algorithm training results of iris data set with α=0.05 and μ=0.1.

Table 10. Iteration steps and runtime (within parentheses) of BPPM, BPNAG, and BPTAG algorithms with learning rate α=0.025 for iris data set, while the steps and runtime of BPGD algorithm are 64822 and 17.88s respectively.

Table 11. Iteration steps and runtime (within parentheses) of BPPM, BPNAG, and BPTAG algorithms with learning rate α=0.05 for iris data set, while the steps and runtime of BPGD algorithm is 48849 and 11.77s respectively.

Table 12. Iteration steps and runtime (within parentheses) of BPPM, BPNAG, and BPTAG algorithms with learning rate α=0.075 for iris data set, while the steps and runtime of BPGD algorithm is 33503 and 8.09s respectively.

It can be seen from the three tables that the BPPM, BPNAG, and BPTAG algorithms are better than the BPGD algorithm. If α is too large, then it will lead to divergency of algorithms. If α is too small, then it will lead to more iteration steps and runtime of algorithms. The momentum parameter μ has similar variation tendency. This is consistent with the variation tendencies of learning rate and momentum of GD, PM, NAG, and TAG algorithms in Section 2. In fact, during the 100 times training process in Table , some trainings of these algorithms are divergent. Taken together, BPTAG algorithm is better than BPPM and BPNAG algorithms, and it is more robust to momentum parameter μ. Because BPTAG algorithm is still convergent when μ becomes larger. The BPPM algorithm has the similar iteration steps and runtime to the BPNAG algorithm, but the BPNAG algorithm is more robust, because the BPPM algorithm is more likely to diverge when μ becomes larger.

4.2. Spiral data set

The spiral data set has been introduced in Section 3, which can not be classified with the linear models. In this case, the softmax loss function is chosen to train the data, thus (32) Ep=j=1NMlneyj,pMk=1Peyj,kM.(32) In order to avoid overfitting, the L2 regularisation is added to E. Therefore the loss function becomes (33) E=p=1Pj=1NMlneyj,pMk=1Peyj,kM+λ2l=2Mi=1Nl1j=1Nlwijl1,l2.(33) Where λ is the regularisation parameter.

The spiral data set is trained by BPGD, BPPM, BPNAG, and BPTAG algorithms with ReLU activation function. The number of iteration steps is set to 3000 steps with regularisation parameter λ=0.0002 for all algorithms. The data is trained 100 times with the 2-50-4 type neural network for each algorithm, and the optimal one is selected. The training accuracy of algorithms with the values of learning rate α and momentum μ are given in Tables .

Table 13. Training accuracy of BPPM, BPNAG, and BPTAG algorithms with learning rate α=0.25 for spiral data set, while the accuracy of BPGD algorithm is 0.8775.

Table 14. Training accuracy of BPPM, BPNAG, and BPTAG algorithms with learning rate α=0.5 for spiral data set, while the accuracy of BPGD algorithm is 0.8950.

Table 15. Training accuracy of BPPM, BPNAG, and BPTAG algorithms with learning rate α=0.75 for spiral data set, while the accuracy of BPGD algorithm is 0.8975.

It can be seen from the three tables that the four algorithms have the highest accuracy when α=0.75. As the learning rate α decreases to 0.5, and then to 0.25, the training accuracy of the four algorithms decreases slowly. If momentum μ is too large or too small, then it will lead to lower accuracy for BPPM and BPNAG algorithms. In particular, the training accuracy of BPPM algorithm is 0.25 when μ=1, which means that it classifies all the training data into one class. And the training accuracy of BPNAG algorithm is 0.25 occasionally when μ=1, it is very low. At the same time, the training accuracy of the BPTAG algorithm is particularly high, even reaching 0.99, which is a great advantage. Once again, this proves that the BPTAG algorithm is more robust than BPPM and BPNAG algorithms for the momentum parameter μ. In general, the performances of the four algorithms are still the same as the iris data set. The BPTAG algorithm has the highest training accuracy, the BPGD algorithm has the lowest training accuracy, and the training accuracy of BPPM algorithm and BPNAG algorithm is almost the same.

5. Three-step accelerated stochastic gradient algorithm in deep learning

5.1. Backpropagation with three-step accelerated stochastic gradient

The gradient descent algorithm needs to use all the data in each iteration during training, which brings challenges to solve the optimisation problems of large scale data. Given an artificial neural network, in order to update the weights, the BPGD algorithm needs to use all the training data every time. When P is large, it requires lots of computational resources and computing time, which is not feasible in the actual production process. So the stochastic gradient descent (SGD) algorithm is proposed to solve this problem.

The SGD algorithm is to use the minibatch data instead of all data to calculate the gradient every time. That is to say, SGD algorithm randomly selects Q samples from P samples when calculating the gradient in each iteration, where 1QP1.

In practical application, parameter tuning is used to obtain the optimal Q. Generally, it can take full advantage of matrix operation in the computer program when Q=2n, such as 16, 32, 64, 128, 256, and so on.

Since the backpropagation with stochastic gradient descent (BPSGD) algorithm uses part of the data set in each iteration, it should take more iteration steps to converge. So the acceleration of BPSGD algorithm is very helpful for the big data computing. BPSGD algorithm, backpropagation with Polyak momentum and stochastic gradient descent (BPPMSGD) algorithm, backpropagation with Nesterov accelerated stochastic gradient (BPNASG) algorithm, and backpropagation with three-step accelerated stochastic gradient (BPTASG) algorithm have been added to the R language package ‘supneuralnet’, see Section 3. The BPTASG algorithm is given as follows.

5.2. Iris data set

The ‘supneuralnet’ package is used for training iris data with BPSGD, BPPMSGD, BPNASG, and BPTASG algorithms. The data is trained 100 times with the 2-6-3 type neural network and minibatch Q = 64 for each algorithm, and the optimal one is selected. According to the results of Section 4, we choose learning rate α=0.05. Figure  represents the training results of BPTASG algorithm with momentum parameter μ=0.1. The iteration steps and runtime of algorithms with the value of momentum are given in Table . The runtime in ‘second’ is within parentheses.

Figure 21. BPTASG algorithm training results of iris data set with α=0.05 and μ=0.1.

Figure 21. BPTASG algorithm training results of iris data set with α=0.05 and μ=0.1.

Table 16. Iteration steps and runtime (within parentheses) of BPPMSGD, BPNASG, and BPTASG algorithms with learning rate α=0.05 for iris data set, while the steps and runtime of BPSGD algorithm is 123859 and 22.13s respectively.

Compared with Table , it can be seen that the BPSGD algorithm requires 123859 steps to converge, which is 2.5 times the iteration steps of BPGD algorithm. At the same time, BPSGD algorithm requires 22.13 seconds to converge, which is about 2 times the runtime of BPGD algorithm. The iteration steps required by accelerated algorithms are almost twice the iteration steps of using the full data set. Some are just less than two times, such as the BPTASG algorithm with momentum μ=0.05. The runtime required by accelerated algorithms are almost 1.5 times the runtime of using the full data set. This shows that the accelerated algorithms have really nice effects, and the BPPMSGD algorithm also converges when μ=0.2. In general, too large momentum μ or too small momentum μ will lead to more iteration steps and runtime for BPPMSGD, BPNASG, and BPTASG algorithms. The BPTASG algorithm has the least iteration steps and runtime, the BPSGD algorithm has the most iteration steps and runtime, and BPPMSGD algorithm and BPNASG algorithm are almost the same.

5.3. Spiral data set

According to the results of Section 4, we choose learning rate α=0.5. The spiral data set is trained by BPSGD, BPPMSGD, BPNASG, and BPTASG algorithms with ReLU activation function, softmax loss function, and the L2 regularisation. The number of iteration steps is set to 3000 steps with regularisation parameter λ=0.0002 for all algorithms. The data is trained 100 times with the 2-50-4 type neural network and minibatch Q = 128 for each algorithm, and the optimal one is selected. The training accuracy of algorithms with the value of momentum μ is given in Table .

Table 17. Training accuracy of BPPMSGD, BPNASG, and BPTASG algorithms with learning rate α=0.5 for spiral data set, while the accuracy of BPSGD algorithm is 0.8887.

Compared with the Table , we find out that with only 3000 iterations, the SGD algorithm reduces the training accuracy. Actually, it needs more iteration steps to reach the training accuracy in Table . But the runtime is much shorter after adding the SGD algorithm, and the accuracy is high enough. It can be seen from Table  that too large momentum μ or too small momentum μ will lead to lower accuracy for accelerated algorithms. In particular, the training accuracy of BPPMSGD algorithm is 0.25 when μ=1, which means that it classifies all the training data into one class. And the training accuracy of BPNASG algorithm is 0.3075 at this time, which is very low. While the training accuracy of the BPTASG algorithm is very high, which is a great advantage. This proves that the BPTASG algorithm is more robust than BPPMSGD and BPNASG algorithms for the momentum parameter μ. Overall, BPTASG algorithm is superior to BPNASG algorithm, BPNASG algorithm is superior to BPPMSGD algorithm, and BPPMSGD algorithm is superior to BPSGD algorithm.

6. Conclusion

This paper proposes the TAG algorithm and applies it to deep learning. Firstly, we introduce GD algorithm, and give the applications of it to a quadratic function, then state PM and NAG algorithms, give the TAG algorithm, and implement them to the quadratic function. The paper conducts experiments to high-dimensional functions that help us to understand TAG with the fastest convergence among these four algorithms. Secondly, we introduce deep learning and backpropagation algorithm. Then the TAG algorithm is added to the backpropagation algorithm. Since the R language package ‘neuralnet’ only contains the GD algorithm of the algorithms in this paper, we rewrite the ‘neuralnet’ package, named ‘supneuralnet’. The PM algorithm, NAG algorithm, and TAG algorithm are added. We also add the SGD algorithm, so that SGD and other algorithms can be combined with freedom. Thirdly, we apply the four algorithms to the iris and spiral data sets in the deep learning framework, compare their classification effects and convergence speed. We find out that our algorithm is the best. Finally, we combine the SGD algorithm with the above four algorithms, and apply them to the iris and spiral data sets in the deep learning framework, compare their classification effects and convergence speed. We also find out that our algorithm is the best. Moreover, we can do more research for the scalability of BPTASG algorithm with more datasets in the future. The theoretical convergence analysis of BPTASG algorithm is also well worth studying.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by National Natural Science Foundation of China (11271136, 81530086), Program of Shanghai Subject Chief Scientist (14XD1401600) and the 111 Project of China (No. B14019).

Notes on contributors

Yongqiang Lian

Yongqiang Lian is a Ph.D. candidate in Statistics at East China Normal University.

Yincai Tang

Yincai Tang is a Professor in School of Statistics at East China Normal University.

Shirong Zhou

Shirong Zhou is a Ph.D. student in Statistics at East China Normal University.

References