Full article: Iteration and stochastic first-order oracle complexities of stochastic gradient descent using constant and decaying learning rates

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

The performance of stochastic gradient descent (SGD), which is the simplest first-order optimizer for training deep neural networks, depends on not only the learning rate but also the batch size. They both affect the number of iterations and the stochastic first-order oracle (SFO) complexity needed for training. In particular, the previous numerical results indicated that, for SGD using a constant learning rate, the number of iterations needed for training decreases when the batch size increases, and the SFO complexity needed for training is minimized at a critical batch size and that it increases once the batch size exceeds that size. Here, we study the relationship between batch size and the iteration and SFO complexities needed for nonconvex optimization in deep learning with SGD using constant or decaying learning rates and show that SGD using the critical batch size minimizes the SFO complexity. We also provide numerical comparisons of SGD with the existing first-order optimizers and show the usefulness of SGD using a critical batch size. Moreover, we show that measured critical batch sizes are close to the sizes estimated from our theoretical results.

Keywords:

2020 Mathematics Subject Classifications:

1. Introduction

1.1. Background

First-order optimizers can train deep neural networks by minimizing loss functions called the expected and empirical risks. They use stochastic first-order derivatives (stochastic gradients), which are estimated from the full gradient of the loss function. The simplest first-order optimizer is stochastic gradient descent (SGD) [Citation1–5], which has a number of variants, including momentum variants [Citation6, Citation7] and numerous adaptive variants, such as adaptive gradient (AdaGrad) [Citation8], root mean square propagation (RMSProp) [Citation9], adaptive moment estimation (Adam) [Citation10], adaptive mean square gradient (AMSGrad) [Citation11], and Adam with decoupled weight decay (AdamW) [Citation12].

SGD can be applied to nonconvex optimization [Citation13–22], where its performance strongly depends on the learning rate $α_{k}$ . For example, under the bounded variance assumption, SGD using a constant learning rate $α_{k} = α$ satisfies $\frac{1}{K} \sum_{k = 0}^{K - 1} ‖ ∇f (θ_{k}) ‖^{2} = O (\frac{1}{K}) + σ^{2}$ [Citation18, Theorem 12] and SGD using a decaying learning rate (i.e. $α_{k} \to 0$ ) satisfies that $\frac{1}{K} \sum_{k = 0}^{K - 1} E [‖ ∇f (θ_{k}) ‖^{2}] = O (\frac{1}{\sqrt{K}})$ [Citation18, Theorem 11], where $(θ_{k})_{k \in N}$ is the sequence generated by SGD to find a local minimizer of f, K is the number of iterations, and $σ^{2}$ is the upper bound of the variance.

The performance of SGD also depends on the batch size b. The convergence analyses reported in [Citation14, Citation17, Citation21, Citation23, Citation24] indicated that SGD with a decaying learning rate and large batch size converges to a local minimizer of the loss function. In [Citation25], it was numerically shown that using an enormous batch reduces both the number of parameter updates and model training time. Moreover, setting appropriate batch sizes for optimizers used in training generative adversarial networks were investigated in [Citation26].

1.2. Motivation

The previous numerical results in [Citation27] indicated that, for SGD using constant or linearly decaying learning rates, the number of iterations K needed to train a deep neural network decreases as the batch size b increases. Motivated by the numerical results in [Citation27], we decided to clarify the theoretical iteration complexity of SGD with a constant or decaying learning rate in training a deep neural network. We used the performance measure of previous theoretical analyses of SGD, i.e. $min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖] \leq ϵ$ , where ϵ $(> 0)$ is the precision and $[0 : K - 1] := {0, 1, \dots, K - 1}$ . We found that, if SGD is an ϵ–approximation, i.e. $min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖] \leq ϵ$ , then it can train a deep neural network in K iterations.

In addition, the numerical results in [Citation27] indicated an interesting fact wherein diminishing returns exist beyond a critical batch size; i.e. the number of iterations needed to train a deep neural network does not strictly decrease beyond the critical batch size. Here, we define the stochastic first-order oracle (SFO) complexity as N := Kb, where K is the number of iterations needed to train a deep neural network and b is the batch size, as stated above. The deep neural network model uses b gradients of the loss functions per iteration. The model has a stochastic gradient computation cost of N = Kb. From the numerical results in [Citation27, Figures 4 and 5], we can conclude that the critical batch size $b^{⋆}$ (if it exists) is useful for SGD, since the SFO complexity $N (b)$ is minimized at $b = b^{⋆}$ and the SFO complexity increases once the batch size exceeds $b^{⋆}$ . Hence, on the basis of the first motivation stated above, we decided to clarify the SFO complexities needed for SGD using a constant or decaying learning rate to be an ϵ–approximation.

1.3. Contribution

1.3.1. Upper bound of theoretical performance measure

To clarify the iteration and SFO complexities needed for SGD to be an ϵ–approximation, we first give upper bounds of $min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖^{2}]$ for SGD to generate a sequence $(θ_{k})_{k \in N}$ with constant or decaying learning rates (see Theorem 3.1 for the definitions of $C_{i}$ and $D_{i}$ ). As our aim is to show that SGD is an ϵ–approximation $min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖^{2}] \leq ϵ^{2}$ , it is desirable that the upper bounds of $min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖^{2}]$ be small. Table indicates that the upper bounds become small when the number of iterations and batch size are large. The table also indicates that the convergence of SGD strongly depends on the batch size, since the variance terms (including $σ^{2}$ and b; see Theorem 3.1 for the definitions of $C_{2}$ and $D_{2}$ ) in the upper bounds of $min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖^{2}]$ decrease as the batch size becomes larger.

Table 1. Upper bounds of $min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖^{2}]$ for SGD using a constant or decaying learning rate and the critical batch size to minimize the SFO complexities and achieve $min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖] \leq ϵ$ ( $C_{i}$ and $D_{i}$ are positive constants, K is the number of iterations, b is the batch size, $T \geq 1$ , $ϵ > 0$ , and L is the Lipschitz constant of $∇f$ ).

Display Table

1.3.2. Critical batch size to reduce SFO complexity

Section 1.3.1 showed that using large batches is appropriate for SGD in the sense of minimizing the upper bound of the performance measure. Here, we are interested in finding appropriate batch sizes from the viewpoint of the computation cost. This is because the SFO complexity increases with the batch size. As indicated in Section 1.2, the critical batch size $b^{⋆}$ minimizes the SFO complexity, N = Kb. Hence, we will investigate the properties of the SFO complexity N = Kb needed to achieve an ϵ–approximation. Here, let us consider SGD using a constant learning rate. From the “Upper Bound” row in Table , we have $min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖^{2}] \leq \underset{\Leftrightarrow K \geq K (b) := \frac{C_{1} b}{ϵ^{2} b - C_{2}} (b > \frac{C_{2}}{ϵ^{2}})}{\underset{⏟}{\frac{C_{1}}{K} + \frac{C_{2}}{b} \leq ϵ^{2}}} .$ We can check that the number of iterations, $K (b) := \frac{C_{1} b}{ϵ^{2} b - C_{2}}$ , needed to achieve an ϵ–approximation is monotone decreasing and convex with respect to the batch size (Theorem 3.2). Accordingly, we have that $K (b) \geq inf {K : min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖] \leq ϵ}$ , where SGD using the batch size b generates $(θ_{k})_{k = 0}^{K - 1}$ . Moreover, we find that the SFO complexity is $N (b) = K (b) b = \frac{C_{1} b^{2}}{ϵ^{2} b - C_{2}}$ . The convexity of $N (b) = \frac{C_{1} b^{2}}{ϵ^{2} b - C_{2}}$ (Theorem 3.3) ensures that a critical batch size $b^{⋆} = \frac{2 C_{2}}{ϵ^{2}}$ whereby $N^{'} (b^{⋆}) = 0$ exists such that $N (b)$ is minimized at $b^{⋆}$ (see the “Critical Batch Size” row in Table ). A similar discussion guarantees the existence of a critical batch size for SGD using a decaying learning rate $α_{k} = \frac{1}{(⌊ \frac{k}{T} ⌋ + 1)^{a}}$ , where $T \geq 1$ , $a \in (0, \frac{1}{2}) \cup (\frac{1}{2}, 1)$ , and $⌊ \cdot ⌋$ is the floor function (see the “Critical Batch Size” row in Table ).

Meanwhile, for a decaying learning rate $α_{k} = \frac{1}{\sqrt{⌊ \frac{k}{T} ⌋ + 1}}$ , although $N (b)$ is convex with respect to b, we have that $N^{'} (b) > 0$ for all $b > \frac{D_{2}}{ϵ^{2}}$ (Theorem 3.3(iii)). Hence, for this case, a critical batch size $b^{⋆}$ defined by $N^{'} (b^{⋆}) = 0$ does not exist. However, since the critical batch size minimizes the SFO complexity N, we can define one as follows: $b^{⋆} \approx \frac{D_{2}}{ϵ^{2}}$ . Accordingly, we have that $N (b^{⋆}) \geq inf {Kb : min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖] \leq ϵ}$ , where SGD using $b^{⋆}$ generates $(θ_{k})_{k = 0}^{K - 1}$ .

1.3.3. Iteration and SFO complexities

Let $F (n, Δ_{0}, L)$ be an L–smooth function class with $f := \frac{1}{n} \sum_{i = 1}^{n} f_{i}$ and $f (θ_{0}) - f_{⋆} \leq Δ_{0}$ (see (C1)) and let $O (b, σ^{2})$ be a stochastic first-order oracle class (see (C2) and (C3)). The iteration complexity $K_{ϵ}$ [Citation21, (7)] and SFO complexity $N_{ϵ}$ needed for SGD to be an ϵ–approximation are defined as (1) $\begin{aligned} K_{ϵ} (n, b, α_{k}, Δ_{0}, L, σ^{2}) \\ := sup_{O \in O (b, σ^{2})} sup_{f \in F (n, Δ_{0}, L)} inf {K : min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖] \leq ϵ}, \end{aligned}$ (1) (2) $\begin{aligned} N_{ϵ} (n, b, α_{k}, Δ_{0}, L, σ^{2}) := K_{ϵ} (n, b, α_{k}, Δ_{0}, L, σ^{2}) b . \end{aligned}$ (2) Table summarizes the iteration and SFO complexities (see also Theorem 3.4). Corollaries 6 and 7 in [Citation14] are the same as our results for SGD with a constant learning rate in Theorems 3.1 and 3.3, since the randomized stochastic projected gradient free algorithm in [Citation14] which is a stochastic zeroth-order (SZO) method that coincides with SGD and it can be applied to the situations where only noisy function values are available. In particular, Corollary 6 in [Citation14] gave the convergence rate of the SZO methods using a fixed batch size, and Corollary 7 indicated the SZO complexity of the SZO method is the same as the SFO complexity. Hence, Corollaries 6 and 7 in [Citation14] lead to the finding that the iteration complexity of SGD using a constant learning rate is $O (1 / ϵ^{2})$ and the SFO complexity of SGD using a constant learning rate is $O (1 / ϵ^{4})$ .

Table 2. Iteration and SFO complexities needed for SGD using a constant or decaying learning rate to be an ϵ–approximation (The critical batch sizes are used to compute $K_{ϵ}$ and $N_{ϵ}$ )

Display Table

Since the positive constants $C_{i}$ and $D_{i}$ depend on the learning rate, we need to compare numerically the performance of SGD with a constant learning rate with that of SGD with a decaying learning rate. Moreover, we also need to compare SGD with the existing first-order optimizers in order to verify its usefulness. Section 4 presents numerical comparisons showing that SGD using the critical batch size outperforms the existing first-order optimizers. We also show that the measured critical batch sizes are close to the theoretical sizes.

2. Nonconvex optimization and SGD

2.1. Nonconvex optimization in deep learning

Let $R^{d}$ be a d-dimensional Euclidean space with inner product $〈 x, y 〉 := x^{⊤} y$ inducing the norm $‖ x ‖$ and $N$ be the set of nonnegative integers. Define $[0 : n] := {0, 1, \dots, n}$ for $n \geq 1$ . Let $(x_{k})_{k \in N}$ and $(y_{k})_{k \in N}$ be positive real sequences and let $x (ϵ), y (ϵ) > 0$ , where $ϵ > 0$ . O denotes Landau's symbol; i.e. $y_{k} = O (x_{k})$ if there exist c>0 and $k_{0} \in N$ such that $y_{k} \leq c x_{k}$ for all $k \geq k_{0}$ , and $y (ϵ) = O (x (ϵ))$ if there exists c>0 such that $y (ϵ) \leq cx (ϵ)$ . Given a parameter $θ \in R^{d}$ and a data point z in a data domain Z, a machine-learning model provides a prediction whose quality can be measured in terms of a differentiable nonconvex loss function $ℓ (θ; z)$ . We aim to minimize the empirical loss defined for all $θ \in R^{d}$ by $f (θ) = \frac{1}{n} \sum_{i = 1}^{n} ℓ (θ; z_{i}) = \frac{1}{n} \sum_{i = 1}^{n} f_{i} (θ)$ , where $S = (z_{1}, z_{2}, \dots, z_{n})$ denotes the training set (We assume that the number of training data n is large) and $f_{i} (\cdot) := ℓ (\cdot; z_{i})$ denotes the loss function corresponding to the i-th training data $z_{i}$ .

2.2. SGD

2.2.1. Conditions and algorithm

We assume that a stochastic first-order oracle (SFO) exists such that, for a given $θ \in R^{d}$ , it returns a stochastic gradient $G_{ξ} (θ)$ of the function f, where a random variable ξ is independent of $θ$ . Let $E_{ξ} [\cdot]$ be the expectation taken with respect to ξ. The following are standard conditions.

(C1) $f := \frac{1}{n} \sum_{i = 1}^{n} f_{i} : R^{d} \to R$ is L–smooth, i.e. $∇f : R^{d} \to R^{d}$ is L–Lipschitz continuous (i.e. $‖ ∇f (x) - ∇f (y) ‖ \leq L ‖ x - y ‖$ ). f is bounded below from $f_{⋆} \in R$ . Let $Δ_{0} > 0$ satisfy $f (θ_{0}) - f_{⋆} \leq Δ_{0}$ , where $θ_{0}$ is an initial point.
(C2) Let $(θ_{k})_{k \in N} \subset R^{d}$ be the sequence generated by SGD. For each iteration k, $E_{ξ_{k}} [G_{ξ_{k}} (θ_{k})] = ∇f (θ_{k})$ , where $ξ_{0}, ξ_{1}, \dots$ are independent samples and the random variable $ξ_{k}$ is independent of $(θ_{l})_{l = 0}^{k}$ . There exists a nonnegative constant $σ^{2}$ such that $E_{ξ_{k}} [‖ G_{ξ_{k}} (θ_{k}) - ∇f (θ_{k}) ‖^{2}] \leq σ^{2}$ .
(C3) For each iteration k, SGD samples a batch $B_{k}$ of size b independently of k and estimates the full gradient $∇f$ as $\nabla f_{B_{k}} (θ_{k}) := \frac{1}{b} \sum_{i \in [b]} G_{ξ_{k, i}} (θ_{k})$ , where $b \leq n$ and $ξ_{k, i}$ is a random variable generated by the ith sampling in the kth iteration.

Algorithm 1 is the SGD optimizer under (C1)–(C3).

2.2.2. Learning rates

We use the following learning rates:

(Constant rate) $α_{k}$ does not depend on $k \in N$ , i.e. $α_{k} = α < \frac{2}{L}$ $(k \in N)$ , where the upper bound $\frac{2}{L}$ of α is needed to analyse SGD (see Appendix A.2). (Decaying rate) $(α_{k})_{k \in N} \subset (0, + \infty)$ is monotone decreasing for k (i.e. $α_{k} \geq α_{k + 1}$ ) and converges to 0. In particular, we will use $α_{k} = \frac{1}{(⌊ \frac{k}{T} ⌋ + 1)^{a}}$ , where $T \geq 1$ and $(Decay 1) a \in (0, \frac{1}{2}) \lor (Decay 2) a = \frac{1}{2} \lor (Decay 3) a \in (\frac{1}{2}, 1)$ . It is guaranteed that there exists $k_{0} \in N$ such that, for all $k \geq k_{0}$ , $α_{k} < \frac{2}{L}$ . Furthermore, we assume that $k_{0} = 0$ , since we can replace $α_{k}$ with $\frac{α}{(⌊ \frac{k}{T} ⌋ + 1)^{a}} \leq α < \frac{2}{L}$ $(k \in N)$ , where $α \in (0, \frac{2}{L})$ is defined as in (Constant).

3. Our results

3.1. Upper bound of the squared norm of the full gradient

Here, we give an upper bound of $min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖^{2}]$ , where $E [\cdot]$ stands for the total expectation, for the sequence generated by SGD using each of the learning rates defined in Section 2.2.2.

Theorem 3.1

Upper bound of the squared norm of the full gradient

The sequence $(θ_{k})_{k \in N}$ generated by Algorithm 1 under (C1)–(C3) satisfies that, for all $K \geq 1$ , $min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖^{2}] \leq {\begin{cases} \frac{C_{1}}{K} + \frac{C_{2}}{b} & (Constant) \\ \frac{D_{1}}{K^{a}} + \frac{D_{2}}{(1 - 2 a) K^{a} b} & (Decay 1) \\ \frac{D_{1}}{\sqrt{K}} + (\frac{1}{\sqrt{K}} + 1) \frac{D_{2}}{b} & (Decay 2) \\ \frac{D_{1}}{K^{1 - a}} + \frac{2 a D_{2}}{(2 a - 1) K^{1 - a} b} & (Decay 3) \end{cases}$ where $\begin{aligned} C_{1} := \frac{2 (f (θ_{0}) - f_{⋆})}{(2 - Lα) α}, C_{2} := \frac{L σ^{2} α}{2 - Lα}, \\ D_{1} := \frac{2 (f (θ_{0}) - f_{⋆})}{α (2 - Lα)}, D_{2} := \frac{T α^{2} L σ^{2}}{2 - Lα} . \end{aligned}$

Theorem 3.1 indicates that the upper bound of $min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖^{2}]$ consists of a bias term including $f (θ_{0}) - f_{⋆}$ and a variance term including $σ^{2}$ and that these terms become small when the number of iterations and the batch size are large. In particular, the bias term using (Constant) is $O (\frac{1}{K})$ , which is a better rate than using (Decay 1)–(Decay 3).

3.2. Number of iterations needed for SGD to be an ϵ–approximation

Let us suppose that SGD is an ϵ–approximation defined as follows: (3) $E [‖ ∇f (θ_{K^{*}}) ‖^{2}] := min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖^{2}] \leq ϵ^{2},$ (3) where $ϵ > 0$ is the precision and $K^{*} \in [0 : K - 1]$ . Condition (Equation3(3) $E [‖ ∇f (θ_{K^{*}}) ‖^{2}] := min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖^{2}] \leq ϵ^{2},$ (3) ) implies that $E [‖ ∇f (θ_{K^{*}}) ‖] \leq ϵ$ . Theorem 3.1 below gives the number of iterations needed to be an ϵ–approximation (Equation3(3) $E [‖ ∇f (θ_{K^{*}}) ‖^{2}] := min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖^{2}] \leq ϵ^{2},$ (3) ).

Theorem 3.2

Numbers of iterations needed for nonconvex optimization using SGD

Let $(θ_{k})_{k \in N}$ be the sequence generated by Algorithm 1 under (C1)–(C3) and let $K : R \to R$ be $K (b) = {\begin{cases} \frac{C_{1} b}{ϵ^{2} b - C_{2}} & (Constant) \\ {\frac{1}{ϵ^{2}} (\frac{D_{2}}{(1 - 2 a) b} + D_{1})}^{\frac{1}{a}} & (Decay 1) \\ {(\frac{D_{1} b + D_{2}}{ϵ^{2} b - D_{2}})}^{2} & (Decay 2) \\ {\frac{1}{ϵ^{2}} (\frac{2 a D_{2}}{(2 a - 1) b} + D_{1})}^{\frac{1}{1 - a}} & (Decay 3) \end{cases}$ where $C_{1}$ , $C_{2}$ , $D_{1}$ , and $D_{2}$ are defined as in Theorem 3.1, the domain of K in (Constant) is $b > \frac{C_{2}}{ϵ^{2}}$ , and the domain of K in (Decay 2) is $b > \frac{D_{2}}{ϵ^{2}}$ . Then, we have the following:

The above K achieves an ϵ–approximation (Equation3(3) $E [‖ ∇f (θ_{K^{*}}) ‖^{2}] := min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖^{2}] \leq ϵ^{2},$ (3) ).
The above K is a monotone decreasing and convex function with respect to the batch size b.

Theorem 3.2 indicates that the number of iterations needed for SGD using constant or decay learning rates to be an ϵ–approximation is small when the batch size is large. Hence, it is appropriate to set a large batch size in order to minimize the iterations needed for an ϵ–approximation (Equation3(3) $E [‖ ∇f (θ_{K^{*}}) ‖^{2}] := min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖^{2}] \leq ϵ^{2},$ (3) ). However, the SFO complexity, which is the cost of the stochastic gradient computation, grows larger with b. Hence, the appropriate batch size should also minimize the SFO complexity.

3.3. SFO complexity needed for SGD to be an ϵ–approximation

Theorem 3.2 leads to the following theorem on the properties of the SFO complexity N needed for SGD to be an ϵ–approximation (Equation3(3) $E [‖ ∇f (θ_{K^{*}}) ‖^{2}] := min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖^{2}] \leq ϵ^{2},$ (3) ).

Theorem 3.3

SFO complexity needed for nonconvex optimization of SGD

Let $(θ_{k})_{k \in N}$ be the sequence generated by Algorithm 1 under (C1)–(C3) and define $N : R \to R$ by $N (b) = K (b) b = {\begin{cases} \frac{C_{1} b^{2}}{ϵ^{2} b - C_{2}} & (Constant) \\ {\frac{1}{ϵ^{2}} (\frac{D_{2}}{(1 - 2 a) b} + D_{1})}^{\frac{1}{a}} b & (Decay 1) \\ {(\frac{D_{1} b + D_{2}}{ϵ^{2} b - D_{2}})}^{2} b & (Decay 2) \\ {\frac{1}{ϵ^{2}} (\frac{2 a D_{2}}{(2 a - 1) b} + D_{1})}^{\frac{1}{1 - a}} b & (Decay 3) \end{cases}$ where $C_{1}$ , $C_{2}$ , $D_{1}$ and $D_{2}$ are as in Theorem 3.1, the domain of N in (Constant) is $b > \frac{C_{2}}{ϵ^{2}}$ , and the domain of N in (Decay 2) is $b > \frac{D_{2}}{ϵ^{2}}$ . Then, we have the following:

The above N is convex with respect to the batch size b.
There exists a critical batch size (4) $b^{⋆} = {\begin{cases} \frac{2 C_{2}}{ϵ^{2}} & (Constant) \\ \frac{(1 - a) D_{2}}{a (1 - 2 a) D_{1}} & (Decay 1) \\ \frac{2 a^{2} D_{2}}{(1 - a) (2 a - 1) D_{1}} & (Decay 3) \end{cases}$ (4) satisfying $N^{'} (b^{⋆}) = 0$ such that $b^{⋆}$ minimizes the SFO complexity N.
For (Decay 2), $N^{'} (b) > 0$ holds for all $b > \frac{D_{2}}{ϵ^{2}}$ .

Theorem 3.3(ii) indicates that, if we can set a critical batch size (Equation4(4) $b^{⋆} = {\begin{cases} \frac{2 C_{2}}{ϵ^{2}} & (Constant) \\ \frac{(1 - a) D_{2}}{a (1 - 2 a) D_{1}} & (Decay 1) \\ \frac{2 a^{2} D_{2}}{(1 - a) (2 a - 1) D_{1}} & (Decay 3) \end{cases}$ (4) ) for each of (Constant), (Decay 1), and (Decay 3), then the SFO complexity will be minimized. However, it would be difficult to set $b^{⋆}$ in (Equation4(4) $b^{⋆} = {\begin{cases} \frac{2 C_{2}}{ϵ^{2}} & (Constant) \\ \frac{(1 - a) D_{2}}{a (1 - 2 a) D_{1}} & (Decay 1) \\ \frac{2 a^{2} D_{2}}{(1 - a) (2 a - 1) D_{1}} & (Decay 3) \end{cases}$ (4) ) before implementing SGD, since $b^{⋆}$ in (Equation4(4) $b^{⋆} = {\begin{cases} \frac{2 C_{2}}{ϵ^{2}} & (Constant) \\ \frac{(1 - a) D_{2}}{a (1 - 2 a) D_{1}} & (Decay 1) \\ \frac{2 a^{2} D_{2}}{(1 - a) (2 a - 1) D_{1}} & (Decay 3) \end{cases}$ (4) ) involves unknown parameters, such as L and $σ^{2}$ (computing L is NP-hard [Citation28]). Hence, we would like to estimate the critical batch sizes by using Theorem 3.3(ii) (see Section 4.3). Theorem 3.3(ii) indicates that the smaller ϵ is, the larger the critical batch size $b^{⋆}$ in (Constant) becomes. Theorem 3.3(iii) indicates that the critical batch size is close to $\frac{D_{2}}{ϵ^{2}}$ when using (Decay 2) to minimize the SFO complexity N.

3.4. Iteration and SFO complexities of SGD

Theorems 3.2 and 3.3 lead to the following theorem indicating the iteration and SFO complexities needed for SGD to be an ϵ–approximation (see also Table ).

Theorem 3.4

Iteration and SFO complexities of SGD

The iteration and SFO complexities such that Algorithm 1 under (C1)–(C3) is an ϵ–approximation (Equation3(3) $E [‖ ∇f (θ_{K^{*}}) ‖^{2}] := min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖^{2}] \leq ϵ^{2},$ (3) ) are as follows: $\begin{aligned} (K_{ϵ} (n, b^{⋆}, α_{k}, Δ_{0}, L, σ^{2}), N_{ϵ} (n, b^{⋆}, α_{k}, Δ_{0}, L, σ^{2})) \\ = {\begin{cases} (O (\frac{1}{ϵ^{2}}), O (\frac{1}{ϵ^{4}})) & (Constant) \\ (O (\frac{1}{ϵ^{\frac{2}{a}}}), O (\frac{1}{ϵ^{\frac{2}{a}}})) & (Decay 1) \\ (O (\frac{1}{ϵ^{4}}), O (\frac{1}{ϵ^{6}})) & (Decay 2) \\ (O (\frac{1}{ϵ^{\frac{2}{1 - a}}}), O (\frac{1}{ϵ^{\frac{2}{1 - a}}})) & (Decay 3) \end{cases} \end{aligned}$ where $K_{ϵ} (n, b, α_{k}, Δ_{0}, L, σ^{2})$ and $N_{ϵ} (n, b, α_{k}, Δ_{0}, L, σ^{2})$ are defined as in (Equation1(1) $\begin{aligned} K_{ϵ} (n, b, α_{k}, Δ_{0}, L, σ^{2}) \\ := sup_{O \in O (b, σ^{2})} sup_{f \in F (n, Δ_{0}, L)} inf {K : min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖] \leq ϵ}, \end{aligned}$ (1) ), the critical batch sizes in Theorem 3.3 are used to compute $K_{ϵ} (n, b^{⋆}, α_{k}, Δ_{0}, L, σ^{2})$ and $N_{ϵ} (n, b^{⋆}, α_{k}, Δ_{0}, L, σ^{2})$ . In (Decay 2), we assume that $b^{⋆} = \frac{D_{2} + 1}{ϵ^{2}}$ . (see also (Equation4(4) $b^{⋆} = {\begin{cases} \frac{2 C_{2}}{ϵ^{2}} & (Constant) \\ \frac{(1 - a) D_{2}}{a (1 - 2 a) D_{1}} & (Decay 1) \\ \frac{2 a^{2} D_{2}}{(1 - a) (2 a - 1) D_{1}} & (Decay 3) \end{cases}$ (4) )).

Theorem 3.4 indicates that the iteration and SFO complexities for (Constant) are smaller than those for (Decay 1)–(Decay 3).

4. Numerical results

We numerically verified the number of iterations and SFO complexities needed to achieve high test accuracy for different batch sizes in training ResNet [Citation29] and Wide-ResNet [Citation30]. The parameter α used in (Constant) was determined by conducting a grid search of ${0.001, 0.005, 0.01, 0.05, 0.1, 0.5}$ . The parameters α and T used in the decaying learning rates (Decay 1)–(Decay 3) defined by $α_{k} = \frac{α}{(⌊ \frac{k}{T} ⌋ + 1)^{a}}$ were determined by a grid search of $α \in {0.001, 0.1, 0.125, 0.25, 0.5, 1.0}$ and $T \in {5, 10, 20, 30, 40, 50}$ . The parameter a was set to $a = \frac{1}{4}$ in (Decay 1) and $a = \frac{3}{4}$ in (Decay 3). We compared SGD with SGD with momentum (momentum), Adam, AdamW, and RMSProp. The learning rates and hyperparameters of these four optimizers were determined on the basis of the previous results [Citation9, Citation10, Citation12] (The weight decay used in the momentum was $5 \times 10^{- 4}$ ). The experimental environment consisted of an NVIDIA DGX A100×8GPU and Dual AMD Rome7742 2.25-GHz, 128 Cores×2CPU. The software environment was Python 3.10.6, PyTorch 1.13.1, and CUDA 11.6. The code is available at https://github.com/imakn0907/SGD_using_decaying.

4.1. Training ResNet-18 on the CIFAR-10 and CIFAR-100 datasets

First, we trained ResNet-18 on the CIFAR-10 dataset. The stopping condition of the optimizers was 200 epochs. Figure indicates that the number of iterations is monotone decreasing and convex with respect to batch size for SGDs using a constant learning rate or a decaying learning rate. Figure indicates that, in each case of SGD with (Constant)–(Decay 3), a critical batch size $b^{⋆} = 2^{4}$ exists at which the SFO complexity is minimized.

Figure 1. Number of iterations needed for SGD with (Constant), (Decay 1), (Decay 2), and (Decay 3) to achieve a test accuracy of 0.9 versus batch size (ResNet-18 on CIFAR-10).

Figure 2. SFO complexity needed for SGD with (Constant), (Decay 1), (Decay 2), and (Decay 3) to achieve a test accuracy of 0.9 versus batch size (ResNet-18 on CIFAR-10).

Figures and indicate that the number of iterations and the SFO complexity for four different learning rates in achieving a test accuracy of 0.6 when training ResNet-18 on the CIFAR-100 dataset. The figures indicate that critical batch sizes existed when using (Constant)–(Decay 3).

Figure 3. Number of iterations needed for SGD with (Constant), (Decay 1), (Decay 2), and (Decay 3) to achieve a test accuracy of 0.6 versus batch size (ResNet-18 on CIFAR-100).

Figure 4. SFO complexity needed for SGD with (Constant), (Decay 1), (Decay 2), and (Decay 3) to achieve a test accuracy of 0.6 versus batch size (ResNet-18 on CIFAR-100).

Figures and compare SGD with (Decay 1) with the other optimizers in training ResNet-18 on the CIFAR-100 dataset. These figures indicates that SGD with (Decay 1) and a critical batch size ( $b = 2^{4}$ ) outperformed the other optimizers in the sense of minimizing the number of iterations and the SFO complexity. Figure also indicates that the existing optimizers using constant learning rates had critical batch sizes minimizing the SFO complexities. In particular, AdamW using the critical batch size $b^{⋆} = 2^{5}$ performed well.

Figure 5. Number of iterations needed for SGD with (Decay 1), momentum, Adam, AdamW, and RMSProp to achieve a test accuracy of 0.6 versus batch size (ResNet-18 on CIFAR-100).

Figure 6. SFO complexity needed for SGD with (Decay 1), momentum, Adam, AdamW, and RMSProp to achieve a test accuracy of 0.6 versus batch size (ResNet-18 on CIFAR-100).

4.2. Training Wide-ResNet on the CIFAR-10 and CIFAR-100 datasets

Next, we trained Wide-ResNet-28 [Citation30] on the CIFAR-10 and CIFAR-100 datasets. The stopping condition of the optimizers was 200 epochs. Figures and show that the number of iterations and the SFO complexity of SGD to achieve a test accuracy of 0.9 (CIFAR-10) versus batch size. Figures and indicate that the critical batch size was $b^{⋆} = 2^{4}$ in each case of SGD using (Constant)–(Decay 3).

Figure 7. Number of iterations needed for SGD with (Constant), (Decay 1), (Decay 2), and (Decay 3) to achieve a test accuracy of 0.9 versus batch size (Wide-ResNet-28-10 on CIFAR-10).

Figure 8. SFO complexity needed for SGD with (Constant), (Decay 1), (Decay 2), and (Decay 3) to achieve a test accuracy of 0.9 versus batch size (Wide-ResNet-28-10 on CIFAR-10).

Figures and indicate that the number of iterations and the SFO complexity of SGD to achieve a test accuracy of 0.6 (CIFAR-100) versus batch size and show that a critical batch size existed for (Constant)–(Decay 3).

Figure 9. Number of iterations needed for SGD with (Constant), (Decay 1), (Decay 2), and (Decay 3) to achieve a test accuracy of 0.6 versus batch size (WideResNet-28-12 on CIFAR-100).

Figure 10. SFO complexity needed for SGD with (Constant), (Decay 1), (Decay 2), and (Decay 3) to achieve a test accuracy of 0.6 versus batch size (WideResNet-28-12 on CIFAR-100).

Figures and indicate that SGD using (Decay 1) and the existing optimizers using constant learning rates had critical batch sizes minimizing the SFO complexities. Also, Figures and show similar results for SGD using (Decay 3).

Figure 11. Number of iterations needed for SGD with (Decay 3), momentum, Adam, AdamW, and RMSProp to achieve a test accuracy of 0.9 versus batch size (WideResNet-28-10 on CIFAR-10).

Figure 12. SFO complexity needed for SGD with (Decay 3), momentum, Adam, AdamW, and RMSProp to achieve a test accuracy of 0.9 versus batch size (WideResNet-28-10 on CIFAR-10).

4.3. Estimation of critical batch sizes

We estimated the critical batch sizes of (Decay 3) using Theorem 3.3 and measured the critical batch sizes of (Decay 1). From (Equation4(4) $b^{⋆} = {\begin{cases} \frac{2 C_{2}}{ϵ^{2}} & (Constant) \\ \frac{(1 - a) D_{2}}{a (1 - 2 a) D_{1}} & (Decay 1) \\ \frac{2 a^{2} D_{2}}{(1 - a) (2 a - 1) D_{1}} & (Decay 3) \end{cases}$ (4) ) and $a = \frac{1}{4}$ ((Decay 1)), we have that, for training ResNet-18 on the CIFAR-10 dataset, $b^{⋆} = 2^{4} = \frac{(1 - a) D_{2}}{a (1 - 2 a) D_{1}}$ , i.e. $\frac{D_{2}}{D_{1}} = \frac{8}{3}$ . Then, the estimated critical batch size of SGD using (Decay 3) ( $a = \frac{3}{4}$ ) for training ResNet-18 on the CIFAR-10 dataset is $\begin{aligned} b^{⋆} & = \frac{2 a^{2}}{(1 - a) (2 a - 1)} \frac{D_{2}}{D_{1}} = \frac{2 a^{2}}{(1 - a) (2 a - 1)} \frac{8}{3} \\ = 24 \in (2^{4}, 2^{5}), \end{aligned}$ which implies that the estimated critical batch size $b^{⋆} = 24$ is close to the measured size $b = 2^{4}$ . We also found that the estimated critical batch sizes are close to the measured critical batch sizes (see Table ).

Table 3. Measured (left) and estimated (right; bold) critical batch sizes (D1 and D3 stand for (Decay 1) and (Decay 3)).

Display Table

5. Conclusion and future work

This paper investigated the required number of iterations and SFO complexities for SGD using constant or decay learning rates to achieve an ϵ–approximation. Our theoretical analyses indicated that the number of iterations needed for an ϵ–approximation is monotone decreasing and convex with respect to the batch size and the SFO complexity needed for an ϵ–approximation is convex with respect to the batch size. Moreover, we showed that SGD using a critical batch size reduces the SFO complexity. The numerical results indicated that SGD using the critical batch size performs better than the existing optimizers in the sense of minimizing the SFO complexity. We also estimated critical batch sizes of SGD using our theoretical results and showed that they are close to the measured critical batch sizes.

The results in this paper can be only applied to SGD. This is a limitation of our work. Hence, in the future, we should investigate whether our results can be applied to variants of SGD, such as the momentum methods and adaptive methods.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

Robbins H, Monro H. A stochastic approximation method. Ann Math Stat. 1951;22:400–407. doi: 10.1214/aoms/1177729586
Google Scholar
Zinkevich M. Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th International Conference on Machine Learning; Washington, DC, USA; 2003. p. 928–936.
Google Scholar
Nemirovski A, Juditsky A, Lan G, et al. Robust stochastic approximation approach to stochastic programming. SIAM J Optim. 2009;19:1574–1609. doi: 10.1137/070704277
Web of Science ®Google Scholar
Ghadimi S, Lan G. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework. SIAM J Optim. 2012;22:1469–1492. doi: 10.1137/110848864
Web of Science ®Google Scholar
Ghadimi S, Lan G. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization II: shrinking procedures and optimal algorithms. SIAM J Optim. 2013;23:2061–2089. doi: 10.1137/110848876
Web of Science ®Google Scholar
Polyak BT. Some methods of speeding up the convergence of iteration methods. USSR Comput Math Math Phys. 1964;4:1–17. doi: 10.1016/0041-5553(64)90137-5
Google Scholar
Nesterov Y. A method for unconstrained convex minimization problem with the rate of convergence O(1/k2). Doklady AN USSR. 1983;269:543–547.
Web of Science ®Google Scholar
Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res. 2011;12:2121–2159.
Web of Science ®Google Scholar
Tieleman T, Hinton G. RMSProp: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw Mach Learn. 2012;4:26–31.
Google Scholar
Kingma DP, Ba J. Adam: a method for stochastic optimization. In: Proceedings of The International Conference on Learning Representations; San Diego, CA, USA; 2015.
Google Scholar
Reddi SJ, Kale S, Kumar S. On the convergence of Adam and beyond. In: Proceedings of The International Conference on Learning Representations; Vancouver, British Columbia, Canada; 2018.
Google Scholar
Loshchilov I, Hutter F. Decoupled weight decay regularization. In: Proceedings of The International Conference on Learning Representations; New Orleans, Louisiana, USA; 2019.
Google Scholar
Ghadimi S, Lan G. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J Optim. 2013;23(4):2341–2368. doi: 10.1137/120880811
Web of Science ®Google Scholar
Ghadimi S, Lan G, Zhang H. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math Program. 2016;155(1):267–305. doi: 10.1007/s10107-014-0846-1
Web of Science ®Google Scholar
Vaswani S, Mishkin A, Laradji I, et al. Painless stochastic gradient: interpolation, line-search, and convergence rates. In: Advances in Neural Information Processing Systems; Vancouver, British Columbia, Canada; Vol. 32; 2019.
Google Scholar
Fehrman B, Gess B, Jentzen A. Convergence rates for the stochastic gradient descent method for non-convex objective functions. J Mach Learn Res. 2020;21:1–48.
PubMed Web of Science ®Google Scholar
Chen H, Zheng L, AL Kontar R, et al. Stochastic gradient descent in correlated settings: a study on Gaussian processes. In: Advances in Neural Information Processing Systems; Virtual conference, Vol. 33; 2020.
Google Scholar
Scaman K, Malherbe C. Robustness analysis of non-convex stochastic gradient descent using biased expectations. In: Advances in Neural Information Processing Systems; Virtual conference, Vol. 33; 2020.
Google Scholar
Loizou N, Vaswani S, Laradji I, et al. Stochastic polyak step-size for SGD: an adaptive learning rate for fast convergence. In: Proceedings of the 24th International Conference on Artificial Intelligence and Statistics; Virtual conference, Vol. 130; 2021.
Google Scholar
Wang X, Magnússon S, Johansson M. On the convergence of step decay step-size for stochastic optimization. In: Beygelzimer A, Dauphin Y, Liang P, et al., editors. Advances in Neural Information Processing Systems; Virtual conference, 2021. Available from: https://openreview.net/forum?id=M-W0asp3fD.
Google Scholar
Arjevani Y, Carmon Y, Duchi JC, et al. Lower bounds for non-convex stochastic optimization. Math Program. 2023;199(1):165–214. doi: 10.1007/s10107-022-01822-7
Web of Science ®Google Scholar
Khaled A, Richtárik P. Better theory for SGD in the nonconvex world. Trans Mach Learn Res. 2023.
Google Scholar
Jain P, Kakade SM, Kidambi R, et al. Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification. J Mach Learn Res. 2018;18(223):1–42.
Google Scholar
Cotter A, Shamir O, Srebro N, et al. Better mini-batch algorithms via accelerated gradient methods. In: Advances in Neural Information Processing Systems; Granada, Spain; Vol. 24; 2011.
Google Scholar
Smith SL, Kindermans PJ, Le QV. Don't decay the learning rate, increase the batch size. In: International Conference on Learning Representations; Vancouver, British Columbia, Canada; 2018.
Google Scholar
Sato N, Iiduka H. Existence and estimation of critical batch size for training generative adversarial networks with two time-scale update rule. In: Proceedings of the 40th International Conference on Machine Learning; (Proceedings of Machine Learning Research; Vol. 202). PMLR; 2023. p. 30080–30104.
Google Scholar
Shallue CJ, Lee J, Antognini J, et al. Measuring the effects of data parallelism on neural network training. J Mach Learn Res. 2019;20:1–49.
Web of Science ®Google Scholar
Virmaux A, Scaman K. Lipschitz regularity of deep neural networks: analysis and efficient estimation. In: Advances in Neural Information Processing Systems; Vol. 31; 2018.
Google Scholar
He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016. p. 770–778.
Google Scholar
Zagoruyko S, Komodakis N. Wide residual networks. arXiv preprint arXiv:160507146. 2016.
Google Scholar

Appendix

A.1. Lemma

First, we will prove the following lemma.Lemma A.1 The sequence

(θ_{k})_{k \in N}

generated by Algorithm 1 under (C1)–(C3) satisfies that, for all

K \geq 1

\sum_{k = 0}^{K - 1} α_{k} (1 - \frac{L α_{k}}{2}) E [‖ ∇f (θ_{k}) ‖^{2}] \leq E [f (θ_{0}) - f_{⋆}] + \frac{L σ^{2}}{2 b} \sum_{k = 0}^{K - 1} α_{k}^{2},

where

E

stands for the total expectation.

Proof.

Condition (C1) (L-smoothness of f) implies that the descent lemma holds, i.e. for all $k \in N$ , $f (θ_{k + 1}) \leq f (θ_{k}) + 〈 ∇f (θ_{k}), θ_{k + 1} - θ_{k} 〉 + \frac{L}{2} ‖ θ_{k + 1} - θ_{k} ‖^{2},$ which, together with $θ_{k + 1} := θ_{k} - α_{k} \nabla f_{B_{k}} (θ_{k})$ , implies that (A1) $f (θ_{k + 1}) \leq f (θ_{k}) - α_{k} 〈 ∇f (θ_{k}), \nabla f_{B_{k}} (θ_{k}) 〉 + \frac{L α_{k}^{2}}{2} ‖ \nabla f_{B_{k}} (θ_{k}) ‖^{2} .$ (A1) Condition (C2) guarantees that (A2) $E_{ξ_{k}} [\nabla f_{B_{k}} (θ_{k}) | θ_{k}] = ∇f (θ_{k}) and E_{ξ_{k}} [‖ \nabla f_{B_{k}} (θ_{k}) - ∇f (θ_{k}) ‖^{2} | θ_{k}] \leq \frac{σ^{2}}{b} .$ (A2) Hence, we have (A3) $\begin{aligned} E_{ξ_{k}} [‖ \nabla f_{B_{k}} (θ_{k}) ‖^{2} | θ_{k}] \\ = E_{ξ_{k}} [‖ \nabla f_{B_{k}} (θ_{k}) - ∇f (θ_{k}) + ∇f (θ_{k}) ‖^{2} | θ_{k}] \\ = E_{ξ_{k}} [‖ \nabla f_{B_{k}} (θ_{k}) - ∇f (θ_{k}) ‖^{2} | θ_{k}] + 2 E_{ξ_{k}} [〈 \nabla f_{B_{k}} (θ_{k}) - ∇f (θ_{k}), ∇f (θ_{k}) 〉 | θ_{k}] \\ + E_{ξ_{k}} [‖ ∇f (θ_{k}) ‖^{2} | θ_{k}] \\ \leq \frac{σ^{2}}{b} + E_{ξ_{k}} [‖ ∇f (θ_{k}) ‖^{2}] . \end{aligned}$ (A3) Taking the expectation conditioned on $θ_{k}$ on both sides of (EquationA1(A1) $f (θ_{k + 1}) \leq f (θ_{k}) - α_{k} 〈 ∇f (θ_{k}), \nabla f_{B_{k}} (θ_{k}) 〉 + \frac{L α_{k}^{2}}{2} ‖ \nabla f_{B_{k}} (θ_{k}) ‖^{2} .$ (A1) ), together with (EquationA2(A2) $E_{ξ_{k}} [\nabla f_{B_{k}} (θ_{k}) | θ_{k}] = ∇f (θ_{k}) and E_{ξ_{k}} [‖ \nabla f_{B_{k}} (θ_{k}) - ∇f (θ_{k}) ‖^{2} | θ_{k}] \leq \frac{σ^{2}}{b} .$ (A2) ) and (EquationA3(A3) $\begin{aligned} E_{ξ_{k}} [‖ \nabla f_{B_{k}} (θ_{k}) ‖^{2} | θ_{k}] \\ = E_{ξ_{k}} [‖ \nabla f_{B_{k}} (θ_{k}) - ∇f (θ_{k}) + ∇f (θ_{k}) ‖^{2} | θ_{k}] \\ = E_{ξ_{k}} [‖ \nabla f_{B_{k}} (θ_{k}) - ∇f (θ_{k}) ‖^{2} | θ_{k}] + 2 E_{ξ_{k}} [〈 \nabla f_{B_{k}} (θ_{k}) - ∇f (θ_{k}), ∇f (θ_{k}) 〉 | θ_{k}] \\ + E_{ξ_{k}} [‖ ∇f (θ_{k}) ‖^{2} | θ_{k}] \\ \leq \frac{σ^{2}}{b} + E_{ξ_{k}} [‖ ∇f (θ_{k}) ‖^{2}] . \end{aligned}$ (A3) ), guarantees that, for all $k \in N$ , $\begin{aligned} E_{ξ_{k}} [f (θ_{k + 1}) | θ_{k}] & \leq f (θ_{k}) - α_{k} E_{ξ_{k}} [〈 ∇f (θ_{k}), \nabla f_{B_{k}} (θ_{k}) 〉 | θ_{k}] + \frac{L α_{k}^{2}}{2} E_{ξ_{k}} [‖ \nabla f_{B_{k}} (θ_{k}) ‖^{2} | θ_{k}] \\ \leq f (θ_{k}) - α_{k} ‖ ∇f (θ_{k}) ‖^{2} + \frac{L α_{k}^{2}}{2} (\frac{σ^{2}}{b} + ‖ ∇f (θ_{k}) ‖^{2}) . \end{aligned}$ Hence, taking the total expectation on both sides of the above inequality ensures that, for all $k \in N$ , $α_{k} (1 - \frac{L α_{k}}{2}) E [‖ ∇f (θ_{k}) ‖^{2}] \leq E [f (θ_{k}) - f (θ_{k + 1})] + \frac{L σ^{2} α_{k}^{2}}{2 b} .$ Let $K \geq 1$ . Summing the above inequality from k = 0 to k = K−1 ensures that $\sum_{k = 0}^{K - 1} α_{k} (1 - \frac{L α_{k}}{2}) E [‖ ∇f (θ_{k}) ‖^{2}] \leq E [f (θ_{0}) - f (θ_{K})] + \frac{L σ^{2}}{2 b} \sum_{k = 0}^{K - 1} α_{k}^{2},$ which, together with (C1) (the lower bound $f_{⋆}$ of f), implies that the assertion in Lemma A.1 holds.

A.2. Proof of Theorem 3.1

(Constant): Lemma A.1 with $α_{k} = α$ implies that $α (1 - \frac{Lα}{2}) \sum_{k = 0}^{K - 1} E [‖ ∇f (θ_{k}) ‖^{2}] \leq E [f (θ_{0}) - f_{⋆}] + \frac{L σ^{2} α^{2} K}{2 b} .$ Since $α < \frac{2}{L}$ , we have that $min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖^{2}] \leq \frac{1}{K} \sum_{k = 0}^{K - 1} E [‖ ∇f (θ_{k}) ‖^{2}] \leq \underset{C_{1}}{\underset{⏟}{\frac{2 (f (θ_{0}) - f_{⋆})}{(2 - Lα) α}}} \frac{1}{K} + \underset{C_{2}}{\underset{⏟}{\frac{L σ^{2} α}{2 - Lα}}} \frac{1}{b} .$ (Decay): Since $(α_{k})_{k \in N}$ converges to 0, there exists $k_{0} \in N$ such that, for all $k \geq k_{0}$ , $α_{k} < \frac{2}{L}$ . We assume that $k_{0} = 0$ (see Section 2.2.2). Lemma A.1 ensures that, for all $K \geq 1$ , $\sum_{k = 0}^{K - 1} α_{k} (1 - \frac{L α_{k}}{2}) E [‖ ∇f (θ_{k}) ‖^{2}] \leq E [f (θ_{0}) - f_{⋆}] + \frac{L σ^{2}}{2 b} \sum_{k = 0}^{K - 1} α_{k}^{2},$ which, together with $α_{k + 1} \leq α_{k} < \frac{2}{L}$ $(k \in N)$ , implies that $α_{K - 1} (1 - \frac{L α_{0}}{2}) \sum_{k = 0}^{K - 1} E [‖ ∇f (θ_{k}) ‖^{2}] \leq E [f (θ_{0}) - f_{⋆}] + \frac{L σ^{2}}{2 b} \sum_{k = 0}^{K - 1} α_{k}^{2} .$ Hence, we have that $\sum_{k = 0}^{K - 1} E [‖ ∇f (θ_{k}) ‖^{2}] \leq \frac{2 (f (θ_{0}) - f_{⋆})}{(2 - L α_{0}) α_{K - 1}} + \frac{L σ^{2}}{b (2 - L α_{0}) α_{K - 1}} \sum_{k = 0}^{K - 1} α_{k}^{2},$ which implies that $min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖^{2}] \leq \frac{2 (f (θ_{0}) - f_{⋆})}{2 - L α_{0}} \frac{1}{K α_{K - 1}} + \frac{1}{b} \frac{L σ^{2}}{2 - L α_{0}} \frac{1}{K α_{K - 1}} \sum_{k = 0}^{K - 1} α_{k}^{2} .$ Meanwhile, we have that $\sum_{k = 0}^{K - 1} α_{k}^{2} \leq \sum_{k = 0}^{K - 1} \frac{T α^{2}}{(k + 1)^{2 a}} \leq T α^{2} (1 + \int_{0}^{K - 1} \frac{d t}{(t + 1)^{2 a}}) \leq {\begin{cases} \frac{T α^{2}}{1 - 2 a} K^{1 - 2 a} & (Decay 1) \\ T α^{2} (1 + \log K) & (Decay 2) \\ \frac{2 aT α^{2}}{2 a - 1} & (Decay 3) \end{cases}$ and $α_{K - 1} = \frac{α}{{(⌊ \frac{K - 1}{T} ⌋ + 1)}^{a}} \geq \frac{α}{{(\frac{K - 1}{T} + 1)}^{a}} \geq \frac{α}{K^{a}} .$ Here, we define $D_{1} := \frac{2 (f (θ_{0}) - f_{⋆})}{α (2 - Lα)} and D_{2} := \frac{T α^{2} L σ^{2}}{2 - Lα} .$ Accordingly, we have that $min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖^{2}] \leq {\begin{cases} \frac{D_{1}}{K^{1 - a}} + \frac{D_{2}}{(1 - 2 a) K^{a} b} & (Decay 1) \\ \frac{D_{1}}{\sqrt{K}} + \frac{D_{2} (1 + \log K)}{\sqrt{K} b} & (Decay 2) \\ \frac{D_{1}}{K^{1 - a}} + \frac{2 a D_{2}}{(2 a - 1) K^{1 - a} b} & (Decay 3) \end{cases}$ which, together with $\log K < \sqrt{K}$ and the condition on a, implies that $min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖^{2}] \leq {\begin{cases} \frac{D_{1}}{K^{a}} + \frac{D_{2}}{(1 - 2 a) K^{a} b} & (Decay 1) \\ \frac{D_{1}}{\sqrt{K}} + (\frac{1}{\sqrt{K}} + 1) \frac{D_{2}}{b} & (Decay 2) \\ \frac{D_{1}}{K^{1 - a}} + \frac{2 a D_{2}}{(2 a - 1) K^{1 - a} b} & (Decay 3) . \end{cases}$

A.3. Proof of Theorem 3.2

Let us consider the case of (Constant). We consider that the upper bound $\frac{C_{1}}{K} + \frac{C_{2}}{b}$ in Theorem 3.1 is equal to $ϵ^{2}$ . This implies that $K = \frac{C_{1} b}{ϵ^{2} b - C_{2}}$ achieves an ϵ–approximation. A discussion similar to the one showing that $K = \frac{C_{1} b}{ϵ^{2} b - C_{2}}$ is an ϵ–approximation ensures that the assertion in Theorem 3.2(i) is true.
It is sufficient to prove that $K^{'} = K^{'} (b) < 0$ and $K^{″} = K^{″} (b) > 0$ hold.

(Constant): Let $K = \frac{C_{1} b}{ϵ^{2} b - C_{2}}$ . Then, we have that $\begin{aligned} K^{'} & = \frac{C_{1} (ϵ^{2} b - C_{2}) - ϵ^{2} C_{1} b}{(ϵ^{2} b - C_{2})^{2}} = - \frac{C_{1} C_{2}}{(ϵ^{2} b - C_{2})^{2}} < 0, \\ K^{″} & = \frac{2 ϵ^{2} C_{1} C_{2} (ϵ^{2} b - C_{2})}{(ϵ^{2} b - C_{2})^{4}} = \frac{2 ϵ^{2} C_{1} C_{2} ((\frac{C_{1}}{K} + \frac{C_{2}}{b}) b - C_{2})}{(ϵ^{2} b - C_{2})^{4}} = \frac{2 ϵ^{2} C_{1}^{2} C_{2}^{2}}{K (ϵ^{2} b - C_{2})^{4}} > 0. \end{aligned}$ (Decay 1): Let $K = (\frac{1}{ϵ^{2}} (D_{1} + \frac{D_{2}}{(1 - 2 a) b}))^{\frac{1}{a}}$ . Then, we have that $\begin{aligned} K^{'} & = \frac{1}{a} {\frac{1}{ϵ^{2}} (D_{1} + \frac{D_{2}}{(1 - 2 a) b})}^{\frac{1}{a} - 1} (- \frac{D_{2}}{ϵ^{2} (1 - 2 a) b^{2}}) \\ = - \frac{D_{2}}{a ϵ^{2} (1 - 2 a) b^{2}} {\frac{1}{ϵ^{2}} (D_{1} + \frac{D_{2}}{(1 - 2 a) b})}^{\frac{1 - a}{a}} < 0, \\ K^{″} & = \frac{2 D_{2}}{a ϵ^{2} (1 - 2 a) b^{3}} {\frac{1}{ϵ^{2}} (D_{1} + \frac{D_{2}}{(1 - 2 a) b})}^{\frac{1 - a}{a}} \\ + \frac{2 (1 - a) D_{2}}{a^{2} ϵ^{2} (1 - 2 a) b^{3}} {\frac{1}{ϵ^{2}} (D_{1} + \frac{D_{2}}{(1 - 2 a) b})}^{\frac{1 - a}{a} - 1} \frac{D_{2}}{ϵ^{2} (1 - 2 a) b^{2}} > 0. \end{aligned}$ (Decay 2): Let $K = (\frac{b D_{1} + D_{2}}{b ϵ^{2} - D_{2}})^{2}$ . Then, we have that $K^{'} = \frac{2 D_{1} (b D_{1} + D_{2}) (b ϵ^{2} - D_{2})^{2} - 2 ϵ^{2} (b ϵ^{2} - D_{2}) (b D_{1} + D_{2})^{2}}{(b ϵ^{2} - D_{2})^{4}},$ which, together with $b ϵ^{2} - D_{2} > 0$ and $D_{1} > ϵ^{2}$ , implies that $\begin{aligned} (b ϵ^{2} - D_{2})^{3} K^{'} & = 2 D_{1} (b D_{1} + D_{2}) (b ϵ^{2} - D_{2}) - 2 ϵ^{2} (b D_{1} + D_{2})^{2} \\ = 2 (b D_{1} + D_{2}) {D_{1} (b ϵ^{2} - D_{2}) - ϵ^{2} (b D_{1} + D_{2})} \\ = - 2 (b D_{1} + D_{2}) (D_{1} D_{2} + ϵ^{2} D_{2}) \\ = - 2 D_{2} (b D_{1} + D_{2}) (D_{1} + ϵ^{2}) < 0. \end{aligned}$ Moreover, $K^{″} = \frac{- 2 D_{1} D_{2} (D_{1} - ϵ^{2}) (b ϵ^{2} - D_{2})^{3} + 6 D_{2} ϵ^{2} (b ϵ^{2} - D_{2})^{2} (b D_{1} + D_{2}) (D_{1} - ϵ^{2})}{(b ϵ^{2} - D_{2})^{6}},$ which implies that $\begin{aligned} (b ϵ^{2} - D_{2})^{4} K^{″} & = - 2 D_{1} D_{2} (D_{1} - ϵ^{2}) (b ϵ^{2} - D_{2}) + 6 D_{2} ϵ^{2} (b D_{1} + D_{2}) (D_{1} - ϵ^{2}) \\ = 2 D_{2} (D_{1} - ϵ^{2}) {- D_{1} (b ϵ^{2} - D_{2}) + 3 ϵ^{2} (b D_{1} + D_{2})} \\ = 2 D_{2} (D_{1} - ϵ^{2}) (2 D_{1} ϵ^{2} b + D_{1} D_{2} + 3 D_{2} ϵ^{2}) > 0. \end{aligned}$ (Decay 3): Let $K = (\frac{1}{ϵ^{2}} (D_{1} + \frac{2 a D_{2}}{(2 a - 1) b}))^{\frac{1}{1 - a}}$ . Then, we have that $\begin{aligned} K^{'} & = \frac{1}{1 - a} {\frac{1}{ϵ^{2}} (D_{1} + \frac{2 a D_{2}}{(2 a - 1) b})}^{\frac{1}{1 - a} - 1} (- \frac{2 a D_{2}}{ϵ^{2} (2 a - 1) b^{2}}) \\ = - \frac{2 a D_{2}}{ϵ^{2} (1 - a) (2 a - 1) b^{2}} {\frac{1}{ϵ^{2}} (D_{1} + \frac{2 a D_{2}}{(2 a - 1) b})}^{\frac{a}{1 - a}} < 0, \\ K^{″} & = \frac{4 a D_{2}}{ϵ^{2} (1 - a) (2 a - 1) b^{3}} {\frac{1}{ϵ^{2}} (D_{1} + \frac{2 a D_{2}}{(2 a - 1) b})}^{\frac{a}{1 - a}} \\ + \frac{2 a^{2} D_{2}}{ϵ^{2} (1 - a)^{2} (2 a - 1) b^{2}} {\frac{1}{ϵ^{2}} (D_{1} + \frac{2 a D_{2}}{(2 a - 1) b})}^{\frac{a}{1 - a} - 1} \frac{2 a D_{2}}{ϵ^{2} (2 a - 1) b^{2}} > 0. \end{aligned}$

A.4. Proof of Theorem 3.3

(Constant): Let $N = \frac{C_{1} b^{2}}{ϵ^{2} b - C_{2}}$ . Then, we have that $N^{'} = \frac{2 C_{1} b (ϵ^{2} b - C_{2}) - ϵ^{2} C_{1} b^{2}}{(ϵ^{2} b - C_{2})^{2}} = \frac{C_{1} b (ϵ^{2} b - 2 C_{2})}{(ϵ^{2} b - C_{2})^{2}} .$ If $N^{'} = 0$ , we have that $ϵ^{2} b - 2 C_{2} = 0$ , i.e. $b = \frac{2 C_{2}}{ϵ^{2}}$ . Moreover, $\begin{aligned} N^{″} & = \frac{(2 ϵ^{2} C_{1} b - 2 C_{1} C_{2}) (ϵ^{2} b - C_{2})^{2} - 2 ϵ^{2} (ϵ^{2} b - C_{2}) (ϵ^{2} C_{1} b^{2} - 2 C_{1} C_{2} b)}{(ϵ^{2} b - C_{2})^{4}} \\ (ϵ^{2} b - C_{2})^{3} N^{″} & = (2 ϵ^{2} C_{1} b - 2 C_{1} C_{2}) (ϵ^{2} b - C_{2}) - 2 ϵ^{2} (ϵ^{2} C_{1} b^{2} - 2 C_{1} C_{2} b) \\ = 2 C_{1} C_{2}^{2} > 0, \end{aligned}$ which implies that N is convex. Hence, there is a critical batch size $b^{⋆} = \frac{2 C_{2}}{ϵ^{2}} > 0$ at which N is minimized.

(Decay 1): Let N = Kb. Then, we have that $\begin{aligned} N^{'} & = K + b K^{'} \\ = {\frac{1}{ϵ^{2}} (D_{1} + \frac{D_{2}}{(1 - 2 a) b})}^{\frac{1}{a}} + \frac{1}{a} {\frac{1}{ϵ^{2}} (D_{1} + \frac{D_{2}}{(1 - 2 a) b})}^{\frac{1}{a} - 1} (- \frac{D_{2}}{ϵ^{2} (1 - 2 a) b^{2}}) b \\ = {\frac{1}{ϵ^{2}} (D_{1} + \frac{D_{2}}{(1 - 2 a) b})}^{\frac{1}{a} - 1} {\frac{1}{ϵ^{2}} (D_{1} + \frac{D_{2}}{(1 - 2 a) b}) - \frac{D_{2}}{a ϵ^{2} (1 - 2 a) b}} . \end{aligned}$ If $N^{'} = 0$ , we have that $\frac{1}{ϵ^{2}} (D_{1} + \frac{D_{2}}{(1 - 2 a) b}) - \frac{D_{2}}{a ϵ^{2} (1 - 2 a) b} = 0, i . e . b = \frac{D_{2} (a - 1)}{a D_{1} (2 a - 1)} .$ Moreover, $\begin{aligned} N^{″} & = K^{'} + (K^{'} + b K^{″}) = 2 K^{'} + b K^{″} \\ = - \frac{2 D_{2}}{a ϵ^{2} (1 - 2 a) b^{2}} {\frac{1}{ϵ^{2}} (D_{1} + \frac{D_{2}}{(1 - 2 a) b})}^{\frac{1 - a}{a}} \\ + \frac{2 D_{2}}{a ϵ^{2} (1 - 2 a) b^{2}} {\frac{1}{ϵ^{2}} (D_{1} + \frac{D_{2}}{(1 - 2 a) b})}^{\frac{1 - a}{a}} \\ + \frac{2 (1 - a) D_{2}}{a^{2} ϵ^{2} (1 - 2 a) b^{2}} {\frac{1}{ϵ^{2}} (D_{1} + \frac{D_{2}}{(1 - 2 a) b})}^{\frac{1 - a}{a} - 1} \frac{D_{2}}{ϵ^{2} (1 - 2 a) b^{2}} \\ = \frac{2 (1 - a) D_{2}}{a^{2} ϵ^{2} (1 - 2 a) b^{2}} {\frac{1}{ϵ^{2}} (D_{1} + \frac{D_{2}}{(1 - 2 a) b})}^{\frac{1 - a}{a} - 1} \frac{D_{2}}{ϵ^{2} (1 - 2 a) b^{2}} > 0, \end{aligned}$ which implies that N is convex. Hence, there is a critical batch size $b^{⋆} = \frac{D_{2} (a - 1)}{a D_{1} (2 a - 1)} > 0$ .

(Decay 2): Let N = bK. Then, we have that $\begin{aligned} N^{'} & = K + b K^{'} \\ = \frac{(b D_{1} - D_{2})^{2}}{(b ϵ^{2} - D_{2})^{2}} - \frac{2 D_{2} b (b D_{1} + D_{2}) (D_{1} - ϵ^{2})}{(b ϵ^{2} - D_{2})^{3}} \\ = \frac{b D_{1} + D_{2}}{(b ϵ^{2} - D_{2})^{3}} {(b D_{1} + D_{2}) (b ϵ^{2} - D_{2}) - 2 D_{2} b (D_{1} - ϵ^{2})} \\ = \frac{b D_{1} + D_{2}}{(b ϵ^{2} - D_{2})^{3}} {D_{1} ϵ^{2} b^{2} + 3 D_{2} (ϵ^{2} - D_{1}) b - D_{2}^{2}} . \end{aligned}$ If $N^{'} = 0$ , we have that $D_{1} b + D_{2} = 0$ , i.e. $b = - \frac{D_{2}}{D_{1}} < 0$ . Moreover, $\begin{aligned} N^{″} & = 2 K^{'} + b K^{″} \\ = - \frac{4 D_{2} (b D_{1} + D_{2}) (D_{1} - ϵ^{2})}{(b ϵ^{2} - D_{2})^{3}} + \frac{2 D_{2} b (D_{1} - ϵ^{2}) (2 D_{1} ϵ^{2} b + D_{1} D_{2} + 3 D_{2} ϵ^{2})}{(b ϵ^{2} - D_{2})^{4}} \\ = \frac{2 D_{2} (D_{1} - ϵ^{2})}{(b ϵ^{2} - D_{2})^{4}} {- 2 (b D_{1} + D_{2}) (b ϵ^{2} - D_{2}) + b (2 D_{1} ϵ^{2} b + D_{1} D_{2} + 3 D_{2} ϵ^{2})} \\ = \frac{2 D_{2} (D_{1} - ϵ^{2})}{(b ϵ^{2} - D_{2})^{4}} (3 D_{1} D_{2} b + D_{2} ϵ^{2} b + 2 D_{2}^{2}) > 0, \end{aligned}$ which implies that N is convex. We can check that $N^{'} (b) > 0$ for all $b > \frac{D_{2}}{ϵ^{2}}$ .

(Decay 3): Let N = bK. Then, we have that $\begin{aligned} N^{'} & = K + b K^{'} \\ = {\frac{1}{ϵ^{2}} (D_{1} + \frac{2 a D_{2}}{(2 a - 1) b})}^{\frac{1}{1 - a}} - \frac{2 a D_{2}}{ϵ^{2} (1 - a) (2 a - 1) b} {\frac{1}{ϵ^{2}} (D_{1} + \frac{2 a D_{2}}{(2 a - 1) b})}^{\frac{1}{1 - a} - 1} \\ = {\frac{1}{ϵ^{2}} (D_{1} + \frac{2 a D_{2}}{(2 a - 1) b})}^{\frac{1}{1 - a} - 1} {\frac{1}{ϵ^{2}} (D_{1} + \frac{2 a D_{2}}{(2 a - 1) b}) - \frac{2 a D_{2}}{ϵ^{2} (1 - a) (2 a - 1) b}} . \end{aligned}$ If $N^{'} = 0$ , we have that $\frac{1}{ϵ^{2}} (D_{1} + \frac{2 a D_{2}}{(2 a - 1) b}) - \frac{2 a D_{2}}{ϵ^{2} (1 - a) (2 a - 1) b} = 0, i . e . b = \frac{2 a^{2} D_{2}}{(2 a - 1) (1 - a) D_{1}} .$ Moreover, $\begin{aligned} N^{″} & = 2 K^{'} + b K^{″} \\ = - \frac{2 a D_{2}}{ϵ^{2} (1 - a) (2 a - 1) b^{2}} {\frac{1}{ϵ^{2}} (D_{1} + \frac{2 a D_{2}}{(2 a - 1) b})}^{\frac{a}{1 - a}} \\ + \frac{2 a D_{2}}{ϵ^{2} (1 - a) (2 a - 1) b^{2}} {\frac{1}{ϵ^{2}} (D_{1} + \frac{2 a D_{2}}{(2 a - 1) b})}^{\frac{a}{1 - a}} \\ + \frac{2 a^{2} D_{2}}{ϵ^{2} (1 - a)^{2} (2 a - 1) b} {\frac{1}{ϵ^{2}} (D_{1} + \frac{2 a D_{2}}{(2 a - 1) b})}^{\frac{2 a - 1}{1 - a}} \frac{2 a D_{2}}{ϵ^{2} (2 a - 1) b^{2}} \\ = \frac{2 a^{2} D_{2}}{ϵ^{2} (1 - a)^{2} (2 a - 1) b} {\frac{1}{ϵ^{2} (D_{1} + \frac{2 a D_{2}}{(2 a - 1) b})}}^{\frac{2 a - 1}{1 - a}} \frac{2 a D_{2}}{ϵ^{2} (2 a - 1) b^{2}} > 0, \end{aligned}$ which implies that N is convex. Hence, there is a critical batch size $b^{⋆} = \frac{2 a^{2} D_{2}}{(2 a - 1) (1 - a) D_{1}} > 0$ .

A.5. Proof of Theorem 3.4

Using K defined in Theorem 3.2 leads to the iteration complexity. For example, SGD using (Constant) satisfies $N (b) = \frac{C_{1} b^{2}}{ϵ^{2} b - C_{2}}$ (Theorem 3.3). Using the critical batch size $b^{⋆} = \frac{2 C_{2}}{ϵ^{2}}$ in (Equation4(4) $b^{⋆} = {\begin{cases} \frac{2 C_{2}}{ϵ^{2}} & (Constant) \\ \frac{(1 - a) D_{2}}{a (1 - 2 a) D_{1}} & (Decay 1) \\ \frac{2 a^{2} D_{2}}{(1 - a) (2 a - 1) D_{1}} & (Decay 3) \end{cases}$ (4) ) leads to $inf {N : min_{k \in [0 : K - 1]} E [‖ ∇f (θ_{k}) ‖] \leq ϵ} \leq N (b^{⋆}) = \frac{4 C_{1} C_{2}}{ϵ^{4}}, i . e . N_{ϵ} = O (\frac{1}{ϵ^{4}}) .$ A similar discussion, together with using N defined in Theorem 3.3 and the critical batch size $b^{⋆}$ in (Equation4(4) $b^{⋆} = {\begin{cases} \frac{2 C_{2}}{ϵ^{2}} & (Constant) \\ \frac{(1 - a) D_{2}}{a (1 - 2 a) D_{1}} & (Decay 1) \\ \frac{2 a^{2} D_{2}}{(1 - a) (2 a - 1) D_{1}} & (Decay 3) \end{cases}$ (4) ), leads to the SFO complexities of (Decay 1) and (Decay 3). Using N defined in Theorem 3.3 and a batch size $b^{*} = \frac{D_{2} + 1}{ϵ^{2}}$ leads to the SFO complexity of (Decay 2).

Iteration and stochastic first-order oracle complexities of stochastic gradient descent using constant and decaying learning rates

Abstract

1. Introduction

1.1. Background

1.2. Motivation

1.3. Contribution

1.3.1. Upper bound of theoretical performance measure

1.3.2. Critical batch size to reduce SFO complexity

1.3.3. Iteration and SFO complexities

Table 2. Iteration and SFO complexities needed for SGD using a constant or decaying learning rate to be an ϵ–approximation (The critical batch sizes are used to compute Kϵ and Nϵ)

2. Nonconvex optimization and SGD

2.1. Nonconvex optimization in deep learning

2.2. SGD

2.2.1. Conditions and algorithm

2.2.2. Learning rates

3. Our results

3.1. Upper bound of the squared norm of the full gradient

Upper bound of the squared norm of the full gradient

3.2. Number of iterations needed for SGD to be an ϵ–approximation

Numbers of iterations needed for nonconvex optimization using SGD

3.3. SFO complexity needed for SGD to be an ϵ–approximation

SFO complexity needed for nonconvex optimization of SGD

3.4. Iteration and SFO complexities of SGD

Iteration and SFO complexities of SGD

4. Numerical results

4.1. Training ResNet-18 on the CIFAR-10 and CIFAR-100 datasets

4.2. Training Wide-ResNet on the CIFAR-10 and CIFAR-100 datasets

4.3. Estimation of critical batch sizes

Table 3. Measured (left) and estimated (right; bold) critical batch sizes (D1 and D3 stand for (Decay 1) and (Decay 3)).

5. Conclusion and future work

Disclosure statement

References

Appendix

A.1. Lemma

A.2. Proof of Theorem 3.1

A.3. Proof of Theorem 3.2

A.4. Proof of Theorem 3.3

A.5. Proof of Theorem 3.4

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 2. Iteration and SFO complexities needed for SGD using a constant or decaying learning rate to be an ϵ–approximation (The critical batch sizes are used to compute $K_{ϵ}$ and $N_{ϵ}$ )