Full article: Gradient methods with memory

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

In this paper, we consider gradient methods for minimizing smooth convex functions, which employ the information obtained at the previous iterations in order to accelerate the convergence towards the optimal solution. This information is used in the form of a piece-wise linear model of the objective function, which provides us with much better prediction abilities as compared with the standard linear model. To the best of our knowledge, this approach was never really applied in Convex Minimization to differentiable functions in view of the high complexity of the corresponding auxiliary problems. However, we show that all necessary computations can be done very efficiently. Consequently, we get new optimization methods, which are better than the usual Gradient Methods both in the number of oracle calls and in the computational time. Our theoretical conclusions are confirmed by preliminary computational experiments.

Keywords:

1. Introduction

1.1. Motivation

First-order gradient methods for minimizing smooth convex functions generate a sequence of test points based on the information obtained from the oracle: the function values and the gradients. Most methods either use the information from the last test point or accumulate it in the form of an aggregated linear function (see, for example, Chapter 2 in [Citation7]). This approach is very different from the technique used in Nonsmooth Optimization, where the piece-wise linear model of the objective is a standard and powerful tool. It is enough to mention the Bundle Method, the Level Method, cutting plane schemes, etc. The reason for this situation is quite clear. The presence of piece-wise linear models in the auxiliary problems, which we need to solve at each iteration of the method, usually significantly increases the complexity of the corresponding computations. This is acceptable in Nonsmooth Optimization, which has the reputation of a difficult field. By contrast, Smooth Optimization admits very simple and elegant schemes, with a very small computational cost of each iteration, preventing us from introducing there such a heavy machinery.

After the preparation of this manuscript, we became aware of a highly specialized attempt in [Citation2], using quadratic lower bounds instead of linear ones. Although the results presented there seem promising, the study limits itself to studying smooth unconstrained problems with strongly convex objectives. The study also states that an extension of the results to a wider context is a difficult open problem.

The main goal of this paper is the demonstration that the above situation is not as clear as it looks like. We will show that the Gradient Method,Footnote¹ equipped with a piece-wise linear model of the objective function, has much more chances to accelerate on particular optimization problems. At the same time, it appears that the corresponding auxiliary optimization problems can be easily solved by an appropriate version of the Frank–Wolfe algorithm. All our claims are supported by a complexity analysis. In the end, we present preliminary computational results, which show that very often the new schemes have much better computational time.

1.2. Contents

In Section 2, we analyse the Gradient Method with Memory as applied to the composite form of smooth convex optimization problems [Citation5]. In order to measure the level of smoothness of our objective function, we introduce the relative smoothness condition [Citation1,Citation4], based on an arbitrary strictly convex distance function. The main novelty here is the piece-wise linear model of the objective function, formed around the current test point. We analyse the corresponding auxiliary optimization problem and propose a condition on its approximate solution which does not destroy the rate of convergence of the algorithm. In Section 3, we analyse the complexity of solving the auxiliary optimization problem using the Frank–Wolfe algorithm. More precisely, we consider the anti-dualFootnote² of the auxiliary problem.

In this section, we restrict ourselves to strongly convex distance functions. We show that our auxiliary problem can be easily solved by the Frank–Wolfe method. Its complexity is proportional to the maximal squared norm of the gradient in the current model of the objective divided by the desired accuracy.

In Section 4, we specify our complexity results for the Euclidean setup, when all distances are measured by a Euclidean norm. We show that, for some strategies of updating the piece-wise linear model, the complexity of the auxiliary computations is very low.

Finally, in Section 5 we present preliminary computational results. We compare the usual Gradient Method with two gradient methods with memory, which use different strategies for updating the piece-wise model of the objective function. Our conclusion is that the new schemes are always better, both in terms of the calls of oracle and in the total computational time.

1.3. Notation and generalities

In what follows, denote by $E$ a finite-dimensional real vector space and by $E^{*}$ its dual space, the space of linear functions on $E$ . The value of function $s \in E^{*}$ at point $x \in E$ is denoted by $〈 s, x 〉$ . Let us fix some arbitrary (possibly non-Euclidean) norm $∥ \cdot ∥$ on the space $E$ and define the dual norm $∥ \cdot ∥_{*}$ on $E^{*}$ in the standard way: $∥ s ∥_{*} \overset{d e f}{=} sup_{h \in E} {〈 s, h 〉 : ∥ h ∥ \leq 1} .$ Let us choose a simple closed convex prox-function $d (\cdot)$ , which is differentiable at the interior of its domain.Footnote³ This function must be strictly convex: (1) $d (y) > d (x) + 〈\nabla d (x), y - x〉, x \in i n t (d o m d), y \in d o m d, x \neq y .$ (1) Using this function, we can define the Bregman distance between two points x and y: (2) $β_{d} (x, y) = d (y) - d (x) - 〈\nabla d (x), y - x〉, x \in i n t (d o m d), y \in d o m d .$ (2) Clearly, $β_{d} (x, y) \overset{(1)}{>} 0$ for $x \neq y$ and $β_{d} (x, x) = 0$ .

We will use Bregman distances for measuring the level of relative smoothness of convex functions (see [Citation4]). Namely, for a differentiable closed convex function f with open $d o m f \subseteq d o m d$ we define two constants, $L_{d} (f) \geq μ_{d} (f) \geq 0$ , such that (3) $\begin{array}{llll} f (y) - f (x) - 〈\nabla f (x), y - x〉 \geq μ_{d} (f) β_{d} (x, y), \\ x, y \in d o m f, \\ f (y) - f (x) - 〈\nabla f (x), y - x〉 \leq L_{d} (f) β_{d} (x, y), \end{array}$ (3) See [Citation1] and [Citation4] for definitions, motivations, and examples.

2. Gradient method with memory

In this paper, we are solving the following composite minimization problem: (4) $min_{x \in d o m ψ} {F (x) \equiv f (x) + ψ (x)},$ (4) where function f satisfies the relative smoothness condition (Equation3(3) $\begin{array}{llll} f (y) - f (x) - 〈\nabla f (x), y - x〉 \geq μ_{d} (f) β_{d} (x, y), \\ x, y \in d o m f, \\ f (y) - f (x) - 〈\nabla f (x), y - x〉 \leq L_{d} (f) β_{d} (x, y), \end{array}$ (3) ), possibly with $μ_{d} (f) = 0$ . The function $ψ : E \to R \cup {+ \infty}$ is a proper closed convex function with $d o m ψ \subseteq d o m f$ and $i n t (d o m ψ)$ non-empty. The function ψ is simple (in the sense of satisfying Assumptions 2.1 and 2.2, stated in the sequel) but it does not have to be differentiable or even continuous. For instance, ψ can incorporate the indicator function of the feasible set. We assume that a solution $x^{*} \in d o m ψ$ of problem (Equation4(4) $min_{x \in d o m ψ} {F (x) \equiv f (x) + ψ (x)},$ (4) ) does exist, denoting $F^{*} = F (x^{*})$ .

The simplest method for solving the problem (Equation4(4) $min_{x \in d o m ψ} {F (x) \equiv f (x) + ψ (x)},$ (4) ) is the usual Gradient Method: (5) $\begin{array}{l} Choose x_{0} \in i n t (d o m ψ) . For k \geq 0, iterate: \\ x_{k + 1} = \underset{y \in d o m ψ}{a r g m i n} \{f (x_{k}) + 〈\nabla f (x_{k}), y - x_{k}〉 + ψ (y) + L β_{d} (x_{k}, y)\} . \end{array}$ (5) The constant L in this method has to be big enough in order to ensure $f (x_{k + 1}) \leq f (x_{k}) + 〈\nabla f (x_{k}), x_{k + 1} - x_{k}〉 + L β_{d} (x_{k}, x_{k + 1}) .$ In view of (Equation3(3) $\begin{array}{llll} f (y) - f (x) - 〈\nabla f (x), y - x〉 \geq μ_{d} (f) β_{d} (x, y), \\ x, y \in d o m f, \\ f (y) - f (x) - 〈\nabla f (x), y - x〉 \leq L_{d} (f) β_{d} (x, y), \end{array}$ (3) ), this is definitely true for $L \geq L_{d} (f)$ . However, we are interested in choosing L as small as possible since this would significantly increase the rate of convergence of the scheme.

Method (Equation5(5) $\begin{array}{l} Choose x_{0} \in i n t (d o m ψ) . For k \geq 0, iterate: \\ x_{k + 1} = \underset{y \in d o m ψ}{a r g m i n} \{f (x_{k}) + 〈\nabla f (x_{k}), y - x_{k}〉 + ψ (y) + L β_{d} (x_{k}, y)\} . \end{array}$ (5) ) is based on the simplest linear model of function $f (\cdot)$ around the point $x_{k}$ . In our paper, we suggest to replace it by a piece-wise linear model, defined by the information collected at other test points.

Namely, for each $k \geq 0$ define a discrete set $Z_{k}$ of $m_{k}$ feasible points ( $m_{k} \geq 1$ ): $Z_{k} = {z_{i} \in d o m ψ, i = 1, \dots, m_{k}} .$ Then we can use a more sophisticated model of the smooth part of the objective function, (6) $f (y) \geq ℓ_{k} (y) \overset{d e f}{=} max_{z_{i} \in Z_{k}} {f (z_{i}) + 〈\nabla f (z_{i}), y - z_{i}〉}, y \in d o m ψ .$ (6) This model is always better than the initial linear model provided that (7) $x_{k} \in Z_{k} .$ (7) In what follows, we always assume that this condition is satisfied.

Thus, we come to the following natural generalization of the method (Equation5(5) $\begin{array}{l} Choose x_{0} \in i n t (d o m ψ) . For k \geq 0, iterate: \\ x_{k + 1} = \underset{y \in d o m ψ}{a r g m i n} \{f (x_{k}) + 〈\nabla f (x_{k}), y - x_{k}〉 + ψ (y) + L β_{d} (x_{k}, y)\} . \end{array}$ (5) ), which we call the Gradient Method with Memory (GMM): (8) $\begin{matrix} C h o o s e $ x_{0} \in i n t (d o m ψ) $ . \\ F o r $ k \geq 0 $, i t e r a t e : \\ x_{k + 1} = \underset{y \in d o m ψ}{a r g m i n} \{ℓ_{k} (y) + ψ (y) + L β_{d} (x_{k}, y)\} . \end{matrix}$ (8)

Remark 2.1

Note that for any $x \in d o m f$ we have $\begin{array}{l} f (x_{k}) + 〈\nabla f (x_{k}), x - x_{k}〉 + L β_{d} (x_{k}, x) \overset{(7)}{\leq} ℓ_{k} (x) + L β_{d} (x_{k}, x) \\ \overset{(6)}{\leq} f (x) + L β_{d} (x_{k}, x) \overset{(3)}{\leq} f (x_{k}) + 〈\nabla f (x_{k}), x - x_{k}〉 + (L + L_{d} (f)) β_{d} (x_{k}, x) . \end{array}$ Therefore, we can count on a better convergence of method (Equation8(8) $\begin{matrix} C h o o s e $ x_{0} \in i n t (d o m ψ) $ . \\ F o r $ k \geq 0 $, i t e r a t e : \\ x_{k + 1} = \underset{y \in d o m ψ}{a r g m i n} \{ℓ_{k} (y) + ψ (y) + L β_{d} (x_{k}, y)\} . \end{matrix}$ (8) ) only if we will be able to choose the parameter L significantly smaller than $L_{d} (f)$ .

At each iteration of method (Equation8(8) $\begin{matrix} C h o o s e $ x_{0} \in i n t (d o m ψ) $ . \\ F o r $ k \geq 0 $, i t e r a t e : \\ x_{k + 1} = \underset{y \in d o m ψ}{a r g m i n} \{ℓ_{k} (y) + ψ (y) + L β_{d} (x_{k}, y)\} . \end{matrix}$ (8) ), we need to solve a non-trivial auxiliary minimization problem. Therefore, the practical efficiency of this method crucially depends on the complexity of this computation. In what follows, we suggest to solve this problem approximately using a special method for its dual problem.

Let us start by presenting the corresponding technique. For the sake of notation, we omit the index of the iteration. Thus, our auxiliary problem is as follows: $min_{y \in d o m ψ} max_{λ \in Δ_{m}} \{\sum_{i = 1}^{m} λ^{(i)} [f_{i} + 〈g_{i}, y - z_{i}〉] + ψ (y) + L β_{d} (\bar{x}, y)\},$ where $f_{i} = f (z_{i})$ , $g_{i} = \nabla f (z_{i})$ , $i = 1, \dots, m$ , and $Δ_{m}$ is the standard simplex in $R^{m}$ . Introducing now the vector $f_{*} \in R^{m}$ with coordinates (9) $f_{*}^{(i)} = 〈g_{i}, z_{i}〉 - f_{i}, i = 1, \dots, m,$ (9) we get the following representation of our problem: (10) $min_{y \in d o m ψ} max_{λ \in Δ_{m}} \{〈λ, G^{T} y - f_{*}〉 + ψ (y) + L β_{d} (\bar{x}, y)\},$ (10) where $G = (g_{1}, \dots, g_{m}) \in E^{*} \times R^{m}$ . Note that the pay-off function in this saddle point problem can be written as follows: $\begin{aligned} 〈λ, G^{T} y - f_{*}〉 + ψ (y) + L β_{d} (\bar{x}, y) \\ = 〈λ, G^{T} y - f_{*}〉 + ψ (y) + L [d (y) - d (\bar{x}) - 〈\nabla d (\bar{x}), y - \bar{x}〉] \\ = L d (y) - 〈L \nabla d (\bar{x}) - G λ, y〉 + ψ (y) - 〈λ, f_{*}〉 + L [〈\nabla d (\bar{x}), \bar{x}〉 - d (\bar{x})] . \end{aligned}$ Hence, we need to introduce the following dual function: (11) $Φ_{L} (s) = max_{y \in d o m ψ} \{〈s, y〉 - L d (y) - ψ (y)\}, s \in E^{*} .$ (11) Our main joint assumption on functions $d (\cdot)$ and $ψ (\cdot)$ is as follows.

Assumption 2.1

For any L>0, function $Φ_{L} (s)$ is defined at any $s \in E^{*}$ .

This can be ensured, for example, by the strong convexity of function $d (\cdot)$ , or by the boundedness of $d o m ψ$ , or in many other ways.

Since the objective function in the definition (Equation11(11) $Φ_{L} (s) = max_{y \in d o m ψ} \{〈s, y〉 - L d (y) - ψ (y)\}, s \in E^{*} .$ (11) ) is strictly concave, its solution $y_{L}^{*} (s) = \arg max_{y \in d o m ψ} \{〈s, y〉 - L d (y) - ψ (y)\}$ is uniquely defined for any $s \in E^{*}$ . Moreover, function $Φ_{L} (\cdot)$ is differentiable and (12) $\nabla Φ_{L} (s) = y_{L}^{*} (s), s \in E^{*} .$ (12)

Now we can write down the problem anti-dual to (Equation10(10) $min_{y \in d o m ψ} max_{λ \in Δ_{m}} \{〈λ, G^{T} y - f_{*}〉 + ψ (y) + L β_{d} (\bar{x}, y)\},$ (10) ) (13) $ξ_{L}^{*} \overset{d e f}{=} min_{λ \in Δ_{m}} \{ξ_{L} (λ) \overset{d e f}{=} Φ_{L} (L \nabla d (\bar{x}) - G λ) + 〈λ, f_{*}〉 + α\},$ (13) where $α = L [d (\bar{x}) - 〈\nabla d (\bar{x}), \bar{x}〉]$ . This is a convex optimization problem with a differentiable objective function. Our second main assumption is as follows.

Assumption 2.2

Function $Φ_{L} (\cdot)$ in the problem (Equation13(13) $ξ_{L}^{*} \overset{d e f}{=} min_{λ \in Δ_{m}} \{ξ_{L} (λ) \overset{d e f}{=} Φ_{L} (L \nabla d (\bar{x}) - G λ) + 〈λ, f_{*}〉 + α\},$ (13) ) is easily computable.

We will discuss the reasonable strategies for finding an approximate solution to problem (Equation13(13) $ξ_{L}^{*} \overset{d e f}{=} min_{λ \in Δ_{m}} \{ξ_{L} (λ) \overset{d e f}{=} Φ_{L} (L \nabla d (\bar{x}) - G λ) + 〈λ, f_{*}〉 + α\},$ (13) ) in Section 3. At this moment, it is enough to assume that we are able to compute a point $\bar{λ} = \bar{λ} (\bar{x}, Z, L)$ such that (14) $〈\bar{λ} - λ, \nabla ξ_{L} (\bar{λ})〉 \leq δ, λ \in Δ_{m},$ (14) where $δ \geq 0$ is some tolerance parameter. Clearly, if $δ = 0$ , then $\bar{λ}$ is the optimal solution to the problem (Equation13(13) $ξ_{L}^{*} \overset{d e f}{=} min_{λ \in Δ_{m}} \{ξ_{L} (λ) \overset{d e f}{=} Φ_{L} (L \nabla d (\bar{x}) - G λ) + 〈λ, f_{*}〉 + α\},$ (13) ). Note that condition (Equation14(14) $〈\bar{λ} - λ, \nabla ξ_{L} (\bar{λ})〉 \leq δ, λ \in Δ_{m},$ (14) ) ensures also a small functional gap: (15) $ξ_{L} (\bar{λ}) - ξ_{L}^{*} = max_{λ \in Δ_{m}} [ξ_{L} (\bar{λ}) - ξ_{L} (λ)] \leq max_{λ \in Δ_{m}} 〈\bar{λ} - λ, \nabla ξ_{L} (\bar{λ})〉 \overset{(14)}{\leq} δ .$ (15)

Condition (Equation14(14) $〈\bar{λ} - λ, \nabla ξ_{L} (\bar{λ})〉 \leq δ, λ \in Δ_{m},$ (14) ) immediately leads to the following result.

Lemma 2.1

Let $\bar{λ} \in Δ_{m}$ satisfy condition (Equation14(14) $〈\bar{λ} - λ, \nabla ξ_{L} (\bar{λ})〉 \leq δ, λ \in Δ_{m},$ (14) ). Then for $\bar{s} = L \nabla d (\bar{x}) - G \bar{λ}$ we have (16) $\sum_{i = 1}^{m} {\bar{λ}}^{(i)} [f_{i} + 〈g_{i}, y_{L}^{*} (\bar{s}) - z_{i}〉] \geq max_{1 \leq i \leq m} [f_{i} + 〈g_{i}, y_{L}^{*} (\bar{s}) - z_{i}〉] - δ .$ (16)

Proof.

Indeed, $\nabla ξ_{L} (\bar{λ}) = f_{*} - G^{T} y_{L}^{*} (\bar{s})$ . Thus, inequality (Equation14(14) $〈\bar{λ} - λ, \nabla ξ_{L} (\bar{λ})〉 \leq δ, λ \in Δ_{m},$ (14) ) can be rewritten as follows: $〈\bar{λ}, G^{T} y_{L}^{*} (\bar{s}) - f_{*}〉 \geq 〈λ, G^{T} y_{L}^{*} (\bar{s}) - f_{*}〉 - δ, λ \in Δ_{m} .$ It remains to note that $(G^{T} y_{L}^{*} (\bar{s}) - f_{*})^{(i)} = f_{i} + 〈g_{i}, y_{L}^{*} (\bar{s}) - z_{i}〉, i = 1, \dots, m .$

Now we are able to analyse one iteration of the inexact version of method (Equation8(8) $\begin{matrix} C h o o s e $ x_{0} \in i n t (d o m ψ) $ . \\ F o r $ k \geq 0 $, i t e r a t e : \\ x_{k + 1} = \underset{y \in d o m ψ}{a r g m i n} \{ℓ_{k} (y) + ψ (y) + L β_{d} (x_{k}, y)\} . \end{matrix}$ (8) ). (17) $\begin{matrix} I n p u t : & P o i n t $ \bar{x} \in i n t (d o m ψ) $, s e t o f t e s t p o i n t s $ Z $, c o n t a i n i n g $ \bar{x} $, \\ c o n s t a n t $ L > 0 $, a n d t o l e r a n c e $ δ \geq 0 $ . \\ I t e r a t i o n : & U s i n g t h e i n p u t d a t a, f o r m t h e o p t i m i z a t i o n p r o b l e m (13) a n d \\ c o m p u t e i t s a p p r o x i m a t e s o l u t i o n $ \bar{λ} $ s a t i s f y i n g c o n d i t i o n (14) . \\ O u t p u t : & P o i n t s $ \bar{s} = L \nabla d (\bar{x}) - G \bar{λ} $ a n d $ x_{+} = y_{L}^{*} (\bar{s}) $ . \end{matrix}$ (17)

Theorem 2.1

Let point $x_{+}$ be generated by one iteration (Equation17(17) $\begin{matrix} I n p u t : & P o i n t $ \bar{x} \in i n t (d o m ψ) $, s e t o f t e s t p o i n t s $ Z $, c o n t a i n i n g $ \bar{x} $, \\ c o n s t a n t $ L > 0 $, a n d t o l e r a n c e $ δ \geq 0 $ . \\ I t e r a t i o n : & U s i n g t h e i n p u t d a t a, f o r m t h e o p t i m i z a t i o n p r o b l e m (13) a n d \\ c o m p u t e i t s a p p r o x i m a t e s o l u t i o n $ \bar{λ} $ s a t i s f y i n g c o n d i t i o n (14) . \\ O u t p u t : & P o i n t s $ \bar{s} = L \nabla d (\bar{x}) - G \bar{λ} $ a n d $ x_{+} = y_{L}^{*} (\bar{s}) $ . \end{matrix}$ (17) ) of the Inexact Gradient Method with Memory $($ IGMM $)$ , and let $L \geq L_{d} (f)$ . Then for any $y \in d o m ψ$ (18) $β (x_{+}, y) \leq β (\bar{x}, y) + \frac{1}{L} [F (y) - F (x_{+}) + δ];$ (18)
For every $y \in d o m ψ$ that satisfies $β_{d} (z_{i}, y) \geq β_{d} (\bar{x}, y)$ , $i = 1, \dots, m$ we have (19) $β (x_{+}, y) \leq (1 - \frac{1}{L} μ_{d} (f)) β (\bar{x}, y) + \frac{1}{L} [F (y) - F (x_{+}) + δ] .$ (19)

Proof.

Note that $\begin{aligned} β (x_{+}, y) - β (\bar{x}, y) \\ \overset{(2)}{=} d (y) - d (x_{+}) - 〈\nabla d (x_{+}), y - x_{+}〉 - d (y) + d (\bar{x}) + 〈\nabla d (\bar{x}), y - \bar{x}〉 \\ \overset{(2)}{=} 〈\nabla d (\bar{x}) - \nabla d (x_{+}), y - x_{+}〉 - β (\bar{x}, x_{+}) . \end{aligned}$ The point $x_{+}$ is defined as $x_{+} = \arg max_{x \in d o m ψ} \{〈L \nabla d (\bar{x}) - G \bar{λ}, x〉 - L d (x) - ψ (x)\} .$ The first-order optimality condition for $x_{+}$ at point y can be written in the following form: $ψ (x_{+}) \leq ψ (y) + 〈G \bar{λ} + L (\nabla d (x_{+}) - \nabla d (\bar{x})), y - x_{+}〉 .$ Hence, $β (x_{+}, y) - β (\bar{x}, y) \leq \frac{1}{L} [ψ (y) - ψ (x_{+}) + 〈G \bar{λ}, y - x_{+}〉] - β (\bar{x}, x_{+}) .$ Note that $\begin{aligned} 〈G \bar{λ}, y - x_{+}〉 & = 〈\bar{λ}, G^{T} (y - x_{+})〉 = \sum_{i = 1}^{m} {\bar{λ}}^{(i)} 〈g_{i}, y - x_{+}〉 \\ = \sum_{i = 1}^{m} {\bar{λ}}^{(i)} [〈g_{i}, z_{i} - x_{+}〉 + 〈g_{i}, y - z_{i}〉] \\ \overset{(3)}{\leq} \sum_{i = 1}^{m} {\bar{λ}}^{(i)} [〈g_{i}, z_{i} - x_{+}〉 + f (y) - f_{i} - μ_{d} (f) β_{d} (z_{i}, y)] \\ = f (y) - \sum_{i = 1}^{m} {\bar{λ}}^{(i)} [f_{i} + 〈g_{i}, x_{+} - z_{i}〉] - μ_{d} (f) \sum_{i = 1}^{m} {\bar{λ}}^{(i)} β_{d} (z_{i}, y) . \end{aligned}$ Under the conditions of Item 1, we drop the last term in the above inequality and by Lemma 2.1 obtain the following: $\begin{aligned} β (x_{+}, y) - β (\bar{x}, y) & \leq \frac{1}{L} [F (y) - \sum_{i = 1}^{m} {\bar{λ}}^{(i)} [f_{i} + 〈g_{i}, x_{+} - z_{i}〉] - ψ (x_{+}) - L β (\bar{x}, x_{+})] \\ \leq \frac{1}{L} [F (y) - max_{1 \leq i \leq m} [f_{i} + 〈g_{i}, x_{+} - z_{i}〉] - ψ (x_{+}) - L β (\bar{x}, x_{+}) + δ] . \end{aligned}$ Under the conditions of Item 2, by the same reasoning we get $\begin{aligned} β (x_{+}, y) - β (\bar{x}, y) \\ \leq \frac{1}{L} [F (y) - μ_{d} (f) β_{d} (\bar{x}, y) - \sum_{i = 1}^{m} {\bar{λ}}^{(i)} [f_{i} + 〈g_{i}, x_{+} - z_{i}〉] - ψ (x_{+}) - L β (\bar{x}, x_{+})] \\ \leq \frac{1}{L} [F (y) - μ_{d} (f) β_{d} (\bar{x}, y) - max_{1 \leq i \leq m} [f_{i} + 〈g_{i}, x_{+} - z_{i}〉] - ψ (x_{+}) - L β (\bar{x}, x_{+}) + δ] . \end{aligned}$ In both cases, since $\bar{x} \in Z$ , we have $\begin{aligned} max_{1 \leq i \leq m} [f_{i} + 〈g_{i}, x_{+} - z_{i}〉] + L β (\bar{x}, x_{+}) & \geq f (\bar{x}) + 〈\nabla f (\bar{x}), x_{+} - \bar{x}〉 + L β (\bar{x}, x_{+}) \\ \overset{(3)}{\geq} f (x_{+}) . \end{aligned}$ Thus, we obtain inequalities (Equation18(18) $β (x_{+}, y) \leq β (\bar{x}, y) + \frac{1}{L} [F (y) - F (x_{+}) + δ];$ (18) ) and (Equation19(19) $β (x_{+}, y) \leq (1 - \frac{1}{L} μ_{d} (f)) β (\bar{x}, y) + \frac{1}{L} [F (y) - F (x_{+}) + δ] .$ (19) ).

Remark 2.2

In the end of the proof, we have seen that the statement of Theorem 2.1 remains valid if the condition $L \geq L_{d} (f)$ is replaced by the following: (20) $max_{1 \leq i \leq m} [f_{i} + 〈g_{i}, x_{+} - z_{i}〉] + L β (\bar{x}, x_{+}) \geq f (x_{+}) .$ (20)

Denote the output of the iteration (Equation17(17) $\begin{matrix} I n p u t : & P o i n t $ \bar{x} \in i n t (d o m ψ) $, s e t o f t e s t p o i n t s $ Z $, c o n t a i n i n g $ \bar{x} $, \\ c o n s t a n t $ L > 0 $, a n d t o l e r a n c e $ δ \geq 0 $ . \\ I t e r a t i o n : & U s i n g t h e i n p u t d a t a, f o r m t h e o p t i m i z a t i o n p r o b l e m (13) a n d \\ c o m p u t e i t s a p p r o x i m a t e s o l u t i o n $ \bar{λ} $ s a t i s f y i n g c o n d i t i o n (14) . \\ O u t p u t : & P o i n t s $ \bar{s} = L \nabla d (\bar{x}) - G \bar{λ} $ a n d $ x_{+} = y_{L}^{*} (\bar{s}) $ . \end{matrix}$ (17) ) by $x_{δ, L} (\bar{x}, Z)$ . Then we can define the following Inexact Gradient Method with Memory. (21) $\begin{matrix} C h o o s e $ x_{0} \in i n t (d o m ψ) $, $ δ \geq 0 $ a n d $ L > 0 $ . \\ F o r $ k \geq 0 $, i t e r a t e : \\ (1) C h o o s e t h e s e t $ Z_{k} $ c o n t a i n i n g $ x_{k} $ . \\ (2) C o m p u t e $ x_{k + 1} = x_{δ, L} (x_{k}, Z_{k}) $ . \end{matrix}$ (21)

Let us describe the rate of convergence of this process.

Theorem 2.2

Proof.

Indeed, in view of inequality (Equation18(18) $β (x_{+}, y) \leq β (\bar{x}, y) + \frac{1}{L} [F (y) - F (x_{+}) + δ];$ (18) ), we have $β_{d} (x_{k + 1}, y) \leq β_{d} (x_{k}, y) + \frac{1}{L} [F (y) - F (x_{k + 1}) + δ], k \geq 0.$ Summing up these inequalities for $k = 0, \dots, T - 1$ , we get inequality (Equation22(22) $\frac{1}{T} \sum_{k = 1}^{T} F (x_{k}) \leq F (y) + \frac{L}{T} β_{d} (x_{0}, y) + δ .$ (22) ).

In the above result, the only restriction for the sets $Z_{k}$ is the inclusion (Equation7(7) $x_{k} \in Z_{k} .$ (7) ). If we apply a more accurate strategy of choosing $Z_{k}$ , we can get for this scheme a finer estimate of its rate of convergence.

Theorem 2.3

Let sequence ${x_{k}}_{k \geq 1}$ be generated by IGMM (Equation21(21) $\begin{matrix} C h o o s e $ x_{0} \in i n t (d o m ψ) $, $ δ \geq 0 $ a n d $ L > 0 $ . \\ F o r $ k \geq 0 $, i t e r a t e : \\ (1) C h o o s e t h e s e t $ Z_{k} $ c o n t a i n i n g $ x_{k} $ . \\ (2) C o m p u t e $ x_{k + 1} = x_{δ, L} (x_{k}, Z_{k}) $ . \end{matrix}$ (21) ) with $L \geq L_{d} (f)$ . Assume that besides the condition (Equation7(7) $x_{k} \in Z_{k} .$ (7) ), the sets $Z_{k}$ satisfy also the following condition: (23) $Z_{k} \subseteq {x_{0}, \dots, x_{k}}, k \geq 0.$ (23) Suppose that $μ_{d} (f) > 0$ and for all k, $1 \leq k \leq T$ , we have (24) $F (x_{k}) - F^{*} \geq δ .$ (24) Then for $Δ_{T}^{*} = min_{1 \leq k \leq T} [F (x_{k}) - F^{*}]$ we get the following rate of convergence (25) $\begin{aligned} Δ_{T}^{*} & \leq δ + \frac{(1 - γ)^{T} μ_{d} (f)}{1 - (1 - γ)^{T}} β_{d} (x_{0}, x^{*}) \\ \leq δ + \frac{μ_{d} (f)}{e^{γ T} - 1} β_{d} (x_{0}, x^{*}) \leq δ + \frac{L}{T} β_{d} (x_{0}, x^{*}), \end{aligned}$ (25) where $γ = \frac{1}{L} μ_{d} (f)$ .

Proof.

In view of assumption (Equation24(24) $F (x_{k}) - F^{*} \geq δ .$ (24) ) and inequality (Equation18(18) $β (x_{+}, y) \leq β (\bar{x}, y) + \frac{1}{L} [F (y) - F (x_{+}) + δ];$ (18) ), we have $β_{d} (x_{k + 1}, x^{*}) \leq β_{d} (x_{k}, x^{*}), 0 \leq k \leq T - 1.$ Therefore, in view of inequality (Equation19(19) $β (x_{+}, y) \leq (1 - \frac{1}{L} μ_{d} (f)) β (\bar{x}, y) + \frac{1}{L} [F (y) - F (x_{+}) + δ] .$ (19) ), for $r_{k} \overset{d e f}{=} β_{d} (x_{k}, x^{*})$ and all $k = 0, \dots, T - 1$ we have $r_{k + 1} \leq (1 - γ) r_{k} - \frac{1}{L} [F (x_{k + 1}) - F^{*} - δ] \leq (1 - γ) r_{k} - \frac{1}{L} [Δ_{T}^{*} - δ] .$ Applying this inequality recursively, we get $\frac{1}{L} [Δ_{T}^{*} - δ] \frac{1 - (1 - γ)^{T}}{1 - (1 - γ)} \leq (1 - γ)^{T} r_{0} .$ This can be rewritten as $Δ_{T}^{*} \leq δ + \frac{L γ (1 - γ)^{T}}{1 - (1 - γ)^{T}} β_{d} (x_{0}, x^{*}) = δ + \frac{(1 - γ)^{T} μ_{d} (f)}{1 - (1 - γ)^{T}} β_{d} (x_{0}, x^{*}) .$

Note that the rate of convergence given by inequality (Equation25(25) $\begin{aligned} Δ_{T}^{*} & \leq δ + \frac{(1 - γ)^{T} μ_{d} (f)}{1 - (1 - γ)^{T}} β_{d} (x_{0}, x^{*}) \\ \leq δ + \frac{μ_{d} (f)}{e^{γ T} - 1} β_{d} (x_{0}, x^{*}) \leq δ + \frac{L}{T} β_{d} (x_{0}, x^{*}), \end{aligned}$ (25) ) is continuous as $μ_{d} (f) \to 0$ .

As we have mentioned in Remark 2.1, it is important to adjust the value of the constant L during the minimization process. Therefore we present an adaptive version of the method (Equation21(21) $\begin{matrix} C h o o s e $ x_{0} \in i n t (d o m ψ) $, $ δ \geq 0 $ a n d $ L > 0 $ . \\ F o r $ k \geq 0 $, i t e r a t e : \\ (1) C h o o s e t h e s e t $ Z_{k} $ c o n t a i n i n g $ x_{k} $ . \\ (2) C o m p u t e $ x_{k + 1} = x_{δ, L} (x_{k}, Z_{k}) $ . \end{matrix}$ (21) ). (26) $\begin{matrix} C h o o s e $ x_{0} \in i n t (d o m ψ) $, $ δ \geq 0 $, a n d s o m e $ L_{0} \in (0, L_{d} (f)] $ . \\ F o r $ k \geq 0 $, i t e r a t e : \\ (1) C h o o s e t h e s e t $ Z_{k} $ c o n t a i n i n g $ x_{k} $ . \\ (2) F i n d t h e s m a l l e s t i n t e g e r $ i_{k} \geq 0 $ s u c h t h a t \\ f o r t h e p o i n t $ x_{k}^{+} = x_{δ, 2^{i_{k}} L_{k}} (x_{k}, Z_{k}) $ w e h a v e \\ f (x_{k}^{+}) \overset{(6)}{\leq} ℓ_{k} (x_{k}^{+}) + L_{k} β_{d} (x_{k}, x_{k}^{+}) . \\ (3) S e t $ x_{k + 1} = x_{k}^{+} $ a n d $ L_{k + 1} = 2^{i_{k} - 1} L_{k} $ . \end{matrix}$ (26)

The rate of convergence of this algorithm can be established exactly in the same way as the one of method (Equation21(21) $\begin{matrix} C h o o s e $ x_{0} \in i n t (d o m ψ) $, $ δ \geq 0 $ a n d $ L > 0 $ . \\ F o r $ k \geq 0 $, i t e r a t e : \\ (1) C h o o s e t h e s e t $ Z_{k} $ c o n t a i n i n g $ x_{k} $ . \\ (2) C o m p u t e $ x_{k + 1} = x_{δ, L} (x_{k}, Z_{k}) $ . \end{matrix}$ (21) ). The main fact is that during the minimization process we always have $L_{k} \leq 2 L_{d} (f), k \geq 0.$ Therefore, for any $y \in d o m ψ$ we will have $β_{d} (x_{k + 1}, y) \leq β_{d} (x_{k}, y) + \frac{1}{2 L_{d} (f)} [F (y) - F (x_{k + 1}) + δ]$ with corresponding consequences for the rate of convergence. At the same time, the average number of oracle calls at each iteration of this method is bounded by two (see [Citation7] for justification details).

3. Getting an approximate solution of the anti-dual problem

The complexity of solving the auxiliary problem (Equation13(13) $ξ_{L}^{*} \overset{d e f}{=} min_{λ \in Δ_{m}} \{ξ_{L} (λ) \overset{d e f}{=} Φ_{L} (L \nabla d (\bar{x}) - G λ) + 〈λ, f_{*}〉 + α\},$ (13) ) crucially depends on the properties of the prox-function $d (\cdot)$ . In the previous section, we assumed its strict convexity and the solvability of problem (Equation11(11) $Φ_{L} (s) = max_{y \in d o m ψ} \{〈s, y〉 - L d (y) - ψ (y)\}, s \in E^{*} .$ (11) ) (see Assumption 2.1). It is time now to make a stronger assumption, which ensures these two properties.

Assumption 3.1

Function $d (\cdot)$ is differentiable in the interior of its domain and strongly convex with convexity parameter one: (27) $d (y) \geq d (x) + 〈\nabla d (x), y - x〉 + \frac{1}{2} ∥ y - x ∥^{2}, x \in i n t (d o m d), y \in d o m d .$ (27)

Clearly, for all $x, y \in i n t (d o m d)$ we have (28) $β_{d} (x, y) \geq \frac{1}{2} ∥ y - x ∥^{2} .$ (28)

The main consequence of Assumption 3.1 is the Lipschitz continuity of the gradient of function $Φ (\cdot)$ . Since usually this fact is proved for a function $ψ (\cdot)$ being an indicator function of a closed convex set, we provide it with a simple proof.

Lemma 3.1

Let function $d (\cdot)$ satisfy Assumption 3.1. Then the gradient $\nabla Φ_{L} (s) = y_{L}^{*} (s)$ , $s \in E^{*}$ , is Lipschitz continuous: (29) $∥ \nabla Φ_{L} (s_{1}) - \nabla Φ_{L} (s_{2}) ∥ \leq \frac{1}{L} ∥ s_{1} - s_{2} ∥_{*}, s_{1}, s_{2} \in E^{*} .$ (29)

Proof.

Let us write down the first-order optimality conditions for the optimization problems defining the points $y_{1} \overset{d e f}{=} y_{L}^{*} (s_{1})$ and $y_{2} \overset{d e f}{=} y_{L}^{*} (s_{2})$ : $\begin{aligned} 〈s_{1} - L [\nabla d (y_{1}) - \nabla d (\bar{x})], y - y_{1}〉 - ψ (y) \leq - ψ (y_{1}), y \in d o m ψ, \\ 〈s_{2} - L [\nabla d (y_{2}) - \nabla d (\bar{x})], y - y_{2}〉 - ψ (y) \leq - ψ (y_{2}), y \in d o m ψ . \end{aligned}$ Taking in the first inequality $y = y_{2}$ and $y = y_{1}$ in the second one, and adding the results, we obtain $〈s_{1} - s_{2}, y_{2} - y_{1}〉 \leq L 〈\nabla d (y_{1}) - \nabla d (y_{2}), y_{2} - y_{1}〉 .$ Thus, $〈s_{1} - s_{2}, y_{1} - y_{2}〉 \geq L 〈\nabla d (y_{2}) - \nabla d (y_{1}), y_{2} - y_{1}〉 \overset{(27)}{\geq} L ∥ y_{1} - y_{2} ∥^{2} .$ Therefore, by the Cauchy-Schwartz inequality, we get $∥ y_{L}^{*} (s_{1}) - y_{L}^{*} (s_{2}) ∥ \leq \frac{1}{L} ∥ s_{1} - s_{2} ∥_{*} .$

Thus, in this section, our main problem of interest is as follows: (30) $ξ_{L}^{*} = min_{λ \in Δ_{m}} \{ξ_{L} (λ) \overset{d e f}{=} Φ_{L} (L \nabla d (\bar{x}) - G λ) + 〈λ, f_{*}〉 + α\},$ (30) where $α = L [d (\bar{x}) - 〈\nabla d (\bar{x}), \bar{x}〉]$ . This is a convex optimization problem over a simplex, where the objective function has a Lipschitz-continuous gradient.

The most natural algorithm for solving the problem (Equation13(13) $ξ_{L}^{*} \overset{d e f}{=} min_{λ \in Δ_{m}} \{ξ_{L} (λ) \overset{d e f}{=} Φ_{L} (L \nabla d (\bar{x}) - G λ) + 〈λ, f_{*}〉 + α\},$ (13) ) is the Frank–Wolfe algorithm [Citation3] (or Conditional Gradients Method). For our problem, it looks as follows. (31) $\begin{matrix} S e t $ λ_{0} = \frac{1}{m} {\bar{e}}_{m} $ . \\ F o r $ k \geq 0 $ i t e r a t e : \\ 1. C o m p u t e t h e g r a d i e n t $ \nabla ξ_{L} (λ_{k}) $ . \\ 2. C o m p u t e $ i_{k} = \underset{1 \leq i \leq m}{a r g m i n} \nabla_{i} ξ_{L} (λ_{k}) $ . \\ 3. S e t $ λ_{k + 1} = \frac{k}{k + 2} λ_{k} + \frac{2}{k + 2} e_{i_{k}} $ . \end{matrix}$ (31) In this scheme, ${\bar{e}}_{m} \in R^{m}$ is the vector of all ones, and $e_{i}$ is ith coordinate vector in $R^{m}$ .

In order to estimate the rate of convergence of this method, we introduce the following accuracy measure: $δ_{L} (\bar{λ}) = max_{λ \in Δ_{m}} 〈\nabla ξ_{L} (\bar{λ}), \bar{λ} - λ〉 .$ For the sequence ${λ_{k}}_{k \geq 0}$ generated by the method (Equation31(31) $\begin{matrix} S e t $ λ_{0} = \frac{1}{m} {\bar{e}}_{m} $ . \\ F o r $ k \geq 0 $ i t e r a t e : \\ 1. C o m p u t e t h e g r a d i e n t $ \nabla ξ_{L} (λ_{k}) $ . \\ 2. C o m p u t e $ i_{k} = \underset{1 \leq i \leq m}{a r g m i n} \nabla_{i} ξ_{L} (λ_{k}) $ . \\ 3. S e t $ λ_{k + 1} = \frac{k}{k + 2} λ_{k} + \frac{2}{k + 2} e_{i_{k}} $ . \end{matrix}$ (31) ), denote $δ_{L}^{*} (T) = min_{0 \leq k \leq T} δ_{L} (λ_{k}), T \geq 0.$ For estimating the rate of convergence of method (Equation31(31) $\begin{matrix} S e t $ λ_{0} = \frac{1}{m} {\bar{e}}_{m} $ . \\ F o r $ k \geq 0 $ i t e r a t e : \\ 1. C o m p u t e t h e g r a d i e n t $ \nabla ξ_{L} (λ_{k}) $ . \\ 2. C o m p u t e $ i_{k} = \underset{1 \leq i \leq m}{a r g m i n} \nabla_{i} ξ_{L} (λ_{k}) $ . \\ 3. S e t $ λ_{k + 1} = \frac{k}{k + 2} λ_{k} + \frac{2}{k + 2} e_{i_{k}} $ . \end{matrix}$ (31) ), we need to choose an appropriate norm in $R^{m}$ . Since the feasible set of the problem (Equation13(13) $ξ_{L}^{*} \overset{d e f}{=} min_{λ \in Δ_{m}} \{ξ_{L} (λ) \overset{d e f}{=} Φ_{L} (L \nabla d (\bar{x}) - G λ) + 〈λ, f_{*}〉 + α\},$ (13) ) is the standard simplex, it is reasonable to use the $ℓ_{1}$ -norm: $∥ λ ∥_{1} = \sum_{i = 1}^{m} | λ^{(i)} |, λ \in R^{m} .$ Then, for measuring the gradients of function $ξ_{L} (\cdot)$ , we can use the $ℓ_{\infty}$ -norm: $∥ λ ∥_{\infty} = max_{1 \leq i \leq m} | λ^{(i)} |, λ \in R^{m} .$ In this case, the Lipschitz constant for the gradients of function $ξ_{L} (\cdot)$ can be estimated as follows: $\begin{aligned} ∥ \nabla ξ_{L} (λ_{1}) - \nabla ξ_{L} (λ_{2}) ∥_{\infty} \\ = max_{1 \leq i \leq m} | 〈g_{i}, \nabla Φ (L \nabla d (\bar{x}) - G λ_{2}) - \nabla Φ (L \nabla d (\bar{x}) - G λ_{1})〉 | \\ \leq max_{1 \leq i \leq m} ∥ g_{i} ∥_{*} \cdot ∥ \nabla Φ (L \nabla d (\bar{x}) - G λ_{2}) - \nabla Φ (L \nabla d (\bar{x}) - G λ_{1}) ∥ \\ \overset{(29)}{\leq} max_{1 \leq i \leq m} ∥ g_{i} ∥_{*} \cdot \frac{1}{L} ∥ G (λ_{1} - λ_{2}) ∥_{*} \leq \frac{1}{L} max_{1 \leq i \leq m} ∥ g_{i} ∥_{*}^{2} \cdot ∥ λ_{1} - λ_{2} ∥_{1} . \end{aligned}$ Thus, the gradients of function $ξ_{L} (\cdot)$ are Lipschitz continuous with the constant (32) $L (ξ_{L}) = \frac{1}{L} max_{1 \leq i \leq m} ∥ g_{i} ∥_{*}^{2} .$ (32) Since the diameter of the standard simplex in $R^{m}$ in $ℓ_{1}$ -norm is two, in accordance to the estimate (3.13) in [Citation6], we can guarantee the following rate of convergence: (33) $δ_{L}^{*} (T) \leq \frac{18}{L \cdot T} max_{1 \leq i \leq m} ∥ g_{i} ∥_{*}^{2}, t \geq 1.$ (33) (Here we replace the constant $\frac{136}{11 \ln 2}$ from [Citation6] by a bigger value 18.) In accordance to the condition (Equation14(14) $〈\bar{λ} - λ, \nabla ξ_{L} (\bar{λ})〉 \leq δ, λ \in Δ_{m},$ (14) ), this means that we need (34) $N_{L} (δ) = \frac{18}{L \cdot δ} max_{1 \leq i \leq m} ∥ g_{i} ∥_{*}^{2}$ (34) iterations of the method (Equation31(31) $\begin{matrix} S e t $ λ_{0} = \frac{1}{m} {\bar{e}}_{m} $ . \\ F o r $ k \geq 0 $ i t e r a t e : \\ 1. C o m p u t e t h e g r a d i e n t $ \nabla ξ_{L} (λ_{k}) $ . \\ 2. C o m p u t e $ i_{k} = \underset{1 \leq i \leq m}{a r g m i n} \nabla_{i} ξ_{L} (λ_{k}) $ . \\ 3. S e t $ λ_{k + 1} = \frac{k}{k + 2} λ_{k} + \frac{2}{k + 2} e_{i_{k}} $ . \end{matrix}$ (31) ) in order to generate an appropriate dual solution $\bar{λ}$ .

4. Unconstrained minimization in Euclidean setup

In this section we consider the simplest unconstrained minimization problem (35) $f^{*} = min_{x \in E} f (x),$ (35) where $f (\cdot)$ is a smooth convex function. For measuring distances in $E$ , we introduce a Euclidean norm $∥ x ∥ = {〈B x, x〉}^{1 / 2}, x \in E,$ where $B = B^{*} ≻ 0$ is a linear operator from $E$ to $E^{*}$ . Then the dual norm is defined as follows: $∥ g ∥_{*} = {〈g, B^{- 1} g〉}^{1 / 2}, g \in E^{*} .$ Let us choose now the distance function $d (x) = \frac{1}{2} ∥ x ∥^{2}$ . Then the Bregman distance is given by $β_{d} (x, y) = \frac{1}{2} ∥ x - y ∥^{2}, x, y \in E .$ In this case, the relative smoothness condition (Equation3(3) $\begin{array}{llll} f (y) - f (x) - 〈\nabla f (x), y - x〉 \geq μ_{d} (f) β_{d} (x, y), \\ x, y \in d o m f, \\ f (y) - f (x) - 〈\nabla f (x), y - x〉 \leq L_{d} (f) β_{d} (x, y), \end{array}$ (3) ) is equivalent to strong convexity and Lipschitz continuity of the gradient: (36) $\begin{array}{lll} f (y) - f (x) - 〈\nabla f (x), y - x〉 \geq \frac{1}{2} μ_{d} (f) ∥ x - y ∥^{2}, \\ x, y \in d o m f, \\ f (y) - f (x) - 〈\nabla f (x), y - x〉 \leq \frac{1}{2} L_{d} (f) ∥ x - y ∥^{2} . \end{array}$ (36)

Let us write down now the specific form of the objective function $ξ_{L} (\cdot)$ in problem (Equation13(13) $ξ_{L}^{*} \overset{d e f}{=} min_{λ \in Δ_{m}} \{ξ_{L} (λ) \overset{d e f}{=} Φ_{L} (L \nabla d (\bar{x}) - G λ) + 〈λ, f_{*}〉 + α\},$ (13) ). Note that in our case $Φ_{L} (s) = max_{y \in E} \{〈s, y〉 - \frac{L}{2} ∥ y ∥^{2}\} = \frac{1}{2 L} ∥ s ∥_{*}^{2}, s \in E^{*} .$ Therefore, $ξ_{L} (λ) = \frac{1}{2 L} ∥ L B \bar{x} - G λ ∥_{*}^{2} + 〈λ, f_{*}〉 + α,$ where $α = - \frac{1}{2} L ∥ \bar{x} ∥^{2}$ . The gradient of function $ξ_{L} (\cdot)$ can be computed as follows: (37) $\begin{aligned} \nabla ξ_{L} (λ) & = \frac{1}{L} G^{*} B^{- 1} (G λ - L B \bar{x}) + f_{*}, λ \in R^{m} \\ = \frac{1}{L} Q λ - \bar{f}, \end{aligned}$ (37) where $Q = G^{*} B^{- 1} G$ and $\bar{f} = G^{*} \bar{x} - f_{*}$ . Note that ${\bar{f}}^{(i)} \overset{(9)}{=} f_{i} + 〈g_{i}, \bar{x} - z_{i}〉, i = 1, \dots, m .$ Thus, in the Euclidean setup, our auxiliary problem (Equation13(13) $ξ_{L}^{*} \overset{d e f}{=} min_{λ \in Δ_{m}} \{ξ_{L} (λ) \overset{d e f}{=} Φ_{L} (L \nabla d (\bar{x}) - G λ) + 〈λ, f_{*}〉 + α\},$ (13) ) can be written as follows: (38) $min_{λ \in Δ_{m}} [ξ_{L} (λ) = \frac{1}{2 L} 〈λ, Q λ〉 - 〈λ, \bar{f}〉] .$ (38) The stopping criterion (Equation14(14) $〈\bar{λ} - λ, \nabla ξ_{L} (\bar{λ})〉 \leq δ, λ \in Δ_{m},$ (14) ) for this problem is as follows: $\begin{aligned} 〈\bar{λ}, \nabla ξ_{L} (\bar{λ})〉 & \overset{(37)}{=} 〈\bar{λ}, \frac{1}{L} Q \bar{λ} - \bar{f}〉 \overset{(14)}{\leq} δ + min_{1 \leq i \leq m} \nabla_{i} ξ_{L} (\bar{λ}) \\ = δ + min_{1 \leq i \leq m} {(\frac{1}{L} Q \bar{λ} - \bar{f})}^{(i)} . \end{aligned}$ Note that the main output of the minimization process for problem (Equation38(38) $min_{λ \in Δ_{m}} [ξ_{L} (λ) = \frac{1}{2 L} 〈λ, Q λ〉 - 〈λ, \bar{f}〉] .$ (38) ) is $x_{+} = y_{*} (L B \bar{x} - G \bar{λ}) \overset{(12)}{=} \frac{1}{L} B^{- 1} (L B \bar{x} - G \bar{λ}) = \bar{x} - \frac{1}{L} B^{- 1} G \bar{λ} .$ Then $\bar{f} - \frac{1}{L} Q \bar{λ} = \bar{f} - \frac{1}{L} G^{*} B^{- 1} G \bar{λ} = \bar{f} + G^{*} (x_{+} - \bar{x}) .$ Hence, in the Euclidean case, the stopping criterion (Equation14(14) $〈\bar{λ} - λ, \nabla ξ_{L} (\bar{λ})〉 \leq δ, λ \in Δ_{m},$ (14) ) can be written as follows: (39) $\sum_{i = 1}^{m} {\bar{λ}}^{(i)} [f_{i} + 〈g_{i}, x_{+} - z_{i}〉] \leq δ + max_{1 \leq i \leq m} [f_{i} + 〈g_{i}, x_{+} - z_{i}〉] .$ (39)

Now we can estimate the computational expenses of the method (Equation31(31) $\begin{matrix} S e t $ λ_{0} = \frac{1}{m} {\bar{e}}_{m} $ . \\ F o r $ k \geq 0 $ i t e r a t e : \\ 1. C o m p u t e t h e g r a d i e n t $ \nabla ξ_{L} (λ_{k}) $ . \\ 2. C o m p u t e $ i_{k} = \underset{1 \leq i \leq m}{a r g m i n} \nabla_{i} ξ_{L} (λ_{k}) $ . \\ 3. S e t $ λ_{k + 1} = \frac{k}{k + 2} λ_{k} + \frac{2}{k + 2} e_{i_{k}} $ . \end{matrix}$ (31) ) as applied to the auxiliary problem (Equation38(38) $min_{λ \in Δ_{m}} [ξ_{L} (λ) = \frac{1}{2 L} 〈λ, Q λ〉 - 〈λ, \bar{f}〉] .$ (38) ).

Computation of matrix Q: $O (m^{2} n)$ arithmetic operations. For certain strategies for updating the sets $Z_{k}$ , it can be reduced to $O (m n)$ operations.
Computation of the vector $\bar{f}$ : $O (m n)$ operations.
Computation of the initial gradient $u_{0} = \frac{1}{L} λ_{0} - \bar{f}$ : $O (m^{2})$ operations. For certain updating strategies it can be $O (m)$ .
Expenses at each iteration:
- Computing the index $i_{k}$ : $O (m)$ operations.
- Updating the point $λ_{k}$ : $O (m)$ operations.
- Updating the gradient $u_{k} = \frac{1}{L} Q λ_{k} - \bar{f}$ : $O (m)$ operations.

Thus, taking into account the upper bound (Equation34(34) $N_{L} (δ) = \frac{18}{L \cdot δ} max_{1 \leq i \leq m} ∥ g_{i} ∥_{*}^{2}$ (34) ) for the number of iterations in method (Equation31(31) $\begin{matrix} S e t $ λ_{0} = \frac{1}{m} {\bar{e}}_{m} $ . \\ F o r $ k \geq 0 $ i t e r a t e : \\ 1. C o m p u t e t h e g r a d i e n t $ \nabla ξ_{L} (λ_{k}) $ . \\ 2. C o m p u t e $ i_{k} = \underset{1 \leq i \leq m}{a r g m i n} \nabla_{i} ξ_{L} (λ_{k}) $ . \\ 3. S e t $ λ_{k + 1} = \frac{k}{k + 2} λ_{k} + \frac{2}{k + 2} e_{i_{k}} $ . \end{matrix}$ (31) ), we obtain the following bound for the arithmetic complexity of problem (Equation38(38) $min_{λ \in Δ_{m}} [ξ_{L} (λ) = \frac{1}{2 L} 〈λ, Q λ〉 - 〈λ, \bar{f}〉] .$ (38) ) with reasonable updating strategies for the sets $Z_{k}$ : (40) $O (m n + \frac{m}{L \cdot δ} max_{1 \leq i \leq m} ∥ g_{i} ∥_{*}^{2}) .$ (40) Taking into account that we can expect that in the problem (Equation35(35) $f^{*} = min_{x \in E} f (x),$ (35) ) we have $\frac{1}{2 L_{d} (f)} ∥ \nabla f (z_{i}) ∥_{*}^{2} \overset{(36)}{\leq} f (z_{i}) - f^{*} \to 0$ as $i \to \infty$ , the bound in (Equation40(40) $O (m n + \frac{m}{L \cdot δ} max_{1 \leq i \leq m} ∥ g_{i} ∥_{*}^{2}) .$ (40) ) suggests that the overhead of solving the inner problem (Equation38(38) $min_{λ \in Δ_{m}} [ξ_{L} (λ) = \frac{1}{2 L} 〈λ, Q λ〉 - 〈λ, \bar{f}〉] .$ (38) ) decreases to a small constant in $O (m n)$ as the algorithm approaches the optimum.

5. Numerical experiments

In this section we present preliminary computational results for method (Equation26(26) $\begin{matrix} C h o o s e $ x_{0} \in i n t (d o m ψ) $, $ δ \geq 0 $, a n d s o m e $ L_{0} \in (0, L_{d} (f)] $ . \\ F o r $ k \geq 0 $, i t e r a t e : \\ (1) C h o o s e t h e s e t $ Z_{k} $ c o n t a i n i n g $ x_{k} $ . \\ (2) F i n d t h e s m a l l e s t i n t e g e r $ i_{k} \geq 0 $ s u c h t h a t \\ f o r t h e p o i n t $ x_{k}^{+} = x_{δ, 2^{i_{k}} L_{k}} (x_{k}, Z_{k}) $ w e h a v e \\ f (x_{k}^{+}) \overset{(6)}{\leq} ℓ_{k} (x_{k}^{+}) + L_{k} β_{d} (x_{k}, x_{k}^{+}) . \\ (3) S e t $ x_{k + 1} = x_{k}^{+} $ a n d $ L_{k + 1} = 2^{i_{k} - 1} L_{k} $ . \end{matrix}$ (26) ) as applied to the following unconstrained minimization problem: (41) $min_{x \in R^{n}} [f (x) = μ \ln (\sum_{j = 1}^{M} e^{(〈a_{j}, x〉 - b_{j}) / μ})] .$ (41) The data defining this function is randomly generated in the following way. First of all, we generate a collection of random vectors ${\hat{a}}_{1}, \dots, {\hat{a}}_{M}$ with entries uniformly distributed in the interval $[- 1, 1]$ . Using the same distribution, we generate values $b_{j}$ , $j = 1, \dots, m$ . Using this data, we form the preliminary function $\hat{f} (x) = μ \ln (\sum_{j = 1}^{M} e^{(〈a_{j}, x〉 - b_{j}) / μ})$ and compute $g = \nabla \hat{f} (0)$ . Then, we define $a_{j} = {\hat{a}}_{j} - g, j = 1, \dots, n .$ Clearly, in this case we have $\nabla f (0) = 0$ , so the unique solution of our test problem (Equation41(41) $min_{x \in R^{n}} [f (x) = μ \ln (\sum_{j = 1}^{M} e^{(〈a_{j}, x〉 - b_{j}) / μ})] .$ (41) ) is $x^{*} = 0$ . The starting point $x_{0}$ is chosen in accordance to the uniform distribution on the Euclidean sphere of radius one.

Thus, the problem (Equation41(41) $min_{x \in R^{n}} [f (x) = μ \ln (\sum_{j = 1}^{M} e^{(〈a_{j}, x〉 - b_{j}) / μ})] .$ (41) ) has three parameters, the dimension n, the number of linear functions $M \geq n$ , and the smoothness coefficient $μ > 0$ . In our experiments, we always choose M = 6n. Let us present our computational results for different values of n and μ.

In the definition of method (Equation26(26) $\begin{matrix} C h o o s e $ x_{0} \in i n t (d o m ψ) $, $ δ \geq 0 $, a n d s o m e $ L_{0} \in (0, L_{d} (f)] $ . \\ F o r $ k \geq 0 $, i t e r a t e : \\ (1) C h o o s e t h e s e t $ Z_{k} $ c o n t a i n i n g $ x_{k} $ . \\ (2) F i n d t h e s m a l l e s t i n t e g e r $ i_{k} \geq 0 $ s u c h t h a t \\ f o r t h e p o i n t $ x_{k}^{+} = x_{δ, 2^{i_{k}} L_{k}} (x_{k}, Z_{k}) $ w e h a v e \\ f (x_{k}^{+}) \overset{(6)}{\leq} ℓ_{k} (x_{k}^{+}) + L_{k} β_{d} (x_{k}, x_{k}^{+}) . \\ (3) S e t $ x_{k + 1} = x_{k}^{+} $ a n d $ L_{k + 1} = 2^{i_{k} - 1} L_{k} $ . \end{matrix}$ (26) ) we have some freedom in the choice of the bundle $Z_{k}$ . Let us bound its maximal size by a parameter $m \geq 1$ . Then m = 1 corresponds to the usual Gradient Method. In the first series of our experiments (shown in Tables and ) we always choose $m = n .$ We also have some freedom in the updating strategy for the sets $Z_{k}$ . Clearly, at the first m steps we can simply add all new points in the bundle. However, at the next iterations we need to decide on the strategy of replacement of the the old information. In our experiments we implemented two strategies:

Cyclic replacement (Cyclic).
Replacement of the linear function with the maximal norm of the gradient (Max-Norm).

Table 1. Smoothness parameter $μ = 0.05$ .

Display Table

Table 2. Smoothness parameter $μ = 0.01$ .

Display Table

The second strategy is motivated by the formula (Equation32(32) $L (ξ_{L}) = \frac{1}{L} max_{1 \leq i \leq m} ∥ g_{i} ∥_{*}^{2} .$ (32) ) for the Lipschitz constant of the gradient of function $ξ_{L} (\cdot)$ . For both strategies, at each iteration we need to update only one column of matrix $Q_{k}$ (see (Equation37(37) $\begin{aligned} \nabla ξ_{L} (λ) & = \frac{1}{L} G^{*} B^{- 1} (G λ - L B \bar{x}) + f_{*}, λ \in R^{m} \\ = \frac{1}{L} Q λ - \bar{f}, \end{aligned}$ (37) )), which costs $O (m n)$ operations.

Let us present the results of our numerical experiments. All methods were stopped when the residual in the function value was smaller than $ϵ = 10^{- 6}$ . The parameter δ for the stopping criterion (Equation14(14) $〈\bar{λ} - λ, \nabla ξ_{L} (\bar{λ})〉 \leq δ, λ \in Δ_{m},$ (14) ) was chosen as $δ = ϵ / 2$ .

In Tables and , the first line indicates the total number of iterations. The second line displays the total number of oracle calls. The third line shows the average number of Frank–Wolfe steps per iterations (for the Gradient Method we just put two). The next line indicates the total computational time (in seconds). Finally, at the last line we can see the average time spent on one iteration of the corresponding method (in milliseconds).

As we can see from these tables, in all our experiments the gradient methods with memory were better than the standard Gradient Method, both in the number of iterations, and, what is rather surprising, in the total computational time. The Max-Norm version usually outperforms the Cyclic version. It is interesting that the auxiliary algorithm (Equation31(31) $\begin{matrix} S e t $ λ_{0} = \frac{1}{m} {\bar{e}}_{m} $ . \\ F o r $ k \geq 0 $ i t e r a t e : \\ 1. C o m p u t e t h e g r a d i e n t $ \nabla ξ_{L} (λ_{k}) $ . \\ 2. C o m p u t e $ i_{k} = \underset{1 \leq i \leq m}{a r g m i n} \nabla_{i} ξ_{L} (λ_{k}) $ . \\ 3. S e t $ λ_{k + 1} = \frac{k}{k + 2} λ_{k} + \frac{2}{k + 2} e_{i_{k}} $ . \end{matrix}$ (31) ) works very well. The average time spent on one iteration of the methods with memory is never increased more than on $50 %$ of the time of the simple Gradient Method. This is partially explained by the fact that in our test problems the data is fully dense, so each call of oracle is very expensive ( $O (M n)$ operations).

Let us look now at how small bundles can accelerate the Gradient Method. In Tables and , the first line with parameter Bundle = 1 corresponds to the Gradient method with line search. The next lines display the results for different sizes of the bundle. We list the number of iterations, the average number of Frank–Wolfe steps per iteration, and the total computational time in seconds. Table displays the results for the IGMM (Equation26(26) $\begin{matrix} C h o o s e $ x_{0} \in i n t (d o m ψ) $, $ δ \geq 0 $, a n d s o m e $ L_{0} \in (0, L_{d} (f)] $ . \\ F o r $ k \geq 0 $, i t e r a t e : \\ (1) C h o o s e t h e s e t $ Z_{k} $ c o n t a i n i n g $ x_{k} $ . \\ (2) F i n d t h e s m a l l e s t i n t e g e r $ i_{k} \geq 0 $ s u c h t h a t \\ f o r t h e p o i n t $ x_{k}^{+} = x_{δ, 2^{i_{k}} L_{k}} (x_{k}, Z_{k}) $ w e h a v e \\ f (x_{k}^{+}) \overset{(6)}{\leq} ℓ_{k} (x_{k}^{+}) + L_{k} β_{d} (x_{k}, x_{k}^{+}) . \\ (3) S e t $ x_{k + 1} = x_{k}^{+} $ a n d $ L_{k + 1} = 2^{i_{k} - 1} L_{k} $ . \end{matrix}$ (26) ) with the cyclic replacement strategy for each bundle size. In Table , we show the results for the Max-Norm replacement strategy. The accuracy parameters for the experiments shown in Tables and are $ϵ = 10^{- 4}$ and $δ = ϵ / 2$ . The smoothness parameter for our objective function is chosen as $μ = 0.05$ .

Table 3. Gradient method with cyclic memory replacement.

Download CSV Display Table

Table 4. Gradient method with Max-Norm memory replacement.

Download CSV Display Table

As we can see from these tables, the Max-Norm replacement strategy was always better than the cyclic one. Even for small bundle sizes, the total number of iterations decreases very quickly. What is more important, this decrease is also seen in total computation time. The number of auxiliary Frank–Wolfe steps remains on an acceptable level and cannot increase significantly the computational time of each iteration as compared with the Gradient Method. Recall that our test function has an expensive oracle, requiring $O (M n)$ operations for computing the function value and the gradient. For the Max-Norm version of IGMM, the optimal size of the bundle is probably between 8 and 16. Another candidate is 256, but it needs many more Frank–Wolfe steps.

Maybe our preliminary conclusions are problem specific. However, we believe that in any case they demonstrate a high potential of our approach in increasing the efficiency of gradient methods, both in accelerated and, hopefully, non-accelerated variants.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by the European Research Council (ERC) under Advanced Grant 788368.

Notes on contributors

Yurii Nesterov

Yurii Nesterov Born: 1956, Moscow. Master degree 1977, Moscow State University. Doctor degree 1984. Professor at Center for Operations Research and Econometrics, UCLouvain, Belgium. Author of 5 monographs and more than 100 refereed papers in the leading optimization journals. Dantzig Prize, John von Neumann Theory Prize 2009, Charles Broyden prize 2010, Francqui Chair (Liege University 2011–2012), SIAM Outstanding paper award 2014, EURO Gold Medal 2016.Main direction is the development of efficient numerical methods for convex and nonconvex optimization problems supported by the global complexity analysis: general interior-point methods (theory of self-concordant functions), fast gradient methods (smoothing technique), global complexity analysis of second-order and tensor schemes (cubic regularization of Newton’s method).

Mihai I. Florea

Mihai I. Florea Bachelor degree 2009, Tohoku University. Master degree 2013, Uppsala University. Doctor degree 2018, Aalto University. Postdoctoral Fellow with the Department of Mathematical Engineering, UCLouvain, Belgium. Japanese Government (MEXT) Scholarship 2004–2009, Tohoku University Head of School of Engineering Prize 2009, Aalto University Dissertation Award 2018.Research interests include first-order methods for convex minimization and applications to medical imaging.

Notes

1 By Gradient Method we denote the extended scheme described in [Citation1], which encompasses Gradient Descent and the Proximal Gradient Method.

2 The dual of the problem with the objective multiplied by

- 1

3 Recall that a function is closed if its epigraph is a closed set.

References

H. Bauschke, J. Bolte, and M. Teboulle, A descent lemma beyond Lipschitz gradient continuity: First-order methods revisited and applications, Math. Oper. Res. 42 (2017), pp. 330–348.
Web of Science ®Google Scholar
D. Drusvyatskiy, M. Fazel, and S. Roy, An optimal first order method based on optimal quadratic averaging, SIAM J. Optim. 28 (2018), pp. 251–271.
Web of Science ®Google Scholar
M. Frank and P. Wolfe, An algorithm for quadratic programming, Naval Res. Logist. Q. 3 (1956), pp. 149–154.
Google Scholar
H. Lu, R. Freund, and Yu. Nesterov, Relatively smooth convex optimization by first-order methods, and applications, SIAM J. Optim. 28S (2018), pp. 333–354.
Google Scholar
Yu. Nesterov, Gradient methods for minimizing composite functions, Math. Program. 140 (2013), pp. 125–161.
Web of Science ®Google Scholar
Yu. Nesterov, Complexity bounds for primal–dual methods minimizing the model of objective function, Math. Program. 171 (2018), pp. 311–330.
Web of Science ®Google Scholar
Yu. Nesterov, Lectures on Convex Optimization, Springer, Berlin, Germany, 2018.
Google Scholar

Gradient methods with memory

ABSTRACT

1. Introduction

1.1. Motivation

1.2. Contents

1.3. Notation and generalities

2. Gradient method with memory

3. Getting an approximate solution of the anti-dual problem

4. Unconstrained minimization in Euclidean setup

5. Numerical experiments

Table 1. Smoothness parameter $μ = 0.05$ .

Table 2. Smoothness parameter $μ = 0.01$ .

Table 3. Gradient method with cyclic memory replacement.

Table 4. Gradient method with Max-Norm memory replacement.

Disclosure statement

Notes on contributors

Yurii Nesterov

Mihai I. Florea

References

Information for

Open access

Opportunities

Help and information

Gradient methods with memory

ABSTRACT

1. Introduction

1.1. Motivation

1.2. Contents

1.3. Notation and generalities

2. Gradient method with memory

3. Getting an approximate solution of the anti-dual problem

4. Unconstrained minimization in Euclidean setup

5. Numerical experiments

Table 1. Smoothness parameter μ=0.05.

Table 2. Smoothness parameter μ=0.01.

Table 3. Gradient method with cyclic memory replacement.

Table 4. Gradient method with Max-Norm memory replacement.

Disclosure statement

Additional information

Funding

Notes on contributors

Yurii Nesterov

Mihai I. Florea

Notes

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 1. Smoothness parameter $μ = 0.05$ .

Table 2. Smoothness parameter $μ = 0.01$ .