2,196
Views
2
CrossRef citations to date
0
Altmetric
Research Article

Gradient methods with memory

ORCID Icon & ORCID Icon
Pages 936-953 | Received 22 Apr 2020, Accepted 30 Nov 2020, Published online: 13 Jan 2021

ABSTRACT

In this paper, we consider gradient methods for minimizing smooth convex functions, which employ the information obtained at the previous iterations in order to accelerate the convergence towards the optimal solution. This information is used in the form of a piece-wise linear model of the objective function, which provides us with much better prediction abilities as compared with the standard linear model. To the best of our knowledge, this approach was never really applied in Convex Minimization to differentiable functions in view of the high complexity of the corresponding auxiliary problems. However, we show that all necessary computations can be done very efficiently. Consequently, we get new optimization methods, which are better than the usual Gradient Methods both in the number of oracle calls and in the computational time. Our theoretical conclusions are confirmed by preliminary computational experiments.

1. Introduction

1.1. Motivation

First-order gradient methods for minimizing smooth convex functions generate a sequence of test points based on the information obtained from the oracle: the function values and the gradients. Most methods either use the information from the last test point or accumulate it in the form of an aggregated linear function (see, for example, Chapter 2 in [Citation7]). This approach is very different from the technique used in Nonsmooth Optimization, where the piece-wise linear model of the objective is a standard and powerful tool. It is enough to mention the Bundle Method, the Level Method, cutting plane schemes, etc. The reason for this situation is quite clear. The presence of piece-wise linear models in the auxiliary problems, which we need to solve at each iteration of the method, usually significantly increases the complexity of the corresponding computations. This is acceptable in Nonsmooth Optimization, which has the reputation of a difficult field. By contrast, Smooth Optimization admits very simple and elegant schemes, with a very small computational cost of each iteration, preventing us from introducing there such a heavy machinery.

After the preparation of this manuscript, we became aware of a highly specialized attempt in [Citation2], using quadratic lower bounds instead of linear ones. Although the results presented there seem promising, the study limits itself to studying smooth unconstrained problems with strongly convex objectives. The study also states that an extension of the results to a wider context is a difficult open problem.

The main goal of this paper is the demonstration that the above situation is not as clear as it looks like. We will show that the Gradient Method,Footnote1 equipped with a piece-wise linear model of the objective function, has much more chances to accelerate on particular optimization problems. At the same time, it appears that the corresponding auxiliary optimization problems can be easily solved by an appropriate version of the Frank–Wolfe algorithm. All our claims are supported by a complexity analysis. In the end, we present preliminary computational results, which show that very often the new schemes have much better computational time.

1.2. Contents

In Section 2, we analyse the Gradient Method with Memory as applied to the composite form of smooth convex optimization problems [Citation5]. In order to measure the level of smoothness of our objective function, we introduce the relative smoothness condition [Citation1,Citation4], based on an arbitrary strictly convex distance function. The main novelty here is the piece-wise linear model of the objective function, formed around the current test point. We analyse the corresponding auxiliary optimization problem and propose a condition on its approximate solution which does not destroy the rate of convergence of the algorithm. In Section 3, we analyse the complexity of solving the auxiliary optimization problem using the Frank–Wolfe algorithm. More precisely, we consider the anti-dualFootnote2 of the auxiliary problem.

In this section, we restrict ourselves to strongly convex distance functions. We show that our auxiliary problem can be easily solved by the Frank–Wolfe method. Its complexity is proportional to the maximal squared norm of the gradient in the current model of the objective divided by the desired accuracy.

In Section 4, we specify our complexity results for the Euclidean setup, when all distances are measured by a Euclidean norm. We show that, for some strategies of updating the piece-wise linear model, the complexity of the auxiliary computations is very low.

Finally, in Section 5 we present preliminary computational results. We compare the usual Gradient Method with two gradient methods with memory, which use different strategies for updating the piece-wise model of the objective function. Our conclusion is that the new schemes are always better, both in terms of the calls of oracle and in the total computational time.

1.3. Notation and generalities

In what follows, denote by E a finite-dimensional real vector space and by E its dual space, the space of linear functions on E. The value of function sE at point xE is denoted by s,x. Let us fix some arbitrary (possibly non-Euclidean) norm on the space E and define the dual norm on E in the standard way: s=defsuphE{s,h:h1}. Let us choose a simple closed convex prox-function d(), which is differentiable at the interior of its domain.Footnote3 This function must be strictly convex: (1) d(y)>d(x)+d(x),yx,xint(domd), ydomd, xy.(1) Using this function, we can define the Bregman distance between two points x and y: (2) βd(x,y)=d(y)d(x)d(x),yx,xint(domd), ydomd.(2) Clearly, βd(x,y)>(1)0 for xy and βd(x,x)=0.

We will use Bregman distances for measuring the level of relative smoothness of convex functions (see [Citation4]). Namely, for a differentiable closed convex function f with open domfdomd we define two constants, Ld(f)μd(f)0, such that (3) f(y)f(x)f(x),yxμd(f)βd(x,y),x,ydomf,f(y)f(x)f(x),yxLd(f)βd(x,y),(3) See [Citation1] and [Citation4] for definitions, motivations, and examples.

2. Gradient method with memory

In this paper, we are solving the following composite minimization problem: (4) minxdomψ{F(x)f(x)+ψ(x)},(4) where function f satisfies the relative smoothness condition (Equation3), possibly with μd(f)=0. The function ψ:ER{+} is a proper closed convex function with domψdomf and int(domψ) non-empty. The function ψ is simple (in the sense of satisfying Assumptions 2.1 and 2.2, stated in the sequel) but it does not have to be differentiable or even continuous. For instance, ψ can incorporate the indicator function of the feasible set. We assume that a solution xdomψ of problem (Equation4) does exist, denoting F=F(x).

The simplest method for solving the problem (Equation4) is the usual Gradient Method: (5) Choose x0int(domψ). For k0, iterate:xk+1=argminydomψf(xk)+f(xk),yxk+ψ(y)+Lβd(xk,y).(5) The constant L in this method has to be big enough in order to ensure f(xk+1)f(xk)+f(xk),xk+1xk+Lβd(xk,xk+1). In view of (Equation3), this is definitely true for LLd(f). However, we are interested in choosing L as small as possible since this would significantly increase the rate of convergence of the scheme.

Method (Equation5) is based on the simplest linear model of function f() around the point xk. In our paper, we suggest to replace it by a piece-wise linear model, defined by the information collected at other test points.

Namely, for each k0 define a discrete set Zk of mk feasible points (mk1): Zk={zidomψ, i=1,,mk}. Then we can use a more sophisticated model of the smooth part of the objective function, (6) f(y)k(y)=defmaxziZk{f(zi)+f(zi),yzi},ydomψ.(6) This model is always better than the initial linear model provided that (7) xkZk.(7) In what follows, we always assume that this condition is satisfied.

Thus, we come to the following natural generalization of the method (Equation5), which we call the Gradient Method with Memory (GMM): (8) Choose$x0int(domψ)$.For$k0$,iterate:xk+1=argminydomψk(y)+ψ(y)+Lβd(xk,y).(8)

Remark 2.1

Note that for any xdomf we have f(xk)+f(xk),xxk+Lβd(xk,x) (7) k(x)+Lβd(xk,x)(6)f(x)+Lβd(xk,x)(3)f(xk)+f(xk),xxk+(L+Ld(f))βd(xk,x). Therefore, we can count on a better convergence of method (Equation8) only if we will be able to choose the parameter L significantly smaller than Ld(f).

At each iteration of method (Equation8), we need to solve a non-trivial auxiliary minimization problem. Therefore, the practical efficiency of this method crucially depends on the complexity of this computation. In what follows, we suggest to solve this problem approximately using a special method for its dual problem.

Let us start by presenting the corresponding technique. For the sake of notation, we omit the index of the iteration. Thus, our auxiliary problem is as follows: minydomψmaxλΔmi=1mλ(i)[fi+gi,yzi]+ψ(y)+Lβd(x¯,y), where fi=f(zi), gi=f(zi), i=1,,m, and Δm is the standard simplex in Rm. Introducing now the vector fRm with coordinates (9) f(i)=gi,zifi,i=1,,m,(9) we get the following representation of our problem: (10) minydomψmaxλΔmλ,GTyf+ψ(y)+Lβd(x¯,y),(10) where G=(g1,,gm)E×Rm. Note that the pay-off function in this saddle point problem can be written as follows: λ,GTyf+ψ(y)+Lβd(x¯,y)=λ,GTyf+ψ(y)+L[d(y)d(x¯)d(x¯),yx¯]=Ld(y)Ld(x¯)Gλ,y+ψ(y)λ,f+L[d(x¯),x¯d(x¯)]. Hence, we need to introduce the following dual function: (11) ΦL(s)=maxydomψs,yLd(y)ψ(y),sE.(11) Our main joint assumption on functions d() and ψ() is as follows.

Assumption 2.1

For any L>0, function ΦL(s) is defined at any sE.

This can be ensured, for example, by the strong convexity of function d(), or by the boundedness of domψ, or in many other ways.

Since the objective function in the definition (Equation11) is strictly concave, its solution yL(s)=argmaxydomψs,yLd(y)ψ(y) is uniquely defined for any sE. Moreover, function ΦL() is differentiable and (12) ΦL(s)=yL(s),sE.(12)

Now we can write down the problem anti-dual to (Equation10) (13) ξL=defminλΔmξL(λ)=defΦL(Ld(x¯)Gλ)+λ,f+α,(13) where α=L[d(x¯)d(x¯),x¯]. This is a convex optimization problem with a differentiable objective function. Our second main assumption is as follows.

Assumption 2.2

Function ΦL() in the problem (Equation13) is easily computable.

We will discuss the reasonable strategies for finding an approximate solution to problem (Equation13) in Section 3. At this moment, it is enough to assume that we are able to compute a point λ¯=λ¯(x¯,Z,L) such that (14) λ¯λ,ξL(λ¯)δ,λΔm,(14) where δ0 is some tolerance parameter. Clearly, if δ=0, then λ¯ is the optimal solution to the problem (Equation13). Note that condition (Equation14) ensures also a small functional gap: (15) ξL(λ¯)ξL=maxλΔm[ξL(λ¯)ξL(λ)]maxλΔmλ¯λ,ξL(λ¯)(14)δ.(15)

Condition (Equation14) immediately leads to the following result.

Lemma 2.1

Let λ¯Δm satisfy condition (Equation14). Then for s¯=Ld(x¯)Gλ¯ we have (16) i=1mλ¯(i)[fi+gi,yL(s¯)zi]max1im[fi+gi,yL(s¯)zi]δ.(16)

Proof.

Indeed, ξL(λ¯)=fGTyL(s¯). Thus, inequality (Equation14) can be rewritten as follows: λ¯,GTyL(s¯)fλ,GTyL(s¯)fδ,λΔm. It remains to note that (GTyL(s¯)f)(i)=fi+gi,yL(s¯)zi,i=1,,m.

Now we are able to analyse one iteration of the inexact version of method (Equation8). (17) Input:Point$x¯int(domψ)$,setoftestpoints$Z$,containing$x¯$,constant$L>0$,andtolerance$δ0$.Iteration:Usingtheinputdata,formtheoptimizationproblem(13)andcomputeitsapproximatesolution$λ¯$satisfyingcondition(14).Output:Points$s¯=Ld(x¯)Gλ¯$and$x+=yL(s¯)$.(17)

Theorem 2.1

  1. Let point x+ be generated by one iteration (Equation17) of the Inexact Gradient Method with Memory (IGMM), and let LLd(f). Then for any ydomψ (18) β(x+,y)β(x¯,y)+1LF(y)F(x+)+δ;(18)

  2. For every ydomψ that satisfies βd(zi,y)βd(x¯,y), i=1,,m we have (19) β(x+,y)11Lμd(f)β(x¯,y)+1LF(y)F(x+)+δ.(19)

Proof.

Note that β(x+,y)β(x¯,y)=(2)d(y)d(x+)d(x+),yx+d(y)+d(x¯)+d(x¯),yx¯=(2)d(x¯)d(x+),yx+β(x¯,x+). The point x+ is defined as x+=argmaxxdomψLd(x¯)Gλ¯,xLd(x)ψ(x). The first-order optimality condition for x+ at point y can be written in the following form: ψ(x+)ψ(y)+Gλ¯+L(d(x+)d(x¯)),yx+. Hence, β(x+,y)β(x¯,y)1Lψ(y)ψ(x+)+Gλ¯,yx+β(x¯,x+). Note that Gλ¯,yx+=λ¯,GT(yx+)=i=1mλ¯(i)gi,yx+=i=1mλ¯(i)[gi,zix++gi,yzi](3)i=1mλ¯(i)[gi,zix++f(y)fiμd(f)βd(zi,y)]=f(y)i=1mλ¯(i)[fi+gi,x+zi]μd(f)i=1mλ¯(i)βd(zi,y). Under the conditions of Item 1, we drop the last term in the above inequality and by Lemma 2.1 obtain the following: β(x+,y)β(x¯,y)1LF(y)i=1mλ¯(i)[fi+gi,x+zi]ψ(x+)Lβ(x¯,x+)1LF(y)max1im[fi+gi,x+zi]ψ(x+)Lβ(x¯,x+)+δ. Under the conditions of Item 2, by the same reasoning we get β(x+,y)β(x¯,y)1LF(y)μd(f)βd(x¯,y)i=1mλ¯(i)[fi+gi,x+zi]ψ(x+)Lβ(x¯,x+)1LF(y)μd(f)βd(x¯,y)max1im[fi+gi,x+zi]ψ(x+)Lβ(x¯,x+)+δ. In both cases, since x¯Z, we have max1im[fi+gi,x+zi]+Lβ(x¯,x+)f(x¯)+f(x¯),x+x¯+Lβ(x¯,x+)(3)f(x+). Thus, we obtain inequalities (Equation18) and (Equation19).

Remark 2.2

In the end of the proof, we have seen that the statement of Theorem 2.1 remains valid if the condition LLd(f) is replaced by the following: (20) max1im[fi+gi,x+zi]+Lβ(x¯,x+)f(x+).(20)

Denote the output of the iteration (Equation17) by xδ,L(x¯,Z). Then we can define the following Inexact Gradient Method with Memory. (21) Choose$x0int(domψ)$,$δ0$and$L>0$.For$k0$,iterate:(1)Choosetheset$Zk$containing$xk$.(2)Compute$xk+1=xδ,L(xk,Zk)$.(21)

Let us describe the rate of convergence of this process.

Theorem 2.2

Let sequence {xk}k1 be generated by IGMM  (Equation21) with LLd(f). Then, for any T1 and ydomψ we have (22) 1Tk=1TF(xk)F(y)+LTβd(x0,y)+δ.(22)

Proof.

Indeed, in view of inequality (Equation18), we have βd(xk+1,y)βd(xk,y)+1LF(y)F(xk+1)+δ,k0. Summing up these inequalities for k=0,,T1, we get inequality (Equation22).

In the above result, the only restriction for the sets Zk is the inclusion (Equation7). If we apply a more accurate strategy of choosing Zk, we can get for this scheme a finer estimate of its rate of convergence.

Theorem 2.3

Let sequence {xk}k1 be generated by IGMM  (Equation21) with LLd(f). Assume that besides the condition  (Equation7), the sets Zk satisfy also the following condition: (23) Zk{x0,,xk},k0.(23) Suppose that μd(f)>0 and for all k, 1kT, we have (24) F(xk)Fδ.(24) Then for ΔT=min1kT[F(xk)F] we get the following rate of convergence (25) ΔTδ+(1γ)Tμd(f)1(1γ)Tβd(x0,x)δ+μd(f)eγT1βd(x0,x)δ+LTβd(x0,x),(25) where γ=1Lμd(f).

Proof.

In view of assumption (Equation24) and inequality (Equation18), we have βd(xk+1,x)βd(xk,x),0kT1. Therefore, in view of inequality (Equation19), for rk=defβd(xk,x) and all k=0,,T1 we have rk+1(1γ)rk1L[F(xk+1)Fδ](1γ)rk1L[ΔTδ]. Applying this inequality recursively, we get 1L[ΔTδ]1(1γ)T1(1γ)(1γ)Tr0. This can be rewritten as ΔTδ+Lγ(1γ)T1(1γ)Tβd(x0,x)=δ+(1γ)Tμd(f)1(1γ)Tβd(x0,x).

Note that the rate of convergence given by inequality (Equation25) is continuous as μd(f)0.

As we have mentioned in Remark 2.1, it is important to adjust the value of the constant L during the minimization process. Therefore we present an adaptive version of the method (Equation21). (26) Choose$x0int(domψ)$,$δ0$,andsome$L0(0,Ld(f)]$.For$k0$,iterate:(1)Choosetheset$Zk$containing$xk$.(2)Findthesmallestinteger$ik0$suchthatforthepoint$xk+=xδ,2ikLk(xk,Zk)$wehavef(xk+)(6)k(xk+)+Lkβd(xk,xk+).(3)Set$xk+1=xk+$and$Lk+1=2ik1Lk$.(26)

The rate of convergence of this algorithm can be established exactly in the same way as the one of method (Equation21). The main fact is that during the minimization process we always have Lk2Ld(f),k0. Therefore, for any ydomψ we will have βd(xk+1,y)βd(xk,y)+12Ld(f)[F(y)F(xk+1)+δ] with corresponding consequences for the rate of convergence. At the same time, the average number of oracle calls at each iteration of this method is bounded by two (see [Citation7] for justification details).

3. Getting an approximate solution of the anti-dual problem

The complexity of solving the auxiliary problem (Equation13) crucially depends on the properties of the prox-function d(). In the previous section, we assumed its strict convexity and the solvability of problem (Equation11) (see Assumption 2.1). It is time now to make a stronger assumption, which ensures these two properties.

Assumption 3.1

Function d() is differentiable in the interior of its domain and strongly convex with convexity parameter one: (27) d(y)d(x)+d(x),yx+12yx2,xint(domd), ydomd.(27)

Clearly, for all x,yint(domd) we have (28) βd(x,y)12yx2.(28)

The main consequence of Assumption 3.1 is the Lipschitz continuity of the gradient of function Φ(). Since usually this fact is proved for a function ψ() being an indicator function of a closed convex set, we provide it with a simple proof.

Lemma 3.1

Let function d() satisfy Assumption 3.1. Then the gradient ΦL(s)=yL(s), sE, is Lipschitz continuous: (29) ΦL(s1)ΦL(s2)1Ls1s2,s1,s2E.(29)

Proof.

Let us write down the first-order optimality conditions for the optimization problems defining the points y1=defyL(s1) and y2=defyL(s2): s1L[d(y1)d(x¯)],yy1ψ(y)ψ(y1),ydomψ,s2L[d(y2)d(x¯)],yy2ψ(y)ψ(y2),ydomψ. Taking in the first inequality y=y2 and y=y1 in the second one, and adding the results, we obtain s1s2,y2y1Ld(y1)d(y2),y2y1. Thus, s1s2,y1y2Ld(y2)d(y1),y2y1(27)Ly1y22. Therefore, by the Cauchy-Schwartz inequality, we get yL(s1)yL(s2)1Ls1s2.

Thus, in this section, our main problem of interest is as follows: (30) ξL=minλΔmξL(λ)=defΦL(Ld(x¯)Gλ)+λ,f+α,(30) where α=L[d(x¯)d(x¯),x¯]. This is a convex optimization problem over a simplex, where the objective function has a Lipschitz-continuous gradient.

The most natural algorithm for solving the problem (Equation13) is the Frank–Wolfe algorithm [Citation3] (or Conditional Gradients Method). For our problem, it looks as follows. (31) Set$λ0=1me¯m$.For$k0$iterate:1.Computethegradient$ξL(λk)$.2.Compute$ik=argmin1imiξL(λk)$.3.Set$λk+1=kk+2λk+2k+2eik$.(31) In this scheme, e¯mRm is the vector of all ones, and ei is ith coordinate vector in Rm.

In order to estimate the rate of convergence of this method, we introduce the following accuracy measure: δL(λ¯)=maxλΔmξL(λ¯),λ¯λ. For the sequence {λk}k0 generated by the method (Equation31), denote δL(T)=min0kTδL(λk),T0. For estimating the rate of convergence of method (Equation31), we need to choose an appropriate norm in Rm. Since the feasible set of the problem (Equation13) is the standard simplex, it is reasonable to use the 1-norm: λ1=i=1m|λ(i)|,λRm. Then, for measuring the gradients of function ξL(), we can use the -norm: λ=max1im|λ(i)|,λRm. In this case, the Lipschitz constant for the gradients of function ξL() can be estimated as follows: ξL(λ1)ξL(λ2)=max1im|gi,Φ(Ld(x¯)Gλ2)Φ(Ld(x¯)Gλ1)|max1imgiΦ(Ld(x¯)Gλ2)Φ(Ld(x¯)Gλ1)(29)max1imgi1LG(λ1λ2)1Lmax1imgi2λ1λ21. Thus, the gradients of function ξL() are Lipschitz continuous with the constant (32) L(ξL)=1Lmax1imgi2.(32) Since the diameter of the standard simplex in Rm in 1-norm is two, in accordance to the estimate (3.13) in [Citation6], we can guarantee the following rate of convergence: (33) δL(T)18LTmax1imgi2,t1.(33) (Here we replace the constant 13611ln2 from [Citation6] by a bigger value 18.) In accordance to the condition (Equation14), this means that we need (34) NL(δ)=18Lδmax1imgi2(34) iterations of the method (Equation31) in order to generate an appropriate dual solution λ¯.

4. Unconstrained minimization in Euclidean setup

In this section we consider the simplest unconstrained minimization problem (35) f=minxEf(x),(35) where f() is a smooth convex function. For measuring distances in E, we introduce a Euclidean norm x=Bx,x1/2,xE, where B=B0 is a linear operator from E to E. Then the dual norm is defined as follows: g=g,B1g1/2,gE. Let us choose now the distance function d(x)=12x2. Then the Bregman distance is given by βd(x,y)=12xy2,x,yE. In this case, the relative smoothness condition (Equation3) is equivalent to strong convexity and Lipschitz continuity of the gradient: (36) f(y)f(x)f(x),yx12μd(f)xy2,x,ydomf,f(y)f(x)f(x),yx12Ld(f)xy2.(36)

Let us write down now the specific form of the objective function ξL() in problem (Equation13). Note that in our case ΦL(s)=maxyEs,yL2y2=12Ls2,sE. Therefore, ξL(λ)=12LLBx¯Gλ2+λ,f+α, where α=12Lx¯2. The gradient of function ξL() can be computed as follows: (37) ξL(λ)=1LGB1(GλLBx¯)+f,λRm=1LQλf¯,(37) where Q=GB1G and f¯=Gx¯f. Note that f¯(i)=(9)fi+gi,x¯zi,i=1,,m. Thus, in the Euclidean setup, our auxiliary problem (Equation13) can be written as follows: (38) minλΔmξL(λ)=12Lλ,Qλλ,f¯.(38) The stopping criterion (Equation14) for this problem is as follows: λ¯,ξL(λ¯)=(37)λ¯,1LQλ¯f¯(14)δ+min1imiξL(λ¯)=δ+min1im1LQλ¯f¯(i). Note that the main output of the minimization process for problem (Equation38) is x+=y(LBx¯Gλ¯)=(12)1LB1(LBx¯Gλ¯)=x¯1LB1Gλ¯. Then f¯1LQλ¯=f¯1LGB1Gλ¯=f¯+G(x+x¯). Hence, in the Euclidean case, the stopping criterion (Equation14) can be written as follows: (39) i=1mλ¯(i)[fi+gi,x+zi]δ+max1im[fi+gi,x+zi].(39)

Now we can estimate the computational expenses of the method (Equation31) as applied to the auxiliary problem (Equation38).

  1. Computation of matrix Q: O(m2n) arithmetic operations. For certain strategies for updating the sets Zk, it can be reduced to O(mn) operations.

  2. Computation of the vector f¯: O(mn) operations.

  3. Computation of the initial gradient u0=1Lλ0f¯: O(m2) operations. For certain updating strategies it can be O(m).

  4. Expenses at each iteration:

    • Computing the index ik: O(m) operations.

    • Updating the point λk: O(m) operations.

    • Updating the gradient uk=1LQλkf¯: O(m) operations.

Thus, taking into account the upper bound (Equation34) for the number of iterations in method (Equation31), we obtain the following bound for the arithmetic complexity of problem (Equation38) with reasonable updating strategies for the sets Zk: (40) Omn+mLδmax1imgi2.(40) Taking into account that we can expect that in the problem (Equation35) we have 12Ld(f)f(zi)2(36)f(zi)f0 as i, the bound in (Equation40) suggests that the overhead of solving the inner problem (Equation38) decreases to a small constant in O(mn) as the algorithm approaches the optimum.

5. Numerical experiments

In this section we present preliminary computational results for method (Equation26) as applied to the following unconstrained minimization problem: (41) minxRnf(x)=μlnj=1Me(aj,xbj)/μ.(41) The data defining this function is randomly generated in the following way. First of all, we generate a collection of random vectors aˆ1,,aˆM with entries uniformly distributed in the interval [1,1]. Using the same distribution, we generate values bj, j=1,,m. Using this data, we form the preliminary function fˆ(x)=μlnj=1Me(aj,xbj)/μ and compute g=fˆ(0). Then, we define aj=aˆjg,j=1,,n. Clearly, in this case we have f(0)=0, so the unique solution of our test problem (Equation41) is x=0. The starting point x0 is chosen in accordance to the uniform distribution on the Euclidean sphere of radius one.

Thus, the problem (Equation41) has three parameters, the dimension n, the number of linear functions Mn, and the smoothness coefficient μ>0. In our experiments, we always choose M = 6n. Let us present our computational results for different values of n and μ.

In the definition of method (Equation26) we have some freedom in the choice of the bundle Zk. Let us bound its maximal size by a parameter m1. Then m = 1 corresponds to the usual Gradient Method. In the first series of our experiments (shown in Tables  and  ) we always choose m=n. We also have some freedom in the updating strategy for the sets Zk. Clearly, at the first m steps we can simply add all new points in the bundle. However, at the next iterations we need to decide on the strategy of replacement of the the old information. In our experiments we implemented two strategies:

  • Cyclic replacement (Cyclic).

  • Replacement of the linear function with the maximal norm of the gradient (Max-Norm).

Table 1. Smoothness parameter μ=0.05.

Table 2. Smoothness parameter μ=0.01.

The second strategy is motivated by the formula (Equation32) for the Lipschitz constant of the gradient of function ξL(). For both strategies, at each iteration we need to update only one column of matrix Qk (see (Equation37)), which costs O(mn) operations.

Let us present the results of our numerical experiments. All methods were stopped when the residual in the function value was smaller than ϵ=106. The parameter δ for the stopping criterion (Equation14) was chosen as δ=ϵ/2.

In Tables  and , the first line indicates the total number of iterations. The second line displays the total number of oracle calls. The third line shows the average number of Frank–Wolfe steps per iterations (for the Gradient Method we just put two). The next line indicates the total computational time (in seconds). Finally, at the last line we can see the average time spent on one iteration of the corresponding method (in milliseconds).

As we can see from these tables, in all our experiments the gradient methods with memory were better than the standard Gradient Method, both in the number of iterations, and, what is rather surprising, in the total computational time. The Max-Norm version usually outperforms the Cyclic version. It is interesting that the auxiliary algorithm (Equation31) works very well. The average time spent on one iteration of the methods with memory is never increased more than on 50% of the time of the simple Gradient Method. This is partially explained by the fact that in our test problems the data is fully dense, so each call of oracle is very expensive (O(Mn) operations).

Let us look now at how small bundles can accelerate the Gradient Method. In Tables  and , the first line with parameter Bundle = 1 corresponds to the Gradient method with line search. The next lines display the results for different sizes of the bundle. We list the number of iterations, the average number of Frank–Wolfe steps per iteration, and the total computational time in seconds. Table  displays the results for the IGMM (Equation26) with the cyclic replacement strategy for each bundle size. In Table , we show the results for the Max-Norm replacement strategy. The accuracy parameters for the experiments shown in Tables  and  are ϵ=104 and δ=ϵ/2. The smoothness parameter for our objective function is chosen as μ=0.05.

Table 3. Gradient method with cyclic memory replacement.

Table 4. Gradient method with Max-Norm memory replacement.

As we can see from these tables, the Max-Norm replacement strategy was always better than the cyclic one. Even for small bundle sizes, the total number of iterations decreases very quickly. What is more important, this decrease is also seen in total computation time. The number of auxiliary Frank–Wolfe steps remains on an acceptable level and cannot increase significantly the computational time of each iteration as compared with the Gradient Method. Recall that our test function has an expensive oracle, requiring O(Mn) operations for computing the function value and the gradient. For the Max-Norm version of IGMM, the optimal size of the bundle is probably between 8 and 16. Another candidate is 256, but it needs many more Frank–Wolfe steps.

Maybe our preliminary conclusions are problem specific. However, we believe that in any case they demonstrate a high potential of our approach in increasing the efficiency of gradient methods, both in accelerated and, hopefully, non-accelerated variants.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by the European Research Council (ERC) under Advanced Grant 788368.

Notes on contributors

Yurii Nesterov

Yurii Nesterov Born: 1956, Moscow. Master degree 1977, Moscow State University. Doctor degree 1984. Professor at Center for Operations Research and Econometrics, UCLouvain, Belgium. Author of 5 monographs and more than 100 refereed papers in the leading optimization journals. Dantzig Prize, John von Neumann Theory Prize 2009, Charles Broyden prize 2010, Francqui Chair (Liege University 2011–2012), SIAM Outstanding paper award 2014, EURO Gold Medal 2016.

Main direction is the development of efficient numerical methods for convex and nonconvex optimization problems supported by the global complexity analysis: general interior-point methods (theory of self-concordant functions), fast gradient methods (smoothing technique), global complexity analysis of second-order and tensor schemes (cubic regularization of Newton’s method).

Mihai I. Florea

Mihai I. Florea Bachelor degree 2009, Tohoku University. Master degree 2013, Uppsala University. Doctor degree 2018, Aalto University. Postdoctoral Fellow with the Department of Mathematical Engineering, UCLouvain, Belgium. Japanese Government (MEXT) Scholarship 2004–2009, Tohoku University Head of School of Engineering Prize 2009, Aalto University Dissertation Award 2018.

Research interests include first-order methods for convex minimization and applications to medical imaging.

Notes

1 By Gradient Method we denote the extended scheme described in [Citation1], which encompasses Gradient Descent and the Proximal Gradient Method.

2 The dual of the problem with the objective multiplied by 1.

3 Recall that a function is closed if its epigraph is a closed set.

References

  • H. Bauschke, J. Bolte, and M. Teboulle, A descent lemma beyond Lipschitz gradient continuity: First-order methods revisited and applications, Math. Oper. Res. 42 (2017), pp. 330–348.
  • D. Drusvyatskiy, M. Fazel, and S. Roy, An optimal first order method based on optimal quadratic averaging, SIAM J. Optim. 28 (2018), pp. 251–271.
  • M. Frank and P. Wolfe, An algorithm for quadratic programming, Naval Res. Logist. Q. 3 (1956), pp. 149–154.
  • H. Lu, R. Freund, and Yu. Nesterov, Relatively smooth convex optimization by first-order methods, and applications, SIAM J. Optim. 28S (2018), pp. 333–354.
  • Yu. Nesterov, Gradient methods for minimizing composite functions, Math. Program. 140 (2013), pp. 125–161.
  • Yu. Nesterov, Complexity bounds for primal–dual methods minimizing the model of objective function, Math. Program. 171 (2018), pp. 311–330.
  • Yu. Nesterov, Lectures on Convex Optimization, Springer, Berlin, Germany, 2018.