1,131
Views
5
CrossRef citations to date
0
Altmetric
Articles

Tensor methods for finding approximate stationary points of convex functions

ORCID Icon & ORCID Icon
Pages 605-638 | Received 22 May 2020, Accepted 29 Aug 2020, Published online: 23 Sep 2020

Abstract

In this paper, we consider the problem of finding ε-approximate stationary points of convex functions that are p-times differentiable with ν-Hölder continuous pth derivatives. We present tensor methods with and without acceleration. Specifically, we show that the non-accelerated schemes take at most Oϵ1/(p+ν1) iterations to reduce the norm of the gradient of the objective below given ϵ(0,1). For accelerated tensor schemes, we establish improved complexity bounds of Oϵ(p+ν)/[(p+ν1)(p+ν+1)] and O|log(ϵ)|ϵ1/(p+ν), when the Hölder parameter ν[0,1] is known. For the case in which ν is unknown, we obtain a bound of Oϵ(p+1)/[(p+ν1)(p+2)] for a universal accelerated scheme. Finally, we also obtain a lower complexity bound of Oϵ2/[3(p+ν)2] for finding ε-approximate stationary points using p-order tensor methods.

2010 Mathematics Subject Classifications:

1. Introduction

1.1. Motivation

In this paper, we study tensor methods for unconstrained optimization, i.e. methods in which the iterates are obtained by the (approximate) minimization of models defined from high-order Taylor approximations of the objective function. This type of method is not new in the optimization literature (see, e.g. [Citation1,Citation4,Citation33]). Recently, the interest for tensor methods has been renewed by the work in [Citation2], where p-order tensor methods were proposed for unconstrained minimization of nonconvex functions with Lipschitz continuous pth derivatives. It was shown that these methods take at most O(ϵp+1p) iterations to find an ε-approximate first-order stationary point of the objective, generalizing the bound of O(ϵ3/2), originally established in [Citation31] for the Cubic Regularization of Newton's Method (p = 2). After [Citation2], several high-order methods have been proposed and analysed for nonconvex optimization (see, e.g. [Citation9–11,Citation23]), resulting even in worst-case complexity bounds for the number of iterations that p-order methods need to generate approximate qth order stationary points [Citation7,Citation8].

More recently, in [Citation27], p-order tensor methods with and without acceleration were proposed for unconstrained minimization of convex functions with Lipschitz continuous pth derivatives. As it is usual in Convex Optimization, these methods aim the generation of a point x¯ such that f(x¯)fϵ, where f is the objective function, f is its optimal value and ϵ>0 is a given precision. Specifically, it was shown that the non-accelerated scheme takes at most O(ϵ1/p) iterations to reduce the functional residual below a given ϵ>0, while the accelerated scheme takes at most O(ϵ1/(p+1)) iterations to accomplish the same task. Auxiliary problems in these methods consist in the minimization of a (p+1)-regularization of the pth order Taylor approximation of the objective, which is a multivariate polynomial. A remarkable result shown in [Citation27] (which distinguishes this work from [Citation1]) is that, in the convex case, the auxiliary problems in tensor methods become convex when the corresponding regularization parameter is sufficiently large. Since [Citation27], several high-order methods have been proposed for convex optimization (see, e.g. [Citation13,Citation14,Citation18,Citation20]), including near-optimal methods [Citation5,Citation15,Citation21,Citation28,Citation29] motivated by the second-order method in [Citation24]. In particular, in [Citation18], we have adapted and generalized the methods in [Citation16,Citation17,Citation27] to handle convex functions with ν-Hölder continuous pth derivatives. It was shown that the non-accelerated schemes take at most O(ϵ1/(p+ν1)) iterations to generate a point with functional residual smaller than a given ϵ(0,1), while the accelerated variants take only O(ϵ1/(p+ν)) iterations when the parameter ν is explicitly used in the scheme. For the case in which ν is not known, we also proposed a universal accelerated scheme for which we established an iteration complexity bound of O(ϵp/[(p+1)(p+ν1)]).

As a natural development, in this paper, we present variants of the p-order methods (p2) proposed in [Citation18] that aim the generation of a point x¯ such that f(x¯)ϵ, for a given threshold ϵ(0,1). In the context of nonconvex optimization, finding approximate stationary points is usually the best one can expect from local optimization methods. In the context of convex optimization, one motivation to search for approximate stationary points is the fact that the norm of the gradient may serve as a measure of feasibility and optimality when one applies the dual approach for solving constrained convex problems (see, e.g. [Citation26]). Another motivation comes from the inexact high-order proximal-point methods, recently proposed in [Citation28,Citation29], in which the iterates are computed as approximate stationary points of uniformly convex models.

Specifically, our contributions are the following:

  1. We show that the non-accelerated schemes in [Citation18] take at most O(ϵ1/(p+ν1)) iterations to reduce the norm of the gradient of the objective below a given ϵ(0,1), when the objective is convex, and O(ϵ(p+ν)/(p+ν1)) iterations, when f is nonconvex. These complexity bounds extend our previous results reported in [Citation16] for regularized Newton methods (case p = 2). Moreover, our complexity bound for the nonconvex case agrees in order with the bounds obtained in [Citation9,Citation23] for different tensor methods.

  2. For accelerated tensor schemes, we establish improved complexity bounds of O(ϵ(p+ν)/[(p+ν1)(p+ν+1)]), when the Hölder parameter ν[0,1] is known. This result generalizes the bound of O(ϵ2/3) obtained in [Citation26] for the accelerated gradient method (p=ν=1). In contrast, when ν is unknown, we prove a bound of O(ϵ(p+1)/[(p+ν1)(p+2)]) for a universal accelerated scheme.

  3. For the case in which ν and the corresponding Hölder constant are known, we propose tensor schemes for the composite minimization problem. In particular, we prove a bound of O(Rp+ν1p+ν|log2(ϵ)|ϵ1p+ν) iterations, where R is an upper bound for the initial distance to the optimal set. This result generalizes the bounds obtained in [Citation26] for first-order and second-order accelerated schemes combined with a regularization approach (p = 1, 2 and ν=1). We also prove a bound of O(Sp+ν1p+ν|log2(ϵ)|ϵ1) iterations, where S is an upper bound for the initial functional residual.

  4. Considering the same class of difficult functions described in [Citation18], we derive a lower complexity bound of O(ϵ2/[3(p+ν)2]) iterations (in terms of the initial distance to the optimal set), and a lower complexity bound of O(ϵ2(p+ν)/[3(p+ν)2]) iterations (in terms of the initial functional residual), for p-order tensor methods to find ε-approximate stationary points of convex functions with ν-Hölder continuous pth derivatives. These bounds generalize the corresponding bounds given in [Citation6] for first-order methods (p=ν=1).

The paper is organized as follows. In Section 2, we define our problem. In Section 3, we present complexity results for tensor schemes without acceleration. In Section 4, we present complexity results for accelerated schemes. In Section 5, we analyse tensor schemes for the composite minimization problem. Finally, in Section 6, we establish lower complexity bounds for tensor methods find ε-approximate stationary points of convex functions under the Hölder condition. Some auxiliary results are left in the Appendix.

1.2. Notations and generalities

Let E be a finite-dimensional real vector space, and E be its dual space. We denote by s,x the value of the linear functional sE at point xE. Spaces E and E are equipped with conjugate Euclidean norms: (1) x=Bx,x1/2,xE,s=s,B1s1/2,sE.(1) where B:EE is a self-adjoint positive definite operator (B0). For a smooth function f:ER, denote by f(x) its gradient, and by 2f(x) its Hessian evaluated at point xE. Then f(x)E and 2f(x)hE for x,hE.

For any integer p1, denote by Dpf(x)[h1,,hp] the directional derivative of function f at x along directions hiE, i=1,,p. For any xdomf and h1,h2E we have Df(x)[h1]=f(x),h1andD2f(x)[h1,h2]=2f(x)h1,h2. If h1==hp=hE, we denote Dpf(x)[h1,,hp] by Dpf(x)[h]p. Using this notation, the pth order Taylor approximation of function f at xE can be written as follows: (2) f(x+h)=Φx,p(x+h)+o(hp),(2) where (3) Φx,p(y)f(x)+i=1p1i!Dif(x)[yx]i,yE.(3) Since Dpf(x)[] is a symmetric p-linear form, its norm is defined as: Dpf(x)=maxh1,,hpDpf(x)[h1,,hp]:hi1,i=1,,p. It can be shown that (see, e.g. Appendix 1 in [Citation30]) Dpf(x)=maxhDpf(x)[h]p:h1. Similarly, since Dpf(x)[.,,.]Dpf(y)[.,,.] is also a symmetric p-linear form for fixed x,yE, it follows that Dpf(x)Dpf(y)=maxhDpf(x)[h]pDpf(y)[h]p:h1.

2. Problem statement

In this paper, we consider methods for solving the following minimization problem (4) minxEf(x),(4) where f:ER is a convex function p-times differentiable. We assume that (Equation4) has at least one optimal solution xE. As in [Citation18], the level of smoothness of the objective f will be characterized by the family of Hölder constants (5) Hf,p(ν)supx,yEDpf(x)Dpf(y)xyν,0ν1.(5) From (Equation5), it can be shown that, for all x,yE, (6) |f(y)Φx,p(y)|Hf,p(ν)p!yxp+ν,(6) (7) f(y)Φx,p(y)Hf,p(ν)(p1)!yxp+ν1,(7) and (8) 2f(y)2Φx,p(y)Hf,p(ν)(p2)!yxp+ν2.(8) Given xE, if Hf,p(ν)<+ and HHf,p(ν), by (Equation6) we have (9) f(y)Φx,p(y)+Hp!yxp+ν,yE.(9) This property motivates the use of the following class of models of f around xE: (10) Ωx,p,H(α)(y)=Φx,p(y)+Hp!yxp+α,α[0,1].(10) Note that, by (Equation9), if HHf,p(ν) then f(y)Ωx,p,H(ν)(y) for all yE.

3. Tensor schemes without acceleration

Let us consider the following assumption:

H1

Hf,p(ν)<+ for some ν[0,1].

Regarding the smoothness parameter ν, there are only two possible situations: either ν is known, or ν is unknown. In order to cover both cases in a single framework, as in [Citation18], we shall consider the parameter (11) α=ν,if ν is known,1,if ν is unknown.1.5pc(11)

Remark 3.1

If ν is unknown, by (Equation11) we set α=1 in Algorithm 1. The resulting algorithm is a universal scheme that can be viewed as a generalization of the universal second-order method (6.10) in [Citation16]. Moreover, it is worth mentioning that for p = 3 and α=ν, one can use Gradient Methods with Bregman distance [Citation19] to approximately solve (12) in the sense of (13).

For both cases (ν known or unknown), Algorithm 1 is a particular instance of Algorithm 1 in [Citation18] in which Mt=2itHt for all t0. Let us define the following function of ε: (15) Nν(ϵ)=max3Hf,p(ν)2,3θ(p1)!,if ν is known,maxθ,3Hf,p(ν)2pp+ν141νp+ν1ϵ1νp+ν1,if ν is unknown.(15) The next lemma provides upper bounds on Mt and on the number of calls of the oracle in Algorithm 1.

Lemma 3.1

Suppose that H1 holds. Given ϵ(0,1), assume that {xt}t=0T is a sequence generated by Algorithm 1 such that (16) f(xt)>ϵ,t=0,,T.(16) Then, (17) HtmaxH0,Nν(ϵ),for t=0,,T,(17) and, consequently, (18) Mt=2itHt2maxH0,Nν(ϵ),for t=0,,T1,(18) Moreover, the number OT of calls of the oracle after T iterations is bounded as follows: (19) OT2T+log2maxH0,Nν(ϵ)log2H0.(19)

Proof.

Let us prove (Equation17) by induction. Clearly, it holds for t = 0. Assume that (Equation17) is true for some t, 0tT1. If ν is known, then by (Equation11) we have α=ν. Thus, it follows from H1 and Lemma A.2 in [Citation18] that the final value of 2itHt cannot be bigger than 2max{(3/2)Hf,p(ν),3θ(p1)!}, since otherwise we should stop the line search earlier. Therefore, Ht+1=122itHtmax3Hf,p(ν)2,3θ(p1)!=Nν(ϵ)maxH0,Nν(ϵ), that is, (Equation17) holds for t = t + 1. On the other hand, if ν is unknown, we have α=1. In view of (Equation16), Corollary A.5 [Citation18] and ϵ(0,1), we must have 2itHt2maxθ,3Hf,p(ν)2pp+ν14ϵ1νp+ν12Nν(ϵ). Consequently, it follows that Ht+1=122itHtNν(ϵ)maxH0,Nν(ϵ), that is, (Equation17) holds for t + 1. This completes the induction argument. Using (Equation17), for t=0,,T1 we get Mt=2Ht+12max{H0,Nν(ϵ)}. Finally, note that at the tth iteration of Algorithm 1, the oracle is called it+1 times. Since Ht+1=2it1Ht, it follows that it1=log2Ht+1log2Ht. Thus, by (Equation17) we get OT=t=0T1(it+1)=t=0T12+log2Ht+1log2Ht=2T+log2HTlog2H02T+log2maxH0,Nν(ϵ)log2H0, and the proof is complete.

Let us consider the additional assumption:

H2

The level sets of f are bounded, that is, maxxL(x0)xxD0(1,+) for L(x0){xE:f(x)f(x0)}, with x0 being the starting point.

The next theorem gives global convergence rates for Algorithm 1 in terms of the functional residual.

Theorem 3.2

Suppose that H1 and H2 are true and let {xt}t=0T be a sequence generated by Algorithm 1 such that, for t=0,,T, we have f(xt,i+)>ϵ,i=0,,it. Let m be the first iteration number such that f(xm)f(x)4[8(p+1)!]p+α1maxH0,Nν(ν)D0p+α, and assume that m<T. Then (20) m1lnp+αp+α1lnmax1,log2f(x0)f(x)2[8(p+1)!]p+α1maxH0,Nν(ϵ)D0p+α(20) and, for all k, m<kT, we have (21) f(xk)f(x)[24p(p+1)!]p+α12maxH0,Nν(ϵ)D0p+α(km)p+α1.(21)

Proof.

By Lemma 3.1, this result follows from Theorem 3.2 in [Citation18] with Mν=2max{H0,Nν(ϵ)}.

Now, we can derive global convergence rates for Algorithm 1 in terms of the norm of the gradient.

Theorem 3.3

Under the same assumptions of Theorem 3.2, if T = m + 3s for some s1, then (22) gTmin0tTf(xt)2288p(p+1)!D0Tmp+α1maxH0,Nν(ϵ).(22) Consequently, (23) T<m+κ1(ν)[288p(p+1)!D0]ϵ1p+ν1,(23) with κ1(ν)=2maxH0,3Hf,p(ν)2,3θ(p1)!1p+ν1,if ν is known,2maxH0,θ,3Hf,p(ν)2pp+ν141νp+ν11p,if ν is unknown.

Proof.

By Theorem 3.2, we have (24) f(xk)f(x)[24p(p+1)!]p+α12maxH0,Nν(ϵ)D0p+α(km)p+α1,(24) for all k, m<kT. In particular, it follows from (14) and (Equation24) that [24p(p+1)!]p+α12maxH0,Nν(ϵ)D0p+α(2s)p+α1f(xm+2s)f(x)=f(xT)f(x)+k=m+2sT1(f(xk)f(xk+1))s8(p+1)!12maxH0,Nν(ϵ)1p+α1(gT)p+αp+α1. Therefore, (gT)p+αp+α18(p+1)![24p(p+1)!]p+α12maxH0,Nν(ϵ)p+αp+α1D0p+α2p+α1sp+α(96p)p+α13p+α[(p+1)!]p+α2maxH0,Nν(ϵ)p+αp+α1D0p+α(Tm)p+α, and so (Equation22) holds. By assumption, we have gT>ϵ. Thus, by (Equation22) we get (25) ϵ<2288p(p+1)!D0Tmp+α1maxH0,Nν(ϵ)T<m+[288p(p+1)!D0]2maxH0,Nν(ϵ)1p+α1ϵ1p+α1.(25) Finally, by analysing separately the cases in which ν is known and unknown, it follows from (Equation25) and (Equation15) that (Equation23) is true.

Remark 3.2

Suppose that the objective f in (Equation4) is nonconvex and bounded from below by f. Then, it follows from (14) and (Equation18) that f(xt)f(xt+1)18(p+1)!12maxH0,Nν(ϵ)1p+α1ϵp+αp+α1,t=0,,T1. Summing up these inequalities, we get f(x0)ff(x0)f(xT)T8(p+1)!12maxH0,Nν(ϵ)1p+α1ϵp+αp+α1 and so, by (Equation15), we obtain TO(ϵp+νp+ν1). This bound generalizes the bound of O(ϵ2+ν1+ν) proved in [Citation16] for p = 2. It agrees in order with the complexity bounds proved in [Citation23] and [Citation9] for different universal tensor methods.

4. Accelerated tensor schemes

The schemes presented here generalize the procedures described in [Citation26] for p = 1 and p = 2. Specifically, our general scheme is obtained by adding Step 2 of Algorithm 1 at the end of Algorithm 4 in [Citation18], in order to relate the functional decrease with the norm of the gradient of f in suitable points:

Let us define the following function of ε: (27) N~ν(ϵ)=(p+ν1)(Hf,p(ν)+θ(p1)!),if ν is known,max4θ(p1)!,(4Hf,p(ν))pp+ν14ϵ1νp+ν1,if ν is unknown.(27) In Algorithm 2, note that {xt} is independent of {zt}. The next theorem establishes global convergence rates for the functional residual with respect to {xt}.

Theorem 4.1

Assume that H1 holds and let the sequence {xt}t=0T be generated by Algorithm 2 such that, for t=0,,T we have (28) f(xt,i+)>ϵ,i=0,,it.(28) Then, (29) f(xt)f(x)23pmaxH~0,N~ν(ϵ)(p+α)p+α1x0xp+α(p1)!(t1)p+α,(29) for t=2,,T.

Proof.

As in the proof of Lemma 3.1, it follows from (Equation28), (Equation27) and Lemmas A.6 and A.7 in [Citation18] that H~tmaxH~0,N~ν(ϵ),t=0,,T, which gives M~t=2itH~t=22itH~t=2H~t+12maxH~0,N~ν(ϵ),t=0,,T1. Then, (Equation29) follows directly from Theorem 4.3 in [Citation18] with Mν=2max{H~0,N~ν(ϵ)}.

Now we can obtain global convergence rates for Algorithm 2 in terms of the norm of the gradient.

Theorem 4.2

Suppose that H1 holds and let sequences {xt}t=0T and {zt}t=0T be generated by Algorithm 2. Assume that, for t=0,,T, we have (30) minf(xt,i+),f(zt,j+)>ϵ,i=0,,it,j=0,,jt.(30) If T = 2s for some s>1, then (31) gTmin0kTf(zt)Cν(ϵ)x0xp+α1p+1T2(p+α1)(p+α+1)p+α(31) where Cν(ϵ)=2(4p+6)maxH~0,N~ν(ϵ)maxH0,Nν(ϵ)1p+α1p+α1p+α,

with Nν(ϵ) and N~ν(ϵ) defined in (Equation15) and (Equation27), respectively. Consequently, (32) T2+2(4p+6)(p+1)max1,H~0,H0,3pHf,p(ν)+θ(p1)!x0xp+νp+ν+1ϵp+ν(p+ν1)(p+ν+1),(32) if ν is known (i.e. α=ν), and

(33) T<2+2(2p+3)(p+1)max1,H~0,H0,4(4Hf,p(ν))pp+ν1+θ(p1)!12x0xp+1p+2ϵp+1(p+ν1)(p+2),(33)

if ν is unknown (i.e. α=1).

Proof.

By Theorem 4.1, we have (34) f(xt)f(x)23pmaxH~0,N~ν(ϵ)(p+α)p+α1x0xp+α(p1)!(t1)p+α(34) for t=2,,T. On the other hand, as in Lemma 3.1, by (Equation30) we get 2jtHt2maxH0,Nν(ϵ),t=0,,T1, where Nν(ϵ) is defined in (Equation15). Then, in view of (26), it follows that (35) f(z¯t)f(zt+1)18(p+1)!12maxH0,Nν(ϵ)1p+α1f(zt+1)p+αp+α1,(35) for t=0,,T1. In particular, f(zt+1)f(z¯t) for t=0,,T1. Moreover, by the definition of z¯t, we get f(z¯t)f(xt+1) and f(z¯t)f(zt). Therefore (36) f(zt)f(xt),t=0,,T,(36) and (37) f(zt)f(zt+1)18(p+1)!12maxH0,Nν(ϵ)1p+α1f(zt+1)p+αp+α1.(37) Now, since T = 2s, summing up (Equation37), we get 23pmaxH~0,N~ν(ϵ)(p+α)p+α1x0xp+α(p1)!(s1)p+αf(xs)f(x)f(zs)f(x)=f(zT)f(x)+k=sT1(f(zk)f(zk+1))(s1)8(p+1)!12maxH0,Nν(ϵ)1p+α1(gT)p+αp+α1. Thus, (gT)p+αp+α12(4p+6)maxH~0,N~ν(ϵ)maxH0,Nν(ϵ)1p+α1(p+1)p+α+1x0xp+α(T2)p+α+1, and so (Equation31) holds. By assumption, we have gT>ϵ. Thus, it follows from (Equation31) that (38) ϵ<Cν(ϵ)x0xp+α1p+1T2(p+α1)(p+α+1)(p+α)T<2+(p+1)Cν(ϵ)ϵp+α(p+α1)(p+α+1)x0xp+αp+α+1.(38) If ν is known, by (Equation15) and (Equation27) we have max{Nν(ϵ),N~ν(ϵ)}3p(Hf,p(ν)+θ(p1)!). Then, maxH~0,N~ν(ϵ)maxH0,Nν(ϵ)1p+ν1maxH~0,H0,3pHf,p(ν)+θ(p1)!p+νp+ν1, and so (39) Cν(ϵ)2(4p+6)maxH~0,H0,3pHf,p(ν)+θ(p1)!.(39) Combining (Equation38), (Equation39) and p+ν(p+ν1)(p+ν+1)1, we obtain (Equation32). If ν is unknown, it follows from (Equation15) and (Equation27) that maxNν(ϵ),N~ν(ϵ)4(4Hf,p(ν))pp+ν1+θ(p1)!ϵ1νp+ν1. Then, maxH~0,N~ν(ϵ)maxH0,Nν(ϵ)1pmaxH~0,H0,4(4Hf,p(ν))pp+ν1+θ(p1)!ϵ1νp+ν1p+1p, and so (40) Cν(ϵ)2(4p+6)maxH~0,H0,4(4Hf,p(ν))pp+ν1+θ(p1)!ϵ1νp+ν1.(40) Combining (Equation38), (Equation40) and p+1p(p+2)<12, we obtain (Equation33).

Remark 4.1

When ν=1, bounds (Equation32) and (Equation33) have the same dependence on ε. However, when ν1, the bound of O(ϵ(p+1)/[(p+ν1)(p+2)]) obtained for the universal scheme (i.e. α=1) is worse than the bound of O(ϵ(p+ν)/[(p+ν1)(p+ν+1)]) obtained for the non-universal scheme (i.e. α=ν). In both cases, these complexity bounds are better than the bound of O(ϵ1/(p+ν1)) proved for Algorithm 1.

5. Composite minimization

From now on, we will assume that ν and Hf,p(ν) are known. In this setting, we can consider the composite minimization problem: (41) minxEf~(x)f(x)+ϕ(x),(41) where f:ER is a convex function satisfying H1 (see page 5), and ϕ:ER{+} is a simple closed convex function whose effective domain has nonempty relative interior, that is, ri(domϕ). We assume that there exists at least one optimal solution xE for (Equation41). By (Equation6), if HHf,p(ν) we have f~(y)Ωx,p,H(ν)(y)+ϕ(y), yE. This motivates the following class of models of f~ around a fixed point xE: (42) Ω~x,p,H(ν)(y)Ωx,p,H(ν)(y)+ϕ(y),(42) where Ωx,p,H(ν)() is defined in (Equation10). The next lemma gives a sufficient condition for function Ωx,p,H(ν)() to be convex. Its proof is an adaptation of the proof of Theorem 1 in [Citation27].

Lemma 5.1

Suppose that H1 holds for some p2. Then, for any x,yE we have (43) 2f(y)2Φx,p(y)+Hf,p(ν)(p2)!yxp+ν2B.(43) Moreover, if H(p1)Hf,p(ν), then function Ωx,p,H(ν)() is convex for any xE.

Proof.

For any uE, it follows from (Equation8) that 2f(y)2Φx,p(y)u,u2f(y)2Φx,p(y)u2Hf,p(ν)(p2)!yxp+ν2u2. Since uE is arbitrary, we get (Equation43).

Now, suppose that H(p1)Hf,p(ν). Then, by (Equation43) we have 02f(y)2Φx,p(y)+Hf,p(ν)(p1)(p1)!yxp+ν2B2Φx,p(y)+H(p1)!yxp+ν2B2Φx,p(y)+H(p+ν)p!yxp+ν2B2Φx,p(y)+2Hp!yxp+ν=2Ωx,p,H(ν)(y). Therefore, Ωx,p,H(ν)(y) is convex.

From Lemma 5.1, if H(p1)Hf,p(ν), it follows that Ω~x,p,H(ν)() is also convex. In this case, since ri(domϕ), any solution x+ of (44) minyEΩ~x,p,H(ν)(y)(44) satisfies the first-order optimality condition: (45) 0Ω~x,p,H(ν)(x+)=Ωx,p,H(ν)(x+)+ϕ(x+)=Ωx,p,H(ν)(x+)+ϕ(x+).(45) Therefore, there exists gϕ(x+)ϕ(x+) such that (46) Ωx,p,H(ν)(x+)+gϕ(x+)=0.(46) Instead of solving (Equation44) exactly, in our algorithms we consider inexact solutions x+ such thatFootnote1 (47) Ω~x,p,H(ν)(x+)f~(x)andΩx,p,H(ν)(x+)+gϕ(x+)θx+xp+ν1,(47) for some gϕ(x+)ϕ(x+) and θ0. For such points x+, we define (48) f~(x+)f(x+)+gϕ(x+),(48) with gϕ(x+) satisfying (Equation47). Clearly, we have f~(x+)f~(x+).

Lemma 5.2

Suppose that H1 holds and let x+ be an approximate solution of (Equation44) such that (Equation47) holds for some xE. If (49) HmaxpHf,p(ν),3θ(p1)!,(49) then (50) f~(x)f~(x+)18(p+1)!H1p+ν1f~(x+)p+νp+ν1.(50)

Proof.

By (Equation48), (Equation7), (Equation10), (Equation47) and (Equation49) we have (51) f~(x+)=f(x+)+gϕ(x+)f(x+)Φx,p(x+)+Φx,p(x+)Ωx,p,H(ν)(x+)+Ωx,p,H(ν)(x+)+gϕ(x+)Hf,p(ν)(p1)!+H(p+ν)p!+θx+xp+ν12Hx+xp+ν1,(51) where the last inequality is due to p2. On the other hand, by (Equation6), (Equation42), (Equation49), we have f~(x+)Ω~x,p,Hf,p(ν)(ν)(x+)=Φx,p(x+)+Hf,p(ν)p!x+xp+ν+ϕ(x+)=Φx,p(x+)+Hp!x+xp+ν(HHf,p(ν))p!x+xp+ν+ϕ(x+)=Ω~x,p,H(ν)(x+)HHf,p(ν)p!x+xp+νf~(x+)HHf,p(ν)p!x+xp+ν. Note that HpHf,p(ν)(p+1p)Hf,p(ν). Thus, (52) f~(x)f~(x+)HHf,p(ν)p!x+xp+νH1p+1Hp!x+xp+ν=H(p+1)!x+xp+ν.(52) Finally, combining (Equation51) and (Equation52), we get (Equation50).

In this composite context, let us consider the following scheme:

For p = 3, point xt+1 at Step 1 can be computed by Algorithm 2 in [Citation19], which is linearly convergent. As far as we know, the development of efficient algorithms to approximately solve (Equation44)–(Equation42) with p>3 is still an open problem.

Theorem 5.3

Suppose that H1 holds and that f~ is bounded from below by f~. Given ϵ(0,1), assume that {xt}t=0T is a sequence generated by Algorithm 3 such that f~(xt)>ϵ for t=0,,T. Then, (53) T8(p+1)!M1p+ν1f~(x0)f~ϵp+νp+ν1.(53)

Proof.

By Lemma 5.2, bound (Equation53) follows as in Remark 3.2.

5.1. Extended accelerated scheme

Let us consider the following variant of Algorithm 2 for composite minimization:

The next theorem gives the global convergence rate for Algorithm 4 in terms of the norm of the gradient. Its proof is a direct adaptation of the proof of Theorem 4.2.

Theorem 5.4

Suppose that H1 holds. Assume that {zt}t=0T is a sequence generated by Algorithm 4 such that (56) f~(zt)>ϵ,t=0,,T.(56) If T = 2s for some s>1, then (57) g~Tmin0kTf~(zt)24(p+1)p+ν1p+νMx0xp+ν1(T2)(p+ν1)(p+ν+1)p+ν.(57) Consequently, (58) T2+24(p+1)1p+ν+1x0xp+νp+ν+1Mϵp+ν(p+ν1)(p+ν+1).(58)

Proof.

In view of Theorem A.2, we have (59) f~(xt)f~(x)23p1M(p+ν)p+ν1x0xp+ν(p1)!(t1)p+ν,(59) for t=2,,T. On the other hand, by (55) and Lemma 5.2, we have (60) f~(z¯t)f~(zt+1)18(p+1)!M1p+ν1f~(zt+1)p+νp+ν1,(60) for t=0,,T1. Thus, f(zt+1)f(z¯t)min{f(xt+1),f(zt)} and, consequently, (61) f~(zt)f~(xt),t=0,,T,(61) and (62) f~(zt)f~(zt+1)18(p+1)!M1p+ν1f~(zt+1)p+νp+ν1,t=0,,T1.(62) Since T = 2s, combining (Equation59), (Equation61) and (Equation62), we obtain 23p1M(p+ν)p+ν1x0xp+ν(p1)!(s1)p+νf~(xs)f~(x)f~(zs)f~(x)=f~(zT)f~(x)+k=sT1f~(zt)f~(zt+1)(s1)8(p+1)!M1p+ν1g~Tp+νp+ν1, where g~T=min0kTf~(zt). Therefore, g~Tp+νp+ν124(p+1)Mp+νp+ν1(p+1)p+ν+1x0xp+ν(T2)p+ν+1, which gives (Equation57). Finally, by (Equation56) we have g~T>ϵ. Thus, (Equation58) follows directly from (Equation57).

5.2. Regularization approach

Now, let us consider the ideal situation in which ν, Hf,p(ν) and Rx0x are known. In this case, a complexity bound with a better dependence on ε can be obtained by repeatedly applying an accelerated algorithm to a suitable regularization of f~. Specifically, given δ>0, consider the regularized problem (63) minxRnF~δ(x)Fδ(x)+ϕ(x),(63) for (64) Fδ(x)=f(x)+δp+νxx0p+ν.(64)

Lemma 5.5

Given x0E and ν[0,1], let dp+ν:ER be defined by dp+ν(x)=xx0p+ν, where is the Euclidean norm defined in (Equation1). Then, Dpdp+ν(x)Dpdp+ν(y)Cp,νxyν, x,yE, where Cp,ν=2Πi=1p(ν+i).

Proof.

See [Citation32].

As a consequence of the lemma above, we have the following property.

Lemma 5.6

If H1 holds, then the pth derivative of Fδ() in (Equation64) is ν-Hölder continuous with constant HFδ,p(ν)=Hf,p(ν)+δp+νCp,ν.

In view of Lemma 5.6, to solve (Equation63) we can use the following instance of Algorithm A1 (see Appendix):

Let us consider the following restart procedure based on Algorithm 5.

Theorem 5.7

Suppose that H1 holds and let {uk}k=0T be a sequence generated by Algorithm 6 such that (68) F~δ(uk)>ϵ2,k=0,,T.(68) Then, (69) T1+log232(p+1)!Hδ1p+ν1δRp+ν2p+ν1(p+ν)ϵp+νp+ν1.(69)

Proof.

Let xδ=argminxEF~δ(x). By Theorem A.2 and (66), we have (70) F~δ(yk+1)F~δ(xδ)=F~δ(xm(k))F~δ(xδ)23p1Hδ(p+ν)p+ν1x0(k)xδp+ν(p1)!(m1)p+νδ2(p+ν2)2(p+ν)ykxδp+ν.(70) On the other hand, by Lemma 5 in [Citation12] and Lemma 1 in [Citation25], function Fδ() is uniformly convex of degree p+ν with parameter 2(p+ν2). Thus, (71) F~δ(yk+1)F~δ(xδ)δ2(p+ν2)p+νyk+1xδp+ν.(71) Combining (Equation70) and (Equation71), we obtain yk+1xδp+ν12ykxδp+ν, and so (72) ykxδp+ν12ky0xδp+ν=12kx0xδp+ν.(72) Thus, it follows from (Equation70) and (Equation72) that (73) F~δ(yk+1)F~δ(xδ)δ2p+ν1(p+ν)12kx0xδp+ν.(73) In view of Lemma 5.2, by (67) and (65), we get (74) F~δ(yk+1)F~δ(uk+1)18(p+1)!Hδ1p+ν1F~δ(uk+1)p+νp+ν1.(74) Then, combining (Equation73) and (Equation74), it follows that 18(p+1)!Hδ1p+ν1F~δ(uk+1)p+νp+ν1δ2p+ν1(p+ν)12kx0xδp+ν. In particular, for k = T−1, it follows from (Equation68) that (75) 2T132(p+1)!Hδ1p+ν1δx0xδp+ν2p+ν1(p+ν)ϵp+νp+ν1.(75) Since F~δ(xδ)F~δ(x), it follows that x0xδx0xR. Thus, combining this with (Equation75), we get (Equation69).

Corollary 5.8

Suppose that H1 holds and that R1. Then, Algorithm 6 with (76) δ=ϵ2(p+ν)Rp+ν1.(76) perform at most (77) Olog2Rp+ν1ϵRp+ν1ϵ1p+ν.(77) iterations of Algorithm 5 in order to generate uT such that f~(uT)ϵ.

Proof.

By Theorem 5.7, we can obtain F~δ(uT)ϵ/2 with (78) T2+log232(p+1)!Hδ1p+ν1δRp+ν2p+ν1(p+ν)ϵp+νp+ν1.(78) Moreover, it follows from (65), (Equation76), the definition of HFδ,p(ν) in Lemma 5.6, ϵ(0,1) and R1 that (79) Hδ=pHFδ,p+3θ(p1)!=pHf,p(ν)+δp+νCp,ν+3θ(p1)!=pHf,p(ν)+ϵ2(p+ν)Rp+ν1Cp,ν(p+ν)+3θ(p1)!pHf,p(ν)+Cp,ν+3θ(p1)!.(79) Combining (Equation78), (Equation79) and (Equation76), we have (80) T2+log232(p+1)!pHf,p(ν)+Cp,ν+3θ(p1)!1p+ν1R22(p+ν2)(p+ν)ϵ1p+ν1.(80) At this point uT, we have (81) f~(uT)F~δ(uT)+δp+νdp+ν(uT)ϵ2+δuTx0p+ν1.(81) Since F~δ() is uniformly convex of degree p+ν with parameter 2(p+ν2), it follows from (Equation74) and (Equation73) that δ2(p+ν2)p+νuTxδp+νF~δ(uT)F~δ(xδ)F~δ(yT)F~δ(xδ)δ2p+ν1(p+ν)12T1x0xδp+ν. Therefore, uTxδx0xδ, and so (82) uTx0p+ν1uTxδ+xδx0p+ν12p+ν1x0xδp+ν1(82) (83) 2p+ν1Rp+ν1.(83) Now, combining (Equation81), (Equation83) and (Equation76), we obtain (84) f~(uT)ϵ2+ϵ2=ϵ.(84) The conclusion is obtained by noticing that, for δ given in (Equation76) we have (85) m=1+24p+ν2(p+ν)p+νHδδ(p1)!1p+ν1+24p+ν2(p+ν)p+νpHf,p(ν)+Cp,ν+3θ(p1)!δ(p1)!1p+ν=1+25p+2ν2(p+ν)p+νRp+ν1pHf,p(ν)+Cp,ν+3θ(p1)!ϵ(p1)!1p+ν(85) Thus, (Equation77) follows from multiplying (Equation80) and (Equation85).

Suppose now that Sf~(x0)f~(x) is known. In this case, we have the following variant of Theorem 5.7.

Theorem 5.9

Suppose that H1 holds and let {uk}k=0T be a sequence generated by Algorithm 6 such that (86) F~δ(uk)>ϵ2,k=0,,T.(86) Then, (87) T1+log216(p+1)!Hδ1p+ν1Sϵp+νp+ν1.(87)

Proof.

By (Equation75), we have (88) T1+log232(p+1)!Hδ1p+ν12ϵp+νp+ν1δ2p+ν2(p+ν)x0xδp+ν.(88) Since F~δ() is uniformly convex of degree p+ν with parameter δ2(p+ν2) we have (89) δ2p+ν2(p+ν)x0xδp+νF~δ(x0)F~δ(xδ)=f~(x0)f~(xδ)δp+νxδx0p+νf~(x0)f~(xδ)f~(x0)f~(x)S.(89) Combining (Equation88) and (Equation89) we get (Equation87).

Corollary 5.10

Suppose that H1 holds and that S1. Then, Algorithm 6 with (90) δ=ϵ2p+ν2p+ν2(p+ν)Sp+ν1p+νp+ν(90) performs at most (91) Olog2Sϵp+ν1p+νSp+ν1p+νϵ(91) iterations of Algorithm 5 in order to generate uT such that f~(uT)ϵ.

Proof.

By Theorem 5.9, we can obtain F~δ(uT)ϵ/2 with (92) T2+log216(p+1)!Hδ1p+ν1Sϵp+νp+ν1.(92) In view of (Equation90), ϵ(0,1) and S1, we also have (93) HδpHf,p(ν)+Cp,ν+3θ(p1)!.(93) Thus, from (Equation92) and (Equation93), it follows that (94) T2+log216(p+1)!pHf,p(ν)+Cp,ν+3θ(p1)!1p+ν1Sϵp+νp+ν1.(94) At this point uT we have f~(uT)F~δ(uT)+δp+νdp+ν(uT)ϵ2+δuTx0p+ν1. By (Equation82) and (Equation89), (95) uTx0p+ν12p+ν1x0xδp+ν12p+ν12p+ν2(p+ν)Sδp+ν1p+ν=1δp+ν1p+ν2p+ν12p+ν2(p+ν)Sp+ν1p+ν.(95) Thus, it follows from (Equation95), (Equation95) and (Equation90) that f~(uT)ϵ2+δ1p+ν2p+ν12p+ν2(p+ν)Sp+ν1p+νϵ2+ϵ2=ϵ. Finally, by (66) and (Equation90) we have m=1+24p+ν2(p+ν)p+νHδδ(p1)!1p+ν1+24p+ν2(p+ν)p+νpHf,p(ν)+Cp,ν+3θ(p1)!(p1)!1p+ν2p+ν2p+ν2(p+ν)Sp+ν1p+νϵ. Thus, (Equation91) follows by multiplying (Equation94) by the upper bound on m given above.

6. Lower complexity bounds under Hölder condition

In this section, we derive lower complexity bounds for p-order tensor methods applied to the problem (Equation4) in terms of the norm of the gradient of f, where the objective f is convex and Hf,p(ν)<+ for some ν[0,1].

For simplicity, assume that E=Rn and B=In. Given an approximation x¯ for the solution of (Equation4), we consider p-order methods that compute trial points of the form x+=x¯+h¯, where the search direction h¯ is the solution of an auxiliary problem of the form (96) minhRnφa,γ,q(h)i=1pa(i)Dif(x¯)[h]i+γhq,(96) with aRp, γ>0 and q>1. Denote by Γx¯,f(a,γ,q) the set of all stationary points of function φa,γ,q() and define the linear subspace (97) Sf(x¯)=LinΓx¯,f(a,γ,q)|aRp,γ>0,q>1.(97) More specifically, we consider the class of p-order tensor methods characterized by the following assumption.

Assumption 6.1

Given x0Rn, the method generates a sequence of test points {xk}k0 such that (98) xk+1x0+i=0kSf(xi),k0.(98) Given ν[0,1], we consider the same family of difficult problems discussed in [Citation18], namely: (99) fk(x)=1p+νi=1k1|x(i)x(i+1)|p+ν+i=kn|x(i)|p+νx(1),2kn.(99) The next lemma establishes that for each fk() we have Hfk,p(ν)<+.

Lemma 6.1

Given an integer k[2,n], the pth derivative of fk() is ν-Hölder continuous with (100) Hfk,p(ν)=22+ν2Πi=1p1(p+νi).(100)

Proof.

See Lemma 5.1 in [Citation18].

The next lemma provides additional properties of fk().

Lemma 6.2

Given an integer k[2,n], let function fk() be defined by (Equation99). Then, fk() has a unique global minimizer xk. Moreover, (101) fk=(p+ν1)kp+νandxk<(k+1)323.(101)

Proof.

See Lemma 5.2 in [Citation18].

Our goal is to understand the behaviour of the tensor methods specified by Assumption 1 when applied to the minimization of fk() with a suitable k. For that, let us consider the following subspaces: Rkn=xRn|x(i)=0,i=k+1,,n,1kn1.

Lemma 6.3

For any q0 and xRkn, fk+q(x)=fk(x).

Proof.

It follows directly from (Equation99).

Lemma 6.4

Let M be a p-order tensor method satisfying Assumption 1. If M is applied to the minimization of ft() (2tn) starting from x0=0, then the sequence {xk}k0 of test points generated by M satisfies xk+1i=0kSft(xi)Rk+1n,0kt1.

Proof.

See Lemma 2 in [Citation27].

The next lemma gives a lower bound for the norm of the gradient of ft() on suitable points.

Lemma 6.5

Let k be an integer in the interval [1,t1), with t+1n. If xRkn, then ft(x)1k+1.

Proof.

In view of (Equation99) we have (102) fk(x)=ηp+ν(Akx)e1,x,(102) where (103) ηp+ν(u)=1p+νi=1n|u(i)|p+ν,(103) and (104) Ak=Uk00Ink,with Uk=11000011000001100001Rk×k.(104) By (Equation104) and (Equation103), we have ηp+ν(Atx)(i)=|x(i)x(i+1)|p+ν2(x(i)x(i+1)),i=1,,t1.|x(i)|p+ν2(x(i)),i=t,,n. Since xRkn, it follows that x(i)=0 for i>k. Therefore, ηp+ν(Atx)(i)=0,i=k+1,,n, which means that ηp+ν(Atx)Rkn. Then, from (Equation102), we obtain (105) ft(x)2=AtTηp+ν(Atx)e12infyRknAtTye12=infzRkBze12where B=AtTIk0=B(BTB)1BTe1e12=i=1nB(BTB)1BTe1(i)(e1)(i)2.(105) By (Equation104), we have (106) B=U~0Rn×k,with U~=1000011000011000001100001R(k+1)×k(106) Consequently, (107) BTe1=100Rk.(107) and (108) BTB=2100000121000000001210000012Rk×k.(108) From (Equation108), it can be checked that (109) (BTB)1=1k+1B~Rk×k,(109) with (110) B~ij=i[(k+1)j],if ji,j[(k+1)i],otherwise.(110) Now, combining (Equation107) and (Equation108)–(Equation109), we get (111) (BTB)1BTe1(i)=(k+1)ik+1,i=1,,k.(111) Then, it follows from (Equation106) and (Equation111) that (112) B(BTB)1BTe1(i)=kk+1,i=1,(k+1)(i1)k+1+(k+1)ik+1,i=2,,k,1k+1,i=k+1,0,i=k+2,,n.=kk+1,i=1,1k+1,i=2,,k+1,0,i=k+2,,n.(112) Finally, by (Equation105) and (Equation112) we have ft(x)2i=1nB(BTB)1BTe1(i)(e1)(i)2=1k+12+i=2k+11k+12=i=1k+11(k+1)2=1k+1, and the proof is complete.

The next theorem establishes a lower bound for the rate of convergence of p-order tensor methods with respect to the initial functional residual (f(x0)f).

Theorem 6.6

Let M be a p-order tensor method satisfying Assumption 6.1. Assume that for any function f with Hf,p(ν)<+ this method ensures the rate of convergence: (113) min1kt1f(xk)Hf,p(ν)1p+ν(f(x0)f)p+ν1p+νκ(t),t2,(113) where {xk}k0 is the sequence generated by method M and f is the optimal value of f. Then, for all t2 such that t+1n we have (114) κ(t)Dp,νt3(p+ν)22(p+ν)with Dp,ν=22+ν2Πi=1p1(p+νi)1p+νp+ν1p+νp+ν1p+ν.(114)

Proof.

Suppose that method M is applied to minimize function ft() with initial point x0=0. By Lemma 6.4, we have xkRkn for all k, 1kt1. Thus, from Lemma 6.5, it follows that (115) min1kt1ft(xk)min1kt11k+1=1t.(115) Then, combining (Equation113), (Equation115), Lemma 6.1 and Lemma 6.2, we get κ(t)Hft,p(ν)1p+ν(ft(x0)ft)min1kt1ft(xk)22+ν2Πi=1p1(p+νi)1p+νp+ν1p+νp+ν1p+νtp+ν1p+νt12Dp,ν(t+1)3(p+ν)22(p+ν), where constant Dp,ν is given in (Equation114).

Remark 6.1

Theorem 6.6 gives a lower bound of O((1k)3(p+ν)22(p+ν)) for the rate of convergence of tensor methods with respect to the initial functional residual. For first-order methods in the Lipschitz case (i.e. p=ν=1), we have O(1k). This gives a lower complexity bound of O(ϵ1) iterations for finding ε-stationary points of convex functions using first-order methods, which coincides with the lower bound (8a) in [Citation6]. Moreover, in view of Corollary 5.8, Algorithm 6 is suboptimal in terms of the initial residual, with the complexity a complexity gap that increases as p grows.

Now, we obtain a lower bound for the rate of convergence of p-order tensor methods with respect to the distance x0x.

Theorem 6.7

Let M be a p-order tensor method satisfying Assumption 1. Assume that for any function f with Hf,p(ν)<+ this method ensures the rate of convergence: (116) min1kt1f(xk)Hf,p(ν)x0xp+ν1κ(t),t2,(116) where {xk}k0 is the sequence generated by method M and x is a global minimizer of f. Then, for all t2 such that t+1n we have (117) κ(t)Lp,ν(t+1)3(p+ν)22with Lp,ν=22+ν2(3)p+ν12Πi=0p1(p+νi).(117)

Proof.

Let us apply method M for minimizing function ft() starting from point x0=0. By Lemma 6.4, we have xkRkn for all k, 1kt1. Thus, from Lemma 6.5, it follows that (118) min1kt1ft(xk)min1kt11k+1=1t.(118) Then, combining (Equation116), (Equation118), Lemma 6.1 and Lemma 6.2 we get κ(t)Hft,p(ν)x0xt+1p+ν1min1kt1ft(xk)22+ν2Πi=1p1(p+νi)xtp+ν1t1222+ν2Πi=1p1(p+ν1)(t+1)323p+ν1(t+1)12Lp,ν(t+1)3(p+ν)22, where constant Lp,ν is given in (Equation117).

Remark 6.2

Theorem 6.7 establishes that the lower bound for the rate of convergence of tensor methods in terms of the norm of the gradient is also of O((1k)3(p+ν)22). For first-order methods in the Lipschitz case (i.e. p=ν=1), we have O(1k2). This gives a lower complexity bound of O(ϵ12) for finding ε-stationary points of convex functions using first-order methods, which coincides with the lower bound (8b) in [Citation6].

Remark 6.3

The rate of O((1k)3(p+ν)22) corresponds to a worst-case complexity bound of O(ϵ2/[3(p+ν)2]) iterations necessary to ensure f(xk)ϵ. Note that, for ϵ(0,1), we have 1ϵp+ν(p+ν1)(p+ν+1)1ϵ1p+ν11ϵp+1(p1)(3p2)1ϵ23(p+ν)2. Thus, by increasing the power of the oracle (i.e. the order p), our non-universal schemes become nearly optimal. For example, if ϵ=106 and p4, we have (1ϵ)1p+ν110(1ϵ)23(p+ν)2.

7. Conclusion

In this paper, we presented p-order methods that can find ε-approximate stationary points of convex functions that are p-times differentiable with ν-Hölder continuous pth derivatives. For the universal and the non-universal schemes without acceleration, we established iteration complexity bounds of O(ϵ1/(p+ν1)) for finding x¯ such that f(x¯)ϵ. For the case in which ν is known, we obtain improved complexity bounds of O(ϵ(p+ν)/[(p+ν1)(p+ν+1)]) and O(|log(ϵ)|ϵ1/(p+ν)) for the corresponding accelerated schemes. For the case in which ν is unknown, we obtained a bound of O(ϵ(p+1)/[(p+ν1)(p+2)]) for a universal accelerated scheme. Similar bounds were also obtained for tensor schemes adapted to the minimization of composite convex functions. A lower complexity bound of O(ϵ2/[3(p+ν)2]) was obtained for the referred problem class. Therefore, in practice, our non-universal schemes become nearly optimal as we increase the order p.

As an additional result, we showed that Algorithm 6 takes at most O(log(ϵ1)) iterations to find ε-stationary points of uniformly convex functions of degree p+ν in the form (Equation64). Notice that strongly convex functions are uniformly convex of degree 2. Thus, our result generalizes the known bound of O(log(ϵ1)) obtained for first-order schemes (p = 1) applied to strongly convex functions with Lipschitz continuous gradients (ν=1). At this point, it is not clear to us how p-order methods (with p2) behave when the objective functions are strongly convex with ν-Hölder continuous pth derivatives. Nevertheless, from the remarks done in [Citation12, p. 6] for p = 2, it appears that in our case the class of uniformly convex functions of degree p+ν is the most suitable for p-order methods from a physical point of view.

Acknowledgements

The authors are very grateful to an anonymous referee, whose comments helped to improve the first version of this paper.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

G.N. Grapiglia was supported by the National Council for Scientific and Technological Development - Brazil [grant number 406269/2016-5] and by the European Research Council Advanced [grant number 788368]. Yurii Nesterov was supported by the European Research Council Advanced [grant number 788368].

Notes on contributors

G. N. Grapiglia

G. N. Grapiglia obtained his doctoral degree in Mathematics in 2014 at Universidade Federal do Paraná (UFPR), Brazil. Currently, he is an Assistant Professor at UFPR. His research cover the development, analysis and application of optimization methods.

Yurii Nesterov

Yurii Nesterov is a professor at the Center for Operations Research and Econometrics (CORE) in Catholic University of Louvain (UCL), Belgium. He received his Ph.D. degree (Applied Mathematics) in 1984 at the Institute of Control Sciences, Moscow. His research interests are related to complexity issues and efficient methods for solving various optimization problems. He has received several international prizes, among which are the Dantzig Prize from SIAM and Mathematical Programming society (2000), the John von Neumann Theory Prize from INFORMS (2009), the SIAM Outstanding Paper Award (2014), and the Euro Gold Medal from the Association of European Operations Research Societies (2016). In 2018, he also won an Advanced Grant from the European Research Council.

Notes

1 Conditions (Equation47) have already been used in [Citation20] and are the composite analogue of the conditions proposed in [Citation2]. It is worth to mention that, for p = 3 and ν=1, the tensor model Ωx,p,M(ν)() has very nice relative smoothness properties (see [Citation27]) which allow the approximate solution of (Equation44) by Bregman Proximal Gradient Algorithms [Citation3,Citation22].

References

  • M. Baes, Estimate sequence methods: Extensions and approximations (2009). Available at http://www.optimization-online.org/DB_FILE/2009/08/2372.pdf.
  • E.G. Birgin, J.L. Gardenghi, J.M. Martínez, S.A. Santos, and Ph.L. Toint, Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models, Math. Program. 163 (2017), pp. 359–368. doi: 10.1007/s10107-016-1065-8
  • J. Bolte, S. Sabach, M. Teboulle, and Y. Vaesburg, First order methods beyond convexity and Lipschitz gradient continuity with applications to quadratic inverse problems, SIAM. J. Optim. 28 (2018), pp. 2131–2151. doi: 10.1137/17M1138558
  • A. Bouaricha, Tensor methods for large, sparse unconstrained optimization, SIAM. J. Optim. 7 (1997), pp. 732–756. doi: 10.1137/S1052623494267723
  • S. Bubeck, Q. Jiang, Y.T. Lee, Y. Li, and A. Sidford, Near-optimal method for highly smooth convex optimization (2019). Available at arXiv, math. OC/1812.08026v2.
  • Y. Carmon, J.C. Duchi, O. Hinder, and A. Sidford, Lower bounds for finding stationary points II: first-order methods (2017). Available at arXiv, math. OC/1711.00841.
  • C. Cartis, N.I.M. Gould, and Ph.L. Toint, Second-order optimality and beyond: characterization and evaluation complexity in convexly constrained nonlinear optimization, Found. Comput. Math. 18 (2018), pp. 1073–1107. doi: 10.1007/s10208-017-9363-y
  • C. Cartis, N.I.M. Gould, and Ph.L. Toint, Strong evaluation complexity bounds for arbitrary-order optimization of nonconvex nonsmooth composite functions (2020). Available at arXiv, math. OC/2001.10802.
  • C. Cartis, N.I.M. Gould, and Ph.L. Toint, Universal regularized methods – varying the power, the smoothness, and the accuracy, SIAM. J. Optim. 29 (2019), pp. 595–615. doi: 10.1137/16M1106316
  • X. Chen, Ph.L. Toint, and H. Wang, Complexity of partially separable convexly constrained optimization with non-lipschitz singularities, SIAM. J. Optim. 29 (2019), pp. 874–903. doi: 10.1137/18M1166511
  • X. Chen and Ph.L. Toint, High-order evaluation complexity for convexly-constrained optimization with non-Lipschitzian group sparsity terms, Math. Program. (2020). doi:10.1007/s10107-020-01470-9.
  • N. Doikov and Yu. Nesterov, Minimizing uniformly convex functions by cubic regularization of newton method (2019). Available at arXiv, math. OC/1905.02671.
  • N. Doikov and Yu. Nesterov, Contracting proximal methods for smooth convex optimization. CORE Discussion Paper 2019/27.
  • N. Doikov and Yu. Nesterov, Inexact tensor methods with dynamic accuracies (2020). Available at arXiv, math. OC/2002.09403.
  • A. Gasnikov, P. Dvurechensky, E. Gorbunov, E. Vorontsova, D. Selikhanovych, and C.A. Uribe, The global rate of convergence for optimal tensor methods in smooth convex optimization (2019). Available at arXiv, math. OC/1809.00389v11.
  • G.N. Grapiglia and Yu. Nesterov, Regularized Newton methods for minimizing functions with Hölder continuous hessians, SIAM. J. Optim. 27 (2017), pp. 478–506. doi: 10.1137/16M1087801
  • G.N. Grapiglia and Yu. Nesterov, Accelerated regularized Newton methods for minimizing composite convex functions, SIAM. J. Optim. 29 (2019), pp. 77–99. doi: 10.1137/17M1142077
  • G.N Grapiglia and Yu. Nesterov, Tensor methods for minimizing convex functions with Hölder continuous higher-order derivatives. To appear in SIAM Journal on Optimization.
  • G.N. Grapiglia and Yu. Nesterov, On inexact solution of auxiliary problems in tensor methods for convex optimization, Optim. Methods Softw. (2020). doi:10.1080/10556788.2020.1731749.
  • B. Jiang, T. Lin, and S. Zhang, A unified adaptive tensor approximation scheme to accelerated composite convex optimization (2020). Available at arXiv, math. OC/1811.02427v2.
  • B. Jiang, H. Wang, and S. Zhang, An optimal high-order tensor method for convex optimization (2020). Available at arXiv, math OC/1812.06557v3.
  • H. Lu, R.M. Freund, and Yu. Nesterov, Relatively smooth convex optimization by first-order methods, and applications, SIAM. J. Optim. 28 (2018), pp. 333–354. doi: 10.1137/16M1099546
  • J.M. Martínez, On high-order model regularization for constrained optimization, SIAM. J. Optim. 27 (2017), pp. 2447–2458. doi: 10.1137/17M1115472
  • R.D.C. Monteiro and B.F. Svaiter, An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods, SIAM. J. Optim. 23 (2013), pp. 1092–1125. doi: 10.1137/110833786
  • Yu. Nesterov, Accelerating the cubic regularization of Newton's method on convex problems, Math. Program. 112 (2008), pp. 159–181. doi: 10.1007/s10107-006-0089-x
  • Yu. Nesterov, How to make gradients small, Optima 88 (2012), pp. 10–11.
  • Yu. Nesterov, Implementable tensor methods in unconstrained convex optimization, Math. Program. (2019). doi:10.1007/s10107-019-01449-1.
  • Yu. Nesterov, Inexact accelerated high-order proximal-point methods. CORE Discussion Paper 2020/08.
  • Yu. Nesterov, Inexact high-order proximal-point methods with auxiliary search procedure. CORE Discussion Paper 2020/10.
  • Yu. Nesterov and A. Nemirovskii, Interior Point Polynomial Methods in Convex Programming: Theory and Applications, SIAM, Philadelphia, 1994.
  • Yu. Nesterov and B.T. Polyak, Cubic regularization of Newton method and its global performance, Math. Program. 108 (2006), pp. 177–205. doi: 10.1007/s10107-006-0706-8
  • A. Rodomanov and Yu. Nesterov, Smoothness parameter of power of Euclidean norm, J. Optim. Theory. Appl. 185 (2020), pp. 303–326. doi: 10.1007/s10957-020-01653-6
  • R.B. Schnabel and T. Chow, Tensor methods for unconstrained optimization using second derivatives, SIAM. J. Optim. 1 (1991), pp. 293–315. doi: 10.1137/0801020

Appendix. Accelerated scheme for composite minimization

To solve problem (Equation41), we can apply the following modification of Algorithm 3 in [Citation18]:

In order to establish a convergence rate for Algorithm A1, we will need the following result.

Lemma A.1

Suppose that H1 holds and let x+ be an approximate solution to minyEΩ~x,p,H(ν)(y) such that (A3) Ω~x,p,H(ν)(x+)f~(x)andΩx,p,H(ν)(x+)+gϕ(x+)θx+xp+ν1,(A3) for some gϕ(x+)ϕ(x+). If H(p+ν1)(Hf,p(ν)+θ(p1)!), then (A4) f~(x+),xx+13(p1)!H1p+ν1f~(x+)p+νp+ν1.(A4)

Proof.

Denote r=x+x. Then, f~(x+)+H(p+ν)p!rp+ν2B(x+x)=f(x+)Φx,p(x+)+Ωx,p,H(ν)(x+)+gϕ(x+)f(x+)Φx,p(x+)+Ωx,p,H(ν)(x+)+gϕ(x+)Hf,p(ν)(p1)!+θrp+ν1, which gives (A5) Hf,p(ν)(p1)!+θr2(p+ν1)f~(x+)+H(p+ν)p!rp+ν2B(x+x)2=f~(x+)2+2(p+ν)p!Hrp+ν2f~(x+),x+x+H2(p+ν)2(p!)2r2(p+ν1).(A5) From (EquationA5), the rest of the proof follows exactly as in the proof of Lemma A.6 in [Citation18].

Theorem A.2

Suppose that H1 holds and let the sequence {xt}t=0T be generated by Algorithm A1. Then, for t=2,,T, (A6) f~(xt)f~(x)23p1M(p+ν)p+νx0xp+ν(p1)!(t1)p+ν.(A6)

Proof.

For all t0, we have (A7) ψt(x)Atf~(x)+1p+νxx0p+ν, xE.(A7) Indeed, (EquationA7) is true for t = 0 because A0=0 and ψ0(x)=1p+νxx0p+ν. Suppose that (EquationA7) is true for some t0. Then, ψt+1(x)=ψt(x)+atf(xt+1)+f(xt+1),xxt+1+ϕ(x)Atf~(x)+atf~(x)+1p+νxx0p+ν=At+1f~(x)+1p+νxx0p+ν. Thus, (EquationA7) follows by induction. Now, let us prove that (A8) Atf~(xt)ψtminxEψt(x).(A8) Again, using A0=0, we see that (EquationA8) is true for t = 0. Assume that (EquationA8) is true for some t0. Note that ψt() is uniformly convex of degree p+ν with parameter 2(p+ν2). Thus, by the induction assumption ψt(x)ψt+2(p+ν2)p+νxvtp+νAtf~(xt)+2(p+ν2)p+νxvtp+ν. Consequently, (A9) ψt+1=minxdomϕψt(x)+atf(xt+1)+f(xt+1),xxt+1+ϕ(x)minxdomϕAtf~(xt)+2(p+ν2)p+νxvtp+ν+atf(xt+1)+f(xt+1),xxt+1+ϕ(x).(A9) Since f is convex and differentiable and gϕ(xt+1)ϕ(xt+1), we have (A10) f~(xt)f~(xt+1)+f~(xt+1),xtxt+1(A10) and (A11) ϕ(x)ϕ(xt+1)+gϕ(xt+1),xxt+1.(A11) Using (EquationA10) and (EquationA11) in (EquationA9), it follows that (A12) ψt+1minxdomϕAt+1f~(xt+1)+f~(xt+1),AtxtAtxt+1+atf~(xt+1),xxt+1+2(p+ν2)p+νxvtp+ν.(A12) Note that Atxt=At+1ytatvt and At+1xt+1=Atxt+1+atxt+1. Thus, combining (EquationA12) and Lemma A.1, we obtain ψt+1At+1f~(xt+1)+minxdomϕAt+114(p1)!M1p+ν1f~(xt+1)p+νp+ν1+atf~(xt+1),xvt+2(p+ν2)p+νxtvtp+νAt+1f~(xt+1), where the last inequality follows from (A1) exactly as in the proof of Theorem 4.3 in [Citation18]. Thus, (EquationA8) also holds for t + 1, which completes the induction argument.

Now, combining (EquationA7) and (EquationA8), we have (A13) f~(xt)f~(x)1At1p+νx0x+p+ν.(A13) Once again, as in the proof of Theorem 4.3 in [Citation18], it follows from (A1) that (A14) At(p1)!23p1M1p+ν12p+ν1p+νp+ν(t1)p+ν, t2.(A14) Finally, (EquationA6) follows directly from (EquationA13) and (EquationA14).