96
Views
0
CrossRef citations to date
0
Altmetric
Oleg Burdakov

Convergence rate analysis of the gradient descent–ascent method for convex–concave saddle-point problems

ORCID Icon, ORCID Icon & ORCID Icon
Received 15 Sep 2022, Accepted 22 May 2024, Published online: 20 Jun 2024

Abstract

In this paper, we study the gradient descent–ascent method for convex–concave saddle-point problems. We derive a new non-asymptotic global convergence rate in terms of distance to the solution set by using the semidefinite programming performance estimation method. The given convergence rate incorporates most parameters of the problem and it is exact for a large class of strongly convex-strongly concave saddle-point problems for one iteration. We also investigate the algorithm without strong convexity and we provide some necessary and sufficient conditions under which the gradient descent–ascent enjoys linear convergence.

Mathematic Subject Classifications:

1. Introduction

We consider the convex–concave saddle point problem (1) minxRnmaxyRmF(x,y),(1) where F:Rn×Rm(,), and F(,y) and F(x,) are convex and concave, respectively, for any fixed xRn and yRm. We assume that problem (Equation1) has some solution, that is, there exists a saddle point (x,y)Rn×Rm with F(x,y)F(x,y)F(x,y),xRn,yRm.We denote the solution set of problem (Equation1) with S. We call F smooth if for some Lx,Ly,Lxy, we have (i)xF(x2,y)xF(x1,y)Lxx2x1x1,x2,y(ii)yF(x,y2)yF(x,y1)Lyy2y1x,y1,y2(iii)xF(x,y2)xF(x,y1)Lxyy2y1x,y1,y2(iv)yF(x2,y)yF(x1,y)Lxyx2x1x1,x2,y.The function F is said to be strongly convex-strongly concave if (i)F(,y)μx22isconvexforanyfixedy(ii)F(x,)+μy22isconcaveforanyfixedx,for some μx,μy>0. Note that strong convex-strong concavity implies that problem (Equation1) has a unique solution (x,y). We denote the set of smooth strongly convex-strongly concave functions by F(Lx,Ly,Lxy,μx,μy).

Problem (Equation1) has applications in game theory [Citation4], robust optimization [Citation5], adversarial training [Citation14], and reinforcement learning [Citation30], to name but a few. In addition, various other algorithms have been developed for solving saddle point problems; see e.g. [Citation15,Citation16,Citation18,Citation25,Citation26,Citation28,Citation29].

One of the simplest approaches for handling problem (Equation1) introduced in [Citation2, Chapter 6] is the gradient-descent–ascent method, which may be regarded as a generalization of the gradient method for saddle point problems. The gradient descent–ascent method is described in Algorithm 1.

The local and global linear convergence of Algorithm 1 have been investigated in the literature; see [Citation3,Citation13,Citation17,Citation33] and the references therein. As we investigate the global linear convergence rate of Algorithm 1, we mention one known global convergence result, which is derived by using variational inequality techniques. Suppose that z=(x,y). Let the function ϕ:Rn+mRn+m given by ϕ(z)=(xF(z)yF(z))T. It is shown that, see e.g. [Citation22], ϕ(z¯)ϕ(z^)2Lz¯z^,ϕ(z¯)ϕ(z^),z¯z^μz¯z^2,where L=max{Lx,Ly,Lxy} and μ=min{μx,μy}. Indeed, ϕ is Lipschitz continuous and strongly monotone. By Facchinei and Pang [Citation12, Theorem 12.1.2], for t(0,μ2L2), we have (2) x2x2+y2y2(1+4L2t22μt)(x1x2+y1y2).(2) In this study, we revisit Algorithm 1 and improve the convergence rate (Equation2). Indeed, we derive a new convergence rate involving most parameters of problem (Equation1). It is worth noting that if one sets L=max{Lx,Ly,Lxy} and μ=min{μx,μy}, the new bound dominates the convergence rate (Equation2) for any step length t(0,μ2L2). Furthermore, by setting t=μ4L2, one can infer that Algorithm 1 has a complexity of O(L2μ2ln(1ϵ)) to compute iterates (xk,yk) such that xkx2+yky2ϵ(x1x2+y1y2), which is the known iteration complexity bound in the literature; see e.g. [Citation6,Citation32]. In this study, thanks to the new convergence rate given in Theorem 2.2, the order of complexity of O((Lμ+Lxy2μ2)ln(1ϵ)) is obtained when L=max{Lx,Ly} and μ=min{μx,μy}, which is more informative in comparison with the above-mentioned one. Moreover, by providing some example, we show that the given convergence rate is exact for one iteration.

The goal of this work is not to achieve the optimal algorithmic complexity for the class of saddle point problems introduced above. Rather, we have the more modest goal of giving the best possible worst-case complexity analysis of the gradient descent–ascent method (Algorithm 1). It is important to note that there are accelerated and extra gradient descent–ascent methods with better worst-case complexity than Algorithm 1; see e.g. [Citation18,Citation28]. In particular, the accelerated methods may be shown to have a worst-case complexity O(Lx2μx2+Lxy2μxμy+Ly2μy2ln(1ϵ)), which may be compared to the best-known lower complexity bound O(Lxμx+Lxy2μxμy+Lyμyln(1ϵ)) for the class of pure first-order algorithms [Citation34].

The paper is organized as follows. First, we present basic definitions and preliminaries used to establish the results. Section 2 is devoted to the study of the linear convergence of Algorithm 1. In Section 3, we study the linear convergence of the gradient descent–ascent method without strong convexity. Indeed, we let FF(Lx,Ly,Lxy,0,0) and give some necessary and sufficient conditions for the linear convergence. Moreover, we derive a convergence rate under this setting.

Notation 1

The n-dimensional Euclidean space is denoted by Rn. We use , and to denote the Euclidean inner product and norm, respectively. For a matrix A, Aij denotes its (i,j)th entry, and AT represents the transpose of A. We use λmax(A) and λmin(A) to denote the largest and the smallest eigenvalue of symmetric matrix A, respectively.

Let XRn. We denote the distance function to X by dX(x):=infx¯Xxx¯ and the set-valued mapping ΠX(x) stands for the projection of x on X, i.e. ΠX(x):={yX:xy=dX(x)}.

We call a differentiable function f:Rn(,) L-smooth if f(x1)∇f(x2)Lx1x2x1,x2Rn.The function f:RnR is called μ-strongly convex function if the function xf(x)μ2x2 is convex. Clearly, any convex function is 0-strongly convex. We denote the set of real-valued convex functions which are L-smooth and μ-strongly convex by Fμ,L(Rn).

Let I be a finite index set and let {(xi;gi;fi)}iIRn×Rn×R. A set {(xi;gi;fi)}iI is called Fμ,L-interpolable if there exists fFμ,L(Rn) with f(xi)=fi,gi∂f(xi)iI.The next theorem gives necessary and sufficient conditions for Fμ,L-interpolability.

Theorem 1.1

[Citation27, Theorem 4]

Let L(0,) and μ[0,) and let I be a finite index set. The set {(xi;gi;fi)}iIRn×Rn×R is Fμ,L-interpolable if and only if for any i,jI, we have

(3) 12(1μL)(1Lgigj2+μxixj22μLgjgi,xjxi)fifjgj,xixj.(3)

It is worth mentioning that, under the assumptions of Theorem 1.1, the set {(xi;gi;fi)}iI is interpolable with an L-smooth μ-strongly concave function if and only if for any i,jI, we have

(4) 12(1μL)(1Lgigj2+μxixj2+2μLgjgi,xjxi)fi+fj+gj,xixj.(4)

2. The gradient descent–ascent method

In this section, we study the convergence rate of gradient descent–ascent method when FF(Lx,Ly,Lxy,μx,μy) with min{μx,μy}>0. Indeed, we investigate the worst-case behaviour of one step of Algorithm 1 in terms of distance to the unique saddle point (x,y) for one iterate. Let (x2,y2) be generated by the algorithm using the starting point (x1,y1). The worst-cast convergence rate is in essence an optimization problem. Indeed, the worst-cast convergence rate of Algorithm 1 is given by the solution of the following abstract optimization problem: (5) maxx2x2+y2y2x1x2+y1y2s.t.(x2,y2)isgeneratedbyAlgorithm1w.r.t.F,x1,y1(x,y)istheuniquesaddlepointofproblem(1)FF(Lx,Ly,Lxy,μx,μy)x1Rn,y1Rm.(5) In problem (Equation5), F,x1,x2,x,y1,y2,y are decision variables and μx,Lx,μy,Ly,Lxy,t are fixed parameters. At first glace, problem (Equation5) seems completely intractable, but its solution may in fact be approximated using a suitable semidefinite programming (SDP) problem, as shown below. This type of approximation is an example of the so-called SDP performance estimation method, that was introduced by Drori and Teboulle [Citation10].

Suppose that Fi,j=F(xi,yj)i,j{1,2,},Gxi,j=xF(xi,yj)i,j{1,2,},Gyi,j=yF(xi,yj)i,j{1,2,},where indices {1,2,} refers to the starting point, the point generated by Algorithm 1 and the saddle point of the problem, respectively. Note that due to the the necessary and sufficient conditions for convex–concave saddle point problems, we have Gx,=0,Gy,=0.By using Theorem 1.1, problem (Equation5) may be relaxed as a finite dimensional optimization problem, (6) maxx2x2+y2y2x1x2+y1y2s.t.{(x1;Gx1,k;F1,k),(x2;Gx2,k;F2,k),(x;Gx,k;F,k)}satisfy(3)fork{1,2,}w.r.t.μx,Lx{(y1;Gyk,1;Fk,1),(y2;Gyk,2;Fk,2),(y;Gyk,;Fk,)}satisfy(4)fork{1,2,}w.r.t.μy,LyGxk,iGxk,jLxyyiyj,i,j,k{1,2,}Gyi,kGyj,kLxyxixj,i,j,k{1,2,}x2=x1tGx1,1y2=y1+tGy1,1,Gx,=0,Gy,=0.(6) In problem (Equation6), {(xi;Gxi,j;Fi,j)} and {(yi;Gyj,i;Fj,i)} (i,j{1,2,}) are decision variables. We may assume that x=0 and y=0 as Algorithm 1 is invariant under translation. By elimination, problem (Equation6) may be reformulated as follows, (7) maxx1tGx1,12+y1+tGy1,12x12+y12s.t.{(x1;Gx1,k;F1,k),(x1tGx1,1;Gx2,k;F2,k),(0;Gx,k;F,k)}satisfy(3)fork{1,2,}w.r.t.μx,Lx{(y1;Gyk,1;Fk,1),(y1+tGy1,1;Gyk,2;Fk,2),(0;Gyk,;Fk,)}satisfy(4)fork{1,2,}w.r.t.μy,LyGxk,1Gxk,22Lxy2tGy1,12,k{1,2,}Gy1,kGy2,k2Lxy2tGx1,12,k{1,2,}Gxk,1Gxk,2Lxy2y12,k{1,2,}Gy1,kGy,k2Lxy2x12,k{1,2,}Gxk,2Gxk,2Lxy2y1+tGy1,12,k{1,2,}Gy2,kGy,k2Lxy2x1tGx1,12,k{1,2,}Gx,=0,Gy,=0.(7) To approximate the solution of problem (Equation7), we reformulate it as a semidefinite program by using the Gram matrix of the unknown vectors in the problem. Indeed, we form the Gram matrices X and Y as follows, U=(x1x2Gx1,1Gx1,2Gx1,Gx2,1Gx2,2Gx2,Gx,1Gx,2)V=(y1y2Gy1,1Gy1,2Gy1,Gy2,1Gy2,2Gy2,Gy,1Gy,2)X=UTU,Y=VTV.This results in an SDP problem, as long as we view the value x12+y12, that appears in the denominator of the objective of problem (Equation7), as a fixed parameter. For this reason we may indeed view problem (Equation7) as an SDP problem in the positive semidefinite matrix variables X and Y. The interested reader may refer to [Citation27,Citation31] for more details concerning the Gram matrix formulation, and SDP performance estimation in general.

For the convenience of the analysis, we investigate the linear convergence of Algorithm 1 in terms of L=max{Lx,Ly} and μ=min{μx,μy}. Before we present the main theorem in this section, we need to present a lemma.

Lemma 2.1

Let 0<μL, c0 and let I=(0,2μμL+c2). Suppose that the function u:IR given by u(t)=12(L2+μ2+2c2)t2(L+μ)t+12(Lμ)t(Lt+μt2)2+4c2t2.Then u is convex on I and u(I)[1,0).

Proof.

Consider the function v:IR given by v(t)=(L2+μ2+2c2)t+(Lμ)(Lt+μt2)2+4c2t2.The function v is convex and positive on I. By elementary calculus, one can show that v(0)>0. So v is increasing on I due to the convexity. As the product of positive monotone convex functions is a convex function, the function ttv(t) is also convex, which implies the convexity of u. Indeed, u is strictly convex on I. Since strictly convex functions attain their maximum on endpoints of a given interval, u(t)<max{u(0),u(2μμL+c2)}=0 for tI. It remains to show that mintIu(t)1. This follows from the point that u(t)12(L2+μ2)t2(L+μ)t12(1+2L2+μ2)1,and the proof is complete.

By the weak duality theorem for SDP, one may demonstrate an upper bound for the optimal value of the SDP problem (Equation7), by constructing a feasible solution to its dual problem, i.e. feasible dual multipliers for the constraints of problem (Equation7). This is done in the next theorem. In the proof, the correct value of the dual multipliers are simply given, and their correctness is verified. The correct values were obtained by solving the SDP problem (Equation7) repeatedly for different numerical values of the parameters, and noting the (numerical) optimal dual multiplier values. Based on these values, it was possible to deduce the analytical expressions for the multipliers. For this reason, the proof was found in a computer-assisted way, but it does not rely on any numerical calculations. Having said that, the proof involves a long identity, given in full in Appendix 2 to this paper, that is so long that it could only be obtained in a computer-assisted way.

Theorem 2.2

Let FF(Lx,Ly,Lxy,μx,μy). Suppose that L=max{Lx,Ly} and μ=min{μx,μy}>0. If t(0,2μμL+Lxy2), then Algorithm 1 generates (x2,y2) such that (8) x2x2+y2y2α(x1x2+y1y2),(8) where α=1+12(L2+μ2+2Lxy2)t2(L+μ)t+12(Lμ)t(Lt+μt2)2+4Lxy2t2.

Proof.

As mentioned earlier, we may assume without loss of generality that x=0 and y=0. By assumption, F(,y)Fμ,L(Rn) and F(x,)Fμ,L(Rm) for any fixed x, y. Without loss of generality, we may also assume that Lxy=1, by replacing F by 1LxyF. This follows from the observation that Algorithm 1 generates the same point (x2,y2) for the problem minxRnmaxyRm1LxyF(x,y),with the step length Lxyt. Moreover, one has 1LxyFF(LxLxy,LyLxy,1,μxLxy,μyLxy) if and only if FF(Lx,Ly,Lxy,μx,μy). Now let t(0,2μμL+1) and define (the multipliers): α¯=1+12(L2+μ2+2)t2(L+μ)t+12(Lμ)t(Lt+μt2)2+4t2,β=(Lt+μt2)2+4t2,γ1=t(t2(2+L2+)t(3L+μ)+(Lt1)β+2)β,γ2=t(t2(2+μ2+)t(3μ+L)+(1μt)β+2)β,γ3=t2(β+Ltμt)2β.It is readily verified that γ1,γ2,γ30, but since this calculation is somewhat tedious we present it in Appendix 1. Moreover, Lemma 2.1 implies that α¯[0,1).

The idea of the proof is now as follows: we first establish that, for any feasible solution of the SDP problem (Equation7), it holds that (9) x1tGx1,12+y1+tGy1,12α¯(x12+y12)0.(9) We do this by establishing an algebraic identity for the left-hand side of the inequality (Equation9). The first and last terms of this identity (shown in full in Appendix 2 to this paper) are as follows: x1tGx1,12+y1+tGy1,12α¯(x12+y12)=γ1(F1,1F,1Gx,1,x1L2(Lμ)(1LGx1,1Gx,12+μx122μLGx,1Gx1,1,x1))t(β+Ltμt)24(Lμ)βGy1,1Gy,1Gy1,2.Note that the first term on the right-hand side is indeed non-positive, since γ10, and the expression in brackets is non-negative at any feasible solution of the SDP problem (Equation7), since it corresponds to one of the constraints in (Equation7). The last term is non-positive as well, since it is the product of a non-positive multiplier with a squared expression. The remaining terms in the identity are similarly non-positive (see Appendix 2), proving the inequality (Equation9). All that remains is to recognize that, in (Equation9), Gx1,1 corresponds to xF(x1,y1), so that x1tGx1,1 corresponds to x2, etc. This yields the statement of the theorem, after rescaling to remove the assumption Lxy=1.

One may wonder how we obtained the (analytical) expression for α in Theorem 2.2. Consider the optimization problem (10) minxRnf(x),(10) where fFμ,L. It is known that the quadratic function q(x)=xTQx with λmax(Q)=L and λmin(Q)=μ attains the worst-case convergence rate for the gradient method; see e.g. [Citation9]. We guessed that this property may hold for problem (Equation1) and we investigated the bilinear saddle point problem (11) minxR2maxyR212xT(Lx00μx)x+xT(0LxyLxy0)y12yT(Ly00μy)y,(11) where Lxμx>0, Lyμy>0 and Lxy are fixed parameters and we derived the worst case convergence of Algorithm 1 with respect to this problem. Our numerical experiments showed that the derived convergence rate is the same as the optimal value of the semidefinite programming problem corresponding to problem (Equation7). Moreover, as a by-product, we exhibit that the convergence rate (Equation8) is exact for one iteration by using problem (Equation11); see Proposition 2.4.

Theorem 2.2 provides some new information concerning Algorithm 1. Firstly, Theorem 2.2 improves the known convergence factor in the literature; see our discussion in Introduction. In addition, it investigates the convergence rate for a step length in a larger interval. Secondly, it does not assume the second order continuous differentiability of F, which is commonly used for deriving a local convergence rate; see [Citation17,Citation20,Citation33]. Finally, the given convergence rate incorporates three parameter μ=min{μx,μy}, L=max{Lx,Ly} and Lxy, which is more informative in comparison with the results in the literature mostly given in terms of μ=min{μx,μy} and L=max{Lx,Ly,Lxy}; see [Citation21,Citation33,Citation34] and references therein. Even though if one considers L=max{Lx,Ly,Lxy} and μ=min{μx,μy}, convergence rate (Equation8) dominates (Equation2). This follows from that for t(0,μ2L2), one has

(1+4L2t22μt)(1+12(3L2+μ2)t2(L+μ)t+12(Lμ)t(Lt+μt2)2+4L2t2)(2L2+μ2)t22L2t2,where the first inequality results from (Lt+μt2)2+4L2t2(2Ltμt)+2Lt. In addition, in this case, the step length can take value in a larger interval as (0,μ2L2)(0,2μL(L+μ)). Moreover, Conjecture 2.6 discusses the convergence rate in terms of Lx,Ly,Lxy,μx,μy.

The next proposition gives the optimal step length with respect to the worst case convergence rate.

Proposition 2.3

Let FF(Lx,Ly,Lxy,μx,μy). If L=max{Lx,Ly} and μ=min{μx,μy}>0, then the optimal step length for Algorithm 1 with respect to the bound (Equation8) is (12) t=2((L+μ)Lxy2++Lxy(μL))(4Lxy2+(L+μ)2)Lxy2+.(12) Moreover, the convergence rate with respect to t is (13) α=8Lxy(L2μ2)+Lxy2+(L2μ2)2+16Lxy2(+Lxy2)((L+μ)2+4Lxy2)2.(13)

Proof.

Let α:[0,2μμL+Lxy2]R given by α(t)=1+12(L2+μ2+2Lxy2)t2(L+μ)t+12(Lμ)t(Lt+μt2)2+4Lxy2t2.By Lemma 2.1, α is a strictly convex function on its domain. By doing some algebra, one can verify that α(t)=0, which implies that t is the minimum.

If Lxy=0, problem (Equation1) reduces to a separable optimization problem. Indeed, the variables x and y are independent. Under this assumption, the optimal step length given by Proposition 2.3 is t=2L+μ, which is the well-known optimal step length for the optimization problem minxRnf(x),where fFμ,L; see [Citation24, Theorem 2.1.15]. Moreover, the convergence rate corresponding to t is α=(LμL+μ)2. By some algebra, one can show that under the assumptions of Proposition (2.3), Algorithm 1 has a complexity of O((Lμ+Lxy2μ2)ln(1ϵ)). Note that the lower iteration complexity bound for first-order methods with L=max{Lx,Ly} and μ=min{μx,μy} is Ω(Lμ+Lxy2μ2ln(1ϵ)); see [Citation34].

As mentioned earlier, we calculated the convergence rate by using problem (Equation11). The next proposition states that the bound (Equation8) is tight for some class of bilinear saddle point problems.

Proposition 2.4

Let FF(Lx,Ly,Lxy,μx,μy). Suppose that Lx=Ly and min{μx,μy}>0. If t(0,2μμL+Lxy2), then convergence rate (Equation8) is exact for one iteration.

Proof.

To establish the proposition, it suffices to introduce a problem for which Algorithm 1 generates (x2,y2) with respect to the initial point (x1,y1) such that x2x2+y2y2=α(x1x2+y1y2),where α is the convergence rate factor given in Theorem 2.2. Consider problem (Equation11). Due to the symmetry of Algorithm 1 and the class of problems, we may assume μxμy. Moreover, without loss of generality, we can take Lxy=1; see our discussion in the proof of Theorem 2.2. Suppose L=Lx, μ=μy and β=(Lt+μt2)2+4t2. One can verify that Algorithm 1 with the initial point x11=0,x21=2t(L+μ)+β2β,y11=t2β(2t(L+μ)+β),y21=0.generates (x2,y2) with the desired equality.

One may wonder why we stress on one iteration in Proposition 2.4. Based on our numerical results if Lxy>0, under the setting of Theorem 2.2, we observed that xkx2+yky2<αk1(x1x2+y1y2),k3,for some t(0,2μμL+Lxy2). The reason may be related to the fact that the vector field (xF(x,y)yF(x,y))T is not conservative.

It may be of interest whether inequality (Equation8) may hold without strong convexity. By removing strong convexity, the solution set may not be singleton. Hence, we investigate distance to the solution set, that is, if there exists 0α<1 with dS2((x2,y2))αdS2((x1,y1)).The next proposition says in general the answer is negative. Indeed, it gives an example with min{μx,μy}=0 and a unique saddle point for which x2x2+y2y2α(x1x2+y1y2),for some α1, no matter how close (x1,y1) is to the unique saddle point and which positive step length t is taken. In the next proposition, we may assume without loss of generality μx=0 and make an example analogous to that given in Proposition 2.4.

Proposition 2.5

Let L,Lxy,μy,t,r>0 be given. Then there exist α1 and a function FF(L,L,Lxy,0,μy) with the unique saddle point (x,y) and (x1,y1) such that, for (x2,y2) generated by Algorithm 1, we have x2x2+y2y2α(x1x2+y1y2),and x1x2+y1y2=r2.

Proof.

As discussed before, we may assume Lxy=1. Consider the bilinear saddle point problem, minxR2maxyR2F(x,y)=12xT(L000)x+xT(0110)y12yT(L00μy)y.It is clear that FF(L,L,Lxy,0,μy), and the unique saddle point is (x,y)=(0,0). Suppose that x11=0,x21=r2tL+β2β,y11=rt2β(2tL+β),y21=0,where β=(Lt2)2+4t2. One can verify Algorithm 1 generates (x2,y2) with x2x2+y2y2α(x1x2+y1y2)=αr2,where α=1+12(L2+2)t2Lt+12Lt(Lt2)2+4t2. By Proposition 2.3, one can infer that α1.

Note that r can take any positive value in Proposition 2.5. By Proposition 2.4, one can infer that the convergence rate factor for bilinear saddle point problems may not be improved for one iteration since the given example is a bilinear saddle point problem. Furthermore, the given convergence rate factor is tight whether Lx=Ly. As discussed in [Citation28], the function H(x,y)=F(LyLx4x,LxLy4y) shares the same smoothness constants with respect to x and y, that is, xH(,y) and yH(x,) are Lipschitz continuous with the same modulus LxLy. However, the gradient methods are not invariant under scaling; see [Citation8, Chapter 9]. Hence, we may lose the generality of our discussion by assuming this condition.

Based on our numerical results and analysis of problem (Equation11), we conjecture the following (exact) convergence rate of Algorithm 1 in terms of Lx,Ly,Lxy,μx,μy. Due to the symmetry of Algorithm 1, we may assume that LxLy. Moreover, Proposition 2.4 implies that bound (Equation8) is tight when μyμx. Hence, we need only consider μy>μx.

Conjecture 2.6

Let FF(Lx,Ly,Lxy,μx,μy). Suppose that μy>μx>0, max{Lx,Ly}=Lx and c=12(Ly2+μx2)t(Ly+μx)+12(Lyμx)(Lyt+μxt2)2+4Lxy2t2,μ¯=c+2LxLx2t+LxLxy2t2(c+Lx(2Lxt))1+t(c+tLxy2)tLxy(c+tLxy2+Lx(2Lxt)),α(μ,L,Lxy,t)=1+12(L2+μ2+2Lxy2)t2(L+μ)t+12(Lμ)t(Lt+μt2)2+4Lxy2t2.Then, one of the following scenarios holds.

  1. Assume that μxμy(LxLy)Lxy2(μyμx) and t(0,2μyLxμy+Lxy2).

    1. If μyμ¯, then x2x2+y2y2α(μy,Lx,Lxy,t)(x1x2+y1y2).

    2. If μyμ¯, then x2x2+y2y2α(μx,Ly,Lxy,t)(x1x2+y1y2).

  2. Assume that μxμy(LxLy)Lxy2(μyμx) and t(0,2μxLyμx+Lxy2).

    1. If μyμ¯, then x2x2+y2y2α(μy,Lx,Lxy,t)(x1x2+y1y2).

    2. If μyμ¯, then x2x2+y2y2α(μx,Ly,Lxy,t)(x1x2+y1y2).

Although we have extensive numerical evidence supporting Conjecture 2.6, we have been unable to prove either part (a) or part (b). To be more precise, we have verified numerically that the optimal value of the SDP problems (Equation7) and (Equation11) corresponds to the expressions in Conjecture 2.6 for many different numerical values of the parameters Lx, Ly, Lxy, μx, and μy, but we could not manage to derive analytical expressions for the dual multipliers of SDP problem (Equation7) for proving the conjecture.

2.1. Numerical illustration

In this section we provide randomly generated examples to compare the optimal step length (Equation12) given in this paper to the known step length t=μ/(4L2) for the bilinear problem minxR5maxyR412xTAxx+xTAxyy12yTAyy,where Ax and Ay are symmetric positive definite matrices. Moreover, the instances are constructed such that the spectra of Ax and Ay are contained in the interval [0.5,5]. For this class of instances, one has L=max{λmax(Ax),λmax(Ay)}[0.5,5] and μ=min{λmin(Ax),λmin(Ay)}[0.5,L]. The matrix AxyR5×4 has entries chosen uniformly at random from [0,1], and subsequently we set Lxy=Axy2. By construction, the solution (saddle point) is (x,y)=(0,0). The starting points x1 and y1 are randomly drawn unit vectors so that the initial condition x1x2+y1y2=2 is satisfied.

In Figure  we show average values (over 100 randomly generated instances) for the convergence indicator xkx2+yky2 after k iterations, for the two step lengths.Footnote1 Note that our new step length (Equation12) gives a clear improvement over the known step length μ/(4L2).

Figure 1. Mean values of xkx2+yky2 for 100 randomly generated instances for each iteration k using the two different step lengths t.

Figure 1. Mean values of ‖xk−x⋆‖2+‖yk−y⋆‖2 for 100 randomly generated instances for each iteration k using the two different step lengths t.

3. Linear convergence without strong convexity

In this section, we study the linear convergence of Algorithm 1 without assuming strong convexity. Indeed, we suppose that FF(Lx,Ly,Lxy,0,0) and we propose some necessary and sufficient conditions for the linear convergence. This subject has received some attention in recent years and some sufficient conditions have been proposed in [Citation11,Citation33] under which Algorithm 1 enjoys local linear convergence rate or it is linearly convergent for bilinear saddle point problems. This topic has been investigated extensively in the context of optimization. The interested reader can refer to [Citation1,Citation7,Citation19,Citation23] and references therein. In this study, we extend the quadratic gradient growth property introduced in [Citation19] for saddle point problems.

Recall that we denote the non-empty solution set of problem (Equation1) by S. As we do not assume the strong convexity (concavity), S may not be singleton. Note that S is a closed convex set under our assumptions. Recall that ΠS((x,y)) denotes the projection of (x,y) onto S.

Definition 3.1

Let μF>0. A function F has a quadratic gradient growth if for any xRn and yRm, (14) xF(x,y),xxyF(x,y),yyμFdS2((x,y)),(14) where (x,y)=ΠS((x,y)).

Note that if we set y=y in (Equation14), we have xF(x,y),xxμFxx2.Hence, Lx-smoothness implies that μFLx. Consequently, due to the symmetry, we have μFmin{Lx,Ly}. The next proposition states that the quadratic gradient growth condition is weaker than the strong convexity-strong concavity. Indeed, the strong convexity–strong concavity implies the quadratic gradient growth property.

Proposition 3.2

Let FF(Lx,Ly,Lxy,μx,μy). If min{μx,μy}>0, then F has a quadratic gradient growth with μF=min{μx,μy}.

Proof.

Under the assumptions, problem (Equation1) has a unique solution (x,y) and xF(x,y)=0 and yF(x,y)=0. Let μ=min{μx,μy} and L=max{Lx,Ly}. Suppose that (x,y)Rn×Rm. By Theorem 1.1, we have

0(F(x,y)F(x,y)+xF(x,y),xxL2(Lμ)(1LxF(x,y)xF(x,y)2+μxx22μLxF(x,y)xF(x,y),xx))+(F(x,y)F(x,y)L2(Lμ)(1LxF(x,y)2+μxx22μLxF(x,y),xx))+(F(x,y)F(x,y)yF(x,y),yyL2(Lμ)(1LyF(x,y)yF(x,y)2+μyy22μLyF(x,y)yF(x,y),yy))+(F(x,y)F(x,y)L2(Lμ)×(1LyF(x,y)2+μyy22μLyF(x,y),yy))=μ2Lμ(xx)12μ(xF(x,y)+xF(x,y)xF(x,y))214(Lμ)xF(x,y)xF(x,y)xF(x,y)2μ2Lμ(yy)+12μ(yF(x,y)yF(x,y)+yF(x,y))214(Lμ)yF(x,y)yF(x,y)yF(x,y)2μ(xx2+yy2)+xF(x,y),xxyF(x,y),yy.Hence, μ(xx2+yy2)xF(x,y),xxyF(x,y),yy,and the proof is complete.

Note that the converse of Proposition 3.2 does not hold necessarily. Consider the following saddle point problem (15) minxRmaxyRF(x,y):=f(x+y)2y2,(15) where f(s)={0|s|1(s1)2s>1(s+1)2s<1.It is seen that F is not strongly convex-strongly concave and the solution set of problem (Equation15) is {(x,0):|x|1}. By doing some algebra, one can check that F has a quadratic gradient growth with μF=1 while it is not strongly convex with respect to the first component. For the case that F(,y) is neither strongly convex nor is F(x,) strongly concave, one may consider uncoupled problem minxRmaxyRf(x)f(y).

In what follows, by using performance estimation, we establish that Algorithm 1 enjoys the linear convergence whether FF(Lx,Ly,Lxy,0,0) has a quadratic gradient growth. Without loss of generality, we may assume that (0,0)=ΠS((x1,y1)). To establish the linear convergence, it suffices to show that dS2((x2,y2))x22+y22αdS2((x1,y1)),for some α[0,1). Similarly to Section 2, we formulate the following optimization problem (16) maxx1tGx1,12+y1+tGy1,12x12+y12s.t.{(x1;Gx1,k;F1,k),(x1tGx1,1;Gx2,k;F2,k),(0;Gx,k;F,k)}satisfy(3)fork{1,2,}w.r.t.μx=0,Lx{(y1;Gyk,1;Fk,1),(y1+tGy1,1;Gyk,2;Fk,2),(0;Gyk,;Fk,)}satisfy(4)fork{1,2,,}w.r.t.μy=0,LyGxk,iGxk,jLxyyiyj,i,j,k{1,2,}Gyi,kGyj,kLxyxixj,i,j,k{1,2,}μF(x12+y12)Gx1,1,x1Gy1,1,y1,Gx,=0,Gy,=0.(16) Note that in the formulation (Equation16), we only use a subset of constraints for the performance estimation. In the next theorem, we prove the linear convergence of Algorithm 1 when F has a quadratic gradient growth.

Theorem 3.3

Let FF(Lx,Ly,Lxy,0,0) and L=max{Lx,Ly}. Assume that F has a quadratic gradient growth with μF>0. If t(0,2μFLμF+2LxyμF(LμF)+Lxy2), then Algorithm 1 generates (x2,y2) such that (17) dS2((x2,y2))αdS2((x1,y1)),(17) where α=t(2tLxyμF(LμF)+μF(Lt2)+tLxy2)+1.

Proof.

The argument is similar to that of Theorem 2.2. It is seen that for any step length t in the given interval, α[0,1). We may assume without loss of generality Lxy=1. By the assumptions, F(,y)F0,L(Rn) and F(x,)F0,L(Rm) for any fixed x, y. Suppose that

α¯=t(2tμF(LμF)+μF(Lt2)+t)+1,β=t2(μFLμF+μF),γ1=t2(μFμF(LμF)+μF),γ2=t2(μF(LμF)+μF(LμF))μF,γ3=t2(μF(L+μF)+μF(LμF))μF+βLμF+2t,γ4=12t2(μF(LμF)+1).One may readily verify that γ1,γ2,γ3,γ40. By doing some algebra, one can show that x1tGx1,12+y1+tGy1,12α¯(x12+y12)+γ1(F1,1F,1Gx,1,x112LGx1,1Gx,12)+γ2(F,1F1,1+Gx1,1,x112LGx,1Gx1,12)+γ2(F1,F,12LGx1,2)+γ1(F,F1,+Gx1,,x112LGx1,2)+γ1(F1,F1,1+Gy1,,y112LGy1,1Gy1,2)+γ2(F1,1F1,Gy1,1,y112LGy1,Gy1,12)+γ2(F,1+F,12LGy,12)+γ1(F,+F,1+Gy,1,y112LGy,12)+γ3(Gx1,1,x1Gy1,1,y1μF(x12+y12))+γ4(x12Gy1,1Gy,12)+γ4(x12Gy1,2)+γ4(y12Gx1,1Gx1,2)+γ4(y12Gx,12)=ζ1x1+ζ2Gx1,1ζ3(Gx1,Gx,1)2ζ4Gx1,1Gx1,Gx,12ζ1y1ζ2Gy1,1ζ3(Gy1,Gy,1)2ζ4Gy1,1Gy,1Gy1,20,where the multipliers ζ1,ζ2,ζ3,ζ4 are given as follows ζ1=μF(βLμFμFt2),ζ2=β2μFμFt21μF,ζ3=t2(μF(LμF)+1)2μFt2,ζ4=14(2t2(μF(LμF)+1)μF(LμF)(2μFt2(μFL)+βLμF)2μF(LμF)t2μF(LμF)).One can show by some algebra that ζ1,ζ40. Hence, for any feasible solution of problem (Equation16), we have x1tGx1,12+y1+tGy1,12x12+y12α¯,and the proof is complete.

We obtained the linear convergence by using quadratic gradient growth in Theorem 3.3. The next theorem states that quadratic gradient growth property is also a sufficient condition for the linear convergence.

Theorem 3.4

If Algorithm 1 is linearly convergent for any initial point, then F has a quadratic gradient growth for some μF>0.

Proof.

Let (x1,y1)Rn×Rm and (x2,y2) be generated by Algorithm 1. Suppose that (x,y)=ΠS((x2,y2)). As Algorithm 1 is linearly convergent, there exist α[0,1) with (18) dS2((x2,y2))αdS2((x1,y1))α(x1x2+y1y2).(18) By setting x2=x1txF(x1,y1) and y2=y1+tyF(x1,y1) in inequality (Equation18), we get 1α2t(x1x2+y1y2)xF(x1,y1),x1xyF(x1,y1),y1y,which implies that μFdS2(x1,y1)xF(x1,y1),x1xyF(x1,y1),y1y,for μF=1α2t and the proof is complete.

4. Concluding remarks

In this study, we provided a new convergence rate for the gradient descent–ascent method for saddle point problems. Furthermore, we gave some necessary and sufficient conditions for the linear convergence without strong convexity. We employed performance estimation method for proving the results. For future work, it would be interesting to consider the case where the variables x and y in the saddle point problem are constrained to lie in given, compact convex sets, since many saddle point problems fall in this category. In this case, one could use the performance estimation framework to analyse other methods, e.g. proximal type algorithms.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by the Dutch Scientific Council (NWO) grant OCENW.GROOT.2019.015, Optimization for and with Machine Learning (OPTIMAL).

Notes on contributors

Moslem Zamani

Moslem Zamani obtained his PhD degree from the University of Avignon and the University of Tehran. He worked as a postdoctoral researcher at Tilburg University. He is currently a researcher at UCLouvain. His main research interests include non-linear optimization and machine learning algorithms.

Hadi Abbaszadehpeivasti

Hadi Abbaszadehpeivasti earned his bachelor's degree in applied mathematics from the University of Zanjan and subsequently obtained master's degrees in industrial engineering from both Sharif University of Technology and Sabancı University. He is currently a PhD. student at Tilburg University, focusing on the complexity of first-order methods. His primary research interests include mathematical optimization and machine learning algorithms.

Etienne de Klerk

Etienne de Klerk obtained his PhD degree from the Delft University of Technology in The Netherlands in 1997. From January 1998 to September 2003, he held assistant professorships at the Delft University of Technology, and from September 2003 to September 2005 an associate professorship at the University of Waterloo, Canada, in the Department of Combinatorics & Optimization. In September 2004 he was appointed at Tilburg University, The Netherlands, first as an associate professor, and then as full professor (from June 2009). From August 29th, 2012, until August 31st, 2013, he was also appointed as full professor in the Division of Mathematics of the School of Physical and Mathematical Sciences at the Nanyang Technological University in Singapore. From September 1st, 2015 to August 31st 2019, he also held a part-time full professorship at the Delft University of Technology. Dr. De Klerk's main research interest is mathematical optimization, and, in particular, semidefinite programming.

Notes

1 The 100 random instances and starting points that we generated to produce Figure  may be found on GitHub; see: https://github.com/molsemzamani/Bilinear-Minimax.

References

  • H. Abbaszadehpeivasti, E. de Klerk, and M. Zamani, Conditions for linear convergence of the gradient method for non-convex optimization, Optim. Lett. 17 (2023), pp. 1105–1125.
  • K.J. Arrow, H. Azawa, L. Hurwicz, H. Uzawa, H.B. Chenery, S.M. Johnson, and S. Karlin, Studies in Linear and Non-linear Programming, Vol. 2, Stanford University Press, California, 1958.
  • W. Azizian, I. Mitliagkas, S. Lacoste-Julien and G. Gidel, A tight and unified analysis of gradient-based methods for a whole spectrum of differentiable games, in International Conference on Artificial Intelligence and Statistics, PMLR, 2020, pp. 2863–2873.
  • T. Başar and G.J. Olsder, Dynamic Noncooperative Game Theory, SIAM, Philadelphia, PA, 1998.
  • A. Ben-Tal, L. El Ghaoui, and A. Nemirovski, Robust Optimization, Vol. 28, Princeton University Press, Princeton, 2009.
  • A. Beznosikov, B. Polyak, E. Gorbunov, D. Kovalev, and A. Gasnikov, Smooth monotone stochastic variational inequalities and saddle point problems – survey, preprint (2022). Available at arXiv:2208.13592.
  • J. Bolte, T.P. Nguyen, J. Peypouquet, and B.W. Suter, From error bounds to the complexity of first-order descent methods for convex functions, Math. Program. 165 (2017), pp. 471–507.
  • S. Boyd and L. Vandenberghe, Convex optimization, Cambridge University Press, Cambridge, 2004.
  • E. De Klerk, F. Glineur, and A.B. Taylor, On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions, Optim. Lett. 11 (2017), pp. 1185–1199.
  • Y. Drori and M. Teboulle, Performance of first-order methods for smooth convex minimization: A novel approach, Math. Program. 145 (2014), pp. 451–482.
  • S.S. Du and W. Hu, Linear convergence of the primal-dual gradient method for convex–concave saddle point problems without strong convexity, in The 22nd International Conference on Artificial Intelligence and Statistics, PMLR, 2019, pp. 196–205.
  • F. Facchinei and J.S. Pang, Finite-Dimensional Variational Inequalities and Complementarity Problems, Springer, New York, 2003.
  • A. Fallah, A. Ozdaglar and S. Pattathil, An optimal multistage stochastic gradient method for minimax problems, in 2020 59th IEEE Conference on Decision and Control (CDC), IEEE, 2020, pp. 3573–3579.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, Adv. Neural. Inf. Process. Syst. 27 (2014), pp. 1–9.
  • E.Y. Hamedani and N.S. Aybat, A primal-dual algorithm with line search for general convex–concave saddle point problems, SIAM. J. Optim. 31 (2021), pp. 1299–1329.
  • R. Jiang and A. Mokhtari, Generalized optimistic methods for convex–concave saddle point problems, preprint (2022). Available at arXiv:2202.09674.
  • T. Liang and J. Stokes, Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks, in The 22nd International Conference on Artificial Intelligence and Statistics, PMLR, 2019, pp. 907–915.
  • T. Lin, C. Jin and M.I. Jordan, Near-optimal algorithms for minimax optimization, in Conference on Learning Theory, PMLR, 2020, pp. 2738–2779.
  • Z.Q. Luo and P. Tseng, Error bounds and convergence analysis of feasible descent methods: A general approach, Ann. Oper. Res. 46 (1993), pp. 157–178.
  • L. Mescheder, S. Nowozin, and A. Geiger, The numerics of gans, Adv. Neural. Inf. Process. Syst. 30 (2017), pp. 1–11.
  • A. Mokhtari, A. Ozdaglar and S. Pattathil, A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach, in International Conference on Artificial Intelligence and Statistics, PMLR, 2020, pp. 1497–1507.
  • A. Mokhtari, A.E. Ozdaglar, and S. Pattathil, Convergence rate of O(1/k) for optimistic gradient and extragradient methods in smooth convex–concave saddle point problems, SIAM. J. Optim. 30 (2020), pp. 3230–3251.
  • I. Necoara, Y. Nesterov, and F. Glineur, Linear convergence of first order methods for non-strongly convex optimization, Math. Program. 175 (2019), pp. 69–107.
  • Y. Nesterov, Lectures on Convex Optimization, Vol. 137, Springer, Berlin, 2018.
  • J. Nie, Z. Yang, and G. Zhou, The saddle point problem of polynomials, Found. Comput. Math. 22 (2022), pp. 1133–1169.
  • J.W. Simpson-Porco, B.K. Poolla, N. Monshizadeh, and F. Dörfler, Input–output performance of linear–quadratic saddle-point algorithms with application to distributed resource allocation problems, IEEE. Trans. Automat. Contr. 65 (2019), pp. 2032–2045.
  • A.B. Taylor, J.M. Hendrickx, and F. Glineur, Smooth strongly convex interpolation and exact worst-case performance of first-order methods, Math. Program. 161 (2017), pp. 307–345.
  • Y. Wang and J. Li, Improved algorithms for convex–concave minimax optimization, Adv. Neural. Inf. Process. Syst. 33 (2020), pp. 4800–4810.
  • Z. Xu, H. Zhang, Y. Xu, and G. Lan, A unified single-loop alternating gradient projection algorithm for nonconvex–concave and convex–nonconcave minimax problems, Math. Program. 201 (2023), pp. 635–706.
  • M. Yang, O. Nachum, B. Dai, L. Li, and D. Schuurmans, Off-policy evaluation via the regularized Lagrangian, Adv. Neural. Inf. Process. Syst. 33 (2020), pp. 6551–6561.
  • M. Zamani, H. Abbaszadehpeivasti, and E. de Klerk, The exact worst-case convergence rate of the alternating direction method of multipliers, preprint (2022). Available at arXiv:2206.09865.
  • G. Zhang, X. Bao, L. Lessard, and R. Grosse, A unified analysis of first-order methods for smooth games via integral quadratic constraints, J. Mach. Learn. Res. 22 (2021), pp. 1–39.
  • G. Zhang, Y. Wang, L. Lessard and R.B. Grosse, Near-optimal local convergence of alternating gradient descent–ascent for minimax optimization, in International Conference on Artificial Intelligence and Statistics, PMLR, 2022, pp. 7659–7679.
  • J. Zhang, M. Hong, and S. Zhang, On lower iteration complexity bounds for the convex concave saddle point problems, Math. Program. 194 (2022), pp. 901–935.

Appendices

Appendix 1.

Non-negativity of multipliers in Theorem 2.2

Recall that, in the proof of Theorem 2.2, γ1 is defined by γ1=t(t2(2+L2+)t(3L+μ)+(Lt1)(Lt+μt2)2+4t2+2)(Lt+μt2)2+4t2.Since t is non-negative, we only need to prove that γ^1:=2t23Ltμt+(Lt1)(Lt+μt2)2+4t2+L2t2+t2+2 is non-negative. We show that the following optimization problem is lower bounded by zero, minL,t,μγ^1s.t.Lμ,μ0,t0,where L,t,μ are decision variables. First we consider the case that Lt10. We have the following optimization problem (A1) minL,t(min0μL2t2+L2t2+t23Ltμt+(Lt1)(Lt+μt2)2+4t2+2)s.t.Lt1,Lμ,t0.(A1) The function γ^1 is concave in μ, therefore, we just consider μ=0 and μ=L.

First we consider the case that μ=0. By substituting μ=0 in γ^1 we have γ^1=2t2+(Lt1)(Lt2+(Lt2)2+4t2).We argue that the above function is non-negative on the feasible set of problem (EquationA1). By a conjugate multiplication of Lt2+(Lt2)2+4t2 one has γ^1=2t2(12(1Lt)(2Lt)+(Lt2)2+4t2),since (2Lt)+(Lt2)2+4t22(2Lt) we conclude that 02(Lt1)(Lt2)+(Lt2)2+4t21 which proves γ^1 is non-negative.

Now we consider the case that μ=L. By substituting μ=L we have γ^1=2t2+2L2t24Lt+2(Lt1)(Lt1)2+t2+2.Now we show that 12γ^1=t2+(Lt1)((Lt1)+(Lt1)2+t2) is non-negative on the given set. Note that, again by conjugate multiplication, 12γ^1=t2(1(1Lt)(1Lt)+(Lt1)2+t2),which always is non-negative due to the non-negativity of (1Lt).

Now we consider the case that tL−1>0. We have γ^1=2t2+L2t2+t23Ltμt+(Lt1)(Lt+μt2)2+4t2+22t2+L2t2+t23Ltμt+(Lt1)|Lt+μt2|+2. Here, we need to consider two sub-cases. Firstly, when 2Ltμt0, we have 2t2+L2t2+t23Ltμt+(Lt1)(2Ltμt)+2=2t20.If Lt+μt20, we have 2t2+L2t2+t23Ltμt+(Lt1)(Ltμt2)+2=(Lt2)2+(Lμ)t+t2+t20,which completes the proof.

To show that γ2 is non-negative we follow the same procedure. Recall the definition of γ2 γ2=t(t2(2+μ2+)t(3μ+L)+(1μt)(Lt+μt2)2+4t2+2)(Lt+μt2)2+4t2.We define γ^2 as γ^2=t2(2+μ2+)t(3μ+L)+(1μt)(Lt+μt2)2+4t2+2.Due to t0, we only need to show that γ^2 is non-negative. To this end, we show that the following optimization problem is lower bounded by zero. minL,t,μγ^2=t2(2+μ2+)t(3μ+L)+(1μt)(Lt+μt2)2+4t2+2s.t.Lμ,μ0,t0.First we consider the case that 1μt0. We have γ^2t2(2+μ2+)t(3μ+L)+(1μt)|Lt+μt2|+2.We consider two sub-cases. Firstly, Lt+μt20: γ^2=t2(2+μ2+)t(3μ+L)+(1μt)(Lt+μt2)2+4t2+2t2(2+μ2+)t(3μ+L)+(1μt)(Lt+μt2)+2=2t20.Now assume that Lt+μt20. γ^2=t2(2+μ2+)t(3μ+L)+(1μt)(Lt+μt2)2+4t2+2t2(2+μ2+)t(3μ+L)+(1μt)(2Ltμt)+2=2((μt1)2+(2μtLt)+μLt2+t2)0.Now we consider the case that 1μt0. mint,μ(minμL2μtμtγ^2=t2(2+μ2+)t(3μ+L)+(1μt)(Lt+μt2)2+4t2+2)s.t.μt1,μ0,t0.Note that γ^2 is concave with respect to the variable L. Therefore, we should study the boundaries of L. If we set L=μ we have γ^2=2(t2+μ2t22μt+(1μt)(μt1)2+t2+1)=2t2+2(μt1)((μt1)(μt1)2+t2). By conjugate multiplication, we have γ^2=2t2(1(μt1)μt1+(μt1)2+t2)0.By L2μtμt one can see that L2t. Setting L=2t: γ^2=μt+2t2+μ2t2+(1μt)4t2+μ2t2=2t2(12(μt1)μt+μ2t2+4t2)0.This completes the proof.

Appendix 2.

Identity used in the proof of Theorem 2.2

The proof of Theorem 2.2 requires the following identity, that may be verified through direct (symbolic) calculation:

x1tGx1,12+y1+tGy1,12α¯(x12+y12)+γ1(F1,1F,1Gx,1,x1L2(Lμ)(1LGx1,1Gx,12+μx122μLGx,1Gx1,1,x1))+γ2(F,1F1,1+Gx1,1,x1L2(Lμ)(1LGx,1Gx1,12+μx122μLGx1,1Gx,1,x1))+γ2(F1,F,L2(Lμ)(1LGx1,2+μx122μLGx1,,x1))+γ1(F,F1,+Gx1,,x1L2(Lμ)(1LGx1,2+μx122μLGx1,,x1))+γ1(F1,F1,1+Gy1,,y1L2(Lμ)×(1LGy1,1Gy1,2+μy122μLGy1,Gy1,1,y1))+γ2(F1,1F1,Gy1,1,y1L2(Lμ)(1LGy1,Gy1,12+μy122μLGy1,1+Gy1,,y1))+γ2(F,1+F,L2(Lμ)(1LGy,12+μy122μLGy,1,y1))+γ1(F,+F,1+Gy,1,y1L2(Lμ)(1LGy,12+μy122μLGy,1,y1))+γ3(x12Gy1,1Gy,12)+γ3(x12Gy1,2)+γ3(y12Gx1,1Gx1,2)+γ3(y12Gx,12)=ζ1x1ζ2Gx1,1ζ3(Gx1,Gx,1)2ζ4Gx1,1Gx1,Gx,12ζ1y1+ζ2Gy1,1ζ3(Gy1,Gy,1)2ζ4Gy1,1Gy,1Gy1,2,where ζ1,ζ2,ζ3,ζ4 are given by ζ1=12t((L2+μ2)βLμ2t2(Lμ)β+(L+μ)(t(L+μ)2)),ζ2=(L2tLμ2t+μ)βL2t(Lt+μt3)(L+μ)(μ2t22μt+2t2+2)+μ2t2t2(L+μ)2(+1)8Lμt(L+μ)+8,ζ3=t(L2+6+μ2)2t2(L+μ)(+1)(Lμ)β2(L+μ)2t2(L+μ)2(+1)8Lμt(L+μ)+8,ζ4=t(β+Ltμt)24(Lμ)β.Note that ζ1,ζ40, as required.