1,416
Views
0
CrossRef citations to date
0
Altmetric
Articles

An incremental mirror descent subgradient algorithm with random sweeping and proximal step

Pages 33-50 | Received 01 Nov 2017, Accepted 25 May 2018, Published online: 14 Jun 2018

ABSTRACT

We investigate the convergence properties of incremental mirror descent type subgradient algorithms for minimizing the sum of convex functions. In each step, we only evaluate the subgradient of a single component function and mirror it back to the feasible domain, which makes iterations very cheap to compute. The analysis is made for a randomized selection of the component functions, which yields the deterministic algorithm as a special case. Under supplementary differentiability assumptions on the function which induces the mirror map, we are also able to deal with the presence of another term in the objective function, which is evaluated via a proximal type step. In both cases, we derive convergence rates of O(1/k) in expectation for the kth best objective function value and illustrate our theoretical findings by numerical experiments in positron emission tomography and machine learning.

AMS SUBJECT CLASSIFICATIONS:

1. Introduction

We consider the problem of minimizing the sum of nonsmooth convex functions (1) minxCi=1mfi(x),(1) where CRn is a nonempty, convex and closed set and, for every i=1,,m, the so-called component functions fi:RnR¯:=R{±} are assumed to be proper and convex and will be evaluated via their respective subgradients. Implicitly we will assume that m is large and it is therefore very costly to evaluate all component functions in each iteration. Consequently, we will examine algorithms which only use the subgradient of a single component function in each iteration. These so-called incremental algorithms, see [Citation1,Citation2], have been applied for large-scale problems arising in tomography [Citation3], generalized assignment problems [Citation2] or machine learning [Citation4]. We refer also to [Citation5] for a slightly different approach, where in the spirit of incremental algorithms only the gradient of one of the component functions is evaluated in each step, but gradients at old iterates are used for the other components. Both subgradient algorithms and incremental methods usually require decreasing stepsizes in order to converge, which makes them slow near an optimal solution. However, they provide a very small number of iterations a low accuracy optimal value and possess a rate of convergence which is almost independent of the dimension of the problem. We refer the reader to [Citation6] for a subgradient algorithm designed for the minimization of a nonsmooth nonconvex function under the making use of proximal subgradients.

When solving optimization problems of type (Equation1), one might want to capture in the formulation of the iterative scheme the geometry of the feasible set C. This can be done by a so-called mirror map, that mirrors each iterate onto the feasible set. The Bregman distance associated with the function that induces the mirror map plays an essential role in the convergence analysis and in the formulation of convergence rates results (see [Citation7,Citation8]). So-called mirror descent algorithms were first discussed in [Citation9] and more recently in [Citation8,Citation10,Citation11] in a very general framework, in [Citation4,Citation12] from a statistical learning point of view and in [Citation13] for the case of dynamical systems. The mirror map can be viewed as a generalization of the ordinary orthogonal projection on C in Hilbert spaces (see Example 2.4), but allows also for a more careful consideration of the structure of the problems, as it is the case when the objective function is subdifferentiable only on the relative interior of the feasible set. In such a setting, one can design a mirror map which maps not onto the entire feasible set but only on a subset of it where the objective function is subdifferentiable (see Example 2.5).

There exists already a rich literature on incremental algorithms dealing with similar problems. In [Citation1,Citation2] incremental subgradient methods with a random selection of the component functions and even projections onto a feasible set are considered, but no mirror descent. Incremental subgradient algorithms utilizing mirror descent techniques are investigated in [Citation3]; however, an additional projection onto the feasible set is required which thus excludes the case where domfC (this is taken care of in our case by the weak assumption that im(H)domf). Furthermore, the results appearing in Section 4 discussing Bregman proximal steps appear to completely novel for this kind of problems and are only known from a forward–backward setting [Citation7].

The basic concepts in the formulation of mirror descent algorithms are recalled in Section 2. We also provide some illustrating examples, which present some special cases, as the general setting might not be immediately intuitive. In Section 3 we formulate an incremental mirror descent subgradient algorithm with random sweeping of the component functions which we show to have a convergence rate of O(1/k) in expectation for the kth best objective function value. In Section 4 we ask additionally for differentiability of the function which induces the mirror map and are then able to add another nonsmooth convex function to the objective function which is evaluated in the iterative scheme by a proximal type step. For the resulting algorithm, we show a similar convergence result. In the last section, we illustrate the theoretical findings by numerical experiments in positron emission tomography (PET) and machine learning.

2. Elements of convex analysis and the mirror descent algorithm

Throughout the paper, we assume that Rn is endowed with the Euclidean inner product , and corresponding norm =,. For a nonempty convex set CRn we denote by riC its relative interior, which is the interior of C relative to its affine hull. For a convex function f:RnR¯ we denote by domf:={xRn:f(x)<+} its effective domain and say that f is proper, if f> and domf. The subdifferential of f at xRn is defined as f(x):={pRn:f(y)f(x)+p,yxyRn}, for f(x)R, and as f(x):=, otherwise. We will write f(x) for an arbitrary subgradient, i.e. an element of the subdifferential f(x).

Problem 2.1

Consider the optimization problem (2) minxCf(x),(2) where CRn is a nonempty, convex and closed set, f:RnR¯ is a proper and convex function, and H:RnR¯ is a proper, lower semicontinuous and σ-strongly convex function such that C=domH¯ and im(H)int(domf).

We say that H:RnR¯ is σ-strongly convex for σ>0, if for every x,xRn and every λ[0,1] it holds (σ/2)λ(1λ)xx2+H(λx+(1λ)x)λH(x)+(1λ)H(x). It is well known that, when H is proper, lower semicontinuous and σ-strongly convex, then its conjugate function H:RnR¯,H(y)=supxRn{y,xH(x)}, is Fréchet differentiable (thus it has full domain) and its gradient H is 1/σ-Lipschitz continuous or, equivalently, H is Fréchet differentiable and its gradient H is σ-cocoercive, which means that for every y,yRn it holds σH(y)H(y)2yy,H(y)H(y). Recall that im(H):={H(y):yRn}.

The following mirror descent algorithm has been introduced in [Citation10] under the name dual averaging.

Algorithm 2.2

Consider for some initial values x0int(domf),y0Rn and sequence of positive stepsizes (tk)k0 the following iterative scheme: ( k0)yk+1=yktkf(xk)xk+1=H(yk+1).

We notice that, since the sequence (xk)k0 is contained in the interior of the effective domain of f, the algorithm is well defined. The assumptions concerning the function H, which induces the mirror map H, are not consistent in the literature. Sometimes H is assumed to be a Legendre function as in [Citation7], or strongly convex and differentiable as in [Citation8,Citation11]. In the following section, we will only assume that H is proper, lower semicontinuous and strongly convex.

Example 2.3

For H=122 we have that H=122 and thus H is the identity operator on Rn. Consequently, Algorithm 2.2 reduces to the classical subgradient method: ( k0)xk+1=xktkf(xk).

Example 2.4

For CRn a nonempty, convex and closed set, take H(x)=12x2, for xC, and H(x)=+, otherwise. Then H=PC, where PC denotes the orthogonal projection onto C. In this setting, Algorithm 2.2 becomes: ( k0)yk+1=yktkf(xk)xk+1=PC(yk+1). This iterative scheme is similar to, but different from the well-known subgradient projection algorithm, which reads: ( k0)yk+1=xktkf(xk)xk+1=PC(yk+1).

Example 2.5

When considering numerical experiments in PET, one often minimizes over the unit simplex Δ:={x=(x1,,xn)TRn:j=1nxj=1,x0}. An appropriate choice for the function H is H(x)=j=1nxjlog(xj) for xΔ, where 0log(0)=0, and H(x)=+, if xΔ. In this case H is given for every yRn by H(y)=1i=1nexp(yi)(exp(y1),exp(y2),,exp(yn))T, and maps into the relative interior of Δ.

The following result will play an important role in the convergence analysis that we will carry out in the next sections.

Lemma 2.6

Let H:RnR¯ be a proper, lower semicontinuous and σ-strongly convex function, for σ>0, xRn and yH(x). Then it holds H(x)+y,xx+σ2xx2H(x)xRn.

Proof.

The function H~():=H()(σ/2)2 is convex and yσxH~(x). Thus H~(x)+yσx,x~xH~(x~) x~Rn or, equivalently, H(x)σ2x2+yσx,x~xH(x~)σ2x~2 x~Rn. Rearranging the terms, leads to the desired conclusion.

3. A stochastic incremental mirror descent algorithm

In this section, we will address the following optimization problem.

Problem 3.1

Consider the optimization problem (3) minxCi=1mfi(x),(3) where CRn is a nonempty, convex and closed set, for every i=1,,m, the functions fi:RnR¯ are proper and convex, and H:RnR¯ is a proper, lower semicontinuous and σ-strongly convex function such that C=domH¯ and im(H)int(i=1mdomfi).

In this section, we apply the dual averaging approach described in Algorithm 2.2 to the optimization problem (Equation3) by only using the subgradient of a component function at a time. This incremental approach (see, also, [Citation1,Citation2]) is similar to but slightly different from the extension suggested in [Citation8]. Furthermore, we introduce a stochastic sweeping of the component functions. For a similar strategy, but in the random selection of coordinates we refer to [Citation14].

Algorithm 3.2

Consider for some initial values x0int(i=1mdomfi),ym,1Rn and sequence of positive stepsizes (tk)k0 the following iterative scheme: ( k0)ψ0,k=xky0,k=ym,k1for i=1,,myi,k=yi1,kϵi,ktkpifi(ψi1,k)ψi,k=H(yi,k)endxk+1=ψm,k, where ϵi,kisa{0,1} valued random variable for every i=1,,m and k0, such that ϵi,k is independent of ψi1,k and P(ϵi,k=1)=pi for every i=1,,m and k0.

One can notice that in the above iterative scheme yi,kH(ψi,k) for every i=1,,m and k0.

In the convergence analysis of Algorithm 3.2 we will make use of the following Bregman-distance-like function associated to the proper and convex function H:RnR¯ and defined as (4) dH:Rn×domH×RnR¯,dH(x,y,z):=H(x)H(y)z,xy.(4) We notice that dH(x,y,z)0 for every (x,y)Rn×domH and every zH(y), due to subgradient inequality.

The function dH is an extension of the Bregman distance (see [Citation4,Citation7,Citation11]), which is associated to a proper and convex function H:RnR¯ fulfilling domH:={xRn:H is differentiable at x} and defined as (5) DH:Rn×domHR¯,DH(x,y)=H(x)H(y)H(y),xy.(5)

Theorem 3.3

In the setting of Problem 3.1, assume that the functions fi are Lfi-Lipschitz continuous on im(H) for i=1,,m. Let (xk)k0 be a sequence generated by Algorithm 3.2. Then for every N1 and every yRn it holds Emin0kN1i=1mfi(xk)i=1mfi(y)dH(y,x0,y0,0)+1σi=1mLfi2i=1m1pi22+1k=0N1tk2k=0N1tk.

Proof.

Let yi=1mdomfidomH be fixed. For y outside this set the conclusion follows automatically.

For every i=1,,m and every k0 it holds dH(y,ψi,k,yi,k)=H(y)H(ψi,k)yi,k,yψi,k=H(y)H(ψi,k)yi1,ktkpiϵi,kfi(ψi1,k),yψi,k. Rearranging the terms, this yields for every i=1,,m and every k0 to dH(y,ψi,k,yi,k)=dH(y,ψi1,k,yi1,k)dH(ψi,k,ψi1,k,yi1,k)+tkpiϵi,kfi(ψi1,k),yψi,k=dH(y,ψi1,k,yi1,k)dH(ψi,k,ψi1,k,yi1,k)+tkpiϵi,kfi(ψi1,k),yψi1,ktkpiϵi,kfi(ψi1,k),ψi,kψi1,kdH(y,ψi1,k,yi1,k)dH(ψi,k,ψi1,k,yi1,k)+tkpiϵi,k(fi(y)fi(ψi1,k))+tkpiϵi,kfi(ψi1,k)ψi1,kψi,k. From here we get for every i=1,,m and every k0 dH(y,ψi,k,yi,k)dH(y,ψi1,k,yi1,k)dH(ψi,k,ψi1,k,yi1,k)+tkpiϵi,k(fi(y)fi(ψi1,k))+1σtk21pi2ϵi,k2fi(ψi1,k)2+σ4ψi1,kψi,k2 which, by using that H is σ-strongly convex and Lemma 2.6, yields dH(y,ψi,k,yi,k)dH(y,ψi1,k,yi1,k)dH(ψi,k,ψi1,k,yi1,k)+tkpiϵi,k(fi(y)fi(ψi1,k))+1σtk21pi2ϵi,kfi(ψi1,k)2+12dH(ψi,k,ψi1,k,yi1,k)=dH(y,ψi1,k,yi1,k)+tkpiϵi,k(fi(y)fi(ψi1,k))+1σtk21pi2ϵi,kfi(ψi1,k)212dH(ψi,k,ψi1,k,yi1,k). Using the fact that fi is Lfi-Lipschitz continuous, it follows that fi(ψi1,k)Lfi, for every i=1,...,m and every k0, thus dH(y,ψi,k,yi,k)dH(y,ψi1,k,yi1,k)+tkpiϵi,k(fi(y)fi(ψi1,k))+1σtk21pi2ϵi,kLfi212dH(ψi,k,ψi1,k,yi1,k). Since all the involved functions are measurable, we can take the expected value on both sides of the above inequality and get due to the assumed independence of ϵi,k and ψi1,k for every i=1,,m and every k0 E(dH(y,ψi,k,yi,k))E(dH(y,ψi1,k,yi1,k))+Etkpi(fi(y)fi(ψi1,k))E(ϵi,k)+1σtk21pi2Lfi2E(ϵi,k)E12dH(ψi,k,ψi1,k,yi1,k). Since E(ϵi,k)=pi, we get for every i=1,,m and every k0 E(dH(y,ψi,k,yi,k))E(dH(y,ψi1,k,yi1,k))+E(tk(fi(y)fi(ψi1,k)))+1σtk21piLfi2E12dH(ψi,k,ψi1,k,yi1,k). Summing the above inequality for i=1,,m and using that i=1mLfi21pii=1mLfi41/2i=1m1pi21/2i=1mLfi2i=1m1pi21/2, it yields for every k0 E(dH(y,ψm,k,ym,k))E(dH(y,xk,y0,k))+Etki=1m(fi(y)fi(ψi1,k))+1σtk2i=1mLfi2i=1m1pi21/2Ei=1m12dH(ψi,k,ψi1,k,yi1,k) or, equivalently, E(dH(y,ψm,k,ym,k))E(dH(y,xk,y0,k))+Etki=1m(fi(y)fi(xk)+fi(xk)fi(ψi1,k))+1σtk2i=1mLfi2i=1m1pi21/2Ei=1m12dH(ψi,k,ψi1,k,yi1,k). Thus, for every k0, (6) E(dH(y,ψm,k,ym,k))E(dH(y,xk,y0,k))+tkEi=1mfi(y)i=1mfi(xk)+1σtk2i=1mLfi2i=1m1pi21/2Ei=1m12dH(ψi,k,ψi1,k,yi1,k)+Etki=1m(fi(xk)fi(ψi1,k)).(6) On the other hand, by using the Lipschitz continuity of H it yields for every k0 i=1m(fi(xk)fi(ψi1,k))=i=2mj=1i1(fi(ψj1,k)fi(ψj,k))i=2mj=1i1Lfiψj1,kψj,kl=1mLfli=2mψi1,kψi,k,l=1mLfli=2mH(yi1,k)H(yi,k)1σl=1mLfli=2myi1,kyi,k=1σl=1mLfli=2mϵi,ktkpifi(ψi1,k)1σtkl=1mLfli=1mϵi,kpiLfi. Therefore, for every k0 (7) Etki=1m(fi(xk)fi(ψi1,k))1σtk2l=1mLflEi=1mϵi,kpiLfi1σtk2i=1mLfi2.(7) Combining (Equation6) and (Equation7) gives for every k0 (8) E(dH(y,ψm,k,ym,k))E(dH(y,xk,y0,k))+tkEi=1mfi(y)i=1mfi(xk)+1σtk2i=1mLfi2i=1m1pi21/2Ei=1m12dH(ψi,k,ψi1,k,yi1,k)+1σtk2i=1mLfi2.(8) Since ψm,k=xk+1 and ym,k=y0,k+1 we get for every k0 that E(dH(y,xk+1,y0,k+1))E(dH(y,xk,y0,k))+tkEi=1mfi(y)i=1mfi(xk)+1σtk2i=1mLfi2i=1m1pi22+1. By summing up this inequality from k=0 to N−1, where N1, we get k=0N1tkEi=1mfi(xk)i=1mfi(y)+E(dH(y,xN,y0,N))E(dH(y,x0,y0,0))+k=0N11σtk2i=1mLfi2i=1m1pi22+1. Since dH(y,xN,y0,N)0, as y0,NH(xN), we get Emin0kN1i=1mfi(xk)i=1mfi(y)dH(y,x0,y0,0)+1σi=1mLfi2i=1m1pi22+1k=0N1tk2k=0N1tk and this finishes the proof.

Remark 3.4

The set from which the variable y is chosen in the previous theorem might seem to be restrictive, however, we would like to recall that in many applications domH is the set of feasible solutions of the optimization problem (Equation3). Since im(H)=domH:={xRn:H(x)}domH, the inequality in Theorem 3.3 is fulfilled for every yim(H).

Remark 3.5

Note furthermore that so far we have not made any assumptions about the stepsizes in Theorem 3.3. It is, however, clear from the statement that in the case where y=x for an optimal solution x and the stepsizes (tk)kN fulfil the classical condition that k=1tk=+ and k=1tk2<+ it follows that limNNE(min0kN1i=1mfi(xk)i=1mfi(x))=0.

The optimal stepsize choice, which we provide in the following corollary, is a consequence of [Citation8, Proposition 4.1], which states that the function zc+(2σ)1zTDzbTz, where c>0,bR++d:={(z1,,zd)TRd:zi>0,i=1,,d} and DRd×d is a symmetric positive definite matrix, attains its minimum on R++d at z=(2cσ/bTD1b)D1b and this provides 2c/σbTD1b as optimal objective value.

Corollary 3.6

In the setting of Problem 3.1, assume that the functions fi are Lfi-Lipschitz continuous on im(H) for i=1,,m. Let xdomH be an optimal solution of (Equation3) and (xk)k0 be a sequence generated by Algorithm 3.2 with optimal stepsize tk:=1i=1mLfidH(x,x0,y0,0)i=1m1pi22+11k k0. Then for every N1 it holds Emin0kN1i=1mfi(xk)i=1mfi(x)2i=1mLfidH(x,x0,y0,0)i=1m1pi22+1σ1N.

Remark 3.7

In the last step of the proof of Theorem 3.3, one could have chosen to use the following inequality k=0N1tkEi=1mfik=0N1tkxkk=0N1tki=1mfi(y)k=0N1tkEi=1mfi(xk)i=1mfi(y) given by the convexity of i=1mfi() in order to prove convergence of the function values for the ergodic sequence x¯k:=(1/i=0kti)i=0ktixi for all k0. This would lead for every N1 and every yRn to Ei=1mfi(x¯N1)i=1mfi(y)dH(y,x0,y0,0)+1σi=1mLfi2i=1m1pi22+1k=0N1tk2k=0N1tk and for the optimal stepsize choice from Corollary 3.6 to Ei=1mfi(x¯N1)i=1mfi(y)2i=1mLfidH(x,x0,y0,0)i=1m1pi22+1σ1N, and might be beneficial, as it does not require the computation of objective function values, which are by our implicit assumption of m being large expensive to compute.

4. A stochastic incremental mirror descent algorithm with Bregman proximal step

In this section, we add another nonsmooth convex function to the objective function of the optimization problem (Equation3) and provide an extension of Algorithm 3.2, which evaluates in particular the new summand by a proximal type step. However, this asks for supplementary differentiability assumption on the function inducing the mirror map.

Problem 4.1

Consider the optimization problem (9) minxCi=1mfi(x)+g(x),(9) where CRn is a nonempty, convex and closed set, for every i=1,,m, the functions fi:RnR¯ are proper and convex and g:RnR¯ is a proper, convex and lower semicontinuous function, and H:RmR¯ is a proper, σ-strongly convex and lower semicontinuous function such that C=domH¯, H is continuously differentiable on int(domH), im(H)int(i=1mdomfi)int(domH) and int(domH)domg.

For a proper, convex, lower semicontinuous function h:RnR¯ we define its Bregman-proximal operator with respect to the proper, σ-strongly convex and lower semicontinuous function H:RnR¯ as being proxhH:domHRn,proxhH(x):=argminuRn{h(u)+DH(u,x)}. Due to the strong convexity of H, the Bregman-proximal operator is well defined. For H=(12)2 it coincides with the classical proximal operator.

We are now in the position to formulate the iterative scheme we would like to propose for solving (Equation9). In case g=0, this algorithm gives exactly the incremental version of the iterative method in [Citation8], actually suggested by the two authors in this paper.

Algorithm 4.2

Consider for some initial value x0im(H) and sequence of positive stepsizes (tk)k0 the following iterative scheme: ( k0)ψ0,k=xkfor i=1,,mψi,k=H(H(ψi1,k)ϵi,ktkpifi(ψi1,k))endxk+1=proxtkgH(ψm,k), where ϵi,kisa{0,1} valued random variable for every i=1,,m and k0, such that ϵi,k is independent from ψi1,k and P(ϵi,k=1)=pi for every i=1,,m and k0.

Lemma 4.3

In the setting of Problem 4.1, Algorithm 4.2 is well defined.

Proof.

As im(H)int(i=1mdomfi), it follows for every i=2,,m and every k0 immediately that ψi1,kintdomfi, thus a subgradient of fi at ψi1,k exists.

In what follows we prove that this is the case also for ψ0,k, for every k0. To this aim, it is enough to show that xkim(H) for every k0. For k=0 this statement is true by the choice of the initial value. For every k0 we have that 0(tkg+HH(ψm,k),)(xk+1), which, according to int(domH)domg, is equivalent to 0tkg(xk+1)+H(xk+1)H(ψm,k). Thus, xk+1domH=im(H) for every k0 and this concludes the proof.

Example 4.4

Consider the case when m=1, ϵ1,k=1 for every k0 and H(x)=12x2 for xC, while H(x)=+ for xC, where CRn is a nonempty, convex and closed set. In this setting, H is equal to the orthogonal projection PC onto the set C. Algorithm 4.2 yields the following iterative scheme, which basically minimizes the sum f1+g over the set C: (10) ( k0)xk+1=proxtkgH(PC(xktkf1(xk))).(10) The difficulty in Example 4.4, assuming that it is reasonably possible to project onto the set C, lies in evaluating proxtkgH, for every k0, as this itself is a constraint optimization problem proxtkgH(x)=argminuCtkg(u)+12xu2. When C=Rn, the iterative scheme (Equation10) becomes the proximal subgradient algorithm investigated in [Citation15].

Theorem 4.5

In the setting of Problem 4.1, assume that the functions fi are Lfi-Lipschitz continuous on im(H) for i=1,,m. Let (xk)k0 be a sequence generated by Algorithm 4.2. Then for every N1 and every yRn it holds Emin0kN1i=1mfi+g(xk+1)i=1mfi+g(y)2σDH(y,x0)+2i=1m1pi21/2+3+2mi=1mLfi2k=0N1tk22σk=0N1tk.

Proof.

Let yi=1mdomfidomgdomH be fixed. For y outside this set the conclusion follows automatically.

As in the first part of the proof of Theorem 3.3, we obtain instead of (Equation8) the following inequality which holds for every i=1,,m and every k0 (11) E(DH(y,ψm,k))E(DH(y,xk))+tkEi=1mfi(y)i=1mfi(xk)+1σtk2i=1mLfi2i=1m1pi21/2+1Ei=1m12DH(ψi,k,ψi1,k).(11) As pointed out in the proof of Lemma 4.3, for every k0 we have 0tkg(xk+1)+H(xk+1)H(ψm,k), thus tk(g(y)g(xk+1))H(ψm,k)H(xk+1),yxk+1. The three point identity leads to tk(g(y)g(xk+1))(DH(y,ψm,k)DH(y,xk+1)DH(xk+1,ψm,k)) or, equivalently, tk(g(xk+1)g(y))+DH(y,xk+1)DH(y,ψm,k)DH(xk+1,ψm,k) for every k0. Since the involved functions are measurable, we can take the expected value on both sides and obtain for every k0 (12) tkE((g(xk+1)g(y)))+E(DH(y,xk+1))E(DH(y,ψm,k))E(DH(xk+1,ψm,k)).(12) Combining (Equation11) and (Equation12) gives for every k0 tkE((g(xk+1)g(y)))+tkEi=1mfi(xk)i=1mfi(y)+E(DH(y,xk+1))E(DH(y,xk))+1σtk2i=1mLfi2i=1m1pi21/2+1E(DH(xk+1,ψm,k))i=1m12E(DH(ψi,k,ψi1,k)). By adding and subtracting E(i=1mfi(xk+1)) and by using afterwards the Lipschitz continuity of i=1mfi, we get for every k0 tkEi=1mfi+g(xk+1)i=1mfi+g(y)tki=1mLfiE(xkxk+1)+E(DH(y,xk+1))E(DH(y,xk))+1σtk2i=1mLfi2i=1m1pi21/2+1E(DH(xk+1,ψm,k))i=1m12E(DH(ψi,k,ψi1,k)). By the triangle inequality, we obtain for every k0 tkEi=1mfi+g(xk+1)i=1mfi+g(y)+E(DH(y,xk+1))E(DH(y,xk))+1σtk2i=1mLfi2i=1m1pi21/2+1E(DH(xk+1,ψm,k))+tki=1mLfiE(xkψm,k+tki=1mLfiE(ψm,kxk+1i=1m12E(DH(ψi,k,ψi1,k)), which, due to Young's inequality and the strong convexity of H, leads to tkEi=1mfi+g(xk+1)i=1mfi+g(y)+E(DH(y,xk+1))E(DH(y,xk))+1σtk2i=1mLfi2i=1m1pi21/2+1E(DH(xk+1,ψm,k))+tki=1mLfiE(xkψm,k+12σtk2i=1mLfi2+E(DH(xk+1,ψm,k))i=1m12E(DH(ψi,k,ψi1,k)). Since xkψm,k=i=1m(ψi1,kψi,k)i=1mψi1,kψi,k, we get for every k0 that tkEi=1mfi+g(xk+1)i=1mfi+g(y)+E(DH(y,xk+1))E(DH(y,xk))+12σtk2i=1mLfi22i=1m1pi21/2+3+tki=1mLfiEi=1mψi1,kψi,ki=1m12E(DH(ψi,k,ψi1,k)). Young's inequality and the strong convexity of H imply that for every i=1,,m and every k0 tki=1mLfiψi1,kψi,k1σtk2i=1mLfi2+σ4ψi1,kψi,k21σtk2i=1mLfi2+12DH(ψi,k,ψi1,k) and thus tkEi=1mfi+g(xk+1)i=1mfi+g(y)+E(DH(y,xk+1))E(DH(y,xk))+12σtk2i=1mLfi22i=1m1pi21/2+3+2m. Summing up this inequality from k=0 to N−1, for N1, we get k=0N1tkEi=1mfi+g(xk+1)i=1mfi+g(y)+E(DH(y,xN))E(DH(y,x0))+12σi=1mLfi22i=1m1pi21/2+3+2mk=0N1tk2. This shows that Emin0kN1i=1mfi+g(xk+1)i=1mfi+g(y)2σDH(y,x0)+2i=1m1pi21/2+3+2mi=1mLfi2k=0N1tk22σk=0N1tk and therefore finishes the proof.

The following result is again a consequence of [Citation8, Proposition 4.1].

Corollary 4.6

In the setting Problem 4.1, assume that the functions fi are Lfi-Lipschitz continuous on im(H) for i=1,,m. Let xdomH be an optimal solution of (Equation9) and (xk)k0 be a sequence generated by Algorithm 4.2 with optimal stepsize tk:=1i=1mLfi2DH(x,x0)2i=1m1pi22+3+2m1k k0. Then for every N1 it holds Emin0kN1i=1mfi+g(xk)i=1mfi+g(x)i=1mLfi2DH(x,x0)2i=1m1pi22+3+2mσ1N.

Remark 4.7

The same considerations as in Remark 3.7 about ergodic convergence are applicable also for the rates provided in Theorem 4.5 and Corollary 4.6.

Remark 4.8

Note the straightforward dependence of the optimal stepsizes as well as the right-hand side of the convergence statement on the data, i.e. the distance of the initial point to optimality, the Lipschitz constants Lfi and the probabilities pi. This backs up the intuition that the decreased gradient evaluation, i.e. smaller pi, does not come for free but at the cost of a worse constant in the convergence rate.

5. Applications

In the numerical experiments carried out in this section, we will compare three versions of the provided algorithms. First of all, the non-incremental version, which takes full subgradient steps with respect to the sum of all component functions instead of every single one individually. This can be viewed as a special case of the algorithms given, when m=1 and ϵ1,k=1 for all k0. Secondly, we discuss the non-stochastic incremental version, which uses the subgradient of every single component function in every iteration and thus corresponds to the case when ϵi,k=1 for every i=1,...,m and every k0. Lastly, we apply the algorithms as intended by evaluating the subgradients of the respective component functions incrementally with a probability different from 1.

5.1. Tomography

This application can be found in [Citation3] and arises in the reconstruction of images in PET. We consider the following problem (13) minxΔi=1myilogj=1nrijxj,(13) where Δ:={xRn:j=1nxj=1,x0} and rij denotes for i=1,,m and j=1,,n the entry of the matrix RRm×n in the i-th row and j-th column and all of these are assumed to be strictly positive. Furthermore, yi denotes for i=1,,m the positive number of photons measured in the i-th bin. As discussed in Example 2.5 this can be incorporated into our framework with the mirror map H(x)=i=1nxilog(xi) for xΔ and H(x)=+, otherwise. As initial value, we use the all ones vector divided by the dimension n.

We also want to point out that a similar example given in [Citation8] in which the minimization of a convex function over the unit simplex Δ somehow does not match the assumption made throughout the paper as the interior of Δ is empty and the function H can therefore not be continuously differentiable in a classical sense. However, with the setting of Section 3 we are able to tackle this problem.

The bad performance, see Figure , of the deterministic incremental version of Algorithm 3.2 can be explained by the fact that many more evaluations of the mirror map are needed, which increases the overall computation time dramatically. The stochastic version, however, performs rather well, after only evaluating merely roughly a fifth of the total number of component functions, see Table .

Figure 1. Results for the optimization problem (Equation13). A plot of (fNf(xbest))/(f(x0)f(xbest)), where fN:=min0kNf(xk), as a function of time, i.e. xN is the last iterate computed before a given point in time.

Figure 1. Results for the optimization problem (Equation13(13) minx∈Δ−∑i=1myilog∑j=1nrijxj,(13) ). A plot of (fN−f(xbest))/(f(x0)−f(xbest)), where fN:=min0≤k≤Nf(xk), as a function of time, i.e. xN is the last iterate computed before a given point in time.

Table 1. Results for the optimization problem (Equation13), where NI denotes the non-incremental, DI the deterministic incremental and SI the stochastic incremental version of Algorithm 3.2.

5.2. Support vector machines

We deal with the classic machine learning problem of binary classification based on the well-known MNIST dataset, which contains 28 by 28 images of handwritten numbers on a grey-scale pixel map. For each of the digits, the dataset comprises around 6000 training images and roughly 1000 test images. In line with [Citation4], we train a classifier to distinguish the numbers 6 and 7, by solving the following optimization problem (14) minwR784i=1mmax{0,1yiw,xi}+λw1,(14) where for i=1,,m, xi{0,1,,255}784 denotes the i-th training image and yi{1,1} denotes the label of the i-th training image. The 1-norm serves as a regularization term and λ>0 balances the two objectives of minimizing the classification error and reducing the 1-norm of the classifier w. To incorporate this problem into our framework, we set H=122 which leaves us with the identity as mirror map as this problem is unconstrained. The results comparing the three versions of Algorithm 4.2 discussed at the beginning of this section are illustrated in Figure . As initial value we simply use the all ones vector. All three versions show classical first-order behaviour, giving a fast decrease in objective function value first but then slowing down dramatically. More information about the performance can be seen in . All three algorithms result in a significant decrease in objective function after being run for only 4 s each. However, from a machine learning point of view, only the misclassification rate is of actual importance. In both regards, the stochastic incremental version clearly trumps the other two implementations. It is also interesting to note that it needs only a small fraction of the number of subgradient evaluations in comparison to the full non-incremental algorithm.

Figure 2. Numerical results for the optimization problem (Equation14) with λ=0.01. The plot shows min0kNf(xk) as a function of time, i.e. xN is the last iterate computed before a given point in time.

Figure 2. Numerical results for the optimization problem (Equation14(14) minw∈R784∑i=1mmax{0,1−yi⟨w,xi⟩}+λ∥w∥1,(14) ) with λ=0.01. The plot shows min0≤k≤Nf(xk) as a function of time, i.e. xN is the last iterate computed before a given point in time.

Table 2. Numerical results for the optimization problem (Equation14), where NI denotes the non-incremental, DI the deterministic incremental and SI the stochastic incremental version of Algorithm 4.2.

6. Conclusion

In this paper, we present two algorithms to solve nonsmooth convex optimization problems where the objective function is a sum of many functions which are evaluated by their respective subgradients under the implicit presence of a constraint set which is dealt with by a so-called mirror map. By allowing for a random selection of each component function to evaluate in each iteration, the proposed methods become suitable even for very large-scale problems. We prove a convergence order of O(1/k) in expectation for the kth best objective function value, which is standard for subgradient methods. However, even for the case where all the objective functions are differentiable, it is not clear if better theoretical estimates can be achieved, due to the need of using diminishing stepsizes in order to obtain convergence in incremental algorithms. Future work could comprise the investigation of different stepsizes, such as constant or dynamic stepsizes as in [Citation2]. Another possible extension of this would be to use different selection procedures such as random subsets of fixed size. Our framework, however, does not provide the right setting for such a batch approach as it would leave ϵi,k and ϵj,k dependent.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

The research of Radu Ioan Boţ has been partially supported by FWF (Austrian Science Fund) [project I 2419-N32]. The research of Axel Böhm has been supported by the doctoral programme Vienna Graduate School on Computational Optimization (VGSCO), FWF (Austrian Science Fund) [project W 1260].

References

  • Bertsekas DP. Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. In: Sra S, Nowozin S, Wright SJ, editors. Optimization for machine learning, Neural Information Processing Series. Cambridge (MA): MIT Press; 2012. p. 85–120.
  • Nedic A, Bertsekas DP. Incremental subgradient methods for nondifferentiable optimization. SIAM J Optim. 2001;12(1):109–138. doi: 10.1137/S1052623499362111
  • Ben-Tal A, Margalit T, Nemirovski A. The ordered subsets mirror descent optimization method with applications to tomography. SIAM J Optim. 2001;12(1):79–108. doi: 10.1137/S1052623499354564
  • Xiao L. Dual averaging methods for regularized stochastic learning and online optimization. J Mach Learn Res. 2010;11:2543–2596.
  • Gurbuzbalaban M, Ozdaglar A, Parrilo PA. On the convergence rate of incremental aggregated gradient algorithms. SIAM J Optim. 2017;27(2):1035–1048. doi: 10.1137/15M1049695
  • Wei Z, He QH. Nonsmooth steepest descent method by proximal subdifferentials in Hilbert spaces. J Optim Theory Appl. 2014;161(2):465–477. doi: 10.1007/s10957-013-0444-z
  • Bauschke HH, Bolte J, Teboulle M. A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Math Oper Res. 2017;42(2):330–348. doi: 10.1287/moor.2016.0817
  • Beck A, Teboulle M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper Res Lett. 2003;31(3):167–175. doi: 10.1016/S0167-6377(02)00231-6
  • Nemirovskii AS, Yudin DB. Problem complexity and method efficiency in optimization. Hoboken (NJ): Wiley; 1983.
  • Nesterov Y. Primal-dual subgradient methods for convex problems. Math Program. 2009;120(1):221–259. doi: 10.1007/s10107-007-0149-x
  • Zhou Z, Mertikopoulos P, Bambos N, et al. Mirror descent in non-convex stochastic programming. arXiv:1706.05681, 2017.
  • Shalev-Shwartz S. Online learning and online convex optimization. Found Trends Mach Learn. 2012;4(2):107–194. doi: 10.1561/2200000018
  • Bolte J, Teboulle M. Barrier operators and associated gradient-like dynamical systems for constrained minimization problems. SIAM J Control Optim. 2003;42(4):1266–1292. doi: 10.1137/S0363012902410861
  • Combettes PL, Pesquet J-C. Stochastic quasi-Fejér block-coordinate fixed point iterations with random sweeping. SIAM J Optim. 2015;25(2):1221–1248. doi: 10.1137/140971233
  • Cruz JYB. On proximal subgradient splitting method for minimizing the sum of two nonsmooth convex functions. Set-Valued Var Anal. 2017;25(2):245–263. doi: 10.1007/s11228-016-0376-5