Full article: An incremental mirror descent subgradient algorithm with random sweeping and proximal step

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

We investigate the convergence properties of incremental mirror descent type subgradient algorithms for minimizing the sum of convex functions. In each step, we only evaluate the subgradient of a single component function and mirror it back to the feasible domain, which makes iterations very cheap to compute. The analysis is made for a randomized selection of the component functions, which yields the deterministic algorithm as a special case. Under supplementary differentiability assumptions on the function which induces the mirror map, we are also able to deal with the presence of another term in the objective function, which is evaluated via a proximal type step. In both cases, we derive convergence rates of $O (1 / \sqrt{k})$ in expectation for the kth best objective function value and illustrate our theoretical findings by numerical experiments in positron emission tomography and machine learning.

KEYWORDS:

AMS SUBJECT CLASSIFICATIONS:

1. Introduction

We consider the problem of minimizing the sum of nonsmooth convex functions (1) $min_{x \in C} \sum_{i = 1}^{m} f_{i} (x),$ (1) where $C \subseteq R^{n}$ is a nonempty, convex and closed set and, for every $i = 1, \dots, m$ , the so-called component functions $f_{i} : R^{n} \to \bar{R} := R \cup {\pm \infty}$ are assumed to be proper and convex and will be evaluated via their respective subgradients. Implicitly we will assume that m is large and it is therefore very costly to evaluate all component functions in each iteration. Consequently, we will examine algorithms which only use the subgradient of a single component function in each iteration. These so-called incremental algorithms, see [Citation1,Citation2], have been applied for large-scale problems arising in tomography [Citation3], generalized assignment problems [Citation2] or machine learning [Citation4]. We refer also to [Citation5] for a slightly different approach, where in the spirit of incremental algorithms only the gradient of one of the component functions is evaluated in each step, but gradients at old iterates are used for the other components. Both subgradient algorithms and incremental methods usually require decreasing stepsizes in order to converge, which makes them slow near an optimal solution. However, they provide a very small number of iterations a low accuracy optimal value and possess a rate of convergence which is almost independent of the dimension of the problem. We refer the reader to [Citation6] for a subgradient algorithm designed for the minimization of a nonsmooth nonconvex function under the making use of proximal subgradients.

When solving optimization problems of type (Equation1(1) $min_{x \in C} \sum_{i = 1}^{m} f_{i} (x),$ (1) ), one might want to capture in the formulation of the iterative scheme the geometry of the feasible set C. This can be done by a so-called mirror map, that mirrors each iterate onto the feasible set. The Bregman distance associated with the function that induces the mirror map plays an essential role in the convergence analysis and in the formulation of convergence rates results (see [Citation7,Citation8]). So-called mirror descent algorithms were first discussed in [Citation9] and more recently in [Citation8,Citation10,Citation11] in a very general framework, in [Citation4,Citation12] from a statistical learning point of view and in [Citation13] for the case of dynamical systems. The mirror map can be viewed as a generalization of the ordinary orthogonal projection on C in Hilbert spaces (see Example 2.4), but allows also for a more careful consideration of the structure of the problems, as it is the case when the objective function is subdifferentiable only on the relative interior of the feasible set. In such a setting, one can design a mirror map which maps not onto the entire feasible set but only on a subset of it where the objective function is subdifferentiable (see Example 2.5).

There exists already a rich literature on incremental algorithms dealing with similar problems. In [Citation1,Citation2] incremental subgradient methods with a random selection of the component functions and even projections onto a feasible set are considered, but no mirror descent. Incremental subgradient algorithms utilizing mirror descent techniques are investigated in [Citation3]; however, an additional projection onto the feasible set is required which thus excludes the case where $dom f ⊉ C$ (this is taken care of in our case by the weak assumption that $im (\nabla H^{*}) \subseteq dom f$ ). Furthermore, the results appearing in Section 4 discussing Bregman proximal steps appear to completely novel for this kind of problems and are only known from a forward–backward setting [Citation7].

The basic concepts in the formulation of mirror descent algorithms are recalled in Section 2. We also provide some illustrating examples, which present some special cases, as the general setting might not be immediately intuitive. In Section 3 we formulate an incremental mirror descent subgradient algorithm with random sweeping of the component functions which we show to have a convergence rate of $O (1 / \sqrt{k})$ in expectation for the kth best objective function value. In Section 4 we ask additionally for differentiability of the function which induces the mirror map and are then able to add another nonsmooth convex function to the objective function which is evaluated in the iterative scheme by a proximal type step. For the resulting algorithm, we show a similar convergence result. In the last section, we illustrate the theoretical findings by numerical experiments in positron emission tomography (PET) and machine learning.

2. Elements of convex analysis and the mirror descent algorithm

Throughout the paper, we assume that $R^{n}$ is endowed with the Euclidean inner product $⟨ \cdot, \cdot ⟩$ and corresponding norm $∥ \cdot ∥ = \sqrt{⟨ \cdot, \cdot ⟩}$ . For a nonempty convex set $C \subseteq R^{n}$ we denote by $ri C$ its relative interior, which is the interior of C relative to its affine hull. For a convex function $f : R^{n} \to \bar{R}$ we denote by $dom f := {x \in R^{n} : f (x) < + \infty}$ its effective domain and say that f is proper, if $f > - \infty$ and $dom f \neq \emptyset$ . The subdifferential of f at $x \in R^{n}$ is defined as $\partial f (x) := {p \in R^{n} : f (y) \geq f (x) + ⟨ p, y - x ⟩ \forall y \in R^{n}}$ , for $f (x) \in R$ , and as $\partial f (x) := \emptyset$ , otherwise. We will write $f^{'} (x)$ for an arbitrary subgradient, i.e. an element of the subdifferential $\partial f (x)$ .

Problem 2.1

Consider the optimization problem (2) $min_{x \in C} f (x),$ (2) where $C \subseteq R^{n}$ is a nonempty, convex and closed set, $f : R^{n} \to \bar{R}$ is a proper and convex function, and $H : R^{n} \to \bar{R}$ is a proper, lower semicontinuous and σ-strongly convex function such that $C = \bar{dom H}$ and $im (\nabla H^{*}) \subseteq int (dom f)$ .

We say that $H : R^{n} \to \bar{R}$ is σ-strongly convex for $σ > 0$ , if for every $x, x^{'} \in R^{n}$ and every $λ \in [0, 1]$ it holds $(σ / 2) λ (1 - λ) ∥ x - x^{'} ∥^{2} + H (λ x + (1 - λ) x^{'}) \leq λ H (x) + (1 - λ) H (x^{'})$ . It is well known that, when H is proper, lower semicontinuous and σ-strongly convex, then its conjugate function $H^{*} : R^{n} \to \bar{R}, H^{*} (y) = sup_{x \in R^{n}} {⟨ y, x ⟩ - H (x)}$ , is Fréchet differentiable (thus it has full domain) and its gradient $\nabla H^{*}$ is $1 / σ$ -Lipschitz continuous or, equivalently, $H^{*}$ is Fréchet differentiable and its gradient $\nabla H^{*}$ is σ-cocoercive, which means that for every $y, y^{'} \in R^{n}$ it holds $σ ∥ \nabla H^{*} (y) - \nabla H^{*} (y^{'}) ∥^{2} \leq ⟨ y - y^{'}, \nabla H^{*} (y) - \nabla H^{*} (y^{'}) ⟩$ . Recall that $im (\nabla H^{*}) := {\nabla H^{*} (y) : y \in R^{n}}$ .

The following mirror descent algorithm has been introduced in [Citation10] under the name dual averaging.

Algorithm 2.2

Consider for some initial values $x_{0} \in int (dom f), y_{0} \in R^{n}$ and sequence of positive stepsizes $(t_{k})_{k \geq 0}$ the following iterative scheme: $(\forall k \geq 0) [\begin{array}{l} y_{k + 1} = y_{k} - t_{k} f^{'} (x_{k}) \\ x_{k + 1} = \nabla H^{*} (y_{k + 1}) . \end{array}$

We notice that, since the sequence $(x_{k})_{k \geq 0}$ is contained in the interior of the effective domain of f, the algorithm is well defined. The assumptions concerning the function H, which induces the mirror map $\nabla H^{*}$ , are not consistent in the literature. Sometimes H is assumed to be a Legendre function as in [Citation7], or strongly convex and differentiable as in [Citation8,Citation11]. In the following section, we will only assume that H is proper, lower semicontinuous and strongly convex.

Example 2.3

For $H = \frac{1}{2} ∥ \cdot ∥^{2}$ we have that $H^{*} = \frac{1}{2} ∥ \cdot ∥^{2}$ and thus $\nabla H^{*}$ is the identity operator on $R^{n}$ . Consequently, Algorithm 2.2 reduces to the classical subgradient method: $(\forall k \geq 0) x_{k + 1} = x_{k} - t_{k} f^{'} (x_{k}) .$

Example 2.4

For $C \subseteq R^{n}$ a nonempty, convex and closed set, take $H (x) = \frac{1}{2} ∥ x ∥^{2}$ , for $x \in C$ , and $H (x) = + \infty$ , otherwise. Then $\nabla H^{*} = P_{C}$ , where $P_{C}$ denotes the orthogonal projection onto C. In this setting, Algorithm 2.2 becomes: $(\forall k \geq 0) [\begin{array}{l} y_{k + 1} = y_{k} - t_{k} f^{'} (x_{k}) \\ x_{k + 1} = P_{C} (y_{k + 1}) . \end{array}$ This iterative scheme is similar to, but different from the well-known subgradient projection algorithm, which reads: $(\forall k \geq 0) [\begin{array}{l} y_{k + 1} = x_{k} - t_{k} f^{'} (x_{k}) \\ x_{k + 1} = P_{C} (y_{k + 1}) . \end{array}$

Example 2.5

When considering numerical experiments in PET, one often minimizes over the unit simplex $Δ := {x = (x_{1}, \dots, x_{n})^{T} \in R^{n} : \sum_{j = 1}^{n} x_{j} = 1, x \geq 0}$ . An appropriate choice for the function H is $H (x) = \sum_{j = 1}^{n} x_{j} \log (x_{j})$ for $x \in Δ$ , where $0 \log (0) = 0$ , and $H (x) = + \infty$ , if $x \notin Δ$ . In this case $\nabla H^{*}$ is given for every $y \in R^{n}$ by $\nabla H^{*} (y) = \frac{1}{\sum_{i = 1}^{n} \exp (y_{i})} (\exp (y_{1}), \exp (y_{2}), \dots, \exp (y_{n}))^{T},$ and maps into the relative interior of Δ.

The following result will play an important role in the convergence analysis that we will carry out in the next sections.

Lemma 2.6

Let $H : R^{n} \to \bar{R}$ be a proper, lower semicontinuous and σ-strongly convex function, for $σ > 0$ , $x \in R^{n}$ and $y \in \partial H (x)$ . Then it holds $H (x) + ⟨ y, x^{'} - x ⟩ + \frac{σ}{2} ∥ x^{'} - x ∥^{2} \leq H (x^{'}) \forall x^{'} \in R^{n} .$

Proof.

The function $\tilde{H} (\cdot) := H (\cdot) - (σ / 2) ∥ \cdot ∥^{2}$ is convex and $y - σ x \in \partial \tilde{H} (x)$ . Thus $\tilde{H} (x) + ⟨ y - σ x, \tilde{x} - x ⟩ \leq \tilde{H} (\tilde{x}) \forall \tilde{x} \in R^{n}$ or, equivalently, $H (x) - \frac{σ}{2} ∥ x ∥^{2} + ⟨ y - σ x, \tilde{x} - x ⟩ \leq H (\tilde{x}) - \frac{σ}{2} ∥ \tilde{x} ∥^{2} \forall \tilde{x} \in R^{n} .$ Rearranging the terms, leads to the desired conclusion.

3. A stochastic incremental mirror descent algorithm

In this section, we will address the following optimization problem.

Problem 3.1

Consider the optimization problem (3) $min_{x \in C} \sum_{i = 1}^{m} f_{i} (x),$ (3) where $C \subseteq R^{n}$ is a nonempty, convex and closed set, for every $i = 1, \dots, m$ , the functions $f_{i} : R^{n} \mapsto \bar{R}$ are proper and convex, and $H : R^{n} \to \bar{R}$ is a proper, lower semicontinuous and σ-strongly convex function such that $C = \bar{dom H}$ and $im (\nabla H^{*}) \subseteq int (⋂_{i = 1}^{m} dom f_{i})$ .

In this section, we apply the dual averaging approach described in Algorithm 2.2 to the optimization problem (Equation3(3) $min_{x \in C} \sum_{i = 1}^{m} f_{i} (x),$ (3) ) by only using the subgradient of a component function at a time. This incremental approach (see, also, [Citation1,Citation2]) is similar to but slightly different from the extension suggested in [Citation8]. Furthermore, we introduce a stochastic sweeping of the component functions. For a similar strategy, but in the random selection of coordinates we refer to [Citation14].

Algorithm 3.2

Consider for some initial values $x_{0} \in int (⋂_{i = 1}^{m} dom f_{i}), y_{m, - 1} \in R^{n}$ and sequence of positive stepsizes $(t_{k})_{k \geq 0}$ the following iterative scheme: $(\forall k \geq 0) [\begin{array}{l} ψ_{0, k} = x_{k} \\ y_{0, k} = y_{m, k - 1} \\ f o r i = 1, \dots, m \\ y_{i, k} = y_{i - 1, k} - ϵ_{i, k} \frac{t_{k}}{p_{i}} f_{i}^{'} (ψ_{i - 1, k}) \\ ψ_{i, k} = \nabla H^{*} (y_{i, k}) \\ e n d \\ x_{k + 1} = ψ_{m, k}, \end{array}$ where $ϵ_{i, k} i s a {0, 1}$ valued random variable for every $i = 1, \dots, m$ and $k \geq 0$ , such that $ϵ_{i, k}$ is independent of $ψ_{i - 1, k}$ and $P (ϵ_{i, k} = 1) = p_{i}$ for every $i = 1, \dots, m$ and $k \geq 0$ .

One can notice that in the above iterative scheme $y_{i, k} \in \partial H (ψ_{i, k})$ for every $i = 1, \dots, m$ and $k \geq 0$ .

In the convergence analysis of Algorithm 3.2 we will make use of the following Bregman-distance-like function associated to the proper and convex function $H : R^{n} \to \bar{R}$ and defined as (4) $d_{H} : R^{n} \times dom H \times R^{n} \to \bar{R}, d_{H} (x, y, z) := H (x) - H (y) - ⟨ z, x - y ⟩ .$ (4) We notice that $d_{H} (x, y, z) \geq 0$ for every $(x, y) \in R^{n} \times dom H$ and every $z \in \partial H (y)$ , due to subgradient inequality.

The function $d_{H}$ is an extension of the Bregman distance (see [Citation4,Citation7,Citation11]), which is associated to a proper and convex function $H : R^{n} \to \bar{R}$ fulfilling $dom \nabla H := {x \in R^{n} : H is differentiable at x} \neq \emptyset$ and defined as (5) $D_{H} : R^{n} \times dom \nabla H \to \bar{R}, D_{H} (x, y) = H (x) - H (y) - ⟨ \nabla H (y), x - y ⟩ .$ (5)

Theorem 3.3

In the setting of Problem 3.1, assume that the functions $f_{i}$ are $L_{f_{i}}$ -Lipschitz continuous on $im (\nabla H^{*})$ for $i = 1, \dots, m$ . Let $(x_{k})_{k \geq 0}$ be a sequence generated by Algorithm 3.2. Then for every $N \geq 1$ and every $y \in R^{n}$ it holds $\begin{aligned} E (min_{0 \leq k \leq N - 1} \sum_{i = 1}^{m} f_{i} (x_{k}) - \sum_{i = 1}^{m} f_{i} (y)) \\ \leq \frac{d_{H} (y, x_{0}, y_{0, 0}) + \frac{1}{σ} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} ({(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{2} + 1) \sum_{k = 0}^{N - 1} t_{k}^{2}}{\sum_{k = 0}^{N - 1} t_{k}} . \end{aligned}$

Proof.

Let $y \in ⋂_{i = 1}^{m} dom f_{i} \cap dom H$ be fixed. For y outside this set the conclusion follows automatically.

For every $i = 1, \dots, m$ and every $k \geq 0$ it holds $\begin{aligned} d_{H} (y, ψ_{i, k}, y_{i, k}) & = H (y) - H (ψ_{i, k}) - ⟨ y_{i, k}, y - ψ_{i, k} ⟩ \\ = H (y) - H (ψ_{i, k}) - ⟨y_{i - 1, k} - \frac{t_{k}}{p_{i}} ϵ_{i, k} f_{i}^{'} (ψ_{i - 1, k}), y - ψ_{i, k}⟩ . \end{aligned}$ Rearranging the terms, this yields for every $i = 1, \dots, m$ and every $k \geq 0$ to $\begin{aligned} d_{H} (y, ψ_{i, k}, y_{i, k}) & = d_{H} (y, ψ_{i - 1, k}, y_{i - 1, k}) - d_{H} (ψ_{i, k}, ψ_{i - 1, k}, y_{i - 1, k}) + \frac{t_{k}}{p_{i}} ϵ_{i, k} ⟨ f_{i}^{'} (ψ_{i - 1, k}), y - ψ_{i, k} ⟩ \\ = d_{H} (y, ψ_{i - 1, k}, y_{i - 1, k}) - d_{H} (ψ_{i, k}, ψ_{i - 1, k}, y_{i - 1, k}) + \frac{t_{k}}{p_{i}} ϵ_{i, k} ⟨ f_{i}^{'} (ψ_{i - 1, k}), y - ψ_{i - 1, k} ⟩ \\ - \frac{t_{k}}{p_{i}} ϵ_{i, k} ⟨ f_{i}^{'} (ψ_{i - 1, k}), ψ_{i, k} - ψ_{i - 1, k} ⟩ \\ \leq d_{H} (y, ψ_{i - 1, k}, y_{i - 1, k}) - d_{H} (ψ_{i, k}, ψ_{i - 1, k}, y_{i - 1, k}) + \frac{t_{k}}{p_{i}} ϵ_{i, k} (f_{i} (y) - f_{i} (ψ_{i - 1, k})) \\ + \frac{t_{k}}{p_{i}} ϵ_{i, k} ∥ f_{i}^{'} (ψ_{i - 1, k}) ∥ ∥ ψ_{i - 1, k} - ψ_{i, k} ∥ . \end{aligned}$ From here we get for every $i = 1, \dots, m$ and every $k \geq 0$ $\begin{aligned} d_{H} (y, ψ_{i, k}, y_{i, k}) & \leq d_{H} (y, ψ_{i - 1, k}, y_{i - 1, k}) - d_{H} (ψ_{i, k}, ψ_{i - 1, k}, y_{i - 1, k}) + \frac{t_{k}}{p_{i}} ϵ_{i, k} (f_{i} (y) - f_{i} (ψ_{i - 1, k})) \\ + \frac{1}{σ} t_{k}^{2} \frac{1}{p_{i}^{2}} ϵ_{i, k}^{2} ∥ f_{i}^{'} (ψ_{i - 1, k}) ∥^{2} + \frac{σ}{4} ∥ ψ_{i - 1, k} - ψ_{i, k} ∥^{2} \end{aligned}$ which, by using that H is σ-strongly convex and Lemma 2.6, yields $\begin{aligned} d_{H} (y, ψ_{i, k}, y_{i, k}) & \leq d_{H} (y, ψ_{i - 1, k}, y_{i - 1, k}) - d_{H} (ψ_{i, k}, ψ_{i - 1, k}, y_{i - 1, k}) + \frac{t_{k}}{p_{i}} ϵ_{i, k} (f_{i} (y) - f_{i} (ψ_{i - 1, k})) \\ + \frac{1}{σ} t_{k}^{2} \frac{1}{p_{i}^{2}} ϵ_{i, k} ∥ f_{i}^{'} (ψ_{i - 1, k}) ∥^{2} + \frac{1}{2} d_{H} (ψ_{i, k}, ψ_{i - 1, k}, y_{i - 1, k}) \\ = d_{H} (y, ψ_{i - 1, k}, y_{i - 1, k}) + \frac{t_{k}}{p_{i}} ϵ_{i, k} (f_{i} (y) - f_{i} (ψ_{i - 1, k})) + \frac{1}{σ} t_{k}^{2} \frac{1}{p_{i}^{2}} ϵ_{i, k} ∥ f_{i}^{'} (ψ_{i - 1, k}) ∥^{2} \\ - \frac{1}{2} d_{H} (ψ_{i, k}, ψ_{i - 1, k}, y_{i - 1, k}) . \end{aligned}$ Using the fact that $f_{i}$ is $L_{f_{i}}$ -Lipschitz continuous, it follows that $∥ f_{i}^{'} (ψ_{i - 1, k}) ∥ \leq L_{f_{i}}$ , for every i=1,...,m and every $k \geq 0$ , thus $\begin{aligned} d_{H} (y, ψ_{i, k}, y_{i, k}) & \leq d_{H} (y, ψ_{i - 1, k}, y_{i - 1, k}) + \frac{t_{k}}{p_{i}} ϵ_{i, k} (f_{i} (y) - f_{i} (ψ_{i - 1, k})) + \frac{1}{σ} t_{k}^{2} \frac{1}{p_{i}^{2}} ϵ_{i, k} L_{f_{i}}^{2} \\ - \frac{1}{2} d_{H} (ψ_{i, k}, ψ_{i - 1, k}, y_{i - 1, k}) . \end{aligned}$ Since all the involved functions are measurable, we can take the expected value on both sides of the above inequality and get due to the assumed independence of $ϵ_{i, k}$ and $ψ_{i - 1, k}$ for every $i = 1, \dots, m$ and every $k \geq 0$ $\begin{aligned} E (d_{H} (y, ψ_{i, k}, y_{i, k})) \leq & E (d_{H} (y, ψ_{i - 1, k}, y_{i - 1, k})) + E (\frac{t_{k}}{p_{i}} (f_{i} (y) - f_{i} (ψ_{i - 1, k}))) E (ϵ_{i, k}) \\ + \frac{1}{σ} t_{k}^{2} \frac{1}{p_{i}^{2}} L_{f_{i}}^{2} E (ϵ_{i, k}) - E (\frac{1}{2} d_{H} (ψ_{i, k}, ψ_{i - 1, k}, y_{i - 1, k})) . \end{aligned}$ Since $E (ϵ_{i, k}) = p_{i}$ , we get for every $i = 1, \dots, m$ and every $k \geq 0$ $\begin{aligned} E (d_{H} (y, ψ_{i, k}, y_{i, k})) & \leq E (d_{H} (y, ψ_{i - 1, k}, y_{i - 1, k})) + E (t_{k} (f_{i} (y) - f_{i} (ψ_{i - 1, k}))) \\ + \frac{1}{σ} t_{k}^{2} \frac{1}{p_{i}} L_{f_{i}}^{2} - E (\frac{1}{2} d_{H} (ψ_{i, k}, ψ_{i - 1, k}, y_{i - 1, k})) . \end{aligned}$ Summing the above inequality for $i = 1, \dots, m$ and using that $\begin{aligned} \sum_{i = 1}^{m} L_{f_{i}}^{2} \frac{1}{p_{i}} \leq {(\sum_{i = 1}^{m} L_{f_{i}}^{4})}^{1 / 2} {(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{1 / 2} \leq {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} {(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{1 / 2}, \end{aligned}$ it yields for every $k \geq 0$ $\begin{aligned} E (d_{H} (y, ψ_{m, k}, y_{m, k})) & \leq E (d_{H} (y, x_{k}, y_{0, k})) + E (t_{k} \sum_{i = 1}^{m} (f_{i} (y) - f_{i} (ψ_{i - 1, k}))) \\ + \frac{1}{σ} t_{k}^{2} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} {(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{1 / 2} - E (\sum_{i = 1}^{m} \frac{1}{2} d_{H} (ψ_{i, k}, ψ_{i - 1, k}, y_{i - 1, k})) \end{aligned}$ or, equivalently, $\begin{aligned} E (d_{H} (y, ψ_{m, k}, y_{m, k})) & \leq E (d_{H} (y, x_{k}, y_{0, k})) + E (t_{k} \sum_{i = 1}^{m} (f_{i} (y) - f_{i} (x_{k}) + f_{i} (x_{k}) - f_{i} (ψ_{i - 1, k}))) \\ + \frac{1}{σ} t_{k}^{2} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} {(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{1 / 2} - E (\sum_{i = 1}^{m} \frac{1}{2} d_{H} (ψ_{i, k}, ψ_{i - 1, k}, y_{i - 1, k})) . \end{aligned}$ Thus, for every $k \geq 0$ , (6) $\begin{aligned} E (d_{H} (y, ψ_{m, k}, y_{m, k})) & \leq E (d_{H} (y, x_{k}, y_{0, k})) + t_{k} E (\sum_{i = 1}^{m} f_{i} (y) - \sum_{i = 1}^{m} f_{i} (x_{k})) \\ + \frac{1}{σ} t_{k}^{2} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} {(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{1 / 2} - E (\sum_{i = 1}^{m} \frac{1}{2} d_{H} (ψ_{i, k}, ψ_{i - 1, k}, y_{i - 1, k})) \\ + E (t_{k} \sum_{i = 1}^{m} (f_{i} (x_{k}) - f_{i} (ψ_{i - 1, k}))) . \end{aligned}$ (6) On the other hand, by using the Lipschitz continuity of $\nabla H^{*}$ it yields for every $k \geq 0$ $\begin{aligned} \sum_{i = 1}^{m} (f_{i} (x_{k}) - f_{i} (ψ_{i - 1, k})) & = \sum_{i = 2}^{m} \sum_{j = 1}^{i - 1} (f_{i} (ψ_{j - 1, k}) - f_{i} (ψ_{j, k})) \\ \leq \sum_{i = 2}^{m} \sum_{j = 1}^{i - 1} L_{f_{i}} ∥ ψ_{j - 1, k} - ψ_{j, k} ∥ \leq (\sum_{l = 1}^{m} L_{f_{l}}) \sum_{i = 2}^{m} ∥ ψ_{i - 1, k} - ψ_{i, k} ∥, \\ \leq (\sum_{l = 1}^{m} L_{f_{l}}) \sum_{i = 2}^{m} ∥ \nabla H^{*} (y_{i - 1, k}) - \nabla H^{*} (y_{i, k}) ∥ \\ \leq \frac{1}{σ} (\sum_{l = 1}^{m} L_{f_{l}}) \sum_{i = 2}^{m} ∥ y_{i - 1, k} - y_{i, k} ∥ \\ = \frac{1}{σ} (\sum_{l = 1}^{m} L_{f_{l}}) \sum_{i = 2}^{m} ∥ϵ_{i, k} \frac{t_{k}}{p_{i}} f_{i}^{'} (ψ_{i - 1, k})∥ \\ \leq \frac{1}{σ} t_{k} (\sum_{l = 1}^{m} L_{f_{l}}) (\sum_{i = 1}^{m} \frac{ϵ_{i, k}}{p_{i}} L_{f_{i}}) . \end{aligned}$ Therefore, for every $k \geq 0$ (7) $\begin{aligned} E (t_{k} \sum_{i = 1}^{m} (f_{i} (x_{k}) - f_{i} (ψ_{i - 1, k}))) & \leq \frac{1}{σ} t_{k}^{2} (\sum_{l = 1}^{m} L_{f_{l}}) E (\sum_{i = 1}^{m} \frac{ϵ_{i, k}}{p_{i}} L_{f_{i}}) \\ \leq \frac{1}{σ} t_{k}^{2} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} . \end{aligned}$ (7) Combining (Equation6(6) $\begin{aligned} E (d_{H} (y, ψ_{m, k}, y_{m, k})) & \leq E (d_{H} (y, x_{k}, y_{0, k})) + t_{k} E (\sum_{i = 1}^{m} f_{i} (y) - \sum_{i = 1}^{m} f_{i} (x_{k})) \\ + \frac{1}{σ} t_{k}^{2} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} {(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{1 / 2} - E (\sum_{i = 1}^{m} \frac{1}{2} d_{H} (ψ_{i, k}, ψ_{i - 1, k}, y_{i - 1, k})) \\ + E (t_{k} \sum_{i = 1}^{m} (f_{i} (x_{k}) - f_{i} (ψ_{i - 1, k}))) . \end{aligned}$ (6) ) and (Equation7(7) $\begin{aligned} E (t_{k} \sum_{i = 1}^{m} (f_{i} (x_{k}) - f_{i} (ψ_{i - 1, k}))) & \leq \frac{1}{σ} t_{k}^{2} (\sum_{l = 1}^{m} L_{f_{l}}) E (\sum_{i = 1}^{m} \frac{ϵ_{i, k}}{p_{i}} L_{f_{i}}) \\ \leq \frac{1}{σ} t_{k}^{2} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} . \end{aligned}$ (7) ) gives for every $k \geq 0$ (8) $\begin{aligned} E (d_{H} (y, ψ_{m, k}, y_{m, k})) & \leq E (d_{H} (y, x_{k}, y_{0, k})) + t_{k} E (\sum_{i = 1}^{m} f_{i} (y) - \sum_{i = 1}^{m} f_{i} (x_{k})) \\ + \frac{1}{σ} t_{k}^{2} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} {(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{1 / 2} - E (\sum_{i = 1}^{m} \frac{1}{2} d_{H} (ψ_{i, k}, ψ_{i - 1, k}, y_{i - 1, k})) \\ + \frac{1}{σ} t_{k}^{2} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} . \end{aligned}$ (8) Since $ψ_{m, k} = x_{k + 1}$ and $y_{m, k} = y_{0, k + 1}$ we get for every $k \geq 0$ that $\begin{aligned} E (d_{H} (y, x_{k + 1}, y_{0, k + 1})) & \leq E (d_{H} (y, x_{k}, y_{0, k})) + t_{k} E (\sum_{i = 1}^{m} f_{i} (y) - \sum_{i = 1}^{m} f_{i} (x_{k})) \\ + \frac{1}{σ} t_{k}^{2} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} ({(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{2} + 1) . \end{aligned}$ By summing up this inequality from k=0 to N−1, where $N \geq 1$ , we get $\begin{aligned} \sum_{k = 0}^{N - 1} t_{k} E (\sum_{i = 1}^{m} f_{i} (x_{k}) - \sum_{i = 1}^{m} f_{i} (y)) + E (d_{H} (y, x_{N}, y_{0, N})) \\ \leq E (d_{H} (y, x_{0}, y_{0, 0})) + \sum_{k = 0}^{N - 1} \frac{1}{σ} t_{k}^{2} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} ({(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{2} + 1) . \end{aligned}$ Since $d_{H} (y, x_{N}, y_{0, N}) \geq 0$ , as $y_{0, N} \in \partial H (x_{N})$ , we get $\begin{aligned} E (min_{0 \leq k \leq N - 1} \sum_{i = 1}^{m} f_{i} (x_{k}) - \sum_{i = 1}^{m} f_{i} (y)) \\ \leq \frac{d_{H} (y, x_{0}, y_{0, 0}) + \frac{1}{σ} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} ({(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{2} + 1) \sum_{k = 0}^{N - 1} t_{k}^{2}}{\sum_{k = 0}^{N - 1} t_{k}} \end{aligned}$ and this finishes the proof.

Remark 3.4

The set from which the variable y is chosen in the previous theorem might seem to be restrictive, however, we would like to recall that in many applications $dom H$ is the set of feasible solutions of the optimization problem (Equation3(3) $min_{x \in C} \sum_{i = 1}^{m} f_{i} (x),$ (3) ). Since $im (\nabla H^{*}) = dom \partial H := {x \in R^{n} : \partial H (x) \neq \emptyset} \subseteq dom H$ , the inequality in Theorem 3.3 is fulfilled for every $y \in im (\nabla H^{*})$ .

Remark 3.5

Note furthermore that so far we have not made any assumptions about the stepsizes in Theorem 3.3. It is, however, clear from the statement that in the case where $y = x^{*}$ for an optimal solution $x^{*}$ and the stepsizes $(t_{k})_{k \in N}$ fulfil the classical condition that $\sum_{k = 1}^{\infty} t_{k} = + \infty$ and $\sum_{k = 1}^{\infty} t_{k}^{2} < + \infty$ it follows that $lim_{N \in N} E (min_{0 \leq k \leq N - 1} \sum_{i = 1}^{m} f_{i} (x_{k}) - \sum_{i = 1}^{m} f_{i} (x^{*})) = 0$ .

The optimal stepsize choice, which we provide in the following corollary, is a consequence of [Citation8, Proposition 4.1], which states that the function $z \mapsto \frac{c + (2 σ)^{- 1} z^{T} D z}{b^{T} z},$ where $c > 0, b \in R_{+ +}^{d} := {(z_{1}, \dots, z_{d})^{T} \in R^{d} : z_{i} > 0, i = 1, \dots, d}$ and $D \in R^{d \times d}$ is a symmetric positive definite matrix, attains its minimum on $R_{+ +}^{d}$ at $z^{*} = \sqrt{(2 c σ / b^{T} D^{- 1} b)} D^{- 1} b$ and this provides $\sqrt{2 c / σ b^{T} D^{- 1} b}$ as optimal objective value.

Corollary 3.6

In the setting of Problem 3.1, assume that the functions $f_{i}$ are $L_{f_{i}}$ -Lipschitz continuous on $im (\nabla H^{*})$ for $i = 1, \dots, m$ . Let $x^{*} \in dom H$ be an optimal solution of (Equation3(3) $min_{x \in C} \sum_{i = 1}^{m} f_{i} (x),$ (3) ) and $(x_{k})_{k \geq 0}$ be a sequence generated by Algorithm 3.2 with optimal stepsize $t_{k} := \frac{1}{\sum_{i = 1}^{m} L_{f_{i}}} \sqrt{\frac{d_{H} (x^{*}, x_{0}, y_{0, 0})}{{(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{2} + 1}} \frac{1}{\sqrt{k}} \forall k \geq 0.$ Then for every $N \geq 1$ it holds $E (min_{0 \leq k \leq N - 1} \sum_{i = 1}^{m} f_{i} (x_{k}) - \sum_{i = 1}^{m} f_{i} (x^{*})) \leq 2 (\sum_{i = 1}^{m} L_{f_{i}}) \sqrt{\frac{d_{H} (x^{*}, x_{0}, y_{0, 0}) ({(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{2} + 1)}{σ}} \frac{1}{\sqrt{N}} .$

Remark 3.7

In the last step of the proof of Theorem 3.3, one could have chosen to use the following inequality $(\sum_{k = 0}^{N - 1} t_{k}) E (\sum_{i = 1}^{m} f_{i} (\frac{\sum_{k = 0}^{N - 1} t_{k} x_{k}}{\sum_{k = 0}^{N - 1} t_{k}}) - \sum_{i = 1}^{m} f_{i} (y)) \leq \sum_{k = 0}^{N - 1} t_{k} E (\sum_{i = 1}^{m} f_{i} (x_{k}) - \sum_{i = 1}^{m} f_{i} (y))$ given by the convexity of $\sum_{i = 1}^{m} f_{i} (\cdot)$ in order to prove convergence of the function values for the ergodic sequence ${\bar{x}}_{k} := (1 / \sum_{i = 0}^{k} t_{i}) \sum_{i = 0}^{k} t_{i} x_{i}$ for all $k \geq 0$ . This would lead for every $N \geq 1$ and every $y \in R^{n}$ to $E (\sum_{i = 1}^{m} f_{i} ({\bar{x}}_{N - 1}) - \sum_{i = 1}^{m} f_{i} (y)) \leq \frac{d_{H} (y, x_{0}, y_{0, 0}) + \frac{1}{σ} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} ({(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{2} + 1) \sum_{k = 0}^{N - 1} t_{k}^{2}}{\sum_{k = 0}^{N - 1} t_{k}}$ and for the optimal stepsize choice from Corollary 3.6 to $E (\sum_{i = 1}^{m} f_{i} ({\bar{x}}_{N - 1}) - \sum_{i = 1}^{m} f_{i} (y)) \leq 2 (\sum_{i = 1}^{m} L_{f_{i}}) \sqrt{\frac{d_{H} (x^{*}, x_{0}, y_{0, 0}) ({(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{2} + 1)}{σ}} \frac{1}{\sqrt{N}},$ and might be beneficial, as it does not require the computation of objective function values, which are by our implicit assumption of m being large expensive to compute.

4. A stochastic incremental mirror descent algorithm with Bregman proximal step

In this section, we add another nonsmooth convex function to the objective function of the optimization problem (Equation3(3) $min_{x \in C} \sum_{i = 1}^{m} f_{i} (x),$ (3) ) and provide an extension of Algorithm 3.2, which evaluates in particular the new summand by a proximal type step. However, this asks for supplementary differentiability assumption on the function inducing the mirror map.

Problem 4.1

Consider the optimization problem (9) $min_{x \in C} \sum_{i = 1}^{m} f_{i} (x) + g (x),$ (9) where $C \subseteq R^{n}$ is a nonempty, convex and closed set, for every $i = 1, \dots, m$ , the functions $f_{i} : R^{n} \to \bar{R}$ are proper and convex and $g : R^{n} \to \bar{R}$ is a proper, convex and lower semicontinuous function, and $H : R^{m} \to \bar{R}$ is a proper, σ-strongly convex and lower semicontinuous function such that $C = \bar{dom H}$ , H is continuously differentiable on $int (dom H)$ , $im (\nabla H^{*}) \subseteq int (⋂_{i = 1}^{m} dom f_{i}) \cap int (dom H)$ and $int (dom H) \cap dom g \neq \emptyset$ .

For a proper, convex, lower semicontinuous function $h : R^{n} \to \bar{R}$ we define its Bregman-proximal operator with respect to the proper, σ-strongly convex and lower semicontinuous function $H : R^{n} \to \bar{R}$ as being ${prox}_{h}^{H} : dom \nabla H \to R^{n}, {prox}_{h}^{H} (x) := \underset{u \in R^{n}}{\arg \min} {h (u) + D_{H} (u, x)} .$ Due to the strong convexity of H, the Bregman-proximal operator is well defined. For $H = (\frac{1}{2}) ∥ \cdot ∥^{2}$ it coincides with the classical proximal operator.

We are now in the position to formulate the iterative scheme we would like to propose for solving (Equation9(9) $min_{x \in C} \sum_{i = 1}^{m} f_{i} (x) + g (x),$ (9) ). In case g=0, this algorithm gives exactly the incremental version of the iterative method in [Citation8], actually suggested by the two authors in this paper.

Algorithm 4.2

Consider for some initial value $x_{0} \in im (\nabla H^{*})$ and sequence of positive stepsizes $(t_{k})_{k \geq 0}$ the following iterative scheme: $(\forall k \geq 0) [\begin{array}{l} ψ_{0, k} = x_{k} \\ f o r i = 1, \dots, m \\ ψ_{i, k} = \nabla H^{*} (\nabla H (ψ_{i - 1, k}) - ϵ_{i, k} \frac{t_{k}}{p_{i}} f_{i}^{'} (ψ_{i - 1, k})) \\ e n d \\ x_{k + 1} = {prox}_{t_{k} g}^{H} (ψ_{m, k}), \end{array}$ where $ϵ_{i, k} i s a {0, 1}$ valued random variable for every $i = 1, \dots, m$ and $k \geq 0$ , such that $ϵ_{i, k}$ is independent from $ψ_{i - 1, k}$ and $P (ϵ_{i, k} = 1) = p_{i}$ for every $i = 1, \dots, m$ and $k \geq 0$ .

Lemma 4.3

In the setting of Problem 4.1, Algorithm 4.2 is well defined.

Proof.

As $im (\nabla H^{*}) \subseteq int (⋂_{i = 1}^{m} dom f_{i})$ , it follows for every $i = 2, \dots, m$ and every $k \geq 0$ immediately that $ψ_{i - 1, k} \in int dom f_{i}$ , thus a subgradient of $f_{i}$ at $ψ_{i - 1, k}$ exists.

In what follows we prove that this is the case also for $ψ_{0, k}$ , for every $k \geq 0$ . To this aim, it is enough to show that $x_{k} \in im (\nabla H^{*})$ for every $k \geq 0$ . For k=0 this statement is true by the choice of the initial value. For every $k \geq 0$ we have that $0 \in \partial (t_{k} g + H - ⟨ \nabla H (ψ_{m, k}), \cdot ⟩) (x_{k + 1}),$ which, according to $int (dom H) \cap dom g \neq \emptyset$ , is equivalent to $0 \in t_{k} \partial g (x_{k + 1}) + \partial H (x_{k + 1}) - \nabla H (ψ_{m, k}) .$ Thus, $x_{k + 1} \in dom \partial H = im (\nabla H^{*})$ for every $k \geq 0$ and this concludes the proof.

Example 4.4

Consider the case when m=1, $ϵ_{1, k} = 1$ for every $k \geq 0$ and $H (x) = \frac{1}{2} ∥ x ∥^{2}$ for $x \in C$ , while $H (x) = + \infty$ for $x \notin C$ , where $C \subseteq R^{n}$ is a nonempty, convex and closed set. In this setting, $\nabla H^{*}$ is equal to the orthogonal projection $P_{C}$ onto the set C. Algorithm 4.2 yields the following iterative scheme, which basically minimizes the sum $f_{1} + g$ over the set C: (10) $(\forall k \geq 0) x_{k + 1} = {prox}_{t_{k} g}^{H} (P_{C} (x_{k} - t_{k} f_{1}^{'} (x_{k}))) .$ (10) The difficulty in Example 4.4, assuming that it is reasonably possible to project onto the set C, lies in evaluating ${prox}_{t_{k} g}^{H}$ , for every $k \geq 0$ , as this itself is a constraint optimization problem ${prox}_{t_{k} g}^{H} (x) = \underset{u \in C}{\arg \min} \{t_{k} g (u) + \frac{1}{2} ∥ x - u ∥^{2}\} .$ When $C = R^{n}$ , the iterative scheme (Equation10(10) $(\forall k \geq 0) x_{k + 1} = {prox}_{t_{k} g}^{H} (P_{C} (x_{k} - t_{k} f_{1}^{'} (x_{k}))) .$ (10) ) becomes the proximal subgradient algorithm investigated in [Citation15].

Theorem 4.5

In the setting of Problem 4.1, assume that the functions $f_{i}$ are $L_{f_{i}}$ -Lipschitz continuous on $im (\nabla H^{*})$ for $i = 1, \dots, m$ . Let $(x_{k})_{k \geq 0}$ be a sequence generated by Algorithm 4.2. Then for every $N \geq 1$ and every $y \in R^{n}$ it holds $\begin{aligned} E (min_{0 \leq k \leq N - 1} (\sum_{i = 1}^{m} f_{i} + g) (x_{k + 1}) - (\sum_{i = 1}^{m} f_{i} + g) (y)) \\ \leq \frac{2 σ D_{H} (y, x_{0}) + (2 {(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{1 / 2} + 3 + 2 m) {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} \sum_{k = 0}^{N - 1} t_{k}^{2}}{2 σ \sum_{k = 0}^{N - 1} t_{k}} . \end{aligned}$

Proof.

Let $y \in ⋂_{i = 1}^{m} dom f_{i} \cap dom g \cap dom H$ be fixed. For y outside this set the conclusion follows automatically.

As in the first part of the proof of Theorem 3.3, we obtain instead of (Equation8(8) $\begin{aligned} E (d_{H} (y, ψ_{m, k}, y_{m, k})) & \leq E (d_{H} (y, x_{k}, y_{0, k})) + t_{k} E (\sum_{i = 1}^{m} f_{i} (y) - \sum_{i = 1}^{m} f_{i} (x_{k})) \\ + \frac{1}{σ} t_{k}^{2} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} {(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{1 / 2} - E (\sum_{i = 1}^{m} \frac{1}{2} d_{H} (ψ_{i, k}, ψ_{i - 1, k}, y_{i - 1, k})) \\ + \frac{1}{σ} t_{k}^{2} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} . \end{aligned}$ (8) ) the following inequality which holds for every $i = 1, \dots, m$ and every $k \geq 0$ (11) $\begin{aligned} E (D_{H} (y, ψ_{m, k})) & \leq E (D_{H} (y, x_{k})) + t_{k} E (\sum_{i = 1}^{m} f_{i} (y) - \sum_{i = 1}^{m} f_{i} (x_{k})) \\ + \frac{1}{σ} t_{k}^{2} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} ({(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{1 / 2} + 1) - E (\sum_{i = 1}^{m} \frac{1}{2} D_{H} (ψ_{i, k}, ψ_{i - 1, k})) . \end{aligned}$ (11) As pointed out in the proof of Lemma 4.3, for every $k \geq 0$ we have $0 \in t_{k} \partial g (x_{k + 1}) + \nabla H (x_{k + 1}) - \nabla H (ψ_{m, k}),$ thus $t_{k} (g (y) - g (x_{k + 1})) \geq ⟨ \nabla H (ψ_{m, k}) - \nabla H (x_{k + 1}), y - x_{k + 1} ⟩ .$ The three point identity leads to $t_{k} (g (y) - g (x_{k + 1})) \geq - (D_{H} (y, ψ_{m, k}) - D_{H} (y, x_{k + 1}) - D_{H} (x_{k + 1}, ψ_{m, k}))$ or, equivalently, $t_{k} (g (x_{k + 1}) - g (y)) + D_{H} (y, x_{k + 1}) \leq D_{H} (y, ψ_{m, k}) - D_{H} (x_{k + 1}, ψ_{m, k})$ for every $k \geq 0$ . Since the involved functions are measurable, we can take the expected value on both sides and obtain for every $k \geq 0$ (12) $t_{k} E ((g (x_{k + 1}) - g (y))) + E (D_{H} (y, x_{k + 1})) \leq E (D_{H} (y, ψ_{m, k})) - E (D_{H} (x_{k + 1}, ψ_{m, k})) .$ (12) Combining (Equation11(11) $\begin{aligned} E (D_{H} (y, ψ_{m, k})) & \leq E (D_{H} (y, x_{k})) + t_{k} E (\sum_{i = 1}^{m} f_{i} (y) - \sum_{i = 1}^{m} f_{i} (x_{k})) \\ + \frac{1}{σ} t_{k}^{2} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} ({(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{1 / 2} + 1) - E (\sum_{i = 1}^{m} \frac{1}{2} D_{H} (ψ_{i, k}, ψ_{i - 1, k})) . \end{aligned}$ (11) ) and (Equation12(12) $t_{k} E ((g (x_{k + 1}) - g (y))) + E (D_{H} (y, x_{k + 1})) \leq E (D_{H} (y, ψ_{m, k})) - E (D_{H} (x_{k + 1}, ψ_{m, k})) .$ (12) ) gives for every $k \geq 0$ $\begin{aligned} t_{k} E ((g (x_{k + 1}) - g (y))) + t_{k} E (\sum_{i = 1}^{m} f_{i} (x_{k}) - \sum_{i = 1}^{m} f_{i} (y)) + E (D_{H} (y, x_{k + 1})) \\ \leq E (D_{H} (y, x_{k})) + \frac{1}{σ} t_{k}^{2} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} ({(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{1 / 2} + 1) \\ - E (D_{H} (x_{k + 1}, ψ_{m, k})) - \sum_{i = 1}^{m} \frac{1}{2} E (D_{H} (ψ_{i, k}, ψ_{i - 1, k})) . \end{aligned}$ By adding and subtracting $E (\sum_{i = 1}^{m} f_{i} (x_{k + 1}))$ and by using afterwards the Lipschitz continuity of $\sum_{i = 1}^{m} f_{i}$ , we get for every $k \geq 0$ $\begin{aligned} t_{k} E ((\sum_{i = 1}^{m} f_{i} + g) (x_{k + 1}) - (\sum_{i = 1}^{m} f_{i} + g) (y)) \\ - t_{k} (\sum_{i = 1}^{m} L_{f_{i}}) E (∥ x_{k} - x_{k + 1} ∥) + E (D_{H} (y, x_{k + 1})) \\ \leq E (D_{H} (y, x_{k})) + \frac{1}{σ} t_{k}^{2} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} ({(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{1 / 2} + 1) \\ - E (D_{H} (x_{k + 1}, ψ_{m, k})) - \sum_{i = 1}^{m} \frac{1}{2} E (D_{H} (ψ_{i, k}, ψ_{i - 1, k})) . \end{aligned}$ By the triangle inequality, we obtain for every $k \geq 0$ $\begin{aligned} t_{k} E ((\sum_{i = 1}^{m} f_{i} + g) (x_{k + 1}) - (\sum_{i = 1}^{m} f_{i} + g) (y)) + E (D_{H} (y, x_{k + 1})) \\ \leq E (D_{H} (y, x_{k})) + \frac{1}{σ} t_{k}^{2} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} ({(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{1 / 2} + 1) - E (D_{H} (x_{k + 1}, ψ_{m, k})) \\ + t_{k} (\sum_{i = 1}^{m} L_{f_{i}}) E (∥x_{k} - ψ_{m, k} ∥) + t_{k} (\sum_{i = 1}^{m} L_{f_{i}}) E (∥ψ_{m, k} - x_{k + 1} ∥) \\ - \sum_{i = 1}^{m} \frac{1}{2} E (D_{H} (ψ_{i, k}, ψ_{i - 1, k})), \end{aligned}$ which, due to Young's inequality and the strong convexity of H, leads to $\begin{aligned} t_{k} E ((\sum_{i = 1}^{m} f_{i} + g) (x_{k + 1}) - (\sum_{i = 1}^{m} f_{i} + g) (y)) + E (D_{H} (y, x_{k + 1})) \\ \leq E (D_{H} (y, x_{k})) + \frac{1}{σ} t_{k}^{2} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} ({(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{1 / 2} + 1) - E (D_{H} (x_{k + 1}, ψ_{m, k})) \\ + t_{k} (\sum_{i = 1}^{m} L_{f_{i}}) E (∥x_{k} - ψ_{m, k} ∥) + \frac{1}{2 σ} t_{k}^{2} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} \\ + E (D_{H} (x_{k + 1}, ψ_{m, k})) - \sum_{i = 1}^{m} \frac{1}{2} E (D_{H} (ψ_{i, k}, ψ_{i - 1, k})) . \end{aligned}$ Since $∥ x_{k} - ψ_{m, k} ∥ = ∥\sum_{i = 1}^{m} (ψ_{i - 1, k} - ψ_{i, k})∥ \leq \sum_{i = 1}^{m} ∥ ψ_{i - 1, k} - ψ_{i, k} ∥,$ we get for every $k \geq 0$ that $\begin{aligned} t_{k} E ((\sum_{i = 1}^{m} f_{i} + g) (x_{k + 1}) - (\sum_{i = 1}^{m} f_{i} + g) (y)) + E (D_{H} (y, x_{k + 1})) \\ \leq E (D_{H} (y, x_{k})) + \frac{1}{2 σ} t_{k}^{2} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} (2 {(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{1 / 2} + 3) \\ + t_{k} (\sum_{i = 1}^{m} L_{f_{i}}) E (\sum_{i = 1}^{m} ∥ ψ_{i - 1, k} - ψ_{i, k} ∥) - \sum_{i = 1}^{m} \frac{1}{2} E (D_{H} (ψ_{i, k}, ψ_{i - 1, k})) . \end{aligned}$ Young's inequality and the strong convexity of H imply that for every $i = 1, \dots, m$ and every $k \geq 0$ $\begin{aligned} t_{k} (\sum_{i = 1}^{m} L_{f_{i}}) ∥ ψ_{i - 1, k} - ψ_{i, k} ∥ & \leq \frac{1}{σ} t_{k}^{2} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} + \frac{σ}{4} ∥ ψ_{i - 1, k} - ψ_{i, k} ∥^{2} \\ \leq \frac{1}{σ} t_{k}^{2} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} + \frac{1}{2} D_{H} (ψ_{i, k}, ψ_{i - 1, k}) \end{aligned}$ and thus $\begin{aligned} t_{k} E ((\sum_{i = 1}^{m} f_{i} + g) (x_{k + 1}) - (\sum_{i = 1}^{m} f_{i} + g) (y)) + E (D_{H} (y, x_{k + 1})) \\ \leq E (D_{H} (y, x_{k})) + \frac{1}{2 σ} t_{k}^{2} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} (2 {(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{1 / 2} + 3 + 2 m) . \end{aligned}$ Summing up this inequality from k=0 to N−1, for $N \geq 1$ , we get $\begin{aligned} \sum_{k = 0}^{N - 1} t_{k} E ((\sum_{i = 1}^{m} f_{i} + g) (x_{k + 1}) - (\sum_{i = 1}^{m} f_{i} + g) (y)) + E (D_{H} (y, x_{N})) \\ \leq E (D_{H} (y, x_{0})) + \frac{1}{2 σ} {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} (2 {(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{1 / 2} + 3 + 2 m) \sum_{k = 0}^{N - 1} t_{k}^{2} . \end{aligned}$ This shows that $\begin{aligned} E (min_{0 \leq k \leq N - 1} (\sum_{i = 1}^{m} f_{i} + g) (x_{k + 1}) - (\sum_{i = 1}^{m} f_{i} + g) (y)) \\ \leq \frac{2 σ D_{H} (y, x_{0}) + (2 {(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{1 / 2} + 3 + 2 m) {(\sum_{i = 1}^{m} L_{f_{i}})}^{2} \sum_{k = 0}^{N - 1} t_{k}^{2}}{2 σ \sum_{k = 0}^{N - 1} t_{k}} \end{aligned}$ and therefore finishes the proof.

The following result is again a consequence of [Citation8, Proposition 4.1].

Corollary 4.6

In the setting Problem 4.1, assume that the functions $f_{i}$ are $L_{f_{i}}$ -Lipschitz continuous on $im (\nabla H^{*})$ for $i = 1, \dots, m$ . Let $x^{*} \in dom H$ be an optimal solution of (Equation9(9) $min_{x \in C} \sum_{i = 1}^{m} f_{i} (x) + g (x),$ (9) ) and $(x_{k})_{k \geq 0}$ be a sequence generated by Algorithm 4.2 with optimal stepsize $t_{k} := \frac{1}{\sum_{i = 1}^{m} L_{f_{i}}} \sqrt{\frac{2 D_{H} (x^{*}, x_{0})}{2 {(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{2} + 3 + 2 m}} \frac{1}{\sqrt{k}} \forall k \geq 0.$ Then for every $N \geq 1$ it holds $\begin{aligned} E (min_{0 \leq k \leq N - 1} (\sum_{i = 1}^{m} f_{i} + g) (x_{k}) - (\sum_{i = 1}^{m} f_{i} + g) (x^{*})) \\ \leq (\sum_{i = 1}^{m} L_{f_{i}}) \sqrt{\frac{2 D_{H} (x^{*}, x_{0}) (2 {(\sum_{i = 1}^{m} \frac{1}{p_{i}^{2}})}^{2} + 3 + 2 m)}{σ}} \frac{1}{\sqrt{N}} . \end{aligned}$

Remark 4.7

The same considerations as in Remark 3.7 about ergodic convergence are applicable also for the rates provided in Theorem 4.5 and Corollary 4.6.

Remark 4.8

Note the straightforward dependence of the optimal stepsizes as well as the right-hand side of the convergence statement on the data, i.e. the distance of the initial point to optimality, the Lipschitz constants $L_{f_{i}}$ and the probabilities $p_{i}$ . This backs up the intuition that the decreased gradient evaluation, i.e. smaller $p_{i}$ , does not come for free but at the cost of a worse constant in the convergence rate.

5. Applications

In the numerical experiments carried out in this section, we will compare three versions of the provided algorithms. First of all, the non-incremental version, which takes full subgradient steps with respect to the sum of all component functions instead of every single one individually. This can be viewed as a special case of the algorithms given, when m=1 and $ϵ_{1, k} = 1$ for all $k \geq 0$ . Secondly, we discuss the non-stochastic incremental version, which uses the subgradient of every single component function in every iteration and thus corresponds to the case when $ϵ_{i, k} = 1$ for every i=1,...,m and every $k \geq 0$ . Lastly, we apply the algorithms as intended by evaluating the subgradients of the respective component functions incrementally with a probability different from 1.

5.1. Tomography

This application can be found in [Citation3] and arises in the reconstruction of images in PET. We consider the following problem (13) $min_{x \in Δ} - \sum_{i = 1}^{m} y_{i} \log (\sum_{j = 1}^{n} r_{i j} x_{j}),$ (13) where $Δ := {x \in R^{n} : \sum_{j = 1}^{n} x_{j} = 1, x \geq 0}$ and $r_{i j}$ denotes for $i = 1, \dots, m$ and $j = 1, \dots, n$ the entry of the matrix $R \in R^{m \times n}$ in the i-th row and j-th column and all of these are assumed to be strictly positive. Furthermore, $y_{i}$ denotes for $i = 1, \dots, m$ the positive number of photons measured in the i-th bin. As discussed in Example 2.5 this can be incorporated into our framework with the mirror map $H (x) = \sum_{i = 1}^{n} x_{i} \log (x_{i})$ for $x \in Δ$ and $H (x) = + \infty$ , otherwise. As initial value, we use the all ones vector divided by the dimension n.

We also want to point out that a similar example given in [Citation8] in which the minimization of a convex function over the unit simplex Δ somehow does not match the assumption made throughout the paper as the interior of Δ is empty and the function H can therefore not be continuously differentiable in a classical sense. However, with the setting of Section 3 we are able to tackle this problem.

The bad performance, see Figure , of the deterministic incremental version of Algorithm 3.2 can be explained by the fact that many more evaluations of the mirror map are needed, which increases the overall computation time dramatically. The stochastic version, however, performs rather well, after only evaluating merely roughly a fifth of the total number of component functions, see Table .

Figure 1. Results for the optimization problem (Equation13(13) $min_{x \in Δ} - \sum_{i = 1}^{m} y_{i} \log (\sum_{j = 1}^{n} r_{i j} x_{j}),$ (13) ). A plot of $(f_{N} - f (x_{b e s t})) / (f (x_{0}) - f (x_{b e s t}))$ , where $f_{N} := min_{0 \leq k \leq N} f (x_{k})$ , as a function of time, i.e. $x_{N}$ is the last iterate computed before a given point in time.

Figure 1. Results for the optimization problem (Equation13(13) minx∈Δ−∑i=1myilog∑j=1nrijxj,(13) ). A plot of (fN−f(xbest))/(f(x0)−f(xbest)), where fN:=min0≤k≤Nf(xk), as a function of time, i.e. xN is the last iterate computed before a given point in time.

Table 1. Results for the optimization problem (Equation13(13) $min_{x \in Δ} - \sum_{i = 1}^{m} y_{i} \log (\sum_{j = 1}^{n} r_{i j} x_{j}),$ (13) ), where NI denotes the non-incremental, DI the deterministic incremental and SI the stochastic incremental version of Algorithm 3.2.

Display Table

5.2. Support vector machines

We deal with the classic machine learning problem of binary classification based on the well-known MNIST dataset, which contains 28 by 28 images of handwritten numbers on a grey-scale pixel map. For each of the digits, the dataset comprises around 6000 training images and roughly 1000 test images. In line with [Citation4], we train a classifier to distinguish the numbers 6 and 7, by solving the following optimization problem (14) $min_{w \in R^{784}} \sum_{i = 1}^{m} max {0, 1 - y_{i} ⟨ w, x_{i} ⟩} + λ ∥ w ∥_{1},$ (14) where for $i = 1, \dots, m$ , $x_{i} \in {0, 1, \dots, 255}^{784}$ denotes the i-th training image and $y_{i} \in {- 1, 1}$ denotes the label of the i-th training image. The 1-norm serves as a regularization term and $λ > 0$ balances the two objectives of minimizing the classification error and reducing the 1-norm of the classifier w. To incorporate this problem into our framework, we set $H = \frac{1}{2} ∥ \cdot ∥^{2}$ which leaves us with the identity as mirror map as this problem is unconstrained. The results comparing the three versions of Algorithm 4.2 discussed at the beginning of this section are illustrated in Figure . As initial value we simply use the all ones vector. All three versions show classical first-order behaviour, giving a fast decrease in objective function value first but then slowing down dramatically. More information about the performance can be seen in . All three algorithms result in a significant decrease in objective function after being run for only 4 s each. However, from a machine learning point of view, only the misclassification rate is of actual importance. In both regards, the stochastic incremental version clearly trumps the other two implementations. It is also interesting to note that it needs only a small fraction of the number of subgradient evaluations in comparison to the full non-incremental algorithm.

Figure 2. Numerical results for the optimization problem (Equation14(14) $min_{w \in R^{784}} \sum_{i = 1}^{m} max {0, 1 - y_{i} ⟨ w, x_{i} ⟩} + λ ∥ w ∥_{1},$ (14) ) with $λ = 0.01$ . The plot shows $min_{0 \leq k \leq N} f (x_{k})$ as a function of time, i.e. $x_{N}$ is the last iterate computed before a given point in time.

Figure 2. Numerical results for the optimization problem (Equation14(14) minw∈R784∑i=1mmax{0,1−yi⟨w,xi⟩}+λ∥w∥1,(14) ) with λ=0.01. The plot shows min0≤k≤Nf(xk) as a function of time, i.e. xN is the last iterate computed before a given point in time.

Table 2. Numerical results for the optimization problem (Equation14(14) $min_{w \in R^{784}} \sum_{i = 1}^{m} max {0, 1 - y_{i} ⟨ w, x_{i} ⟩} + λ ∥ w ∥_{1},$ (14) ), where NI denotes the non-incremental, DI the deterministic incremental and SI the stochastic incremental version of Algorithm 4.2.

Display Table

6. Conclusion

In this paper, we present two algorithms to solve nonsmooth convex optimization problems where the objective function is a sum of many functions which are evaluated by their respective subgradients under the implicit presence of a constraint set which is dealt with by a so-called mirror map. By allowing for a random selection of each component function to evaluate in each iteration, the proposed methods become suitable even for very large-scale problems. We prove a convergence order of $O (1 / \sqrt{k})$ in expectation for the kth best objective function value, which is standard for subgradient methods. However, even for the case where all the objective functions are differentiable, it is not clear if better theoretical estimates can be achieved, due to the need of using diminishing stepsizes in order to obtain convergence in incremental algorithms. Future work could comprise the investigation of different stepsizes, such as constant or dynamic stepsizes as in [Citation2]. Another possible extension of this would be to use different selection procedures such as random subsets of fixed size. Our framework, however, does not provide the right setting for such a batch approach as it would leave $ϵ_{i, k}$ and $ϵ_{j, k}$ dependent.

Disclosure statement

No potential conflict of interest was reported by the authors.

ORCID

Radu Ioan Boţ http://orcid.org/0000-0002-4469-314X

Additional information

Funding

The research of Radu Ioan Boţ has been partially supported by FWF (Austrian Science Fund) [project I 2419-N32]. The research of Axel Böhm has been supported by the doctoral programme Vienna Graduate School on Computational Optimization (VGSCO), FWF (Austrian Science Fund) [project W 1260].

References

Bertsekas DP. Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. In: Sra S, Nowozin S, Wright SJ, editors. Optimization for machine learning, Neural Information Processing Series. Cambridge (MA): MIT Press; 2012. p. 85–120.
Google Scholar
Nedic A, Bertsekas DP. Incremental subgradient methods for nondifferentiable optimization. SIAM J Optim. 2001;12(1):109–138. doi: 10.1137/S1052623499362111
Web of Science ®Google Scholar
Ben-Tal A, Margalit T, Nemirovski A. The ordered subsets mirror descent optimization method with applications to tomography. SIAM J Optim. 2001;12(1):79–108. doi: 10.1137/S1052623499354564
Web of Science ®Google Scholar
Xiao L. Dual averaging methods for regularized stochastic learning and online optimization. J Mach Learn Res. 2010;11:2543–2596.
Web of Science ®Google Scholar
Gurbuzbalaban M, Ozdaglar A, Parrilo PA. On the convergence rate of incremental aggregated gradient algorithms. SIAM J Optim. 2017;27(2):1035–1048. doi: 10.1137/15M1049695
Web of Science ®Google Scholar
Wei Z, He QH. Nonsmooth steepest descent method by proximal subdifferentials in Hilbert spaces. J Optim Theory Appl. 2014;161(2):465–477. doi: 10.1007/s10957-013-0444-z
Web of Science ®Google Scholar
Bauschke HH, Bolte J, Teboulle M. A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Math Oper Res. 2017;42(2):330–348. doi: 10.1287/moor.2016.0817
Web of Science ®Google Scholar
Beck A, Teboulle M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper Res Lett. 2003;31(3):167–175. doi: 10.1016/S0167-6377(02)00231-6
Web of Science ®Google Scholar
Nemirovskii AS, Yudin DB. Problem complexity and method efficiency in optimization. Hoboken (NJ): Wiley; 1983.
Google Scholar
Nesterov Y. Primal-dual subgradient methods for convex problems. Math Program. 2009;120(1):221–259. doi: 10.1007/s10107-007-0149-x
Web of Science ®Google Scholar
Zhou Z, Mertikopoulos P, Bambos N, et al. Mirror descent in non-convex stochastic programming. arXiv:1706.05681, 2017.
Google Scholar
Shalev-Shwartz S. Online learning and online convex optimization. Found Trends Mach Learn. 2012;4(2):107–194. doi: 10.1561/2200000018
Google Scholar
Bolte J, Teboulle M. Barrier operators and associated gradient-like dynamical systems for constrained minimization problems. SIAM J Control Optim. 2003;42(4):1266–1292. doi: 10.1137/S0363012902410861
Web of Science ®Google Scholar
Combettes PL, Pesquet J-C. Stochastic quasi-Fejér block-coordinate fixed point iterations with random sweeping. SIAM J Optim. 2015;25(2):1221–1248. doi: 10.1137/140971233
Web of Science ®Google Scholar
Cruz JYB. On proximal subgradient splitting method for minimizing the sum of two nonsmooth convex functions. Set-Valued Var Anal. 2017;25(2):245–263. doi: 10.1007/s11228-016-0376-5
Web of Science ®Google Scholar

An incremental mirror descent subgradient algorithm with random sweeping and proximal step

ABSTRACT

1. Introduction

2. Elements of convex analysis and the mirror descent algorithm

3. A stochastic incremental mirror descent algorithm

4. A stochastic incremental mirror descent algorithm with Bregman proximal step

5. Applications

5.1. Tomography

Table 1. Results for the optimization problem (Equation13(13) $min_{x \in Δ} - \sum_{i = 1}^{m} y_{i} \log (\sum_{j = 1}^{n} r_{i j} x_{j}),$ (13) ), where NI denotes the non-incremental, DI the deterministic incremental and SI the stochastic incremental version of Algorithm 3.2.

5.2. Support vector machines

6. Conclusion

Disclosure statement

References

Information for

Open access

Opportunities

Help and information

An incremental mirror descent subgradient algorithm with random sweeping and proximal step

ABSTRACT

1. Introduction

2. Elements of convex analysis and the mirror descent algorithm

3. A stochastic incremental mirror descent algorithm

4. A stochastic incremental mirror descent algorithm with Bregman proximal step

5. Applications

5.1. Tomography

Table 1. Results for the optimization problem (Equation13(13) minx∈Δ−∑i=1myilog∑j=1nrijxj,(13) ), where NI denotes the non-incremental, DI the deterministic incremental and SI the stochastic incremental version of Algorithm 3.2.

5.2. Support vector machines

Table 2. Numerical results for the optimization problem (Equation14(14) minw∈R784∑i=1mmax{0,1−yi⟨w,xi⟩}+λ∥w∥1,(14) ), where NI denotes the non-incremental, DI the deterministic incremental and SI the stochastic incremental version of Algorithm 4.2.

6. Conclusion

Disclosure statement

ORCID

Additional information

Funding

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 1. Results for the optimization problem (Equation13(13) $min_{x \in Δ} - \sum_{i = 1}^{m} y_{i} \log (\sum_{j = 1}^{n} r_{i j} x_{j}),$ (13) ), where NI denotes the non-incremental, DI the deterministic incremental and SI the stochastic incremental version of Algorithm 3.2.