Full article: Coordinate optimization for generalized fused Lasso

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Fused Lasso is one of extensions of Lasso to shrink differences of parameters. We focus on a general form of it called generalized fused Lasso (GFL). The optimization problem for GFL can be came down to that for generalized Lasso and can be solved via a path algorithm for generalized Lasso. Moreover, the path algorithm is implemented via the genlasso package in R. However, the genlasso package has some computational problems. Then, we apply a coordinate descent algorithm (CDA) to solve the optimization problem for GFL. We give update equations of the CDA in closed forms, without considering the Karush-Kuhn-Tucker conditions. Furthermore, we show an application of the CDA to a real data analysis.

Keywords:

1. Introduction

Suppose that there are m groups and n_j pairs of data ${y_{j, i}, x_{j, i}} (i = 1, \dots, n_{j})$ are observed for group $j \in {1, \dots, m},$ where $y_{j, i}$ is a response variable and $x_{j, i}$ is a p-dimensional vector of non stochastic explanatory variables. Let $y_{j}$ be an n_j-dimensional vector defined by $y_{j} = (y_{j, 1}, \dots, y_{j, n_{j}})'$ and $X_{j}$ be an $n_{j} \times p$ matrix defined by $X_{j} = (x_{j, 1}, \dots, x_{j, n_{j}})' .$ For group j, we consider the following linear regression model: (1) $y_{j} = X_{j} β + μ_{j} 1_{n_{j}} + ε_{j},$ (1) where $β$ is a p-dimensional vector of regression coefficients, μ_j is a location parameter for group j, $1_{n}$ is an n-dimensional vector of ones, and $ε_{j}$ is an n_j-dimensional vector of error variables from a distribution with mean 0 and variance $σ^{2} .$ In addition, we assume that the vectors $ε_{1}, \dots, ε_{m}$ are independent. Let $y$ be an n-dimensional vector defined by $y = (y_{1}^{'}, \dots, y_{m}^{'})', X$ be an n × p matrix defined by $X = (X_{1}^{'}, \dots, X_{m}^{'})', μ$ be an m-dimensional vector defined by $μ = (μ_{1}, \dots, μ_{m})', R$ be an $n \times m$ block diagonal matrix defined by $R = d iag (1_{n_{1}}, \dots, 1_{n_{m}}),$ and $ε$ be an n-dimensional vector defined by $ε = (ε_{1}^{'}, \dots, ε_{m}^{'})',$ where $n = \sum_{j = 1}^{m} n_{j} .$ Then, m models in Equation(1)(1) $y_{j} = X_{j} β + μ_{j} 1_{n_{j}} + ε_{j},$ (1) are expressed as $y = Xβ + Rμ + ε .$

In this paper, although we deal with a linear regression model, we consider the case that location parameters of the model are different for each groups, that is, it can be regarded that location parameters express individual effects of groups. We call μ_j group effect for group j. In such case, one of interesting points is which group effects are equivalence. Hence, we approach it by using penalized least square method. That is, $μ$ is estimated by minimizing the following penalized residual sum of squares (PRSS): (2) $‖ y - Xβ - Rμ ‖_{2}^{2} + λ \sum_{j = 1}^{m} \sum_{ℓ \in D_{j}} w_{j ℓ} | μ_{j} - μ_{ℓ} |,$ (2) where λ is a positive tuning parameter, $D_{j} \subset {1, \dots, m} \ {j}$ is an index set, and $w_{j ℓ}$ is a positive weight satisfying $w_{j ℓ} = w_{ℓ j} .$ This is called generalized fused Lasso (GFL). The original form of GFL was proposed by Tibshirani et al. (Citation2005) and is called fused Lasso (FL). When $D_{1} = \emptyset, D_{j} = {j - 1} (j = 2, \dots, m),$ and $w_{j ℓ} = 1,$ GFL coincides with FL. GFL shrinks $| μ_{j} - μ_{ℓ} |$ toward 0 and the estimates of μ_j and $μ_{ℓ}$ are often equal. Hence, using GFL, we can identify groups with equal group effects.

To obtain the GFL estimator of $μ,$ we need an optimization method to minimize Equation(2)(2) $‖ y - Xβ - Rμ ‖_{2}^{2} + λ \sum_{j = 1}^{m} \sum_{ℓ \in D_{j}} w_{j ℓ} | μ_{j} - μ_{ℓ} |,$ (2) . For FL, Friedman et al. (Citation2007) proposed a coordinate descent algorithm (CDA). In other cases, the optimization problem can be came down to that of the ordinary Lasso (Tibshirani Citation1996) under some conditions (e.g., Sun, Wang, and Fuentes Citation2016; Li and Sang Citation2019). In this paper, we deal with the case that the above methods cannot be applied. That is, we assume that $\sum_{j = 1}^{m} # (D_{j}) > m .$ Fortunately, GFL can be expressed as generalized Lasso (Tibshirani and Taylor Citation2011). Generalized Lasso type penalty is given by $‖ Dμ ‖_{1},$ where $D$ is a penalty matrix. For FL, $D$ is an $(m - 1) \times m$ matrix given by $D = (\begin{matrix} - 1 & 1 & 0 & \dots & 0 \\ 0 & ⋱ & ⋱ & ⋱ & ⋮ \\ ⋮ & ⋱ & ⋱ & ⋱ & 0 \\ 0 & \dots & 0 & - 1 & 1 \end{matrix}) .$

In addition, Tibshirani and Taylor (Citation2011) proposed the optimization algorithm for generalized Lasso and the algorithm is implemented via the genlasso package (e.g., Arnold and Tibshirani Citation2019) in R (e.g., R Core Team Citation2019). However, the optimization with the genlasso package has a high calculation cost and cannot be practically executed in large sample data. Moreover, there are issues in terms of numerical error and estimates cannot be exactly equal. Accordingly, we focus on a CDA.

To accurately calculate the GFL estimator of $μ,$ even in large sample data, we give update equations of the CDA in closed form. Our algorithm has advantages compared with several existing algorithms, for example, by Friedman et al. (Citation2007) and Tibshirani and Taylor (Citation2011). Their algorithms are via Lagrange dual problem and must consider several conditions, for example, Karush-Kuhn-Tucker conditions and optimality. On the other hand, using our algorithm, the optimization problem for GFL can be solved by elementary mathematics, without considering these conditions. Moreover, although our algorithm reduces a search range of the minimizer for the jth coordinate direction to $2 # (D_{j}) + 1$ discrete points that are given by closed forms, a numerical search like comparing values for each point in the reduced search range is not required. By checking simple conditions, the unique minimizer for a coordinate direction can be decided. Of course, the unique minimizer is given by closed form and ensures minimization for coordinate direction.

The remainder of the paper is organized as follows: In Section 2, we give closed form update equations of the CDA for GFL. Numerical examples are discussed in Section 3. Section 4 is conclusion of this paper. Technical details are provided in the Appendix.

2. Main results

We derive update equations of the CDA for obtaining the GFL estimator of $μ .$ A CDA updates a solution along coordinate directions. The CDA for FL consists of a descent cycle and a fusion cycle. The descent cycle successively minimizes along coordinate directions. The ordinary CDA only consists of the descent cycle. However, when the ordinary CDA is applied for FL, some estimates are equal and then the solution gets stuck, failing to reach the minimum. To avoid this problem, Friedman et al. (Citation2007) invoked the fusion cycle. Since this problem can also occur for GFL, we give update equations for the descent cycle and the fusion cycle. In this section, we regard that $β$ is fixed at $β = \hat{β},$ and minimize the following PRSS: (3) $‖ \tilde{y} - Rμ ‖_{2}^{2} + λ \sum_{j = 1}^{m} \sum_{ℓ \in D_{j}} w_{j ℓ} | μ_{j} - μ_{ℓ} |,$ (3) where $\tilde{y} = y - X \hat{β} .$

2.1. Descent cycle

The descent cycle minimizes along coordinate directions. That is, Equation(3)(3) $‖ \tilde{y} - Rμ ‖_{2}^{2} + λ \sum_{j = 1}^{m} \sum_{ℓ \in D_{j}} w_{j ℓ} | μ_{j} - μ_{ℓ} |,$ (3) is minimized with respect to μ_i and repeat it for $i = 1, \dots, m .$ We fix $μ_{j} (j \in {1, \dots, m} \ {i})$ at $μ_{j} = {\hat{μ}}_{j} .$ Then, Equation(3)(3) $‖ \tilde{y} - Rμ ‖_{2}^{2} + λ \sum_{j = 1}^{m} \sum_{ℓ \in D_{j}} w_{j ℓ} | μ_{j} - μ_{ℓ} |,$ (3) is rewritten as the function of μ_i by the following lemma (the proof is given in Appendix A):

Lemma 2.1.

The EquationEquation (3)(3) $‖ \tilde{y} - Rμ ‖_{2}^{2} + λ \sum_{j = 1}^{m} \sum_{ℓ \in D_{j}} w_{j ℓ} | μ_{j} - μ_{ℓ} |,$ (3) can be expressed as the following function of $μ_{i} (i \in {1, \dots, m})$ : (4) $n_{i} μ_{i}^{2} - 2 {\tilde{y}}_{i}^{'} 1_{n_{i}} μ_{i} + 2 λ \sum_{j \in D_{i}} w_{i j} | μ_{i} - {\hat{μ}}_{j} | + u_{i},$ (4) where ${\tilde{y}}_{i}$ is the ith block of $\tilde{y}$ , that is, ${\tilde{y}}_{i} = y_{i} - X_{i} \hat{β}$ and u_i is the term that does not depend on μ_i.

From Lemma 2.1, essentially, it is sufficient to minimize the following function: (5) $ϕ (x) = c_{2} x^{2} - 2 c_{1} x + 2 λ \sum_{j = 1}^{r} w_{j} | x - b_{j} | (c_{2}, w_{1}, \dots, w_{r} > 0) .$ (5)

Let $t_{0} = - \infty$ and let $t_{1}, \dots, t_{r}$ be the order statistics of $b_{1}, \dots, b_{r},$ that is, $t_{j} = {\begin{matrix} \min {b_{1}, \dots, b_{r}} & (j = 1) \\ \min {{b_{1}, \dots, b_{r}} \ {t_{1}, \dots, t_{j - 1}}} & (j = 2, \dots, r) \end{matrix},$ $J_{a}^{+}$ and $J_{a}^{-}$ be index sets for $a \in {0, 1, \dots, r}$ defined by $J_{a}^{+} = {j \in {1, \dots, r} ∣ b_{j} \leq t_{a}}, J_{a}^{-} = {j \in {1, \dots, r} ∣ b_{j} > t_{a}},$ and R_a be a range defined by $R_{a} = {\begin{matrix} (t_{a}, t_{a + 1}] & (a = 0, 1, \dots, r - 1) \\ (t_{r}, \infty) & (a = r) \end{matrix} .$

Moreover, we define ${\tilde{w}}_{a}$ and v_a for $a \in {0, 1, \dots, r}$ as ${\tilde{w}}_{a} = \sum_{j \in J_{a}^{+}} w_{j} - \sum_{j \in J_{a}^{-}} w_{j}, v_{a} = \frac{c_{1} - λ {\tilde{w}}_{a}}{c_{2}} .$

Then, the minimizer of Equation(5)(5) $ϕ (x) = c_{2} x^{2} - 2 c_{1} x + 2 λ \sum_{j = 1}^{r} w_{j} | x - b_{j} | (c_{2}, w_{1}, \dots, w_{r} > 0) .$ (5) is given by the following theorem (the proof is given in Appendix B):

Theorem 2.2.

Let $a^{*}$ and $a^{⋆}$ be non negative values defined by $a^{*} \in {0, 1, \dots, r} s . t . v_{a^{*}} \in R_{a^{*}}, a^{⋆} \in {1, \dots, r} s . t . t_{a^{⋆}} \in [v_{a^{⋆}}, v_{a^{⋆} - 1}) .$

Then, either $a^{*}$ or $a^{⋆}$ uniquely exists, and the minimizer of $ϕ (x)$ in Equation(5)(5) $ϕ (x) = c_{2} x^{2} - 2 c_{1} x + 2 λ \sum_{j = 1}^{r} w_{j} | x - b_{j} | (c_{2}, w_{1}, \dots, w_{r} > 0) .$ (5) is given by $\hat{x} = {\begin{matrix} v_{a^{*}} & (a^{*} exists) \\ t_{a^{⋆}} & (a^{⋆} exists) \end{matrix} .$

Theorem 2.2

gives the unique minimizer of μ_i in Equation(4)(4) $n_{i} μ_{i}^{2} - 2 {\tilde{y}}_{i}^{'} 1_{n_{i}} μ_{i} + 2 λ \sum_{j \in D_{i}} w_{i j} | μ_{i} - {\hat{μ}}_{j} | + u_{i},$ (4) . Hence, by applying Theorem 2.2 to Equation(4)(4) $n_{i} μ_{i}^{2} - 2 {\tilde{y}}_{i}^{'} 1_{n_{i}} μ_{i} + 2 λ \sum_{j \in D_{i}} w_{i j} | μ_{i} - {\hat{μ}}_{j} | + u_{i},$ (4) for $i = 1, \dots, m,$ we can obtain the solution of $μ$ in the descent cycle. In Theorem 2.2, when $a^{⋆}$ exists, estimate of μ_i is equal to ${\hat{μ}}_{j}$ (satisfying ${\hat{μ}}_{j} = t_{a^{⋆}}$ ), and this means that group effects of groups i and j are equal.

2.2. Fusion cycle

The fusion cycle avoids a solution getting stuck when some estimates are equal in the descent cycle. Suppose that we obtain ${\hat{μ}}_{j} = {\hat{μ}}_{ℓ}$ as estimates of μ_j and $μ_{ℓ} (j \neq ℓ)$ in the descent cycle. Then, to avoid the solutions of μ_j and $μ_{ℓ}$ getting stuck, we regard $η_{i} = μ_{j} = μ_{ℓ}$ and minimize Equation(3)(3) $‖ \tilde{y} - Rμ ‖_{2}^{2} + λ \sum_{j = 1}^{m} \sum_{ℓ \in D_{j}} w_{j ℓ} | μ_{j} - μ_{ℓ} |,$ (3) along the η_i-axis direction.

After the descent cycle, suppose that we obtain ${\hat{μ}}_{1}, \dots, {\hat{μ}}_{m}$ as estimates of $μ_{1}, \dots, μ_{m} .$ Then, let ${\hat{η}}_{1}, \dots, {\hat{η}}_{b} (b \leq m)$ be distinct values of ${\hat{μ}}_{1}, \dots, {\hat{μ}}_{m}$ and we define index sets $E_{1}, \dots, E_{b}$ as $E_{j} = {ℓ \in {1, \dots, m} ∣ {\hat{μ}}_{ℓ} = {\hat{η}}_{j}},$ where $E_{j} \neq \emptyset$ and $E_{j} \cap E_{ℓ} = \emptyset (j \neq ℓ) .$ If b < m, the fusion cycle is executed. In the fusion cycle, Equation(3)(3) $‖ \tilde{y} - Rμ ‖_{2}^{2} + λ \sum_{j = 1}^{m} \sum_{ℓ \in D_{j}} w_{j ℓ} | μ_{j} - μ_{ℓ} |,$ (3) is minimized with respect to η_i and repeat it for $i \in {j \in {1, \dots, b} ∣ # (E_{j}) \geq 2} .$ We fix $η_{j} (j \in {1, \dots, b} \ {i})$ at $η_{j} = {\hat{η}}_{j} .$ Moreover, we define $D_{j}^{*} \subseteq {1, \dots, b} \ {j} (j \in {1, \dots, b})$ and $w_{j ℓ}^{*} (j \in {1, \dots, b}; ℓ \in D_{j}^{*})$ as $\begin{matrix} D_{j}^{*} = {ℓ \in {1, \dots, b} \\ {j} ∣ E_{ℓ} \cap F_{j} \neq \emptyset}, F_{j} = \underset{ℓ \in E_{j}}{\cup} D_{ℓ} \ E_{j}, \\ w_{j ℓ}^{*} = \sum_{(i, s) \in J_{j ℓ}} w_{i s}, J_{j ℓ} = \underset{i \in E_{j}}{\cup} {i} \times (E_{ℓ} \cap D_{i}) . \end{matrix}$

Then, Equation(3)(3) $‖ \tilde{y} - Rμ ‖_{2}^{2} + λ \sum_{j = 1}^{m} \sum_{ℓ \in D_{j}} w_{j ℓ} | μ_{j} - μ_{ℓ} |,$ (3) is rewritten as the function of η_i by the following lemma (the proof is given in Appendix C):

Lemma 2.3.

The EquationEquation (3)(3) $‖ \tilde{y} - Rμ ‖_{2}^{2} + λ \sum_{j = 1}^{m} \sum_{ℓ \in D_{j}} w_{j ℓ} | μ_{j} - μ_{ℓ} |,$ (3) can be expressed as the following function of $η_{i} (i \in {1, \dots, b})$ : (6) $\sum_{j \in E_{i}} n_{j} η_{i}^{2} - 2 \sum_{j \in E_{i}} {\tilde{y}}_{j}' 1_{n_{j}} η_{i} + 2 λ \sum_{ℓ \in D_{i}^{*}} w_{i ℓ}^{*} | η_{i} - {\hat{η}}_{ℓ} | + u_{i}^{*},$ (6) where $u_{i}^{*}$ is the term that does not depend on η_i.

From Lemma 2.3, we found that the minimization of Equation(6)(6) $\sum_{j \in E_{i}} n_{j} η_{i}^{2} - 2 \sum_{j \in E_{i}} {\tilde{y}}_{j}' 1_{n_{j}} η_{i} + 2 λ \sum_{ℓ \in D_{i}^{*}} w_{i ℓ}^{*} | η_{i} - {\hat{η}}_{ℓ} | + u_{i}^{*},$ (6) is essentially equal to that of Equation(5)(5) $ϕ (x) = c_{2} x^{2} - 2 c_{1} x + 2 λ \sum_{j = 1}^{r} w_{j} | x - b_{j} | (c_{2}, w_{1}, \dots, w_{r} > 0) .$ (5) . Hence, by applying Theorem 2.2 to Equation(6)(6) $\sum_{j \in E_{i}} n_{j} η_{i}^{2} - 2 \sum_{j \in E_{i}} {\tilde{y}}_{j}' 1_{n_{j}} η_{i} + 2 λ \sum_{ℓ \in D_{i}^{*}} w_{i ℓ}^{*} | η_{i} - {\hat{η}}_{ℓ} | + u_{i}^{*},$ (6) for $i \in {j \in {1, \dots, b} ∣ # (E_{j}) \geq 2},$ we can obtain the solution of $μ$ in the fusion cycle.

2.3. Coordinate descent algorithm for GFL

In the previous subsections, we described the descent and fusion cycles and showed that the solution for each step is given by Theorem 2.2. From the results, the CDA for GFL is summarized as follows:

Coordinate Descent Algorithm for GFL

Input: The λ and initial vector of $μ$

Output: The optimal solution of $μ$

Step 1. Run the descent cycle:
Update ${\hat{μ}}_{i}$ by applying Theorem 2.2 to (4) for $i \in {1, \dots, m}$ and define b. If b < m, go to Step 2. If not, go to Step 3.
Step 2. Run the fusion cycle if b < m:
Update ${\hat{η}}_{i}$ by applying Theorem 2.2 to (6) for $i \in {j \in {1, \dots, b} ∣ # (E_{j}) \geq 2} .$
Step 3. Check convergence:
If $\hat{μ}$ converges, the algorithm completes. If not, return to Step 1.

Actually, when we use this algorithm, the selection of λ is very important. The reason is that λ adjusts the strength of the penalty term, that is, the value of λ changes the estimate of $μ .$ If λ = 0, the GFL estimator of $μ$ is equal to the least square estimator of $μ .$ For increasing λ, some estimates are equal, that is, b becomes smaller. For a sufficient large λ, denoted by $λ_{\max},$ all estimates are equal. Hence, the optimal λ should be searched in the range $[0, λ_{\max}] .$ When all estimates are equal, that is, $λ = λ_{\max},$ the estimates are given by ${\hat{μ}}_{1} = \dots = {\hat{μ}}_{m} = {\hat{μ}}_{\infty} = \tilde{y}' 1_{n} / n .$ Then, from the update equation of the descent cycle, $λ_{\max}$ is given by (7) $λ_{\max} = \max {\max_{j \in {1, \dots, m}} \frac{{\hat{μ}}_{\infty} n_{j} - {\tilde{y}}_{j}^{'} 1_{n_{j}}}{\sum_{ℓ \in D_{j}} w_{j ℓ}}, \max_{j \in {1, \dots, m}} \frac{{\tilde{y}}_{j}^{'} 1_{n_{j}} - {\hat{μ}}_{\infty} n_{j}}{\sum_{ℓ \in D_{j}} w_{j ℓ}}} .$ (7)

Thus, for some $λ \in [0, λ_{\max}], μ$ is estimated via the CDA for GFL, and the optimal λ is selected via, for example, model selection criterion minimization method.

3. Numerical studies

In this section, we present numerical simulations, discuss estimation accuracy, and consider an illustrative application to an actual data set. The numerical calculation programs are executed by R (ver. 3.6.0) under a computer with a Windows 10 Pro operating system, an Intel (R) Core (TM) i7-7700 processor, and 16 GB of RAM.

In this studies, explanatory variables include dummy variables with 3 or more categories. Then, we split $X$ and $β$ as $X = (A_{1}, \dots, A_{k}), β = (β_{1}^{'}, \dots, β_{k}^{'})',$ where $k (\leq p)$ is the number of explanatory variables, $A_{ℓ}$ is an $n \times p_{ℓ}$ matrix of the $ℓ$ th explanatory variable, $β_{ℓ}$ is a $p_{ℓ}$ -dimensional vector of regression coefficients for $A_{ℓ},$ and $p_{ℓ}$ satisfies $p_{ℓ} \geq 1$ and $p = \sum_{ℓ = 1}^{k} p_{ℓ} .$ In particular, when $p_{ℓ} = 1,$ we denote $A_{ℓ} = a_{ℓ}$ and $β_{ℓ} = β_{ℓ} .$ Assume that $X$ is scaled. Then, the following equations hold: $‖ a_{ℓ} ‖_{2} = 1 (ℓ s . t . p_{ℓ} = 1), A_{ℓ}^{'} A_{ℓ} = I_{p_{ℓ}} (ℓ s . t . p_{ℓ} \geq 2) .$

Although we execute the CDA for GFL, we select explanatory variables at the same time. The explanatory variables are selected by the CDA for group Lasso (Yuan and Lin Citation2006). Then, estimators of $β$ and $μ$ are given by $\begin{matrix} {\hat{β}}_{λ_{1}} = \arg \min_{β} {‖ y - Xβ - R \hat{μ} ‖_{2}^{2} + λ_{1} \sum_{j = 1}^{k} w_{1, j} ‖ β_{j} ‖_{2}}, \\ {\hat{μ}}_{λ_{2}} = \arg \min_{μ} {‖ y - X \hat{β} - Rμ ‖_{2}^{2} + λ_{2} \sum_{j = 1}^{m} \sum_{ℓ \in D_{j}} w_{2, j ℓ} | μ_{j} - μ_{ℓ} |}, \end{matrix}$ where λ₁ and λ₂ are positive tuning parameters. The update equation of the CDA for group Lasso is given by ${\hat{β}}_{λ_{1}, j} = {(1 - \frac{λ_{1} w_{1, j}}{2 ‖ c_{j} ‖_{2}})}_{+} c_{j},$ where $c_{j} = A_{j}^{'} (y - R \hat{μ} - \sum_{ℓ \neq j}^{k} A_{ℓ} {\hat{β}}_{ℓ})$ and ${(x)}_{+} = \max {0, x} .$ In particular, when p_j = 1, the update equation is given by ${\hat{β}}_{λ_{1}, j} = S (c_{j}, λ_{1} w_{1, j} / 2),$ where S(x, a) is a soft-thresholding operator (e.g., Donoho and Johnstone Citation1994), that is, $S (x, a) = s ign (x) {(| x | - a)}_{+} .$ The weights of penalties are often used the inverse of $ℓ_{2}$ -norm (absolute value) of the estimator (e.g., Zou Citation2006). Hence, we use the following weights: $w_{1, j} = \frac{1}{‖ {\tilde{β}}_{j} ‖_{2}} (j \in {1, \dots, k}), w_{2, j ℓ} = \frac{1}{| {\tilde{μ}}_{j} - {\tilde{μ}}_{ℓ} |} (j \in {1, \dots, m}, ℓ \in D_{j}),$ where ${\tilde{β}}_{j}$ and ${\tilde{μ}}_{j}$ are the least-squares estimators of $β_{j}$ and μ_j, respectively. These estimators are obtained by the following algorithm:

Alternate Optimization Algorithm

Input: Initial vectors of $β$ and $μ$

Output: The optimal solutions of $β$ and $μ$

Step 1. Optimize λ₁ and $β$ :
1-1. Decide searching points of λ₁.
1-2. For fixed λ₁, calculate ${\hat{β}}_{λ_{1}}$ by the CDA for group Lasso.
1-3. Repeat 1-2 for all λ₁ and select the optimal λ₁ based on minimizing a model selection criterion.

Step 2. Optimize λ₂ and $μ$ :
2-1. Decide searching points of λ₂.
2-2. For fixed λ₂, calculate ${\hat{μ}}_{λ_{2}}$ by the CDA for GFL or the genlasso package.
2-3. Repeat 2-2 for all λ₂ and select the optimal λ₂ based on minimizing a model selection criterion.

Step 3: Check convergence:

If $\hat{β}$ and $\hat{μ}$ are both converge, the algorithm completes. If not, return to Step 1.

The λ₁ and λ₂ are searched in 100 points given by $λ_{\max} {(3 / 4)}^{j - 1} (j = 1, \dots, 100),$ where $λ_{\max}$ is given by $λ_{\max} = {\begin{matrix} \max_{ℓ \in {1, \dots, k}} \frac{2 ‖ A_{ℓ}^{'} (y - R \hat{μ}) ‖_{2}}{w_{1, ℓ}} & (for λ_{1}) \\ given by (7) & (for λ_{2}) \end{matrix} .$

The optimal tuning parameters are selected based on minimizing the extended GCV (EGCV) criterion (Ohishi, Yanagihara, and Fujikoshi Citation2020) defined by $E GCV = \frac{(residual sum of squares) / n}{{1 - (degrees of freedom) / n}^{α}},$ where α is a positive value expressing the strength of model complexity. We use the EGCV criterion with $α = log n .$ For an m-dimensional vector $θ^{(i)},$ we regard that $θ^{(i)}$ converges when the following equation holds: $\frac{\max_{j \in {1, \dots, m}} {(θ_{j}^{(i + 1)} - θ_{j}^{(i)})}^{2}}{\max_{j \in {1, \dots, m}} {(θ_{j}^{(i)})}^{2}} \leq \frac{1}{10, 000},$ where (i) is the iteration number and $θ_{j}^{(i)}$ is the jth element of $θ^{(i)} .$

3.1. Simulation

In this subsection, we compare the estimation accuracies of the following Method 1 and Method 2 by simulation.

Method 1: The Alternate Optimization Algorithm using the CDA for GFL to optimize $μ .$
Method 2: The Alternate Optimization Algorithm using the genlasso package (ver. 1.4) to optimize $μ .$

The number of groups is m = 10, 20, the correlation between explanatory variables is $ρ = 0.5,$ 0.8, and the sample sizes of groups are $n_{1} = \dots = n_{m} = n_{0} .$ Then, total sample size is $n = m n_{0}$ and we use $n_{0} = 100, 200, 500, 1000 .$ shows simulation groups when m = 10, 20 with adjacent relationships indicated by lines. The figure means, for example, $D_{1} = {2, 3, 4, 5, 6}$ and $D_{2} = {1, 3, 6, 7, 8}$ when m = 10. We generated data from the simulation model $N_{n} (Xβ + Rμ, I_{n})$ with the following $X$ : $X = (a_{1}, \dots, a_{8}, A_{9}, \dots, A_{13}),$ where column vectors $a_{1}, \dots, a_{8}$ and block matrices $A_{9}, \dots, A_{13}$ are calculated as using the following procedure. Let $u_{1}, \dots, u_{14}$ be independent n-dimensional vectors that the elements are identically and independently distributed according to U(0, 1) and $v_{1}, \dots, v_{13}$ be n-dimensional vectors defined by $v_{j} = ω u_{14} + (1 - ω) u_{j},$ where ω is the parameter determining the correlation of $v_{i}$ and $v_{j} (i \neq j)$ as ρ, defined by $ω = {\begin{matrix} \frac{ρ \pm \sqrt{ρ^{2} - ρ (2 ρ - 1)}}{2 ρ - 1} & (ρ \neq 1 / 2) \\ \frac{1}{2} & (ρ = 1 / 2) \end{matrix} .$

Figure 1. Simulation groups and adjacent relationship. (a) m = 10. (b) m = 20.

By using these vectors $v_{1}, \dots, v_{13},$ we define the blocks in $X$ as follows: Let $a_{j} = v_{j}$ for $j = 1, \dots, 5;$ let $a_{j} (j = 6, 7, 8)$ be dummy variables that take the value 1 or 0 defined by $a_{j, i} = {\begin{matrix} 1 & (v_{j, i} > 0.6) \\ 0 & (v_{j, i} \leq 0.6) \end{matrix} (i \in {1, \dots, n});$ and let $A_{j} (j = 9, \dots, 13)$ be $(j - 7)$ -dimensional dummy variables that are categorized, defined by $(The i th row vector of A_{j}) = {\begin{matrix} e_{j - 7, ℓ} & (v_{j, i} \in Q_{j - 6, ℓ}, ℓ \neq j - 6) \\ 0_{j - 7} & (v_{j, i} \in Q_{j - 6, j - 6}) \end{matrix} (i = 1, \dots, n),$ where $e_{j, ℓ}$ is a j-dimensional vector in which the $ℓ$ th element is 1 and the others are 0 and $Q_{j, ℓ}$ is the $ℓ$ th range when $[0, 1]$ is split into j ranges. The following 2 cases are used as $β$ and $μ .$

Case 1: Let the number of true explanatory variables be $k_{*} = 9$ and the number of true joins of groups be

m_{*} = {\begin{matrix} 3 & (m = 10) \\ 6 & (m = 20) \end{matrix},

and we use the following $β$ and $μ$ : $β = (1, 2, 3, 0, 0, 1, 1, 2, 1_{2}^{'}, 0_{3}^{'}, 2 \times 1_{4}^{'}, 0_{5}^{'}, 3 \times 1_{6}^{'})',$ $\forall j \in E_{ℓ}, μ_{j} = ℓ (ℓ = 1, \dots, m_{*}) .$

Case 2: Let the number of true explanatory variables be $k_{*} = 3$ and the number of true joins of groups be

m_{*} = {\begin{matrix} 6 & (m = 10) \\ 12 & (m = 20) \end{matrix},

and we use the following $β$ and $μ$ : $\begin{matrix} β = (1, 0, 0, 0, 0, 1, 0, 0, 0_{2}^{'}, 0_{3}^{'}, 2 \times 1_{4}^{'}, 0_{5}^{'}, 0_{6}^{'})', \\ \forall j \in E_{ℓ}, μ_{j} = ℓ (ℓ = 1, \dots, m_{*}) . \end{matrix}$

and show true joins of groups when m = 10, 20, respectively. Estimation accuracy is evaluated by the selection probabilities of true variables and true joins by Monte Carlo simulation with 1000 iterations. and show the selection probabilities (SP) of true variables and true joins and running times (RT) of programs in Cases 1 and 2, respectively. The SP is displayed as the combined probability (C-SP) and separate probability about variables and joins. From the tables, since the SP of Method 1 approaches 100% as sample size increases, we found that Method 1 has high estimation accuracy. On the other hand, Method 2 struggles to select the true variables and true joins and its SP is 7.4% (when m = 10, $ρ = 0.8,$ and $n = 5000$ in Case 1) at most. In particular, it struggles to select the true joins and its SP is only 7.8% (when m = 10, $ρ = 0.8,$ and $n = 5000$ in Case 1) at most. Moreover, in terms of running time, Method 1 is about 134 times faster than Method 2 (when m = 20, $ρ = 0.5,$ and $n = 20, 000$ in Case 1) at most.

Figure 2. True joins when m = 10. (a) Case 1. (b) Case 2.

Figure 3. True joins when m = 20. (a) Case 1. (b) Case 2.

Table 1. Selection probabilities and running times in Case 1.

Display Table

Table 2. Selection probabilities and running times in Case 2.

Display Table

3.2. A real data example

In this subsection, we present an illustrative application of the proposed method (Method 1 in Subsection 3.1) to an actual data set. Actually, Method 2 in Subsection 3.1 cannot run the program because the genlasso package causes memory shortage since the applied data have a large sample. This is the motivation of this paper. The data pertain to studio apartment rents and environmental conditions in Tokyo’s 23 wards collected by Tokyo Kantei Co., Ltd. Here, $n = 61, 999$ and all data were collected between April 2014 and April 2015 (). In this application, let the response variable be monthly rent with the remainder set as explanatory variables. We estimate regional effects at 852 areas split Tokyo’s 23 wards using the proposed method. shows the split of Tokyo’s 23 wards into 852 areas. shows estimates and clustering result of regional effects in the form of choropleth map. In terms of the former, as with , the 852 areas in Tokyo’s 23 wards are clustered to form 190 areas. summarizes estimates of regression coefficients. As a result of variable selection, B5 and C1 are not selected. is residual plots for quantitative variables. provides information concerning coefficients of determination (R²), median error rate (MER), and running time. From the results, R² is more than 0.8, MER is less than 10%, and the residual plots are unproblematic. Thus, the proposed method performs well. In this application, we have applied the CDA for GFL to large sample data. For such data, the genlasso package caused memory shortage. Nevertheless the CDA completed program in only about 2 minutes. This is our big contribution.

Figure 4. The 852 areas in Tokyo’s 23 wards.

Figure 5. Regional effects estimation results. (a) Estimates. (b) Clustering.

Figure 6. Residual plots. (a) Floor area (A). (b) Age (C3). (c) Interaction (C4). (d) Walking time (C5).

4. Conclusion

In this paper, we challenged a coordinate optimization for GFL and derived the update equations for descent cycle and fusion cycle as closed forms, respectively. The update equations can be obtained by only using fundamental mathematics and some complex conditions (e.g., the Karush-Kuhn-Tucker conditions) are not required. In the numerical studies, we found that our proposed method gives accurate solution rapidly than existing method (which is using the genlasso package) and can be applied to even a large sample data that the genlasso package causes memory shortage.

Here, our proposed method was obtained for a linear regression model. However, the method can be also applied to models which are not linear. For example, for generalized linear models, by approximating a negative logarithm likelihood function by a linear function, the GFL-penalized objective function has equal class to Equation(3)(3) $‖ \tilde{y} - Rμ ‖_{2}^{2} + λ \sum_{j = 1}^{m} \sum_{ℓ \in D_{j}} w_{j ℓ} | μ_{j} - μ_{ℓ} |,$ (3) . That is our proposed method can be applied to even generalized linear models. On the other hand, the method cannot work for an extension of GFL penalty. For example, when group-GFL penalty is used instead of GFL penalty (it means the case that μ_j is a vector), the method cannot be applied.

Acknowledgments

The authors thank Dr. Ryoya Oda and Mr. Yuya Suzuki of Hiroshima University, Mr. Yuhei Sato and Mr. Yu Matsumura of Tokyo Kantei Co., Ltd., and Koki Kirishima of Computer Management Co., Ltd., for helpful comments. Moreover, the authors also thank the associate editor and the reviewers for their valuable comments.

Additional information

Funding

The first author’s research was partially supported by JSPS KAKENHI Grant Numbers JP20H04151 and JP21K13834. The second author’s research was partially supported by JSPS KAKENHI Grant Numbers JP17K15842, JP18K10068, JP19H01076, JP20H04151, and JP21K17288. The last author’s research was partially supported by JSPS KAKENHI Grant Numbers JP18K03415 and JP20H0415.

References

Arnold, T., and R. Tibshirani. 2019. genlasso: Path Algorithm for Generalized Lasso Problems. R package version 1.4. https://CRAN.R-project.org/package=genlasso.
Google Scholar
Donoho, D. L., and I. M. Johnstone. 1994. Ideal spatial adaptation by wavelet shrinkage. Biometrika 81 (3):425–55. doi: https://doi.org/10.1093/biomet/81.3.425.
Web of Science ®Google Scholar
Friedman, J., T. Hastie, H. Höfling, and R. Tibshirani. 2007. Pathwise coordinate optimization. The Annals of Applied Statistics 1 (2):302–32. doi: https://doi.org/10.1214/07-AOAS131.
Web of Science ®Google Scholar
Li, F., and H. Sang. 2019. Spatial homogeneity pursuit of regression coefficients for large datasets. Journal of the American Statistical Association 114 (527):1050–62. doi: https://doi.org/10.1080/01621459.2018.1529595.
Web of Science ®Google Scholar
Ohishi, M., H. Yanagihara, and Y. Fujikoshi. 2020. A fast algorithm for optimizing ridge parameters in a generalized ridge regression by minimizing a model selection criterion. Journal of Statistical Planning and Inference 204:187–205. doi: https://doi.org/10.1016/j.jspi.2019.04.010.
Web of Science ®Google Scholar
R Core Team. 2019. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Google Scholar
Sun, Y., H. J. Wang, and M. Fuentes. 2016. Fused adaptive Lasso for spatial and temporal quantile function estimation. Technometrics 58 (1):127–37. doi: https://doi.org/10.1080/00401706.2015.1017115.
Web of Science ®Google Scholar
Tibshirani, R. 1996. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1):267–88. doi: https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.
Web of Science ®Google Scholar
Tibshirani, R., and J. Taylor. 2011. The solution path of the generalized Lasso. The Annals of Statistics 39:1335–71.
Web of Science ®Google Scholar
Tibshirani, R., M. Saunders, S. Rosset, J. Zhu, and K. Knight. 2005. Sparsity and smoothness via the fused Lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (1):91–108. doi: https://doi.org/10.1111/j.1467-9868.2005.00490.x.
Web of Science ®Google Scholar
Yuan, M., and Y. Lin. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (1):49–67. doi: https://doi.org/10.1111/j.1467-9868.2005.00532.x.
Web of Science ®Google Scholar
Zou, H. 2006. The adaptive Lasso and its oracle properties. Journal of the American Statistical Association 101 (476):1418–29. doi: https://doi.org/10.1198/016214506000000735.
Web of Science ®Google Scholar

Appendix A. Proof of Lemma 2.1

We partition Equation(3) into terms that do and do not depend on μ_i. The first term in Equation(3) can be partitioned as follows:

(A1)

\begin{matrix} ‖ \tilde{y} - Rμ ‖_{2}^{2} = ‖ \tilde{y} ‖_{2}^{2} - 2 \tilde{y}' Rμ + μ' R' Rμ \\ = ‖ \tilde{y} ‖_{2}^{2} - 2 (\sum_{j \neq i}^{m} {\hat{μ}}_{j} {\tilde{y}}_{j}^{'} 1_{n_{j}} + μ_{i} {\tilde{y}}_{i}^{'} 1_{n_{i}}) + \sum_{j \neq i}^{m} n_{j} {\hat{μ}}_{j}^{2} + n_{i} μ_{i}^{2} \\ = n_{i} μ_{i}^{2} - 2 {\tilde{y}}_{i}^{'} 1_{n_{i}} μ_{i} + \sum_{j \neq i}^{m} (n_{j} {\hat{μ}}_{j}^{2} - 2 {\tilde{y}}_{j}^{'} 1_{n_{j}} {\hat{μ}}_{j}) + ‖ \tilde{y} ‖_{2}^{2} . \end{matrix}

(A1)

Moreover, since $w_{j ℓ} = w_{ℓ j},$ the second term in Equation(3)(3) $‖ \tilde{y} - Rμ ‖_{2}^{2} + λ \sum_{j = 1}^{m} \sum_{ℓ \in D_{j}} w_{j ℓ} | μ_{j} - μ_{ℓ} |,$ (3) can be partitioned as follows: $\begin{matrix} \sum_{j = 1}^{m} \sum_{ℓ \in D_{j}} w_{j ℓ} | μ_{j} - μ_{ℓ} | = \sum_{j \neq i}^{m} \sum_{ℓ \in D_{j}} w_{j ℓ} | μ_{j} - μ_{ℓ} | + \sum_{ℓ \in D_{i}} w_{i ℓ} | μ_{i} - μ_{ℓ} | \\ = \sum_{j \neq i}^{m} \sum_{ℓ \in \underset{j}{D}} i w_{j ℓ} | {\hat{μ}}_{j} - {\hat{μ}}_{ℓ} | + 2 \sum_{ℓ \in D_{i}} w_{i ℓ} | μ_{i} - {\hat{μ}}_{ℓ} | . \end{matrix}$

Consequently, Lemma 2.1 is proved, where u_i is given by $u_{i} = \sum_{j \neq i}^{m} (n_{j} {\hat{μ}}_{j}^{2} - 2 {\tilde{y}}_{j}^{'} 1_{n_{j}} {\hat{μ}}_{j}) + ‖ \tilde{y} ‖_{2}^{2} + λ_{2} \sum_{j \neq i}^{m} \sum_{ℓ \in \underset{j}{D}} i w_{2, j ℓ} | {\hat{μ}}_{j} - {\hat{μ}}_{ℓ} | .$

Appendix B. Proof of Theorem 2.2

First, we rewrite Equation(5)(5) $ϕ (x) = c_{2} x^{2} - 2 c_{1} x + 2 λ \sum_{j = 1}^{r} w_{j} | x - b_{j} | (c_{2}, w_{1}, \dots, w_{r} > 0) .$ (5) in non absolute form. Since $t_{1}, \dots, t_{r}$ are the order statistics of $b_{1}, \dots, b_{r},$ the following equation holds when $x \in R_{a} :$ $\begin{matrix} \sum_{j = 1}^{r} w_{j} | x - b_{j} | = \sum_{j \in J_{a}^{+}} w_{j} (x - b_{j}) + \sum_{j \in J_{a}^{-}} w_{j} (b_{j} - x) \\ = {\tilde{w}}_{a} x - (\sum_{j \in J_{a}^{+}} w_{j} b_{j} - \sum_{j \in J_{a}^{-}} w_{j} b_{j}) . \end{matrix}$

Thus, we have the following non absolute form of Equation(5)(5) $ϕ (x) = c_{2} x^{2} - 2 c_{1} x + 2 λ \sum_{j = 1}^{r} w_{j} | x - b_{j} | (c_{2}, w_{1}, \dots, w_{r} > 0) .$ (5) : $ϕ (x) = ϕ_{a} (x) = c_{2} x^{2} - 2 (c_{1} - λ {\tilde{w}}_{a}) x - 2 λ (\sum_{j \in J_{a}^{+}} w_{j} b_{j} - \sum_{j \in J_{a}^{-}} w_{j} b_{j}) (x \in R_{a}; a = 0, 1, \dots, r) .$

From the above equation, we found that v_a is the x-coordinate of the vertex of the quadratic function $ϕ_{a} (x)$ and the minimizer of $ϕ (x)$ is included in ${v_{0}, v_{1}, \dots, v_{r}, t_{1}, \dots, t_{r}}$ . The piecewise function $ϕ (x) = ϕ_{a} (x) (x \in R_{a})$ has the following properties (the proof is given in Appendix D.1):

Lemma B.1.

The $ϕ (x)$ is continuous in $x \in R$ , that is, $ϕ_{a} (t_{a + 1}) = ϕ_{a + 1} (t_{a + 1}) (a = 0, 1, \dots, r - 1)$ and v_a is a monotonically decreasing sequence with respect to a.

From this lemma, we have the following lemma (the proof is given in Appendix D.2):

Lemma B.2.

Then, either $a^{*}$ or $a^{⋆}$ uniquely exists.

Lemma B.2 says that existing $v_{a^{*}}$ or $t_{a^{⋆}}$ is a local minimizer. In addition, because $ϕ (x)$ is a continuous function, the local minimizer is the minimizer of $ϕ (x) .$ Consequently, Theorem 2.2 is proved.

Appendix C. Proof of Lemma 2.3

From Equation(1)(1) $y_{j} = X_{j} β + μ_{j} 1_{n_{j}} + ε_{j},$ (1) , we have $‖ \tilde{y} - Rμ ‖_{2}^{2} = \sum_{j \in E_{i}} n_{j} η_{i}^{2} - 2 \sum_{j \in E_{i}} {\tilde{y}}_{j}' 1_{n_{j}} η_{i} + \sum_{j \notin E_{i}} (n_{j} {\hat{μ}}_{j}^{2} - 2 {\tilde{y}}_{j}^{'} 1_{n_{j}} {\hat{μ}}_{j}) + ‖ \tilde{y} ‖_{2}^{2} .$

Moreover, the second term in Equation(3)(3) $‖ \tilde{y} - Rμ ‖_{2}^{2} + λ \sum_{j = 1}^{m} \sum_{ℓ \in D_{j}} w_{j ℓ} | μ_{j} - μ_{ℓ} |,$ (3) can be partitioned as follows: $\begin{matrix} \sum_{j = 1}^{m} \sum_{ℓ \in D_{j}} w_{j ℓ} | μ_{j} - μ_{ℓℓ} | = \sum_{j \notin E_{i}} \sum_{ℓ \in D_{j}} w_{j ℓ} | μ_{j} - μ_{ℓ} | + \sum_{j \in E_{i}} \sum_{ℓ \in D_{j}} w_{j ℓ} | μ_{j} - μ_{ℓ} | \\ = \sum_{j \notin E_{i}} \sum_{ℓ \in D_{j} E_{i}} w_{j ℓ} | {\hat{μ}}_{j} - {\hat{μ}}_{ℓ} | + \sum_{j \notin E_{i}} \sum_{ℓ \in D_{j} \cap E_{i}} w_{j ℓ} | {\hat{μ}}_{j} - μ_{ℓ} | \\ + \sum_{j \in E_{i}} \sum_{ℓ \in D_{j} E_{i}} w_{j ℓ} | μ_{j} - {\hat{μ}}_{ℓ} | + \sum_{j \in E_{i}} \sum_{ℓ \in D_{j} \cap E_{i}} w_{j ℓ} | μ_{j} - μ_{ℓ} | \\ = \sum_{j \notin E_{i}} \sum_{ℓ \in D_{j} E_{i}} w_{j ℓ} | {\hat{μ}}_{j} - {\hat{μ}}_{ℓ} | + \sum_{j \notin E_{i}} \sum_{ℓ \in D_{j} \cap E_{i}} w_{j ℓ} | {\hat{μ}}_{j} - η_{i} | \\ + \sum_{j \in E_{i}} \sum_{ℓ \in D_{j} E_{i}} w_{j ℓ} | η_{i} - {\hat{μ}}_{ℓ} | + \sum_{j \in E_{i}} \sum_{ℓ \in D_{j} \cap E_{i}} w_{j ℓ} | η_{i} - η_{i} | \\ = \sum_{j \notin E_{i}} \sum_{ℓ \in D_{j} E_{i}} w_{j ℓ} | {\hat{μ}}_{j} - {\hat{μ}}_{ℓ} | + 2 \sum_{j \in E_{i}} \sum_{ℓ \in D_{j} E_{i}} w_{j ℓ} | η_{i} - {\hat{μ}}_{ℓ} | . \end{matrix}$

All pairs $(j, ℓ)$ in the second term of the above equation are expressed as the following set: $\underset{j \in E_{i}}{\cup} {j} \times (D_{j} \ E_{i}) .$

Regarding this set, we have the following lemma (the proof is given in Appendix D.3):

Lemma C.1.

The following two sets are equal: $\underset{j \in E_{i}}{\cup} {j} \times (D_{j} \ E_{i}) = \underset{ℓ \in D_{i}^{*}}{\cup} J_{i ℓ} .$

This lemma gives that $\sum_{j \in E_{i}} \sum_{ℓ \in D_{j} \ E_{i}} w_{j ℓ} | η_{i} - {\hat{μ}}_{ℓ} | = \sum_{ℓ \in D_{i}^{*}} \sum_{(j, s) \in J_{i ℓ}} w_{j s} | η_{i} - {\hat{μ}}_{s} | = \sum_{ℓ \in D_{i}^{*}} w_{i ℓ}^{*} | η_{i} - {\hat{η}}_{ℓ} | .$

Consequently, Lemma 2.3 is proved, where $u_{i}^{*}$ is given by $u_{i}^{*} = \sum_{j \notin E_{i}} (n_{j} {\hat{μ}}_{j}^{2} - 2 {\tilde{y}}_{j}^{'} 1_{n_{j}} {\hat{μ}}_{j}) + ‖ \tilde{y} ‖_{2}^{2} + λ \sum_{j \notin E_{i}} \sum_{ℓ \in D_{j} \ E_{i}} w_{j ℓ} | {\hat{μ}}_{j} - {\hat{μ}}_{ℓ} | .$

Appendix D. Proof of lemmas in appendix

Proof of Lemma B.1

First, we prove that v_a is a monotonically decreasing sequence. The following equation holds: ${\tilde{w}}_{a} = \sum_{j \in J_{a}^{+}} w_{j} - \sum_{j \in J_{a}^{-}} w_{j} = (\sum_{j \in J_{a + 1}^{+}} w_{j} - w_{j_{*}}) - (\sum_{j \in J_{a + 1}^{-}} w_{j} + w_{j_{*}}) = {\tilde{w}}_{a + 1} - 2 w_{j_{*}},$ where $j_{*} = \arg \min_{j \in J_{a}^{-}} b_{j} = \arg \max_{j \in J_{a + 1}^{+}} b_{j} .$ Since $w_{j} > 0, {\tilde{w}}_{a}$ is a monotonically increasing sequence with respect to a. Thus, v_a is a monotonically decreasing sequence with respect to a.

Next, we prove that $ϕ (x)$ is a continuous function. The following equation holds about the term of $ϕ_{a} (t_{a + 1})$ that depends on a: $\begin{matrix} 2 λ {\tilde{w}}_{a} t_{a + 1} - 2 λ (\sum_{j \in J_{a}^{+}} w_{j} b_{j} - \sum_{j \in J_{a}^{-}} w_{j} b_{j}) \\ = 2 λ {\tilde{w}}_{a + 1} t_{a + 1} - 4 λ w_{j_{*}} t_{a + 1} - 2 λ (\sum_{j \in J_{a + 1}^{+}} w_{j} b_{j} - \sum_{j \in J_{a + 1}^{-}} w_{j} b_{j} - 2 w_{j_{*}} t_{a + 1}) \\ = 2 λ {\tilde{w}}_{a + 1} t_{a + 1} - 2 λ (\sum_{j \in J_{a + 1}^{+}} w_{j} b_{j} - \sum_{j \in J_{a + 1}^{-}} w_{j} b_{j}) . \end{matrix}$

Thus, we have $ϕ_{a} (t_{a + 1}) = ϕ_{a + 1} (t_{a + 1}) .$

Consequently, Lemma B.1 is proved.

Proof of Lemma B.2

The following statement is true: $\begin{matrix} a^{⋆} does not exist \Leftrightarrow \forall a \in {1, \dots, r}, t_{a} \notin [v_{a}, v_{a - 1}) \\ \Leftrightarrow {\begin{matrix} \forall a \in {1, \dots, r}, v_{a - 1} \leq t_{a} \\ \forall a \in {1, \dots, r}, t_{a} < v_{a} \\ \exists! a_{0} \in {1, \dots, r - 1} s . t . {\begin{matrix} t_{a} < v_{a_{0}} & (a \leq a_{0} - 1) \\ v_{a_{0}} \leq t_{a} & (a_{0} \leq a) \end{matrix} \end{matrix} \\ \Leftrightarrow {\begin{matrix} v_{0} \in R_{0} & (a^{*} = 0) \\ v_{r} \in R_{r} & (a^{*} = r) \\ v_{a_{0}} \in R_{a_{0}} & (a^{*} = a_{0} \in {1, \dots, r - 1}) \end{matrix} \\ \Leftrightarrow \exists a^{*} \in {0, 1, \dots, r} s . t . v_{a^{*}} \in R_{a^{*}} \\ \Leftrightarrow a^{*} exists . \end{matrix}$

The fact says that only either $a^{*}$ or $a^{⋆}$ exists.

Regarding the uniqueness of $a^{*},$ assume that $a_{1} (\in {0, 1, \dots, r})$ and $a_{2} (\in {0, 1, \dots, r})$ exist such that $v_{a_{1}} \in R_{a_{1}}$ and $v_{a_{2}} \in R_{a_{2}},$ and they satisfy $a_{1} + 1 \leq a_{2}$ without loss of generality. Then, although $t_{a_{1}} < v_{a_{1}} \leq t_{a_{1} + 1} \leq t_{a_{2}} < v_{a_{2}} \leq t_{a_{2} + 1}$ holds, this is in conflict with $v_{a_{2}} < v_{a_{1}} .$ Thus, we have a₁ = a₂.

Regarding the uniqueness of $a^{⋆},$ assume that $a_{1} (\in {1, \dots, r})$ and $a_{2} (\in {1, \dots, r})$ exist such that $t_{a_{1}} \in [v_{a_{1}}, v_{a_{1} - 1})$ and $t_{a_{2}} \in [v_{a_{2}}, v_{a_{2} - 1})$ and they satisfy $a_{1} + 1 \leq a_{2}$ without loss of generality. Then, although $v_{a_{1}} \leq t_{a_{1}} \leq t_{a_{2}} < v_{a_{2} - 1}$ holds, this is in conflict with $v_{a_{2} - 1} \leq v_{a_{1}} .$ Thus, we have a₁ = a₂.

Consequently, Lemma B.2 is proved.

Proof of Lemma C.1

First, we show that $\underset{j \in E_{i}}{\cup} {j} \times (D_{j} \ E_{i}) \subset \underset{ℓ \in D_{i}^{*}}{\cup} J_{i ℓ} .$

Let $(ℓ, s)$ be an element of the above LHS. Then, the following statement is true: $\begin{matrix} (ℓ, s) \in \underset{j \in E_{i}}{\cup} {j} \times (D_{j} \ E_{i}) \Leftrightarrow \exists j_{0} \in E_{i} s . t . (ℓ, s) \in {j_{0}} \times (D_{j_{0}} \ E_{i}) \\ \Leftrightarrow \exists j_{0} \in E_{i} s . t . ℓ = j_{0} \land s \in D_{j_{0}} \ E_{i} . \end{matrix}$

The $s \in D_{j_{0}} \ E_{i}$ leads $s \in F_{i}$ and $s \in D_{j_{0}} \land s \notin E_{i} \Leftrightarrow s \in D_{j_{0}} \land \exists! ℓ_{0} \in {1, \dots, b} \ {i} s . t . s \in E_{ℓ_{0}} .$

These results say $s \in E_{ℓ_{0}} \cap F_{i}$ and hence $ℓ_{0} \in D_{i}^{*} .$ Notice that $(ℓ, s) \in {j_{0}} \times E_{ℓ_{0}} \cap D_{j_{0}} .$ Hence, we have $(ℓ, s) \in \underset{ℓ \in D_{i}^{*}}{\cup} J_{i ℓ} .$

Next, we show that $\underset{j \in E_{i}}{\cup} {j} \times (D_{j} \ E_{i}) \supset \underset{ℓ \in D_{i}^{*}}{\cup} J_{i ℓ} .$

Let $(ℓ, s)$ be an element of the above RHS. Then, the following statement is true: $\begin{matrix} (ℓ, s) \in \underset{ℓ \in D_{i}^{*}}{\cup} J_{i ℓ} \Leftrightarrow \exists ℓ_{0} \in D_{i}^{*} s . t . (\exists j_{0} \in E_{i} s . t . (ℓ, s) \in {j_{0}} \times E_{ℓ_{0}} \cap D_{j_{0}}) \\ \Rightarrow ℓ = j_{0} \land s \in E_{ℓ_{0}} \times D_{j_{0}} . \end{matrix}$

Moreover, we found that $s \notin E_{i}$ because $ℓ_{0} \in D_{i}^{*} .$ Hence, we have $(ℓ, s) \in \underset{j \in E_{i}}{\cup} {j} \times (D_{j} \ E_{i}) .$

Consequently, Lemma C.1 is proved.

Coordinate optimization for generalized fused Lasso

Abstract

1. Introduction