2,139
Views
5
CrossRef citations to date
0
Altmetric
Original Articles

Coordinate optimization for generalized fused Lasso

ORCID Icon, , , &
Pages 5955-5973 | Received 17 Dec 2020, Accepted 12 May 2021, Published online: 16 Jul 2021

Abstract

Fused Lasso is one of extensions of Lasso to shrink differences of parameters. We focus on a general form of it called generalized fused Lasso (GFL). The optimization problem for GFL can be came down to that for generalized Lasso and can be solved via a path algorithm for generalized Lasso. Moreover, the path algorithm is implemented via the genlasso package in R. However, the genlasso package has some computational problems. Then, we apply a coordinate descent algorithm (CDA) to solve the optimization problem for GFL. We give update equations of the CDA in closed forms, without considering the Karush-Kuhn-Tucker conditions. Furthermore, we show an application of the CDA to a real data analysis.

1. Introduction

Suppose that there are m groups and nj pairs of data {yj,i,xj,i}(i=1,,nj) are observed for group j{1,,m}, where yj,i is a response variable and xj,i is a p-dimensional vector of non stochastic explanatory variables. Let yj be an nj-dimensional vector defined by yj=(yj,1,,yj,nj) and Xj be an nj×p matrix defined by Xj=(xj,1,,xj,nj). For group j, we consider the following linear regression model: (1) yj=Xjβ+μj1nj+εj,(1) where β is a p-dimensional vector of regression coefficients, μj is a location parameter for group j, 1n is an n-dimensional vector of ones, and εj is an nj-dimensional vector of error variables from a distribution with mean 0 and variance σ2. In addition, we assume that the vectors ε1,,εm are independent. Let y be an n-dimensional vector defined by y=(y1,,ym),X be an n × p matrix defined by X=(X1,,Xm),μ be an m-dimensional vector defined by μ=(μ1,,μm),R be an n×m block diagonal matrix defined by R=diag(1n1,,1nm), and ε be an n-dimensional vector defined by ε=(ε1,,εm), where n=j=1mnj. Then, m models in Equation(1) are expressed as y=++ε.

In this paper, although we deal with a linear regression model, we consider the case that location parameters of the model are different for each groups, that is, it can be regarded that location parameters express individual effects of groups. We call μj group effect for group j. In such case, one of interesting points is which group effects are equivalence. Hence, we approach it by using penalized least square method. That is, μ is estimated by minimizing the following penalized residual sum of squares (PRSS): (2) y22+λj=1mDjwj|μjμ|,(2) where λ is a positive tuning parameter, Dj{1,,m}\{j} is an index set, and wj is a positive weight satisfying wj=wj. This is called generalized fused Lasso (GFL). The original form of GFL was proposed by Tibshirani et al. (Citation2005) and is called fused Lasso (FL). When D1=,Dj={j1}(j=2,,m), and wj=1, GFL coincides with FL. GFL shrinks |μjμ| toward 0 and the estimates of μj and μ are often equal. Hence, using GFL, we can identify groups with equal group effects.

To obtain the GFL estimator of μ, we need an optimization method to minimize Equation(2). For FL, Friedman et al. (Citation2007) proposed a coordinate descent algorithm (CDA). In other cases, the optimization problem can be came down to that of the ordinary Lasso (Tibshirani Citation1996) under some conditions (e.g., Sun, Wang, and Fuentes Citation2016; Li and Sang Citation2019). In this paper, we deal with the case that the above methods cannot be applied. That is, we assume that j=1m#(Dj)>m. Fortunately, GFL can be expressed as generalized Lasso (Tibshirani and Taylor Citation2011). Generalized Lasso type penalty is given by 1, where D is a penalty matrix. For FL, D is an (m1)×m matrix given by D=(1100000011).

In addition, Tibshirani and Taylor (Citation2011) proposed the optimization algorithm for generalized Lasso and the algorithm is implemented via the genlasso package (e.g., Arnold and Tibshirani Citation2019) in R (e.g., R Core Team Citation2019). However, the optimization with the genlasso package has a high calculation cost and cannot be practically executed in large sample data. Moreover, there are issues in terms of numerical error and estimates cannot be exactly equal. Accordingly, we focus on a CDA.

To accurately calculate the GFL estimator of μ, even in large sample data, we give update equations of the CDA in closed form. Our algorithm has advantages compared with several existing algorithms, for example, by Friedman et al. (Citation2007) and Tibshirani and Taylor (Citation2011). Their algorithms are via Lagrange dual problem and must consider several conditions, for example, Karush-Kuhn-Tucker conditions and optimality. On the other hand, using our algorithm, the optimization problem for GFL can be solved by elementary mathematics, without considering these conditions. Moreover, although our algorithm reduces a search range of the minimizer for the jth coordinate direction to 2#(Dj)+1 discrete points that are given by closed forms, a numerical search like comparing values for each point in the reduced search range is not required. By checking simple conditions, the unique minimizer for a coordinate direction can be decided. Of course, the unique minimizer is given by closed form and ensures minimization for coordinate direction.

The remainder of the paper is organized as follows: In Section 2, we give closed form update equations of the CDA for GFL. Numerical examples are discussed in Section 3. Section 4 is conclusion of this paper. Technical details are provided in the Appendix.

2. Main results

We derive update equations of the CDA for obtaining the GFL estimator of μ. A CDA updates a solution along coordinate directions. The CDA for FL consists of a descent cycle and a fusion cycle. The descent cycle successively minimizes along coordinate directions. The ordinary CDA only consists of the descent cycle. However, when the ordinary CDA is applied for FL, some estimates are equal and then the solution gets stuck, failing to reach the minimum. To avoid this problem, Friedman et al. (Citation2007) invoked the fusion cycle. Since this problem can also occur for GFL, we give update equations for the descent cycle and the fusion cycle. In this section, we regard that β is fixed at β=β̂, and minimize the following PRSS: (3) y˜22+λj=1mDjwj|μjμ|,(3) where y˜=yXβ̂.

2.1. Descent cycle

The descent cycle minimizes along coordinate directions. That is, Equation(3) is minimized with respect to μi and repeat it for i=1,,m. We fix μj(j{1,,m}\{i}) at μj=μ̂j. Then, Equation(3) is rewritten as the function of μi by the following lemma (the proof is given in Appendix A):

Lemma 2.1.

The EquationEquation (3) can be expressed as the following function of μi(i{1,,m}): (4) niμi22y˜i1niμi+2λjDiwij|μiμ̂j|+ui,(4) where y˜i is the ith block of y˜, that is, y˜i=yiXiβ̂ and ui is the term that does not depend on μi.

From Lemma 2.1, essentially, it is sufficient to minimize the following function: (5) ϕ(x)=c2x22c1x+2λj=1rwj|xbj|(c2,w1,,wr>0).(5)

Let t0= and let t1,,tr be the order statistics of b1,,br, that is, tj={min{b1,,br}(j=1)min{{b1,,br}\{t1,,tj1}}(j=2,,r), Ja+ and Ja be index sets for a{0,1,,r} defined by Ja+={j{1,,r}bjta},Ja={j{1,,r}bj>ta}, and Ra be a range defined by Ra={(ta,ta+1](a=0,1,,r1)(tr,)(a=r).

Moreover, we define w˜a and va for a{0,1,,r} as w˜a=jJa+wjjJawj,va=c1λw˜ac2.

Then, the minimizer of Equation(5) is given by the following theorem (the proof is given in Appendix B):

Theorem 2.2.

Let a and a be non negative values defined by a{0,1,,r}s.t.vaRa,a{1,,r}s.t.ta[va,va1).

Then, either a or a uniquely exists, and the minimizer of ϕ(x) in Equation(5) is given by x̂={va(aexists)ta(aexists).

Theorem 2.2

gives the unique minimizer of μi in Equation(4). Hence, by applying Theorem 2.2 to Equation(4) for i=1,,m, we can obtain the solution of μ in the descent cycle. In Theorem 2.2, when a exists, estimate of μi is equal to μ̂j (satisfying μ̂j=ta), and this means that group effects of groups i and j are equal.

2.2. Fusion cycle

The fusion cycle avoids a solution getting stuck when some estimates are equal in the descent cycle. Suppose that we obtain μ̂j=μ̂ as estimates of μj and μ(j) in the descent cycle. Then, to avoid the solutions of μj and μ getting stuck, we regard ηi=μj=μ and minimize Equation(3) along the ηi-axis direction.

After the descent cycle, suppose that we obtain μ̂1,,μ̂m as estimates of μ1,,μm. Then, let η̂1,,η̂b(bm) be distinct values of μ̂1,,μ̂m and we define index sets E1,,Eb as Ej={{1,,m}μ̂=η̂j}, where Ej and EjE=(j). If b < m, the fusion cycle is executed. In the fusion cycle, Equation(3) is minimized with respect to ηi and repeat it for i{j{1,,b}#(Ej)2}. We fix ηj(j{1,,b}\{i}) at ηj=η̂j. Moreover, we define Dj{1,,b}\{j}(j{1,,b}) and wj(j{1,,b};Dj) as Dj={{1,,b}\\{j}EFj},Fj=EjD\Ej,wj=(i,s)Jjwis,Jj=iEj{i}×(EDi).

Then, Equation(3) is rewritten as the function of ηi by the following lemma (the proof is given in Appendix C):

Lemma 2.3.

The EquationEquation (3) can be expressed as the following function of ηi(i{1,,b}): (6) jEinjηi22jEiy˜j1njηi+2λDiwi|ηiη̂|+ui,(6) where ui is the term that does not depend on ηi.

From Lemma 2.3, we found that the minimization of Equation(6) is essentially equal to that of Equation(5). Hence, by applying Theorem 2.2 to Equation(6) for i{j{1,,b}#(Ej)2}, we can obtain the solution of μ in the fusion cycle.

2.3. Coordinate descent algorithm for GFL

In the previous subsections, we described the descent and fusion cycles and showed that the solution for each step is given by Theorem 2.2. From the results, the CDA for GFL is summarized as follows:

  • Coordinate Descent Algorithm for GFL

Input: The λ and initial vector of μ

Output: The optimal solution of μ

  • Step 1. Run the descent cycle:

  • Update μ̂i by applying Theorem 2.2 to (4) for i{1,,m} and define b. If b < m, go to Step 2. If not, go to Step 3.

  • Step 2. Run the fusion cycle if b < m:

  • Update η̂i by applying Theorem 2.2 to (6) for i{j{1,,b}#(Ej)2}.

  • Step 3. Check convergence:

  • If μ̂ converges, the algorithm completes. If not, return to Step 1.

Actually, when we use this algorithm, the selection of λ is very important. The reason is that λ adjusts the strength of the penalty term, that is, the value of λ changes the estimate of μ. If λ = 0, the GFL estimator of μ is equal to the least square estimator of μ. For increasing λ, some estimates are equal, that is, b becomes smaller. For a sufficient large λ, denoted by λmax, all estimates are equal. Hence, the optimal λ should be searched in the range [0,λmax]. When all estimates are equal, that is, λ=λmax, the estimates are given by μ̂1==μ̂m=μ̂=y˜1n/n. Then, from the update equation of the descent cycle, λmax is given by (7) λmax=max{maxj{1,,m}μ̂njy˜j1njDjwj,maxj{1,,m}y˜j1njμ̂njDjwj}.(7)

Thus, for some λ[0,λmax],μ is estimated via the CDA for GFL, and the optimal λ is selected via, for example, model selection criterion minimization method.

3. Numerical studies

In this section, we present numerical simulations, discuss estimation accuracy, and consider an illustrative application to an actual data set. The numerical calculation programs are executed by R (ver. 3.6.0) under a computer with a Windows 10 Pro operating system, an Intel (R) Core (TM) i7-7700 processor, and 16 GB of RAM.

In this studies, explanatory variables include dummy variables with 3 or more categories. Then, we split X and β as X=(A1,,Ak),β=(β1,,βk), where k(p) is the number of explanatory variables, A is an n×p matrix of the th explanatory variable, β is a p-dimensional vector of regression coefficients for A, and p satisfies p1 and p==1kp. In particular, when p=1, we denote A=a and β=β. Assume that X is scaled. Then, the following equations hold: a2=1(s.t.p=1),AA=Ip(s.t.p2).

Although we execute the CDA for GFL, we select explanatory variables at the same time. The explanatory variables are selected by the CDA for group Lasso (Yuan and Lin Citation2006). Then, estimators of β and μ are given by β̂λ1=arg minβ{yRμ̂22+λ1j=1kw1,jβj2},μ̂λ2=arg minμ{yXβ̂22+λ2j=1mDjw2,j|μjμ|}, where λ1 and λ2 are positive tuning parameters. The update equation of the CDA for group Lasso is given by β̂λ1,j=(1λ1w1,j2cj2)+cj, where cj=Aj(yRμ̂jkAβ̂) and (x)+=max{0,x}. In particular, when pj = 1, the update equation is given by β̂λ1,j=S(cj,λ1w1,j/2), where S(x, a) is a soft-thresholding operator (e.g., Donoho and Johnstone Citation1994), that is, S(x,a)=sign(x)(|x|a)+. The weights of penalties are often used the inverse of 2-norm (absolute value) of the estimator (e.g., Zou Citation2006). Hence, we use the following weights: w1,j=1β˜j2(j{1,,k}),w2,j=1|μ˜jμ˜|(j{1,,m},Dj), where β˜j and μ˜j are the least-squares estimators of βj and μj, respectively. These estimators are obtained by the following algorithm:

  • Alternate Optimization Algorithm

Input: Initial vectors of β and μ

Output: The optimal solutions of β and μ

  • Step 1. Optimize λ1 and β:

    1-1. Decide searching points of λ1.

  • 1-2. For fixed λ1, calculate β̂λ1 by the CDA for group Lasso.

  • 1-3. Repeat 1-2 for all λ1 and select the optimal λ1 based on minimizing a model selection criterion.

  • Step 2. Optimize λ2 and μ:

    2-1. Decide searching points of λ2.

  • 2-2. For fixed λ2, calculate μ̂λ2 by the CDA for GFL or the genlasso package.

  • 2-3. Repeat 2-2 for all λ2 and select the optimal λ2 based on minimizing a model selection criterion.

  • Step 3: Check convergence:

If β̂ and μ̂ are both converge, the algorithm completes. If not, return to Step 1.

The λ1 and λ2 are searched in 100 points given by λmax(3/4)j1(j=1,,100), where λmax is given by λmax={max{1,,k}2A(yRμ̂)2w1,(for λ1) given by (7)(for λ2).

The optimal tuning parameters are selected based on minimizing the extended GCV (EGCV) criterion (Ohishi, Yanagihara, and Fujikoshi Citation2020) defined by EGCV=(residual sum of squares)/n{1(degrees of freedom)/n}α, where α is a positive value expressing the strength of model complexity. We use the EGCV criterion with α=logn. For an m-dimensional vector θ(i), we regard that θ(i) converges when the following equation holds: maxj{1,,m}(θj(i+1)θj(i))2maxj{1,,m}(θj(i))2110,000, where (i) is the iteration number and θj(i) is the jth element of θ(i).

3.1. Simulation

In this subsection, we compare the estimation accuracies of the following Method 1 and Method 2 by simulation.

  • Method 1: The Alternate Optimization Algorithm using the CDA for GFL to optimize μ.

  • Method 2: The Alternate Optimization Algorithm using the genlasso package (ver. 1.4) to optimize μ.

The number of groups is m = 10, 20, the correlation between explanatory variables is ρ=0.5, 0.8, and the sample sizes of groups are n1==nm=n0. Then, total sample size is n=mn0 and we use n0=100,200,500,1000. shows simulation groups when m = 10, 20 with adjacent relationships indicated by lines. The figure means, for example, D1={2,3,4,5,6} and D2={1,3,6,7,8} when m = 10. We generated data from the simulation model Nn(+,In) with the following X: X=(a1,,a8,A9,,A13), where column vectors a1,,a8 and block matrices A9,,A13 are calculated as using the following procedure. Let u1,,u14 be independent n-dimensional vectors that the elements are identically and independently distributed according to U(0, 1) and v1,,v13 be n-dimensional vectors defined by vj=ωu14+(1ω)uj, where ω is the parameter determining the correlation of vi and vj(ij) as ρ, defined by ω={ρ±ρ2ρ(2ρ1)2ρ1(ρ1/2)12(ρ=1/2).

Figure 1. Simulation groups and adjacent relationship. (a) m = 10. (b) m = 20.

Figure 1. Simulation groups and adjacent relationship. (a) m = 10. (b) m = 20.

By using these vectors v1,,v13, we define the blocks in X as follows: Let aj=vj for j=1,,5; let aj(j=6,7,8) be dummy variables that take the value 1 or 0 defined by aj,i={1(vj,i>0.6)0(vj,i0.6)(i{1,,n}); and let Aj(j=9,,13) be (j7)-dimensional dummy variables that are categorized, defined by (The ith row vector of Aj)={ej7,(vj,iQj6,,j6)0j7(vj,iQj6,j6)(i=1,,n), where ej, is a j-dimensional vector in which the th element is 1 and the others are 0 and Qj, is the th range when [0,1] is split into j ranges. The following 2 cases are used as β and μ.

  • Case 1: Let the number of true explanatory variables be k=9 and the number of true joins of groups be

m={3(m=10)6(m=20),

and we use the following β and μ: β=(1,2,3,0,0,1,1,2,12,03,2×14,05,3×16), jE,μj=(=1,,m).

  • Case 2: Let the number of true explanatory variables be k=3 and the number of true joins of groups be

m={6(m=10)12(m=20),

and we use the following β and μ: β=(1,0,0,0,0,1,0,0,02,03,2×14,05,06),jE, μj=(=1,,m).

and show true joins of groups when m = 10, 20, respectively. Estimation accuracy is evaluated by the selection probabilities of true variables and true joins by Monte Carlo simulation with 1000 iterations. and show the selection probabilities (SP) of true variables and true joins and running times (RT) of programs in Cases 1 and 2, respectively. The SP is displayed as the combined probability (C-SP) and separate probability about variables and joins. From the tables, since the SP of Method 1 approaches 100% as sample size increases, we found that Method 1 has high estimation accuracy. On the other hand, Method 2 struggles to select the true variables and true joins and its SP is 7.4% (when m = 10, ρ=0.8, and n=5000 in Case 1) at most. In particular, it struggles to select the true joins and its SP is only 7.8% (when m = 10, ρ=0.8, and n=5000 in Case 1) at most. Moreover, in terms of running time, Method 1 is about 134 times faster than Method 2 (when m = 20, ρ=0.5, and n=20,000 in Case 1) at most.

Figure 2. True joins when m = 10. (a) Case 1. (b) Case 2.

Figure 2. True joins when m = 10. (a) Case 1. (b) Case 2.

Figure 3. True joins when m = 20. (a) Case 1. (b) Case 2.

Figure 3. True joins when m = 20. (a) Case 1. (b) Case 2.

Table 1. Selection probabilities and running times in Case 1.

Table 2. Selection probabilities and running times in Case 2.

3.2. A real data example

In this subsection, we present an illustrative application of the proposed method (Method 1 in Subsection 3.1) to an actual data set. Actually, Method 2 in Subsection 3.1 cannot run the program because the genlasso package causes memory shortage since the applied data have a large sample. This is the motivation of this paper. The data pertain to studio apartment rents and environmental conditions in Tokyo’s 23 wards collected by Tokyo Kantei Co., Ltd. Here, n=61,999 and all data were collected between April 2014 and April 2015 (). In this application, let the response variable be monthly rent with the remainder set as explanatory variables. We estimate regional effects at 852 areas split Tokyo’s 23 wards using the proposed method. shows the split of Tokyo’s 23 wards into 852 areas. shows estimates and clustering result of regional effects in the form of choropleth map. In terms of the former, as with , the 852 areas in Tokyo’s 23 wards are clustered to form 190 areas. summarizes estimates of regression coefficients. As a result of variable selection, B5 and C1 are not selected. is residual plots for quantitative variables. provides information concerning coefficients of determination (R2), median error rate (MER), and running time. From the results, R2 is more than 0.8, MER is less than 10%, and the residual plots are unproblematic. Thus, the proposed method performs well. In this application, we have applied the CDA for GFL to large sample data. For such data, the genlasso package caused memory shortage. Nevertheless the CDA completed program in only about 2 minutes. This is our big contribution.

Figure 4. The 852 areas in Tokyo’s 23 wards.

Figure 4. The 852 areas in Tokyo’s 23 wards.

Figure 5. Regional effects estimation results. (a) Estimates. (b) Clustering.

Figure 5. Regional effects estimation results. (a) Estimates. (b) Clustering.

Figure 6. Residual plots. (a) Floor area (A). (b) Age (C3). (c) Interaction (C4). (d) Walking time (C5).

Figure 6. Residual plots. (a) Floor area (A). (b) Age (C3). (c) Interaction (C4). (d) Walking time (C5).

Table 3. Data items.

Table 4. Regression coefficient estimates.

Table 5. Model fit and run time.

4. Conclusion

In this paper, we challenged a coordinate optimization for GFL and derived the update equations for descent cycle and fusion cycle as closed forms, respectively. The update equations can be obtained by only using fundamental mathematics and some complex conditions (e.g., the Karush-Kuhn-Tucker conditions) are not required. In the numerical studies, we found that our proposed method gives accurate solution rapidly than existing method (which is using the genlasso package) and can be applied to even a large sample data that the genlasso package causes memory shortage.

Here, our proposed method was obtained for a linear regression model. However, the method can be also applied to models which are not linear. For example, for generalized linear models, by approximating a negative logarithm likelihood function by a linear function, the GFL-penalized objective function has equal class to Equation(3). That is our proposed method can be applied to even generalized linear models. On the other hand, the method cannot work for an extension of GFL penalty. For example, when group-GFL penalty is used instead of GFL penalty (it means the case that μj is a vector), the method cannot be applied.

Acknowledgments

The authors thank Dr. Ryoya Oda and Mr. Yuya Suzuki of Hiroshima University, Mr. Yuhei Sato and Mr. Yu Matsumura of Tokyo Kantei Co., Ltd., and Koki Kirishima of Computer Management Co., Ltd., for helpful comments. Moreover, the authors also thank the associate editor and the reviewers for their valuable comments.

Additional information

Funding

The first author’s research was partially supported by JSPS KAKENHI Grant Numbers JP20H04151 and JP21K13834. The second author’s research was partially supported by JSPS KAKENHI Grant Numbers JP17K15842, JP18K10068, JP19H01076, JP20H04151, and JP21K17288. The last author’s research was partially supported by JSPS KAKENHI Grant Numbers JP18K03415 and JP20H0415.

References

Appendix A. Proof of Lemma 2.1

We partition Equation(3) into terms that do and do not depend on μi. The first term in Equation(3) can be partitioned as follows: (A1) y˜22=y˜222y˜+μR=y˜222(jimμ̂jy˜j1nj+μiy˜i1ni)+jimnjμ̂j2+niμi2=niμi22y˜i1niμi+jim(njμ̂j22y˜j1njμ̂j)+y˜22.(A1)

Moreover, since wj=wj, the second term in Equation(3) can be partitioned as follows: j=1mDjwj|μjμ|=jimDjwj|μjμ|+Diwi|μiμ|=jimDjiwj|μ̂jμ̂|+2Diwi|μiμ̂|.

Consequently, Lemma 2.1 is proved, where ui is given by ui=jim(njμ̂j22y˜j1njμ̂j)+y˜22+λ2jimDjiw2,j|μ̂jμ̂|.

Appendix B. Proof of Theorem 2.2

First, we rewrite Equation(5) in non absolute form. Since t1,,tr are the order statistics of b1,,br, the following equation holds when xRa: j=1rwj|xbj|=jJa+wj(xbj)+jJawj(bjx)=w˜ax(jJa+wjbjjJawjbj).

Thus, we have the following non absolute form of Equation(5): ϕ(x)=ϕa(x)=c2x22(c1λw˜a)x2λ(jJa+wjbjjJawjbj)(xRa;a=0,1,,r).

From the above equation, we found that va is the x-coordinate of the vertex of the quadratic function ϕa(x) and the minimizer of ϕ(x) is included in {v0,v1,,vr,t1,,tr}. The piecewise function ϕ(x)=ϕa(x)(xRa) has the following properties (the proof is given in Appendix D.1):

Lemma B.1.

The ϕ(x) is continuous in xR, that is, ϕa(ta+1)=ϕa+1(ta+1)(a=0,1,,r1) and va is a monotonically decreasing sequence with respect to a.

From this lemma, we have the following lemma (the proof is given in Appendix D.2):

Lemma B.2.

Let a and a be non negative values defined by a{0,1,,r}s.t.vaRa,a{1,,r}s.t.ta[va,va1).

Then, either a or a uniquely exists.

Lemma B.2 says that existing va or ta is a local minimizer. In addition, because ϕ(x) is a continuous function, the local minimizer is the minimizer of ϕ(x). Consequently, Theorem 2.2 is proved.

Appendix C. Proof of Lemma 2.3

From Equation(1), we have y˜22=jEinjηi22jEiy˜j1njηi+jEi(njμ̂j22y˜j1njμ̂j)+y˜22.

Moreover, the second term in Equation(3) can be partitioned as follows: j=1mDjwj|μjμℓℓ|=jEiDjwj|μjμ|+jEiDjwj|μjμ|=jEiDjEiwj|μ̂jμ̂|+jEiDjEiwj|μ̂jμ|+jEiDjEiwj|μjμ̂|+jEiDjEiwj|μjμ|=jEiDjEiwj|μ̂jμ̂|+jEiDjEiwj|μ̂jηi|+jEiDjEiwj|ηiμ̂|+jEiDjEiwj|ηiηi|=jEiDjEiwj|μ̂jμ̂|+2jEiDjEiwj|ηiμ̂|.

All pairs (j,) in the second term of the above equation are expressed as the following set: jEi{j}×(Dj\Ei).

Regarding this set, we have the following lemma (the proof is given in Appendix D.3):

Lemma C.1.

The following two sets are equal: jEi{j}×(Dj\Ei)=DiJi.

This lemma gives that jEiDj\Eiwj|ηiμ̂|=Di(j,s)Jiwjs|ηiμ̂s|=Diwi|ηiη̂|.

Consequently, Lemma 2.3 is proved, where ui is given by ui=jEi(njμ̂j22y˜j1njμ̂j)+y˜22+λjEiDj\Eiwj|μ̂jμ̂|.

Appendix D. Proof of lemmas in appendix

Proof of Lemma B.1

First, we prove that va is a monotonically decreasing sequence. The following equation holds: w˜a=jJa+wjjJawj=(jJa+1+wjwj)(jJa+1wj+wj)=w˜a+12wj, where j=arg minjJabj=arg maxjJa+1+bj. Since wj>0,w˜a is a monotonically increasing sequence with respect to a. Thus, va is a monotonically decreasing sequence with respect to a.

Next, we prove that ϕ(x) is a continuous function. The following equation holds about the term of ϕa(ta+1) that depends on a: 2λw˜ata+12λ(jJa+wjbjjJawjbj)=2λw˜a+1ta+14λwjta+12λ(jJa+1+wjbjjJa+1wjbj2wjta+1)=2λw˜a+1ta+12λ(jJa+1+wjbjjJa+1wjbj).

Thus, we have ϕa(ta+1)=ϕa+1(ta+1).

Consequently, Lemma B.1 is proved.

Proof of Lemma B.2

The following statement is true: a does not exista{1,,r},ta[va,va1){a{1,,r},va1taa{1,,r},ta<va!a0{1,,r1}s.t.{ta<va0(aa01)va0ta(a0a){v0R0(a=0)vrRr(a=r)va0Ra0(a=a0{1,,r1})a{0,1,,r}s.t.vaRaaexists.

The fact says that only either a or a exists.

Regarding the uniqueness of a, assume that a1({0,1,,r}) and a2({0,1,,r}) exist such that va1Ra1 and va2Ra2, and they satisfy a1+1a2 without loss of generality. Then, although ta1<va1ta1+1ta2<va2ta2+1 holds, this is in conflict with va2<va1. Thus, we have a1 = a2.

Regarding the uniqueness of a, assume that a1({1,,r}) and a2({1,,r}) exist such that ta1[va1,va11) and ta2[va2,va21) and they satisfy a1+1a2 without loss of generality. Then, although va1ta1ta2<va21 holds, this is in conflict with va21va1. Thus, we have a1 = a2.

Consequently, Lemma B.2 is proved.

Proof of Lemma C.1

First, we show that jEi{j}×(Dj\Ei)DiJi.

Let (,s) be an element of the above LHS. Then, the following statement is true: (,s)jEi{j}×(Dj\Ei)j0Eis.t.(,s){j0}×(Dj0\Ei)j0Eis.t.=j0sDj0\Ei.

The sDj0\Ei leads sFi and sDj0sEisDj0!0{1,,b}\{i}s.t.sE0.

These results say sE0Fi and hence 0Di. Notice that (,s){j0}×E0Dj0. Hence, we have (,s)DiJi.

Next, we show that jEi{j}×(Dj\Ei)DiJi.

Let (,s) be an element of the above RHS. Then, the following statement is true: (,s)DiJi0Dis.t.(j0Eis.t.(,s){j0}×E0Dj0)=j0sE0×Dj0.

Moreover, we found that sEi because 0Di. Hence, we have (,s)jEi{j}×(Dj\Ei).

Consequently, Lemma C.1 is proved.