Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

We present a reformulation of optimization problems over the Stiefel manifold by using a Cayley-type transform, named the generalized left-localized Cayley transform, for the Stiefel manifold. The reformulated optimization problem is defined over a vector space, whereby we can apply directly powerful computational arts designed for optimization over a vector space. The proposed Cayley-type transform enjoys several key properties which are useful to (i) study relations between the original problem and the proposed problem; (ii) check the conditions to guarantee the global convergence of optimization algorithms. Numerical experiments demonstrate that the proposed algorithm outperforms the standard algorithms designed with a retraction on the Stiefel manifold.

Keywords:

1. Introduction

The Stiefel manifold $St (p, N) := {U \in R^{N \times p} ∣ U^{T} U = I_{p}}$ is defined for $(p, N) \in N \times N$ with $p \leq N$ , where $I_{p}$ is the $p \times p$ identity matrix (see Appendix 1 for basic facts on $St (p, N)$ ).

We consider an orthogonal constraint optimization problem formulated as:

Problem 1.1

For a given continuous function $f : R^{N \times p} \to R$ , (1) $find U^{⋆} \in \underset{U \in St (p, N)}{argmin} f (U),$ (1) where the existence of a minimizer in (Equation1(1) $find U^{⋆} \in \underset{U \in St (p, N)}{argmin} f (U),$ (1) ) is automatically guaranteed by the compactness of $St (p, N)$ and the continuity of f over the Np-dimensional Euclidean space $R^{N \times p}$ .

This problem belongs to the so-called Riemannian optimization problems (see [Citation1] and Appendix 2), and has rich applications, in the case $p ≪ N$ in particular, in data sciences including signal processing and machine learning as remarked recently in [Citation2,Citation3]. These applications include, e.g. nearest low-rank correlation matrix problem [Citation4–6], nonlinear eigenvalue problem [Citation7–9], sparse principal component analysis [Citation10–12], 1-bit compressed sensing [Citation13,Citation14], joint diagonalization problem for independent component analysis [Citation15–17] and enhancement of the generalization performance in deep neural network [Citation18,Citation19]. However, Problem 1.1 has inherent difficulties regarding the severe nonlinearity of $St (p, N)$ as an instance of general nonlinear Riemannian manifolds.

Minimization of a continuous $f : R^{N \times N} \to R$ over the orthogonal group $O (N) := St (N, N)$ is a special instance of Problem 1.1 with p = N. This problem can be separated into two optimization problems over the special orthogonal group $SO (N) := {U \in O (N) ∣ det (U) = 1}$ as (2) $find U_{1}^{⋆} \in \underset{U \in SO (N)}{argmin} f (U)$ (2) and, with an arbitrarily chosen $Q \in O (N) ∖ SO (N)$ , $find U_{2}^{⋆} \in \underset{U \in SO (N)}{argmin} f (QU)$ because $O (N)$ is the disjoint union of $SO (N)$ and $O (N) ∖ SO (N) = {U \in O (N) ∣ det (U) = - 1} = {QU \in O (N) ∣ U \in SO (N)}$ . For the problem in (Equation2(2) $find U_{1}^{⋆} \in \underset{U \in SO (N)}{argmin} f (U)$ (2) ), the Cayley transform (3) $φ : SO (N) ∖ E_{N, N} \to Q_{N, N} : U \mapsto (I - U) (I + U)^{- 1}$ (3) and its inversion mappingFootnote¹ (4) $φ^{- 1} : Q_{N, N} \to SO (N) ∖ E_{N, N} : V \mapsto (I - V) (I + V)^{- 1} = 2 (I + V)^{- 1} - I$ (4) have been utilized in [Citation18,Citation20,Citation21] because φ translates a subset $SO (N) ∖ E_{N, N} (= O (N) ∖ E_{N, N}$ [see (A3)]) of $SO (N)$ into the vector space $Q_{N, N} := {V \in R^{N \times N} ∣ V^{T} = - V}$ of all skew-symmetric matrices, where $E_{N, N} := {U \in O (N) ∣ det (I + U) = 0}$ is called, in this paper, the singular-point set of φ. More precisely, this is because φ is a diffeomorphism between the dense subsetFootnote² $SO (N) ∖ E_{N, N}$ of $SO (N)$ and $Q_{N, N}$ .

The Cayley transform pair φ and $φ^{- 1}$ can be modified with an arbitrarily chosen $S \in O (N)$ as (5) $\begin{aligned} φ_{S} & : O (N) ∖ E_{N, N} (S) \to Q_{N, N} (S) := Q_{N, N} \\ : U \mapsto φ (S^{T} U) = (I - S^{T} U) (I + S^{T} U)^{- 1} \end{aligned}$ (5) and (6) $φ_{S}^{- 1} : Q_{N, N} (S) \to O (N) ∖ E_{N, N} (S) : V \mapsto S φ^{- 1} (V) = S (I - V) (I + V)^{- 1},$ (6) where $E_{N, N} (S) := {U \in O (N) ∣ det (I + S^{T} U) = 0}$ is the singular-point set of $φ_{S}$ . These mappings are also diffeomorphisms between their domains and images. With the aid of $φ_{S}$ Footnote³ with $S \in SO (N)$ , the following Problem 1.2 was considered in [Citation20] as a relaxation of the problem in (Equation2(2) $find U_{1}^{⋆} \in \underset{U \in SO (N)}{argmin} f (U)$ (2) ).

Problem 1.2

For a given continuous function $f : R^{N \times N} \to R$ , choose $S \in SO (N)$ , and $ϵ > 0$ arbitrarily. Then, (7) $find V^{⋆} \in Q_{N, N} (S) such that f \circ φ_{S}^{- 1} (V^{⋆}) < min f (SO (N)) + ϵ .$ (7)

Remark 1.3

(The existence of $V^{⋆}$ in Problem 1.2). The existence of $V^{⋆}$ satisfying (Equation7(7) $find V^{⋆} \in Q_{N, N} (S) such that f \circ φ_{S}^{- 1} (V^{⋆}) < min f (SO (N)) + ϵ .$ (7) ) is guaranteed because $φ_{S}^{- 1} (Q_{N, N}) = SO (N) ∖ E_{N, N} (S)$ is a dense subset of $SO (N)$ for any $S \in SO (N)$ [Citation20] (see Fact A.3) and $f \circ φ_{S}^{- 1}$ is continuous.
(Left-localized Cayley transform). We call $φ_{S}$ in (Equation5(5) $\begin{aligned} φ_{S} & : O (N) ∖ E_{N, N} (S) \to Q_{N, N} (S) := Q_{N, N} \\ : U \mapsto φ (S^{T} U) = (I - S^{T} U) (I + S^{T} U)^{- 1} \end{aligned}$ (5) ) the left-localized Cayley transform centred at $S \in O (N)$ because $S$ is multiplied from the left of $φ^{- 1} (V)$ in (Equation6(6) $φ_{S}^{- 1} : Q_{N, N} (S) \to O (N) ∖ E_{N, N} (S) : V \mapsto S φ^{- 1} (V) = S (I - V) (I + V)^{- 1},$ (6) ), and $φ_{S} (S) = 0$ . Although $Q_{N, N} (S)$ in (Equation5(5) $\begin{aligned} φ_{S} & : O (N) ∖ E_{N, N} (S) \to Q_{N, N} (S) := Q_{N, N} \\ : U \mapsto φ (S^{T} U) = (I - S^{T} U) (I + S^{T} U)^{- 1} \end{aligned}$ (5) ) is the common set $Q_{N, N}$ for all $S \in O (N)$ , we distinguish $Q_{N, N} (S)$ for each $S \in O (N)$ as the domain of parametrization $φ_{S}^{- 1}$ for a particular subset $O (N) ∖ E_{N, N} (S) \subset O (N)$ .

We note that Problem 1.2 is a realistic relaxation of the problem in (Equation2(2) $find U_{1}^{⋆} \in \underset{U \in SO (N)}{argmin} f (U)$ (2) ) as long as our target is approximation of a solution to (Equation2(2) $find U_{1}^{⋆} \in \underset{U \in SO (N)}{argmin} f (U)$ (2) ) algorithmically because $SO (N) ∖ E_{N, N} (S) = φ_{S}^{- 1} (Q_{N, N} (S))$ is dense in $SO (N)$ . In reality with a digital computer, we can handle just a small subset of the rational numbers $Q$ , which is dense in $R$ , due to the limitation of the numerical precision. This situation implies that it is reasonable to consider an approximation of $SO (N)$ within its dense subset $SO (N) ∖ E_{N, N} (S)$ .

For Problem 1.2, we can enjoy various arts of optimization over a vector space, e.g. the gradient descent method and Newton's method, because $Q_{N, N} (S)$ is a vector space. Thanks to the homeomorphism of $φ_{S}$ , we can estimate a solution to the problem in (Equation2(2) $find U_{1}^{⋆} \in \underset{U \in SO (N)}{argmin} f (U)$ (2) ) by applying $φ_{S}^{- 1}$ to a solution of Problem 1.2 with a sufficiently small $ϵ > 0$ . We call this strategy via Problem 1.2 a Cayley parametrization (CP) strategy for the problem in (Equation2(2) $find U_{1}^{⋆} \in \underset{U \in SO (N)}{argmin} f (U)$ (2) ). The CP strategy has a notable advantage over the standard optimization strategies [Citation1], called the retraction-based strategies, in view that many powerful computational arts designed for optimization over a single vector space can be directly plugged into the CP strategy. We will discuss the details in Remark 3.6.

In this paper, we address a natural question regarding a possible extension of the CP strategy to Problem 1.1 for general p<N: can we parameterize a dense subset of $St (p, N)$ even with p<N in terms of a single vector space? To answer this question positively, we propose a Generalized Left-Localized Cayley Transform (G-L $^{2}$ CT): $Φ_{S} : St (p, N) ∖ E_{N, p} (S) \to Q_{N, p} (S) : U \mapsto [\begin{matrix} A_{S} (U) & - B_{S}^{T} (U) \\ B_{S} (U) & 0 \end{matrix}],$ with $S \in O (N)$ , as an extension of the left-localized Cayley transform $φ_{S}$ in (Equation5(5) $\begin{aligned} φ_{S} & : O (N) ∖ E_{N, N} (S) \to Q_{N, N} (S) := Q_{N, N} \\ : U \mapsto φ (S^{T} U) = (I - S^{T} U) (I + S^{T} U)^{- 1} \end{aligned}$ (5) ), where $A_{S} (U) \in Q_{p, p}$ and $B_{S} (U) \in R^{(N - p) \times p}$ are determined with a centre point $S \in O (N)$ (see (Equation11(11) $\begin{aligned} A_{S} (U) & := 2 (I_{p} + S_{le}^{T} U)^{- T} S_{kew} (U^{T} S_{le}) (I_{p} + S_{le}^{T} U)^{- 1} \in Q_{p, p} \end{aligned}$ (11) ) and (Equation12(12) $\begin{aligned} B_{S} (U) & := - S_{ri}^{T} U (I_{p} + S_{le}^{T} U)^{- 1} \in R^{(N - p) \times p}, \end{aligned}$ (12) ) in Definition 2.1). The set $E_{N, p} (S) := {U \in St (p, N) ∣ det (I_{p} + S_{le}^{T} U) = 0}$ is called the singular-point set of $Φ_{S}$ (see the notation in the end of this section), and $Q_{N, p} (S)$ is a linear subspace of $Q_{N, N} (S)$ (see (Equation9(9) $Q_{N, p} (S) := Q_{N, p} := {[\begin{matrix} A & - B^{T} \\ B & 0 \end{matrix}] | \begin{matrix} - A^{T} = A \in R^{p \times p}, \\ B \in R^{(N - p) \times p} \end{matrix}} \subset Q_{N, N} .$ (9) )). For any $S \in O (N)$ , we will show several key properties, e.g. (i) $Φ_{S}$ is a diffeomorphism between $St (p, N) ∖ E_{N, p} (S)$ and the vector space $Q_{N, p} (S)$ with the inversion mapping $Φ_{S}^{- 1} : Q_{N, p} (S) \to St (p, N) ∖ E_{N, p} (S)$ (see Proposition 2.2); (ii) $St (p, N) ∖ E_{N, p} (S)$ is a dense subset of $St (p, N)$ for p<N (see Theorem 2.3(b)). Therefore, the proposed $Φ_{S}$ and $Φ_{S}^{- 1}$ have inherent properties desired for applications in the CP strategy to Problem 1.1.

To extend the CP strategy to Problem 1.1 for p<N, we consider Problem 1.4 below, which can be seen as an extension of Problem 1.2. For the same reason as in Remark 1.3(a), the existence of $V^{⋆}$ achieving (Equation8(8) $find V^{⋆} \in Q_{N, p} (S) such that f \circ Φ_{S}^{- 1} (V^{⋆}) < min f (St (p, N)) + ϵ .$ (8) ) is guaranteed by the denseness of $St (p, N) ∖ E_{N, p} (S)$ in $St (p, N)$ (see Lemma 2.6).

Problem 1.4

For a given continuous function $f : R^{N \times p} \to R$ with p<N, choose $S \in O (N)$ , and $ϵ > 0$ arbitrarily. Then, (8) $find V^{⋆} \in Q_{N, p} (S) such that f \circ Φ_{S}^{- 1} (V^{⋆}) < min f (St (p, N)) + ϵ .$ (8)

Under a smoothness assumption on general f, a realistic goal for Problem 1.1 is to find a stationary point $U^{⋆} \in St (p, N)$ of f because Problem 1.1 is a non-convex optimization problem (see, e.g. [Citation1,Citation22,Citation23]) and any local minimizer must be a stationary point [Citation22,Citation23]. In Lemma 3.4, we present a characterization of a stationary point $U^{⋆} \in St (p, N)$ of f over $St (p, N)$ , with $S \in O (N)$ satisfying $U^{⋆} \in St (p, N) ∖ E_{N, p} (S)$ , in terms of a stationary point $V^{⋆} \in Q_{N, p} (S)$ of $f \circ Φ_{S}^{- 1}$ over the vector space $Q_{N, p} (S)$ , i.e. $\nabla (f \circ Φ_{S}^{- 1}) (V^{⋆}) = 0$ . To approximate a stationary point of f over $St (p, N)$ , we also consider the following problem:

Problem 1.5

For a continuously differentiable function $f : R^{N \times p} \to R$ with p<N, choose $S \in O (N)$ and $ϵ > 0$ arbitrarily. Then, $find V^{⋆} \in Q_{N, p} (S) such that ‖ \nabla (f \circ Φ_{S}^{- 1}) (V^{⋆}) ‖_{F} < ϵ .$

For Problem 1.5, we can apply many powerful arts for searching a stationary point of a non-convex function over a vector space.

Numerical experiments in Section 4 demonstrate that the proposed CP strategy outperforms the standard algorithms designed with a retraction on $St (p, N)$ (see Appendix 2) in the scenario of a certain eigenbasis extraction problem.

Notation. $N$ and $R$ denote the set of all positive integers and the set of all real numbers respectively. For general $n \in N$ , we use $I_{n}$ for the identity matrix in $R^{n \times n}$ , but for simplicity, we use $I$ for the identity matrix in $R^{N \times N}$ . For $p \leq N$ , $I_{N \times p} \in R^{N \times p}$ denotes the matrix of the first p columns of $I$ . For a matrix $X \in R^{n \times m}$ , $[X]_{ij}$ $(1 \leq i \leq n, 1 \leq j \leq m)$ denotes the $(i, j)$ entry of $X$ , and $X^{T}$ denotes the transpose of $X$ . For a square matrix $X := [\begin{matrix} X_{11} \in R^{p \times p} & X_{12} \in R^{p \times (N - p)} \\ X_{21} \in R^{(N - p) \times p} & X_{22} \in R^{(N - p) \times (N - p)} \end{matrix}] \in R^{N \times N},$ we use the notation $[[X]]_{ij} := X_{ij}$ for $i, j \in {1, 2}$ . For $U \in R^{N \times p}$ , the matrices $U_{up} \in R^{p \times p}$ and $U_{lo} \in R^{(N - p) \times p}$ respectively denote the upper and the lower block matrices of $U = [U_{up}^{T} U_{lo}^{T}]^{T}$ . For $S \in R^{N \times N}$ , the matrices $S_{le} \in R^{N \times p}$ and $S_{ri} \in R^{N \times (N - p)}$ respectively denote the left and right block matrices of $S = [S_{le} S_{ri}]$ . For a matrix $X \in R^{n \times n}$ , $S_{kew} (X) = (X - X^{T}) / 2$ denotes the skew-symmetric component of $X$ . For square matrices $X_{i} \in R^{n_{i} \times n_{i}} (1 \leq i \leq k)$ , $diag (X_{1}, X_{2}, \dots, X_{k}) \in R^{(\sum_{i = 1}^{k} n_{i}) \times (\sum_{i = 1}^{k} n_{i})}$ denotes the block diagonal matrix with diagonal blocks $X_{1}, X_{2}, \dots, X_{k}$ . For a given matrix, $‖ \cdot ‖_{2}$ and $‖ \cdot ‖_{F}$ denote the spectral norm and the Frobenius norm respectively. The functions $σ_{max} (\cdot)$ and $σ_{min} (\cdot)$ denote respectively the largest and the nonnegative smallest singular values of a given matrix. The function $λ_{max} (\cdot)$ denotes the largest eigenvalue of a given symmetric matrix. For a vector space $X$ of matrices, $B_{X} (X^{⋆}, ϵ) := {X \in X ∣ ‖ X - X^{⋆} ‖_{F} < ϵ}$ denotes an open ball centred at $X^{⋆} \in X$ with radius $ϵ > 0$ . To distinguish from the symbol for the orthogonal group $O (N)$ , the symbol $o (\cdot)$ is used in place of the standard big O notation for computational complexity.

2. Generalized left-localized Cayley transform (G-L $^{2}$ CT)

2.1. Definition and properties of G-L $^{2}$ CT

In this subsection, we introduce the Generalized Left-Localized Cayley Transform (G-L $^{2}$ CT) for the parametrization of $St (p, N)$ as a natural extension of $φ_{S}$ in (Equation5(5) $\begin{aligned} φ_{S} & : O (N) ∖ E_{N, N} (S) \to Q_{N, N} (S) := Q_{N, N} \\ : U \mapsto φ (S^{T} U) = (I - S^{T} U) (I + S^{T} U)^{- 1} \end{aligned}$ (5) ). Indeed, the G-L $^{2}$ CT inherits key properties satisfied by $φ_{S}$ (see Proposition 2.2 and Theorem 2.3).

Definition 2.1

Generalized left-localized Cayley transform

For $p, N \in N$ satisfying $p \leq N$ , let $S \in O (N)$ , $E_{N, p} (S) := {U \in St (p, N) ∣ det (I_{p} + S_{le}^{T} U) = 0}$ , and (9) $Q_{N, p} (S) := Q_{N, p} := {[\begin{matrix} A & - B^{T} \\ B & 0 \end{matrix}] | \begin{matrix} - A^{T} = A \in R^{p \times p}, \\ B \in R^{(N - p) \times p} \end{matrix}} \subset Q_{N, N} .$ (9) The generalized left-localized Cayley transform centred at $S$ is defined by (10) $Φ_{S} : St (p, N) ∖ E_{N, p} (S) \to Q_{N, p} (S) : U \mapsto [\begin{matrix} A_{S} (U) & - B_{S}^{T} (U) \\ B_{S} (U) & 0 \end{matrix}]$ (10) with (11) $\begin{aligned} A_{S} (U) & := 2 (I_{p} + S_{le}^{T} U)^{- T} S_{kew} (U^{T} S_{le}) (I_{p} + S_{le}^{T} U)^{- 1} \in Q_{p, p} \end{aligned}$ (11) (12) $\begin{aligned} B_{S} (U) & := - S_{ri}^{T} U (I_{p} + S_{le}^{T} U)^{- 1} \in R^{(N - p) \times p}, \end{aligned}$ (12) where we call $S$ the centre point of $Φ_{S}$ , and $E_{N, p} (S)$ the singular-point set of $Φ_{S}$ .

Proposition 2.2

Inversion of G-L $^{2}$ CT

The mapping $Φ_{S}$ with $S \in O (N)$ is a diffeomorphism between a subset $St (p, N) ∖ E_{N, p} (S) \subset St (p, N)$ and $Q_{N, p} (S)$ .Footnote⁴ The inversion mapping is given, in terms of $φ_{S}^{- 1}$ in (Equation6(6) $φ_{S}^{- 1} : Q_{N, N} (S) \to O (N) ∖ E_{N, N} (S) : V \mapsto S φ^{- 1} (V) = S (I - V) (I + V)^{- 1},$ (6) ), by (13) $\begin{aligned} Φ_{S}^{- 1} & : Q_{N, p} (S) \to St (p, N) ∖ E_{N, p} (S) \\ : V \mapsto Ξ \circ φ_{S}^{- 1} (V) = S (I - V) (I + V)^{- 1} I_{N \times p}, \end{aligned}$ (13) where $Ξ : O (N) \to St (p, N) : U \mapsto U I_{N \times p}$ . Moreover, for $V \in Q_{N, p} (S)$ , we have the following expressions (14) $\begin{aligned} Φ_{S}^{- 1} (V) & = Ξ \circ φ_{S}^{- 1} (V) = 2 S (I + V)^{- 1} I_{N \times p} - S I_{N \times p} \end{aligned}$ (14) (15) $\begin{aligned} = 2 S [\begin{matrix} M^{- 1} \\ - [[V]]_{21} M^{- 1} \end{matrix}] - S_{le} = 2 (S_{le} - S_{ri} [[V]]_{21}) M^{- 1} - S_{le}, \end{aligned}$ (15) where $M := I_{p} + [[V]]_{11} + [[V]]_{21}^{T} [[V]]_{21} \in R^{p \times p}$ is the Schur complement matrix of $I + V \in R^{N \times N}$ (see Fact A.6).

Proof.

See Appendix 3.

Theorem 2.3

Denseness of $St (p, N) ∖ E_{N, p} (S)$

Let $S \in O (N)$ and p<N. Then, the following hold:

$St (p, N) ∖ E_{N, p} (S) = Ξ (O (N) ∖ E_{N, N} (S))$ , i.e. $Φ_{S}^{- 1} (Q_{N, p}) = Ξ \circ φ_{S}^{- 1} (Q_{N, p}) = Ξ \circ φ_{S}^{- 1} (Q_{N, N})$ , where Ξ is defined as in Proposition 2.2.
$St (p, N) ∖ E_{N, p} (S)$ is an open dense subset of $St (p, N)$ (see Fact A.1(a) for the topology of $St (p, N)$ ).
For $S_{1}, S_{2} \in O (N)$ , the subset $Δ (S_{1}, S_{2}) := (St (p, N) ∖ E_{N, p} (S_{1})) \cap (St (p, N) ∖ E_{N, p} (S_{2}))$ is a nonempty open dense subset of $St (p, N)$ .
Let $g : Q_{N, p} (S) \to R : V \mapsto det (I_{p} + S_{le}^{T} Φ_{S}^{- 1} (V))$ . Then, g is a positive-valued function and $lim_{V \in Q_{N, p} (S), ‖ V ‖_{2} \to \infty} g (V) = 0$ . Conversely, if a sequence $(V_{n})_{n = 0}^{\infty} \subset Q_{N, p} (S)$ satisfies $lim_{n \to \infty} g (V_{n}) = 0$ , then $lim_{n \to \infty} ‖ V_{n} ‖_{2} = \infty$ .

Proof.

See Appendix 4.

Proposition 2.4

Properties of G-L $^{2}$ CT in view of the manifold theory

(Chart). For $S \in O (N)$ , the ordered pair $(St (p, N) ∖ E_{N, p} (S), Φ_{S})$ is a chart of $St (p, N)$ , i.e. (i) $St (p, N) ∖ E_{N, p} (S)$ is an open subset of $St (p, N)$ ; (ii) $Φ_{S}$ is a homeomorphism between $St (p, N) ∖ E_{N, p} (S)$ and the $Np - \frac{1}{2} p (p + 1)$ dimensional Euclidean space $Q_{N, p} (S)$ .
(Smooth atlas). The set $(St (p, N) ∖ E_{N, p} (S), Φ_{S})_{S \in O (N)}$ is a smooth atlas of $St (p, N)$ , i.e. (i) $⋃_{S \in O (N)} (St (p, N) ∖ E_{N, p} (S)) = St (p, N)$ ; (ii) for every pair $S_{1}, S_{2} \in O (N)$ , $Φ_{S_{2}} \circ Φ_{S_{1}}^{- 1}$ is smooth over $Φ_{S_{1}} (Δ (S_{1}, S_{2}))$ , where $Δ (S_{1}, S_{2}) \neq \emptyset$ has been defined in Theorem 2.3(c).

Proof.

(a) (i) See Theorem 2.3(b). (ii) From Proposition 2.2, $Φ_{S}$ is a homeomorphism between $St (p, N) ∖ E_{N, p} (S)$ and $Q_{N, p} (S)$ . Clearly the dimension of the vector space $Q_{N, p} (S)$ is $Np - p (p + 1) / 2$ .

(b) (i) Recall that $St (p, N) = ⋃_{S \in O (N)} {S I_{N \times p}} = ⋃_{S \in O (N)} {S_{le}} = ⋃_{S \in O (N)} {Φ_{S}^{- 1} (0)}$ . (ii) See Proposition 2.2.

Remark 2.5

(Relation between the Cayley transform-based retraction and $Φ_{S}^{- 1}$ ). By using $φ^{- 1}$ in (Equation4(4) $φ^{- 1} : Q_{N, N} \to SO (N) ∖ E_{N, N} : V \mapsto (I - V) (I + V)^{- 1} = 2 (I + V)^{- 1} - I$ (4) ), the Cayley transform-based retraction has been utilized for Problem 1.1, e.g. [Citation22,Citation24,Citation25] (see Appendix 2 for the retraction-based strategy). The Cayley transform-based retraction can be expressed by using the proposed $Φ_{S}^{- 1}$ (see (Equation31(31) $(V \in T_{U} St (p, N)) Φ_{[U U_{⊥}]}^{- 1} \circ Ψ_{[U U_{⊥}]} (V) = R_{U}^{Cay} (V) .$ (31) )). In Section 3.4, we will clarify a diffeomorphic property of this retraction through $Φ_{S}^{- 1}$ .
(Parametrization of $St (p, N)$ with $Φ_{S}$ ). By $St (p, N) ∖ E_{N, p} (S) (= Φ_{S}^{- 1} (Q_{N, p} (S))) ⊊ St (p, N)$ , for a given pair of $U \in St (p, N)$ and $S \in O (N)$ , the inclusion $U \in St (p, N) ∖ E_{N, p} (S)$ is not guaranteed in general. However, Proposition 2.4(b) ensures the existence of $S \in O (N)$ satisfying $U \in St (p, N) ∖ E_{N, p} (S)$ . Indeed, we can construct such $S$ by using a singular value decomposition of $U_{up} \in R^{p \times p}$ as shown later in Theorem 2.7. This fact tells us that the availability of general $S \in O (N)$ can realize overall parametrization of $St (p, N)$ with $Φ_{S}^{- 1}$ . We note that a naive idea for using $Φ_{I}^{- 1}$ , i.e. a special case of $Φ_{S}^{- 1}$ with $S = I$ , in optimization over $St (p, N)$ has been reported shortly in [Citation26], which can be seen as an extension of the Cayley parametrization in [Citation20] for optimization over $O (N)$ .
(On the choice of $Ξ : O (N) \to St (p, N)$ for $Φ_{S}^{- 1} = Ξ \circ φ_{S}^{- 1}$ in Proposition 2.2). Since Ξ defined in Proposition 2.2 selects the first p column vectors from an orthogonal matrix, $Φ_{S}^{- 1} (V) = Ξ \circ φ_{S}^{- 1} (V)$ for $V \in Q_{N, p} (S)$ can be regarded as the matrix of the first p column vectors selected from an orthogonal matrix $φ_{S}^{- 1} (V)$ . Proposition 2.2 guarantees that the matrix $Φ_{S}^{- 1} (V)$ of the first p column vectors of $φ_{S}^{- 1} (V)$ does not overlap in $V \in Q_{N, p} (S)$ . Although there are many other selection rules $Ξ_{⟨ U ⟩} : O (N) \to St (p, N) : U \mapsto U U$ with $U \in {U^{'} \in St (p, N) ∣ [U^{'}]_{ij} \in {0, 1}, 1 \leq i \leq N, 1 \leq j \leq p}$ of p column vectors from $φ_{S}^{- 1} (V)$ , $Ξ_{⟨ U ⟩} \circ φ_{S}^{- 1}$ can not necessarily parameterize $St (p, N)$ without any overlap as shown below. For simplicity, assume 2p<N. Consider $U$ satisfying $U_{up} = 0$ ( $U := [0 I_{p}]^{T}$ is such a typical instance). Then, we can verify that $Ξ_{⟨ U ⟩} \circ φ_{S}^{- 1} (V)$ is not an injection on $Q_{N, p}$ (see Appendix 5). Note that an idea for using $Ξ_{⟨ U ⟩} \circ φ_{S}^{- 1}$ only with $S = I$ have been considered in [Citation26]. However, for parametrization of $St (p, N)$ , it seems to suggest the special selection $U = I_{N \times p}$ , which corresponds to $Φ_{I}^{- 1}$ .

By using Theorem 2.3, we deduce Lemma 2.6, which guarantees the existence of a solution to Problem 1.4 for any $ϵ > 0$ . Theorem 2.3 will also be used in Lemma 3.5 to ensure the existence of a solution to Problem 1.5.

Lemma 2.6

Let $f : R^{N \times p} \to R$ be continuous with p<N and $S \in O (N)$ . Then, it holds (16) $min_{U \in St (p, N)} f (U) = inf_{U \in St (p, N) ∖ E_{N, p} (S)} f (U) = inf_{V \in Q_{N, p} (S)} f \circ Φ_{S}^{- 1} (V) .$ (16)

Proof.

The second equality in (Equation16(16) $min_{U \in St (p, N)} f (U) = inf_{U \in St (p, N) ∖ E_{N, p} (S)} f (U) = inf_{V \in Q_{N, p} (S)} f \circ Φ_{S}^{- 1} (V) .$ (16) ) is verified from the homeomorphism of $Φ_{S}^{- 1}$ . Let $U^{⋆} \in St (p, N)$ be a global minimizer of f over $St (p, N)$ , i.e. $f (U^{⋆}) = min f (St (p, N))$ . From the denseness of $St (p, N) ∖ E_{N, p} (S)$ in $St (p, N)$ (see Theorem 2.3(b)), there exists a sequence $(U_{n})_{n = 0}^{\infty} \subset St (p, N) ∖ E_{N, p} (S)$ satisfying $lim_{n \to \infty} U_{n} = U^{⋆}$ . The continuity of f yields $lim_{n \to \infty} f (U_{n}) = f (U^{⋆})$ , i.e. $inf_{U \in St (p, N) ∖ E_{N, p} (S)} f (U) = min f (St (p, N))$ .

2.2. Computational complexities for $Φ_{S}$ and $Φ_{S}^{- 1}$ with $S \in O_{p} (N)$

From the expressions in (Equation10(10) $Φ_{S} : St (p, N) ∖ E_{N, p} (S) \to Q_{N, p} (S) : U \mapsto [\begin{matrix} A_{S} (U) & - B_{S}^{T} (U) \\ B_{S} (U) & 0 \end{matrix}]$ (10) )–(Equation12(12) $\begin{aligned} B_{S} (U) & := - S_{ri}^{T} U (I_{p} + S_{le}^{T} U)^{- 1} \in R^{(N - p) \times p}, \end{aligned}$ (12) ) and (Equation15(15) $\begin{aligned} = 2 S [\begin{matrix} M^{- 1} \\ - [[V]]_{21} M^{- 1} \end{matrix}] - S_{le} = 2 (S_{le} - S_{ri} [[V]]_{21}) M^{- 1} - S_{le}, \end{aligned}$ (15) ), both $Φ_{S}$ and $Φ_{S}^{- 1}$ with general $S \in O (N)$ require $o (N^{2} p + p^{3})$ flops (FLoating-point OPerationS [not ‘FLoating point Operations Per Second’]), which are dominated by the matrix multiplications $S_{ri}^{T} U$ in (Equation12(12) $\begin{aligned} B_{S} (U) & := - S_{ri}^{T} U (I_{p} + S_{le}^{T} U)^{- 1} \in R^{(N - p) \times p}, \end{aligned}$ (12) ) and $S_{ri} [[V]]_{21}$ in (Equation15(15) $\begin{aligned} = 2 S [\begin{matrix} M^{- 1} \\ - [[V]]_{21} M^{- 1} \end{matrix}] - S_{le} = 2 (S_{le} - S_{ri} [[V]]_{21}) M^{- 1} - S_{le}, \end{aligned}$ (15) ) respectively. However, if we employ a special centre point (17) $S \in O_{p} (N) := {diag (T, I_{N - p}) ∣ T \in O (p)} \subset O (N),$ (17) then the complexities for $Φ_{S}$ and $Φ_{S}^{- 1}$ can be reduced to $o (N p^{2} + p^{3})$ flops. Indeed, for $T \in O (p)$ and $U \in St (p, N)$ , we have $[diag (T, I_{N - p})]_{le}^{T} U = T^{T} U_{up}$ and $[diag (T, I_{N - p})]_{ri}^{T} U = U_{lo}$ . Hence $Φ_{diag (T, I_{N - p})} (U)$ requires $N p^{2} + o (p^{3})$ flops due to $\begin{aligned} A_{diag (T, I_{N - p})} (U) & = 2 (I_{p} + T^{T} U_{up})^{- T} S_{kew} (U_{up}^{T} T) (I_{p} + T^{T} U_{up})^{- 1} \in R^{p \times p}, \\ B_{diag (T, I_{N - p})} (U) & = - U_{lo} (I_{p} + T^{T} U_{up})^{- 1} \in R^{(N - p) \times p} . \end{aligned}$ Moreover, for $V \in Q_{N, p} (diag (T, I_{N - p}))$ and $M := I_{p} + [[V]]_{11} + [[V]]_{21}^{T} [[V]]_{21} \in R^{p \times p}$ , it follows from $[diag (T, I_{N - p})]_{ri} [[V]]_{21} = [0_{p} [[V]]_{21}^{T}]^{T}$ and (Equation15(15) $\begin{aligned} = 2 S [\begin{matrix} M^{- 1} \\ - [[V]]_{21} M^{- 1} \end{matrix}] - S_{le} = 2 (S_{le} - S_{ri} [[V]]_{21}) M^{- 1} - S_{le}, \end{aligned}$ (15) ) that $Φ_{diag (T, I_{N - p})}^{- 1} (V) = [\begin{matrix} 2 T M^{- 1} - T \\ - 2 [[V]]_{21} M^{- 1} \end{matrix}]$ requires $2 N p^{2} + o (p^{3})$ flops.

For a given $U \in St (p, N)$ , Theorem 2.7 below presents a way to select $T \in O (p)$ satisfying $U \in St (p, N) ∖ E_{N, p} (diag (T, I_{N - p}))$ , where $T$ is designed with a singular value decomposition of $U_{up} \in R^{p \times p}$ , requiring thus at most $o (p^{3})$ flops.

Theorem 2.7

Parametrization of $St (p, N)$ by $Φ_{S}$ with $S \in O_{p} (N)$

Let $U = [U_{up}^{T} U_{lo}^{T}]^{T} \in St (p, N)$ , and $U_{up} = Q_{1} Σ Q_{2}^{T}$ be a singular value decomposition of $U_{up} \in R^{p \times p}$ , where $Q_{1}, Q_{2} \in O (p)$ and $Σ \in R^{p \times p}$ is a diagonal matrix with non-negative entries. Define $S := diag (T, I_{N - p}) \in O_{p} (N)$ with $T := Q_{1} Q_{2}^{T} \in O (p)$ . Then, the following hold:

$det (I_{p} + S_{le}^{T} U) \geq 1$ and $U \in St (p, N) ∖ E_{N, p} (S)$ .
(18) $Φ_{S} (U) = [\begin{matrix} 0 & Q_{2} (I_{p} + Σ)^{- 1} Q_{2}^{T} U_{lo}^{T} \\ - U_{lo} Q_{2} (I_{p} + Σ)^{- 1} Q_{2}^{T} & 0 \end{matrix}],$ (18) where $‖ B_{S} (U) ‖_{2} \overset{(12)}{=} ‖ U_{lo} Q_{2} (I_{p} + Σ)^{- 1} Q_{2}^{T} ‖_{2} \leq 1$ .

Proof.

(a) By $S_{le}^{T} U = T^{T} U_{up} = Q_{2} Σ Q_{2}^{T}$ , it holds $det (I_{p} + S_{le}^{T} U) = det (Q_{2} (I_{p} + Σ) Q_{2}^{T}) = det (I_{p} + Σ) \geq 1$ , which implies $U \in St (p, N) ∖ E_{N, p} (S)$ by Definition 2.1.

(b) Substituting $S_{le}^{T} U = Q_{2} Σ Q_{2}^{T}$ and $S_{ri}^{T} U = U_{lo}$ into (Equation11(11) $\begin{aligned} A_{S} (U) & := 2 (I_{p} + S_{le}^{T} U)^{- T} S_{kew} (U^{T} S_{le}) (I_{p} + S_{le}^{T} U)^{- 1} \in Q_{p, p} \end{aligned}$ (11) ) and (Equation12(12) $\begin{aligned} B_{S} (U) & := - S_{ri}^{T} U (I_{p} + S_{le}^{T} U)^{- 1} \in R^{(N - p) \times p}, \end{aligned}$ (12) ), we obtain (Equation18(18) $Φ_{S} (U) = [\begin{matrix} 0 & Q_{2} (I_{p} + Σ)^{- 1} Q_{2}^{T} U_{lo}^{T} \\ - U_{lo} Q_{2} (I_{p} + Σ)^{- 1} Q_{2}^{T} & 0 \end{matrix}],$ (18) ). From (Equation12(12) $\begin{aligned} B_{S} (U) & := - S_{ri}^{T} U (I_{p} + S_{le}^{T} U)^{- 1} \in R^{(N - p) \times p}, \end{aligned}$ (12) ), $‖ B_{S} (U) ‖_{2}$ is bounded above as (19) $\begin{aligned} ‖ B_{S} (U) ‖_{2} = ‖ S_{ri}^{T} U (I_{p} + S_{le}^{T} U)^{- 1} ‖_{2} \leq ‖ S_{ri} ‖_{2} ‖ U ‖_{2} ‖ (I_{p} + Q_{2} Σ Q_{2}^{T})^{- 1} ‖_{2} \\ = ‖ Q_{2} (I_{p} + Σ)^{- 1} Q_{2}^{T} ‖_{2} = ‖ (I_{p} + Σ)^{- 1} ‖_{2} \leq 1. \end{aligned}$ (19)

Remark 2.8

Comparisons to commonly used retractions of $St (p, N)$

The computational complexity $2 N p^{2} + o (p^{3})$ flops for $Φ_{S}^{- 1}$ with $S \in O_{p} (N)$ is competitive to that for commonly used retractions, which map a tangent vector to a point in $St (p, N)$ (for the retraction-based strategy, see Appendix 2). Indeed, retractions based on QR decomposition, the polar decomposition [Citation1] and the Cayley transform [Citation22] require respectively $2 N p^{2} + o (p^{3})$ flops, $3 N p^{2} + o (p^{3})$ flops and $6 N p^{2} + o (p^{3})$ flops [Citation24, Table ].

Table 1. Performance of each algorithm applied to Problem 4.1.

Display Table

2.3. Gradient of function after the Cayley parametrization

For the applications of $Φ_{S}$ (G-L $^{2}$ CT) with $S \in O (N)$ to Problems 1.4 and 1.5, we present an expression of the gradient of $f \circ Φ_{S}^{- 1}$ denoted by $\nabla (f \circ Φ_{S}^{- 1})$ (Proposition 2.9) and its useful properties (Proposition 2.10, Remark 2.11 and also Proposition A.11).

Proposition 2.9

Gradient of function after the Cayley parametrization

For a differentiable function $f : R^{N \times p} \to R$ and $S \in O (N)$ , the function $f_{S} := f \circ Φ_{S}^{- 1} : Q_{N, p} (S) \to R$ is differentiable with (20) $(V \in Q_{N, p} (S)) \nabla f_{S} (V) = 2 S_{kew} (W_{S}^{f} (V)) = W_{S}^{f} (V) - W_{S}^{f} (V)^{T} \in Q_{N, p} (S),$ (20) where (21) $W_{S}^{f} (V) := [\begin{matrix} [[{\bar{W}}_{S}^{f} (V)]]_{11} & [[{\bar{W}}_{S}^{f} (V)]]_{12} \\ [[{\bar{W}}_{S}^{f} (V)]]_{21} & 0 \end{matrix}] \in R^{N \times N}$ (21) and (22) $\begin{aligned} {\bar{W}}_{S}^{f} (V) & := (I + V)^{- 1} I_{N \times p} \nabla f (Φ_{S}^{- 1} (V))^{T} S (I + V)^{- 1} \end{aligned}$ (22) (23) $\begin{aligned} = [\begin{matrix} M^{- 1} \nabla f (U)^{T} (S_{le} - S_{ri} [[V]]_{21}) M^{- 1} \\ - [[V]]_{21} M^{- 1} ∇f (U)^{T} (S_{le} - S_{ri} [[V]]_{21}) M^{- 1} \end{matrix} \\ \begin{matrix} M^{- 1} ∇f (U)^{T} ((S_{le} - S_{ri} [[V]]_{21}) M^{- 1} [[V]]_{21}^{T} + S_{ri}) \\ - [[V]]_{21} M^{- 1} ∇f (U)^{T} ((S_{le} - S_{ri} [[V]]_{21}) M^{- 1} [[V]]_{21}^{T} + S_{ri}) \end{matrix}] \\ \in R^{N \times N} \end{aligned}$ (23) in terms of $U := Φ_{S}^{- 1} (V) \in St (p, N) ∖ E_{N, p} (S)$ and $M := I_{p} + [[V]]_{11} + [[V]]_{21}^{T} [[V]]_{21} \in R^{p \times p}$ . In particular, by $S_{le} = Φ_{S}^{- 1} (0)$ in (Equation13(13) $\begin{aligned} Φ_{S}^{- 1} & : Q_{N, p} (S) \to St (p, N) ∖ E_{N, p} (S) \\ : V \mapsto Ξ \circ φ_{S}^{- 1} (V) = S (I - V) (I + V)^{- 1} I_{N \times p}, \end{aligned}$ (13) ), (24) $\nabla f_{S} (0) = [\begin{matrix} \nabla f (S_{le})^{T} S_{le} - S_{le}^{T} ∇f (S_{le}) & ∇f (S_{le})^{T} S_{ri} \\ - S_{ri}^{T} ∇f (S_{le}) & 0 \end{matrix}] .$ (24)

Proof.

See Appendix 6.

Proposition 2.10

Transformation formula for gradients of function

For $S_{1}, S_{2} \in O (N)$ , suppose that $V_{1} \in Q_{N, p} (S_{1})$ and $V_{2} \in Q_{N, p} (S_{2})$ satisfy $Φ_{S_{1}}^{- 1} (V_{1}) = Φ_{S_{2}}^{- 1} (V_{2})$ . Then, for a differentiable function $f : R^{N \times p} \to R$ , the following hold:

$X := [φ_{S_{1}}^{- 1} (V_{1})]_{ri}^{T} [φ_{S_{2}}^{- 1} (V_{2})]_{ri} \in O (N - p)$ is guaranteed. Moreover, by using $\begin{aligned} G_{S_{1}, S_{2}} (V_{1}, V_{2}) \\ := (I + V_{1})^{- 1} [\begin{matrix} I_{p} & 0 \\ 0 & X \end{matrix}] (I + V_{2}) \\ \times (\nabla f_{S_{2}} (V_{2}) - [\begin{matrix} 0 & 0 \\ [[V_{2}]]_{21} & I_{N - p} \end{matrix}] \nabla f_{S_{2}} (V_{2}) [\begin{matrix} 0 & [[V_{2}]]_{21}^{T} \\ 0 & I_{N - p} \end{matrix}]) \\ \times (I + V_{2})^{T} [\begin{matrix} I_{p} & 0 \\ 0 & X^{T} \end{matrix}] (I + V_{1})^{- T} \in Q_{N, N}, \end{aligned}$ we have (25) $\begin{aligned} \nabla f_{S_{1}} (V_{1}) & = [\begin{matrix} [[G_{S_{1}, S_{2}} (V_{1}, V_{2})]]_{11} & [[G_{S_{1}, S_{2}} (V_{1}, V_{2})]]_{12} \\ [[G_{S_{1}, S_{2}} (V_{1}, V_{2})]]_{21} & 0 \end{matrix}] \in Q_{N, p} (S_{1}) \\ = G_{S_{1}, S_{2}} (V_{1}, V_{2}) - [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] G_{S_{1}, S_{2}} (V_{1}, V_{2}) [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] . \end{aligned}$ (25)
$‖ \nabla f_{S_{1}} (V_{1}) ‖_{F} \leq 2 (1 + ‖ V_{2} ‖_{2}^{2}) ‖ \nabla f_{S_{2}} (V_{2}) ‖_{F}$ .
$\nabla f_{S_{1}} (V_{1}) = 0$ if and only if $\nabla f_{S_{2}} (V_{2}) = 0$ .

Proof.

See Appendix 7.

Remark 2.11

(Computational complexity for $\nabla (f \circ Φ_{S}^{- 1})$ with $S \in O_{p} (N)$ in (Equation17(17) $S \in O_{p} (N) := {diag (T, I_{N - p}) ∣ T \in O (p)} \subset O (N),$ (17) )). Let $S := diag (T, I_{N - p}) \in O_{p} (N)$ with $T \in O (p)$ and $V \in Q_{N, p} (S)$ . From (Equation21(21) $W_{S}^{f} (V) := [\begin{matrix} [[{\bar{W}}_{S}^{f} (V)]]_{11} & [[{\bar{W}}_{S}^{f} (V)]]_{12} \\ [[{\bar{W}}_{S}^{f} (V)]]_{21} & 0 \end{matrix}] \in R^{N \times N}$ (21) ) and (Equation23(23) $\begin{aligned} = [\begin{matrix} M^{- 1} \nabla f (U)^{T} (S_{le} - S_{ri} [[V]]_{21}) M^{- 1} \\ - [[V]]_{21} M^{- 1} ∇f (U)^{T} (S_{le} - S_{ri} [[V]]_{21}) M^{- 1} \end{matrix} \\ \begin{matrix} M^{- 1} ∇f (U)^{T} ((S_{le} - S_{ri} [[V]]_{21}) M^{- 1} [[V]]_{21}^{T} + S_{ri}) \\ - [[V]]_{21} M^{- 1} ∇f (U)^{T} ((S_{le} - S_{ri} [[V]]_{21}) M^{- 1} [[V]]_{21}^{T} + S_{ri}) \end{matrix}] \\ \in R^{N \times N} \end{aligned}$ (23) ), computation of $\nabla (f \circ Φ_{S}^{- 1}) (V) (= 2 S_{kew} (W_{S}^{f} (V)))$ requires at most $5 N p^{2} + o (p^{3})$ flops due to $\begin{aligned} W_{S}^{f} (V) & = [\begin{matrix} M^{- 1} \nabla f (U)^{T} [\begin{matrix} T \\ - [[V]]_{21} \end{matrix}] M^{- 1} \\ - [[V]]_{21} M^{- 1} ∇f (U)^{T} [\begin{matrix} T \\ - [[V]]_{21} \end{matrix}] M^{- 1} \end{matrix} \\ \begin{matrix} M^{- 1} ∇f (U)^{T} ([\begin{matrix} T \\ - [[V]]_{21} \end{matrix}] M^{- 1} [[V]]_{21}^{T} + [\begin{matrix} 0 \\ I_{N - p} \end{matrix}]) \\ 0 \end{matrix}], \end{aligned}$ where $U := Φ_{S}^{- 1} (V) \in St (p, N) ∖ E_{N, p} (S)$ and $M := I_{p} + [[V]]_{11} + [[V]]_{21}^{T} [[V]]_{21} \in R^{p \times p}$ .
(Relation of gradients after Cayley parametrization). Proposition 2.10 illustrates the relations of two gradients after Cayley parameterization with different centre points. These relations will be used in Lemmas 3.4 and 3.5 to characterize the first-order optimality condition with the proposed Cayley parametrization.
(Useful properties of the gradient after Cayley parametrization). In Appendix 8, we present useful properties (i) Lipschitz continuity; (ii) the boundedness; (iii) the variance bounded; of $\nabla (f \circ Φ_{S}^{- 1})$ for the minimization of $f \circ Φ_{S}^{- 1}$ over $Q_{N, p} (S)$ . These properties have been exploited in distributed optimization and stochastic optimization, e.g. [Citation27–32].

3. Optimization over the Stiefel manifold with the Cayley parametrization

3.1. Optimality condition via the Cayley parametrization

We present simple characterizations of (i) local minimizer, and (ii) stationary point, of a real valued function over $St (p, N)$ in terms of $Φ_{S}$ .

Let $X$ be a vector space of matrices. A point $X^{⋆} \in Y \subset X$ is said to be a local minimizer of $J : X \to R$ over $Y \subset X$ if there exists $ϵ > 0$ satisfying $J (X^{⋆}) \leq J (X)$ for all $X \in B_{X} (X^{⋆}, ϵ) \cap Y$ . Under the smoothness assumption on J, $X \in X$ is said to be a stationary point of J over the vector space $X$ if $\nabla J (X) = 0$ . For a smooth function $f : R^{N \times p} \to R$ , $U \in St (p, N)$ is said to be a stationary point of f over $St (p, N)$ [Citation22,Citation23] if $U$ satisfies the following conditions: (26) ${\begin{aligned} (I - U U^{T}) \nabla f (U) = 0 \\ U^{T} ∇f (U) - ∇f (U)^{T} U = 0 . \end{aligned}$ (26) The above conditions $\nabla J (X) = 0$ and (Equation26(26) ${\begin{aligned} (I - U U^{T}) \nabla f (U) = 0 \\ U^{T} ∇f (U) - ∇f (U)^{T} U = 0 . \end{aligned}$ (26) ) are called the first-order optimality conditions because they are respectively necessary conditions for $X$ to be a local minimizer of J over $X$ (see, e.g. [Citation33, Theorem 2.2]), and for $U$ to be a local minimizer of f over $St (p, N)$ (see [Citation23, Definition 2.1, Remark 2.3] and [Citation22, Lemma 1]).

In Lemma 3.1 below, we characterize a local minimizer of f over $St (p, N)$ as a local minimizer of $f \circ Φ_{S}^{- 1}$ with a certain $S \in O (N)$ over the vector space $Q_{N, p} (S)$ .

Lemma 3.1

Equivalence of local minimizers in the two senses

Let $f : R^{N \times p} \to R$ be continuous. Let $U^{⋆} \in St (p, N)$ and $S \in O (N)$ satisfy $U^{⋆} \in St (p, N) ∖ E_{N, p} (S)$ . Then, $U^{⋆}$ is a local minimizer of f over $St (p, N)$ if and only if $V^{⋆} := Φ_{S} (U^{⋆}) \in Q_{N, p} (S)$ is a local minimizer of $f \circ Φ_{S}^{- 1}$ over the vector space $Q_{N, p} (S)$ .

Proof.

Let $U^{⋆}$ be a local minimizer of f over $St (p, N)$ and $ϵ > 0$ satisfy $f (U^{⋆}) \leq f (U)$ for all $U \in B_{R^{N \times p}} (U^{⋆}, ϵ) \cap St (p, N) =: N_{1} (U^{⋆}) \subset St (p, N) ∖ E_{N, p} (S)$ . From the homeomorphism of $Φ_{S}$ in Proposition 2.2, $N_{2} (V^{⋆}) := Φ_{S} (N_{1} (U^{⋆})) \subset Q_{N, p} (S)$ is a nonempty open subset of $Q_{N, p} (S)$ containing $V^{⋆}$ . Then, there exists $\hat{ϵ} > 0$ satisfying $B_{Q_{N, p} (S)} (V^{⋆}, \hat{ϵ}) \subset N_{2} (V^{⋆})$ . Since $f \circ Φ_{S}^{- 1} (B_{Q_{N, p} (S)} (V^{⋆}, \hat{ϵ})) \subset f \circ Φ_{S}^{- 1} (N_{2} (V^{⋆})) = f (N_{1} (U^{⋆}))$ , we obtain $f (U^{⋆}) = f \circ Φ_{S}^{- 1} (V^{⋆}) \leq f \circ Φ_{S}^{- 1} (V)$ for all $V \in B_{Q_{N, p} (S)} (V^{⋆}, \hat{ϵ})$ , implying thus $V^{⋆}$ is a local minimizer of $f \circ Φ_{S}^{- 1}$ over $Q_{N, p} (S)$ . In a similar way, we can prove its converse.

Under a special assumption on f in Theorem 3.2 below, yet found especially in many data science scenarios (see Remark 3.3), we can characterize a global minimizer of Problem 1.1 via $f \circ Φ_{S}^{- 1}$ with any $S \in O (N)$ . In this case, a global minimizer $V^{⋆} \in Q_{N, p} (S)$ of $f \circ Φ_{S}^{- 1}$ is guaranteed to exist in the unit ball ${V \in Q_{N, p} (S) ∣ ‖ V ‖_{2} \leq 1}$ .

Theorem 3.2

Let $S \in O (N)$ . Assume that $f : R^{N \times p} \to R$ is continuous and right orthogonal invariant, i.e. $f (U) = f (UQ)$ for $U \in St (p, N)$ and $Q \in O (p)$ . Then, there exists a global minimizer $V^{⋆} \in Q_{N, p} (S)$ of $f \circ Φ_{S}^{- 1}$ achieving $f \circ Φ_{S}^{- 1} (V^{⋆}) = min_{U \in St (p, N)} f (U)$ , $‖ [[V^{⋆}]]_{21} ‖_{2} \leq 1$ and $‖ V^{⋆} ‖_{2} \leq 1$ .

Proof.

Let $U^{⋄} \in St (p, N)$ be a global minimizer of f over $St (p, N)$ , and $S_{le}^{T} U^{⋄} = Q_{1} Σ Q_{2}^{T}$ be a singular value decomposition with $Q_{1}, Q_{2} \in O (p)$ and nonnegative-valued diagonal matrix $Σ \in R^{p \times p}$ . Then, we obtain $U^{⋆} := U^{⋄} Q \in St (p, N) ∖ E_{N, p} (S)$ with $Q := Q_{2} Q_{1}^{T} \in O (p)$ by $| det (I_{p} + S_{le}^{T} U^{⋆}) | = | det (I_{p} + Q_{1} Σ Q_{2}^{T} Q_{2} Q_{1}^{T}) | = | det (I_{p} + Σ) | \geq 1$ . The right orthogonal invariance of f ensures $f (U^{⋄}) = f (U^{⋆}) = f \circ Φ_{S}^{- 1} (V^{⋆})$ with $V^{⋆} := Φ_{S} (U^{⋆})$ .

Substituting $S_{le}^{T} U^{⋆} = Q_{1} Σ Q_{1}^{T}$ into (Equation11(11) $\begin{aligned} A_{S} (U) & := 2 (I_{p} + S_{le}^{T} U)^{- T} S_{kew} (U^{T} S_{le}) (I_{p} + S_{le}^{T} U)^{- 1} \in Q_{p, p} \end{aligned}$ (11) ) and (Equation12(12) $\begin{aligned} B_{S} (U) & := - S_{ri}^{T} U (I_{p} + S_{le}^{T} U)^{- 1} \in R^{(N - p) \times p}, \end{aligned}$ (12) ), we obtain $[[V^{⋆}]]_{11} = A_{S} (U^{⋆}) = 0$ and $[[V^{⋆}]]_{21} = B_{S} (U^{⋆}) = - S_{ri}^{T} U^{⋄} Q (I_{p} + Q_{1} Σ Q_{1}^{T})^{- 1} = - S_{ri}^{T} U^{⋄} Q_{2} (I_{p} + Σ)^{- 1} Q_{1}^{T}$ . In a similar manner to (Equation19(19) $\begin{aligned} ‖ B_{S} (U) ‖_{2} = ‖ S_{ri}^{T} U (I_{p} + S_{le}^{T} U)^{- 1} ‖_{2} \leq ‖ S_{ri} ‖_{2} ‖ U ‖_{2} ‖ (I_{p} + Q_{2} Σ Q_{2}^{T})^{- 1} ‖_{2} \\ = ‖ Q_{2} (I_{p} + Σ)^{- 1} Q_{2}^{T} ‖_{2} = ‖ (I_{p} + Σ)^{- 1} ‖_{2} \leq 1. \end{aligned}$ (19) ), the last equality implies $‖ [[V^{⋆}]]_{21} ‖_{2} \leq 1$ . The last statement is verified by $\begin{aligned} ‖ V^{⋆} ‖_{2}^{2} & = λ_{max} ([\begin{matrix} 0 & [[V^{⋆}]]_{21}^{T} \\ - [[V^{⋆}]]_{21} & 0 \end{matrix}] [\begin{matrix} 0 & - [[V^{⋆}]]_{21}^{T} \\ [[V^{⋆}]]_{21} & 0 \end{matrix}]) \\ = λ_{max} ([\begin{matrix} [[V^{⋆}]]_{21}^{T} [[V^{⋆}]]_{21} & 0 \\ 0 & [[V^{⋆}]]_{21} [[V^{⋆}]]_{21}^{T} \end{matrix}]) \\ = λ_{max} ([[V^{⋆}]]_{21}^{T} [[V^{⋆}]]_{21}) = ‖ [[V^{⋆}]]_{21} ‖_{2}^{2} \leq 1. \end{aligned}$

Remark 3.3

Right orthogonal invariance

Under the right orthogonal invariance of f, Problem 1.1 arises in, e.g. low-rank matrix completion [Citation34,Citation35], eigenvalue problems [Citation1,Citation22,Citation24,Citation36], and optimal $H_{2}$ model reduction [Citation3,Citation37]. These applications can be formulated as optimization problems over the Grassmann manifold $Gr (p, N)$ , which is the set of all p-dimensional subspace of $R^{N}$ . Practically, $Gr (p, N)$ is represented numerically by ${[U] ∣ U \in St (p, N)}$ , where $[U] := {UQ \in St (p, N) ∣ Q \in O (p)}$ is an equivalence class, because the column space of $U \in St (p, N)$ equals that of $UQ \in St (p, N)$ for all $Q \in O (p)$ . Since the value of the right orthogonal invariant f depends only on the equivalence class $[U]$ , Problem 1.1 of such f can be regarded as an optimization problem over $Gr (p, N)$ .

In Lemma 3.4 below, we characterize a stationary point of f over $St (p, N)$ by a stationary point of $f \circ Φ_{S}^{- 1}$ with a certain $S \in O (N)$ over the vector space $Q_{N, p} (S)$ . Moreover, Lemma 3.5 ensures the existence of solutions to Problem 1.5 with any $ϵ > 0$ . Therefore, we can approximate a stationary point of f over $St (p, N)$ by solving Problem 1.5 with a sufficiently small $ϵ > 0$ .

Lemma 3.4

First-order optimality condition

Let $f : R^{N \times p} \to R$ be differentiable. Let $U \in St (p, N)$ and $S \in O (N)$ satisfy $U \in St (p, N) ∖ E_{N, p} (S)$ . Then, the first-order optimality condition in (Equation26(26) ${\begin{aligned} (I - U U^{T}) \nabla f (U) = 0 \\ U^{T} ∇f (U) - ∇f (U)^{T} U = 0 . \end{aligned}$ (26) ) can be stated equivalently as (27) $\nabla f_{S} (Φ_{S} (U)) = 0,$ (27) where $f_{S} := f \circ Φ_{S}^{- 1}$ .

Proof.

Let $U_{⊥} \in St (N - p, N)$ satisfy $U^{T} U_{⊥} = 0$ . Then, we have $U = Φ_{[U U_{⊥}]}^{- 1} (0)$ . For $S \in O (N)$ satisfying $U \in St (p, N) ∖ E_{N, p} (S)$ and $V := Φ_{S} (U) \in Q_{N, p} (S)$ , i.e. $U = Φ_{S}^{- 1} (V)$ , Proposition 2.10(c) asserts that $\nabla f_{S} (V) = 0$ if and only if $\nabla f_{[U U_{⊥}]} (0) = 0$ . To prove the equivalence between (Equation26(26) ${\begin{aligned} (I - U U^{T}) \nabla f (U) = 0 \\ U^{T} ∇f (U) - ∇f (U)^{T} U = 0 . \end{aligned}$ (26) ) and (Equation27(27) $\nabla f_{S} (Φ_{S} (U)) = 0,$ (27) ), it is sufficient to show the equivalence between the condition in (Equation26(26) ${\begin{aligned} (I - U U^{T}) \nabla f (U) = 0 \\ U^{T} ∇f (U) - ∇f (U)^{T} U = 0 . \end{aligned}$ (26) ) and $\nabla f_{[U U_{⊥}]} (0) = 0$ . By (Equation24(24) $\nabla f_{S} (0) = [\begin{matrix} \nabla f (S_{le})^{T} S_{le} - S_{le}^{T} ∇f (S_{le}) & ∇f (S_{le})^{T} S_{ri} \\ - S_{ri}^{T} ∇f (S_{le}) & 0 \end{matrix}] .$ (24) ), we have $\nabla f_{[U U_{⊥}]} (0) = [\begin{matrix} \nabla f (U)^{T} U - U^{T} ∇f (U) & ∇f (U)^{T} U_{⊥} \\ - U_{⊥}^{T} ∇f (U) & 0 \end{matrix}],$ which yields $[[\nabla f_{[U U_{⊥}]} (0)]]_{11} = 0$ if and only if the second condition in (Equation26(26) ${\begin{aligned} (I - U U^{T}) \nabla f (U) = 0 \\ U^{T} ∇f (U) - ∇f (U)^{T} U = 0 . \end{aligned}$ (26) ) holds true.

In the following, we show the equivalence of $U_{⊥}^{T} \nabla f (U) = 0$ and $(I - U U^{T}) \nabla f (U) = 0$ . By noting $[\begin{matrix} U & U_{⊥} \end{matrix}] {[\begin{matrix} U & U_{⊥} \end{matrix}]}^{T} = U U^{T} + U_{⊥} U_{⊥}^{T} = I$ , the equality $U_{⊥}^{T} \nabla f (U) = 0$ implies $0 = U_{⊥} U_{⊥}^{T} \nabla f (U) = (I - U U^{T}) ∇f (U)$ . Conversely, $(I - U U^{T}) \nabla f (U) = 0$ implies $0 = U_{⊥}^{T} (I - U U^{T}) \nabla f (U) = U_{⊥}^{T} ∇f (U)$ .

Lemma 3.5

Let $f : R^{N \times p} \to R$ be continuously differentiable with p<N and $S \in O (N)$ . Then, $inf_{V \in Q_{N, p} (S)} ‖ \nabla (f \circ Φ_{S}^{- 1}) (V) ‖_{F} = 0$ .

Proof.

Let $U^{⋆} \in St (p, N)$ be a global minimizer of f over $St (p, N)$ , and $S^{⋆} \in O (N)$ satisfy $U^{⋆} \in St (p, N) ∖ E_{N, p} (S^{⋆})$ . Then, $U^{⋆}$ is a stationary point of f over $St (p, N)$ , and we have $‖ \nabla (f \circ Φ_{S^{⋆}}^{- 1}) (V^{⋆}) ‖_{F} = 0$ with $V^{⋆} := Φ_{S^{⋆}} (U^{⋆}) \in Q_{N, p} (S^{⋆})$ from Lemma 3.4.

Theorem 2.3(c) ensures the denseness of $Δ (S, S^{⋆}) := (St (p, N) ∖ E_{N, p} (S)) \cap (St (p, N) ∖ E_{N, p} (S^{⋆}))$ in $St (p, N)$ . Then, we obtain a sequence $(U_{n})_{n = 0}^{\infty}$ of $Δ (S, S^{⋆})$ converging to $U^{⋆}$ . Let $(V_{n}^{⋆})_{n = 0}^{\infty}$ and $(V_{n})_{n = 0}^{\infty}$ be sequences of $V_{n}^{⋆} := Φ_{S^{⋆}} (U_{n}) \in Q_{N, p} (S^{⋆})$ and $V_{n} := Φ_{S} (U_{n}) \in Q_{N, p} (S)$ . The continuity of $Φ_{S^{⋆}}$ yields $lim_{n \to \infty} V_{n}^{⋆} = V^{⋆}$ , implying the boundedness of $(V_{n}^{⋆})_{n = 0}^{\infty}$ . From $Φ_{S}^{- 1} (V_{n}) = U_{n} = Φ_{S^{⋆}}^{- 1} (V_{n}^{⋆})$ and Proposition 2.10(b), we have $0 \leq ‖ \nabla (f \circ Φ_{S}^{- 1}) (V_{n}) ‖_{F} \leq 2 (1 + ‖ V_{n}^{⋆} ‖_{2}^{2}) ‖ \nabla (f \circ Φ_{S^{⋆}}^{- 1}) (V_{n}^{⋆}) ‖_{F}$ . The right-hand side of the above inequality converges to zero from the boundedness of $(V_{n}^{⋆})_{n = 0}^{\infty}$ and $‖ \nabla (f \circ Φ_{S^{⋆}}^{- 1}) (V^{⋆}) ‖_{F} = 0$ . Therefore, we have $lim_{n \to \infty} ‖ \nabla (f \circ Φ_{S}^{- 1}) (V_{n}) ‖_{F} = 0$ , from which we completed the proof.

3.2. Basic framework to incorporate optimization techniques designed over a vector space with the Cayley parametrization

We illustrate a general scheme of the Cayley parametrization strategy in Algorithm 1,Footnote⁵ where $U_{0} \in St (p, N)$ is an initial estimate for a solution to Problem 1.1 with p<N, $S \in O (N)$ is a centre point for parametrization of a dense subset $St (p, N) ∖ E_{N, p} (S) \subset St (p, N)$ in terms of the vector space $Q_{N, p} (S)$ , and a mapping $A : Q_{N, p} (S) \to Q_{N, p} (S)$ is a certain update rule for decreasing the value of $f \circ Φ_{S}^{- 1}$ . In principle, we can employ any optimization update scheme over a vector space as $A$ , which is a notable advantage of the proposed strategy over the standard strategy (see Remark 3.6). As a simplest example, we will employ, in Section 4, a gradient descent-type update scheme $A^{GDM} : Q_{N, p} (S) \to Q_{N, p} (S) : V \mapsto V - γ \nabla (f \circ Φ_{S}^{- 1}) (V)$ with a stepsize $γ > 0$ determined by a certain line-search algorithm (see, e.g. [Citation33]).

To parameterize $U_{0} \in St (p, N)$ by $Φ_{S}^{- 1}$ , $S \in O (N)$ must be chosen to satisfy $U_{0} \in St (p, N) ∖ E_{N, p} (S)$ . An example of selection of such $S$ for a given $U_{0}$ is $S := diag (Q_{1} Q_{2}^{T}, I_{N - p}) \in O_{p} (N)$ by using a singular value decomposition $[U_{0}]_{up} = Q_{1} Σ Q_{2}^{T} \in R^{p \times p}$ with $Q_{1}, Q_{2} \in O (p)$ and a diagonal matrix $Σ \in R^{p \times p}$ with non-negative entries (see Theorem 2.7).

Remark 3.6

Comparison to the retraction-based strategy

As reported in [Citation1,Citation3,Citation22,Citation24,Citation25,Citation41–52], Problem 1.1 has been tackled with a retraction $R : T St (p, N) := {{U} \times T_{U} St (p, N) ∣ U \in St (p, N)} \to St (p, N) : (U, V) \mapsto R_{U} (V)$ (see, e.g. [Citation1]) by exploiting only a local diffeomorphismFootnote⁶ of each $R_{U}$ between a sufficiently small neighbourhood of $0 \in T_{U} St (p, N)$ in the tangent space $T_{U} St (p, N)$ , at $U \in St (p, N)$ to $St (p, N)$ , and its image in $St (p, N)$ (see Appendix 2 for its basic idea). At the nth iteration, these retraction-based strategies decrease the time-varying function $f \circ R_{U_{n}}$ at $0 \in T_{U_{n}} St (p, N)$ over the time-varying vector space $T_{U_{n}} St (p, N)$ , where $U_{n} \in St (p, N)$ is the nth estimate for a solution. Many computational mechanisms for finding a descent direction $D_{n} \in T_{U_{n}} St (p, N)$ in the tangent space $T_{U_{n}} St (p, N)$ have been motivated by standard ideas for optimization over a fixed vector space. To achieve fast convergence in optimization over a vector space, many researchers have been trying to utilize the past updating directions for estimating a current descent direction, e.g. the conjugate gradient method, quasi-Newton's method and Nesterov accelerated gradient method [Citation27,Citation28,Citation33,Citation53]. However, in the retraction-based strategy, since the past updating directions $(D_{k})_{k = 0}^{n - 1}$ no longer live in the current tangent space $T_{U_{n}} St (p, N)$ , we can not utilize directly $(D_{k})_{k = 0}^{n - 1}$ for estimating a new descent direction $D_{n} \in T_{U_{n}} St (p, N)$ . To be exploited the past updating directions with a retraction, those directions must be translated into the current tangent space with certain mappings, e.g. a vector transport [Citation1] and the inversion mapping of retractions [Citation25].

On the other hand, Algorithm 1 decreases the fixed cost function $f \circ Φ_{S}^{- 1}$ with a fixed $S \in O (N)$ over the fixed vector space $Q_{N, p} (S)$ during the process of Algorithm 1 by exploiting the diffeomorphism of $Φ_{S}^{- 1}$ between $Q_{N, p} (S)$ and an open dense subset $St (p, N) ∖ E_{N, p} (S)$ of $St (p, N)$ (see Proposition 2.2 and Theorem 2.3(b)). Since every past updating direction lives in the same vector space $Q_{N, p} (S)$ , we can utilize the past updating directions without requiring any additional computation such as a vector transport and the inversion mapping of retractions. Therefore, we can transplant powerful computational arts, e.g. [Citation27–33], designed for optimization over a vector space, into the proposed strategy. For many such algorithms, Proposition A.11 must be useful for checking whether conditions, regarding the cost function, for a global convergence of optimization techniques hold true or not.

3.3. Singular-point issue in the Cayley parametrization strategy

Numerical performance of Algorithm 1 heavily depends on tuning $S \in O (N)$ in general. If we choose $S$ such that a minimizer $U^{⋆} \in St (p, N)$ of Problem 1.1 is close to the singular-point set $E_{N, p} (S)$ , then a risk of a slow convergence of Algorithm 1 arises due to an insensitivity of $Φ_{S}^{- 1}$ to the change around $Φ_{S} (U^{⋆})$ in the vector space $Q_{N, p} (S)$ . In a case where p = N, this risk has been reported by [Citation20,Citation21]. We can see this insensitivity of $Φ_{S}^{- 1}$ via Proposition 3.7 below.

Proposition 3.7

The mobility of $Φ_{S}^{- 1}$

Let $p, N \in N$ satisfy p<N, $S \in O (N)$ , $V \in Q_{N, p} (S)$ , and $E \in Q_{N, p} (S)$ satisfy $‖ E ‖_{F} = 1$ . Then, we have $‖ Φ_{S}^{- 1} (V + τ E) - Φ_{S}^{- 1} (V) ‖_{F} \leq τr (V),$ where (28) $r (V) := \frac{2 \sqrt{1 + ‖ [[V]]_{21} ‖_{2}^{2}}}{1 + σ_{min}^{2} ([[V]]_{21})} .$ (28) We call $r : Q_{N, p} (S) \to R$ the mobility of $Φ_{S}^{- 1}$ , which is bounded as (29) $r (V) \geq 2 (1 + ‖ [[V]]_{21} ‖_{2}^{2})^{- 1 / 2},$ (29) where the equality holds when $σ_{min} ([[V]]_{21}) = σ_{max} ([[V]]_{21}) (= ‖ [[V]]_{21} ‖_{2})$ .

Proof.

See Appendix 9.

To interpret the result in Proposition 3.7, we consider two simple examples. Under the condition $σ_{min} ([[V]]_{21}) = σ_{max} ([[V]]_{21}) (= ‖ [[V]]_{21} ‖_{2})$ , we observe from (Equation29(29) $r (V) \geq 2 (1 + ‖ [[V]]_{21} ‖_{2}^{2})^{- 1 / 2},$ (29) ) that the mobility $r (V)$ becomes small when $‖ [[V]]_{21} ‖_{2}$ increases. On the other hand, because $r (V) = 2$ is achieved by $‖ [[V]]_{21} ‖_{2} = 0$ from (Equation28(28) $r (V) := \frac{2 \sqrt{1 + ‖ [[V]]_{21} ‖_{2}^{2}}}{1 + σ_{min}^{2} ([[V]]_{21})} .$ (28) ), $[[V]]_{21}$ around zero does not lead small $r (V)$ .

These tendencies can be observed numerically in Figure , where the plot shows the norm $‖ [[V]]_{21} ‖_{2}$ on the horizontal axis versus the values $‖ Φ_{S}^{- 1} (V + E) - Φ_{S}^{- 1} (V) ‖_{F}$ and $r (V)$ , with randomly chosen $V, E \in Q_{N, p} (S)$ satisfying $‖ E ‖_{F} = 1$ , on the vertical axis for each $N \in {500, 1000, 2000}$ and p = 10. From this figure, we observe that the mobility $r (V)$ decreases and $Φ_{S}^{- 1}$ becomes insensitive as $‖ [[V]]_{21} ‖_{2}$ increases.

Figure 1. The average values of the change $‖ Φ_{S}^{- 1} (V + E) - Φ_{S}^{- 1} (V) ‖_{F}$ and the mobility $r (V)$ for each $‖ [[V]]_{21} ‖_{2}$ over 10 trials in the case $N = {500, 1000, 2000}$ and p = 10. In each trial, we generate $\tilde{V}, \tilde{E} \in R^{N \times N}$ of which each entry is uniformly chosen from $[- 0.5, 0.5]$ except for the $(N - p)$ -by- $(N - p)$ right lower block matrix. Then, with $E := S_{kew} (\tilde{E}) / ‖ S_{kew} (\tilde{E}) ‖_{F} \in Q_{N, p}$ satisfying $‖ E ‖_{F} = 1$ , we evaluate $‖ Φ_{S}^{- 1} (V + E) - Φ_{S}^{- 1} (V) ‖_{F}$ and $r (V)$ at $V \in Q_{N, p}$ with $[[V]]_{11} = [[S_{kew} (\tilde{V})]]_{11}$ and $[[V]]_{21} = c [[S_{kew} (\tilde{V})]]_{21}$ by changing $c \in [0, 5 / ‖ [[S_{kew} (\tilde{V})]]_{21} ‖_{2}]$ .

This insensitivity of $Φ_{S}^{- 1}$ , at distant points from zero, causes a risk of slow convergence of Algorithm 1 even if the current estimate $V_{n} \in Q_{N, p} (S)$ is not sufficiently close to a solution $V^{⋆} \in Q_{N, p} (S)$ of Problem 1.4 or Problem 1.5. Since Theorem 2.3(d) implies that $‖ Φ_{S} (U) ‖_{2}$ increases as $U \in St (p, N) ∖ E_{N, p} (S)$ approaches $E_{N, p} (S)$ , the risk of the slow convergence, say a singular-point issue, can arise in a case where a global minimizer $U^{⋆} \in St (p, N)$ stays around $E_{N, p} (S)$ . In Section 4.2, we will see that the numerical performance of Algorithm 1 employing the gradient descent-type method tends to deteriorate as $U^{⋆}$ approaches $E_{N, p} (S)$ .

To remedy the singular-point issue in Algorithm 1, it is recommendable to use $S$ such that $Φ_{S} (U^{⋆})$ is close to zero in $Q_{N, p} (S)$ . Although we can not determine for a given $S$ whether $Φ_{S} (U^{⋆})$ is close to zero or not in advance of minimization for general f, Theorem 3.2 guarantees, under the right orthogonal invariance of f, the existence of a global minimizer $U^{⋆}$ satisfying $‖ [[Φ_{S} (U^{⋆})]]_{21} ‖_{2} \leq 1$ for every $S \in O (N)$ . In this case, by $r (Φ_{S} (U^{⋆})) \geq \sqrt{2}$ in (Equation29(29) $r (V) \geq 2 (1 + ‖ [[V]]_{21} ‖_{2}^{2})^{- 1 / 2},$ (29) ) and the continuity of r, the mobility r of $Φ_{S}^{- 1}$ can be maintained in a neighbourhood of $Φ_{S} (U^{⋆})$ to which a point sequence $(V_{n})_{n = 0}^{\infty}$ generated by Algorithm 1 is desired to approach. Therefore, we do not need to be nervous about the influence by the singular-point set around $Φ_{S} (U^{⋆})$ .

For general f, to remedy the singular-point issue, we reported shortly in [Citation38,Citation39] that this issue can be avoided by a Cayley parametrization-type strategy, for Problem 3.8 below, by updating not only $V_{n} \in Q_{N, p}$ but also a preferable centre point $S_{n} \in O (N)$ strategically. Due to the space consuming discussion, we will present its fully detailed discussion in another occasion.

Problem 3.8

For a given continuous function $f : R^{N \times p} \to R$ , choose $ϵ > 0$ arbitrarily. Then, $find (V^{⋆}, S^{⋆}) \in Q_{N, p} \times O (N) such that f \circ Φ_{S^{⋆}}^{- 1} (V^{⋆}) < min f (St (p, N)) + ϵ .$

3.4. Relation between the Cayley transform-based retraction and $Φ_{S}^{- 1}$

The proposed $Φ_{S}^{- 1}$ can be regarded as another form of the Cayley transform-based retraction for $St (p, N)$ . By using the inversion $φ^{- 1}$ in (Equation4(4) $φ^{- 1} : Q_{N, N} \to SO (N) ∖ E_{N, N} : V \mapsto (I - V) (I + V)^{- 1} = 2 (I + V)^{- 1} - I$ (4) ), the Cayley transform-based retraction $R^{Cay} : T St (p, N) \to St (p, N) : (U, V) \mapsto R_{U}^{Cay} (V)$ was introduced explicitly in [Citation22,Citation24], where the tangent bundle $T St (p, N) = {{U} \times T_{U} St (p, N)) ∣ U \in St (p, N)}$ is defined with the tangent space $T_{U} St (p, N)$ to $St (p, N)$ at $U \in St (p, N)$ (see Fact A.1(d)). For $U \in St (p, N)$ , $R_{U}^{Cay}$ can be expressed with $P_{U} := I - U U^{T} / 2 \in R^{N \times N}$ as (30) $R_{U}^{Cay} : T_{U} St (p, N) \to St (p, N) : V \mapsto φ^{- 1} (S_{kew} (U V^{T} P_{U})) U .$ (30) By passing through the linear mapping $Ψ_{[U U_{⊥}]} : T_{U} St (p, N) \to Q_{N, p} ([U U_{⊥}]) : V \mapsto - \frac{1}{2} [\begin{matrix} U^{T} V & - (U_{⊥}^{T} V)^{T} \\ U_{⊥}^{T} V & 0 \end{matrix}]$ with $U_{⊥} \in St (N - p, N)$ satisfying $U^{T} U_{⊥} = 0$ , we have the following relation (31) $(V \in T_{U} St (p, N)) Φ_{[U U_{⊥}]}^{- 1} \circ Ψ_{[U U_{⊥}]} (V) = R_{U}^{Cay} (V) .$ (31) This relation can be verified specially with $S := [U U_{⊥}] \in O (N)$ by $\begin{aligned} (V \in Q_{N, p} (S)) Φ_{S}^{- 1} (V) \\ = S (I - V) (I + V)^{- 1} I_{N \times p} \\ = (I - SV S^{- 1}) (I + SV S^{- 1})^{- 1} S I_{N \times p} \\ = (I - SV S^{T}) (I + SV S^{T})^{- 1} U = φ^{- 1} (SV S^{T}) U \end{aligned}$ and $\begin{aligned} (V \in T_{U} St (p, N)) S Ψ_{S} (V) S^{T} \\ = - \frac{1}{2} [U U_{⊥}] [\begin{matrix} U^{T} V & - V^{T} U_{⊥} \\ U_{⊥}^{T} V & 0 \end{matrix}] [\begin{matrix} U^{T} \\ U_{⊥}^{T} \end{matrix}] \\ = - \frac{1}{2} (U U^{T} V U^{T} + U_{⊥} U_{⊥}^{T} V U^{T} - U V^{T} U_{⊥} U_{⊥}^{T}) \\ = - \frac{1}{2} (U U^{T} V U^{T} + (I - U U^{T}) V U^{T} - U V^{T} (I - U U^{T})) \\ (∵ U U^{T} + U_{⊥} U_{⊥}^{T} = I) \\ = \frac{1}{2} (U V^{T} - V U^{T} - U V^{T} U U^{T}) \\ = \frac{1}{2} (U V^{T} - V U^{T} - \frac{1}{2} U (V^{T} U - U^{T} V) U^{T}) \\ (∵ V \in T_{U} St (p, N) \Leftrightarrow U^{T} V + V^{T} U = 0) \\ = S_{kew} (U V^{T} - \frac{1}{2} U V^{T} U U^{T}) \\ = S_{kew} (U V^{T} (I - \frac{1}{2} U U^{T})) = S_{kew} (U V^{T} P_{U}) . \end{aligned}$ Through the relation in (Equation31(31) $(V \in T_{U} St (p, N)) Φ_{[U U_{⊥}]}^{- 1} \circ Ψ_{[U U_{⊥}]} (V) = R_{U}^{Cay} (V) .$ (31) ), we obtain a diffeomorphic property of $R_{U}^{Cay}$ in the following. The linear mapping $Ψ_{S}$ is a bijection between $T_{U} St (p, N)$ and $Q_{N, p} (S)$ with its inversion mapping $Ψ_{S}^{- 1} : Q_{N, p} (S) \to T_{U} St (p, N) : V \mapsto - 2 SV I_{N \times p}$ . From $Ψ_{S} (T_{U} St (p, N)) = Q_{N, p} (S)$ , (Equation31(31) $(V \in T_{U} St (p, N)) Φ_{[U U_{⊥}]}^{- 1} \circ Ψ_{[U U_{⊥}]} (V) = R_{U}^{Cay} (V) .$ (31) ) and Proposition 2.2, $R_{U}^{Cay}$ is a diffeomorphic between $T_{U} St (p, N)$ and a subset $St (p, N) ∖ E_{N, p} (S)$ of $St (p, N)$ . Clearly, the inversion mapping of $R_{U}^{Cay}$ is given by $R_{U}^{{Cay}^{- 1}} : St (p, N) ∖ E_{N, p} (S) \to T_{U} St (p, N) : U \mapsto Ψ_{S}^{- 1} \circ Φ_{S} (U)$ .

We present an explicit formula for $R_{U}^{{Cay}^{- 1}}$ . From Definition 2.1, we have (32) $\begin{aligned} (U \in St (p, N) ∖ E_{N, p} (S)) R_{U}^{{Cay}^{- 1}} (U) \\ = - 2 S [\begin{matrix} A_{S} (U) & - B_{S} (U) \\ B_{S} (U) & 0 \end{matrix}] I_{N \times p} \\ = - 2 [\begin{matrix} U & U_{⊥} \end{matrix}] [\begin{matrix} A_{S} (U) \\ B_{S} (U) \end{matrix}] = - 2 U A_{S} (U) - 2 U_{⊥} B_{S} (U) . \end{aligned}$ (32) From (Equation11(11) $\begin{aligned} A_{S} (U) & := 2 (I_{p} + S_{le}^{T} U)^{- T} S_{kew} (U^{T} S_{le}) (I_{p} + S_{le}^{T} U)^{- 1} \in Q_{p, p} \end{aligned}$ (11) ) and (Equation12(12) $\begin{aligned} B_{S} (U) & := - S_{ri}^{T} U (I_{p} + S_{le}^{T} U)^{- 1} \in R^{(N - p) \times p}, \end{aligned}$ (12) ), each term in (Equation32(32) $\begin{aligned} (U \in St (p, N) ∖ E_{N, p} (S)) R_{U}^{{Cay}^{- 1}} (U) \\ = - 2 S [\begin{matrix} A_{S} (U) & - B_{S} (U) \\ B_{S} (U) & 0 \end{matrix}] I_{N \times p} \\ = - 2 [\begin{matrix} U & U_{⊥} \end{matrix}] [\begin{matrix} A_{S} (U) \\ B_{S} (U) \end{matrix}] = - 2 U A_{S} (U) - 2 U_{⊥} B_{S} (U) . \end{aligned}$ (32) ) is evaluated as $\begin{aligned} - 2 U A_{S} (U) & = - 4 U (I_{p} + U^{T} U)^{- 1} S_{kew} (U^{T} U) (I_{p} + U^{T} U)^{- 1} \\ = 2 U (I_{p} + U^{T} U)^{- 1} ((I_{p} + U^{T} U) - (I_{p} + U^{T} U)) (I_{p} + U^{T} U)^{- 1} \\ = 2 U (I_{p} + U^{T} U)^{- 1} - 2 U (I_{p} + U^{T} U)^{- 1}, \\ - 2 U_{⊥} B_{S} (U) & = 2 U_{⊥} U_{⊥}^{T} U (I_{p} + U^{T} U)^{- 1} = 2 (I - U U^{T}) U (I_{p} + U^{T} U)^{- 1} . \end{aligned}$ By substituting these equalities into (Equation32(32) $\begin{aligned} (U \in St (p, N) ∖ E_{N, p} (S)) R_{U}^{{Cay}^{- 1}} (U) \\ = - 2 S [\begin{matrix} A_{S} (U) & - B_{S} (U) \\ B_{S} (U) & 0 \end{matrix}] I_{N \times p} \\ = - 2 [\begin{matrix} U & U_{⊥} \end{matrix}] [\begin{matrix} A_{S} (U) \\ B_{S} (U) \end{matrix}] = - 2 U A_{S} (U) - 2 U_{⊥} B_{S} (U) . \end{aligned}$ (32) ), we have (33) $\begin{aligned} R_{U}^{{Cay}^{- 1}} (U) \\ = 2 U (I_{p} + U^{T} U)^{- 1} - 2 U (I_{p} + U^{T} U)^{- 1} + 2 (I - U U^{T}) U (I_{p} + U^{T} U)^{- 1} \\ = 2 U (I_{p} + U^{T} U)^{- 1} + 2 U (I_{p} + U^{T} U)^{- 1} - 2 U (I_{p} + U^{T} U) (I_{p} + U^{T} U)^{- 1} \\ = 2 U (I_{p} + U^{T} U)^{- 1} + 2 U (I_{p} + U^{T} U)^{- 1} - 2 U . \end{aligned}$ (33) Although the expression (Equation33(33) $\begin{aligned} R_{U}^{{Cay}^{- 1}} (U) \\ = 2 U (I_{p} + U^{T} U)^{- 1} - 2 U (I_{p} + U^{T} U)^{- 1} + 2 (I - U U^{T}) U (I_{p} + U^{T} U)^{- 1} \\ = 2 U (I_{p} + U^{T} U)^{- 1} + 2 U (I_{p} + U^{T} U)^{- 1} - 2 U (I_{p} + U^{T} U) (I_{p} + U^{T} U)^{- 1} \\ = 2 U (I_{p} + U^{T} U)^{- 1} + 2 U (I_{p} + U^{T} U)^{- 1} - 2 U . \end{aligned}$ (33) ) of $R_{U}^{{Cay}^{- 1}}$ has been given by [Citation25,Citation54], our discussion via (Equation31(31) $(V \in T_{U} St (p, N)) Φ_{[U U_{⊥}]}^{- 1} \circ Ψ_{[U U_{⊥}]} (V) = R_{U}^{Cay} (V) .$ (31) ) presents much more comprehensive information about $R_{U}^{Cay}$ . In [Citation25,Citation54], it has been reported that a certain restriction of $R_{U}^{Cay}$ to a sufficiently small open neighbourhood of $0 \in T_{U} St (p, N)$ is invertible with $R_{U}^{{Cay}^{- 1}}$ . Meanwhile, we clarify that $R_{U}^{Cay}$ is invertible on $T_{U} St (p, N)$ entirely by passing through $Φ_{S}^{- 1}$ . The following proposition summarizes the above discussion.

Proposition 3.9

For $U \in St (p, N)$ , let $U_{⊥} \in St (N - p, N)$ satisfy $U^{T} U_{⊥} = 0$ . Then, the Cayley transform-based retraction $R_{U}^{Cay}$ in (Equation30(30) $R_{U}^{Cay} : T_{U} St (p, N) \to St (p, N) : V \mapsto φ^{- 1} (S_{kew} (U V^{T} P_{U})) U .$ (30) ) [Citation22,Citation24] is diffeomorphic between $T_{U} St (p, N)$ and $St (p, N) ∖ E_{N, p} (S)$ , and its inversion mapping $R_{U}^{{Cay}^{- 1}}$ is given by (Equation33(33) $\begin{aligned} R_{U}^{{Cay}^{- 1}} (U) \\ = 2 U (I_{p} + U^{T} U)^{- 1} - 2 U (I_{p} + U^{T} U)^{- 1} + 2 (I - U U^{T}) U (I_{p} + U^{T} U)^{- 1} \\ = 2 U (I_{p} + U^{T} U)^{- 1} + 2 U (I_{p} + U^{T} U)^{- 1} - 2 U (I_{p} + U^{T} U) (I_{p} + U^{T} U)^{- 1} \\ = 2 U (I_{p} + U^{T} U)^{- 1} + 2 U (I_{p} + U^{T} U)^{- 1} - 2 U . \end{aligned}$ (33) ), where $S := [U U_{⊥}]$ . In addition, for p<N, the image $St (p, N) ∖ E_{N, p} (S)$ of $R_{U}^{Cay}$ is an open dense subset of $St (p, N)$ (see Theorem 2.3(b)).

Remark 3.10

Minimization of $f \circ R_{U}^{Cay}$ with a fixed $U$

By using the Cayley transform-based retraction $R_{U}^{Cay}$ , the Cayley parametrization strategy in Algorithm 1 can be modified to the minimization of $f \circ R_{U}^{Cay}$ with a fixed $U \in St (p, N)$ over $T_{U} St (p, N)$ . The explicit formula for the gradient of $f \circ R_{U}^{Cay}$ is given in Appendix 10. Compared to the minimization of $f \circ R_{U}^{Cay}$ over $T_{U} St (p, N)$ , advantages of the minimization of $f \circ Φ_{S}^{- 1}$ with $S \in O_{p} (N)$ over $Q_{N, p} (S)$ are as follows.

The complexity $2 N p^{2} + o (p^{3})$ flops of $Φ_{S}^{- 1}$ with $S \in O_{p} (N)$ is more efficient than $6 N p^{2} + o (p^{3})$ flops of $R_{U}^{Cay}$ (see Remark 2.8). In a case where we employ the gradient descent-type method for the minimization of $f \circ Φ_{S}^{- 1}$ and $f \circ R_{U}^{Cay}$ , the difference of constant factor affects run time of algorithm in practice because $Φ_{S}^{- 1}$ and $R_{U}^{Cay}$ are used to estimate a stepsize many times within a line-search algorithm, e.g. the backtracking algorithm (Algorithm 2), in each iteration (see, e.g. [Citation33]).
$R_{U}^{Cay}$ has been exploited with the aid of the Sherman-Morrison-Woodbury formula (see Fact A.7) to reduce the complexity for matrix inversion, which can induce the deterioration of the orthogonal feasibility due to the numerical instability of its formula [Citation22]. On the other hand, $Φ_{S}^{- 1}$ does not use the formula, and thus is numerically stabler than $R_{U}^{Cay}$ . This will be demonstrated numerically in Section 4. Indeed, for $V \in Q_{N, p} (S)$ , the condition number $κ (M) := ‖ M ‖_{2} ‖ M^{- 1} ‖_{2}$ of $M := I_{p} + [[V]]_{11} + [[V]]_{21}^{T} [[V]]_{21}$ in (Equation15(15) $\begin{aligned} = 2 S [\begin{matrix} M^{- 1} \\ - [[V]]_{21} M^{- 1} \end{matrix}] - S_{le} = 2 (S_{le} - S_{ri} [[V]]_{21}) M^{- 1} - S_{le}, \end{aligned}$ (15) ) is upper bounded byFootnote⁷ $1 + ‖ [[V]]_{11} ‖_{2} + ‖ [[V]]_{21} ‖_{2}^{2}$ , implying thus $M$ is hardly become ill-conditioned whenever $‖ V ‖_{2}$ is not very large (this is usual case, e.g. in application of G-L $^{2}$ CT for optimization of right orthogonal invariant functions [see Theorem 3.2]).

4. Numerical experiments

We illustrate the performance of the proposed CP strategy in Algorithm 1 by numerical experiments. To demonstrate the effectiveness of the proposed formulation in Problem 1.4 in a simple situation, we implemented Algorithm 1 with a gradient descent-type update scheme $A^{GDM} : Q_{N, p} (S) \to Q_{N, p} (S) : V \mapsto V - γ \nabla f_{S} (V)$ in MATLAB, where $f_{S} := f \circ Φ_{S}^{- 1}$ . In $A^{GDM}$ for a given $V \in Q_{N, p} (S)$ , we use a stepsize $γ > 0$ , satisfying the so-called Armijo rule, generated by the backtracking algorithm (see, e.g. [Citation33]) with predetermined $γ_{initial} > 0$ and $ρ, c \in (0, 1)$ (see Algorithm 2). Armijo rule has been utilized to design a stepsize for decreasing the function value sufficiently in numerical optimization. All the experiments were performed on MacBook Pro (13-inch, 2017) with Intel Core i5-7360U and 16GB of RAM.

4.1. Comparison to the retraction-based strategy

We compared Algorithm 1+ $A^{GDM}$ (abbreviated as GDM+CP) and three retraction-based strategies [Citation1] with the steepest descent solver implemented in Manopt [Citation55] in the scenario of eigenbasis extraction problem below. Since the Cayley transform-based retraction $R^{Cay}$ in (Equation30(30) $R_{U}^{Cay} : T_{U} St (p, N) \to St (p, N) : V \mapsto φ^{- 1} (S_{kew} (U V^{T} P_{U})) U .$ (30) ) can be utilized for a parametrization of a subset of $St (p, N)$ (see Section 3.4 and Proposition 3.9), to see differences in performance between $Φ_{S}^{- 1}$ and $R_{U}^{Cay}$ , we also compared the proposed GDM+CP and its modified version with replacement of $Φ_{S}^{- 1}$ by $R_{U}^{Cay}$ (abbreviated by GDM+CP-retraction) illustrated in Algorithm 3+ ${\hat{A}}^{GDM} : V \mapsto V - γ \nabla (f \circ R_{U}^{Cay}) (V)$ for the minimization of $f \circ R_{U}^{Cay}$ with a fixed $U \in St (p, N)$ over $T_{U} St (p, N)$ .

Problem 4.1

Eigenbasis extraction problem (e.g. [Citation1,Citation22,Citation24])

For a given symmetric matrix $A \in R^{N \times N}$ , (34) $find U^{⋆} \in \underset{U \in St (p, N)}{argmin} f (U) (:= - Tr (U^{T} AU)) .$ (34) Any solution $U^{⋆}$ of Problem 4.1 is an orthonormal eigenbasis associated with the p largest eigenvalues of $A$ . In our experiment, we used $A := {\tilde{A}}^{T} \tilde{A} \in R^{N \times N}$ with randomly chosen $\tilde{A} \in R^{N \times N}$ of which each entry is sampled by the standard normal distribution $N (0, 1)$ . Note that f is right orthogonal invariant, and thus we can exploit Theorem 3.2 for GDM+CP.

For the retraction-based strategies, we employed three retractions: (i) Cayley transform-based (abbreviated by GDM+Cayley) [Citation22]; (ii) QR decomposition-based (abbreviated by GDM+QR) [Citation1]; (iii) polar decomposition-based (abbreviated by GDM+polar) [Citation1]. In the steepest descent solver in Manopt, we calculated a stepsize for the current estimate $U \in St (p, N)$ with Algorithm 2 after replacement of the criterion $f_{S} (V - γ \nabla f_{S} (V)) > f_{S} (V) - cγ ‖ \nabla f_{S} (V) ‖_{F}^{2}$ by $f \circ R_{U} (- γ grad f (U)) > f (U) - cγ ‖ grad f (U) ‖_{F}^{2}$ (see, e.g. [Citation3, Algorithm 3.1]), where $grad f (U) = P_{T_{U} St (p, N)} (∇f (U)) \in T_{U} St (p, N)$ is the Riemannian gradient of f at $U \in St (p, N)$ (for the projection mapping $P_{T_{U} St (p, N)}$ , see Fact A.1(d)).

For an initial point $U_{0} \in St (p, N)$ , we used a centre point for GDM+CP as $S := diag (Q_{1} Q_{2}^{T}, I_{N - p}) \in O_{p} (N)$ by using a singular value decomposition of $[U_{0}]_{up} = Q_{1} Σ Q_{2}^{T}$ with $Q_{1}, Q_{2} \in O (p)$ and a nonnegative-valued diagonal matrix $Σ \in R^{p \times p}$ (see Theorem 2.7). For GDM+CP-retraction, we used a fixed $U := U_{0}$ for the minimization of $f \circ R_{U}^{Cay}$ . We note that the choice of $U := U_{0}$ is reasonable because the procedure of Algorithm 3( $U_{0}, U_{0}, {\hat{A}}^{GDM}$ ), which tries to decrease $f \circ R_{U_{0}}^{Cay}$ from the initial point $R_{U_{0}}^{{Cay}^{- 1}} (U_{0}) = 0 \in T_{U_{0}} St (p, N)$ , is the same as the procedure of GDM+Cayley in the first iteration. The explicit formula for the gradient of $f \circ R_{U}^{Cay}$ is given in Appendix 10.

For five algorithms, we used the default parameters $ρ = 0.5$ and $c = 2^{- 13}$ , for Algorithm 2, in Manopt. We employed several initial stepsizes $γ_{initial} \in {0.1, 0.01, 0.001}$ . We generated an initial point $U_{0} \in St (p, N)$ by using ‘orth(rand $(N, p)$ )’ in MATLAB.

For each algorithm, we stopped the update at nth iteration when it achieved the following conditions (used in [Citation25]) with $D_{n} := \nabla f_{S} (V_{n})$ , $\nabla (f \circ R_{U_{0}}^{Cay}) (V_{n})$ , $grad f (U_{n})$ : (35) $n \geq 5000 or \frac{‖ D_{n} ‖_{F}}{‖ D_{0} ‖_{F}} \leq 10^{- 10} or \frac{| f (U_{n}) - f (U_{n - 1}) |}{| f (U_{n}) |} \leq 10^{- 20} .$ (35) Table illustrates average results for 10 trials of each algorithm employing the initial stepsize $γ_{initial} \in {0.1, 0.01, 0.001}$ with the shortest CPU time to reach the stopping criteria in the scenario of Problem 4.1 with $(N, p) \in {1000, 2000} \times {10, 50}$ . In the table, ‘fval’ means the value $f (U)$ at the output $U \in St (p, N)$ , ‘fval-optimal’ means $f (U) - f (U^{⋆})$ with the global minimizer $U^{⋆} \in St (p, N)$ obtained by the eigenvalue decomposition of $A$ , ‘feasi’ means the feasibility $‖ I_{p} - U^{T} U ‖_{F}$ , ‘nrmg’ means the norm $‖ \nabla f_{S} (Φ_{S} (U)) ‖_{F}$ , $‖ \nabla (f \circ R_{U_{0}}^{Cay}) (R_{U_{0}}^{{Cay}^{- 1}} (U)) ‖_{F}$ or $‖ grad f (U) ‖_{F}$ , ‘itr’ means the number of iterations, and ‘time’ means the CPU time (s). Figure shows the convergence history of algorithms for each problem size respectively. The plots show CPU time on the horizontal axis versus the value $f (U) - f (U^{⋆})$ on the vertical axis.

Figure 2. Convergence histories of each algorithm applied to Problem 4.1 regarding the value $f (U) - f (U^{⋆})$ at CPU time for each problem size. Markers are put at every 250 iterations.

We observe that the proposed GDM+CP reaches the stopping criteria with the shortest CPU time among all five algorithms for every problem size. Possible reasons for the superiority of the proposed Cayley parametrization strategy to the retraction-based strategy are as follows.

The Cayley parametrization strategy exploits the diffeomorphic property of $Φ_{S}^{- 1}$ between a vector space and an open dense subset of $St (p, N)$ while the retraction-based strategy exploits only a local diffeomorphic property around $0$ of retractions (see Remark 3.6).
For Problem 4.1, a global minimizer $V^{⋆} \in Q_{N, p} (S)$ of $f \circ Φ_{S}^{- 1}$ existsFootnote⁸ within the unit ball ${V \in Q_{N, p} (S) ∣ ‖ V ‖_{2} \leq 1}$ due to the right orthogonal invariance of f in (Equation34(34) $find U^{⋆} \in \underset{U \in St (p, N)}{argmin} f (U) (:= - Tr (U^{T} AU)) .$ (34) ) and Theorem 3.2. In comparison, the existence of a global minimizer, say $V^{⋆} \in T_{U_{n}} St (p, N)$ , of $f \circ R_{U_{n}}$ over $T_{U_{n}} St (p, N)$ is not guaranteed for a general retraction R. Even if such a $V^{⋆}$ exists, it is not guaranteed that $R_{U_{n}} (V^{⋆}) \in St (p, N)$ is a global minimizer of f over $St (p, N)$ because $R_{U_{n}} (T_{U_{n}} St (p, N))$ is not necessarily dense in $St (p, N)$ .

Additionally, GDM+CP can keep the feasibility at the same level as GDM+QR and GDM+polar, and better than GDM+Cayley. These observations imply that the proposed CP strategy outperforms the retraction-based strategy. Moreover, it is expected that the proposed CP strategy achieves fast convergence to a solution for Problem 1.1 when we plug more powerful computational arts designed for optimization over a vector space into the CP strategy (see also Remark 3.6).

As shown in Propositions 2.2 and 3.9, both $Φ_{S}^{- 1}$ and $R_{U_{0}}^{Cay}$ can parameterize respectively open dense subsets of $St (p, N)$ . However, we observe that (i) the proposed GDM+CP for the minimization of $f \circ Φ_{S}^{- 1}$ has faster convergence speed than GDM+CP-retraction for the minimization of $f \circ R_{U_{0}}^{Cay}$ ; (ii) the orthogonal feasibility in GDM+CP-retraction deteriorates compared than GDM+CP. We believe that these performance differences are made respectively by (i) the computational complexity for $Φ_{S}^{- 1}$ is more efficient than that of $R_{U_{0}}^{Cay}$ , and by (ii) calculations of $R_{U_{0}}^{Cay}$ and $\nabla (f \circ R_{U_{0}}^{Cay})$ require the Sherman-Morrison-Woodbury formula for matrix inversions in order to achieve comparable computational complexities, and its formula is known to have a numerical instability [Citation22] (see Remark 3.10).

Moreover, although GDM+CP-retraction reaches the stopping criteria without achieving the same level of the final cost value as the others,Footnote⁹ GDM+CP-retraction has the same or better performance than GDM+Cayley in view of convergence history in Figure at every time. This indicates an efficacy of the parametrization strategy of $St (p, N)$ in the vector space reformulation for Problem 1.1 because GDM+CP-retraction and GDM+Cayley used the same Cayley transform-based retraction.

Finally, we remark that if $γ_{initial}$ is set as too large, numerical performance of the proposed GDM+CP can deteriorate because a generated sequence $(V_{n})_{n = 0}^{\infty} \subset Q_{N, p} (S)$ can go away from $0 \in Q_{N, p} (S)$ quickly, which induces the insensitivity of $Φ_{S}^{- 1}$ (see Section 3.3). This tendency can be observed from Figure , which illustrates average convergence histories for 10 trials of GDM+CP with each stepsize $γ_{initial} \in {0.1, 0.01, 0.001, 0.0001}$ in the scenario of Problem 4.1. Figure shows that GDM+CP with $γ_{initial} = 0.001$ has the best performance among four algorithms. This observation indicates that we need not set $γ_{initial}$ as large for GDM+CP. Not surprisingly, we also see that too small $γ_{initial}$ causes slow convergence speed of GDM+CP with move only a little along $- \nabla f_{S} (V_{n})$ at each iteration.

Figure 3. Convergence histories of GDM+CP with each $γ_{initial}$ applied to Problem 4.1 (N = 1000, p = 10) regarding the value $f (U) - f (U^{⋆})$ at CPU time for each problem size. Markers are put at every 250 iterations.

4.2. Singular-point issue

In this subsection, we tested how much the singular-points influence the performance of the proposed CP strategy. As we mentioned in Section 3.3, a risk of the slow convergence of Algorithm 1 can arise in a case where a global minimizer $U^{⋆} \in St (p, N)$ of Problem 1.1 is close to the singular-point set $E_{N, p} (S)$ . To see such an influence, we compared CP strategies with several centre point $S$ by a toy Problem 1.1 for the minimization of $f (U) := \frac{1}{2} ‖ U - U^{⋆} ‖_{F}^{2}$ with a given $U^{⋆} \in St (p, N)$ . Clearly, its solution is $U^{⋆}$ .

In this experiment, we used centre points $S (θ) := diag (R (θ), I_{N - p}) \in O (N) (θ = π / 1000, π / 4, π / 2, π)$ , the global minimizer $U^{⋆} := [S (π)]_{le}$ and an initial point $U_{0} := [S (π / 4)]_{le}$ , where $R (θ) := [\begin{matrix} \cos (θ) & - \sin (θ) \\ \sin (θ) & \cos (θ) \end{matrix}] \in SO (2)$ is a rotation matrix. From $[S (θ)]_{le}^{T} U^{⋆} = diag (- R (θ), I_{p - 2})$ , we have $det (I_{p} + [S (θ)]_{le}^{T} U^{⋆}) = 2^{p - 1} (1 - \cos (θ))$ . Therefore, $E_{N, p} (S (θ)) = {U \in St (p, N) ∣ det (I + [S (θ)]_{le}^{T} U) = 0}$ approaches $U^{⋆}$ as $θ \to 0$ , and $E_{N, p} (S (π))$ is farthest from $U^{⋆}$ .

We used the stopping criteria (Equation35(35) $n \geq 5000 or \frac{‖ D_{n} ‖_{F}}{‖ D_{0} ‖_{F}} \leq 10^{- 10} or \frac{| f (U_{n}) - f (U_{n - 1}) |}{| f (U_{n}) |} \leq 10^{- 20} .$ (35) ), and parameters $ρ = 0.5$ , $c = 2^{- 13}$ , and $γ_{initial} = 0.1$ for Algorithm 2 to determine a stepsize $γ > 0$ .

Table illustrates average results for 10 trials of each algorithm with N = 1000 and p = 10 in this scenario. Figure shows the convergence history of algorithms. The plot shows CPU time on the horizontal axis versus the value $f (U) - f (U^{⋆})$ on the vertical axis.

Figure 4. Convergence histories of each algorithm applied to Problem 1.1 with $f (U) := \frac{1}{2} ‖ U - U^{⋆} ‖_{F}^{2}$ , and N = 1000, p = 10 regarding the value $f (U)$ at CPU time. Markers are put at every 250 iterations.

Figure 4. Convergence histories of each algorithm applied to Problem 1.1 with f(U):=12‖U−U⋆‖F2, and N = 1000, p = 10 regarding the value f(U) at CPU time. Markers are put at every 250 iterations.

Table 2. Performance of each algorithm applied Problem 1.1 with $f (U) := \frac{1}{2} ‖ U - U^{⋆} ‖_{F}^{2}$ .

Display Table

From Figure , we observe that GDM+CP with $S (π)$ is the fastest among all algorithms. On the other hand, $U_{n}$ generated by GDM+CP with $S (π / 1000)$ does not approach a global minimizer $U^{⋆}$ . This implies that the convergence speed of GDM+CP tends to become slower as $θ \to 0$ , or equivalently as $U^{⋆}$ approaches the singular-point set.

From these observations, the performance of the proposed Algorithm 1 depends heavily on tuning $S$ as mentioned in Section 3.3. Since we can not see whether a solution $U^{⋆}$ is distant from $E_{N, p} (S)$ or not in advance before running algorithms, it is desired to circumvent the influence of this singular-point issue. In [Citation38,Citation39], we presented preliminary reports for a CP strategy with an adaptive changing centre point scheme to avoid the singular-point issue by considering Problem 3.8 instead of Problem 1.4.

5. Conclusion

We presented a generalization of the Cayley transform for the Stiefel manifold to parameterize a dense subset of the Stiefel manifold in terms of a single vector space. The proposed Cayley transform is diffeomorphic between a dense subset of the Stiefel manifold and a vector space. Thanks to the diffeomorphic property, we proposed a new reformulation of optimization problem over the Stiefel manifold to transplant optimization techniques designed over a vector space. Numerical experiments have shown that the proposed algorithm outperformed the standard algorithms designed with a retraction on the Stiefel manifold under a simple situation.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by JSPS [grants-in-aid19H04134] partially, by JSPS [grants-in-aid 21J21353] and by JST SICORP [grant number JPMJSC20C6].

Notes

φ^{- 1}

is well-defined over

Q_{N, N}

because all eigenvalues of

V \in Q_{N, N}

are pure imaginary. For the second expression in (Equation4), see the beginning of Appendix 3.

2 The closure of $SO (N) ∖ E_{N, N}$ is equal to $SO (N)$ . For every $U \in SO (N)$ , we can approximate it by some sequence $(U_{n})_{n = 1}^{\infty}$ of $SO (N) ∖ E_{N, N}$ with any accuracy, i.e. $lim_{n \to \infty} U_{n} = U$ .

3 The domain of $φ_{S}$ with $S \in SO (N)$ is a subset $O (N) ∖ E_{N, N} (S) = SO (N) ∖ E_{N, N} (S)$ of $SO (N)$ .

4 As in (Equation9(9) $Q_{N, p} (S) := Q_{N, p} := {[\begin{matrix} A & - B^{T} \\ B & 0 \end{matrix}] | \begin{matrix} - A^{T} = A \in R^{p \times p}, \\ B \in R^{(N - p) \times p} \end{matrix}} \subset Q_{N, N} .$ (9) ), $Q_{N, p} (S)$ is the common set $Q_{N, p}$ for every $S \in O (N)$ . However, we distinguish $Q_{N, p} (S)$ for each $S \in O (N)$ as a parametrization of the particular subset $St (p, N) ∖ E_{N, p} (S)$ of $St (p, N)$ (see also Remark 1.3(b)).

5 Algorithm 1 can serve as a central building block in our further advanced Cayley parametrization strategies, reported partially in [Citation38–40].

6 The local diffeomorphism of $R_{U}$ around $0 \in T_{U} St (p, N)$ can be verified with the inverse function theorem and the condition (ii) in Definition B.1.

7 Let $I_{p} + [[V]]_{21}^{T} [[V]]_{21} = Q (I_{p} + Σ) Q^{T}$ be the eigenvalue decomposition with $Q \in O (N)$ and a nonnegative-valued diagonal matrix $Σ \in R^{p \times p}$ . From (I2) in Appendix 9, we have $‖ M^{- 1} ‖_{2} \leq ‖ (I_{p} + Σ)^{- 1} ‖_{F} = (1 + σ_{min}^{2} ([[V]]_{21}))^{- 1} \leq 1$ . Thus, we have $κ (M) \leq ‖ M ‖_{2} \leq 1 + ‖ [[V]]_{11} ‖_{2} + ‖ [[V]]_{21} ‖_{2}^{2}$ .

8 From the relation $min_{U \in St (p, N)} f (U) = inf_{V \in Q_{N, p} (S)} f \circ Φ_{S}^{- 1} (V)$ in Lemma 2.6, $Φ_{S}^{- 1} (V^{⋆}) \in St (p, N)$ is also a global minimizer of f over $St (p, N)$ .

9 We note that this early stopping of GDM+CP-retraction can be caused by the instability [Citation22] of the Sherman-Morrison-Woodbury formula used in $R_{U_{0}}^{Cay}$ and $\nabla (f \circ R_{U_{0}}^{Cay})$ .

10 The subspace $W_{1} := {U Ω \in R^{N \times p} ∣ Ω^{T} = - Ω \in R^{p \times p}} \subset R^{N \times p}$ is an orthogonal complement to the subspace $W_{2} := {U_{⊥} K \in R^{N \times p} ∣ K \in R^{(N - p) \times p}} \subset R^{N \times p}$ with the inner product $⟨ X, Y ⟩ = Tr (X^{T} Y) (X, Y \in R^{N \times p})$ . The tangent space $T_{U} St (p, N)$ can be decomposed as $W_{1} \oplus W_{2}$ with the direct sum ⊕. In view of the orthogonal decomposition, the first term and the second term in the right-hand side of (EquationA1(A1) $\begin{aligned} (X \in R^{N \times p}) P_{T_{U} St (p, N)} (X) & := \underset{Z \in T_{U} St (p, N)}{argmin} ‖ X - Z ‖_{F} \\ = \frac{1}{2} U (U^{T} X - X^{T} U) + (I - U U^{T}) X . \end{aligned}$ (A1) ) can be regarded respectively as the orthogonal projection of $X$ onto $W_{1}$ and $W_{2}$ .

11 The exponential mapping ${Exp}_{U} : T_{U} St (p, N) \to St (p, N)$ at $U \in St (p, N)$ is defined as a mapping that assigns a given direction $D \in T_{U} St (p, N)$ to a point on the geodesic of $St (p, N)$ with the initial velocity $D$ . The exponential mapping is also a special instance of retractions of $St (p, N)$ . However, due to its high computational complexity, computationally simpler retractions have been used extensively for Problem 1.1 [Citation1].

References

Absil PA, Mahony R, Sepulchre R. Optimization algorithms on matrix manifolds. Princeton (NJ): Princeton University Press; 2008.
Google Scholar
Manton JH. Geometry, manifolds, and nonconvex optimization: how geometry can help optimization. IEEE Signal Process Mag. 2020;37(5):109–119.
Web of Science ®Google Scholar
Sato H. Riemannian optimization and its applications. Switzerland: Springer International Publishing; 2021.
Google Scholar
Pietersz R, Groenen PJF. Rank reduction of correlation matrices by majorization. Quant Finance. 2004;4(6):649–662.
Web of Science ®Google Scholar
Grubišić I, Pietersz R. Efficient rank reduction of correlation matrices. Linear Algebra Appl. 2007;422(2–3):629–653.
Web of Science ®Google Scholar
Zhu X. A feasible filter method for the nearest low-rank correlation matrix problem. Numer Algorithms. 2015;69(4):763–784.
Web of Science ®Google Scholar
Bai Z, Sleijpen G, van der Vorst H. Nonlinear eigenvalue problems. In: Bai Z, Demmel J, Dongarra J, Ruhe A, van der Vorst H, editors, Templates for the solution of algebraic eigenvalue problems. Philadelphia (PA): SIAM; 2000. p. 281–314.
Google Scholar
Yang C, Meza JC, Wang LW. A constrained optimization algorithm for total energy minimization in electronic structure calculations. J Comput Phys. 2006;217(2):709–721.
Web of Science ®Google Scholar
Zhao Z, Bai Z, Jin X. A Riemannian Newton algorithm for nonlinear eigenvalue problems. Comput Optim Appl. 2015;36(2):752–774.
Google Scholar
Zou H, Hastie T, Tibshirani R. Sparse principal component analysis. J Comput Graph Stat. 2006;15(2):265–286.
Web of Science ®Google Scholar
Journée M, Nesterov Y, Richtárik P, et al. Generalized power method for sparse principal component analysis. J Mach Learn Res. 2010;11(15):517–553.
Google Scholar
Lu Z, Zhang Y. An augmented Lagrangian approach for sparse principal component analysis. Math Program. 2012;135(1–2):149–193.
Web of Science ®Google Scholar
Boufounos PT, Baraniuk RG. 1-bit compressive sensing. In: Annual Conference on Information Sciences and Systems. IEEE; 2008. p. 16–21.
Google Scholar
Laska JN, Wen Z, Yin W, et al. Trust, but verify: fast and accurate signal recovery from 1-bit compressive measurements. IEEE Trans Signal Process. 2011;59(11):5289–5301.
Web of Science ®Google Scholar
Joho M, Mathis H. Joint diagonalization of correlation matrices by using gradient methods with application to blind signal separation. In: Sensor Array and Multichannel Signal Processing Workshop Proceedings. IEEE; 2002. p. 273–277.
Google Scholar
Theis FJ, Cason TP, Absil PA. Soft dimension reduction for ICA by joint diagonalization on the Stiefel manifold. In: International Symposium on Independent Component Analysis and Blind Signal Separation. Springer; 2009. p. 354–361.
Google Scholar
Sato H. Riemannian Newton-type methods for joint diagonalization on the Stiefel manifold with application to independent component analysis. Optimization. 2017;66(12):2211–2231.
Web of Science ®Google Scholar
Helfrich K, Willmott D, Ye Q. Orthogonal recurrent neural networks with scaled Cayley transform. In: International Conference on Machine Learning. PMLR; Vol. 80; 2018. p. 1969–1978.
Google Scholar
Bansal N, Chen X, Wang Z. Can we gain more from orthogonality regularizations in training deep networks? In: Advances in neural information processing systems. Curran Associates Inc.; 2018. p. 4266–4276.
Google Scholar
Yamada I, Ezaki T. An orthogonal matrix optimization by dual Cayley parametrization technique. In: 4th International Symposium on Independent Component Analysis and Blind, Signal Separation; 2003. p. 35–40.
Google Scholar
Hori G, Tanaka T. Pivoting in Cayley transform-based optimization on orthogonal groups. In: Proceedings of the Second APSIPA Annual Summit and Conference; 2010. p. 181–184.
Google Scholar
Wen Z, Yin W. A feasible method for optimization with orthogonality constraints. Math Program. 2013;142(1–2):397–434.
Web of Science ®Google Scholar
Gao B, Liu X, Chen X, et al. A new first-order algorithmic framework for optimization problems with orthogonality constraints. SIAM J Optim. 2018;28(1):302–332.
Web of Science ®Google Scholar
Zhu X. A Riemannian conjugate gradient method for optimization on the Stiefel manifold. Comput Optim Appl. 2017;67(1):73–110.
Web of Science ®Google Scholar
Zhu X, Sato H. Riemannian conjugate gradient methods with inverse retraction. Comput Optim Appl. 2020;77(3):779–810.
Web of Science ®Google Scholar
Fraikin C, Hüper K, Dooren PV. Optimization over the Stiefel manifold. In: Proceedings in Applied Mathematics and Mechanics. Wiley; Vol. 7; 2007.
Google Scholar
Reddi SJ, Hefny A, Sra S, et al. Stochastic variance reduction for nonconvex optimization. In: International Conference on Machine Learning. PMLR; Vol. 48; 2016. p. 314–323.
Google Scholar
Ghadimi S, Lan G. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math Program. 2016;156:59–99.
Web of Science ®Google Scholar
Allen-Zhu Z. Natasha 2: faster non-convex optimization than SGD. In: Advances in neural information processing systems. Curran Associates Inc.; 2018. p. 2680–2691.
Google Scholar
Ward R, Wu X, Bottou L. AdaGrad stepsizes: sharp convergence over nonconvex landscapes. In: International Conference on Machine Learning. PMLR; Vol. 97; 2019. p. 6677–6686.
Google Scholar
Chen X, Liu S, Sun R, et al. On the convergence of a class of adam-type algorithms for non-convex optimization. arXiv preprint arXiv:1808.02941, 2018.
Google Scholar
Tatarenko T, Touri B. Non-convex distributed optimization. IEEE Trans Automat Control. 2017;62(8):3744–3757.
Web of Science ®Google Scholar
Nocedal J, Wright S. Numerical optimization. 2nd ed. New York (NY): Springer; 2006.
Google Scholar
Boumal N, Absil PA. Low-rank matrix completion via preconditioned optimization on the Grassmann manifold. Linear Algebra Appl. 2015;475:200–239.
Web of Science ®Google Scholar
Pitaval RA, Dai W, Tirkkonen O. Convergence of gradient descent for low-rank matrix approximation. IEEE Trans Inf Theory. 2015;61(8):4451–4457.
Web of Science ®Google Scholar
Sato H, Iwai T. Optimization algorithms on the Grassmann manifold with application to matrix eigenvalue problems. Jpn J Ind Appl Math. 2014;31(2):355–400.
Web of Science ®Google Scholar
Xu Y, Zeng T. Fast optimal H2 model reduction algorithms based on Grassmann manifold optimization. Int J Numer Anal Model. 2013;10(4):972–991.
Web of Science ®Google Scholar
Kume K, Yamada I. Adaptive localized Cayley parametrization technique for smooth optimization over the Stiefel manifold. In: European Signal Processing Conference. EURASIP; 2019. p. 500–504.
Google Scholar
Kume K, Yamada I. A Nesterov-type acceleration with adaptive localized Cayley parametrization for optimization over the Stiefel manifold. In: European Signal Processing Conference. EURASIP; 2020. p. 2105–2109.
Google Scholar
Kume K, Yamada I. A global Cayley parametrization of Stiefel manifold for direct utilization of optimization mechanisms over vector spaces. In: International Conference on Acoustics, Speech, and Signal Processing. IEEE; 2021. p. 5554–5558.
Google Scholar
Edelman A, Arias TA, Smith ST. The geometry of algorithms with orthogonality constraints. SIAM J Matrix Anal Appl. 1998;20(2):303–353.
Web of Science ®Google Scholar
Nikpour M, Manton JH, Hori G. Algorithms on the Stiefel manifold for joint diagonalisation. In: International Conference on Acoustics, Speech, and Signal Processing. IEEE; Vol. 2. 2002. p. 1481–1484.
Google Scholar
Nishimori Y, Akaho S. Learning algorithms utilizing quasi-geodesic flows on the Stiefel manifold. Neurocomputing. 2005;67:106–135.
Web of Science ®Google Scholar
Absil PA, Baker CG, Gallivan KA. Trust-region methods on Riemannian manifolds. Found Comut Math. 2007;7(3):303–330.
Web of Science ®Google Scholar
Abrudan TE, Eriksson J, Koivunen V. Steepest descent algorithms for optimization under unitary matrix constraint. IEEE Trans Signal Process. 2008;56(3):1134–1147.
Web of Science ®Google Scholar
Absil PA, Malick J. Projection-like retractions on matrix manifolds. SIAM J Optim. 2012;22(1):135–158.
Web of Science ®Google Scholar
Ring W, Wirth B. Optimization methods on Riemannian manifolds and their application to shape space. SIAM J Optim. 2012;22(2):596–627.
Web of Science ®Google Scholar
Huang W, Gallivan KA, Absil PA. A Broyden class of quasi-Newton methods for Riemannian optimization. SIAM J Optim. 2015;25(3):1660–1685.
Web of Science ®Google Scholar
Jiang B, Dai YH. A framework of constraint preserving update schemes for optimization on Stiefel manifold. Math Program. 2015;153(2):535–575.
Web of Science ®Google Scholar
Manton JH. A framework for generalising the Newton method and other iterative methods from Euclidean space to manifolds. Numer Math. 2015;129:91–125.
Web of Science ®Google Scholar
Sato H, Iwai T. A new, globally convergent Riemannian conjugate gradient method. Optimization. 2015;64(4):1011–1031.
Web of Science ®Google Scholar
Kasai H, Mishra B. Inexact trust-region algorithms on Riemannian manifolds. In: Advances in neural information processing systems. Curran Associates Inc.; 2018. p. 4254–4265.
Google Scholar
Nesterov Y. A method for solving the convex programming problem with convergence rate o(1/k2). Dokl Akad Nauk SSSR. 1983;269:543–547.
Web of Science ®Google Scholar
Siegel JW. Accelerated optimization with orthogonality constraints. J Comput Math. 2020;39(2):207–226.
Web of Science ®Google Scholar
Boumal N, Mishra B, Absil PA, et al. Manopt, a Matlab toolbox for optimization on manifolds. J Mach Learn Res. 2014;15:1455–1459.
Web of Science ®Google Scholar
Satake I. Linear algebra. New York (NY): Marcel Dekker Inc.; 1975.
Google Scholar
Van den Bos A. Parameter estimation for scientists and engineers. New York (NY): Wiley; 2007.
Google Scholar
Horn RA, Johnson CR. Matrix analysis. 2nd ed. Cambridge (MA): Cambridge University Press; 2012.
Google Scholar

Appendix 1.

Basic facts on the Stiefel manifold, the Cayley transform and tools for matrix analysis

In this section, we summarize basic properties on $St (p, N)$ and the Cayley transform together with elementary tools for matrix analysis.

Fact A.1

Stiefel manifold [Citation1,Citation41]

The Stiefel manifold $St (p, N)$ is an embedded submanifold of $R^{N \times p}$ . The topology $O (St (p, N))$ , the family of all open subsets, of $St (p, N)$ is defined as any union of sets in ${St (p, N) \cap B_{R^{N \times p}} (X, r) ∣ X \in R^{N \times p}, r > 0}$ .
The dimension of $St (p, N)$ is $Np - \frac{1}{2} p (p + 1)$ , i.e. every point $U \in St (p, N)$ has an open neighbourhood $N (U) \subset St (p, N)$ such that there exists a homeomorphism $ϕ : N (U) \to R^{Np - p (p + 1) / 2}$ between $N (U)$ and some open subset of $R^{Np - p (p + 1) / 2}$ .
The Stiefel manifold $St (p, N)$ is compact. Moreover, $St (p, N)$ with p<N is connected while $O (N) := St (N, N)$ is a disconnected union of connected subsets $SO (N) := {U \in O (N) ∣ det (U) = 1}$ and $O (N) ∖ SO (N)$ .
The tangent space to $St (p, N)$ at $U \in St (p, N)$ is expressed as $\begin{aligned} T_{U} St (p, N) & = {V \in R^{N \times p} ∣ U^{T} V + V^{T} U = 0} \\ = {U Ω + U_{⊥} K \in R^{N \times p} ∣ Ω^{T} = - Ω \in R^{p \times p}, K \in R^{(N - p) \times p}} \end{aligned}$ in terms of an arbitrarily chosen $U_{⊥} \in St (N - p, p)$ satisfying $U^{T} U_{⊥} = 0 \in R^{p \times (N - p)}$ (see, e.g. [Citation1, Example 3.5.2]). The projection mapping $P_{T_{U} St (p, N)} : R^{N \times p} \to T_{U} St (p, N)$ onto $T_{U} St (p, N)$ is given byFootnote¹⁰ (see, e.g. [Citation1, Example 3.6.2]) (A1) $\begin{aligned} (X \in R^{N \times p}) P_{T_{U} St (p, N)} (X) & := \underset{Z \in T_{U} St (p, N)}{argmin} ‖ X - Z ‖_{F} \\ = \frac{1}{2} U (U^{T} X - X^{T} U) + (I - U U^{T}) X . \end{aligned}$ (A1)

Fact A.2

Commutativity of the Cayley transform pair, e.g. [Citation56]

The Cayley transform φ in (Equation3(3) $φ : SO (N) ∖ E_{N, N} \to Q_{N, N} : U \mapsto (I - U) (I + U)^{- 1}$ (3) ) and its inversion $φ^{- 1}$ in (Equation4(4) $φ^{- 1} : Q_{N, N} \to SO (N) ∖ E_{N, N} : V \mapsto (I - V) (I + V)^{- 1} = 2 (I + V)^{- 1} - I$ (4) ) can be expressed as (A2) $\begin{aligned} (U \in O (N) ∖ E_{N, N}) & φ (U) = (I - U) (I + U)^{- 1} = (I + U)^{- 1} (I - U) \\ (V \in Q_{N, N}) & φ^{- 1} (V) = (I - V) (I + V)^{- 1} = (I + V)^{- 1} (I - V) . \end{aligned}$ (A2)

Fact A.3

Denseness of $O (N) ∖ E_{N, N} (S)$ ; see [Citation20] for $S = I$

For $S \in O (N)$ , define $O (N, S) := {U \in O (N) ∣ det (U) = det (S)}$ , i.e. $O (N, S) = {\begin{cases} SO (N) & (if det (S) = 1) \\ O (N) ∖ SO (N) & (if det (S) = - 1) . \end{cases}$ Then, for $S \in O (N)$ and $E_{N, N} (S)$ defined just after (Equation6(6) $φ_{S}^{- 1} : Q_{N, N} (S) \to O (N) ∖ E_{N, N} (S) : V \mapsto S φ^{- 1} (V) = S (I - V) (I + V)^{- 1},$ (6) ), $O (N) ∖ E_{N, N} (S)$ is a dense subset of $O (N, S)$ , i.e. the closure of $O (N) ∖ E_{N, N} (S)$ is $O (N, S)$ .

Proof.

It suffices to show for $U \in O (N, S)$ that there exists a sequence $(U_{n})_{n = 0}^{\infty} \subset O (N) ∖ E_{N, N} (S)$ such that $lim_{n \to \infty} U_{n} = U$ .

Let $U \in O (N, S)$ . Then, $S^{T} U$ can be expressed as $S^{T} U = Q^{T} Λ Q$ with some $Q \in O (N)$ and (A3) $Λ = diag (I_{k_{1}}, - I_{k_{2}}, R (θ_{1}), R (θ_{2}), \dots, R (θ_{m})) \in O (N)$ (A3) (see [Citation56, IV. §5]), where $k_{1}, k_{2}, m \in N \cup {0}$ satisfy $k_{1} + k_{2} + 2 m = N$ , and $R (θ) := [\begin{matrix} \cos (θ) & - \sin (θ) \\ \sin (θ) & \cos (θ) \end{matrix}] \in SO (2)$ $(θ \in [0, 2 π) ∖ {0, π})$ . The relation $det (U) = det (S) = det (S^{T})$ ensures $S^{T} U \in SO (N)$ , thus the number $k_{2}$ must be even. Define $U_{n} = S Q^{T} Λ (π + 1 / n) Q \in O (N)$ $(n \in N)$ , where $Λ (π + 1 / n) \in SO (N)$ is given by replacing each diagonal block matrix $- I_{2} \in SO (2)$ in $- I_{k_{2}}$ [in (EquationA3(A3) $Λ = diag (I_{k_{1}}, - I_{k_{2}}, R (θ_{1}), R (θ_{2}), \dots, R (θ_{m})) \in O (N)$ (A3) )] with $R (π + 1 / n)$ . From $det (I_{2} + R (π + 1 / n)) \neq 0$ for $n \in N$ , we have $U_{n} \in O (N) ∖ E_{N, N} (S)$ and $lim_{n \to \infty} U_{n} = U$ , which implies $O (N) ∖ E_{N, N} (S)$ is dense in $O (N, S)$ .

Lemma A.4

Matrix norms

For $A \in R^{l \times m}$ and $B \in R^{m \times n}$ , it holds $‖ AB ‖_{F} \leq ‖ A ‖_{2} ‖ B ‖_{F}$ and $‖ AB ‖_{F} \leq ‖ A ‖_{F} ‖ B ‖_{2}$ .
For $V \in Q_{N, N} := {V \in R^{N \times N} ∣ V^{T} = - V}$ , we have $σ_{i} (I + V) \geq 1 (1 \leq i \leq N)$ , $‖ I + V ‖_{2}^{2} = 1 + ‖ V ‖_{2}^{2}$ and $‖ (I + V)^{- 1} ‖_{2} \leq 1$ , where $σ_{i} (\cdot)$ stands for the ith largest singular value of a given matrix.
For $V_{1}, V_{2} \in Q_{N, N}$ , $‖ (I + V_{1})^{- 1} - (I + V_{2})^{- 1} ‖_{F} \leq ‖ V_{1} - V_{2} ‖_{F}$ .
$det (I + V) > 0$ for all $V \in Q_{N, N}$ .
For $V \in Q_{N, N}$ , it holds $\sqrt{1 + ‖ V ‖_{2}^{2}} \leq det (I + V) \leq (1 + ‖ V ‖_{2}^{2})^{N / 2}$ .

Proof.

(a) Let $b_{i} \in R^{m}$ be the ith column vector of $B$ . Then, it holds $‖ AB ‖_{F}^{2} = \sum_{i = 1}^{n} ‖ A b_{i} ‖^{2} = \sum_{b_{i} \neq 0} {‖ A \frac{b_{i}}{‖ b_{i} ‖} ‖}^{2} ‖ b_{i} ‖^{2} \leq \sum_{i = 1}^{n} ‖ A ‖_{2}^{2} ‖ b_{i} ‖^{2} = ‖ A ‖_{2}^{2} ‖ B ‖_{F}^{2},$ where $‖ \cdot ‖$ stands for the Euclidean norm for vectors. Thus, we have $‖ AB ‖_{F} \leq ‖ A ‖_{2} ‖ B ‖_{F}$ . By taking the transpose of $AB$ in the previous inequality, we have $‖ AB ‖_{F} = ‖ B^{T} A^{T} ‖_{F} \leq ‖ B^{T} ‖_{2} ‖ A^{T} ‖_{F} = ‖ A ‖_{F} ‖ B ‖_{2}$ .

(b) For $1 \leq i \leq N$ , let $λ_{i} (Y)$ be the ith largest eigenvalue of a symmetric matrix $Y \in R^{N \times N}$ . Then, we have the expression $σ_{i} (I + V) = \sqrt{λ_{i} ((I + V)^{T} (I + V))} = \sqrt{λ_{i} (I + V^{T} V)} = \sqrt{1 + σ_{i}^{2} (V)} \geq 1 (1 \leq i \leq N)$ , which asserts $‖ I + V ‖_{2}^{2} = σ_{1}^{2} (I + V) = 1 + σ_{1}^{2} (V) = 1 + ‖ V ‖_{2}^{2}$ and $‖ (I + V)^{- 1} ‖_{2} = σ_{N}^{- 1} (I + V) \leq 1$ .

(c) By (a) and (b), $‖ (I + V_{1})^{- 1} - (I + V_{2})^{- 1} ‖_{F} = ‖ (I + V_{1})^{- 1} ((I + V_{2}) - (I + V_{1})) (I + V_{2})^{- 1} ‖_{F} \leq ‖ (I + V_{1})^{- 1} ‖_{2} ‖ (I + V_{2})^{- 1} ‖_{2} ‖ V_{1} - V_{2} ‖_{F} \leq ‖ V_{1} - V_{2} ‖_{F}$ .

(d) The nonsingularity of $I + V$ (see (b)) yields $det (I + V) \neq 0$ , and $det (I + 0) = 1$ . Since $det (I + \cdot)$ is continuous and $Q_{N, N}$ is connected, $det (I + V)$ is a positive-valued.

(e) Let $I + V = Q_{1} Σ Q_{2}^{T}$ be a singular value decomposition with $Q_{1}, Q_{2} \in O (N)$ and a nonnegative diagonal matrix $Σ \in R^{N \times N}$ . Then, we obtain $| det (I + V) | = | det (Q_{1} Σ Q_{2}^{T}) | = det (Σ) = \prod_{i = 1}^{N} σ_{i} (I + V)$ , implying thus $det (I + V) = \prod_{i = 1}^{N} σ_{i} (I + V)$ by (d). Moreover by (b), we have $det (I + V) \geq σ_{1} (I + V) = ‖ I + V ‖_{2} = \sqrt{1 + ‖ V ‖_{2}^{2}}$ and $det (I + V) \leq σ_{1}^{N} (I + V) = ‖ I + V ‖_{2}^{N} = (1 + ‖ V ‖_{2}^{2})^{N / 2}$ .

Fact A.5

Derivative of matrix functions (see, e.g. [Citation57, Appendix D])

Let $D \subset R$ be an open interval. Then, the following hold:

Let $X : R \to R^{N \times M}$ and $Y : R \to R^{M \times L}$ be differentiable on D. Then, $\frac{d}{d t} X (t) Y (t) = (\frac{d}{d t} X (t)) Y (t) + X (t) (\frac{d}{d t} Y (t)) .$
Let $X : R \to R^{N \times N}$ be differentiable and invertible on D. Then, $\frac{d}{d t} X^{- 1} (t) = - X^{- 1} (t) (\frac{d}{d t} X (t)) X^{- 1} (t) .$

Fact A.6

The Schur complement formula [Citation58, Sec. 0.8.5]

Let $[[X]]_{22} \in R^{(N - p) \times (N - p)}$ be a nonsingular block matrix of $X \in R^{N \times N}$ . Define a Schur complement matrix of $X$ by $M := [[X]]_{11} - [[X]]_{12} [[X]]_{22}^{- 1} [[X]]_{21}$ . Then, $M$ is nonsingular if and only if $X$ is nonsingular, and the inversion $X^{- 1}$ can be expressed as $X^{- 1} = [\begin{matrix} M^{- 1} & - M^{- 1} [[X]]_{12} [[X]]_{22}^{- 1} \\ - [[X]]_{22}^{- 1} [[X]]_{21} M^{- 1} & [[X]]_{22}^{- 1} + [[X]]_{22}^{- 1} [[X]]_{21} M^{- 1} [[X]]_{12} [[X]]_{22}^{- 1} \end{matrix}] .$ Moreover, it holds $det (X) = det ([[X]]_{22}) det (M)$ .

Fact A.7

The Sherman-Morrison-Woodbury formular [Citation58, Sec. 0.7.4]

For nonsingular matrices $A \in R^{N \times N}$ , $R \in R^{p \times p}$ , and rectangular matrices $X \in R^{N \times p}$ , $Y \in R^{p \times N}$ , let $B = A + XRY \in R^{N \times N}$ . If $B$ and $R^{- 1} + Y A^{- 1} X$ are nonsingular, then $B^{- 1} = A^{- 1} - A^{- 1} X (R^{- 1} + Y A^{- 1} X)^{- 1} Y A^{- 1} .$

Appendix 2.

Retraction-based strategy for optimization over $St (p, N)$

We summarize a standard strategy for optimization over $St (p, N)$ .

Definition A.8

Retraction [Citation1]

The set of mappings $R_{U} : T_{U} St (p, N) \to St (p, N) : D \mapsto R_{U} (D)$ defined at each $U \in St (p, N)$ is called a retraction of $St (p, N)$ if it satisfies (i) $R_{U} (0) = U$ ; (ii) $\frac{d}{d t} |_{t = 0} R_{U} (t D) = D$ for all $U \in St (p, N)$ and $D \in T_{U} St (p, N)$ .

Retractions serve as certain approximations of the exponential mapping ${Exp}_{U}$ Footnote¹¹. Many examples of retractions for $St (p, N)$ are known, e.g. with QR decomposition, with polar decomposition and with the Euclidean projection [Citation1,Citation45,Citation46] as well as with the Cayley transform [Citation22,Citation45].

In the view that $St (p, N)$ is a Riemannian manifold, Problem 1.1 has been tackled with retractions as an application of the standard strategies for optimization defined over Riemannian manifold. In such a strategy for $St (p, N)$ based on a retraction [Citation1,Citation22,Citation24,Citation41–52], the computation for updating the estimate $U_{n} \in St (p, N)$ to $U_{n + 1} \in St (p, N)$ at nth iteration is decomposed into: (i) determine a search direction $D_{n} \in T_{U_{n}} St (p, N)$ ; (ii) assign $R_{U_{n}} (D_{n}) = R_{U_{n}} (0 + D_{n}) \in St (p, N)$ to a new estimate $U_{n + 1}$ . Along this strategy, optimization algorithms designed originally over a single vector space have been extended to those designed over tangent spaces, to $St (p, N)$ , by using additional tools, e.g. a vector transport [Citation1] and the inversion mapping of retractions [Citation25], if necessary. Such extensions have been made for many schemes, e.g. the gradient descent method [Citation41–43,Citation45], the conjugate gradient method [Citation24,Citation25,Citation47,Citation51], Newton's method [Citation41,Citation50], quasi-Newton's method [Citation47,Citation48], the Barzilai–Borwein method [Citation22,Citation49] and the trust-region method [Citation44,Citation52].

Appendix 3.

Proof of Proposition 2.2

The second equality in (Equation14(14) $\begin{aligned} Φ_{S}^{- 1} (V) & = Ξ \circ φ_{S}^{- 1} (V) = 2 S (I + V)^{- 1} I_{N \times p} - S I_{N \times p} \end{aligned}$ (14) ) is verified by $(I - V) (I + V)^{- 1} = (2 I - (I + V)) (I + V)^{- 1} = 2 (I + V)^{- 1} - I$ . Fact A.6 and $[[I + V]]_{22} = I_{N - p}$ guarantee the non-singularity of $M := I_{p} + [[V]]_{11} + [[V]]_{21}^{T} [[V]]_{21}$ and (A4) $(I + V)^{- 1} = [\begin{matrix} M^{- 1} & M^{- 1} [[V]]_{21}^{T} \\ - [[V]]_{21} M^{- 1} & I_{N - p} - [[V]]_{21} M^{- 1} [[V]]_{21}^{T} \end{matrix}]$ (A4) which implies $(I + V)^{- 1} I_{N \times p} = [I_{p} - [[V]]_{21}^{T}]^{T} M^{- 1}$ and the expressions of $Ξ \circ φ_{S}^{- 1}$ in (Equation15(15) $\begin{aligned} = 2 S [\begin{matrix} M^{- 1} \\ - [[V]]_{21} M^{- 1} \end{matrix}] - S_{le} = 2 (S_{le} - S_{ri} [[V]]_{21}) M^{- 1} - S_{le}, \end{aligned}$ (15) ).

In the following, we will show $Φ_{S}^{- 1} = Υ_{S} := Ξ \circ φ_{S}^{- 1}$ on $Q_{N, p} (S)$ by dividing 4 steps.

(I) Proof of $Υ_{S} (Q_{N, p} (S)) := {Υ_{S} (V) ∣ V \in Q_{N, p} (S)} \subset St (p, N) ∖ E_{N, p} (S)$ . For every $V \in Q_{N, p} (S)$ , (EquationA2(A2) $\begin{aligned} (U \in O (N) ∖ E_{N, N}) & φ (U) = (I - U) (I + U)^{- 1} = (I + U)^{- 1} (I - U) \\ (V \in Q_{N, N}) & φ^{- 1} (V) = (I - V) (I + V)^{- 1} = (I + V)^{- 1} (I - V) . \end{aligned}$ (A2) ) ensures $\begin{aligned} Υ_{S} (V)^{T} Υ_{S} (V) & = I_{N \times p}^{T} (I + V)^{- T} (I - V)^{T} S^{T} S (I - V) (I + V)^{- 1} I_{N \times p} \\ = I_{N \times p}^{T} (I - V)^{- 1} (I + V) (I + V)^{- 1} (I - V) I_{N \times p} = I_{p}, \end{aligned}$ thus $Υ_{S} (V) \in St (p, N)$ . $Υ_{S} (V) \notin E_{N, p} (S)$ is confirmed by the expression in (Equation15(15) $\begin{aligned} = 2 S [\begin{matrix} M^{- 1} \\ - [[V]]_{21} M^{- 1} \end{matrix}] - S_{le} = 2 (S_{le} - S_{ri} [[V]]_{21}) M^{- 1} - S_{le}, \end{aligned}$ (15) ), i.e. (A5) $\begin{aligned} I_{p} + S_{le}^{T} Υ_{S} (V) = I_{p} + S_{le}^{T} (2 (S_{le} - S_{ri} [[V]]_{21}) M^{- 1} - S_{le}) \\ = I_{p} + 2 M^{- 1} - I_{p} = 2 M^{- 1}, (∵ S_{le}^{T} S_{le} = I_{p} and S_{le}^{T} S_{ri} = 0 from S^{T} S = I) \end{aligned}$ (A5) and $det (I_{p} + S_{le}^{T} Υ_{S} (V)) = 2^{p} / det (M) \neq 0$ .

(II) Proof of $Υ_{S} \circ Φ_{S} (U) = U (U \in St (p, N) ∖ E_{N, p} (S))$ . Let $U \in St (p, N) ∖ E_{N, p} (S)$ and $V := Φ_{S} (U)$ in (Equation10(10) $Φ_{S} : St (p, N) ∖ E_{N, p} (S) \to Q_{N, p} (S) : U \mapsto [\begin{matrix} A_{S} (U) & - B_{S}^{T} (U) \\ B_{S} (U) & 0 \end{matrix}]$ (10) ). Then, by $\begin{aligned} I_{p} + A_{S} (U) + B_{S}^{T} (U) B_{S} (U) \\ = I_{p} + 2 (I_{p} + S_{le}^{T} U)^{- T} S_{kew} (U^{T} S_{le}) (I_{p} + S_{le}^{T} U)^{- 1} \\ + (I_{p} + S_{le}^{T} U)^{- T} U^{T} S_{ri} S_{ri}^{T} U (I_{p} + S_{le}^{T} U)^{- 1} \\ = (I_{p} + S_{le}^{T} U)^{- T} ((I_{p} + S_{le}^{T} U)^{T} (I_{p} + S_{le}^{T} U) + (U^{T} S_{le} - S_{le}^{T} U) + U^{T} S_{ri} S_{ri}^{T} U) (I_{p} + S_{le}^{T} U)^{- 1} \\ = (I_{p} + S_{le}^{T} U)^{- T} (I_{p} + 2 U^{T} S_{le} + U^{T} S_{le} S_{le}^{T} U + U^{T} S_{ri} S_{ri}^{T} U) (I_{p} + S_{le}^{T} U)^{- 1} \\ = (I_{p} + U^{T} S_{le})^{- 1} (2 I_{p} + 2 U^{T} S_{le}) (I_{p} + S_{le}^{T} U)^{- 1} = 2 (I_{p} + S_{le}^{T} U)^{- 1}, \\ (∵ S S^{T} = S_{le} S_{le}^{T} + S_{ri} S_{ri}^{T} = I) \end{aligned}$ we deduce with (Equation15(15) $\begin{aligned} = 2 S [\begin{matrix} M^{- 1} \\ - [[V]]_{21} M^{- 1} \end{matrix}] - S_{le} = 2 (S_{le} - S_{ri} [[V]]_{21}) M^{- 1} - S_{le}, \end{aligned}$ (15) ) $\begin{aligned} Υ_{S} (V) & = 2 (S_{le} - S_{ri} B_{S} (U)) (I_{p} + A_{S} (U) + B_{S}^{T} (U) B_{S} (U))^{- 1} - S_{le} \\ = (S_{le} + S_{ri} S_{ri}^{T} U (I_{p} + S_{le}^{T} U)^{- 1}) (I_{p} + S_{le}^{T} U) - S_{le} \\ = (S_{le} S_{le}^{T} + S_{ri} S_{ri}^{T}) U = U . \end{aligned}$

(III) Proof of $Φ_{S} \circ Υ_{S} (V) = V (V \in Q_{N, p} (S))$ . Let $V \in Q_{N, p} (S)$ and $U := Υ_{S} (V) \overset{(15)}{=} 2 (S_{le} - S_{ri} [[V]]_{21}) M^{- 1} - S_{le}$ with $M := I_{p} + [[V]]_{11} + [[V]]_{21}^{T} [[V]]_{21}$ . It suffices to show $A_{S} \circ Υ_{S} (V) = [[V]]_{11}$ and $B_{S} \circ Υ_{S} (V) = [[V]]_{21}$ . Then, by the definition of $Φ_{S}$ in (Equation10(10) $Φ_{S} : St (p, N) ∖ E_{N, p} (S) \to Q_{N, p} (S) : U \mapsto [\begin{matrix} A_{S} (U) & - B_{S}^{T} (U) \\ B_{S} (U) & 0 \end{matrix}]$ (10) ), (Equation11(11) $\begin{aligned} A_{S} (U) & := 2 (I_{p} + S_{le}^{T} U)^{- T} S_{kew} (U^{T} S_{le}) (I_{p} + S_{le}^{T} U)^{- 1} \in Q_{p, p} \end{aligned}$ (11) ) and (Equation12(12) $\begin{aligned} B_{S} (U) & := - S_{ri}^{T} U (I_{p} + S_{le}^{T} U)^{- 1} \in R^{(N - p) \times p}, \end{aligned}$ (12) ), and by $\begin{aligned} S_{le}^{T} U & = S_{le}^{T} (2 (S_{le} - S_{ri} [[V]]_{21}) M^{- 1} - S_{le}) = 2 M^{- 1} - I_{p} (∵ S_{le}^{T} S_{le} = I_{p}, S_{le}^{T} S_{ri} = 0) \\ S_{ri}^{T} U & = S_{ri}^{T} (2 (S_{le} - S_{ri} [[V]]_{21}) M^{- 1} - S_{le}) = - 2 [[V]]_{21} M^{- 1} (∵ S_{ri}^{T} S_{ri} = I_{N - p}, S_{ri}^{T} S_{le} = 0), \end{aligned}$ each block matrix in (Equation10(10) $Φ_{S} : St (p, N) ∖ E_{N, p} (S) \to Q_{N, p} (S) : U \mapsto [\begin{matrix} A_{S} (U) & - B_{S}^{T} (U) \\ B_{S} (U) & 0 \end{matrix}]$ (10) ) can be evaluated as $\begin{aligned} A_{S} (U) & = (I_{p} + 2 M^{- 1} - I_{p})^{- T} ((2 M^{- 1} - I_{p})^{T} - (2 M^{- 1} - I_{p})) (I_{p} + 2 M^{- 1} - I_{p})^{- 1} \\ = 2^{- 1} M^{T} (M^{- T} - M^{- 1}) M = 2^{- 1} (M - M^{T}) \\ = 2^{- 1} ((I_{p} + [[V]]_{11} + [[V]]_{21}^{T} [[V]]_{21}) - (I_{p} + [[V]]_{11}^{T} + [[V]]_{21}^{T} [[V]]_{21})) \\ = [[V]]_{11} (∵ [[V]]_{11}^{T} = - [[V]]_{11}), \\ B_{S} (U) & = - (- 2 [[V]]_{21} M^{- 1}) (I_{p} + 2 M^{- 1} - I_{p})^{- 1} = 2 [[V]]_{21} M^{- 1} (2 M^{- 1})^{- 1} = [[V]]_{21}, \end{aligned}$ which implies $Φ_{S} \circ Υ_{S} (V) = V$ .

(IV) Proof of diffeomorphism of $Φ_{S}$ and $Φ_{S}^{- 1}$ . From (II) and (III), we have seen $Φ_{S}^{- 1} = Υ_{S}$ , and both $Φ_{S}$ and $Φ_{S}^{- 1}$ are homeomorphic between their domains and images, and consist of finite numbers of matrix additions, matrix multiplications and matrix inversions, which are all smooth. Therefore, $Φ_{S}$ and $Φ_{S}^{- 1}$ are diffeomorphic between their domains and images.

Appendix 4.

Proof of Theorem 2.3

(a) From the definition of $φ_{S}^{- 1}$ in (Equation6(6) $φ_{S}^{- 1} : Q_{N, N} (S) \to O (N) ∖ E_{N, N} (S) : V \mapsto S φ^{- 1} (V) = S (I - V) (I + V)^{- 1},$ (6) ), $Φ_{S}^{- 1}$ is the restriction of $Ξ \circ φ_{S}^{- 1}$ to $Q_{N, p} (S)$ , which implies $St (p, N) ∖ E_{N, p} (S) \overset{Prop . 2.2}{=} Φ_{S}^{- 1} (Q_{N, p} (S)) = Ξ \circ φ_{S}^{- 1} (Q_{N, p} (S))$ . Thus, it suffices to show for every $V \in Q_{N, N} (S)$ that there exists $\hat{V} \in Q_{N, p} (S)$ satisfying $Φ_{S}^{- 1} (\hat{V}) = Ξ \circ φ_{S}^{- 1} (V)$ , which is verified by the following lemma.

Lemma A.9

Let $S \in O (N)$ and $V = [\begin{matrix} A & - B^{T} \\ B & C \end{matrix}] \in Q_{N, N} (S)$ with $A \in Q_{p, p}$ , $B \in R^{(N - p) \times p}$ , and $C \in Q_{N - p, N - p}$ . Define (A6) $\hat{V} := [\begin{matrix} \hat{A} := A - B^{T} (I_{N - p} + C)^{- T} C (I_{N - p} + C)^{- 1} B & - {\hat{B}}^{T} \\ \hat{B} := (I_{N - p} + C)^{- 1} B & 0_{N - p} \end{matrix}] \in R^{N \times N} .$ (A6) Then, $\hat{V} \in Q_{N, p} (S)$ and $Φ_{S}^{- 1} (\hat{V}) = Ξ \circ φ_{S}^{- 1} (V)$

Proof.

From the skew-symmetries of $A$ and $C$ , we have ${\hat{A}}^{T} = A^{T} - B^{T} (I_{N - p} + C)^{- T} C^{T} (I_{N - p} + C)^{- 1} B = - A + B^{T} (I_{N - p} + C)^{- T} C (I_{N - p} + C)^{- 1} B = - \hat{A}$ , thus $\hat{V} \in Q_{N, p} (S)$ .

By letting $M := I_{p} + A + B^{T} (I_{N - p} + C)^{- 1} B \in R^{p \times p}$ , Fact A.6 yields $\begin{aligned} (I + V)^{- 1} = & [\begin{matrix} M^{- 1} \\ - (I_{N - p} + C)^{- 1} B M^{- 1} \end{matrix} \\ \begin{matrix} M^{- 1} B^{T} (I_{N - p} + C)^{- 1} \\ (I_{N - p} + C)^{- 1} - (I_{N - p} + C)^{- 1} B M^{- 1} B^{T} (I_{N - p} + C)^{- 1} \end{matrix}] \end{aligned}$ from the non-singularities of $I + V$ and $I_{N - p} + C$ (see Lemma A.4(b)). The expressions in (Equation6(6) $φ_{S}^{- 1} : Q_{N, N} (S) \to O (N) ∖ E_{N, N} (S) : V \mapsto S φ^{- 1} (V) = S (I - V) (I + V)^{- 1},$ (6) ) and (Equation4(4) $φ^{- 1} : Q_{N, N} \to SO (N) ∖ E_{N, N} : V \mapsto (I - V) (I + V)^{- 1} = 2 (I + V)^{- 1} - I$ (4) ) assert that $\begin{aligned} Ξ \circ φ_{S}^{- 1} (V) = S (I - V) (I + V)^{- 1} I_{N \times p} = 2 S (I + V)^{- 1} I_{N \times p} - S I_{N \times p} \\ = 2 S [\begin{matrix} (I_{p} + A + B^{T} (I_{N - p} + C)^{- 1} B)^{- 1} \\ - (I_{N - p} + C)^{- 1} B (I_{p} + A + B^{T} (I_{N - p} + C)^{- 1} B)^{- 1} \end{matrix}] - S I_{N \times p} . \end{aligned}$ On the other hand, from (Equation15(15) $\begin{aligned} = 2 S [\begin{matrix} M^{- 1} \\ - [[V]]_{21} M^{- 1} \end{matrix}] - S_{le} = 2 (S_{le} - S_{ri} [[V]]_{21}) M^{- 1} - S_{le}, \end{aligned}$ (15) ), we obtain $Φ_{S}^{- 1} (\hat{V}) = 2 S [\begin{matrix} (I_{p} + \hat{A} + {\hat{B}}^{T} \hat{B})^{- 1} \\ - \hat{B} (I_{p} + \hat{A} + {\hat{B}}^{T} \hat{B})^{- 1} \end{matrix}] - S I_{N \times p} .$ Clearly to get $Ξ \circ φ_{S}^{- 1} (V) = Φ_{S}^{- 1} (\hat{V})$ , it suffices to show $A + B^{T} (I_{N - p} + C)^{- 1} B = \hat{A} + {\hat{B}}^{T} \hat{B}$ because $(I_{N - p} + C)^{- 1} B = \hat{B}$ holds automatically by the definition of $\hat{B}$ in (EquationA6(A6) $\hat{V} := [\begin{matrix} \hat{A} := A - B^{T} (I_{N - p} + C)^{- T} C (I_{N - p} + C)^{- 1} B & - {\hat{B}}^{T} \\ \hat{B} := (I_{N - p} + C)^{- 1} B & 0_{N - p} \end{matrix}] \in R^{N \times N} .$ (A6) ). The equation $A + B^{T} (I_{N - p} + C)^{- 1} B = \hat{A} + {\hat{B}}^{T} \hat{B}$ is verified by $C^{T} = - C$ and by $\begin{aligned} \hat{A} + {\hat{B}}^{T} \hat{B} & = A - B^{T} (I_{N - p} + C)^{- T} C (I_{N - p} + C)^{- 1} B + B^{T} (I_{N - p} + C)^{- T} (I_{N - p} + C)^{- 1} B \\ = A + B^{T} (I_{N - p} - C)^{- 1} (I_{N - p} - C) (I_{N - p} + C)^{- 1} B = A + B^{T} (I_{N - p} + C)^{- 1} B . \end{aligned}$

(b) (Openness) By the continuity of $g : R^{N \times p} \to R : X \mapsto det (I_{p} + S_{le}^{T} X)$ , the preimage $g^{- 1} ({0})$ is closed on $R^{N \times p}$ . Since $E_{N, p} (S) = g^{- 1} ({0}) \cap St (p, N)$ is closed in $St (p, N)$ , $St (p, N) ∖ E_{N, p} (S)$ is open in $St (p, N)$ .

(Denseness) It suffices to show, for every $U \in E_{N, p} (S)$ , there exists a sequence $(U_{n})_{n = 0}^{\infty} \subset St (p, N) ∖ E_{N, p} (S)$ such that $lim_{n \to \infty} U_{n} = U$ . Let $U = Ξ (\tilde{U}) \in E_{N, p} (S) \subset St (p, N)$ with $\tilde{U} := [U U_{⊥}] \in O (N)$ , where $U_{⊥} \in St (N - p, N)$ satisfies $U^{T} U_{⊥} = 0$ . Then, $Ξ (O (N) ∖ E_{N, N} (S)) = St (p, N) ∖ E_{N, p} (S)$ (see (a)) ensures $\tilde{U} \in E_{N, N} (S)$ . By using the denseness of $O (N) ∖ E_{N, N} (S)$ in $O (N, S)$ (see Fact A.3), we can construct a sequence $({\tilde{U}}_{n})_{n = 0}^{\infty} \subset O (N) ∖ E_{N, N} (S)$ such that $lim_{n \to \infty} {\tilde{U}}_{n} = \tilde{U}$ . Moreover by defining $(U_{n})_{n = 0}^{\infty} := (Ξ ({\tilde{U}}_{n}))_{n = 0}^{\infty} \subset Ξ (O (N) ∖ E_{N, N} (S)) \overset{(a)}{=} St (p, N) ∖ E_{N, p} (S)$ , the continuity of Ξ yields $lim_{n \to \infty} U_{n} = lim_{n \to \infty} Ξ ({\tilde{U}}_{n}) = Ξ (\tilde{U}) = U$ .

(c) $St (p, N) ∖ E_{N, p} (S_{i}) (i = 1, 2)$ are open dense subsets of $St (p, N)$ from Theorem 2.3(b). The openness of $Δ (S_{1}, S_{2})$ is clear. To show the denseness of $Δ (S_{1}, S_{2})$ in $St (p, N)$ , choose $U \in St (p, N)$ and $ϵ > 0$ arbitrarily. By the open denseness of $St (p, N) ∖ E_{N, p} (S_{1})$ , there exist $U_{1} \in B_{St (p, N)} (U, ϵ) \cap St (p, N) ∖ E_{N, p} (S_{1})$ and $ϵ_{1} > 0$ satisfying $B_{St (p, N)} (U_{1}, ϵ_{1}) \subset B_{St (p, N)} (U, ϵ) \cap St (p, N) ∖ E_{N, p} (S_{1})$ , where $B_{St (p, N)} (U, ϵ) := B_{R^{N \times p}} (U, ϵ) \cap St (p, N)$ . The denseness of $St (p, N) ∖ E_{N, p} (S_{2})$ in $St (p, N)$ yields the existence of $U_{2} \in B_{St (p, N)} (U_{1}, ϵ_{1}) \cap St (p, N) ∖ E_{N, p} (S_{2})$ , from which we obtain $U_{2} \in B_{St (p, N)} (U_{1}, ϵ_{1}) \cap St (p, N) ∖ E_{N, p} (S_{2}) \subset B_{St (p, N)} (U, ϵ) \cap St (p, N) ∖ E_{N, p} (S_{1}) \cap St (p, N) ∖ E_{N, p} (S_{2}) = B_{St (p, N)} (U, ϵ) \cap Δ (S_{1}, S_{2})$ .

(d) From (EquationA5(A5) $\begin{aligned} I_{p} + S_{le}^{T} Υ_{S} (V) = I_{p} + S_{le}^{T} (2 (S_{le} - S_{ri} [[V]]_{21}) M^{- 1} - S_{le}) \\ = I_{p} + 2 M^{- 1} - I_{p} = 2 M^{- 1}, (∵ S_{le}^{T} S_{le} = I_{p} and S_{le}^{T} S_{ri} = 0 from S^{T} S = I) \end{aligned}$ (A5) ), we have $I_{p} + S_{le}^{T} Φ_{S}^{- 1} (V) = 2 M^{- 1}$ for $V \in Q_{N, p} (S)$ , where $M := I_{p} + [[V]]_{11} + [[V]]_{21}^{T} [[V]]_{21} \in R^{p \times p}$ is the Schur complement matrix of $I + V \in R^{N \times N}$ . Fact A.6 yields $g (V) = det (2 M^{- 1}) = 2^{p} (det (M))^{- 1} = 2^{p} (det (I + V))^{- 1}$ due to $[[I + V]]_{22} = I_{N - p}$ . Lemma A.4(d) ensures $g (V) > 0$ . By Lemma A.4(e), we have $det (I + V) \geq \sqrt{1 + ‖ V ‖_{2}^{2}} \to \infty$ as $‖ V ‖_{2} \to \infty$ , implying thus $lim_{\begin{matrix} V \in Q_{N, p} (S) \\ ‖ V ‖_{2} \to \infty \end{matrix}} g (V) = 0$ .

Assume that $(V_{n})_{n = 0}^{\infty} \subset Q_{N, p} (S)$ satisfies $lim_{n \to \infty} g (V_{n}) = 0$ . By $0 < det (I + V_{n}) \leq (1 + ‖ V_{n} ‖_{2}^{2})^{N / 2}$ in Lemma A.4(e), we have $g (V_{n}) = 2^{p} (det (I + V_{n}))^{- 1} \geq 2^{p} / (1 + ‖ V_{n} ‖_{2}^{2})^{N / 2}$ . The assumption asserts $‖ V_{n} ‖_{2} \to \infty$ as $n \to \infty$ .

Appendix 5.

On the choice of $Ξ : O (N) \to St (p, N)$ for $Φ_{S}^{- 1}$ in Proposition thm2.2

For 2p<N, let $U \in St (p, N)$ and $U_{⊥} \in St (N - p, N)$ satisfy $U^{T} U_{⊥} = 0$ , and $S := [U U_{⊥}] \in O (N)$ . From $U = S I_{N \times p}$ , we have (A7) $\begin{aligned} (V \in Q_{N, p}) Ξ_{⟨ U ⟩} \circ φ_{S}^{- 1} (V) \\ = S (I - V) (I + V)^{- 1} S I_{N \times p} = S (I - V) S (I + S^{T} V S)^{- 1} I_{N \times p} \\ = S S (I - S^{T} V S) (I + S^{T} V S)^{- 1} I_{N \times p} = Ξ \circ φ_{S S}^{- 1} (S^{T} V S) . \end{aligned}$ (A7) From $S^{T} V S \in Q_{N, N}$ , Theorem 2.3(a) ensures $Ξ_{⟨ U ⟩} \circ φ_{S}^{- 1} (Q_{N, p}) \subset Ξ \circ φ_{S S}^{- 1} (Q_{N, N}) = St (p, N) ∖ E_{N, p} (S S)$ .

In the following, let us consider the case of $U_{up} = 0 \in R^{p \times p}$ to show that $Ξ_{⟨ U ⟩} \circ φ_{S}^{- 1}$ is not injective on $Q_{N, p}$ . Since $Ξ_{⟨ U ⟩} \circ φ_{S}^{- 1}$ does not depend on $U_{⊥}$ , we can assume, without loss of generality, $S = [\begin{matrix} 0 & I_{p} & 0 \\ Z & 0 & Z_{⊥} \end{matrix}]$ , $U = [\begin{matrix} 0 \\ Z \end{matrix}]$ and $U_{⊥} = [\begin{matrix} I_{p} & 0 \\ 0 & Z_{⊥} \end{matrix}]$ with $Z \in St (p, N - p)$ and $Z_{⊥} \in St (N - 2 p, N - p)$ satisfying $Z^{T} Z_{⊥} = 0$ . We have (A8) $\begin{aligned} (V \in Q_{N, p}) S^{T} V S & = [\begin{matrix} 0 & Z^{T} \\ I_{p} & 0 \\ 0 & Z_{⊥}^{T} \end{matrix}] [\begin{matrix} [[V]]_{11} & - [[V]]_{21}^{T} \\ [[V]]_{21} & 0 \end{matrix}] [\begin{matrix} 0 & I_{p} & 0 \\ Z & 0 & Z_{⊥} \end{matrix}] \\ = [\begin{matrix} Z^{T} [[V]]_{21} & 0 \\ [[V]]_{11} & - [[V]]_{21}^{T} \\ Z_{⊥}^{T} [[V]]_{21} & 0 \end{matrix}] [\begin{matrix} 0 & I_{p} & 0 \\ Z & 0 & Z_{⊥} \end{matrix}] \\ = [\begin{matrix} 0 & Z^{T} [[V]]_{21} & 0 \\ - [[V]]_{21}^{T} Z & [[V]]_{11} & - [[V]]_{21}^{T} Z_{⊥} \\ 0 & Z_{⊥}^{T} [[V]]_{21} & 0 \end{matrix}] . \end{aligned}$ (A8)

Now, by using $α \in R ∖ {0}$ , define $V (α) \in Q_{N, p}$ as $[[V (α)]]_{11} = 0$ and $[[V (α)]]_{21} = α [0_{(N - p) \times p} Z_{⊥}] [0_{p \times (N - 2 p)} I_{p}]^{T}$ , where $[[V (α)]]_{21} \neq 0$ is guaranteed by $Z_{⊥} \in St (N - 2 p, N - p)$ and 0<N−2p. Then, $Z^{T} [[V (α)]]_{21} = 0$ and (EquationA8(A8) $\begin{aligned} (V \in Q_{N, p}) S^{T} V S & = [\begin{matrix} 0 & Z^{T} \\ I_{p} & 0 \\ 0 & Z_{⊥}^{T} \end{matrix}] [\begin{matrix} [[V]]_{11} & - [[V]]_{21}^{T} \\ [[V]]_{21} & 0 \end{matrix}] [\begin{matrix} 0 & I_{p} & 0 \\ Z & 0 & Z_{⊥} \end{matrix}] \\ = [\begin{matrix} Z^{T} [[V]]_{21} & 0 \\ [[V]]_{11} & - [[V]]_{21}^{T} \\ Z_{⊥}^{T} [[V]]_{21} & 0 \end{matrix}] [\begin{matrix} 0 & I_{p} & 0 \\ Z & 0 & Z_{⊥} \end{matrix}] \\ = [\begin{matrix} 0 & Z^{T} [[V]]_{21} & 0 \\ - [[V]]_{21}^{T} Z & [[V]]_{11} & - [[V]]_{21}^{T} Z_{⊥} \\ 0 & Z_{⊥}^{T} [[V]]_{21} & 0 \end{matrix}] . \end{aligned}$ (A8) ) with $V = V (α)$ yield (A9) $S^{T} V (α) S = [\begin{matrix} 0 & 0 & 0 \\ 0 & 0 & - [[V (α)]]_{21}^{T} Z_{⊥} \\ 0 & Z_{⊥}^{T} [[V (α)]]_{21} & 0 \end{matrix}] =: [\begin{matrix} A & - B^{T} \\ B & C \end{matrix}] \in Q_{N, N},$ (A9) where $A = 0 \in Q_{p, p}$ , $B = 0 \in R^{(N - p) \times p}$ and $C = [\begin{matrix} 0 & - [[V (α)]]_{21}^{T} Z_{⊥} \\ Z_{⊥}^{T} [[V (α)]]_{21} & 0 \end{matrix}] \in Q_{N - p, N - p} .$

Finally, by applying Lemma A.9 to (EquationA9(A9) $S^{T} V (α) S = [\begin{matrix} 0 & 0 & 0 \\ 0 & 0 & - [[V (α)]]_{21}^{T} Z_{⊥} \\ 0 & Z_{⊥}^{T} [[V (α)]]_{21} & 0 \end{matrix}] =: [\begin{matrix} A & - B^{T} \\ B & C \end{matrix}] \in Q_{N, N},$ (A9) ) and (EquationA7(A7) $\begin{aligned} (V \in Q_{N, p}) Ξ_{⟨ U ⟩} \circ φ_{S}^{- 1} (V) \\ = S (I - V) (I + V)^{- 1} S I_{N \times p} = S (I - V) S (I + S^{T} V S)^{- 1} I_{N \times p} \\ = S S (I - S^{T} V S) (I + S^{T} V S)^{- 1} I_{N \times p} = Ξ \circ φ_{S S}^{- 1} (S^{T} V S) . \end{aligned}$ (A7) ), we deduce $Ξ_{⟨ U ⟩} \circ φ_{S}^{- 1} (V (α)) \overset{(A 7)}{=} Ξ \circ φ_{S S}^{- 1} (S^{T} V (α) S) \overset{Lemma A .9}{=} Φ_{S S}^{- 1} (0) = S S I_{N \times p} = S U$ for all $α \in R ∖ {0}$ . This implies that infinitely many $V (α) \in Q_{N, p} (α \in R ∖ {0})$ achieve $Ξ_{⟨ U ⟩} \circ φ_{S}^{- 1} (V (α)) = S U$ , and clearly $Ξ_{⟨ U ⟩} \circ φ_{S}^{- 1}$ is not injective.

Appendix 6.

Proof of Proposition 2.9

The differentiability of $f \circ Φ_{S}^{- 1}$ is verified by the differentiabilities of f and $Φ_{S}^{- 1}$ . Let $V, D \in Q_{N, p} (S)$ . From the chain rule, we obtain ${\frac{d}{d t} (f \circ Φ_{S}^{- 1}) (V + t D) |}_{t = 0} = Tr (\nabla f (U)^{T} {\frac{d}{d t} Φ_{S}^{- 1} (V + t D) |}_{t = 0}) .$ Moreover, by $Φ_{S}^{- 1} (V) = 2 S (I + V)^{- 1} I_{N \times p} - S I_{N \times p}$ and Fact A.5, we deduce $\begin{aligned} {\frac{d}{d t} Φ_{S}^{- 1} (V + t D) |}_{t = 0} & = 2 S {\frac{d}{d t} (I + V + t D)^{- 1} |}_{t = 0} I_{N \times p} \\ = - 2 S (I + V)^{- 1} D (I + V)^{- 1} I_{N \times p} . \end{aligned}$ Therefore, we have $\begin{aligned} {\frac{d}{d t} (f \circ Φ_{S}^{- 1}) (V + t D) |}_{t = 0} & = Tr (- 2 \nabla f (U)^{T} S (I + V)^{- 1} D (I + V)^{- 1} I_{N \times p}) \\ = Tr (- 2 (I + V)^{- 1} I_{N \times p} ∇f (U)^{T} S (I + V)^{- 1} D) \\ = Tr (- 2 {\bar{W}}_{S}^{f} (V) D), \end{aligned}$ where ${\bar{W}}_{S}^{f} (V)$ is defined in (Equation22(22) $\begin{aligned} {\bar{W}}_{S}^{f} (V) & := (I + V)^{- 1} I_{N \times p} \nabla f (Φ_{S}^{- 1} (V))^{T} S (I + V)^{- 1} \end{aligned}$ (22) ). Furthermore, we have $Tr (- 2 {\bar{W}}_{S}^{f} (V) D) = Tr (- 2 S_{kew} ({\bar{W}}_{S}^{f} (V)) D) = Tr (- 2 S_{kew} (W_{S}^{f} (V)) D)$ , where the first equality follows by $Tr (X^{T} D) = \frac{1}{2} (Tr (X^{T} D) + Tr (X D^{T})) = \frac{1}{2} (Tr (XD) - Tr (XD)) = 0$ for any symmetric matrix $X \in R^{N \times N}$ and the second equality follows by (Equation21(21) $W_{S}^{f} (V) := [\begin{matrix} [[{\bar{W}}_{S}^{f} (V)]]_{11} & [[{\bar{W}}_{S}^{f} (V)]]_{12} \\ [[{\bar{W}}_{S}^{f} (V)]]_{21} & 0 \end{matrix}] \in R^{N \times N}$ (21) ) and $[[D]]_{22} = 0$ . Therefore, we obtain (A10) $(D \in Q_{N, p} (S)) {\frac{d}{d t} (f \circ Φ_{S}^{- 1}) (V + t D) |}_{t = 0} = Tr (- 2 S_{kew} (W_{S}^{f} (V)) D) .$ (A10)

On the other hand, by letting $\nabla (f \circ Φ_{S}^{- 1}) (V) \in Q_{N, p} (S)$ be the gradient of $f \circ Φ_{S}^{- 1}$ at $V$ , it follows (A11) $(D \in Q_{N, p} (S)) {\frac{d}{d t} (f \circ Φ_{S}^{- 1}) (V + t D) |}_{t = 0} = Tr (\nabla (f \circ Φ_{S}^{- 1}) (V)^{T} D) .$ (A11) By noting $2 S_{kew} (W_{S}^{f} (V)) \in Q_{N, p} (S)$ , (EquationA10(A10) $(D \in Q_{N, p} (S)) {\frac{d}{d t} (f \circ Φ_{S}^{- 1}) (V + t D) |}_{t = 0} = Tr (- 2 S_{kew} (W_{S}^{f} (V)) D) .$ (A10) ) and (EquationA11(A11) $(D \in Q_{N, p} (S)) {\frac{d}{d t} (f \circ Φ_{S}^{- 1}) (V + t D) |}_{t = 0} = Tr (\nabla (f \circ Φ_{S}^{- 1}) (V)^{T} D) .$ (A11) ) imply $\nabla (f \circ Φ_{S}^{- 1}) (V) = - 2 S_{kew} (W_{S}^{f} (V))^{T} = 2 S_{kew} (W_{S}^{f} (V))$ . By applying (EquationA4(A4) $(I + V)^{- 1} = [\begin{matrix} M^{- 1} & M^{- 1} [[V]]_{21}^{T} \\ - [[V]]_{21} M^{- 1} & I_{N - p} - [[V]]_{21} M^{- 1} [[V]]_{21}^{T} \end{matrix}]$ (A4) ) to (Equation22(22) $\begin{aligned} {\bar{W}}_{S}^{f} (V) & := (I + V)^{- 1} I_{N \times p} \nabla f (Φ_{S}^{- 1} (V))^{T} S (I + V)^{- 1} \end{aligned}$ (22) ), the expression (Equation23(23) $\begin{aligned} = [\begin{matrix} M^{- 1} \nabla f (U)^{T} (S_{le} - S_{ri} [[V]]_{21}) M^{- 1} \\ - [[V]]_{21} M^{- 1} ∇f (U)^{T} (S_{le} - S_{ri} [[V]]_{21}) M^{- 1} \end{matrix} \\ \begin{matrix} M^{- 1} ∇f (U)^{T} ((S_{le} - S_{ri} [[V]]_{21}) M^{- 1} [[V]]_{21}^{T} + S_{ri}) \\ - [[V]]_{21} M^{- 1} ∇f (U)^{T} ((S_{le} - S_{ri} [[V]]_{21}) M^{- 1} [[V]]_{21}^{T} + S_{ri}) \end{matrix}] \\ \in R^{N \times N} \end{aligned}$ (23) ) is derived as $\begin{aligned} {\bar{W}}_{S}^{f} (V) = (I + V)^{- 1} I_{N \times p} \nabla f (U)^{T} S (I + V)^{- 1} \\ = [\begin{matrix} M^{- 1} \\ - [[V]]_{21} M^{- 1} \end{matrix}] ∇f (U)^{T} [(S_{le} - S_{ri} [[V]]_{21}) M^{- 1} (S_{le} - S_{ri} [[V]]_{21}) M^{- 1} [[V]]_{21}^{T} + S_{ri}] . \end{aligned}$

By substituting $V = 0$ into (Equation22(22) $\begin{aligned} {\bar{W}}_{S}^{f} (V) & := (I + V)^{- 1} I_{N \times p} \nabla f (Φ_{S}^{- 1} (V))^{T} S (I + V)^{- 1} \end{aligned}$ (22) ), and by $Φ_{S}^{- 1} (0) = S I_{N \times p} = S_{le}$ , we deduce $(S \in O (N)) {\bar{W}}_{S}^{f} (0) = I_{N \times p} ∇f (Φ_{S}^{- 1} (0))^{T} S = [\begin{matrix} ∇f (S_{le})^{T} S_{le} & ∇f (S_{le})^{T} S_{ri} \\ 0 & 0 \end{matrix}] \overset{(21)}{=} W_{S}^{f} (0)$ and $\nabla f_{S} (0) \overset{(20)}{=} 2 S_{kew} (W_{S}^{f} (0)) = [\begin{matrix} \nabla f (S_{le})^{T} S_{le} - S_{le}^{T} ∇f (S_{le}) & ∇f (S_{le})^{T} S_{ri} \\ - S_{ri}^{T} ∇f (S_{le}) & 0 \end{matrix}] .$

Appendix 7.

Proof of Proposition 2.10

(I) Proof of Proposition 2.10(a). We need the following lemma to show Proposition 2.10(a). Figure illustrates the relation between the following lemma and Proposition 2.10(a).

Figure A1. A flow chart represents the overview of the proof of Proposition 2.10(a). The goal is to derive a transformation formula from $\nabla f_{S_{2}} (V_{2})$ to $\nabla f_{S_{1}} (V_{1})$ under $Φ_{S_{1}}^{- 1} (V_{1}) = Φ_{φ_{S_{1}}^{- 1} (V_{1})}^{- 1} (0) = Φ_{φ_{S_{2}}^{- 1} (V_{2})}^{- 1} (0) = Φ_{S_{2}}^{- 1} (V_{2})$ .

Figure A1. A flow chart represents the overview of the proof of Proposition 2.10(a). The goal is to derive a transformation formula from ∇fS2(V2) to ∇fS1(V1) under ΦS1−1(V1)=ΦφS1−1(V1)−1(0)=ΦφS2−1(V2)−1(0)=ΦS2−1(V2).

Lemma A.10

Let $f : R^{N \times p} \to R$ be a differentiable function, and let $S \in O (N)$ , $V \in Q_{N, p} (S)$ and $S^{'} := φ_{S}^{- 1} (V) \in O (N)$ in (Equation6(6) $φ_{S}^{- 1} : Q_{N, N} (S) \to O (N) ∖ E_{N, N} (S) : V \mapsto S φ^{- 1} (V) = S (I - V) (I + V)^{- 1},$ (6) ), implying thus $Φ_{S}^{- 1} (V) = φ_{S}^{- 1} (V) I_{N \times p} = S^{'} I_{N \times p} = Φ_{S^{'}}^{- 1} (0)$ . Then, the following hold:

For $W_{S}^{f} (V)$ in (Equation21(21) $W_{S}^{f} (V) := [\begin{matrix} [[{\bar{W}}_{S}^{f} (V)]]_{11} & [[{\bar{W}}_{S}^{f} (V)]]_{12} \\ [[{\bar{W}}_{S}^{f} (V)]]_{21} & 0 \end{matrix}] \in R^{N \times N}$ (21) ) and ${\bar{W}}_{S}^{f} (V)$ in (Equation22(22) $\begin{aligned} {\bar{W}}_{S}^{f} (V) & := (I + V)^{- 1} I_{N \times p} \nabla f (Φ_{S}^{- 1} (V))^{T} S (I + V)^{- 1} \end{aligned}$ (22) ), we have ${\bar{W}}_{S}^{f} (V) = (I + V)^{- 1} W_{S^{'}}^{f} (0) (I + V)^{- T}$ and (A13) $\begin{aligned} W_{S}^{f} (V) & = (I + V)^{- 1} W_{S^{'}}^{f} (0) (I + V)^{- T} \\ - [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] (I + V)^{- 1} W_{S^{'}}^{f} (0) (I + V)^{- T} [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] . \end{aligned}$ (A13)
The gradients of $f_{S} := f \circ Φ_{S}^{- 1}$ and $f_{S^{'}} := f \circ Φ_{S^{'}}^{- 1}$ satisfy (A14) $\begin{aligned} \nabla f_{S} (V) & = (I + V)^{- 1} \nabla f_{S^{'}} (0) (I + V)^{- T} \\ - [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] (I + V)^{- 1} \nabla f_{S^{'}} (0) (I + V)^{- T} [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] \end{aligned}$ (A14) and (A15) $\nabla f_{S^{'}} (0) = (I + V) (\nabla f_{S} (V) - [\begin{matrix} 0 & 0 \\ [[V]]_{21} & I_{N - p} \end{matrix}] \nabla f_{S} (V) [\begin{matrix} 0 & [[V]]_{21}^{T} \\ 0 & I_{N - p} \end{matrix}]) (I + V)^{T} .$ (A15)
If $\hat{S}, \overset{ˇ}{S} \in O (N)$ satisfy $Φ_{\hat{S}}^{- 1} (0) = Φ_{\overset{ˇ}{S}}^{- 1} (0) \in St (p, N)$ , i.e. ${\hat{S}}_{le} = {\overset{ˇ}{S}}_{le}$ in (Equation13(13) $\begin{aligned} Φ_{S}^{- 1} & : Q_{N, p} (S) \to St (p, N) ∖ E_{N, p} (S) \\ : V \mapsto Ξ \circ φ_{S}^{- 1} (V) = S (I - V) (I + V)^{- 1} I_{N \times p}, \end{aligned}$ (13) ), then we have (A16) $\nabla f_{\hat{S}} (0) = [\begin{matrix} I_{p} & 0 \\ 0 & Y \end{matrix}] \nabla f_{\overset{ˇ}{S}} (0) [\begin{matrix} I_{p} & 0 \\ 0 & Y^{T} \end{matrix}],$ (A16) where $Y := {\hat{S}}_{ri}^{T} {\overset{ˇ}{S}}_{ri} \in O (N - p)$ .

Proof.

(a) Combining $Φ_{S}^{- 1} (V) = Φ_{S^{'}}^{- 1} (0)$ and $W_{S^{'}}^{f} (0) \overset{(A 12)}{=} {\bar{W}}_{S^{'}}^{f} (0) \overset{(22)}{=} I_{N \times p} \nabla f (Φ_{S^{'}}^{- 1} (0))^{T} S^{'} = I_{N \times p} ∇f (Φ_{S}^{- 1} (V))^{T} S^{'}$ , we obtain (A17) $\begin{aligned} (I + V)^{- 1} W_{S^{'}}^{f} (0) (I + V)^{- T} = (I + V)^{- 1} I_{N \times p} \nabla f (Φ_{S}^{- 1} (V))^{T} S^{'} (I^{T} + V^{T})^{- 1} \\ \overset{(6)}{=} (I + V)^{- 1} I_{N \times p} ∇f (Φ_{S}^{- 1} (V))^{T} S (I - V) (I + V)^{- 1} (I - V)^{- 1} \\ \overset{(A 2)}{=} (I + V)^{- 1} I_{N \times p} ∇f (Φ_{S}^{- 1} (V))^{T} S (I + V)^{- 1} (I - V) (I - V)^{- 1} \\ = (I + V)^{- 1} I_{N \times p} ∇f (Φ_{S}^{- 1} (V))^{T} S (I + V)^{- 1} \overset{(22)}{=} {\bar{W}}_{S}^{f} (V) . \end{aligned}$ (A17) The relation (EquationA13(A13) $\begin{aligned} W_{S}^{f} (V) & = (I + V)^{- 1} W_{S^{'}}^{f} (0) (I + V)^{- T} \\ - [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] (I + V)^{- 1} W_{S^{'}}^{f} (0) (I + V)^{- T} [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] . \end{aligned}$ (A13) ) is obtained by substituting (EquationA17(A17) $\begin{aligned} (I + V)^{- 1} W_{S^{'}}^{f} (0) (I + V)^{- T} = (I + V)^{- 1} I_{N \times p} \nabla f (Φ_{S}^{- 1} (V))^{T} S^{'} (I^{T} + V^{T})^{- 1} \\ \overset{(6)}{=} (I + V)^{- 1} I_{N \times p} ∇f (Φ_{S}^{- 1} (V))^{T} S (I - V) (I + V)^{- 1} (I - V)^{- 1} \\ \overset{(A 2)}{=} (I + V)^{- 1} I_{N \times p} ∇f (Φ_{S}^{- 1} (V))^{T} S (I + V)^{- 1} (I - V) (I - V)^{- 1} \\ = (I + V)^{- 1} I_{N \times p} ∇f (Φ_{S}^{- 1} (V))^{T} S (I + V)^{- 1} \overset{(22)}{=} {\bar{W}}_{S}^{f} (V) . \end{aligned}$ (A17) ) to an alternative expression of (Equation21(21) $W_{S}^{f} (V) := [\begin{matrix} [[{\bar{W}}_{S}^{f} (V)]]_{11} & [[{\bar{W}}_{S}^{f} (V)]]_{12} \\ [[{\bar{W}}_{S}^{f} (V)]]_{21} & 0 \end{matrix}] \in R^{N \times N}$ (21) ): $W_{S}^{f} (V) = {\bar{W}}_{S}^{f} (V) - [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] {\bar{W}}_{S}^{f} (V) [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] .$

(b) (EquationA14(A14) $\begin{aligned} \nabla f_{S} (V) & = (I + V)^{- 1} \nabla f_{S^{'}} (0) (I + V)^{- T} \\ - [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] (I + V)^{- 1} \nabla f_{S^{'}} (0) (I + V)^{- T} [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] \end{aligned}$ (A14) ) is confirmed by applying (EquationA13(A13) $\begin{aligned} W_{S}^{f} (V) & = (I + V)^{- 1} W_{S^{'}}^{f} (0) (I + V)^{- T} \\ - [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] (I + V)^{- 1} W_{S^{'}}^{f} (0) (I + V)^{- T} [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] . \end{aligned}$ (A13) ) to (Equation20(20) $(V \in Q_{N, p} (S)) \nabla f_{S} (V) = 2 S_{kew} (W_{S}^{f} (V)) = W_{S}^{f} (V) - W_{S}^{f} (V)^{T} \in Q_{N, p} (S),$ (20) ) as $\begin{aligned} \nabla f_{S} (V) \overset{(20)}{=} W_{S}^{f} (V) - W_{S}^{f} (V)^{T} \\ \overset{(A 13)}{=} (I + V)^{- 1} (W_{S^{'}}^{f} (0) - W_{S^{'}}^{f} (0)^{T}) (I + V)^{- T} \\ - [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] (I + V)^{- 1} (W_{S^{'}}^{f} (0) - W_{S^{'}}^{f} (0)^{T}) (I + V)^{- T} [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] \\ \overset{(20)}{=} (I + V)^{- 1} \nabla f_{S^{'}} (0) (I + V)^{- T} - [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] (I + V)^{- 1} \nabla f_{S^{'}} (0) (I + V)^{- T} [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] . \end{aligned}$

To derive (EquationA15(A15) $\nabla f_{S^{'}} (0) = (I + V) (\nabla f_{S} (V) - [\begin{matrix} 0 & 0 \\ [[V]]_{21} & I_{N - p} \end{matrix}] \nabla f_{S} (V) [\begin{matrix} 0 & [[V]]_{21}^{T} \\ 0 & I_{N - p} \end{matrix}]) (I + V)^{T} .$ (A15) ) from (EquationA14(A14) $\begin{aligned} \nabla f_{S} (V) & = (I + V)^{- 1} \nabla f_{S^{'}} (0) (I + V)^{- T} \\ - [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] (I + V)^{- 1} \nabla f_{S^{'}} (0) (I + V)^{- T} [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] \end{aligned}$ (A14) ), let first $\nabla f_{S^{'}} (0) = [\begin{matrix} E \in Q_{p, p} & - F^{T} \\ F \in R^{(N - p) \times p} & 0_{N - p} \end{matrix}] \in Q_{N, p} (S^{'})$ and apply (EquationA4(A4) $(I + V)^{- 1} = [\begin{matrix} M^{- 1} & M^{- 1} [[V]]_{21}^{T} \\ - [[V]]_{21} M^{- 1} & I_{N - p} - [[V]]_{21} M^{- 1} [[V]]_{21}^{T} \end{matrix}]$ (A4) ) with $M := I_{p} + [[V]]_{11} + [[V]]_{21}^{T} [[V]]_{21} \in R^{p \times p}$ as (A18) $\begin{aligned} (I + V)^{- 1} \nabla f_{S^{'}} (0) (I + V)^{- T} \\ = [\begin{matrix} M^{- 1} & M^{- 1} [[V]]_{21}^{T} \\ - [[V]]_{21} M^{- 1} & I_{N - p} - [[V]]_{21} M^{- 1} [[V]]_{21}^{T} \end{matrix}] [\begin{matrix} E & - F^{T} \\ F & 0 \end{matrix}] \\ \times [\begin{matrix} M^{- T} & - M^{- T} [[V]]_{21}^{T} \\ [[V]]_{21} M^{- T} & I_{N - p} - [[V]]_{21} M^{- T} [[V]]_{21}^{T} \end{matrix}] \\ = [\begin{matrix} M^{- 1} (E + [[V]]_{21}^{T} F) & - M^{- 1} F^{T} \\ - [[V]]_{21} M^{- 1} (E + [[V]]_{21}^{T} F) + F & [[V]]_{21} M^{- 1} F^{T} \end{matrix}] \\ \times [\begin{matrix} M^{- T} & - M^{- T} [[V]]_{21}^{T} \\ [[V]]_{21} M^{- T} & I_{N - p} - [[V]]_{21} M^{- T} [[V]]_{21}^{T} \end{matrix}] \\ = [\begin{matrix} M^{- 1} G M^{- T} \\ - [[V]]_{21} M^{- 1} G M^{- T} + F M^{- T} \end{matrix} \\ \begin{matrix} - M^{- 1} G M^{- T} [[V]]_{21}^{T} - M^{- 1} F^{T} \\ [[V]]_{21} M^{- 1} G M^{- T} [[V]]_{21}^{T} + [[V]]_{21} M^{- 1} F^{T} - F M^{- T} [[V]]_{21}^{T} \end{matrix}], \end{aligned}$ (A18) where $G := E + [[V]]_{21}^{T} F - F^{T} [[V]]_{21} \in R^{p \times p}$ satisfies $G^{T} = - G$ . By substituting (EquationA18(A18) $\begin{aligned} (I + V)^{- 1} \nabla f_{S^{'}} (0) (I + V)^{- T} \\ = [\begin{matrix} M^{- 1} & M^{- 1} [[V]]_{21}^{T} \\ - [[V]]_{21} M^{- 1} & I_{N - p} - [[V]]_{21} M^{- 1} [[V]]_{21}^{T} \end{matrix}] [\begin{matrix} E & - F^{T} \\ F & 0 \end{matrix}] \\ \times [\begin{matrix} M^{- T} & - M^{- T} [[V]]_{21}^{T} \\ [[V]]_{21} M^{- T} & I_{N - p} - [[V]]_{21} M^{- T} [[V]]_{21}^{T} \end{matrix}] \\ = [\begin{matrix} M^{- 1} (E + [[V]]_{21}^{T} F) & - M^{- 1} F^{T} \\ - [[V]]_{21} M^{- 1} (E + [[V]]_{21}^{T} F) + F & [[V]]_{21} M^{- 1} F^{T} \end{matrix}] \\ \times [\begin{matrix} M^{- T} & - M^{- T} [[V]]_{21}^{T} \\ [[V]]_{21} M^{- T} & I_{N - p} - [[V]]_{21} M^{- T} [[V]]_{21}^{T} \end{matrix}] \\ = [\begin{matrix} M^{- 1} G M^{- T} \\ - [[V]]_{21} M^{- 1} G M^{- T} + F M^{- T} \end{matrix} \\ \begin{matrix} - M^{- 1} G M^{- T} [[V]]_{21}^{T} - M^{- 1} F^{T} \\ [[V]]_{21} M^{- 1} G M^{- T} [[V]]_{21}^{T} + [[V]]_{21} M^{- 1} F^{T} - F M^{- T} [[V]]_{21}^{T} \end{matrix}], \end{aligned}$ (A18) ) to (EquationA14(A14) $\begin{aligned} \nabla f_{S} (V) & = (I + V)^{- 1} \nabla f_{S^{'}} (0) (I + V)^{- T} \\ - [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] (I + V)^{- 1} \nabla f_{S^{'}} (0) (I + V)^{- T} [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] \end{aligned}$ (A14) ), we obtain $\nabla f_{S} (V) = [\begin{matrix} M^{- 1} G M^{- T} & - M^{- 1} G M^{- T} [[V]]_{21}^{T} - M^{- 1} F^{T} \\ - [[V]]_{21} M^{- 1} G M^{- T} + F M^{- T} & 0_{N - p} \end{matrix}]$ and $- [\begin{matrix} 0 & 0 \\ [[V]]_{21} & I_{N - p} \end{matrix}] \nabla f_{S} (V) [\begin{matrix} 0 & [[V]]_{21}^{T} \\ 0 & I_{N - p} \end{matrix}] = [\begin{matrix} 0 & 0 \\ 0 & [[(I + V)^{- 1} \nabla f_{S^{'}} (0) (I + V)^{- T}]]_{22} \end{matrix}],$ from which we obtain $\nabla f_{S} (V) = (I + V)^{- 1} \nabla f_{S^{'}} (0) (I + V)^{- T} + [\begin{matrix} 0 & 0 \\ [[V]]_{21} & I_{N - p} \end{matrix}] \nabla f_{S} (V) [\begin{matrix} 0 & [[V]]_{21}^{T} \\ 0 & I_{N - p} \end{matrix}] .$

(c) From ${\hat{S}}_{le} = Φ_{\hat{S}}^{- 1} (0) = Φ_{\overset{ˇ}{S}}^{- 1} (0) = {\overset{ˇ}{S}}_{le} =: U$ , and ${\hat{S}}^{T} \overset{ˇ}{S} = [\begin{matrix} I_{p} & 0 \\ 0 & {\hat{S}}_{ri}^{T} {\overset{ˇ}{S}}_{ri} \end{matrix}] = [\begin{matrix} I_{p} & 0 \\ 0 & Y \end{matrix}] \in O (N),$ we see $Y \in O (N - p)$ and ${\overset{ˇ}{S}}_{ri} = {\hat{S}}_{ri} Y$ by ${\hat{S}}_{ri} Y = {\hat{S}}_{ri} {\hat{S}}_{ri}^{T} {\overset{ˇ}{S}}_{ri} = (I - {\hat{S}}_{le} {\hat{S}}_{le}^{T}) {\overset{ˇ}{S}}_{ri} = (I - {\overset{ˇ}{S}}_{le} {\overset{ˇ}{S}}_{le}^{T}) {\overset{ˇ}{S}}_{ri} = {\overset{ˇ}{S}}_{ri} {\overset{ˇ}{S}}_{ri}^{T} {\overset{ˇ}{S}}_{ri} = {\overset{ˇ}{S}}_{ri}$ .

Thus, it follows from ${\overset{ˇ}{S}}_{ri} = {\hat{S}}_{ri} Y$ and $U = {\hat{S}}_{le} = {\overset{ˇ}{S}}_{le}$ that $\begin{aligned} \nabla f_{\hat{S}} (0) & \overset{(24)}{=} [\begin{matrix} \nabla f (U)^{T} {\hat{S}}_{le} - {\hat{S}}_{le}^{T} ∇f (U) & ∇f (U)^{T} {\hat{S}}_{ri} \\ - {\hat{S}}_{ri}^{T} ∇f (U) & 0 \end{matrix}] \\ = [\begin{matrix} ∇f (U)^{T} {\overset{ˇ}{S}}_{le} - {\overset{ˇ}{S}}_{le}^{T} ∇f (U) & ∇f (U)^{T} {\overset{ˇ}{S}}_{ri} Y^{T} \\ - Y {\overset{ˇ}{S}}_{ri}^{T} ∇f (U) & 0 \end{matrix}] \\ = [\begin{matrix} I_{p} & 0 \\ 0 & Y \end{matrix}] [\begin{matrix} ∇f (U)^{T} {\overset{ˇ}{S}}_{le} - {\overset{ˇ}{S}}_{le}^{T} ∇f (U) & ∇f (U)^{T} {\overset{ˇ}{S}}_{ri} \\ - {\overset{ˇ}{S}}_{ri}^{T} ∇f (U) & 0 \end{matrix}] [\begin{matrix} I_{p} & 0 \\ 0 & Y^{T} \end{matrix}] \\ \overset{(24)}{=} [\begin{matrix} I_{p} & 0 \\ 0 & Y \end{matrix}] \nabla f_{\overset{ˇ}{S}} (0) [\begin{matrix} I_{p} & 0 \\ 0 & Y^{T} \end{matrix}] . \end{aligned}$

Return to the proof of Proposition 2.10(a). Let $\hat{S_{1}} := φ_{S_{1}}^{- 1} (V_{1}) \in O (N)$ and ${\overset{ˇ}{S}}_{2} := φ_{S_{2}}^{- 1} (V_{2}) \in O (N)$ . Since ${\hat{S_{1}}}_{le} = Φ_{\hat{S_{1}}}^{- 1} (0) = Φ_{S_{1}}^{- 1} (V_{1}) = Φ_{S_{2}}^{- 1} (V_{2}) = Φ_{{\overset{ˇ}{S}}_{2}}^{- 1} (0) = {\hat{S_{2}}}_{le}$ , Lemma A.10(c) implies $X = {\hat{S_{1}}}_{ri}^{T} {\overset{ˇ}{S}}_{2_{ri}} \in O (N - p)$ . Moreover from Lemma A.10, we have the relations $\begin{aligned} \nabla f_{S_{1}} (V_{1}) & \overset{(A 14)}{=} (I + V_{1})^{- 1} \nabla f_{\hat{S_{1}}} (0) (I + V_{1})^{- T} \\ - [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] (I + V_{1})^{- 1} \nabla f_{\hat{S_{1}}} (0) (I + V_{1})^{- T} [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}], \\ \nabla f_{\hat{S_{1}}} (0) & \overset{(A 16)}{=} [\begin{matrix} I_{p} & 0 \\ 0 & X \end{matrix}] \nabla f_{{\overset{ˇ}{S}}_{2}} (0) [\begin{matrix} I_{p} & 0 \\ 0 & X^{T} \end{matrix}], \\ \nabla f_{{\overset{ˇ}{S}}_{2}} (0) & \overset{(A 15)}{=} (I + V_{2}) \\ \times (\nabla f_{S_{2}} (V_{2}) - [\begin{matrix} 0 & 0 \\ [[V_{2}]]_{21} & I_{N - p} \end{matrix}] \nabla f_{S_{2}} (V_{2}) [\begin{matrix} 0 & [[V_{2}]]_{21}^{T} \\ 0 & I_{N - p} \end{matrix}]) (I + V_{2})^{T} . \end{aligned}$ Finally by substituting the second and last relations into the first relation, we complete the proof.

(II) Proof of Proposition 2.10(b) and (c). From Proposition 2.10(a), Lemma A.4(a) and (b), we obtain $\begin{aligned} ‖ \nabla f_{S_{1}} (V_{1}) ‖_{F} \\ = {‖ G_{S_{1}, S_{2}} (V_{1}, V_{2}) - [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] G_{S_{1}, S_{2}} (V_{1}, V_{2}) [\begin{matrix} 0 & 0 \\ 0 & I_{N - p} \end{matrix}] ‖}_{F} \leq ‖ G_{S_{1}, S_{2}} (V_{1}, V_{2}) ‖_{F} \\ \leq ‖ (I + V_{1})^{- 1} ‖_{2}^{2} {‖ \begin{matrix} I_{p} & 0 \\ 0 & X \end{matrix} ‖}_{2}^{2} ‖ I + V_{2} ‖_{2}^{2} \\ \times {‖ \nabla f_{S_{2}} (V_{2}) - [\begin{matrix} 0 & 0 \\ [[V_{2}]]_{21} & I_{N - p} \end{matrix}] \nabla f_{S_{2}} (V_{2}) [\begin{matrix} 0 & [[V_{2}]]_{21}^{T} \\ 0 & I_{N - p} \end{matrix}] ‖}_{F} \\ \leq ‖ I + V_{2} ‖_{2}^{2} ‖ \nabla f_{S_{2}} (V_{2}) \\ - {[\begin{matrix} 0 & 0 \\ [[V_{2}]]_{21} & I_{N - p} \end{matrix}] \nabla f_{S_{2}} (V_{2}) [\begin{matrix} 0 & [[V_{2}]]_{21}^{T} \\ 0 & I_{N - p} \end{matrix}] ‖}_{F} (∵ Lemma A .4 (b)) \\ \leq ‖ I + V_{2} ‖_{2}^{2} (1 + {‖ \begin{matrix} 0 & 0 \\ [[V_{2}]]_{21} & I_{N - p} \end{matrix} ‖}_{2}^{2}) ‖ \nabla f_{S_{2}} (V_{2}) ‖_{F} \leq 2 ‖ I + V_{2} ‖_{2}^{2} ‖ \nabla f_{S_{2}} (V_{2}) ‖_{F}, \end{aligned}$ where the last inequality is derived by ${‖ \begin{matrix} 0 & 0 \\ [[V_{2}]]_{21} & I_{N - p} \end{matrix} ‖}_{2} = 1$ from the fact that each eigenvalue of a triangular matrix equals its diagonal entry. Finally by applying Lemma A.4(b) again, we obtain Proposition 2.10(b), which implies Proposition 2.10(c).

Appendix 8.

Useful properties of $\nabla (f \circ Φ_{S}^{- 1})$ for optimization

The properties of $\nabla (f \circ Φ_{S}^{- 1})$ in the following Proposition A.11 are useful in transplanting powerful computational arts designed for optimization over a vector space into the minimization of $f \circ Φ_{S}^{- 1}$ over $Q_{N, p} (S)$ . Indeed, the Lipschitz continuity of the gradient is one of the commonly used assumptions in optimization over a vector space (see, e.g. [Citation27–32]). The boundedness of the gradient is a key property for distributed optimization and stochastic optimization over a vector space (see, e.g. [Citation30–32]). The variance bounded of the gradient is also commonly assumed in stochastic optimization over a vector space (see, e.g. [Citation28–30]).

Proposition A.11

Bounds for gradient after Cayley parametrizaton

Let $f : R^{N \times p} \to R$ be continuously differentiable. Then, for any $S \in O (N)$ , the following hold:

(Lipschitz continuity). If $(\exists L > 0, \forall U_{1}, U_{2} \in St (p, N)) ‖ ∇f (U_{1}) - ∇f (U_{2}) ‖_{F} \leq L ‖ U_{1} - U_{2} ‖_{F}$ and $μ \geq max_{U \in St (p, N)} ‖ ∇f (U) ‖_{2}$ , then the gradient of $f_{S} := f \circ Φ_{S}^{- 1}$ satisfies (A19) $(\forall V_{1}, V_{2} \in Q_{N, p} (S)) ‖ \nabla f_{S} (V_{1}) - \nabla f_{S} (V_{2}) ‖_{F} \leq 4 (μ + L) ‖ V_{1} - V_{2} ‖_{F} .$ (A19)
(Boundedness). (A20) $(V \in Q_{N, p} (S)) ‖ \nabla f_{S} (V) ‖_{F} \leq 2 max_{U \in St (p, N)} ‖ ∇f (U) ‖_{F} .$ (A20)
(Variance boundedness). Suppose $f^{ξ} : R^{N \times p} \to R$ is indexed with realizations of a random variable ξ and continuously differentiable for each realization. If there exists $σ \geq 0$ and f satisfies $(U \in St (p, N)) {\begin{cases} E_{ξ} [f^{ξ} (U)] = f (U), \\ E_{ξ} [\nabla f^{ξ} (U)] = ∇f (U), \\ E_{ξ} [‖ \nabla f^{ξ} (U) - ∇f (U) ‖_{F}^{2}] \leq σ^{2}, \end{cases}$ we have (A21) $(V \in Q_{N, p} (S)) E_{ξ} [‖ \nabla f_{S}^{ξ} (V) - \nabla f_{S} (V) ‖_{F}^{2}] \leq 4 σ^{2} .$ (A21)

Proof.

The existence of the maximum of $‖ \nabla f (\cdot) ‖$ over $St (p, N)$ is guaranteed by the compactness of $St (p, N)$ and the continuities of $‖ \cdot ‖$ and $\nabla f$ . We divide the proof of (a)–(c) as follows. Recall that ${\bar{W}}_{S}^{f} (V)$ and $W_{S}^{f} (V)$ for $S \in O (N)$ were respectively defined as (Equation22(22) $\begin{aligned} {\bar{W}}_{S}^{f} (V) & := (I + V)^{- 1} I_{N \times p} \nabla f (Φ_{S}^{- 1} (V))^{T} S (I + V)^{- 1} \end{aligned}$ (22) ) and (Equation21(21) $W_{S}^{f} (V) := [\begin{matrix} [[{\bar{W}}_{S}^{f} (V)]]_{11} & [[{\bar{W}}_{S}^{f} (V)]]_{12} \\ [[{\bar{W}}_{S}^{f} (V)]]_{21} & 0 \end{matrix}] \in R^{N \times N}$ (21) ), and we have $\nabla f_{S} (V) := \nabla (f \circ Φ_{S}^{- 1}) (V) = 2 S_{kew} (W_{S}^{f} (V))$ (see Proposition 2.9). In the following, we use properties of $S_{kew}$ ; (i) $‖ S_{kew} (X) ‖_{F} \leq ‖ X ‖_{F}$ for $X \in R^{N \times N}$ ; (ii) the linearity of $S_{kew}$ .

(I) Proof of Proposition A.11(a). First, we introduce a useful inequalities.

Lemma A.12

Lipschitz continuity of $Φ_{S}^{- 1}$

For every $S \in O (N)$ , $Φ_{S}^{- 1}$ is Lipschitz continuous over $Q_{N, p} (S)$ with a constant 2, i.e. $(V_{1}, V_{2} \in Q_{N, p} (S)) ‖ Φ_{S}^{- 1} (V_{1}) - Φ_{S}^{- 1} (V_{2}) ‖_{F} \leq 2 ‖ V_{1} - V_{2} ‖_{F} .$

Proof.

From (Equation14(14) $\begin{aligned} Φ_{S}^{- 1} (V) & = Ξ \circ φ_{S}^{- 1} (V) = 2 S (I + V)^{- 1} I_{N \times p} - S I_{N \times p} \end{aligned}$ (14) ) and Lemma A.4(a) and (c), we have $\begin{aligned} ‖ Φ_{S}^{- 1} (V_{1}) - Φ_{S}^{- 1} (V_{2}) ‖_{F} = ‖ 2 S ((I + V_{1})^{- 1} - (I + V_{2})^{- 1}) I_{N \times p} ‖_{F} \\ \leq 2 ‖ S ‖_{2} ‖ (I + V_{1})^{- 1} - (I + V_{2})^{- 1} ‖_{F} ‖ I_{N \times p} ‖_{2} \leq 2 ‖ (I + V_{1})^{- 1} - (I + V_{2})^{- 1} ‖_{F} \\ \leq 2 ‖ V_{1} - V_{2} ‖_{F} . \end{aligned}$

Return to the proof of Proposition A.11(a). From (Equation20(20) $(V \in Q_{N, p} (S)) \nabla f_{S} (V) = 2 S_{kew} (W_{S}^{f} (V)) = W_{S}^{f} (V) - W_{S}^{f} (V)^{T} \in Q_{N, p} (S),$ (20) ), (Equation21(21) $W_{S}^{f} (V) := [\begin{matrix} [[{\bar{W}}_{S}^{f} (V)]]_{11} & [[{\bar{W}}_{S}^{f} (V)]]_{12} \\ [[{\bar{W}}_{S}^{f} (V)]]_{21} & 0 \end{matrix}] \in R^{N \times N}$ (21) ) in Proposition 2.9, we have (A22) $\begin{aligned} ‖ \nabla f_{S} (V_{1}) - \nabla f_{S} (V_{2}) ‖_{F} = 2 ‖ S_{kew} (W_{S}^{f} (V_{1}) - W_{S}^{f} (V_{2})) ‖_{F} \\ \leq 2 ‖ W_{S}^{f} (V_{1}) - W_{S}^{f} (V_{2}) ‖_{F} \leq 2 ‖ {\bar{W}}_{S}^{f} (V_{1}) - {\bar{W}}_{S}^{f} (V_{2}) ‖_{F} . \end{aligned}$ (A22) Moreover, from (Equation22(22) $\begin{aligned} {\bar{W}}_{S}^{f} (V) & := (I + V)^{- 1} I_{N \times p} \nabla f (Φ_{S}^{- 1} (V))^{T} S (I + V)^{- 1} \end{aligned}$ (22) ), for all $V_{1}, V_{2} \in Q_{N, p} (S)$ with $U_{1} := Φ_{S}^{- 1} (V_{1}), U_{2} := Φ_{S}^{- 1} (V_{2}) \in St (p, N) ∖ E_{N, p} (S)$ , we deduce (A23) $\begin{aligned} ‖ {\bar{W}}_{S}^{f} (V_{1}) - {\bar{W}}_{S}^{f} (V_{2}) ‖_{F} \\ = ‖ (I + V_{1})^{- 1} I_{N \times p} \nabla f (U_{1})^{T} S (I + V_{1})^{- 1} - (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{2})^{T} S (I + V_{2})^{- 1} ‖_{F} \\ \leq ‖ (I + V_{1})^{- 1} I_{N \times p} ∇f (U_{1})^{T} S (I + V_{1})^{- 1} - (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{1})^{T} S (I + V_{1})^{- 1} ‖_{F} \\ + ‖ (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{1})^{T} S (I + V_{1})^{- 1} - (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{2})^{T} S (I + V_{1})^{- 1} ‖_{F} \\ + ‖ (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{2})^{T} S (I + V_{1})^{- 1} - (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{2})^{T} S (I + V_{2})^{- 1} ‖_{F} . \end{aligned}$ (A23) The first term in the right-hand side of (EquationA23(A23) $\begin{aligned} ‖ {\bar{W}}_{S}^{f} (V_{1}) - {\bar{W}}_{S}^{f} (V_{2}) ‖_{F} \\ = ‖ (I + V_{1})^{- 1} I_{N \times p} \nabla f (U_{1})^{T} S (I + V_{1})^{- 1} - (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{2})^{T} S (I + V_{2})^{- 1} ‖_{F} \\ \leq ‖ (I + V_{1})^{- 1} I_{N \times p} ∇f (U_{1})^{T} S (I + V_{1})^{- 1} - (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{1})^{T} S (I + V_{1})^{- 1} ‖_{F} \\ + ‖ (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{1})^{T} S (I + V_{1})^{- 1} - (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{2})^{T} S (I + V_{1})^{- 1} ‖_{F} \\ + ‖ (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{2})^{T} S (I + V_{1})^{- 1} - (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{2})^{T} S (I + V_{2})^{- 1} ‖_{F} . \end{aligned}$ (A23) ) can be bounded as $\begin{aligned} ‖ (I + V_{1})^{- 1} I_{N \times p} \nabla f (U_{1})^{T} S (I + V_{1})^{- 1} - (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{1})^{T} S (I + V_{1})^{- 1} ‖_{F} \\ = ‖ ((I + V_{1})^{- 1} - (I + V_{2})^{- 1}) I_{N \times p} ∇f (U_{1})^{T} S (I + V_{1})^{- 1} ‖_{F} \\ \leq ‖ I_{N \times p} ‖_{2} ‖ ∇f (U_{1}) ‖_{2} ‖ S ‖_{2} ‖ (I + V_{1})^{- 1} ‖_{2} ‖ (I + V_{1})^{- 1} - (I + V_{2})^{- 1} ‖_{F} (∵ Lemma A.4(a)) \\ \leq ‖ ∇f (U_{1}) ‖_{2} ‖ V_{1} - V_{2} ‖_{F} \leq μ ‖ V_{1} - V_{2} ‖_{F} . (∵ Lemma A.4(b) and (c)) \end{aligned}$ Similarly the last term in (EquationA23(A23) $\begin{aligned} ‖ {\bar{W}}_{S}^{f} (V_{1}) - {\bar{W}}_{S}^{f} (V_{2}) ‖_{F} \\ = ‖ (I + V_{1})^{- 1} I_{N \times p} \nabla f (U_{1})^{T} S (I + V_{1})^{- 1} - (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{2})^{T} S (I + V_{2})^{- 1} ‖_{F} \\ \leq ‖ (I + V_{1})^{- 1} I_{N \times p} ∇f (U_{1})^{T} S (I + V_{1})^{- 1} - (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{1})^{T} S (I + V_{1})^{- 1} ‖_{F} \\ + ‖ (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{1})^{T} S (I + V_{1})^{- 1} - (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{2})^{T} S (I + V_{1})^{- 1} ‖_{F} \\ + ‖ (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{2})^{T} S (I + V_{1})^{- 1} - (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{2})^{T} S (I + V_{2})^{- 1} ‖_{F} . \end{aligned}$ (A23) ) can be bounded above by $μ ‖ V_{1} - V_{2} ‖_{F}$ . The second term in (EquationA23(A23) $\begin{aligned} ‖ {\bar{W}}_{S}^{f} (V_{1}) - {\bar{W}}_{S}^{f} (V_{2}) ‖_{F} \\ = ‖ (I + V_{1})^{- 1} I_{N \times p} \nabla f (U_{1})^{T} S (I + V_{1})^{- 1} - (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{2})^{T} S (I + V_{2})^{- 1} ‖_{F} \\ \leq ‖ (I + V_{1})^{- 1} I_{N \times p} ∇f (U_{1})^{T} S (I + V_{1})^{- 1} - (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{1})^{T} S (I + V_{1})^{- 1} ‖_{F} \\ + ‖ (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{1})^{T} S (I + V_{1})^{- 1} - (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{2})^{T} S (I + V_{1})^{- 1} ‖_{F} \\ + ‖ (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{2})^{T} S (I + V_{1})^{- 1} - (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{2})^{T} S (I + V_{2})^{- 1} ‖_{F} . \end{aligned}$ (A23) ) can be evaluated as $\begin{aligned} ‖ (I + V_{2})^{- 1} I_{N \times p} \nabla f (U_{1})^{T} S (I + V_{1})^{- 1} - (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{2})^{T} S (I + V_{1})^{- 1} ‖_{F} \\ = ‖ (I + V_{2})^{- 1} I_{N \times p} (∇f (U_{1}) - ∇f (U_{2}))^{T} S (I + V_{1})^{- 1} ‖_{F} \\ \leq ‖ (I + V_{2})^{- 1} ‖_{2} ‖ I_{N \times p} ‖_{2} ‖ S ‖_{2} ‖ (I + V_{1})^{- 1} ‖_{2} ‖ ∇f (U_{1}) - ∇f (U_{2}) ‖_{F} (∵ Lemma A.4(a)) \\ \leq ‖ ∇f (U_{1}) - ∇f (U_{2}) ‖_{F} \leq L ‖ U_{1} - U_{2} ‖_{F} \\ (∵ Lemma A.4(b) and Lipschitz continuity of ∇f) \\ = L ‖ Φ_{S}^{- 1} (V_{1}) - Φ_{S}^{- 1} (V_{2}) ‖_{F} \leq 2 L ‖ V_{1} - V_{2} ‖_{F} (∵ Lemma A.12) . \end{aligned}$ Therefore, the left-hand side of (EquationA23(A23) $\begin{aligned} ‖ {\bar{W}}_{S}^{f} (V_{1}) - {\bar{W}}_{S}^{f} (V_{2}) ‖_{F} \\ = ‖ (I + V_{1})^{- 1} I_{N \times p} \nabla f (U_{1})^{T} S (I + V_{1})^{- 1} - (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{2})^{T} S (I + V_{2})^{- 1} ‖_{F} \\ \leq ‖ (I + V_{1})^{- 1} I_{N \times p} ∇f (U_{1})^{T} S (I + V_{1})^{- 1} - (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{1})^{T} S (I + V_{1})^{- 1} ‖_{F} \\ + ‖ (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{1})^{T} S (I + V_{1})^{- 1} - (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{2})^{T} S (I + V_{1})^{- 1} ‖_{F} \\ + ‖ (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{2})^{T} S (I + V_{1})^{- 1} - (I + V_{2})^{- 1} I_{N \times p} ∇f (U_{2})^{T} S (I + V_{2})^{- 1} ‖_{F} . \end{aligned}$ (A23) ) is bounded as $(V_{1}, V_{2} \in Q_{N, p} (S)) ‖ {\bar{W}}_{S}^{f} (V_{1}) - {\bar{W}}_{S}^{f} (V_{2}) ‖_{F} \leq 2 (μ + L) ‖ V_{1} - V_{2} ‖_{F},$ which is combined with (EquationA22(A22) $\begin{aligned} ‖ \nabla f_{S} (V_{1}) - \nabla f_{S} (V_{2}) ‖_{F} = 2 ‖ S_{kew} (W_{S}^{f} (V_{1}) - W_{S}^{f} (V_{2})) ‖_{F} \\ \leq 2 ‖ W_{S}^{f} (V_{1}) - W_{S}^{f} (V_{2}) ‖_{F} \leq 2 ‖ {\bar{W}}_{S}^{f} (V_{1}) - {\bar{W}}_{S}^{f} (V_{2}) ‖_{F} . \end{aligned}$ (A22) ) to get (EquationA19(A19) $(\forall V_{1}, V_{2} \in Q_{N, p} (S)) ‖ \nabla f_{S} (V_{1}) - \nabla f_{S} (V_{2}) ‖_{F} \leq 4 (μ + L) ‖ V_{1} - V_{2} ‖_{F} .$ (A19) ).

(II) Proof of Proposition A.11(b). From (Equation20(20) $(V \in Q_{N, p} (S)) \nabla f_{S} (V) = 2 S_{kew} (W_{S}^{f} (V)) = W_{S}^{f} (V) - W_{S}^{f} (V)^{T} \in Q_{N, p} (S),$ (20) ), (Equation21(21) $W_{S}^{f} (V) := [\begin{matrix} [[{\bar{W}}_{S}^{f} (V)]]_{11} & [[{\bar{W}}_{S}^{f} (V)]]_{12} \\ [[{\bar{W}}_{S}^{f} (V)]]_{21} & 0 \end{matrix}] \in R^{N \times N}$ (21) ) in Proposition 2.9, we have $‖ \nabla f_{S} (V) ‖_{F} = 2 ‖ S_{kew} (W_{S}^{f} (V)) ‖_{F} \leq 2 ‖ W_{S}^{f} (V) ‖_{F} \leq 2 ‖ {\bar{W}}_{S}^{f} (V) ‖_{F} .$ By using Lemma A.4(a) and (b), we get $\begin{aligned} ‖ {\bar{W}}_{S}^{f} (V) ‖_{F} = ‖ (I + V)^{- 1} I_{N \times p} \nabla f (Φ_{S}^{- 1} (V))^{T} S (I + V)^{- 1} ‖_{F} \\ \leq ‖ (I + V)^{- 1} ‖_{2}^{2} ‖ I_{N \times p} ‖_{2} ‖ S ‖_{2} ‖ ∇f (Φ_{S}^{- 1} (V)) ‖_{F} \leq ‖ ∇f (Φ_{S}^{- 1} (V)) ‖_{F} \\ \leq max_{U \in St (p, N)} ‖ ∇f (U) ‖_{F}, \end{aligned}$ which implies (EquationA20(A20) $(V \in Q_{N, p} (S)) ‖ \nabla f_{S} (V) ‖_{F} \leq 2 max_{U \in St (p, N)} ‖ ∇f (U) ‖_{F} .$ (A20) ).

(III) Proof of Proposition A.11 A.11. From (Equation20(20) $(V \in Q_{N, p} (S)) \nabla f_{S} (V) = 2 S_{kew} (W_{S}^{f} (V)) = W_{S}^{f} (V) - W_{S}^{f} (V)^{T} \in Q_{N, p} (S),$ (20) ), (Equation21(21) $W_{S}^{f} (V) := [\begin{matrix} [[{\bar{W}}_{S}^{f} (V)]]_{11} & [[{\bar{W}}_{S}^{f} (V)]]_{12} \\ [[{\bar{W}}_{S}^{f} (V)]]_{21} & 0 \end{matrix}] \in R^{N \times N}$ (21) ) in Proposition 2.9, we obtain, for each ξ, $\begin{aligned} ‖ \nabla f_{S}^{ξ} (V) - \nabla f_{S} (V) ‖_{F}^{2} = 4 ‖ S_{kew} (W_{S}^{f^{ξ}} (V) - W_{S}^{f} (V)) ‖_{F}^{2} \\ \leq 4 ‖ W_{S}^{f^{ξ}} (V) - W_{S}^{f} (V) ‖_{F}^{2} \leq 4 ‖ {\bar{W}}_{S}^{f^{ξ}} (V) - {\bar{W}}_{S}^{f} (V) ‖_{F}^{2} \\ = 4 ‖ (I + V)^{- 1} I_{N \times p} (\nabla f^{ξ} (Φ_{S}^{- 1} (V)) - \nabla f (Φ_{S}^{- 1} (V))) S (I + V)^{- 1} ‖_{F}^{2} \\ \leq 4 ‖ (I + V)^{- 1} ‖_{2}^{4} ‖ I_{N \times p} ‖_{2}^{2} ‖ S ‖_{2}^{2} ‖ \nabla f^{ξ} (Φ_{S}^{- 1} (V)) - ∇f (Φ_{S}^{- 1} (V)) ‖_{F}^{2} (∵ Lemma A.4(a)) \\ \leq 4 ‖ \nabla f^{ξ} (Φ_{S}^{- 1} (V)) - ∇f (Φ_{S}^{- 1} (V)) ‖_{F}^{2} . (∵ Lemma A.4(b)) \end{aligned}$ By taking the expectation of both sides, we get (EquationA21(A21) $(V \in Q_{N, p} (S)) E_{ξ} [‖ \nabla f_{S}^{ξ} (V) - \nabla f_{S} (V) ‖_{F}^{2}] \leq 4 σ^{2} .$ (A21) ).

Appendix 9.

Proof of Proposition 3.7

Application of (Equation14(14) $\begin{aligned} Φ_{S}^{- 1} (V) & = Ξ \circ φ_{S}^{- 1} (V) = 2 S (I + V)^{- 1} I_{N \times p} - S I_{N \times p} \end{aligned}$ (14) ) to $U (τ) := Φ_{S}^{- 1} (V + τ E) = 2 S (I + V + τ E)^{- 1} I_{N \times p} - S I_{N \times p}$ yields $U (τ) - U (0) = 2 S ((I + V + τ E)^{- 1} - (I + V)^{- 1}) I_{N \times p} = - 2 τ S (I + V + τ E)^{- 1} E (I + V)^{- 1} I_{N \times p}$ (for the 2nd equality, see the proof of Lemma A.4(c)) and $\begin{aligned} ‖ U (τ) - U (0) ‖_{F} = ‖ 2 τ S (I + V + τ E)^{- 1} E (I + V)^{- 1} I_{N \times p} ‖_{F} \\ \leq 2 τ ‖ S ‖_{2} ‖ (I + V + τ E)^{- 1} ‖_{2} ‖ E ‖_{F} ‖ (I + V)^{- 1} I_{N \times p} ‖_{2} (∵ Lemma A.4(a)) \\ \leq 2 τ ‖ (I + V)^{- 1} I_{N \times p} ‖_{2} \leq 2 τ ‖ I_{p} - ‖ V ‖_{21}^{T} ‖_{2} ‖ M^{- 1} ‖_{2}, \end{aligned}$ where $M = I_{p} + [[V]]_{11} + [[V]]_{21}^{T} [[V]]_{21} \in R^{p \times p}$ , the second last inequality is derived by Lemma A.4(b), and the last inequality is derived by (EquationA4(A4) $(I + V)^{- 1} = [\begin{matrix} M^{- 1} & M^{- 1} [[V]]_{21}^{T} \\ - [[V]]_{21} M^{- 1} & I_{N - p} - [[V]]_{21} M^{- 1} [[V]]_{21}^{T} \end{matrix}]$ (A4) ).

To evaluate the norm $‖ M^{- 1} ‖_{2}$ , let $I_{p} + [[V]]_{21}^{T} [[V]]_{21} = Q (I_{p} + Σ) Q^{T}$ be the eigenvalue decomposition, where $Q \in O (p)$ is an orthogonal matrix and $Σ \in R^{p \times p}$ is a diagonal matrix whose entries are non-negative. Then, we have $M = Q (I_{p} + Σ)^{1 / 2} (I_{p} + (I_{p} + Σ)^{- 1 / 2} Q^{T} [[V]]_{11} Q (I_{p} + Σ)^{- 1 / 2}) (I_{p} + Σ)^{1 / 2} Q^{T} .$ The norm $‖ M^{- 1} ‖_{2} = ‖ (I_{p} + [[V]]_{11} + [[V]]_{21}^{T} [[V]]_{21})^{- 1} ‖_{2}$ is bounded above as (A25) $\begin{aligned} ‖ M^{- 1} ‖_{2} \\ = ‖ Q (I_{p} + Σ)^{- 1 / 2} (I_{p} + (I_{p} + Σ)^{- 1 / 2} Q^{T} [[V]]_{11} Q (I_{p} + Σ)^{- 1 / 2})^{- 1} (I_{p} + Σ)^{- 1 / 2} Q^{T} ‖_{2} \\ \leq ‖ (I_{p} + Σ)^{- 1 / 2} ‖_{2}^{2} ‖ (I_{p} + (I_{p} + Σ)^{- 1 / 2} Q^{T} [[V]]_{11} Q (I_{p} + Σ)^{- 1 / 2})^{- 1} ‖_{2} \\ \leq ‖ (I_{p} + Σ)^{- 1} ‖_{2}, \end{aligned}$ (A25) where the last inequality is derived from the skew-symmetry of $(I_{p} + Σ)^{- 1 / 2} Q^{T} [[V]]_{11} Q (I_{p} + Σ)^{- 1 / 2}$ and Lemma A.4(b). Moreover, by $‖ (I_{p} + Σ)^{- 1} ‖_{2} = (1 + σ_{min}^{2} ([[V]]_{21}))^{- 1}$ , we have $‖ M^{- 1} ‖_{2} \leq (1 + σ_{min}^{2} ([[V]]_{21}))^{- 1}$ . Furthermore, from the definition of the spectral norm, we have $‖ I_{p} - [[V]]_{21}^{T} ‖_{2} = \sqrt{λ_{max} (I_{p} + [[V]]_{21}^{T} [[V]]_{21})} = \sqrt{1 + ‖ [[V]]_{21} ‖_{2}^{2}}$ . By substituting these relations into (EquationA24 $\begin{aligned} ‖ U (τ) - U (0) ‖_{F} = ‖ 2 τ S (I + V + τ E)^{- 1} E (I + V)^{- 1} I_{N \times p} ‖_{F} \\ \leq 2 τ ‖ S ‖_{2} ‖ (I + V + τ E)^{- 1} ‖_{2} ‖ E ‖_{F} ‖ (I + V)^{- 1} I_{N \times p} ‖_{2} (∵ Lemma A.4(a)) \\ \leq 2 τ ‖ (I + V)^{- 1} I_{N \times p} ‖_{2} \leq 2 τ ‖ I_{p} - ‖ V ‖_{21}^{T} ‖_{2} ‖ M^{- 1} ‖_{2}, \end{aligned}$ ), we completed the proof of (Equation28(28) $r (V) := \frac{2 \sqrt{1 + ‖ [[V]]_{21} ‖_{2}^{2}}}{1 + σ_{min}^{2} ([[V]]_{21})} .$ (28) ). The equation (Equation29(29) $r (V) \geq 2 (1 + ‖ [[V]]_{21} ‖_{2}^{2})^{- 1 / 2},$ (29) ) is verified by $σ_{min} ([[V]]_{21}) \leq σ_{max} ([[V]]_{21}) = ‖ [[V]]_{21} ‖_{2}$ .

Appendix 10.

Gradient of $f \circ R_{U}^{Cay}$

Proposition A.13

Let $U \in St (p, N)$ and $U_{⊥} \in St (N - p, N)$ satisfy $U^{T} U_{⊥} = 0$ . For a differentiable function $f : R^{N \times p} \to R$ , the Cayley transform-based retraction $R^{Cay}$ in (Equation30(30) $R_{U}^{Cay} : T_{U} St (p, N) \to St (p, N) : V \mapsto φ^{- 1} (S_{kew} (U V^{T} P_{U})) U .$ (30) ), and $U \in St (p, N)$ , the function $f \circ R_{U}^{Cay} : T_{U} St (p, N) \to R$ is differentiable with $(V \in T_{U} St (p, N)) \nabla (f \circ R_{U}^{Cay}) (V) = - 2 P_{U} S_{kew} (ZU ∇f (R_{U}^{Cay} (V))^{T} Z) U$ where $P_{U} := I - U U^{T} / 2 \in R^{N \times N}$ and $Z := (I + S_{kew} (U V^{T} P_{U}))^{- 1} \in R^{N \times N}$ . The matrix $Z$ can be expressed as $Z = I - A (I_{2 p} + B^{T} A)^{- 1} B^{T}$ with $A = [U P_{U} V / 2] \in R^{N \times 2 p}$ and $B = [P_{U} V / 2 - U] \in R^{N \times 2 p}$ .

Proof.

Let $V, D \in T_{U} St (p, N)$ . From the chain rule and Fact A.5, we obtain (A26) $\begin{aligned} {\frac{d (f \circ R_{U}^{Cay})}{d t} (V + t D) |}_{t = 0} = Tr (\nabla f (R_{U}^{Cay} (V))^{T} {\frac{d R_{U}^{Cay}}{d t} (V + t D) |}_{t = 0}) \\ = - 2 Tr (∇f (R_{U}^{Cay} (V))^{T} Z S_{kew} (U D^{T} P_{U}) ZU) \\ = Tr (U^{T} ZU ∇f (R_{U}^{Cay} (V))^{T} Z P_{U} D) - Tr (D^{T} P_{U} ZU ∇f (R_{U}^{Cay} (V))^{T} ZU) \\ = Tr (U^{T} ZU ∇f (R_{U}^{Cay} (V))^{T} Z P_{U} D) - Tr (U^{T} Z^{T} ∇f (R_{U}^{Cay} (V)) U^{T} Z^{T} P_{U} D) \\ = Tr (2 U^{T} S_{kew} (ZU ∇f (R_{U}^{Cay} (V))^{T} Z) P_{U} D) \\ = Tr ((- 2 P_{U} S_{kew} (ZU ∇f (R_{U}^{Cay} (V))^{T} Z) U)^{T} D) \end{aligned}$ (A26) due to $R_{U}^{Cay} (V) = 2 (I + S_{kew} (U V^{T} P_{U}))^{- 1} U - U$ (see (Equation30(30) $R_{U}^{Cay} : T_{U} St (p, N) \to St (p, N) : V \mapsto φ^{- 1} (S_{kew} (U V^{T} P_{U})) U .$ (30) ) and (Equation4(4) $φ^{- 1} : Q_{N, N} \to SO (N) ∖ E_{N, N} : V \mapsto (I - V) (I + V)^{- 1} = 2 (I + V)^{- 1} - I$ (4) )) and $\begin{aligned} {\frac{d R_{U}^{Cay}}{d t} (V + t D) |}_{t = 0} & = - 2 (I + S_{kew} (U V^{T} P_{U}))^{- 1} S_{kew} (U D^{T} P_{U}) (I + S_{kew} (U V^{T} P_{U}))^{- 1} U \\ = - 2 Z S_{kew} (U D^{T} P_{U}) ZU . \end{aligned}$ For $W := - 2 P_{U} S_{kew} (ZU \nabla f (R_{U}^{Cay} (V))^{T} Z) U \in R^{N \times p}$ , we have $U^{T} W + W^{T} U = 0$ because $U^{T} W = - U^{T} S_{kew} (ZU \nabla f (R_{U}^{Cay} (V))^{T} Z) U$ is skew-symmetric. Fact A.1(d) yields $W \in T_{U} St (p, N)$ .

On the other hand, we obtain (A27) ${\frac{d (f \circ R_{U}^{Cay})}{d t} (V + t D) |}_{t = 0} = Tr (\nabla (f \circ R_{U}^{Cay}) (V)^{T} D) .$ (A27) From (EquationA26(A26) $\begin{aligned} {\frac{d (f \circ R_{U}^{Cay})}{d t} (V + t D) |}_{t = 0} = Tr (\nabla f (R_{U}^{Cay} (V))^{T} {\frac{d R_{U}^{Cay}}{d t} (V + t D) |}_{t = 0}) \\ = - 2 Tr (∇f (R_{U}^{Cay} (V))^{T} Z S_{kew} (U D^{T} P_{U}) ZU) \\ = Tr (U^{T} ZU ∇f (R_{U}^{Cay} (V))^{T} Z P_{U} D) - Tr (D^{T} P_{U} ZU ∇f (R_{U}^{Cay} (V))^{T} ZU) \\ = Tr (U^{T} ZU ∇f (R_{U}^{Cay} (V))^{T} Z P_{U} D) - Tr (U^{T} Z^{T} ∇f (R_{U}^{Cay} (V)) U^{T} Z^{T} P_{U} D) \\ = Tr (2 U^{T} S_{kew} (ZU ∇f (R_{U}^{Cay} (V))^{T} Z) P_{U} D) \\ = Tr ((- 2 P_{U} S_{kew} (ZU ∇f (R_{U}^{Cay} (V))^{T} Z) U)^{T} D) \end{aligned}$ (A26) ), (EquationA27(A27) ${\frac{d (f \circ R_{U}^{Cay})}{d t} (V + t D) |}_{t = 0} = Tr (\nabla (f \circ R_{U}^{Cay}) (V)^{T} D) .$ (A27) ) and $W \in T_{U} St (p, N)$ , it holds $\nabla (f \circ R_{U}^{Cay}) (V) = W$ .

In the following, let us consider the expression of $Z$ along the discussion in [Citation22, Lemma 4]. From $I + S_{kew} (U V^{T} P_{U}) = I + A B^{T}$ , we have $Z = (I + A B^{T})^{- 1}$ . Then, applying the Sherman-Morrison-Woodbury formula (see Fact A.7) to $Z$ , we obtain $Z = I - A (I_{2 p} + B^{T} A)^{- 1} B^{T}$ .

Generalized left-localized Cayley parametrization for optimization with orthogonality constraints

Abstract

1. Introduction

2. Generalized left-localized Cayley transform (G-L2CT)

2.1. Definition and properties of G-L2CT

Generalized left-localized Cayley transform

Inversion of G-L2CT

Denseness of St(p,N)∖EN,p(S)

Properties of G-L2CT in view of the manifold theory

2.2. Computational complexities for ΦS and ΦS−1 with S∈Op(N)

Parametrization of St(p,N) by ΦS with S∈Op(N)

Comparisons to commonly used retractions of St(p,N)

Table 1. Performance of each algorithm applied to Problem 4.1.

2.3. Gradient of function after the Cayley parametrization

Gradient of function after the Cayley parametrization

Transformation formula for gradients of function

3. Optimization over the Stiefel manifold with the Cayley parametrization

3.1. Optimality condition via the Cayley parametrization

Equivalence of local minimizers in the two senses

Right orthogonal invariance

First-order optimality condition

3.2. Basic framework to incorporate optimization techniques designed over a vector space with the Cayley parametrization

Comparison to the retraction-based strategy

3.3. Singular-point issue in the Cayley parametrization strategy

The mobility of ΦS−1

3.4. Relation between the Cayley transform-based retraction and ΦS−1

Minimization of f∘RUCay with a fixed U

4. Numerical experiments

4.1. Comparison to the retraction-based strategy

Eigenbasis extraction problem (e.g. [Citation1,Citation22,Citation24])

4.2. Singular-point issue

Table 2. Performance of each algorithm applied Problem 1.1 with f(U):=12‖U−U⋆‖F2.

5. Conclusion

Disclosure statement

Additional information

Funding

Notes

References

Appendix 1.

Basic facts on the Stiefel manifold, the Cayley transform and tools for matrix analysis

Stiefel manifold [Citation1,Citation41]

Commutativity of the Cayley transform pair, e.g. [Citation56]

Denseness of O(N)∖EN,N(S); see [Citation20] for S=I

Matrix norms

Derivative of matrix functions (see, e.g. [Citation57, Appendix D])

The Schur complement formula [Citation58, Sec. 0.8.5]

The Sherman-Morrison-Woodbury formular [Citation58, Sec. 0.7.4]

Appendix 2.

Retraction-based strategy for optimization over St(p,N)

Retraction [Citation1]

Appendix 3.

Proof of Proposition 2.2

Appendix 4.

Proof of Theorem 2.3

Appendix 5.

On the choice of Ξ:O(N)→St(p,N) for ΦS−1 in Proposition thm2.2

Appendix 6.

Proof of Proposition 2.9

Appendix 7.

Proof of Proposition 2.10

Appendix 8.

Useful properties of ∇(f∘ΦS−1) for optimization

Bounds for gradient after Cayley parametrizaton

Lipschitz continuity of ΦS−1

Appendix 9.

Proof of Proposition 3.7

Appendix 10.

Gradient of f∘RUCay

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

2. Generalized left-localized Cayley transform (G-L $^{2}$ CT)

2.1. Definition and properties of G-L $^{2}$ CT

Inversion of G-L $^{2}$ CT

Denseness of $St (p, N) ∖ E_{N, p} (S)$

Properties of G-L $^{2}$ CT in view of the manifold theory

2.2. Computational complexities for $Φ_{S}$ and $Φ_{S}^{- 1}$ with $S \in O_{p} (N)$

Parametrization of $St (p, N)$ by $Φ_{S}$ with $S \in O_{p} (N)$

Comparisons to commonly used retractions of $St (p, N)$

The mobility of $Φ_{S}^{- 1}$

3.4. Relation between the Cayley transform-based retraction and $Φ_{S}^{- 1}$

Minimization of $f \circ R_{U}^{Cay}$ with a fixed $U$

Table 2. Performance of each algorithm applied Problem 1.1 with $f (U) := \frac{1}{2} ‖ U - U^{⋆} ‖_{F}^{2}$ .

Denseness of $O (N) ∖ E_{N, N} (S)$ ; see [Citation20] for $S = I$

Retraction-based strategy for optimization over $St (p, N)$

On the choice of $Ξ : O (N) \to St (p, N)$ for $Φ_{S}^{- 1}$ in Proposition thm2.2

Useful properties of $\nabla (f \circ Φ_{S}^{- 1})$ for optimization

Lipschitz continuity of $Φ_{S}^{- 1}$

Gradient of $f \circ R_{U}^{Cay}$