Full article: Smooth and Probabilistic PARAFAC Model with Auxiliary Covariates

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

As immunological and clinical studies become more complex, there is an increasing need to analyze temporal immunophenotypes alongside demographic and clinical covariates, where each subject receives matrix-valued time series observations for potentially high-dimensional longitudinal features, as well as other static characterizations. Researchers aim to find the low-dimensional embedding of subjects using matrix-valued time series observations and investigate relationships between static clinical responses and the embedding. However, constructing these embeddings can be challenging due to high dimensionality, sparsity, and irregularity in sample collection over time. In addition, the incorporation of static auxiliary covariates is frequently desired during such a construction. To address these issues, we propose a smoothed probabilistic PARAFAC model with covariates (SPACO) that uses auxiliary covariates of interest. We provide extensive simulations to test different aspects of SPACO and demonstrate its application to an immunological dataset from patients with SARS-CoV-2 infection. Supplementary materials for this article are available online.

KEYWORDS:

1 Introduction

Sparsely observed multivariate time series data are now common in immunological studies. For each subject or participant $i$ $(i = 1, \dots, I)$ , we can collect multiple measurements on $J$ features over time, but often at $n_{i}$ different time stamps ${t_{i, 1}, \dots, t_{i, n_{i}}}$ . For example, for each subject, immune profiles are measured for hundreds of markers at irregular sampling times in Lucas et al. (Citation2020) and Rendeiro et al. (Citation2021). Let $X_{i} \in R^{n_{i} \times J}$ be the longitudinal measurements for subject $i$ , and $X \in R^{I \times T \times J}$ be the sparse three-way tensor collecting $X_{i}$ for all $I$ subjects, where $T$ is the number of unique time points across all subjects, for example, $T = | \cup_{i} {t_{i, 1}, \dots, t_{i, n_{i}}} |$ . A major computational task in practice is to construct low-dimensional embeddings that capture data variability in $X$ across subjects. Tensor decomposition is a widely used technique for achieving this goal. demonstrates the PARAFAC/CP tensor decomposition and utilization of the decomposition results (Harshman and Lundy Citation1994), where we approximate a tensor by the sum of rank-one tensors. Apart from tensor reconstruction and denoising, the decomposition of the three-way tensor X provides decomposed factors $U$ , $V$ , and $Φ$ , which aim to capture major variations along the subject, feature, and time directions, respectively. These factors can be used for further downstream analysis. For instance, factors $U$ (also denoted as the subject scores) serve as the aforementioned subject embeddings, which can be used to investigate the relationship between immune profiles and various clinical/demographic covariates.

Fig. 1 Illustration of the PARAFAC/CP decomposition method for immunoprofiles measured longitudinally. The white color represents missing data in the time direction. The three-way tensor is approximated as the sum of $K$ rank-one tensors $F^{1}, \dots, F^{K}$ . Each rank-one tensor $F^{k}$ $(k = 1, \dots, K)$ can be expressed using a factor along the subject direction $(U_{k})$ , a factor along the feature direction $(V_{k})$ , and a factor along the time direction $(Φ_{k})$ . The factors associated with the tensor decomposition can be used in downstream analysis.

The immune profiles are usually highly sparse observed along the time dimensions, that is, the sets { $t_{i, 1}, \dots, t_{i, n_{i}}$ } for different subjects i tend to be small in size and have low overlaps, resulting in a high missing rate along the time dimension of X (). In addition, researchers may have access to nontemporal covariates for each subject, such as medical conditions and demographic information, which may partially explain the variation in the temporal measurements of X across subjects. As an example, the IMPACT study (Lucas et al. Citation2020) analyzed immune profiles of 98 COVID-19 infected individuals. Researchers measured immune profiles for more than 30 days after symptom onset, but only an average of 1.84 measured time points per subject. Auxiliary demographic information, including sex and age, and several preexisting medical conditions, were also collected. To better estimate $U$ in this example, a tensor decomposition tool that can properly model the longitudinal behavior of the factors and incorporate available auxiliary covariates is needed.

In this article, we propose SPACO (smooth and probabilistic PARAFAC model with auxiliary covariates) to adapt to the sparsity long the time dimension in X and use the auxiliary covariates Z. SPACO assumes that X is a noisy realization of some low-rank signal tensor whose time components are smooth and subject scores have a potential dependence on the auxiliary covariates Z: $\begin{matrix} x_{itj} = \sum_{k = 1}^{K} u_{i k} ϕ_{t k} v_{j k} + ϵ_{itj}, ϵ_{itj} \sim N (0, σ_{j}^{2}) \\ u_{i} = {(u_{i k})}_{k = 1}^{K} \sim N (η_{i}, Λ_{f}), η_{i} = β^{⊤} z_{i} . \end{matrix}$

Here, (a) z_i is the ith row of Z and $β \in R^{q \times K}$ describes the dependence of the expected subject score for subject i on z_i, (b) u_ik, $ϕ_{t k}$ , v_jk are the subject score, trajectory value and feature loading for factor k in the PARAFAC model and the observation indexed by (i, t, j) where $u_{i}$ has a normal prior $N (η_{i}, Λ_{f})$ . We impose smoothness on time trajectories ${(ϕ_{t k})}_{t = 1}^{⊤}$ and sparsity on β to deal with the irregular sampling along the time dimension and potentially high dimensionality in Z.

Alongside the proposal of the model, we will delve into several issues pertinent to SPACO, including model initialization, auto-tuning of smoothness and sparsity in β, and hypothesis testing on β. Effectively addressing these concerns is crucial for applying SPACO in practice. In the remainder of the article, we summarize work closely related to our study in Section 1.1. We describe the SPACO model in Section 2 and discuss model parameter estimation with predetermined tuning parameters in Section 3. Section 4 is dedicated to the examination of the aforementioned issues that bear practical significance. A comparative analysis of SPACO with existing methods on synthetic data is presented in Section 5. Lastly, in Section 6, we employ SPACO on a highly sparse tensor containing immunological measurements from SARS-COV-2 infected patients. We also provide a Python package, SPACO, for researchers who are interested in using the proposed method.

1.1 Related Work

Matrix-valued observations have become increasingly important in modern applications, particularly in the context of dimension reduction or matrix-valued data clustering (Chang Citation1983; Vichi, Rocci, and Kiers Citation2007; Viroli Citation2011; Anderlucci and Viroli Citation2015).

Tensor decomposition is a useful technique for modeling matrix-valued observations, and has been applied in various fields (Acar and Yener Citation2008; Kolda and Bader Citation2009; Sidiropoulos et al. Citation2017). In the study of multivariate longitudinal data, researchers in economics have combined tensor decomposition with auto-cross-covariance estimation and autoregressive models (Fan, Fan, and Lv Citation2008; Lam, Yao, and Bathia Citation2011; Fan, Liao, and Mincheva Citation2011; Bai and Wang Citation2015, Citation2016; Wang, Liu, and Chen Citation2019; Wang et al. Citation2021). However, these approaches are either not compatible with highly sparse data or do not scale well with the feature dimensions, both of which are important for medical applications.

Functional PCA (Besse and Ramsay Citation1986; Yao, Müller, and Wang Citation2005) is often used for modeling sparse longitudinal data in the matrix form. It uses the smoothness of time trajectories to handle sparsity in longitudinal observations and estimates the eigenvectors and factor scores under this smoothness assumption. Yokota, Zhao, and Cichocki (Citation2016) and Imaizumi and Hayashi (Citation2017) introduced smoothness into tensor decomposition and estimated the parameters by iteratively solving penalized regression problems. However, these methods do not consider auxiliary covariates.

It has been previously discussed that including auxiliary information could potentially improve our estimation. For example, Li, Shen, and Huang (Citation2016) proposed the supervised sparse and functional principal component (SupSFPC) method and observed that auxiliary covariates improve signal estimation quality in the matrix setting for modeling multivariate longitudinal observations. In Chen, Tsay, and Chen (Citation2020), the authors observed that inclusion of auxiliary constraints improve the quality of tensor decomposition. Lock and Li (Citation2018) proposed SupCP, a supervised multiway factorization model with complete observation that employs a probabilistic tensor model (Tipping and Bishop Citation1999; Mnih and Salakhutdinov Citation2007; Hinrich and Mørup Citation2019). Although an extension to sparse tensor is straightforward, SupCP does not model smoothness and can be highly affected by severe missingness along the time dimension.

To address these limitations, we propose SPACO, an extension of functional PCA and SupSFPC to the setting of three-way tensor decomposition using the parallel factor (PARAFAC) model (Carroll, Pruzansky, and Kruskal Citation1980; Harshman and Lundy Citation1994). SPACO jointly models smooth longitudinal data with potentially high-dimensional static covariates Z using a probabilistic model. When Z is unavailable, we refer to the SPACO model as SPACO-. SPACO- itself is also an attractive alternative to existing tensor decomposition implementations with probabilistic modeling, smoothness regularization, and automatic parameter tuning. In our numerical experiments, we demonstrate the advantages of SPACO and SPACO- over several state-of-the-art tensor decomposition methods and the improvement of SPACO over SPACO- by using Z.

2 SPACO Model

2.1 Notations

We use (a) bolded uppercase letters to represent three-way tensors, (b) regular uppercase letters to represent a matrix, and (c) bolded lowercase to represent vectors. Following this convection, let $X \in R^{I \times T \times J}$ be the tensor for some sparse multivariate longitudinal observations, where I is the number of subjects, J is the number of features, and T is the number of total unique time points. Let $X_{I} : = (\begin{matrix} X_{:, :, 1} & \dots & X_{:, :, J} \end{matrix}) \in R^{I \times (T J)}, X_{T} : = (\begin{matrix} X_{:, :, 1}^{⊤} & \dots & X_{:, :, J}^{⊤} \end{matrix}) \in R^{T \times (I J)}, X_{J} : = (\begin{matrix} X_{:, 1, :}^{⊤} & \dots & X_{:, T, :}^{⊤} \end{matrix}) \in R^{J \times (I T)}$ be the matrix unfolding of X in the subject/feature/time dimension respectively. For any matrix A, we let $A_{i :} / A_{: i}$ denote its ith row/column, and often write $A_{: i}$ as A_i for the ith column for convenience. We also define:

Tensor product $⊚$ : $a \in R^{I}, b \in R^{T}, c \in R^{J}$ , then, $A = a ⊚ b ⊚ c \in R^{I \times T \times J}$ with $A_{itj} = a_{i} b_{t} c_{j}$ .

Kronecker product ⊗: $A \in R^{I_{1} \times K_{1}}, B \in R^{I_{2} \times K_{2}}$ , then $C = A \otimes B \in R^{(I_{1} I_{2}) \times (K_{1} K_{2})}$ is the Kronecker product of A and B: $C = A \otimes B = (\begin{matrix} A_{11} B & \dots & A_{1 K_{1}} B \\ ⋮ & ⋱ & ⋮ \\ A_{I_{1} 1} B & \dots & A_{I_{1} K_{1}} B \end{matrix}) \in R^{(I_{1} I_{2}) \times (K_{1} K_{2})}$ .

Column-wise Khatri-Rao product $⊙$ : $A \in R^{I_{1} \times K}, B \in R^{I_{2} \times K}$ , then $C = A ⊙ B \in R^{(I_{1} I_{2}) \times K}$ with $C_{:, k} = (A_{:, k} \otimes B_{:, k})$ for $k = 1, \dots, K$ .

Element-wise multiplication $\cdot$ : $A, B \in R^{I \times K}$ , then $C = A \cdot B \in R^{I \times K}$ with $C_{i k} = (A_{i k} B_{i k})$ ; for $b \in R^{K}, C = A \cdot b = A diag {b_{1}, \dots, b_{K}}$ ; for $b \in R^{I}, C = b \cdot A = diag {b_{1}, \dots, b_{I}} A$ .

2.2 Smooth and Probabilistic PARAFAC Model with Covariates

We assume X to be a noisy realization of an underlying signal array F which is a rank K tensor with $U / Φ / V$ be the decomposed factor matrices, for example, $F = \sum_{k = 1}^{K} U_{k} ⊚ Φ_{k} ⊚ V_{k}$ , where $U_{k} / Φ_{k} / V_{k}$ are kth columns of $U / Φ / V$ . We denote the rows of $U / Φ / V$ by $u_{i} / ϕ_{t} / v_{j}$ , and their entries by $u_{i k} / ϕ_{t k} / v_{j k}$ . We let x_itj denote the (i, t, j)-entry of X. Then, (1) $x_{itj} = \sum_{k = 1}^{K} u_{i k} ϕ_{t k} v_{j k} + ε_{itj}, u_{i} \sim N (η_{i}, Λ_{f}), ε_{ijt} \sim N (0, σ_{j}^{2}),$ (1) where $Λ_{f} = diag {s_{1}^{2}, \dots, s_{K}^{2}}$ is a K × K diagonal covariance matrix and $η_{i}$ is a mean vector of length K. In practice, although the observation tensor X is almost always of high-rank, it is often assumed that F is (or approximately) low-rank with K being small.

We set $η_{i} = 0$ in the absence of auxiliary covariates $Z \in R^{I \times q}$ . In the setting where we are interested in explaining the heterogeneity $η_{i}$ across subjects using Z, we may model $η_{i}$ as a function of $z_{i} : = Z_{i, :}$ . Here, we consider a linear model $η_{i k} = z_{i}^{⊤} β_{k}$ for $k = 1, \dots, K$ , where η_ik is the kth entry of $η_{i}$ and $β \in R^{q \times K}$ is the coefficients matrix describing the dependence of subject scores on Z. To avoid confusion, we will always call X the “features”, and Z the “covariates” or “variables”.

Recalling that $X_{I} \in R^{I \times (T J)}$ is the unfolding of X in the subject direction, we write $\vec{i}$ for the indices of observed values in the ith row of X_I, and $X_{I, \vec{i}}$ for the vector of these observed values. Each such observed value x_itj has noise variance $σ_{j}^{2}$ , and we write $Λ_{\vec{i}}$ to represent the diagonal covariance matrix with diagonal values $σ_{j}^{2}$ being the corresponding variances for $ε_{ijt}$ at indices in $\vec{i}$ . Similarly, we define ${\vec{t}, X_{T, \vec{t}}, Λ_{\vec{t}}}$ for the unfolding $X_{T} \in R^{T \times (I J)}$ , and ${\vec{j}, X_{J, \vec{j}}, Λ_{\vec{j}}}$ for the observed indices, the associated observed vector and diagonal covariance matrix for the jth row in $X_{J} \in R^{J \times (I T)}$ . We set $Θ = {V, Φ, β, (σ_{j}^{2}, j = 1, \dots, J), (s_{k}^{2}, k = 1, \dots, K)}$ to denote all model parameters. Set $H = V ⊙ Φ \in R^{(T J) \times K}$ and $f_{i} = X_{I, \vec{i}} - H_{\vec{i}, :} η_{i}$ as the residual vector after removing mean signals from observations for subject i. If U is observed, the complete log-likelihood is (2) $\begin{matrix} L (X, U | Θ) = - \frac{1}{2} \sum_{i} (f_{i}^{⊤} Λ_{\vec{i}}^{- 1} f_{i} + {(u_{i} - η_{i})}^{⊤} Λ_{f}^{- 1} (u_{i} - η_{i}) \\ + \log | Λ_{\vec{i}} | + I \log | Λ_{f} |) . \end{matrix}$ (2)

Set ${\tilde{Σ}}_{i} = Λ_{\vec{i}} + H_{\vec{i}} Λ_{f} H_{\vec{i}}^{⊤}$ . The marginalized log-likelihood integrating out the randomness in U enjoys a closed form (Lock and Li Citation2018): (3) $L (X | Θ) \propto - \frac{1}{2} (\sum_{i} f_{i}^{⊤} {\tilde{Σ}}_{i}^{- 1} f_{i} + \log | {\tilde{Σ}}_{i} |) .$ (3)

Model parameters in (3) are not identifiable due to (a) parameters rescaling from $(Φ_{k}, V_{k}, β_{k}, s_{k}^{2})$ to $(c_{1} Φ_{k}, c_{2} V_{k}, c_{3} β_{k}, c_{3}^{2} s_{k}^{2})$ for any $c_{1} c_{2} c_{3} = 1$ , and (b) reordering of different component k for $k = 1, \dots, K$ . More discussions of the model identifiability can be found in Lock and Li (Citation2018). Hence, adopting similar rules from Lock and Li (Citation2018), we require

(C.1) $⏧ V_{k} ⏧_{2}^{2} = 1, ⏧ Φ_{k} ⏧_{2}^{2} = T$ .

(C.2)The latent components are in decreasing order based on their overall variances $Λ_{f, k k} + ⏧ Z β_{k} ⏧_{2}^{2} / I$ , and the first nonzero entries in V_k and $Φ_{k}$ to be positive, for example, $v_{k 1} > 0$ and $ϕ_{k 1} > 0$ if they are nonzero.

In the immunological application being considered, U represents subject scores, which are latent variables characterizing differences across subjects, V represents feature loadings, revealing the composition of factors using the original features and aiding downstream interpretation, and $Φ$ represents time trajectories interpreted as function values sampled from some underlying smooth functions $ϕ_{k} (t)$ at a set of discrete-time points where the sampling in the time direction is often highly sparse.

Algorithm 1:

SPACO with fixed penalties

Data: X, Ω, λ₁, λ₂, K

Result: Estimated V, $Φ$ , β, s², $σ^{2}$ and posterior distribution of $U | Θ, X$ and the marginalized density $P (X | Θ)$ .

Initialization of V, $Φ$ , β, s², $σ^{2}$ and the posterior distribution of U;
while Not converged do
for $k = 1, \dots, K$ do
$β_{k} \leftarrow \arg \min_{β_{k}} {- L (X | Θ) + λ_{2 k} | β_{k} |}$
$V_{k} \leftarrow \arg \min_{V_{k}} [J (Θ; Θ_{0}) + ν_{V} ⏧ V_{k} ⏧_{2}^{2}]$ , ν_V is the largest value leading to the minimizer having $⏧ V_{k} ⏧_{2}^{2} = 1$
$Φ_{k} \leftarrow \arg \min_{Φ_{k}} [J (Θ; Θ_{0}) + ν_{Φ} ⏧ Φ_{k} ⏧_{2}^{2}], ν_{Φ}$ is the largest value leading to the minimizer having $⏧ Φ_{k} ⏧_{2}^{2} = T$
$s_{k}^{2} \leftarrow \arg \max_{s_{k}^{2}} E_{U | Θ_{0}} L (X, U | Θ)$ .
end
For $j = 1, \dots, J$ : $σ_{j}^{2} \leftarrow \arg \max_{σ_{j}^{2}} E_{U | Θ_{0}} L (X, U | Θ)$ .
Update the posterior distribution (posterior mean and covariance) of U.
end

We address the challenge of sparse sampling utilizing the smoothness assumption in $ϕ_{k} (t)$ and encourage it by directly penalizing the function values via a penalty term $\sum_{k} λ_{1 k} Φ_{k}^{⊤} Ω Φ_{k}$ . This article considers a Laplacian smoothing (Sorkine et al. Citation2004) with a weighted adjacency matrix Γ. Let $T [t]$ be the time associated with the tth slice along the time dimension, for example, time associated with $X_{., t .,}$ . We define Ω and Γ as $\begin{matrix} Ω = Γ^{⊤} Γ, \\ Γ = ( \begin{matrix} \frac{1}{T [2] - T [1]} & - \frac{1}{T [2] - T [1]} & \dots & 0 & 0 \\ 0 & \frac{1}{T [3] - T [2]} & ⋱ & 0 & ⋮ \\ ⋮ & ⋮ & ⋱ & ⋮ & ⋮ \\ 0 & 0 & \dots & \frac{1}{T [T] - T [T - 1]} & - \frac{1}{T [T] - T [T - 1]} \end{matrix} ) \\ \in R^{T \times (T - 1)} . \end{matrix}$

The Laplacian smoothing discourages abrupt trajectory changes in observations from close-by time points, which is usually a reasonable assumption biologically. Practitioners may choose other forms for Ω based on their applications. For example, if practitioners want $ϕ_{k} (t)$ to have slowly varying derivatives, they can also use a penalty matrix that penalizes changes in gradients over time.

Further, when the number of covariates q in Z is moderately large, we may wish to impose sparsity in the β parameter. We encourage such sparsity by including a lasso penalty (Tibshirani Citation2011) in the model. In summary, our goal is then to find parameters maximizing the expected penalized log-likelihood, or minimizing the penalized expected deviance loss $J (Θ)$ , under norm constraints: (4) $\begin{matrix} \min J (Θ) : = - \frac{1}{2} L (X | Θ) + \sum_{k = 1}^{K} λ_{1 k} Φ_{k}^{⊤} Ω Φ_{k} + \sum_{k} λ_{2 k} | β_{k} | \\ s . t ⏧ V_{k} ⏧_{2}^{2} =1, ⏧ Φ_{k} ⏧_{2}^{2} =T, for all k = 1, \dots, K . \end{matrix}$ (4)

The objective only incorporates the identifiability constraint (C.1). By changing the signs in V, $Φ$ , β and reordering them, we can always ensure (C.2) without altering the attained objective value.

EquationEquation (4)(4) $\begin{matrix} \min J (Θ) : = - \frac{1}{2} L (X | Θ) + \sum_{k = 1}^{K} λ_{1 k} Φ_{k}^{⊤} Ω Φ_{k} + \sum_{k} λ_{2 k} | β_{k} | \\ s . t ⏧ V_{k} ⏧_{2}^{2} =1, ⏧ Φ_{k} ⏧_{2}^{2} =T, for all k = 1, \dots, K . \end{matrix}$ (4) characterizes a nonconvex problem. We will obtain local optimal solutions via an alternating update procedure: (a) we update β through lasso regressions while keeping other parameters fixed, (b) we update other model parameters using the EM algorithm while keeping β fixed. See Section 3 for more details.

3 Model Parameter Estimation

Given the model rank K and penalty terms $λ_{1 k}, λ_{2 k}$ , we alternately update parameters β, V, U, s², and $σ^{2}$ with a mixed EM procedure described in Algorithm 1. We briefly explain the updating steps here:

Given other parameters, we find β to directly minimize the objective by solving a least-squares regression problem with lasso penalty.
Fixing β, we update the other parameters using an EM procedure. Denote the current parameters as $Θ_{0}$ . At the M-step, we minimize the penalized expected negative log-likelihood
$\begin{matrix} J (Θ; Θ_{0}) \\ : = E_{U | Θ_{0}} {- L (X, U | Θ) + \sum_{k} λ_{1 k} Φ_{k}^{⊤} Ω Φ_{k} + \sum_{k} λ_{2 k} | β_{k} |}, \end{matrix}$ (5)
under the current posterior distribution $U | Θ_{0}, X$ . We adopt a block-wise parameter updating scheme where we update V_k, $Φ_{k}$ , Λ_f and $σ_{j}^{2}$ sequentially.

Algorithm 1 describes the high level ideas of our updating schemes. The posterior distribution of $u_{i}$ for each row in U is Gaussian, with posterior covariance Σ_i and posterior mean $μ_{i}$ given below, (6) $Σ_{i} = {(H_{\vec{i}}^{⊤} Λ_{\vec{i}}^{- 1} H_{\vec{i}} + Λ_{f}^{- 1})}^{- 1}$ (6) (7) $μ_{i} = Σ_{i} (Λ_{f}^{- 1} β^{⊤} z_{i} + H_{\vec{i}}^{⊤} Λ_{\vec{i}}^{- 1} X_{I, \vec{i}}) .$ (7)

In line 5 and 6, we guarantee the norm constraints on V_k and $Φ_{k}$ by adding an additional quadratic term and set the coefficient ν to guarantee the norm requirements. Although the problem is not convex, our proposed approaches yield optimal solutions for updating different parameter blocks in the sub-routines, and the penalized (marginalized) deviance loss is nonincreasing over the iterations.

Theorem 3.1.

In Algorithm 1, let $Θ_{l}$ and $Θ_{l + 1}$ are the estimated parameters at the beginning and end of the $l$ th iteration of the outer while loop. We have $J (Θ^{l}) \geq J (Θ^{l + 1})$ .

Proof of Theorem 3.1 is given in Appendix B.1. Derivations and explicit steps for carrying subroutines are deferred to Appendix A.1.

Algorithm 2:

Initialization of SPACO

Let $V_{⊥}$ be the top K₃ left singular vectors of X_J using only the observed columns.
Set $Y (k) = (Y_{1} (k), \dots, Y_{T} (k)) \in R^{I \times T}$ , where $Y_{t} (k) = X_{:, t, :} {(V_{⊥})}_{k} \in R^{I}$ .
Let $Φ_{⊥}$ be the top K₂ eigenvectors from functional PCA on the row aggregation of matrices Y(k) $k = 1, \dots, K_{3}$ . (see Appendix A.2 for details on functional PCA.)
Let $\tilde{U} = \arg \min_{U} {⏧ X_{I} - U {(V_{⊥} \otimes Φ_{⊥})}^{⊤} ⏧_{F}^{2} + δ ⏧ U ⏧_{F}^{2}} \in R^{I \times K^{2}}$ , where δ is a small regularization parameter to avoid severe over-fitting to the noise.
Let $U_{⊥}$ be the top K left singular eigenvectors of $\tilde{U}$ , and $\tilde{G} = U_{⊥}^{⊤} \tilde{U} \in R^{K \times K^{2}}$ . Let $G \in R^{K \times K \times K}$ be the estimated core array from rearranging $\tilde{G}$ .
Let $\sum_{k = 1}^{K} A_{k} ⊚ B_{k} ⊚ C_{k}$ be the rank-K CP approximation of G. Stack these as the columns of $A, B, C \in R^{K \times K}$ , and set $[U, Φ, V] = [U_{⊥} A, Φ_{⊥} B, V_{⊥} C]$ .
For each $k = 1, \dots, K$ , rescale the initializers to satisfy constraints on V and $Φ$

4 Initialization, Tuning and Testing

4.1 Initialization

One popular initialization approach for PARAFAC decomposition is through the Tucker decomposition $[U_{⊥}, Φ_{⊥}, V_{⊥}; G]$ of X using HOSVD/MLSVD (De Lathauwer, De Moor, and Vandewalle Citation2000) where $G \in R^{K_{1} \times K_{2} \times K_{3}}$ is the core tensor and $U_{⊥} \in R^{I \times K_{1}}, Φ_{⊥} \in R^{T \times K_{2}}, V_{⊥} \in R^{J \times K_{3}}$ are unitary matrices multiplied with the core tensors along the subject, time and feature directions respectively ( $K_{1} / K_{2} / K_{3}$ is the smallest between K and $I / T / J$ ), and then perform PARAFAC decomposition on the small core tensor G (Bro and Andersson Citation1998; Phan, Tichavskỳ, and Cichocki Citation2013). Here, we propose to initialize SPACO with Algorithm 2, which combines the aforementioned approach with functional PCA (Yao, Müller, and Wang Citation2005) to handle sparse longitudinal data. Algorithm 2 proceeds as follows: (a) perform SVD on X_J to get $V_{⊥}$ ; (b) project X_J onto each column of $V_{⊥}$ and perform functional PCA to estimate $Φ_{⊥}$ ; (c) a ridge-penalized regression regressing rows of X_I on $V_{⊥} \otimes Φ_{⊥}$ , and estimate $U_{⊥}$ and G from the regression coefficients.

In a noiseless model with δ = 0 and complete temporal observations, one may replace the functional PCA step of Algorithm 2 with standard PCA. Then $[U, Φ, V]$ becomes a PARAFAC decomposition of $\frac{1}{1 + δ} X$ .

Lemma 4.1.

Suppose $X = \sum_{k = 1}^{K} U_{k}^{*} ⊚ Φ_{k}^{*} ⊚ V_{k}^{*}$ and is completely observed. Replace $Φ_{⊥}$ in Algorithm 2 by the top K eigenvectors of $W = \frac{1}{I} \sum_{k = 1}^{K} Y {(k)}^{⊤} Y (k)$ . Then, the outputs $U, Φ, V$ of Algorithm 2 satisfy that $X = (1 + δ) \sum_{k = 1}^{K} U_{k} ⊚ Φ_{k} ⊚ V_{k}$ .

By default, we set $δ = \frac{1}{\sqrt{J \times T}}$ . Proof of Lemma 4.1 is given in Appendix B.2.

4.2 Auto-Selection of Tuning Parameters

Selection of regularizers λ₁ and λ₂: One possible approach to select the tuning parameters $λ_{1 k}$ and $λ_{2 k}$ is through cross-validation. However, this can be computationally expensive even when tuning each parameter sequentially. Additionally, determining a suitable set of candidate values for the parameters before running the SPACO algorithm can be challenging. To address these issues, we adopt nested cross-validation, which has been empirically demonstrated to be useful (Huang, Shen, and Buja Citation2008; Li, Shen, and Huang Citation2016). Specifically, we tune the parameters within their corresponding subroutines as follows:

In the update for $Φ_{k}$ , we conduct column-wise leave-one-out cross-validation to select $λ_{1 k}$ , where all observations from a single time point are left out (See Appendix A.3.)
In the update for β_k, we perform K-fold cross-validation to select $λ_{2 k}$ .

Rank selection: Rank selection can be performed through cross-validation, as suggested in SupCP (Lock and Li Citation2018). To reduce computational costs, we adopt two strategies (see Appendix A.4 for more details):

1. We initialize the cross-validation parameters with estimates obtained from running a full SPACO/SPACO- analysis and only carry out a limited number of iterations to update the parameters. We find that using 5 or 10 iterations is sufficient in our experiments (the default maximum number of iterations is 10).

Algorithm 3:

Randomization test for Z_j

1 for $k = 1, \dots, K$ do

3 Construct responses and features as described in Lemma 4.2.

5 Define ${\hat{β}}_{k}$ by ${\hat{β}}_{k} = \arg \min_{β_{k} : β_{k, j} = 0} {\sum_{i = 1}^{I} \frac{1}{w_{i}} {(z_{i}^{⊤} β_{k} - {\tilde{y}}_{i})}^{2} + λ_{2 k} | β_{k} |_{1}} .$

7 Compute the designed test statistic T using $(Z_{j}, Z_{j^{c}}, \tilde{y}, {\hat{β}}_{j^{c}, k})$ .

9 Compute randomized statistics ${\underline{T}}^{b}$ using $({\underline{Z}}_{j}^{b}, Z_{j^{c}}, \tilde{y}, {\hat{β}}_{j^{c}, k})$ , where ${\underline{Z}}_{j}^{b}$ for $b = 1, \dots, B$ are (conditionally or marginally) independent copies of Z_j.

11 Let $\hat{G} (.)$ be the empirical estimate of the CDF of T under H₀ using ${{\underline{T}}_{1}, \dots, {\underline{T}}^{B}, T}$ , and return the two-sided p-value $p = [1 - \hat{G} (| T |)] + \hat{G} (- | T |)$ .

12 end

2. We start the analysis with the smallest possible rank and gradually increase it. We terminate the analysis when either the cross-validated marginal log-likelihood stops improving or when we reach the maximum rank that we are willing to consider.

4.3 Covariate Importance

In Section 5 of our synthetic experiments, we found that incorporating Z could improve the estimation of subject scores when Z strongly influenced the subject scores. This leads us to question how to determine the significance of such covariates when they are present. Here, we consider the construction of approximated p-values from partial independence/marginal independence tests between Z_j and η_k: $H_{0 k}^{partial} : Z_{j} ⊥ ⊥ η_{k} | Z_{j^{c}}, H_{0 k}^{margin} : Z_{j} ⊥ ⊥ η_{k},$ both of which are practical interest.

Recap on randomization-based hypothesis testing: Before introducing our proposal, we will revisit concepts of hypothesis testing via randomization in the traditional setting where we investigate the relationship between a response variable y and a covariate variable z.

First, we investigate the marginal independence between y and z, with the null hypothesis $H_{0}^{margin} : z ⊥ ⊥ y$ . The randomization test, widely employed for independence testing, offers a valid p-value without assuming a specific dependence structure between y and z (Fisher Citation1936; Pitman Citation1937; Efron et al. Citation2001; Ernst Citation2004). Here, we outline the randomization test for marginal independence. Let $t (z, y)$ be a test statistic where $z$ and $y$ are the observation vectors for z and response y. For instance, one may set $t (z, y) = | cor (z, y) |$ . Let $T : = t (z, y)$ and ${\underline{T}}_{b} : = t ({\underline{z}}^{b}, y)$ , for $b = 1, \dots, B$ , where ${\underline{z}}^{b}$ is either a permutation of $z$ or a vector of independent realizations from the marginal distribution of z. There are subtle differences between using ${\underline{z}}^{b}$ generated from a permutation and using one generated from the marginal distribution (Onghena Citation2017); however, we will not distinguish them in this context. Under the null hypothesis $H_{0}^{margin}, T, {\underline{T}}_{1}, \dots, {\underline{T}}_{B}$ are exchangeable. Thus, for any $α \in (0, 1)$ , we have $P (T > t_{1 - α}^{*} | y) \leq α$ , where $t_{1 - α}^{*}$ is the $(1 - α)$ -percentile of the empirical distribution constituted by $(T, {\underline{T}}_{1}, \dots, {\underline{T}}_{B})$ .

In addition to marginal independence, many statistical inquiries focus on the conditional or partial independence between y and z given additional variable(s) z_o, with the null hypothesis $H_{0}^{partial} : z ⊥ ⊥ y | z_{o}$ . For illustration, one may consider a linear regression model expressed by the following equation: $y = z β + z_{o}^{⊤} γ + ζ,$ where ζ is a mean-zero noise term independent of z and z_o. In recent years, the conditional randomization test has emerged as a popular method for testing the partial independence between variables z and y, given z_o. If the conditional distribution of $z | z_{o}$ is known, the conditional randomization test generates new copies of z from the distribution $z | z_{o}$ , allowing the construction of randomized test statistics that are exchangeable with the original one under $H_{0}^{partial}$ . This approach leverages the known conditional distribution to construct valid testing procedures without imposing additional constraints on the relationship between y and $(z, z_{o})$ (Candes et al. Citation2018; Berrett et al. Citation2020). To be specific, we define the test statistic $t (z, z_{o}, y)$ where z_o represents the observations for variables z_o. The test statistics $t (z, z_{o}, y)$ can take any function form, for example, we can set $t (z, z_{o}, y) = | cor (z, y) |$ . Let $T : = t (z, z_{o}, y)$ and ${\underline{T}}_{b} : = t ({\underline{z}}^{b}, z_{o}, y)$ , where ${\underline{z}}^{b}$ is an independent copy of the observation vector generated from the distribution of $z | z_{o}$ , for $b = 1, \dots, B$ . Under the null hypothesis $H_{0}^{partial}$ , the test statistics T and ${\underline{T}}^{b}$ have the same conditional distribution given $z_{o}$ and $y$ . Hence, $P (T > t_{1 - α}^{*} | z_{o}, y) \leq α$ , where $t_{1 - α}^{*}$ is the $(1 - α)$ -percentile of the conditional distribution of $(T, {\underline{T}}_{1}, \dots, {\underline{T}}_{B})$ (Candes et al. Citation2018).

In this article, we propose to adapt the idea of randomization tests to construct robust p-values for testing independence/conditional independence between factors and auxiliary covariates in SPACO, where the responses are not given beforehand.

Oracle randomization test in SPACO: Returning to SPACO, we first consider the ideal scenario where V, $Φ$ , s², and $σ^{2}$ are given. Lemma 4.2 forms the basis of our proposal.

Lemma 4.2.

Given $V, Φ, s^{2}, σ^{2}$ , and let $β_{l}^{*}$ be the true regression coefficients on the covariates Z for $μ_{l}, l = 1, \dots, K$ . For any $k = 1, \dots, K$ , set ${\underline{Σ}}_{i} = {({(V ⊙ Φ)}_{\vec{i}}^{⊤} Λ_{\vec{i}} {(V ⊙ Φ)}_{\vec{i}})}^{- 1},$ $w_{i} = s_{k}^{2} + {({\underline{Σ}}_{i})}_{k k},$ $y_{i} = {({\underline{Σ}}_{i})}_{k, :} H_{\vec{i}, :}^{⊤} Λ_{\vec{i}} X_{I, \vec{i}} .$

Then, we have $y_{i} = z_{i}^{⊤} β_{k}^{*} + ξ_{i}$ , where $ξ_{i} \sim N (0, w_{i})$ is independent of auxiliary covariates.

Lemma 4.2 is proved via direct calculation in Appendix B.3. Readers may recognize that y and Z match the regression model, potentially allowing the application of the standard randomization technique. Proposition 4.3 details our construction of both observed and randomized test statistics, the validity of which is a result of Lemma 4.2 and Lemma F.1 in Candes et al. (Citation2018). A proof is provided in Appendix B.4 for completeness.

Proposition 4.3.

Set $T_{partial} = \frac{\sum_{i} \frac{1}{w_{i}} (y_{i} - Z_{i, j^{c}}^{⊤} β_{j^{c}, k}) Z_{i j}}{\sum_{i} \frac{1}{w_{i}} Z_{i j}^{2}}$ for $β_{j^{c}, k}$ estimated without the jth covariate Z_j, $T_{margin} = \frac{\sum_{i} \frac{1}{w_{i}} y_{i} Z_{i j}}{\sum_{i} \frac{1}{w_{i}} Z_{i j}^{2}}$ . Replacing Z_j with the properly randomized ${\underline{Z}}_{j}$ to create ${\underline{T}}_{margin}$ and ${\underline{T}}_{partial}$ , where ${\underline{Z}}_{j}$ is a permutation of Z_j for ${\underline{T}}_{margin}$ and ${\underline{Z}}_{j}$ is independently generated from $Z_{j} | Z_{j^{c}}$ for ${\underline{T}}_{partial}$ , we have (8) $T_{margin} | \tilde{y} \overset{d}{=} {\underline{T}}_{margin} | y, under H_{0 k}^{margin},$ (8) (9) $T_{partial} | (y, Z_{j^{c}}) \overset{d}{=} {\underline{T}}_{partial} | (y, Z_{j^{c}}), under H_{0 k}^{partial} .$ (9)

Algorithm 3 outlines our proposal. The conditional randomization test involves generating ${\underline{Z}}_{j}$ from the conditional distribution $Z_{j} | Z_{j^{c}}$ , which is estimated using an appropriate exponential family distribution. Further details on the generations of ${\underline{Z}}_{j}$ can be found in Appendix A.5.

Approximated p-value construction with estimated model paramters: The true model parameters V, $Φ$ , s², and $σ^{2}$ are unknown, so we will substitute the true parameters with their empirical estimates. However, the empirical estimates from a full SPACO fit tend to suffer from fitting toward the noise. To mitigate such influence, we use $V, Φ$ from cross-validation described in Section 4.2. Specifically, for $i \in V_{m}$ , the index set for fold m, we construct ${\tilde{y}}_{i}$ using estimates $V^{- m}, Φ^{- m}$ and the prior covariance estimate $Λ_{f}^{- m}$ from other folds, by setting ${\tilde{y}}_{i} = {({\underline{Σ}}_{i})}_{k, :} {(V^{- m} ⊙ Φ^{- m})}^{⊤} Λ_{\vec{i}} X_{I, \vec{i}}$ with ${\underline{Σ}}_{i}$ is also estimated using $V^{- m}$ , $Φ^{- m}$ , and $Λ_{f}^{- m}$ . This procedure, referred to as cross-fit, employs data from other folds to estimate certain parameters, and forms the final quantity of interest with them and data from current fold. Additionally, initializing each fold at the global solution enhances the comparability of the constructed $\tilde{y}$ and $\tilde{z}$ across different folds. This cross-fit approach yields improved Type I error control compared to the naive incorporation of full estimates in our empirical experiments.

5 Numerical Studies with Synthetic Data

This section presents an evaluation of SPACO using synthetic Gaussian data. The noise variance is fixed at 1, and the true rank is set to K = 3. The simulated data consists of $(X, Z)$ with dimensions $(I, T, J, q) \in {(100, 30, 10, 100), (100, 30, 500, 100)}$ . The observed rate (i.e., $1 -$ the missing rate) along the time dimension is set to $r \in {100 %, 50 %, 10 %}$ , with observed time stamps chosen randomly for each subject. To generate the data, we first set $v_{j k} \overset{iid}{\sim} N (0, \frac{1}{J})$ and $z_{i l} \overset{iid}{\sim} N (0, 1)$ for $i = 1, \dots, I, j = 1, \dots, J$ and $l = 1, \dots, q$ . Then, we set $ϕ_{1} (t) = θ_{1}, ϕ_{2} (t) = θ_{2} \sqrt{1 - {(\frac{t}{T})}^{2}}$ , and $ϕ_{3} (t) = θ_{3} \cos (4 π \frac{t}{T})$ with random parameters $θ_{1}, θ_{2}, θ_{3} \sim c_{1} \cdot N (0, \frac{\log J + \log T}{rIT})$ for $c_{1} \in 1, 3, 5$ . We also set $β_{l, k} \sim c_{2} \cdot N (0, \frac{\log q}{I})$ for $c_{2} \in 0, 3, 10$ for the first $l = 1, 2, 3$ , and set $β_{l k} = 0$ otherwise. Here, c₁ controls the signal-to-noise ratio (SNR) in the observed tensor X and is referred to as SNR1; c₂ captures the signal-to-noise ratio in the auxiliary covariates Z and is also referred to as SNR2. Each U_k is standardized to have mean 0 and variance 1 after generation. This generates 54 different simulation setups in total.

5.1 Reconstruction Quality Evaluation

We compare SPACO, SPACO-, plain CP from python package tensorly, and a proxy to SupCP by setting $λ_{1 k} = λ_{2 k} = 10^{- 2}$ in SPACO as small fixed values (additional small penalties enhance numerical stability when dealing with large q and a high missing rate). Although this is not exactly SupCP, it bears high resemblance. This special case of SPACO is referred to as SupCP, which is included to assess gains from smoothness regularization on the time trajectory. Note that SPACO, SPACO-, and SupCP all employ our proposed initialization, improving stability and performance at high missing rates. An analysis contrasting SupCP with random and proposed initialization is presented in Appendix C.1. In our experiments, we used the true rank for all methods. The accuracy the rank estimation procedure proposed in Section 4.2 is assessed in Appendix C.3. In our experiments, the rank estimation procedure closely approximates the true model rank when the signal is strong, with a tendency to underestimate in cases of weaker signal. This underestimation does not, however, lead to significant increases in reconstruction loss.

shows the achieved correlation between the reconstructed tensors and the underlying signal tensors across various setups and 20 random repetitions. The correlation between two tensors F and $\hat{F}$ is defined as $〈 F_{demean}, {\hat{F}}_{demean} 〉 / (⏧ F_{demean} ⏧_{2} ⏧ {\hat{F}}_{demean} ⏧_{2})$ , where $F_{demean}$ and ${\hat{F}}_{demean}$ are the demeaned version the original and estimated tensors, respectively, and $〈 F_{demean}, {\hat{F}}_{demean} 〉$ is their inner product. Each subplot corresponds to different signal-to-noise ratio combinations (SNR1, SNR2), as indicated by its row and column labels. The y-axis represents the achieved correlation, while the x-axis shows different combinations of J and observation rate. For example, x-axis label $J 10_r 0.1$ means the feature dimension is 10, and 10% of entries are observed. The “Raw” method, which correlates the true signal with the empirical observation on non-missing entries, is included to demonstrate signal level in different simulation setups, even though it is not directly comparable to other others in the presence of missing data. A comparison of reconstruction quality on missing entries is provided in Appendix C.2.

Fig. 2 Reconstruction evaluation by the correlation between the estimates and the true signal tensor. In each subplot, the x-axis label indicates different J and observing rate, the y-axis is the achieved correlation, and the box colors represent different methods. The corresponding subplot column/row name represents the signal-to-noise ratio SNR1/SNR2.

SPACO achieves better reconstruction than SPACO- when subject scores U depend strongly on Z. This is likely due to improved quality in estimating U. To confirm this, we evaluate the estimation quality of U at J = 10 and SNR2 $= 10$ and measure the estimation quality by R² (regressing the true subject scores on the estimated ones). In , we show the achieved $(1 - R^{2})$ for SPACO and SPACO- (smaller is better), where the x-axis label represents the observing rate and column names represent the component, for example, Comp1 $\to U_{1}$ .

Fig. 3 Comparison of SPACO and SPACO- for reconstructing U at J = 10, SNR2 $= 10$ . In each subplot, the x-axis label indicates different component and observing rate, the y-axis is the achieved $(1 - R^{2})$ , and the box colors represent different methods. The corresponding subplot column/row name represents the signal-to-noise ratio SNR1/component.

SPACO/SPACO- are both top performers for our smooth tensor decomposition problem and achieve significantly better performance than CP and SupCP when the signal is weak and when the missing rate is high by using the smoothness of the time trajectory. To see this, we compared the estimation quality of the time trajectories using SPACO and SupCP. In , we show the achieved $(1 - R^{2})$ for SPACO and SupCP at J = 10. The x-axis label represents different trajectory and observing rate, for example, $C 1_r 1.0$ represents estimation of $ϕ_{1} (t)$ at observing rate r = 1.0. When the signal is weak (SNR1 = 1), SPACO can approximate the constant trajectory component (C1) and begin to estimate other trajectories successfully as the signal increases. SPACO achieves significantly better estimation of the true underlying trajectories than SupCP for various signal-to-noise ratios.

Fig. 4 Comparison of SPACO and SupCP for reconstructing $Φ$ at J = 10. In each subplot, the x-axis label indicates different component and observing rate, the y-axis is the achieved $(1 - R^{2})$ , and the box colors represent different methods. The corresponding subplot column/row name represents the signal-to-noise ratio SNR1/SNR2.

5.2 Variable Importance and Hypothesis Testing

We investigate the approximated p-values based on cross-fit for testing the partial and marginal associations of Z with U under the same simulation set-ups. Since our variables in Z are generated independently, the two null hypotheses coincide. Nevertheless, the two tests have different powers given the same p-value cutoff, as a result of different test statistics adopted. The proposed randomization tests for SPACO achieve reasonable Type I error controls. Using Cross-fit is important for maintaining good Type I error control. In Appendix C.5, we present qq-plots comparing p-values using cross-fitted V and $Φ$ to the naive construction. The latter exhibits noticeable deviations from the uniform distribution for the later construction when the signal-to-noise ratio is low. and show the achieved Type I error and power with p-value cutoffs at $α = 0.01, 0.05$ and with an observing rate r = 0.5. The Type I errors are also well controlled for $r \in {1.0, 0.1}$ (Appendix C.4).

Fig. 5 Achieved Type I errors at observing rate r = 0.5. In each subplot, x-axis label indicates different combination of feature dimension J and targeted level $α \in {0.01, 0.05}$ , while the y-axis represents the achieved Type I errors. Different bar colors represent different tests (partial or marginal). The two dashed horizontal lines correspond to levels 0.01 and 0.05.

Fig. 6 Achieved power at observing rate r = 0.5. In each subplot, the x-axis label indicates different combinations of feature dimension J and targeted level $α \in {0.01, 0.05}$ , the y-axis indicates the achieved power. Different bar colors represent different tests (partial or marginal).

6 Case Study

SPACO was employed to analyze a longitudinal immunological dataset on COVID-19 obtained from the IMPACT study (Lucas et al. Citation2020). The original dataset comprised 180 samples collected from 98 COVID-19 infected participants, measuring 135 immunological features. Features with more than 20% missing observations were excluded, and the remaining missing values were imputed using MOFA (Argelaguet et al. Citation2018) (see Appendix D for more details), resulting a complete matrix of 111 features and 180 samples. This matrix was organized into a sparsely observed tensor of size $(I, T, J) = (98, 35, 111)$ , where T is the number of unique days from symptom onset (DFSO). The average observing rate along the time dimension was 1.84. Additionally, we had access to auxiliary covariates Z containing eight risk factors (COVID_risk1 - COVID_risk5, age, sex, BMI), as well as four symptom measures (ICU, Clinical score, Length of Stay, Coagulopathy). We ran SPACO with X and Z and SPACO- with X only, selecting a model rank of K = 3 based on 5-fold cross-validation. The three estimated components are denoted as C1/C2/C3. Integrating static covariates Z with the longitudinal measurements can sometimes improve the estimation quality of subject scores compared to SPACO-, as demonstrated in our synthetic data examples. While true subject scores are unobtainable in the real dataset, the clinical relevance of the estimated subject scores can still be assessed by comparing them with clinical responses.

SPACO and SPACO- yielded overall highly similar estimations and subject scores, as depicted in , with the second component exhibiting the largest discrepancy, yet maintaining a correlation above 0.98. Despite this similarity, we observed a noticeable increase in correlations between C2 from SPACO and ICU/Clinical scores/Length of Stay. The permutation p-values from tests on whether the association between these clinical responses are higher with C2 from SPACO than with C2 from SPACO- are 0.006, 0.008, and 0.002 for ICU, Clinical scores and Length of Stay, respectively, as shown in .

Fig. 7 Comparisons between subject scores estimated from SPACO and SPACO- as well as the static risk factors. Panel A displays the high concordance in the correlations between estimated subject scores from SPACO and SPACO-. Panel B shows the correlations with clinical responses (row label) using subject scores from the most distinct component, C2, estimated with SPACO and SPACO-. The associated permutation p-values, which assess the improvement in correlations with the four clinical responses using SPACO, are shown beneath the row label. Panel C shows the correlation between clinical responses and the two most significant risk factors identified through conditional independence testing, COVIDRISK_3 and BMI.

Through the use of the randomization test, we can assess the contribution of each Z_j to C2. provides the p-value and adjusted p-value from the conditional independence test, with the number of randomization B = 2000. The top associated risk factor is COVIDRISK_3 (hypertension) with a p-value of 0.001 (adjusted p-value around 0.01). BMI is also weakly associated with C2, with a p-value of 0.07. Both these risk factors displayed much weaker associations with symptom measures when analyzed separately. By including these risk factors in SPACO, the method not only outperforms SPACO-, but also achieves a stronger association with the clinical responses compared to the risk factors themselve ().

Table 1 Results from randomization test for the second component (C2).

Download CSV Display Table

7 Discussion

We propose SPACO to model sparse multivariate longitudinal data and auxiliary covariates jointly. The smoothness regularization used in SPACO may lead to a significant improvement in estimation quality, particularly when the missing rate is high. Inclusion of informative auxiliary covariates can also enhance the estimation of subject scores. We applied our proposed pipeline to COVID-19 datasets and demonstrated its effectiveness in identifying components with subject scores that are closely associated with clinical outcomes of interest. Moreover, SPACO can identify static covariates that may contribute to severe symptoms. In the future, we plan to extend SPACO to model multi-omics data, characterized by differing data types, scales, and potentially measurement times. Such an extension will require a tailored model design that carefully integrates the different omics data, rather than a naive approach of blindly pooling them together.

Supplementary Materials

Online Appendices: (Online Appendices.pdf, pdf file) provides additional details of the algorithms, more results from numerical experiments and technical proofs.

Experiment Code: (SPACOexperiments.zip, zip file) It contains code for reproducing results in synthetic and real data experiments, and organized read dataset. A README.md document has been included to for detailed instructions on how to reproduce results represented in this article. Contents in this zip file can also be found at https://github.com/LeyingGuan/SPACOexperments.

Python package for SPACO: (SPACO.zip, zip file) It contains the Python implementation of the SPACO package. Readers can also find and install SPACO from GitHub at https://github.com/LeyingGuan/SPACO.

Supplemental material

OnlineSupplement.zip

Download Zip (18.7 MB)

README_Supplement.md

Download (568 B)

Acknowledgments

We express our gratitude to the editor, associate editor, and two reviewers for their insightful feedback, which has substantially enhanced the article.

Disclosure Statement

The author reports there are no competing interests to declare.

Additional information

Funding

L.G. was supported in part by the NSF award DMS-2310836.

References

Acar, E., and Yener, B. (2008), “Unsupervised Multiway Data Analysis: A Literature Survey,” IEEE Transactions on Knowledge and Data Engineering, 21, 6–20. DOI: 10.1109/TKDE.2008.112.
Web of Science ®Google Scholar
Anderlucci, L., and Viroli, C. (2015), “Covariance Pattern Mixture Models for the Analysis of Multivariate Heterogeneous Longitudinal Data,” The Annals of Applied Statistics, 9, 777–800. DOI: 10.1214/15-AOAS816.
Web of Science ®Google Scholar
Argelaguet, R., Velten, B., Arnol, D., Dietrich, S., Zenz, T., Marioni, J. C., Buettner, F., Huber, W., and Stegle, O. (2018), “Multi-Omics Factor Analysis–A Framework for Unsupervised Integration of Multi-Omics Data Sets,” Molecular Systems Biology, 14, e8124. DOI: 10.15252/msb.20178124.
PubMed Web of Science ®Google Scholar
Bai, J., and Wang, P. (2015), “Identification and Bayesian Estimation of Dynamic Factor Models,” Journal of Business & Economic Statistics, 33, 221–240. DOI: 10.1080/07350015.2014.941467.
Web of Science ®Google Scholar
Bai, J., and Wang, P. (2016), “Econometric Analysis of Large Factor Models,” Annual Review of Economics, 8, 53–80. DOI: 10.1146/annurev-economics-080315-015356.
Web of Science ®Google Scholar
Berrett, T. B., Wang, Y., Barber, R. F., and Samworth, R. J. (2020), “The Conditional Permutation Test for Independence While Controlling for Confounders,” Journal of the Royal Statistical Society, Series B, 82, 175–197. DOI: 10.1111/rssb.12340.
Google Scholar
Besse, P., and Ramsay, J. O. (1986), “Principal Components Analysis of Sampled Functions,” Psychometrika, 51, 285–311. DOI: 10.1007/BF02293986.
Web of Science ®Google Scholar
Bro, R., and Andersson, C. A. (1998), “Improving the Speed of Multiway Algorithms: Part II: Compression,” Chemometrics and Intelligent Laboratory Systems, 42, 105–113. DOI: 10.1016/S0169-7439(98)00011-2.
Web of Science ®Google Scholar
Candes, E., Fan, Y., Janson, L., and Lv, J. (2018), “Panning for Gold: ‘model-x’ Knockoffs for High Dimensional Controlled Variable Selection,” Journal of the Royal Statistical Society, Series B, 80, 551–577. DOI: 10.1111/rssb.12265.
Google Scholar
Carroll, J. D., Pruzansky, S., and Kruskal, J. B. (1980), “Candelinc: A General Approach to Multidimensional Analysis of Many-Way Arrays with Linear Constraints on Parameters,” Psychometrika, 45, 3–24. DOI: 10.1007/BF02293596.
Web of Science ®Google Scholar
Chang, W.-C. (1983), “On Using Principal Components before Separating a Mixture of Two Multivariate Normal Distributions,” Journal of the Royal Statistical Society, Series C, 32, 267–275. DOI: 10.2307/2347949.
Web of Science ®Google Scholar
Chen, E. Y., Tsay, R. S., and Chen, R. (2020), “Constrained Factor Models for High-Dimensional Matrix-Variate Time Series,” Journal of the American Statistical Association, 115, 775–793. DOI: 10.1080/01621459.2019.1584899.
Web of Science ®Google Scholar
De Lathauwer, L., De Moor, B., and Vandewalle, J. (2000), “A Multilinear Singular Value Decomposition,” SIAM Journal on Matrix Analysis and Applications, 21, 1253–1278. DOI: 10.1137/S0895479896305696.
Web of Science ®Google Scholar
Efron, B., Tibshirani, R., Storey, J. D., and Tusher, V. (2001), “Empirical Bayes Analysis of a Microarray Experiment,” Journal of the American Statistical Association, 96, 1151–1160. DOI: 10.1198/016214501753382129.
Web of Science ®Google Scholar
Ernst, M. D. (2004), “Permutation Methods: A Basis for Exact Inference,” Statistical Science, 19, 676–685. DOI: 10.1214/088342304000000396.
Web of Science ®Google Scholar
Fan, J., Fan, Y., and Lv, J. (2008), “High Dimensional Covariance Matrix Estimation Using a Factor Model,” Journal of Econometrics, 147, 186–197. DOI: 10.1016/j.jeconom.2008.09.017.
Web of Science ®Google Scholar
Fan, J., Liao, Y., and Mincheva, M. (2011), “High Dimensional Covariance Matrix Estimation in Approximate Factor Models,” Annals of Statistics, 39, 3320–3356. DOI: 10.1214/11-AOS944.
PubMed Web of Science ®Google Scholar
Fisher, R. A. (1936), “Design of Experiments,” British Medical Journal, 1, 554. DOI: 10.1136/bmj.1.3923.554-a.
Google Scholar
Harshman, R. A., and Lundy, M. E. (1994), “Parafac: Parallel Factor Analysis,” Computational Statistics & Data Analysis, 18, 39–72. DOI: 10.1016/0167-9473(94)90132-5.
Web of Science ®Google Scholar
Hinrich, J. L., and Mørup, M. (2019), “Probabilistic Tensor Train Decomposition,” in 2019 27th European Signal Processing Conference (EUSIPCO), pp. 1–5, IEEE. DOI: 10.23919/EUSIPCO.2019.8903177.
Google Scholar
Huang, J. Z., Shen, H., and Buja, A. (2008),“Functional Principal Components Analysis via Penalized Rank One Approximation,” Electronic Journal of Statistics, 2, 678–695. DOI: 10.1214/08-EJS218.
Web of Science ®Google Scholar
Imaizumi, M., and Hayashi, K. (2017), “Tensor Decomposition with Smoothness,” in International Conference on Machine Learning, pp. 1597–1606, PMLR.
Google Scholar
Kolda, T. G., and Bader, B. W. (2009), “Tensor Decompositions and Applications,” SIAM Review, 51, 455–500. DOI: 10.1137/07070111X.
Web of Science ®Google Scholar
Lam, C., Yao, Q., and Bathia, N. (2011), “Estimation of Latent Factors for High-Dimensional Time Series,” Biometrika, 98, 901–918. DOI: 10.1093/biomet/asr048.
Web of Science ®Google Scholar
Li, G., Shen, H., and Huang, J. Z. (2016), “Supervised Sparse and Functional Principal Component Analysis,” Journal of Computational and Graphical Statistics, 25, 859–878. DOI: 10.1080/10618600.2015.1064434.
Web of Science ®Google Scholar
Lock, E. F., and Li, G. (2018), “Supervised Multiway Factorization,” Electronic Journal of Statistics, 12, 1150–1180. DOI: 10.1214/18-EJS1421.
PubMed Web of Science ®Google Scholar
Lucas, C., Wong, P., Klein, J., Castro, T. B., Silva, J., Sundaram, M., Ellingson, M. K., Mao, T., Oh, J. E., Israelow, B., et al. (2020), “Longitudinal Analyses Reveal Immunological Misfiring in Severe Covid-19,” Nature, 584, 463–469. DOI: 10.1038/s41586-020-2588-y.
PubMed Web of Science ®Google Scholar
Mnih, A., and Salakhutdinov, R. R. (2007), “Probabilistic Matrix Factorization,” in Advances in Neural Information Processing Systems (Vol. 20), pp. 1257–1264.
Google Scholar
Onghena, P. (2017), “Randomization Tests or Permutation Tests? A Historical and Terminological Clarification,” in Randomization, Masking, and Allocation Concealment, pp. 209–228, New York: Chapman and Hall/CRC.
Google Scholar
Phan, A.-H., Tichavskỳ, P., and Cichocki, A. (2013), “Candecomp/parafac Decomposition of High-Order Tensors through Tensor Reshaping,” IEEE Transactions on Signal Processing, 61, 4847–4860. DOI: 10.1109/TSP.2013.2269046.
Web of Science ®Google Scholar
Pitman, E. J. G. (1937), “Significance Tests Which May be Applied to Samples from any Populations. II. The Correlation Coefficient Test,” Supplement to the Journal of the Royal Statistical Society, 4, 225–232. DOI: 10.2307/2983647.
Google Scholar
Rendeiro, A. F., Casano, J., Vorkas, C. K., Singh, H., Morales, A., DeSimone, R. A., Ellsworth, G. B., Soave, R., Kapadia, S. N., Saito, K., et al. (2021), “Profiling of Immune Dysfunction in Covid-19 Patients Allows Early Prediction of Disease Progression,” Life Science Alliance, 4, e202000955. DOI: 10.26508/lsa.202000955.
PubMed Web of Science ®Google Scholar
Sidiropoulos, N. D., De Lathauwer, L., Fu, X., Huang, K., Papalexakis, E. E., and Faloutsos, C. (2017), “Tensor Decomposition for Signal Processing and Machine Learning,” IEEE Transactions on Signal Processing, 65, 3551–3582. DOI: 10.1109/TSP.2017.2690524.
Web of Science ®Google Scholar
Sorkine, O., Cohen-Or, D., Lipman, Y., Alexa, M., Rössl, C., and Seidel, H.-P. (2004), “Laplacian Surface Editing,” in Proceedings of the 2004 Eurographics/ACM SIGGRAPH symposium on Geometry processing, pp. 175–184.
Google Scholar
Tibshirani, R. (2011), “Regression Shrinkage and Selection via the Lasso: A Retrospective,” Journal of the Royal Statistical Society, Series B, 73, 273–282. DOI: 10.1111/j.1467-9868.2011.00771.x.
Google Scholar
Tipping, M. E., and Bishop, C. M. (1999), “Probabilistic Principal Component Analysis,” Journal of the Royal Statistical Society, Series B, 61, 611–622. DOI: 10.1111/1467-9868.00196.
Web of Science ®Google Scholar
Vichi, M., Rocci, R., and Kiers, H. A. (2007), “Simultaneous Component and Clustering Models for Three-Way Data: Within and between Approaches,” Journal of Classification, 24, 71–98. DOI: 10.1007/s00357-007-0006-x.
Web of Science ®Google Scholar
Viroli, C. (2011), “Finite Mixtures of Matrix Normal Distributions for Classifying Three-Way Data,” Statistics and Computing, 21, 511–522. DOI: 10.1007/s11222-010-9188-x.
Web of Science ®Google Scholar
Wang, D., Liu, X., and Chen, R. (2019), “Factor Models for Matrix-Valued High-Dimensional Time Series,” Journal of Econometrics, 208, 231–248. DOI: 10.1016/j.jeconom.2018.09.013.
Web of Science ®Google Scholar
Wang, D., Zheng, Y., Lian, H., and Li, G. (2021), “High-Dimensional Vector Autoregressive Time Series Modeling via Tensor Decomposition,” Journal of the American Statistical Association, 117, 1338–1356. DOI: 10.1080/01621459.2020.1855183.
Web of Science ®Google Scholar
Yao, F., Müller, H.-G., and Wang, J.-L. (2005), “Functional Data Analysis for Sparse Longitudinal Data,” Journal of the American Statistical Association, 100, 577–590. DOI: 10.1198/016214504000001745.
Web of Science ®Google Scholar
Yokota, T., Zhao, Q., and Cichocki, A. (2016), “Smooth Parafac Decomposition for Tensor Completion,” IEEE Transactions on Signal Processing, 64, 5423–5436. DOI: 10.1109/TSP.2016.2586759.
Web of Science ®Google Scholar

Smooth and Probabilistic PARAFAC Model with Auxiliary Covariates

Abstract