Full article: Factor Modeling for Clustering High-Dimensional Time Series

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

We propose a new unsupervised learning method for clustering a large number of time series based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the common factors, the cluster-specific factors, and the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao. Supplementary materials for this article are available online.

Keywords:

1 Introduction

One of the primary tasks of data mining is clustering. While most clustering methods are originally designed for independent observations, clustering a large number of time series gains increasing momentum (Esling and Agon, Citation2012), due to mining large and complex data recorded over time in business, finance, biology, medicine, climate, energy, environment, psychology, multimedia and other areas (Table 1 of Aghabozorgi, Shirkhorshid, and Wah, Citation2015). Consequently, the literature on time series clustering is large; see Liao (Citation2005), Aghabozorgi, Shirkhorshid, and Wah (Citation2015), Maharaj, D’Urso, and Caiado (Citation2019) and the references therein. The basic idea is to develop some relevant similarity or distance measures among time series first, and then to apply the standard clustering algorithms such as hierarchical clustering or k-means method. Most existing similarity/distances measures for time series may be loosely divided into two categories: data-based and feature-based. The data-based approaches define the measures directly based on observed time series using, for example, L₂- or, more general, Minkowski’s distance, or various correlation measures. Alonso and Peña (Citation2019) proposed a generalized cross-correlation as a similarity measure, which takes into account cross-correlation over different time lags. Dynamic time warping can be applied beforehand to cope with time deformation due to, for example, shifting holidays over different years (Keogh and Ratanamahatana, Citation2005). The feature-based approaches extract relevant features from observed time series data first, and then define similarity/distance measures based on the extracted features. The feature extraction can be carried out by various transformations such as Fourier, wavelet or principal component analysis (sec. 2.3 of Roelofsen, Citation2018). The features from fitted time series models can also be used to define similarity/distance measures (Yao et al., Citation2000; Frühwirth-Schnatter and Kaufmann, Citation2008). Attempts have also been made to define the similarity between two time series by measuring the discrepancy between the two underlying stochastic processes (Kakizawa, Shumway, and Taniguchi, Citation1998; Khaleghi et al., Citation2016). Other approaches include Zhang (Citation2013) which clusters time series based on the parallelism of their trend functions, and Ando and Bai (Citation2017) which represents the latent clusters in terms of a factor model. So-called “subsequence clustering” occurs frequently in the literature on time series clustering; see Keogh and Lin (Citation2005) and Zolhavarieh, Aghabozorgi, and Teh (Citation2014). It refers to clustering the segments from a single long time series, which is not considered in this article.

The goal of this study is to propose a new factor model based approach to cluster a large number of time series into different and unknown clusters such that the members within each cluster share a similar dynamic structure, while the number of clusters and their sizes are all unknown. We represent the dynamic structures by latent common and cluster-specific factors, which are both unknown and are identified by the difference in factor strength. The strength of a factor is measured by the number of time series which influence and/or are influenced by the factor (Chamberlain, Citation1983). The common factors are strong factors as each of them carries the information on most (if not all) time series concerned. The cluster-specific factors are weak factors as they only affect the time series in a specific cluster.

Though our factor model is similar to that of Ando and Bai (Citation2017), our approach is radically different. First, we estimate strong factors and all the weaker factors in the manner of one-pass, and then the latent clusters are recovered based on the estimated weak factor loadings. Ando and Bai (Citation2017) adopted an iterative least squares algorithm to estimate factors/factor loadings and latent cluster structure recursively. Second, our setting allows the flexibility that some time series do not belong to any clusters, which is often the case in practice. Third, our setting allows the dependence between the common factors and cluster-specific factors while Ando and Bai (Citation2017) imposed an orthogonality condition between the two; see Remark 1(iv) in Section 2.

The methods used for estimating factors and factor loadings are adapted from Lam and Yao (Citation2012). Nevertheless substantial advances have been made even within the context of Lam and Yao (Citation2012): (i) we remove the artifact condition that the factor loading spaces for strong and weak factors are perpendicular with each other, (ii) we allow weak serial correlations in idiosyncratic components in the model, which were assumed to be vector white noise by Lam and Yao (Citation2012), and, more significantly, (iii) we propose a new and consistent ratio-based estimator for the number of factors (see Step 1 and also Remark 3(iii) in Section 3).

The rest of the article is organized as follows. Our factor model and the relevant conditions are presented in Section 2. We elaborate explicitly why it is natural to identify the latent clusters in terms of the factor strength. The new clustering algorithm is presented in Section 3. The clustering is based on the factor loadings on all the weak factors; applying a K-means algorithm using a correlation-type similarity measure defined in terms of the loadings. The asymptotic properties of the estimation for factors and factor loadings are collected in Section 4. Section 5 presents the consistency the proposed factor-based clustering algorithm with error rates. Numerical illustration with both simulated and a real data example is reported in Section 6. We also provide some comments in Section 7. All technical proofs are presented to a supplementary materials.

We always assume vectors in column. Let $| | a | |$ denote the Euclidean norm of vector $a$ . For any matrix $G$ , let $M (G)$ denote the linear space spanned by the columns of $G \equiv (g_{i, j}), | | G | |$ the square root of the largest eigenvalue of $G^{⊤} G, | | G | |_{\min}$ the square root of the smallest eigenvalue of $G^{⊤} G$ , $| G |$ the matrix with $| g_{i, j} |$ as its (i, j)th element. We write $a ≍ b$ if $a = O (b)$ and $b = O (a)$ . We use $C, C_{0}$ to denote generic constants independent of p and n, which may be different at different places.

2 Models

Let ${y_{t}}_{1 \leq t \leq n}$ be a weakly stationary $p \times 1$ vector time series, that is, $E y_{t}$ is a constant independent of t, and all elements of $cov (y_{t + k}, y_{t})$ are finite and dependent on k only. Suppose that $y_{t}$ consists of d + 1 latent segments, that is, (1) $y_{t}^{⊤} = (y_{t, 1}^{⊤}, \dots, y_{t, d}^{⊤}, y_{t, d + 1}^{⊤}),$ (1) where $y_{t, 1}, \dots, y_{t, d + 1}$ are, respectively, $p_{1}, \dots, p_{d + 1}$ -vector time series with $p_{1}, \dots, p_{d} \geq 1, p_{d + 1} \geq 0$ , and $p_{1} + \dots + p_{d} = p_{0}, p_{0} + p_{d + 1} = p .$

Furthermore, we assume the following latent factor model with d clusters: (2) $\begin{matrix} y_{t} = A x_{t} + (\begin{matrix} B \\ 0 \end{matrix}) z_{t} + ε_{t}, \\ B = diag (B_{1}, \dots, B_{d}), z_{t}^{⊤} = (z_{t, 1}^{⊤}, \dots, z_{t, d}^{⊤}), \end{matrix}$ (2) where $A$ is a $p \times r_{0}$ matrix with rank r₀, $x_{t}$ is r₀-vector time series representing r₀ common factors and $| var (x_{t}) | \neq 0, B_{j}$ is $p_{j} \times r_{j}$ matrix with rank r_j, $z_{t, j}$ is r_j-vector time series representing r_j factors for $y_{t, j}$ only and $| var (z_{t, j}) | \neq 0$ , $0$ stands for a $p_{d + 1} \times r$ matrix with all elements equal to 0, $r = r_{1} + \dots + r_{d}$ , and $ε_{t}$ is an idiosyncratic component in the sense of Chamberlain (Citation1983) and Chamberlain, and Rothschild (Citation1983) (see below). Note that in the model above, we only observe permuted $y_{t}$ (i.e., the order of components of $y_{t}$ is unknown) while all the terms on the RHS of (2) are unknown.

By (2), the p₀ components of $y_{t}$ are grouped into d clusters $y_{t, 1}, \dots, y_{t, d}$ , while the $p_{d + 1}$ components of $y_{t, d + 1}$ do not belong to any clusters. The jth cluster $y_{t, j}$ is characterized by the cluster-specific factor $z_{t, j}$ , in addition to the dependence on the common factor $x_{t}$ . The goal is to identify those d latent clusters from observations $y_{1}, \dots, y_{n}$ . Note that all p_j, r_j and d are also unknown.

We always assume that the number of the common factors and the number of cluster-specific factors for each cluster remain bounded when the number of time series p diverges. This reflects the fact that the factor models are only appealing when the numbers of factors are much smaller than the number of time series concerned. Furthermore, we assume that the number of time series in each cluster p_i diverges at a lower order than p and the number of clusters d diverges as well. See Assumption 1.

Assumption 1.

$\max_{0 \leq i \leq d} {r_{i}} < C < \infty$ , $r ≍ d = O (p^{δ})$ , and $p_{i} ≍ p^{1 - δ}$ for $i = 1, \dots, d$ , where C > 0 and $δ \in (0, 1)$ are constants independent of n and p.

The strength of a factor is measured by the number of time series which influence and/or are influenced by the factor. Each component of $x_{t}$ is a common factor. It is related to most, if not all, components of $y_{t}$ in the sense that the most elements of the corresponding column of $A$ (i.e., the factor loadings) are nonzero. Hence, it is reasonable to assume (3) $| | a_{j} | |^{2} ≍ p, j = 1, \dots, r_{0},$ (3) where $a_{j}$ is the jth column of $A$ . This is in the same spirit of the definition for the common factors by Chamberlain, and Rothschild (Citation1983). Denoted by $b_{i}^{j}$ the ith column of the $p_{j} \times r_{j}$ matrix $B_{j}$ . In the same vein, we assume that (4) $| | b_{i}^{j} | |^{2} ≍ p^{1 - δ}, i = 1, \dots, r_{j} and j = 1, \dots, d,$ (4) as each cluster-specific factor for the jth cluster is related to most of the $p_{j} ≍ p^{1 - δ}$ (Assumption 1) time series in the cluster. Note that the factor strength can be measured by constant $δ \in [0, 1]$ : $δ > 0$ in (4) indicates that factors $z_{t} = (z_{t, 1}, \dots, z_{t, d})$ are weaker than factors $x_{t}$ which corresponds to δ = 0; see (3).

Conditions (3) and (4) are imposed under the assumption that all the factors remain unchanged as p diverges, all the entries of covariance matrices below are bounded, $\begin{matrix} Σ_{x} (k) = cov (x_{t + k}, x_{t}), Σ_{z} (k) = cov (z_{t + k}, z_{t}), \\ Σ_{x, z} (k) = cov (x_{t + k}, z_{t}), Σ_{z, x} (k) = cov (z_{t + k}, x_{t}), \end{matrix}$ and, furthermore, $Σ_{x} (k)$ and $Σ_{z} (k)$ are full-ranked for $k = 0, 1, \dots, k_{0}$ , where $k_{0} \geq 1$ is an integer. Then conditions (3) and (4) are equivalent to (6) and (7) in Assumption 3 after the orthogonal normalization to be introduced now in order to make model (2) partially identifiable and operationally tractable.

In model (2) $A, B, x_{t}$ and $z_{t}$ are not uniquely defined, as, for example, $(A, x_{t})$ can be replaced by $(AH, H^{- 1} x_{t})$ for any $r_{0} \times r_{0}$ invertible matrix $H$ . We argue that this lack of uniqueness gives us the flexibility to choose appropriate $A$ and $B$ to facilitate our estimation more readily. Assumption 2 specifies both $A$ and $B$ to be half-orthogonal in the sense that the columns of $A$ or $B$ are orthonormal, which can be fulfilled by, for example, replacing the original $(A, x_{t})$ by $(H, V x_{t})$ , where $A = HV$ is a QR decomposition of $A$ . Even under Assumption 2, $A$ and $B$ are still not unique. In fact that only the factor loading spaces $M (A), M (B_{i})$ are uniquely defined by (2). Hence $A A^{⊤} = A {(A^{⊤} A)}^{- 1} A^{⊤}$ , that is, the projection matrix onto $M (A)$ , is also unique.

Assumption 2.

$A^{⊤} A = I_{r_{0}},$ $B_{j}^{⊤} B_{j} = I_{r_{j}}$ for $1 \leq j \leq d$ , and it holds for a constant $q_{0} \in (0, 1)$ that (5) $| | A A^{⊤} (\begin{matrix} B \\ 0 \end{matrix}) | | \leq q_{0} .$ (5)

Furthermore for $j = 1, \dots, d$ , $rp (B_{j}) {rp (B_{j})}^{⊤}$ cannot be written as a block diagonal matrix with at least two blocks, where $rp (B_{j})$ denotes any row-permutation of $B_{j}$ .

Condition (5) implies that the columns of $(\begin{matrix} B \\ 0 \end{matrix})$ do not fall entirely into the space $M (A)$ as otherwise one cannot distinguish $z_{t}$ from $x_{t}$ . It is automatically fulfilled if $A^{⊤} (\begin{matrix} B \\ 0 \end{matrix}) = 0$ which is a condition imposed in Lam and Yao (Citation2012). Finally the last condition in Assumption 2 ensures that the number of clusters d is uniquely defined.

Assumption 3.

Let $y_{t}, x_{t}$ , and $z_{t}$ be strictly stationary with the finite fourth moments. As $p \to \infty$ , it holds for $k = 0, 1, \dots, k_{0}$ that (6) $| | Σ_{x} (k) | | ≍ p ≍ | | Σ_{x} (k) | |_{\min},$ (6) (7) $| | Σ_{z} (k) | | ≍ p^{1 - δ} ≍ | | Σ_{z} (k) | |_{\min},$ (7) (8) $\begin{matrix} | | Σ_{x} {(k)}^{- 1 / 2} Σ_{x, z} (k) Σ_{z} {(k)}^{- 1 / 2} | | \leq q_{0} < 1, \\ | | Σ_{z} {(k)}^{- 1 / 2} Σ_{z, x} (k) Σ_{x} {(k)}^{- 1 / 2} | | \leq q_{0} < 1, \end{matrix}$ (8) (9) $| | Σ_{x, z} (k) | | = O (p^{1 - δ / 2}), | | Σ_{z, x} (k) | | = O (p^{1 - δ / 2}) .$ (9)

Furthermore, $y_{t}$ is ψ-mixing with the mixing coefficients satisfying $\sum_{t \geq 1} t ψ {(t)}^{1 / 2} < \infty$ , and $cov (x_{t}, ε_{s}) = 0$ , $cov (z_{t}, ε_{s}) = 0$ for any t and s.

Remark 1.

(i) The factor strength is defined in terms of the orders of the factor loadings in (3) and (4). Due to the orthogonalization specified in Assumption 2, they are transformed into the orders of the covariance matrices in (6) and (7). See also Remark 1 in Lam and Yao (Citation2012). Nevertheless, the factor strength is still measured by the constant $δ \in [0, 1]$ : the smaller δ is, the stronger a factor is. The common factors in $x_{t}$ are the strongest with δ = 0, and the cluster-specific factors in $z_{t}$ are weaker with $δ \in (0, 1)$ . In (2) $ε_{t}$ represents the idiosyncratic component of $y_{t}$ in the sense that each component of $ε_{t}$ only affects the corresponding component and a few other components of $y_{t}$ (i.e., δ = 1), which is implied by Assumptions 4. Hence, the strength of $ε_{t}$ is the weakest. The differences in the factor strength make $x_{t}, z_{t}$ , and $ε_{t}$ on the RHS of (2) (asymptotically) identifiable. To simplify the presentation, we assume that all the components of $z_{t}$ are of the same strength (i.e., all pi are of the same order). See the real data example in Section 6.2 for how to handle the cluster-specific factors of different strengths.

(ii) Model (2) is similar to that of Ando and Bai (Citation2017). However, we do not require that the common factor $x_{t}$ and the cluster-specific factor $z_{t}$ are orthogonal with each other in the sense that $\frac{1}{n} \sum_{1 \leq t \leq n} x_{t} z_{t}^{⊤} = 0$ , which is imposed by Ando and Bai (Citation2017). Furthermore, we allow the idiosyncratic term $ε_{t}$ to exhibit weak autocorrelations (Assumption 4), instead of complete independence as in Ando and Bai (Citation2017).

We now impose some structure assumptions on the idiosyncratic term $ε_{t}$ in model (2).

Assumption 4.

Let $ε_{t} = G e_{t}$ , where $G$ is a p × p constant matrix with $| | G | |$ bounded from above by a positive constant independent of p. Furthermore, one of the following two conditions holds.

$e_{t}$ is MA( $\infty$ ), that is, $e_{t} = \sum_{s = 0}^{\infty} ϕ_{s} η_{t - s}$ , where $\sum_{s = 0}^{\infty} | ϕ_{s} | < \infty$ , $η_{t} = {(η_{t, 1}, \dots, η_{t, p})}^{⊤}$ , and $η_{t, i}$ being iid across t and i with mean 0, variance 1 and $E (η_{t, i}^{4}) < \infty$ .
$e_{t} = {(e_{t, 1}, \dots, e_{t, p})}^{⊤}$ consists of p independent weakly stationary univariate time series, $E (e_{t}) = 0$ , and $\min_{1 \leq i \leq p} E e_{t, i}^{2} > 0$ . Furthermore, ${\tilde{e}}_{i} = (e_{1, i}, \dots, e_{n, i})$ satisfies (10) $\max_{_{β \geq 1, 1 \leq i \leq p, | | a | | = 1}} β^{- 1 / 2} {E | {\tilde{e}}_{i}^{⊤} a |^{β}}^{1 / β} \leq C_{0} .$ (10)

Remark 2.

In Assumption 4, (i) assumes that $e_{t}$ is a linear process with the same serial correlation structure across all the components. (ii) allows some nonlinear serial dependence, the dependence structures for different components may differ. But then the sub-Gaussian condition (10) is required.

3 A Clustering Algorithm

With available observations $y_{1}, \dots, y_{n}$ , we propose below an algorithm (in five steps) to identify the latent d clusters. To this end, we introduce some notation first. Let $\bar{y} = \frac{1}{n} \sum_{t = 1}^{n} y_{t}$ , (11) ${\hat{Σ}}_{y} (k) = \frac{1}{n} \sum_{t = 1}^{n - k} (y_{t + k} - \bar{y}) {(y_{t} - \bar{y})}^{⊤}, \hat{M} = \sum_{k = 0}^{k_{0}} {\hat{Σ}}_{y} (k) {\hat{Σ}}_{y} {(k)}^{⊤},$ (11) where $k_{0} \geq 0$ is a pre-specified integer in Assumption 3.

Step 1(Estimation for the number of factors.) For $0 \leq k \leq k_{0}$ , let ${\hat{λ}}_{k, 1} \geq \dots \geq {\hat{λ}}_{k, p} \geq 0$ be the eigenvalues of matrix ${\hat{Σ}}_{y} (k) {\hat{Σ}}_{y} {(k)}^{⊤}$ . For a pre-specified positive integer $J_{0} \leq p$ , put ${\hat{R}}_{0} = 1$ and (12) ${\hat{R}}_{j} = \sum_{k = 0}^{k_{0}} {\hat{λ}}_{k, j} / \sum_{k = 0}^{k_{0}} {\hat{λ}}_{k, j + 1}, 1 \leq j \leq J_{0} .$ (12) We say that ${\hat{R}}_{s}$ attains a local maximum if ${\hat{R}}_{s} > \max {{\hat{R}}_{s - 1}, {\hat{R}}_{s + 1}}$ . Let ${\hat{R}}_{{\hat{τ}}_{1}}$ and ${\hat{R}}_{{\hat{τ}}_{2}}$ be the two largest local maximums among ${\hat{R}}_{1}, \dots, {\hat{R}}_{J_{0} - 1}$ . The estimators for the numbers of factors are then defined as (13) ${\hat{r}}_{0} = \min {{\hat{τ}}_{1}, {\hat{τ}}_{2}}, {\hat{r}}_{0} + \hat{r} = \max {{\hat{τ}}_{1}, {\hat{τ}}_{2}} .$ (13)
Step 2(Estimation for the loadings for common factors.) Let ${\hat{γ}}_{1}, \dots, {\hat{γ}}_{p}$ be the orthonormal eigenvectors of matrix $\hat{M}$ , arranged according to the descending order of the corresponding eigenvalues. The estimated loading matrix for the common factors is (14) $\hat{A} = ({\hat{γ}}_{1}, \dots, {\hat{γ}}_{{\hat{r}}_{0}}) .$ (14)
Step 3(Estimation for the loadings for cluster-specific factors.) Replace $y_{t}$ by $(I_{p} - \hat{A} {\hat{A}}^{⊤}) y_{t}$ in (11), and repeat the eigenanalysis as in Step 2 above but now denote the corresponding orthonormal eigenvectors by ${\hat{ζ}}_{1}, \dots, {\hat{ζ}}_{p}$ . The estimated loading matrix for the cluster-specific factors is (15) $\hat{B} = ({\hat{ζ}}_{1}, \dots, {\hat{ζ}}_{\hat{r}}) .$ (15)
Step 4(Identification for the components not belonging to any clusters.) Let ${\hat{b}}_{1}, \dots, {\hat{b}}_{p}$ denote the row vectors of $\hat{B}$ . Then the identified index set for the components of $y_{t}$ not belonging to any clusters is (16) ${\hat{J}}_{d + 1} = {j : 1 \leq j \leq p, | | {\hat{b}}_{j} | | \leq ω_{p}},$ (16) where $ω_{p} > 0$ is a constant satisfying the conditions $ω_{p} = o (p^{δ / 2 - 1 / 2}), p^{- 1 / 2} ω_{p}^{- 1} = o (1)$ and $\frac{p^{δ} n^{- 1} r + p^{- δ}}{p^{1 - δ} ω_{p}^{2}} = o (1) .$
Step 5 (K-means clustering.) Denote by $\hat{d}$ the number of eigenvalues of $| \hat{B} {\hat{B}}^{⊤} |$ greater than $1 - \log^{- 1} n$ , which is taken as an upper bound of the number of clusters. Let ${\hat{p}}_{0} = p - | {\hat{J}}_{d + 1} |$ , and $\hat{F}$ be the ${\hat{p}}_{0} \times \hat{r}$ matrix obtained from $\hat{B}$ by removing the rows with their indices in ${\hat{J}}_{d + 1}$ . Let ${\hat{f}}_{1}, \dots, {\hat{f}}_{{\hat{p}}_{0}}$ denote the ${\hat{p}}_{0}$ rows of $\hat{F}$ . Let $\hat{R}$ be the ${\hat{p}}_{0} \times {\hat{p}}_{0}$ matrix with the $(l, m)$ th element ${\hat{ρ}}_{l, m} = | {\hat{f}}_{l}^{⊤} {\hat{f}}_{m} | / {({\hat{f}}_{l}^{⊤} {\hat{f}}_{l} \cdot {\hat{f}}_{m}^{⊤} {\hat{f}}_{m})}^{1 / 2}, 1 \leq l, m \leq {\hat{p}}_{0} .$

Perform the K-means clustering (with L²-distance) for the ${\hat{p}}_{0}$ rows of $\hat{R}$ to form the d clusters, where $d \leq \hat{d}$ is chosen such that the within-cluster-sum of L²-distances (to the cluster center points) are stabilized.

Remark 3.

(i) The ratio-based estimation in Step 1 is new. By Theorem 3 in Section 4, it holds ${\hat{r}}_{0} \to r_{0}$ and $\hat{r} \to r$ in probability. The existing approaches use the ratios of the ordered eigenvalues of matrix $\hat{M}$ instead (Lam and Yao, Citation2012; Chang, Gao, and Yao, Citation2015; Li, Wang, and Yao, Citation2017); leading to an estimator which may not be consistent. See Example 1. Note that Lam and Yao (Citation2012) shows that their estimator ${\tilde{r}}_{0}$ fulfills the relation $P ({\tilde{r}}_{0} \geq r_{0}) \to 1$ only.

(ii) The intuition behind the estimators in (12) is that the eigenvalues $λ_{k, 1} \geq \dots \geq λ_{k, p} (\geq 0)$ of matrix $Σ_{y} (k) Σ_{y} {(k)}^{⊤}$ , where $Σ_{y} (k) = cov (y_{t + k}, y_{t})$ , satisfy the conditions $\begin{matrix} λ_{k, i}^{- 1} = o (λ_{k, j}^{- 1}) and λ_{k, j}^{- 1} = o (λ_{k, l}^{- 1}) for 1 \leq i \leq r_{0}, \\ r_{0} < j \leq r_{0} + r and l > r_{0} + r . \end{matrix}$

This is implied by the differences in strength among the common factor $x_{t}$ , the cluster specific factors $z_{t, i}$ , and the idiosyncratic components $ε_{t}$ ; see Theorem 3. Note that we use the ratios of the cumulative eigenvalues in (12) in order to add together the information from different lags k. In practice, we set k₀ to be a small integer such as $k_{0} \leq 5$ , as the significant autocorrelation occurs typically at small lags. The results do not vary that much with respect to the value of k₀ (see the simulation results in Section 6.1). We truncate the sequence ${{\hat{R}}_{j}}$ at J₀ to alleviate the impact of “0/0”. In practice, we may set $J_{0} = p / 4$ or $p / 3$ .

(iii) Step 3 removes the common factors first before estimating $B$ , as Lam and Yao (Citation2012) showed that weak factors can be more accurately estimated by removing strong factors from the data first.

(iv) Once the numbers of strong and weak factors are correctly specified, the factor loading spaces are relatively easier to identify. In fact $M (\hat{A})$ is a consistent estimator for $M (A)$ . However, $M (\hat{B})$ is a consistent estimator for $M {(I_{p} - A A^{⊤}) (\begin{matrix} B \\ 0 \end{matrix})}$ instead of $M {(\begin{matrix} B \\ 0 \end{matrix})}$ . See Theorem 2 in Section 4. Furthermore the last $p_{d + 1}$ rows of $(I_{p} - A A^{⊤}) (\begin{matrix} B \\ 0 \end{matrix})$ are no longer 0. Nevertheless when the elements in $A A^{⊤}$ and $B B^{⊤}$ have different orders, those $p_{d + 1}$ zero-rows can be recovered from $\hat{B}$ in Step 4. See Theorem 5 in Section 5.

(v) Given the block diagonal structure of $B$ in (2), the d clusters would be identified easily by taking the (i, j)th element of $| B B^{⊤} |$ as the similarity measure between the ith and the jth components, or by simply applying the K-means method to the rows of $| B B^{⊤} |$ . However, applying the K-means method directly to the rows of $B$ will not do. Theorems 2 and 4 indicate that the block diagonal structure, though masked by asymptotically diminishing “noise”, still presents in $\hat{B} {\hat{B}}^{⊤}$ via a latent row-permutation of $\hat{B}$ . Accordingly the cluster analysis in Step 5 is based on the absolute values of the correlation-type measures among the rows of $\hat{F} {\hat{F}}^{⊤}$ which is an estimator for $B B^{⊤}$ .

(vi) In Step 5, we search for the number of clusters d by the “elbow method” which is the most frequently used method in K-means clustering. Nevertheless $\hat{d}$ provides an upper bound for d; see Theorem 6. Our empirical experiences indicate that $\hat{d} = d$ holds often especially when r_j, $1 \leq j \leq d$ , are small. See Tables 7 and 8 in Section 6.1. Note that $B B^{⊤}$ is a block diagonal matrix with d blocks and all the nonzero eigenvalues equal to 1. Therefore the dominant eigenvalue for each of the latent d blocks in $\hat{B} {\hat{B}}^{⊤}$ is greater than or at least very close to 1. Moreover, by Perron-Frobenius’s theorem, the largest eigenvalue of $| B_{j} B_{j}^{⊤} |$ , that is, the so-called Perron-Frobenius eigenvalue, is strictly greater than the other eigenvalues of $| B_{j} B_{j}^{⊤} |$ under the last condition in Assumption 2. This is the intuition behind the definition of $\hat{d}$ .

Example 1.

Consider a simple model of the form (2) in which $ε_{t} \equiv 0, r_{0} = 1, r = 2, A^{⊤} (\begin{matrix} B \\ 0 \end{matrix}) = 0$ and $\begin{matrix} x_{t} = p^{1 / 2} (u_{1, t} + a_{1} u_{1, t - 1} + u_{2, t} + a_{2} u_{2, t - 1}), \\ z_{1, t} = p^{1 / 2 - δ / 2} (u_{2, t} + a_{2} u_{2, t - 1}), \\ z_{2, t} = p^{1 / 2 - δ / 2} (u_{3, t} + a_{3} u_{3, t - 1}), \end{matrix}$ where $a_{1}, a_{2}, a_{3}$ are constants, and $u_{i, t}$ , for different i, t, are independent and N(0, 1). Let $M = \sum_{0 \leq k \leq 1} Σ_{y} (k) Σ_{y} {(k)}^{⊤}$ , and $λ_{1} \geq λ_{2} \geq λ_{3}$ be the three largest eigenvalues of $M$ . It can be shown that $λ_{1} ≍ p^{2}, λ_{3} = p^{2 - 2 δ} {{(1 + a_{3}^{2})}^{2} + a_{3}^{2}}$ and $λ_{2} ≍ p^{2 - δ}$ provided ${(a_{1} - a_{2})}^{2} (1 - a_{1} a_{2}) \neq 0$ . Hence, $λ_{1} / λ_{2} ≍ λ_{2} / λ_{3} ≍ p^{δ}$ . This shows that $r_{0} (= 1)$ or $r (= 2)$ cannot be estimated stably based on the ratios of the eigenvalues of $\hat{M}$ for this example. In fact, let two $p \times 3$ matrices $U_{0}$ and $U_{1}$ be the eigenvectors of $Σ_{y} (0) Σ_{y} {(0)}^{⊤}$ and $Σ_{y} (1) Σ_{y} {(1)}^{⊤}$ . When ${(a_{1} - a_{2})}^{2} (1 - a_{1} a_{2}) \neq 0, U_{0}$ and $U_{1}$ are different while $U_{0} U_{0}^{⊤} = U_{1} U_{1}^{⊤}$ .

4 Asymptotic Properties on Estimation for Factors

Theorem 1 and Remark 4 show that in the absence of weak factor $z_{t}$ , the estimation for the strong factor loading space $M (A)$ achieves root-n convergence rate in spite of diverging p. Since only the factor loading space $M (A)$ is uniquely defined by (2) (see the discussion below Assumption 2), we measure the estimation error in terms of its (unique) projection matrix $A A^{⊤}$ .

Theorem 1.

Let Assumptions 1–4 hold. Let $p, n \to \infty, n = O (p)$ and $p^{δ} = o (n)$ . Then it holds that (17) $‖ \overset{⁁}{A} {\overset{⁁}{A}}^{⊤} - A A^{⊤} ‖ = O_{p} (n^{- 1 / 2} + p^{- δ / 2}) .$ (17)

Assumption 2 ensures that the rank of matrix $B_{*} \equiv (I_{p} - A A^{⊤}) (\begin{matrix} B \\ 0 \end{matrix})$ is r. Denote by $P_{A_{⊥} B} = B_{*} {(B_{*}^{⊤} B_{*})}^{- 1} B_{*}^{⊤}$ the projection matrix onto $M {(I_{p} - A A^{⊤}) (\begin{matrix} B \\ 0 \end{matrix})}$ of which $M (\hat{B})$ is a consistent estimator, see Theorem 2, and also Remark 3(iv).

Theorem 2.

Let Assumptions 1–4 hold. Let $p, n \to \infty, n = O (p), p^{δ} r = o (n)$ . Then it holds that (18) $| | \hat{B} {\hat{B}}^{⊤} - P_{A_{⊥} B} | | = O_{p} (p^{δ / 2} n^{- 1 / 2} + p^{- δ / 2})$ (18)

and (19) $| | \hat{B} {\hat{B}}^{⊤} - P_{A_{⊥} B} | |_{F} = O_{p} (p^{δ / 2} n^{- 1 / 2} r^{1 / 2} + p^{- δ / 2}) .$ (19)

Theorem 3 specifies the asymptotic behavior for the ratios of the cumulated eigenvalues used in estimating the numbers of factors in Step 1 in Section 3. It implies that ${\hat{r}}_{0} \to r_{0}, \hat{r} \to r$ in probability provided that $J_{0} > r_{0} + r$ is fixed.

Theorem 3.

Let Assumptions 1–4 hold. Let $p, n \to \infty, n = O (p), p^{δ} r = o (n)$ . For ${\hat{R}}_{j}$ defined in (12), it holds for some constant C > 0 that (20) $\lim_{n, p \to \infty} P ({\hat{R}}_{j} < C) = 1 for j = 1, \dots, r_{0} - 1,$ (20) (21) ${\hat{R}}_{r_{0}}^{- 1} = O_{p} (p^{- 2 δ}), {\hat{R}}_{r_{0} + r}^{- 1} = O_{p} (n^{- 2} p^{2 δ}),$ (21) (22) $\lim_{n, p \to \infty} P ({\hat{R}}_{j} < C) = 1 for j = r_{0} + 1, \dots, r_{0} + r - 1, and$ (22) (23) ${\hat{R}}_{j} = O_{p} (1) for j = r_{0} + r + 1, \dots, r_{0} + r + s,$ (23) where s is a positive fixed integer.

Remark 4.

It is worth pointing out that the block diagonal structure of $B$ is not required for Theorems 1–3. On the other hand, if $A^{⊤} (\begin{matrix} B \\ 0 \end{matrix}) = 0$ and ${x_{t}}$ and ${z_{t}}$ are independent, the term $p^{- δ / 2}$ on the RHS of (17)–(19) disappears.

5 Asymptotic Properties on Clustering

Assumption 5.

The elements of $A A^{⊤}$ are of the order $O (p^{- 1})$ , and $| | b_{i} | |^{2} ≍ p^{δ - 1}$ for $1 \leq i \leq p_{0}$ , where $b_{i}$ denotes the ith row of matrix $B$ .

The orthogonalization $A^{⊤} A = I_{r_{0}}$ implies that the average of the squared elements of $A$ is $O (p^{- 1})$ . Since r₀ is finite, it is reasonable to assume that the elements of $A A^{⊤}$ are $O (p^{- 1})$ . As $B$ is a block-diagonal matrix with blocks ${B_{j}}$ and $B_{j}^{⊤} B_{j} = I_{r_{j}}$ , the squared elements of $B_{j}$ are of the order $p_{j}^{- 1} ≍ p^{- (1 - δ)}$ in average. As r_j is bounded, it is reasonable to assume $| | b_{i} | |^{2} ≍ p^{δ - 1}$ . Assumption 5 ensures that $(I_{p} - A A^{⊤}) (\begin{matrix} B \\ 0 \end{matrix})$ is asymptotically a block diagonal matrix; see Theorem 4. This enables to recover the block diagonal structure of $B$ based on $\hat{B}$ which provides a consistent estimator for the space $M {(I_{p} - A A^{⊤}) (\begin{matrix} B \\ 0 \end{matrix})}$ (Theorem 2), and also to separate the components of $y_{t}$ not belonging to any clusters. See Theorems 5–7 .

Theorem 4.

Let Assumptions 1–2 and 5 hold. Divide matrix $(\begin{matrix} B \\ 0 \end{matrix}) (B^{⊤}, 0) - P_{A_{⊥} B}$ into $(d + 1) \times (d + 1)$ blocks, and denote its (i, j)th block of the size $p_{i} \times p_{j}$ by $V_{i, j}$ . Then as $p \to \infty$ , $| | V_{i, j} | |_{F} = O_{p} (p^{- 1} p_{i}^{1 / 2} p_{j}^{1 / 2}) = O_{p} (p^{- δ})$ .

Theorem 5.

Let the conditions of Theorem 2 and Assumption 5 hold. For ${\hat{J}}_{d + 1}$ defined in (16), it holds that (24) $\frac{| J_{d + 1}^{c} \cap {\hat{J}}_{d + 1} |}{p^{1 - δ}} = O_{p} (n^{- 1} r p^{δ} + p^{- δ}), and$ (24) (25) $\frac{| J_{d + 1} \cap {\hat{J}}_{d + 1} |}{| J_{d + 1} |} = 1 + O_{p} (\frac{p^{δ} r n^{- 1} + p^{- δ}}{p_{d + 1} ω_{p}^{2}}),$ (25) where ω_p is given in (16). Furthermore, $\frac{p^{δ} r n^{- 1} + p^{- δ}}{p_{d + 1} ω_{p}^{2}} = o (1)$ provided that $p^{1 - δ} / p_{d + 1} = O (1)$ .

Theorem 6.

Let the conditions of Theorem 5 hold, and $p^{δ} r \log^{2} n = o (n)$ . Then $P (\hat{d} \geq d) \to 1,$ as $n, p \to \infty$ .

Remark 5.

Theorem 5 shows that most the components belonging to the d clusters will not be classified as not belonging to any clusters (see (24)). Furthermore most the components not belonging to any clusters will be correctly identified (see (25)). Theorem 6 shows that the probability of under-estimating d converges to 0.

To investigate the errors in the K-means clustering, let $R = (| r_{l, m} |)$ be the $p_{0} \times p_{0}$ matrix with $r_{l, m} = b_{l}^{⊤} b_{m} / {(b_{l}^{⊤} b_{l} \cdot b_{m}^{⊤} b_{m})}^{1 / 2}, 1 \leq l, m \leq p_{0} .$

We assume that d is known. Let $O_{d}$ be the set consisting of all $p_{0} \times p_{0}$ matrices with d distinct rows. Put (26) $D_{0} = \arg \min_{_{D \in O_{d}}} | | R - D | |_{F}^{2} .$ (26)

For any p₀-vector $g$ with its elements taking integer values between 1 and d, let $\begin{matrix} O_{d} (g) = {D \in O_{d} : two rows of D are the same if and \\ only if the corresponding two elements of g are the same} . \end{matrix}$

Note that the d distinct rows of $D_{0}$ would be the centers of the d clusters identified by the K-means method based on the rows of $R$ . However $R$ is unknown, we identify the d clusters based on its estimator $\hat{R}$ ; see Step 5 of the algorithm in Section 3. The clustering based on $\hat{R}$ can only be successful if that based on $R$ is successful, that is, $D_{0} \in O_{d} (g_{0})$ , where $g_{0}$ is the p₀-vector with 1 as its first p₁ elements, 2 as its next p₂ elements, $\dots$ , and d as its last p_d elements. Given the block diagonal structure of $B$ , condition $D_{0} \in O_{d} (g_{0})$ is likely to hold.

For any p₀-vector $g$ with its elements taking integer values between 1 and d, it partitions ${1, \dots, p_{0}}$ into d subset. Let $τ (g)$ denote the number of misclassified components by partition g.

Assumption 6.

$D_{0} \in O_{d} (g_{0})$ . For some constant c > 0, $\min_{_{D \in O_{d} (g)}} | | R - D | |_{F}^{2} \geq | | R - D_{0} | |_{F}^{2} + c τ (g) p^{1 - δ} .$

Theorem 7.

Let the conditions of Theorem 5 and Assumption 6 hold. The number of clusters d is assumed to be known. Denoted by $\hat{τ}$ the number of misclassified components of $y_{t}$ by the K-means clustering in Step 5. Then as $n, p \to \infty$ , (27) $\hat{τ} / p = O_{p} (p^{- δ / 2}) .$ (27)

Remark 6.

Theorem 7 implies that the misclassification rate of the K-means method converges to 0, though the convergence rate is slow (see (27)). However, a faster rate is attained when $A^{⊤} (\begin{matrix} B \\ 0 \end{matrix}) = 0$ , and ${x_{t}}$ and ${z_{t}}$ are independent, as then $\hat{τ} / p = O_{p} (n^{- 1 / 2} r^{1 / 2})$ . See also Remark 4. Assumption 6 requires that $| | R - D | |_{F}^{2}$ increases as the number of misplaced members of this partition $τ (g)$ becomes large, which is necessary for the K-means method.

6 Numerical Properties

6.1 Simulation

We illustrate the proposed methodology through a simulation study with model (2). We draw the elements of $A$ and $B_{j}$ independently from $U (- 1, 1)$ . All component series of $x_{t}$ and $z_{t}$ are independent and AR(1) and MA(1), respectively, with Gaussian innovations. All components of $ε_{t}$ are independent MA(1) with $N (0, 0.25)$ innovations. All the AR and the MA coefficients are drawn randomly from $U {(- 0.95, - 0.4) \cup (0.4, 0.95)}$ . The standard deviations of the components of $x_{t}$ and $z_{t}$ are drawn randomly from U(1, 2).

We consider following two scenarios with $r_{0} = r_{1} = \dots = r_{d} = 2$ and $p_{1} = \dots = p_{d}$ :

Scenario I: n = 400, d = 5 and $p_{d + 1} = p_{1}$ . Hence, r = 10 and $p = 6 p_{1}$ .
Scenario II: n = 800, d = 10 and $p_{d + 1} = 5 p_{1}$ . Hence, r = 20 and $p = 15 p_{1}$ .

The numbers of factors r₀ and r are estimated based on the ratios ${\hat{R}}_{j}$ in (13) with $k_{0} = 1, \dots, 5$ and $J_{0} = [p / 4]$ in (12). For the comparison purpose, we also report the estimates based on the ratios of eigenvalues of $\hat{M}$ in (11) also with $k_{0} = 0, \dots, 5$ , which is the standard method used in literature and is defined as in (13) but now with ${\hat{R}}_{j} = {\tilde{λ}}_{j} / {\tilde{λ}}_{j + 1}$ instead, where ${\tilde{λ}}_{1} \geq \dots \geq {\tilde{λ}}_{p} \geq 0$ are the eigenvalues of $\hat{M}$ . See, for example, Lam and Yao (Citation2012). For each setting, we replicate the experiment 1000 times.

The relative frequencies of ${\hat{r}}_{0} = r_{0}$ and ${\hat{r}}_{0} + \hat{r} = r_{0} + r$ are reported in . Overall the method based on the ratios of the cumulative eigenvalues ${\hat{R}}_{j}$ provides accurate and robust performance and is not sensitive to the choice of k₀. The estimation based on the eigenvalues of $\hat{M}$ with $k \geq 1$ is competitive for r₀, but is considerably poorer for $r_{0} + r$ in Scenario II. Using $\hat{M}$ with k = 0 leads to weaker estimates for r₀ in Scenario I.

Table 1 The relative frequencies of ${\hat{r}}_{0} = r_{0}$ and ${\hat{r}}_{0} + \hat{r} = r_{0} + r$ in a simulation for Scenario I with 1000 replications, where ${\hat{r}}_{0}$ and $\hat{r}$ are estimated by the ${\hat{R}}_{j}$ -based method (13), and the ratios of the eigenvalues of $\hat{M}$ .

Display Table

Table 2 The relative frequencies of ${\hat{r}}_{0} = r_{0}$ and ${\hat{r}}_{0} + \hat{r} = r_{0} + r$ in a simulation for Scenario II with 1000 replications, where ${\hat{r}}_{0}$ and $\hat{r}$ are estimated by the ${\hat{R}}_{j}$ -based method (13), and the ratios of the eigenvalues of $\hat{M}$ .

Display Table

It is noticeable that the performance of the estimation for the number of common factor r₀ in Scenario II is better than Scenario I. This is due to the fact the difference in the factor strength between the common factor $x_{t}$ and the cluster-based factor $z_{t}$ in Scenario II is larger than Scenarios I. In contrast, the results hardly change with different values of p₁.

Recall $P_{A ⊥ B}$ is the projection matrix onto the space $M {(I_{p} - A A^{⊤}) (\begin{matrix} B \\ 0 \end{matrix})}$ ; see Theorem 2 and also Remark 3(iv). contain the means and standard deviations of the estimation errors for the factor loading spaces $| | \hat{A} {\hat{A}}^{⊤} - A A^{⊤} | |_{F}$ and $| | \hat{B} {\hat{B}}^{⊤} - P_{A ⊥ B} | |_{F}$ , where $\hat{A}$ is estimated by the eigenvectors of the matrix $\hat{M}$ in (11) with $k_{0} = 1, \dots, 5$ , see Step 2 of the algorithm stated in Section 3. See also Step 3 there for the similar procedure in estimating $B$ . For the comparison purpose, we also include the estimates obtained with $\hat{M}$ replaced by ${\hat{Σ}}_{y} (k) {\hat{Σ}}_{y} {(k)}^{⊤}$ with $k = 0, 1, \dots, 5$ . show clearly that the estimation based on $\hat{M}$ is accurate and robust with respect to the different values of k₀. Furthermore using a single-lagged covariance matrix for estimating factor loading spaces is not recommendable. The error $| | \hat{A} {\hat{A}}^{⊤} - A A^{⊤} | |_{F}$ in Scenario I is larger than the error in Scenarios II. This is due to the larger sample size n in Scenario II. See Theorem 1. In contrast, Theorem 2 shows that the error rate of $| | \hat{B} {\hat{B}}^{⊤} - P_{A ⊥ B} | |$ contains the term $p^{δ / 2} n^{- 1 / 2}$ . While n is larger in Scenario II, so is $p^{δ}$ . This explains why the error $| | \hat{B} {\hat{B}}^{⊤} - P_{A ⊥ B} | |_{F}$ in Scenario II is also larger than that in Scenario I.

Table 3 The means and standard deviations (in parentheses) of $| | \hat{A} {\hat{A}}^{⊤} - A A^{⊤} | |_{F}$ and $| | \hat{B} {\hat{B}}^{⊤} - P_{A ⊥ B} | |_{F}$ in a simulation for Scenario I with 1000 replications, where $\hat{A}$ is estimated by the eigenvectors of $\hat{M}$ in (11) (with $k_{0} = 1, \dots, 5$ ), or by those of ${\hat{Σ}}_{y} (k) {\hat{Σ}}_{y} {(k)}^{⊤}$ (for $k = 0, 1, \dots, 5$ ), and $\hat{B}$ is estimated in the similar manner.

Display Table

Table 4 The means and standard deviations (in parentheses) of $| | \hat{A} {\hat{A}}^{⊤} - A A^{⊤} | |_{F}$ and $| | \hat{B} {\hat{B}}^{⊤} - P_{A ⊥ B} | |_{F}$ in a simulation for Scenario II with 1000 replications, where $\hat{A}$ is estimated by the eigenvectors of $\hat{M}$ in (11) (with $k_{0} = 1, \dots, 5$ ), or by those of ${\hat{Σ}}_{y} (k) {\hat{Σ}}_{y} {(k)}^{⊤}$ (for $k = 0, 1, \dots, 5$ ), and $\hat{B}$ is estimated in the similar manner.

Display Table

In the sequel, we only report the results with ${\hat{r}}_{0}$ and $\hat{r}$ estimated by (13), and the factor loading spaces estimated by the eigenvectors of $\hat{M}$ . We always set $k_{0} = 5$ . We examine now the effectiveness of Step 4 of the algorithm. Note that the indices of the components not belonging to any clusters are identified as those in ${\hat{J}}_{d + 1}$ in (16), which is defined in terms of a threshold $ω_{p} = o (p^{δ / 2 - 1 / 2})$ . We experiment with the three choices of this tuning parameter, namely $ω_{p 1} = {(\hat{r} / p)}^{1 / 2} / \ln p$ , $ω_{p 2} = {\hat{r} / (p \ln p)}^{1 / 2}$ and $ω_{p 3} = {\hat{r} / (p \ln \ln p)}^{1 / 2}$ . Recall $J_{d + 1}^{c}$ contains all the indices of the components of $y_{t}$ belonging to one of the d clusters. The means and standard deviations of the two types of misclassification errors $E_{1} = | J_{d + 1}^{c} \cap {\hat{J}}_{d + 1} | / | J_{d + 1}^{c} |$ and $E_{2} = | J_{d + 1} \cap {\hat{J}}_{d + 1}^{c} | / | J_{d + 1} |$ over the 1000 replications are reported in . Among the three choices, $ω_{p 2}$ appears to work best as the two types of errors are both small. The increase in the errors due to the estimation for r₀ and r is not significant.

Table 5 The means and standard deviations (in parentheses) of the error rates $E_{1} = | J_{d + 1}^{c} \cap {\hat{J}}_{d + 1} | / | J_{d + 1}^{c} |$ and $E_{2} = | J_{d + 1} \cap {\hat{J}}_{d + 1}^{c} | / | J_{d + 1} |$ in a simulation for Scenario I with 1000 replications with the three possible choices of threshold ω_p in (16), and the numbers of factors r₀ and r either known or to be estimated.

Display Table

Table 6 The means and standard deviations (in parentheses) of the error rates $E_{1} = | J_{d + 1}^{c} \cap {\hat{J}}_{d + 1} | / | J_{d + 1}^{c} |$ and $E_{2} = | J_{d + 1} \cap {\hat{J}}_{d + 1}^{c} | / | J_{d + 1} |$ in a simulation for Scenario II with 1000 replications with the three possible choices of threshold ω_p in (16), and the numbers of factors r₀ and r either known or to be estimated.

Display Table

In the sequel, we only report the results with $ω_{p 2} = {\hat{r} / (p \ln p)}^{1 / 2}$ . In Step 5, we estimate $\hat{d}$ as an upper bound for d. As r_j = 2 for $j = 1, \dots, d, \hat{d} = d$ occurs almost surely in our simulation. See . Then the $\hat{d}$ clusters are obtained by performing the K-means clustering for the ${\hat{p}}_{0}$ rows of $\hat{R}$ , where ${\hat{p}}_{0} = p - | {\hat{J}}_{d + 1} |$ . As the error rates in estimating $J_{d + 1}^{c}$ has already been reported in , we concentrate on the components of $y_{t}$ with indices in ${\hat{J}}_{d + 1}^{c} \cap J_{d + 1}^{c}$ now, and count the number of them which were misplaced by the K-means clustering, that is, $\hat{τ}$ . Both the means and the standard deviations of the error rates $\hat{τ} / | {\hat{J}}_{d + 1}^{c} \cap J_{d + 1}^{c} |$ over 1000 replications are reported in . We also report the relative frequencies of $\hat{d} = d$ . show clearly that the K-means clustering identifies the latent clusters very accurately, and the difference in performance due to the estimating $(r_{0}, r)$ is also small.

Table 7 The means and Standard Deviations (STD) of the error rates $\hat{τ} / | {\hat{J}}_{d + 1}^{c} \cap J_{d + 1}^{c} |$ and the relative frequencies of $\hat{d} = d$ in a simulation for Scenario I with 1000 replications with the numbers of factors r₀ and r either known or to be estimated.

Display Table

Table 8 The means and Standard Deviations (STD) of the error rates $\hat{τ} / | {\hat{J}}_{d + 1}^{c} \cap J_{d + 1}^{c} |$ and the relative frequencies of $\hat{d} = d$ in a simulation for Scenario II with 1000 replications with the numbers of factors r₀ and r either known or to be estimated.

Display Table

6.2 Real Data Illustration

We consider the daily returns of the stocks listed in S&P500 in December 31, 2014–December 31, 2019. By removing those which were not traded on every trading day during the period, there are p = 477 stocks which were traded on n = 1259 trading days. Those stocks are from 11 industry sectors:

Table

Download CSV Display Table

The conventional wisdom suggests that the companies in the same industry sector share some common features. We apply the proposed 5-step algorithm in Section 3 to the return series to cluster those 477 stocks into different groups.

Step 1 is to estimate the numbers of strong factors and cluster-specific weak factors. To this end, we calculate ${\hat{R}}_{j}$ as in (12) with $k_{0} = 5$ . It turns out ${\hat{R}}_{1} = 32.53$ is much larger than all the others, while ${\hat{R}}_{j}$ for $j \geq 2$ are plotted in . By (13), ${\hat{r}}_{0} = 1$ and ${\hat{r}}_{0} + \hat{r} = 4$ . Note that the estimates for ${\hat{r}}_{0}$ and ${\hat{r}}_{0} + \hat{r}$ are unchanged with $k_{0} = 1, \dots, 4$ . While the existence of ${\hat{r}}_{0} = 1$ strong and common factor is reasonable, it is most unlikely that there are merely $\hat{r} = 3$ cluster-specific weak factors. Note that estimators in (13) are derived under the assumption that all the r cluster-specific (i.e., weak) factors are of the same factor strength; see Remark 1(ii) in Section 2. In practice weak factors may have different degrees of strength; implying that we should also take into account the 3rd, the 4th, the 5th largest local maximum of ${\hat{R}}_{j}$ . Hence, we take ${\hat{r}}_{0} + \hat{r} = 16$ (or perhaps also 10 or 13), as suggests that there are 3 factors with factor strength $δ_{1} > 0$ , and further 12 factors with strength $δ_{2} \in (δ_{1}, 1)$ .

Fig. 1 Plot of ${\hat{R}}_{j}$ against j for $2 \leq j \leq 20$ .

Fig. 1 Plot of R̂j against j for 2≤j≤20.

With ${\hat{r}}_{0} = 1$ and $\hat{r} = 15$ , we proceed to Steps 2 & 3 of Section 3 and obtain the estimator $\hat{B}$ as in (15). Setting $ω_{p} = {\hat{r} / (p \ln p)}^{1 / 2}$ , $| {\hat{J}}_{d + 1} | = 11$ , that is, 11 stocks do not appear to belong to any clusters, where ${\hat{J}}_{d + 1}$ is defined as in (16) in Step 4. Leaving those 11 stocks out, we perform Step 5, that is, the K-means clustering for the ${\hat{p}}_{0} = 477 - 11 = 466$ rows of matrix $\hat{R}$ . From we choose $\hat{d} = 3$ . But we also consider $\hat{d} = 9$ and $\hat{d} = 11$ as two more examples.

Fig. 2 The eigenvalues of $| \hat{B} {\hat{B}}^{⊤} |$ when ${\hat{r}}_{0} = 1$ and $\hat{r} = 15$ . The red line is $1 - \log^{- 1} n$ .

Fig. 2 The eigenvalues of |B̂B̂⊤| when r̂0=1 and r̂=15. The red line is 1− log −1n.

To present the identified d clusters, we define $11 \times d$ matrix with $n_{i j} / n_{i}$ as its (i, j)th element, where n_i is the number of the stocks in the ith industry sector, and n_ij is the number of the stocks in the ith industry sector which are allocated in the jth cluster. Thus, $n_{i j} / n_{i} \in [0, 1]$ and $\sum_{j} n_{i j} / n_{i} = 1$ .

The heat-maps of this $11 \times d$ matrix for $d = \hat{d} = 9$ is presented in . The first cluster mainly consists of the companies in Consumer Staples, and Utilities, Clusters 2 and 3 contain the companies in, respectively, Health Care and Financials, Cluster 4 contains mainly some companies in Communication Service and Information Technology, Cluster 5 consists of the companies in Industrials and Materials, Cluster 6 are mainly the companies in Consumer Discretionary, Cluster 7 are mainly the companies in Real Estate. Cluster 8 is mainly the companies from Information Technology, Cluster 9 contains almost all companies in Energy and a small number of companies from each of 5 or 6 different sectors. To examine how stable the clustering is, we also include the results for d = 11 and d = 3 in . When d is increased from 9 to 11, the original Cluster 1 is divided into new Clusters 1 and 11 with the former consisting of Consumer Staples, and the latter being Utilities. Furthermore the original Cluster 4 splits into new Clusters 4 and 10, while the other 7 original clusters are hardly changed. With d = 3, most companies in each of the 11 sectors stay in one cluster. For example, most companies in Financials are always in a separate group.

Fig. 3 Heat-maps of the distributions of the stocks in each of the 11 industry sectors (corresponding to 11 rows) over d clusters (corresponding to d columns), with $d = 9, 11,$ and 3. The estimated numbers of the common and cluster-specific factors are, respectively, ${\hat{r}}_{0} = 1$ and $\hat{r} = 15$ .

If we take ${\hat{r}}_{0} = 1$ and $\hat{r} = 9, | {\hat{J}}_{d + 1} | = 12$ and ${\hat{p}}_{0} = 477 - 12 = 465$ . The clustering results with $d = 9, 11$ and 3 are presented in is presented in . The first cluster mainly consists of the companies in Consumer Staples, Real Estate and Utilities, Clusters 2 and 3 contain the companies in, respectively, Health Care and Financials, Cluster 4 contains mainly some companies in Communication Service and Information Technology, Cluster 5 consists of the companies in Industrials and Materials, Cluster 6 are mainly the companies in Consumer Discretionary, Cluster 7 is a mixture of a small number of companies from each of 5 or 6 different sectors, Cluster 8 is mainly the companies from Information Technology, Cluster 9 contains almost all companies in Energy. To examine how stable the clustering is, we also include the results for d = 11 and d = 3 in . When d is increased from 9 to 11, the original Cluster 1 is divided into new Clusters 1 and 11 with the former consisting of Consumer Staples and Utilities sectors, and the latter being Real Estate sector. Furthermore the original Cluster 7 splits into new Clusters 7 and 10, while the other 7 original clusters are hardly changed. With d = 3, most companies in each of the 11 sectors stay in one cluster.

Fig. 4 Heat-maps of the distributions of the stocks in each of the 11 industry sectors (corresponding to 11 rows) over d clusters (corresponding to d columns), with $d = 9, 11,$ and 3. The estimated numbers of the common and cluster-specific factors are, respectively, ${\hat{r}}_{0} = 1$ and $\hat{r} = 9$ .

If we take ${\hat{r}}_{0} = 1$ and $\hat{r} = 12$ , the estimated ${\hat{J}}_{d + 1}$ is unchanged. The clustering results with $d = 9, 11,$ and 3 are presented in . Comparing with , there are some striking similarities: First the clustering with d = 3 are almost identical. For d = 9, the profiles of Clusters 2, …, 6, 8, and 9 are not significantly changed while Clusters 1 and 7 in are somehow mixed together in . With d = 11, the profiles of Clusters 2 – 6, 8 – 10 in the two figures are about the same while Clusters 7 and 11 are mixed up across the two figures.

Fig. 5 Heat-maps of the distributions of the stocks in each of the 11 industry sectors (corresponding to 11 rows) over d clusters (corresponding to d columns), with $d = 9, 11,$ and 3. The estimated numbers of the common and cluster-specific factors are, respectively, ${\hat{r}}_{0} = 1$ and $\hat{r} = 12$ .

The analysis above indicates that the companies in the same industry sector tend to share similar dynamic structure in the sense that they are driven by the same cluster-specific factors. Our analysis is reasonably stable as most the clusters do not change substantially when the number of the weaker factors chooses different values $\hat{r} = 9, \hat{r} = 12$ , or $\hat{r} = 15$ .

We also apply method of Ando and Bai (Citation2017) to this dataset; leading to the same estimate ${\hat{r}}_{0} = 1$ , but smaller estimates $\hat{r} = 4$ and $\hat{d} = 4$ . The clustering results for $\hat{d} = 3$ and $\hat{d} = 4$ are presented in which similar to the right parts in , though the method of Ando and Bai (Citation2017) puts energy companies as a separate group. In contrast, our method puts financial companies as a separate group. Note that classical papers in Finance (e.g., Berger and Ofek, Citation1995; Denis, Denis, and Yost, Citation2002; Lemmon, Roberts, and Zender, Citation2008) often eliminate financial companies from other companies.

Fig. 6 Heat-maps of the distributions of the stocks in each of the 11 industry sectors (corresponding to 11 rows) over d clusters (corresponding to d columns) based on Ando and Bai (Citation2017).

7 Miscellaneous Comments

Robustness

We identify and distinguish common factors and cluster-specific factors by the different factor strengths, that is, common factors are strong with δ = 0, and cluster-specific factors are weak with $δ > 0$ . However if, for example, one of the common factors has the same strength as the cluster-specific factors, the number of the strong factor is then $r_{0} - 1$ and the number of the weak factors is r + 1. In this case, the estimated $r_{0}, r$ and the factor loading spaces will all be wrong. Nevertheless a common factor has nonzero loadings on the most components of $y_{t}$ , hence, those loadings must be extremely small in order to be a weak factor. Therefore, its impact on the estimation of the number of clusters, and the misclassification rates is minor. Simulation results in supplementary materials support this assertion.

Heterogeneous factor strength

We assume two factor strengths: p for common factors and $p^{1 - δ}$ for cluster-specific factors. If there are s different strengths $p^{1 - δ_{1}}, \dots, p^{1 - δ_{s}}$ among the cluster-specific factors, we can search for the s largest local maximums among ${\hat{R}}_{1}, \dots, {\hat{R}}_{J_{0} - 1}$ in Step 1. Moreover, Step 2 should be repeated s times to estimate s factor loadings corresponding to different factor strengths. While the asymptotic results can be extended accordingly, small values such as $s \leq 3$ are sufficient for most practical applications.

Estimation for r_j and $B_{j}$

When we cluster the time series correctly in Steps 4–5, we obtain the estimator for $B_{j}$ from $\hat{B}$ directly. We can also run Step 1 on each cluster to estimate r_j. Theorem 3 ensures the consistency of those estimates. Although Theorem 7 ensures that most of time series can be clustered correctly, there may be a cluster obtained in Step 5 which is not accuracy enough. It remains an open problem to evaluate how the clustering error is propagated into the estimation for r_j and $B_{j}$ .

Supplementary Materials

All the technical proofs are presented in an online supplementary which also contains additional simulation results. We also provide the dataset and codes in Section 6 in an online supplementary.

Supplemental material

Supplemental Material

Download Zip (2.9 MB)

Acknowledgments

We would like to thank three reviewers, the associate editor and the editor for their comments and suggestions. We would like to thank Professor Tomohiro Ando and Professor Jushan Bai for kindly providing us with the computer code for the simulations in Ando and Bai (Citation2017).

Disclosure Statement

There are no relevant competing interests to declare.

Additional information

Funding

Bo Zhang is partially supported by National Natural Science Funds of China No.12001517 & 72091212, National Key R&D Program of China-2022YFA1008000, USTC Research Funds of the Double First-Class Initiative YD2040002005 and The Fundamental Research Funds for the Central Universities WK2040000026 & WK2040000027. Guangming Pan is partially supported by MOE Tier 2 grant 2018-T2-2-112 and MOE Tier 1 grant RG76/21 at the Nanyang Technological University, Singapore. Qiwei Yao is partially supported by EPSRC (UK) Research grant EP/V007556/1. Wang Zhou is partially supported by a grant A-8000440-00-00 at the National University of Singapore.

References

Aghabozorgi, S., Shirkhorshid, A. S., and Wah, T. Y. (2015), “Time-Series Clustering – A Decade Review,” Information System, 53, 16–38. DOI: 10.1016/j.is.2015.04.007.
Web of Science ®Google Scholar
Alonso, A. S., and Peña, D. (2019), “Clustering Time Series by Linear Dependency,” Statistics and Computing, 29, 655–676. DOI: 10.1007/s11222-018-9830-6.
Web of Science ®Google Scholar
Ando, T., and Bai, J. (2017), “Clustering Huge Number of Financial Time Series: A Panel Data Approach with High-Dimensional Predictors and Factor Structures,” Journal of the American Statistical Association, 519, 1182–1198. DOI: 10.1080/01621459.2016.1195743.
Google Scholar
Berger, P. G., and Ofek, E. (1995), “Diversification’s Effect on Firm Value,” Journal of Financial Economics, 37), 39–65. DOI: 10.1016/0304-405X(94)00798-6.
Web of Science ®Google Scholar
Chamberlain, G. (1983), “Funds, Factors, and Diversification in Arbitrage Pricing Models,” Econometrica, 51, 1305–1323. DOI: 10.2307/1912276.
Web of Science ®Google Scholar
Chamberlain, G., and Rothschild, M. (1983), “Arbitrage, Factor Structure, and Mean-Variance Analysis on Large Asset Markets,” Econometrica, 51, 1281–1304. DOI: 10.2307/1912275.
Web of Science ®Google Scholar
Chang, J., Gao, B., and Yao, Q. (2015), “High Dimensional Stochastic Regression with Latent Factors, Endogeneity and Nonlinearity,” Journal of Econometrics, 189, 297–312. DOI: 10.1016/j.jeconom.2015.03.024.
Web of Science ®Google Scholar
Denis, D. J., Denis, D. K., and Yost, K. (2002), “Global Diversification, Industrial Diversification, and Firm Value,” Journal of Finance, 57, 1951–1979. DOI: 10.1111/0022-1082.00485.
Web of Science ®Google Scholar
Esling, P., and Agon, C. (2012), “Time-Series Data Mining,” ACM Computing Survey, 45, Article 12. DOI: 10.1145/2379776.2379788.
Web of Science ®Google Scholar
Frühwirth-Schnatter, S., and Kaufmann, S. (2008), “Model-based Clustering of Multiple Time Series,” Journal of Business & Economic Statistics, 26, 78–89. DOI: 10.1198/073500107000000106.
Web of Science ®Google Scholar
Kakizawa, Y., Shumway, R. H., and Taniguchi, M. (1998), “Discrimination and Clustering for Multivariate Time Series,” Journal of the American Statistical Association, 93, 328–340. DOI: 10.1080/01621459.1998.10474114.
Web of Science ®Google Scholar
Keogh, E., and Lin, J. (2005), “Clustering of Time-Series Subsequences is Meaningless: Implications for Previous and Future Research,” Knowledge and Information Systems, 8, 154–177. DOI: 10.1007/s10115-004-0172-7.
Web of Science ®Google Scholar
Keogh, E., and Ratanamahatana, C. A. (2005), “Exact Indexing of Dynamic Time Warping,” Knowledge and Information Systems, 7, 358–386. DOI: 10.1007/s10115-004-0154-9.
Web of Science ®Google Scholar
Khaleghi, A., Ryabko, D., Mary, J., and Preux, P. (2016), “Consistent Algorithms for Clustering Time Series,” Journal of Machine Learning Research, 17, 1–32.
Web of Science ®Google Scholar
Lam, C., and Yao, Q. (2012), “Factor Modelling for High-Dimensional Time Series: Inference for the Number of Factors,” The Annals of Statistics, 40, 694–726. DOI: 10.1214/12-AOS970.
Web of Science ®Google Scholar
Lemmon, M. L., Roberts, M. R., and Zender, J. F. (2008), “Back to the Beginning: Persistence and the Cross-Section of Corporate Capital Structure,” Journal of Finance, 63, 1575–1608. DOI: 10.1111/j.1540-6261.2008.01369.x.
Web of Science ®Google Scholar
Li, Z., Wang, Q., and Yao, J. (2017), “Identifying the Number of Factors from Singular Values of a Large Sample Auto-Covariance Matrix,” The Annals of Statistics, 45, 257–288. DOI: 10.1214/16-AOS1452.
Web of Science ®Google Scholar
Liao, T. W. (2005), “Clustering of Time Series Data – A Survey,” Pattern Recognition, 38, 1857–1874.
Web of Science ®Google Scholar
Maharaj, E. A., D’Urso, P., and Caiado, J. (2019), Time Series Clustering and Classification, Boca Raton, FL: Chapman and Hall/CRC.
Google Scholar
Roelofsen, P. (2018), “Time Series Clustering,” Vrije Universiteit Ansterdam. Available at https://www.math.vu.nl//ensuremath{/sim}sbhulai/papers/thesis-roelofsen.pdf.
Google Scholar
Yao, Q., Tong, H., Finkenstädt, B., and Stenseth, N. C. (2000), “Common Structure in Panels of Short Ecological Time Series,” Proceeding of the Royal Society (London), B, 267, 2457–2467.
Web of Science ®Google Scholar
Zhang, T. (2013), “Clustering High-Dimensional Time Series based on Parallelism,” Journal of the American Statistical Association, 108, 577–588. DOI: 10.1080/01621459.2012.760458.
Web of Science ®Google Scholar
Zolhavarieh, S., Aghabozorgi, S., and Teh, Y. W. (2014), “A Review of Subsequence Time Series Clustering,” The Scientific World Journal, 2014, 312512. DOI: 10.1155/2014/312521.
Google Scholar

Factor Modeling for Clustering High-Dimensional Time Series

Abstract

1 Introduction

2 Models

3 A Clustering Algorithm

4 Asymptotic Properties on Estimation for Factors

5 Asymptotic Properties on Clustering