Full article: Nonparametric homogeneity pursuit in functional-coefficient models

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

This paper explores the homogeneity of coefficient functions in nonlinear models with functional coefficients and identifies the underlying semiparametric modelling structure. With initial kernel estimates, we combine the classic hierarchical clustering method with a generalised version of the information criterion to estimate the number of clusters, each of which has a common functional coefficient, and determine the membership of each cluster. To identify a possible semi-varying coefficient modelling framework, we further introduce a penalised local least squares method to determine zero coefficients, non-zero constant coefficients and functional coefficients which vary with an index variable. Through the nonparametric kernel-based cluster analysis and the penalised approach, we can substantially reduce the number of unknown parametric and nonparametric components in the models, thereby achieving the aim of dimension reduction. Under some regularity conditions, we establish the asymptotic properties for the proposed methods including the consistency of the homogeneity pursuit. Numerical studies, including Monte-Carlo experiments and two empirical applications, are given to demonstrate the finite-sample performance of our methods.

KEYWORDS:

2010 MATHEMATICS SUBJECT CLASSIFICATIONS:

1. Introduction

We consider the functional-coefficient model defined by (1) $Y_{t} = X_{t}^{^{⊺}} β_{0} (U_{t}) + ε_{t}, t = 1, \dots, n,$ (1) where $Y_{t}$ is a response variable, $X_{t} = (X_{t 1}, \dots, X_{t p})^{^{⊺}}$ is a p-dimensional vector of random covariates, $β_{0} (\cdot) = [β_{1}^{0} (\cdot), \dots, β_{p}^{0} (\cdot)]^{^{⊺}}$ is a p-dimensional vector of functional coefficients, $U_{t}$ is a univariate index variable, and $ε_{t}$ is an independent and identically distributed (i.i.d.) error term. The functional-coefficient model is a natural extension of the classic linear regression model by allowing the regression coefficients to vary with a certain index variable and thus captures a flexible dynamic relationship between the response and covariates. In recent years, there have been extensive studies on estimation and model selection for model (Equation1(1) $Y_{t} = X_{t}^{^{⊺}} β_{0} (U_{t}) + ε_{t}, t = 1, \dots, n,$ (1) ) and its various generalised versions (see, e.g. Fan and Zhang Citation1999; Cai, Fan, and Yao Citation2000; Xia, Zhang, and Tong Citation2004; Fan and Zhang Citation2008; Wang and Xia Citation2009; Kai, Li, and Zou Citation2011; Park, Mammen, Lee, and Lee Citation2015).

However, when the number of functional coefficients is large or moderately large, it is well known that a direct nonparametric estimation of the potentially p different coefficient functions in model (Equation1(1) $Y_{t} = X_{t}^{^{⊺}} β_{0} (U_{t}) + ε_{t}, t = 1, \dots, n,$ (1) ) would be unstable. To address this issue, there have been some extensive studies in the literature on selecting significant variables in functional-coefficient models (Fan, Ma, and Dai Citation2014; Liu, Li, and Wu Citation2014) or exploring certain rank-reduced structures in functional coefficients (Jiang, Wang, Xia, and Jiang Citation2013; Chen, Li, and Xia Citation2019), both of which aim to reduce the dimension of unknown functional coefficients and improve estimation efficiency. In this paper, we consider a different approach, i.e. we assume that there is a homogeneity structure on model (Equation1(1) $Y_{t} = X_{t}^{^{⊺}} β_{0} (U_{t}) + ε_{t}, t = 1, \dots, n,$ (1) ) so that individual functional coefficients can be grouped into a number of clusters and coefficients within each cluster have the same functional pattern. Throughout the paper, we assume that the dimension p may depend on the sample size n and can be divergent with n, but the number of unknown clusters is fixed and much smaller than p. It is easy to see that the dimension reduction through homogeneity pursuit is more general than the commonly used sparsity assumption in high-dimensional functional-coefficient models (cf. Fan et al. Citation2014; Liu et al. Citation2014; Li, Ke, and Zhang Citation2015 Lee and Mammen Citation2016) as the latter can be seen as a special case of the former with a very large group of zero coefficients. Specifically, we assume the following homogeneity structure on model (Equation1(1) $Y_{t} = X_{t}^{^{⊺}} β_{0} (U_{t}) + ε_{t}, t = 1, \dots, n,$ (1) ): there exists a partition of ${1, 2, \dots, p}$ denoted by $C_{0} = {C_{1}^{0}, \dots, C_{K_{0}}^{0}}$ such that (2) $β_{j}^{0} (\cdot) = α_{k}^{0} (\cdot) for j \in C_{k}^{0} and C_{k_{1}}^{0} \cap C_{k_{2}}^{0} = \emptyset for 1 \leq k_{1} \neq k_{2} \leq K_{0},$ (2) where the Lebesgue measure of ${u \in U : α_{k_{1}}^{0} (u) - α_{k_{2}}^{0} (u) \neq 0}$ is positive and bounded away from zero for any $1 \leq k_{1} \neq k_{2} \leq K_{0}$ , and $U$ is a compact support of the index variable $U_{t}$ . Furthermore, some of the functional coefficients $α_{k}^{0} (\cdot)$ are allowed to have constant values, in which case model (Equation1(1) $Y_{t} = X_{t}^{^{⊺}} β_{0} (U_{t}) + ε_{t}, t = 1, \dots, n,$ (1) ) is semiparametric with a combination of constant and functional coefficients. Our aim is to (i) explore homogeneity structure (Equation2(2) $β_{j}^{0} (\cdot) = α_{k}^{0} (\cdot) for j \in C_{k}^{0} and C_{k_{1}}^{0} \cap C_{k_{2}}^{0} = \emptyset for 1 \leq k_{1} \neq k_{2} \leq K_{0},$ (2) ) by estimating the unknown number of clusters $K_{0}$ and identifying members of the clusters $C_{1}^{0}, \dots, C_{K_{0}}^{0}$ ; and (ii) identify the clusters of constant coefficients and those of coefficients varying with $U_{t}$ and estimate their unknown values.

The topic investigated in our paper has two close relatives in the existing literature. On the one hand, the functional-coefficient regression with the homogeneity structure is a natural extension of linear regression with homogeneity structure, which has received increasing attention in recent years. For example, Tibshirani, Saunders, Rosset, Zhu, and Knight (Citation2005) introduce the so-called fused LASSO method to study slope homogeneity; Bondell and Reich (Citation2008) propose the OSCAR penalised method for grouping pursuit; Shen and Huang (Citation2010) use a truncated $L_{1}$ penalised method to extract the latent grouping structure; and Ke, Fan, and Wu (Citation2015) propose the CARDS method to identify the homogeneity structure and estimate the parameters simultaneously. On the other hand, this paper is also relevant to some recent literature on longitudinal/panel data model classification. For example, Ke, Li, and Zhang (Citation2016) and Su, Shi, and Phillips (Citation2016) consider identifying the latent group structure for linear longitudinal data models by using the binary segmentation and shrinkage method, respectively; Vogt and Linton (Citation2017) introduce a kernel-based classification of univariate nonparametric regression functions in longitudinal data; and Su, Wang, and Jin (Citation2019) propose a penalised sieve estimation method to identify latent grouping structure for time-varying coefficient longitudinal data models. The methodology of nonparametric homogeneity pursuit developed in this paper will be substantially different from those in the aforementioned literature.

In this paper, we first estimate each functional coefficient in model (Equation1(1) $Y_{t} = X_{t}^{^{⊺}} β_{0} (U_{t}) + ε_{t}, t = 1, \dots, n,$ (1) ) by using the kernel smoothing method and ignoring homogeneity structure (Equation2(2) $β_{j}^{0} (\cdot) = α_{k}^{0} (\cdot) for j \in C_{k}^{0} and C_{k_{1}}^{0} \cap C_{k_{2}}^{0} = \emptyset for 1 \leq k_{1} \neq k_{2} \leq K_{0},$ (2) ), and calculate the $L_{1}$ -distance between the estimated functional coefficients. Then, we combine the classic hierarchical clustering method and a generalised version of the information criterion to explore homogeneity structure (Equation2(2) $β_{j}^{0} (\cdot) = α_{k}^{0} (\cdot) for j \in C_{k}^{0} and C_{k_{1}}^{0} \cap C_{k_{2}}^{0} = \emptyset for 1 \leq k_{1} \neq k_{2} \leq K_{0},$ (2) ), i.e. estimate $K_{0}$ and the members of $C_{k}^{0}$ , $k = 1, \dots, K_{0}$ . Under some mild conditions, we show that the developed estimators for the number $K_{0}$ and the index sets $C_{k}^{0}$ , $k = 1, \dots, K_{0}$ , are consistent. After estimating structure (Equation2(2) $β_{j}^{0} (\cdot) = α_{k}^{0} (\cdot) for j \in C_{k}^{0} and C_{k_{1}}^{0} \cap C_{k_{2}}^{0} = \emptyset for 1 \leq k_{1} \neq k_{2} \leq K_{0},$ (2) ), we further estimate a semi-varying coefficient modelling framework by determining the zero coefficients, non-zero constant coefficients and functional coefficients varying with the index variable. This is done by using a penalised local least squares method, where the penalty function is the weighted LASSO with the weights defined as derivatives of the well-known SCAD penalty introduced by Fan and Li (Citation2001). With the nonparametric cluster analysis and the penalised approach, we can reduce the number of unknown components in model (Equation1(1) $Y_{t} = X_{t}^{^{⊺}} β_{0} (U_{t}) + ε_{t}, t = 1, \dots, n,$ (1) ) from p to $K_{0} - 1$ (if the zero constant coefficients exist in the model). Furthermore, the choice of the tuning parameters in the proposed estimation approach and the computational algorithm is also discussed. The simulation studies show that the proposed methods have reliable finite-sample numerical performance. We finally apply the model and methodology to analyse the Boston house price data and the plasma beta-carotene level data, and find that the original nonparametric functional-coefficient models can be simplified and the number of unknown components involved can be reduced. In particular, the out-of-sample mean absolute prediction errors of our approach are usually much smaller than those using the naive kernel method which ignores the latent homogeneity structure.

The rest of the paper is organised as follows. Section 2 introduces the clustering method, information criterion and penalised method to determine the unknown clusters and estimate the unknown components. Section 3 establishes the asymptotic theory for the proposed clustering and estimation methods. Section 4 discusses the choice of the tuning parameters and introduces an algorithm for computing the penalised estimates. Section 5 reports Monte-Carlo simulation studies. Section 6 gives the empirical applications to the Boston house price data and the plasma beta-carotene level data. Section 7 concludes the paper. The proofs of the main asymptotic theorems are given in a supplemental document.

2. Methodology

In this section, we first introduce a clustering method for kernel estimated functional coefficients in Section 2.1, followed by a generalised information criterion for determining the number of clusters in Section 2.2, and finally, propose a penalised local linear estimation approach to identify the semi-varying coefficient modelling structure in Section 2.3.

2.1. Kernel-based clustering method

Assuming that the coefficient functions have continuous second-order derivatives, we can use the kernel smoothing method (cf. Wand and Jones Citation1995) to obtain preliminary estimates of $β_{j}^{0} (\cdot)$ , $j = 1, \dots, p$ , which are denoted by ${\tilde{β}}_{j} (\cdot)$ , $j = 1, \dots, p$ . Let $Y_{n} = (Y_{1}, \dots, Y_{n})^{^{⊺}}$ , $X_{n} = (X_{1}, \dots, X_{n})^{^{⊺}}$ and $W_{n} (u) = diag {K_{h} (U_{1}, u), \dots, K_{h} (U_{n}, u)}$ with $K_{h} (U_{t}, u) = K ((U_{t} - u) / h)$ , where $K (\cdot)$ is a kernel function and h is a bandwidth which tends to zero as the sample size n diverges to infinity. Then the kernel estimation $\tilde{β} (u_{0})$ can be expressed as follows: (3) $\begin{aligned} \tilde{β} (u_{0}) & = {[{\tilde{β}}_{1} (u_{0}), \dots, {\tilde{β}}_{p} (u_{0})]}^{^{⊺}} \\ = {[\sum_{t = 1}^{n} X_{t} X_{t}^{^{⊺}} K_{h} (U_{t}, u_{0})]}^{- 1} [\sum_{t = 1}^{n} X_{t} Y_{t} K_{h} (U_{t}, u_{0})] \\ = {[X_{n}^{^{⊺}} W_{n} (u_{0}) X_{n}]}^{- 1} [X_{n}^{^{⊺}} W_{n} (u_{0}) Y_{n}], \end{aligned}$ (3) where $u_{0}$ is on the support of the index variable. Note that other commonly used nonparametric estimation methods such as the local polynomial method (Fan and Gijbels Citation1996) and B-spline method (Green and Silverman Citation1994) are also applicable to obtain the preliminary estimates.

Without loss of generality, we let $U = [0, 1]$ be the compact support of the index variable $U_{t}$ . Define (4) ${\tilde{Δ}}_{i j} = \frac{1}{n} \sum_{t = 1}^{n} | {\tilde{β}}_{i} (U_{t}) - {\tilde{β}}_{j} (U_{t}) | I (U_{t} \in U_{h}),$ (4) where ${\tilde{β}}_{i} (\cdot)$ is defined in (Equation3(3) $\begin{aligned} \tilde{β} (u_{0}) & = {[{\tilde{β}}_{1} (u_{0}), \dots, {\tilde{β}}_{p} (u_{0})]}^{^{⊺}} \\ = {[\sum_{t = 1}^{n} X_{t} X_{t}^{^{⊺}} K_{h} (U_{t}, u_{0})]}^{- 1} [\sum_{t = 1}^{n} X_{t} Y_{t} K_{h} (U_{t}, u_{0})] \\ = {[X_{n}^{^{⊺}} W_{n} (u_{0}) X_{n}]}^{- 1} [X_{n}^{^{⊺}} W_{n} (u_{0}) Y_{n}], \end{aligned}$ (3) ), $I (\cdot)$ is the indicator function and $U_{h} = [h, 1 - h]$ . The aim of truncating the observations outside $U_{h}$ is to overcome the so-called boundary effect in the kernel estimation. Noting that $h \to 0$ , the set $U_{h}$ can be sufficiently close to $U$ , and thus, the information loss is negligible. In fact, ${\tilde{Δ}}_{i j}$ can be viewed as a natural estimate of (5) $Δ_{i j}^{0} = \int_{U_{h}} | β_{i}^{0} (u) - β_{j}^{0} (u) | f_{U} (u) d u,$ (5) where $f_{U} (\cdot)$ is the density function of $U_{t}$ . Under some smoothness conditions on $β_{i}^{0} (\cdot)$ and $f_{U} (\cdot)$ , we may show that $Δ_{i j}^{0} \to \int_{U} | β_{i}^{0} (u) - β_{j}^{0} (u) | f_{U} (u) d u, n \to \infty .$ From (Equation2(2) $β_{j}^{0} (\cdot) = α_{k}^{0} (\cdot) for j \in C_{k}^{0} and C_{k_{1}}^{0} \cap C_{k_{2}}^{0} = \emptyset for 1 \leq k_{1} \neq k_{2} \leq K_{0},$ (2) ) and (Equation5(5) $Δ_{i j}^{0} = \int_{U_{h}} | β_{i}^{0} (u) - β_{j}^{0} (u) | f_{U} (u) d u,$ (5) ), we have $Δ_{i j}^{0} = 0$ for $i, j \in C_{k}^{0}$ , and $Δ_{i j}^{0} \neq 0$ for $i \in C_{k_{1}}^{0}$ and $j \in C_{k_{2}}^{0}$ with $k_{1} \neq k_{2}$ . Then, we define a distance matrix among the functional coefficients, denoted by $Δ_{0}$ , whose $(i, j)$ -entry is $Δ_{i j}^{0}$ . The corresponding estimated distance matrix, denoted by ${\tilde{Δ}}_{n}$ , has entries ${\tilde{Δ}}_{i j}$ defined in (Equation4(4) ${\tilde{Δ}}_{i j} = \frac{1}{n} \sum_{t = 1}^{n} | {\tilde{β}}_{i} (U_{t}) - {\tilde{β}}_{j} (U_{t}) | I (U_{t} \in U_{h}),$ (4) ). It is obvious that both $Δ_{0}$ and ${\tilde{Δ}}_{n}$ are $p \times p$ symmetric matrices with the main diagonal elements being zeros.

We next use the well-known agglomerative hierarchical clustering method to explore the homogeneity among the functional coefficients. This clustering method starts with p singleton clusters, corresponding to the p functional coefficients. In each stage, the two clusters with the smallest distance are merged into a new cluster. This continues until we end with only one full cluster. Such a clustering approach has been widely studied in the literature on cluster analysis (cf. Everitt, Landau, Leese, and Stahl Citation2011; Rencher and Christensen Citation2012). However, to the best of our knowledge, there is virtually no work combining the agglomerative hierarchical clustering method with the kernel smoothing of functional coefficients in nonparametric homogeneity pursuit. This paper fills in this gap. Specifically, the algorithm is described as follows, where the number of clusters $K_{0}$ is assumed to be known. Section 2.2 will introduce an information criterion to determine the number $K_{0}$ .

Start with p clusters each of which contains one functional coefficient and search for the smallest distance among the off-diagonal elements of ${\tilde{Δ}}_{n}$ .
Merge the two clusters with the smallest distance, and then re-calculate the distance between clusters and update the distance matrix. Here the distance between two clusters $A$ and $B$ is defined as the farthest distance between a point in $A$ and a point in $B$ , which is called the complete linkage.
Repeat steps 1 and 2 until the number of clusters reaches $K_{0}$ .

Let ${\tilde{C}}_{1}, \dots, {\tilde{C}}_{K_{0}}$ be the estimated clusters obtained via the above algorithm when the true number of clusters is known a priori. More generally, if the number of clusters is assumed to be K with $1 \leq K \leq p$ , we stop the above algorithm when the number of clusters reaches K, and let ${\tilde{C}}_{1 | k}, \dots, {\tilde{C}}_{K | K}$ be the estimated clusters.

2.2. Estimation of the cluster number

In practice, the true number of clusters is usually unknown and needs to be estimated. When the number of clusters is assumed to be K, we define the post-clustering kernel estimation for the functional coefficients: (6) $\begin{aligned} {\tilde{α}}_{K} (u_{0}) & = {[{\tilde{α}}_{1 | k} (u_{0}), \dots, {\tilde{α}}_{K | K} (u_{0})]}^{^{⊺}} \\ = {[\sum_{t = 1}^{n} {\tilde{X}}_{t, K} {\tilde{X}}_{t, K}^{^{⊺}} K_{h} (U_{t}, u_{0})]}^{- 1} \\ [\sum_{t = 1}^{n} {\tilde{X}}_{t, K} Y_{t} K_{h} (U_{t}, u_{0})], \end{aligned}$ (6) where ${\tilde{X}}_{t, K} = {({\tilde{X}}_{t, 1 | k}, \dots, {\tilde{X}}_{t, K | K})}^{^{⊺}} with {\tilde{X}}_{t, k | K} = \sum_{j \in {\tilde{C}}_{k | K}} X_{t j},$ ${\tilde{C}}_{k | K}$ is defined as in Section 2.1. When the number K is larger than $K_{0}$ , ${\tilde{α}}_{K} (\cdot)$ is still a uniformly consistent kernel estimate of the functional coefficients (cf. the proof of Theorem 3.2 in the appendix); but when K is smaller than $K_{0}$ , the clustering approach in Section 2.1 results in a misspecified functional-coefficient model and ${\tilde{α}}_{K} (\cdot)$ constructed in (Equation6(6) $\begin{aligned} {\tilde{α}}_{K} (u_{0}) & = {[{\tilde{α}}_{1 | k} (u_{0}), \dots, {\tilde{α}}_{K | K} (u_{0})]}^{^{⊺}} \\ = {[\sum_{t = 1}^{n} {\tilde{X}}_{t, K} {\tilde{X}}_{t, K}^{^{⊺}} K_{h} (U_{t}, u_{0})]}^{- 1} \\ [\sum_{t = 1}^{n} {\tilde{X}}_{t, K} Y_{t} K_{h} (U_{t}, u_{0})], \end{aligned}$ (6) ) can be viewed as the kernel estimate of the ‘quasi’ functional coefficients which will be defined in (Equation14(14) $α_{K} (u) = {[α_{1 | K} (u), \dots, α_{K | K} (u)]}^{^{⊺}} = {[Σ_{X | K} (u)]}^{- 1} Σ_{X Y | K} (u),$ (14) ).

We define the following objective function: (7) $IC (K) = \log [{\tilde{σ}}_{n}^{2} (K)] + K \cdot {[\frac{\log (n h)}{n h}]}^{ρ}$ (7) with $0 < ρ < 1$ , ${\tilde{σ}}_{n}^{2} (K) = \frac{1}{n_{h}} \sum_{t = 1}^{n} {[Y_{t} - {\tilde{X}}_{t, K}^{^{⊺}} {\tilde{α}}_{K} (U_{t})]}^{2} I (U_{t} \in U_{h}) and n_{h} = \sum_{t = 1}^{n} I (U_{t} \in U_{h}),$ and determine the number of clusters through (8) $\tilde{K} = {\arg \min}_{1 \leq K \leq \bar{K}} IC (K),$ (8) where $\bar{K}$ is a pre-specified finite positive integer which is larger than $K_{0}$ . In practical application, $\bar{K}$ can be chosen the same as the dimension of covariates p if the latter is either fixed or moderately large. If we choose ρ close to 1 and treat nh as the ‘effective’ sample size, the above criterion would be similar to the classic Bayesian information criterion introduced by Schwarz (Citation1978). Su et al. (Citation2016) use a similar information criterion to determine the group number in linear longitudinal data models. The Bayesian information criterion has been extended to the nonparametric framework in recent years (cf. Wang and Xia Citation2009).

2.3. Penalised local linear estimation

We next introduce a penalised approach to further identify the clusters with non-zero constant coefficients and the cluster with zero coefficient. For notational simplicity, we let ${\tilde{X}}_{t} = {\tilde{X}}_{t, \tilde{K}}$ and $\tilde{α} (u_{0}) = [{\tilde{α}}_{1} (u_{0}), \dots, {\tilde{α}}_{\tilde{K}} (u_{0})]^{^{⊺}}$ be defined similarly to ${\tilde{α}}_{K} (u_{0})$ in (Equation6(6) $\begin{aligned} {\tilde{α}}_{K} (u_{0}) & = {[{\tilde{α}}_{1 | k} (u_{0}), \dots, {\tilde{α}}_{K | K} (u_{0})]}^{^{⊺}} \\ = {[\sum_{t = 1}^{n} {\tilde{X}}_{t, K} {\tilde{X}}_{t, K}^{^{⊺}} K_{h} (U_{t}, u_{0})]}^{- 1} \\ [\sum_{t = 1}^{n} {\tilde{X}}_{t, K} Y_{t} K_{h} (U_{t}, u_{0})], \end{aligned}$ (6) ) with $K = \tilde{K}$ . Throughout the paper, we call $\tilde{α} (\cdot)$ the post-clustering kernel estimator. It is obvious that identifying the constant coefficients is equivalent to identifying the functional coefficients such that either their derivatives are zero or the deviation of the functional coefficients, $D_{k}^{0}$ , is zero (cf. Li et al. Citation2015), where $D_{k}^{0} = {\sum_{t = 1}^{n} {[α_{k}^{0} (U_{t}) - {\bar{α}}_{k}]}^{2}}^{1 / 2}, {\bar{α}}_{k} = \frac{1}{n} \sum_{s = 1}^{n} α_{k}^{0} (U_{s}) .$ In practice, we may estimate the deviation of the functional coefficients by ${\tilde{D}}_{k} = {\sum_{t = 1}^{n} {[{\tilde{α}}_{k} (U_{t}) - \frac{1}{n} \sum_{s = 1}^{n} {\tilde{α}}_{k} (U_{s})]}^{2}}^{1 / 2},$ for $k = 1, \dots, \tilde{K}$ . Let $\begin{aligned} A & = {(a_{1}^{^{⊺}}, \dots, a_{n}^{^{⊺}})}^{^{⊺}}, a_{t} = (a_{t 1}, \dots, a_{t \tilde{K}})^{^{⊺}}; \\ B & = {(b_{1}^{^{⊺}}, \dots, b_{n}^{^{⊺}})}^{^{⊺}}, b_{t} = (b_{t 1}, \dots, b_{t \tilde{K}})^{^{⊺}}; \\ A_{k} & = (a_{1 k}, \dots, a_{n k})^{^{⊺}}, B_{k} = (b_{1 k}, \dots, b_{n k})^{^{⊺}} . \end{aligned}$ As in Li et al. (Citation2015), we define the penalised objective function as follows: (9) $Q_{n} (A, B) = L_{n} (A, B) + P_{n 1} (A) + P_{n 2} (B),$ (9) where $\begin{aligned} L_{n} (A, B) = \sum_{s = 1}^{n} L_{n} (a_{s}, b_{s}) = \frac{1}{n} \sum_{s = 1}^{n} \sum_{t = 1}^{n} {[Y_{t} - {\tilde{X}}_{t}^{^{⊺}} a_{s} - {\tilde{X}}_{t}^{^{⊺}} b_{s} (U_{t} - U_{s})]}^{2} K_{h} (U_{t}, U_{s}), \\ P_{n 1} (A) = \sum_{k = 1}^{\tilde{K}} p_{λ_{1}}^{'} (‖ {\tilde{A}}_{k} ‖) ‖ A_{k} ‖, P_{n 2} (B) = \sum_{k = 1}^{\tilde{K}} p_{λ_{2}}^{'} ({\tilde{D}}_{k}) ‖ h B_{k} ‖, \end{aligned}$ in which ${\tilde{A}}_{k} = [{\tilde{α}}_{k} (U_{1}), \dots, {\tilde{α}}_{k} (U_{n})]^{^{⊺}}$ , $‖ \cdot ‖$ denotes the Euclidean norm, $λ_{1}$ and $λ_{2}$ are two tuning parameters, and $p_{λ}^{'} (\cdot)$ is the derivative of the SCAD penalty function (Fan and Li Citation2001): $p_{λ}^{'} (z) = λ [I (z \leq λ) + \frac{(a_{*} λ - z)_{+}}{(a_{*} - 1) λ} I (z > λ)] .$ Following Fan and Li's (Citation2001) recommendation, we choose $a_{*} = 3.7$ in this paper. Let (10) $\begin{aligned} {\hat{A}}_{k} = {[{\hat{α}}_{k} (U_{1}), \dots, {\hat{α}}_{k} (U_{n})]}^{^{⊺}} and {\hat{B}}_{k} = {[{\hat{α}}_{k}^{'} (U_{1}), \dots, {\hat{α}}_{k}^{'} (U_{n})]}^{^{⊺}}, k = 1, \dots, \tilde{K}, \end{aligned}$ (10) be the minimiser of the objective function $Q_{n} (A, B)$ defined in (Equation9(9) $Q_{n} (A, B) = L_{n} (A, B) + P_{n 1} (A) + P_{n 2} (B),$ (9) ). Through the penalisation, we would expect $‖ {\hat{A}}_{k} ‖ = 0$ when ${\tilde{C}}_{k | \tilde{K}}$ is the estimated cluster with zero coefficient, and $‖ {\hat{B}}_{k} ‖ = 0$ when ${\tilde{C}}_{k | \tilde{K}}$ is the estimated cluster with a non-zero constant coefficient (see (Equation20(20) $P (‖ {\hat{A}}_{K_{0}} ‖ = 0, ‖ {\hat{B}}_{k} ‖ = 0, k = K_{*}, \dots, K_{0}) \to 1,$ (20) ) in Theorem 3.3). Hence, if $‖ {\hat{A}}_{k} ‖ = 0$ , the corresponding covariates are not significant and should be removed from functional-coefficient model (Equation1(1) $Y_{t} = X_{t}^{^{⊺}} β_{0} (U_{t}) + ε_{t}, t = 1, \dots, n,$ (1) ); and if $‖ {\hat{B}}_{k} ‖ = 0$ , the functional coefficient has a constant value and can be consistently estimated by (11) ${\hat{α}}_{k} = \frac{1}{n} \sum_{t = 1}^{n} {\hat{α}}_{k} (U_{t}) .$ (11)

Implementation of the proposed methods in Sections 2.1–2.3 is summarised in the following flowchart.

3. Asymptotic theorems

In this section, we give the asymptotic theorems for the proposed clustering and semiparametric penalised methods. We start with some regularity conditions, some of which might be weakened at the expense of more lengthy proofs.

Assumption 3.1

The kernel function $K (\cdot)$ is a Lipschitz continuous and symmetric probability density function with a compact support $[- 1, 1]$ .

Assumption 3.2

(i)	The density function of the index variable $U_{t}$ , $f_{U} (\cdot)$ , has a continuous second-order derivative and is bounded away from zero and infinity on the support.
(ii)	The functional coefficients $β_{0} (\cdot)$ and $α_{0} (\cdot) = [α_{1}^{0} (\cdot), \dots, α_{K_{0}}^{0} (\cdot)]^{^{⊺}}$ have continuous second-order derivatives.

Assumption 3.3

(i)	The $p \times p$ matrix $Σ (u) := E (X_{t} X_{t}^{^{⊺}} \| U_{t} = u)$ is twice continuously differentiable and positive definite for any $u \in [0, 1]$ . Furthermore, $0 < inf_{u \in [0, 1]} λ_{min} (Σ (u)) \leq sup_{u \in [0, 1]} λ_{max} (Σ (u)) < \infty,$ where $λ_{min} (\cdot)$ and $λ_{max} (\cdot)$ denote the smallest and largest eigenvalues, respectively.
(ii)	Let $(U_{t}, X_{t}, ε_{t})$ , $t = 1, \dots, n$ , be i.i.d. Furthermore, the error $ε_{t}$ is independent of $(U_{t}, X_{t})$ , $E [ε_{t}] = 0$ and $0 < σ^{2} = E [ε_{t}^{2}] < \infty$ , and there exists $0 < ι_{1} < \infty$ such that $E (\| ε_{t} \|^{2 + ι_{1}}) + max_{1 \leq i \leq p} E (\| X_{t i} \|^{2 (2 + ι_{1})}) < \infty$ .

Assumption 3.4

(i)	Let the bandwidth h and the dimension p satisfy $p (ϵ_{n} + h^{2}) = o (1), n^{2 ι_{2} - 1} h \to \infty,$ where $ϵ_{n} = \sqrt{\log h^{- 1} / (n h)}$ and $ι_{2} < 1 - 1 / (2 + ι_{1})$ .
(ii)	Let $p^{1 / 2} (ϵ_{n} + h^{2}) = o (δ_{n}), n^{1 / 2} δ_{n} / (\log n)^{1 / 2} \to \infty,$ where $δ_{n} = min_{1 \leq k_{1} \neq k_{2} \leq K_{0}} δ_{k_{1} k_{2}}, δ_{k_{1} k_{2}} = \int_{U_{h}} \| α_{k_{1}}^{0} (u) - α_{k_{2}}^{0} (u) \| f_{U} (u) d u .$

Remark 3.1

Assumptions 3.1–3.3 are some commonly used conditions on the kernel estimation of the functional-coefficient models. The strong moment condition on $ε_{t}$ and $X_{t}$ in Assumption 3.3(ii) is required when applying the uniform asymptotics of some kernel-based quantities. The independence condition between $ε_{t}$ and $(U_{t}, X_{t})$ seems restrictive, but may be replaced by the following heteroscedastic error structure: $ε_{t} = σ (U_{t}, X_{t}) η_{t}$ , where $η_{t}$ is independent of $(U_{t}, X_{t})$ and $σ^{2} (\cdot, \cdot)$ is a conditional volatility function. By slightly modifying our proofs, the asymptotic properties continue to hold under this relaxed error condition. Assumption 3.4(i) restricts the divergence rate of the regressor dimension and the convergence rate of the bandwidth. In particular, if $ι_{1}$ is sufficiently large (i.e. the moment conditions in Assumption 3.3(ii) becomes stronger), the condition $n^{2 ι_{2} - 1} h \to \infty$ could be close to the conventional condition $n h \to \infty$ . Assumption 3.4(ii) indicates that the difference between two functional coefficients (in different clusters) can be convergent to zero with a certain polynomial rate. In particular, when p is fixed, $h = c_{h} n^{- 1 / 5}$ with $0 < c_{h} < \infty$ , and $δ_{n} = n^{- δ_{0}}$ with $0 \leq δ_{0} < 2 / 5$ , Assumption 3.4(ii) would be automatically satisfied. On the other hand, letting $h = c_{h} n^{- 1 / 5}$ and $δ_{n} = n^{- 1 / 5} (\log n)^{1 / 4}$ , it follows from Assumption 3.4(i)(ii) that $p = o (min {n^{2 / 5} (\log n)^{- 1 / 2}, n^{4 / 5} δ_{n}^{2} (\log n)^{- 1}}) = o (n^{2 / 5} (\log n)^{- 1 / 2}),$ indicating that the dimension p may be divergent to infinity at a polynomial rate of n.

Theorem 3.1

Suppose that Assumptions 3.1–3.4 are satisfied and $K_{0}$ is known a priori. Then, we have (12) $P ({{\tilde{C}}_{k}, k = 1, \dots, K_{0}} \neq {C_{k}^{0}, k = 1, \dots, K_{0}}) = o (1)$ (12) when the sample size n is sufficiently large, where ${\tilde{C}}_{k}$ is defined in Section 2.1 and $C_{k}^{0}$ is defined in (Equation2(2) $β_{j}^{0} (\cdot) = α_{k}^{0} (\cdot) for j \in C_{k}^{0} and C_{k_{1}}^{0} \cap C_{k_{2}}^{0} = \emptyset for 1 \leq k_{1} \neq k_{2} \leq K_{0},$ (2) ).

Remark 3.2

The above theorem shows the consistency of the agglomerative hierarchical clustering method proposed in Section 2.1 when the number of clusters is known a priori, i.e. with probability approaching one, the $K_{0}$ clusters can be correctly specified. It is similar to Theorem 3.1 in Vogt and Linton (Citation2017) which gives the consistency of the classification of univariate nonparametric function in the longitudinal data setting by using the nonparametric segmentation method.

We next derive the consistency for the information criterion on estimating the number of clusters which is usually unknown in practice. Some further notation and assumptions are needed. Define $X_{t, K_{0}} = {(X_{t, 1 | K_{0}}, \dots, X_{t, K_{0} | K_{0}})}^{^{⊺}} with X_{t, k | K_{0}} = \sum_{j \in C_{k}^{0}} X_{t j},$ and $Σ_{X | K_{0}} (u) = E [X_{t, K_{0}} X_{t, K_{0}}^{^{⊺}} | U_{t} = u], u \in [0, 1] .$ Similarly, we can define $Σ_{X | K} (u)$ when $K > K_{0}$ and there are further splits on at least one of $C_{k}^{0}$ , $k = 1, \dots, K_{0}$ . Define the event: (13) $C_{n} (K_{0}) = {[{\tilde{C}}_{k}, k = 1, \dots, K_{0}] = [C_{k}^{0}, k = 1, \dots, K_{0}]} .$ (13) From (Equation12(12) $P ({{\tilde{C}}_{k}, k = 1, \dots, K_{0}} \neq {C_{k}^{0}, k = 1, \dots, K_{0}}) = o (1)$ (12) ) in Theorem 3.1, we have $P (C_{n} (K_{0})) \to 1$ as $n \to \infty$ . Conditional on the event $C_{n} (K_{0})$ , when the number of clusters K is smaller than $K_{0}$ , two or more clusters of $C_{k}^{0}$ , $k = 1, \dots, K_{0}$ , are falsely merged, which results in K clusters denoted by $C_{1 | K}, \dots, C_{K | K}$ , respectively, $1 \leq K \leq K_{0} - 1$ . With such a clustering result, the group-specific functional coefficients cannot be consistently estimated by the kernel smoothing method, as the model is misspecified. However, we may define the ‘quasi’ functional coefficients by (14) $α_{K} (u) = {[α_{1 | K} (u), \dots, α_{K | K} (u)]}^{^{⊺}} = {[Σ_{X | K} (u)]}^{- 1} Σ_{X Y | K} (u),$ (14) where $1 \leq K \leq K_{0} - 1$ , (15) $Σ_{X | K} (u) = E [X_{t, K} X_{t, K}^{^{⊺}} | U_{t} = u], Σ_{X Y | K} (u) = E [X_{t, K} Y_{t} | U_{t} = u],$ (15) and $X_{t, K} = {(X_{t, 1 | K}, \dots, X_{t, K | K})}^{^{⊺}} with X_{t, k | K} = \sum_{j \in C_{k | K}} X_{t j},$ given $C_{1 | k}, \dots, C_{K | K}$ . When $K = K_{0}$ , it is easy to find that the quasi-functional coefficients become the ‘genuine’ functional coefficients conditional on the event $C_{n} (K_{0})$ . Define $ε_{t, K} = Y_{t} - X_{t, K}^{^{⊺}} α_{K} (U_{t})$ and $ε_{t 1, K} = X_{t, K} ε_{t, K}$ . By (Equation14(14) $α_{K} (u) = {[α_{1 | K} (u), \dots, α_{K | K} (u)]}^{^{⊺}} = {[Σ_{X | K} (u)]}^{- 1} Σ_{X Y | K} (u),$ (14) ), it is easy to show that (16) $E [ε_{t 1, K} | U_{t}] = 0 a . s .,$ (16) where $0$ is a null vector whose dimension might change from line to line. A natural nonparametric estimate of $α_{K} (\cdot)$ would be ${\tilde{α}}_{K} (\cdot)$ defined in (Equation6(6) $\begin{aligned} {\tilde{α}}_{K} (u_{0}) & = {[{\tilde{α}}_{1 | k} (u_{0}), \dots, {\tilde{α}}_{K | K} (u_{0})]}^{^{⊺}} \\ = {[\sum_{t = 1}^{n} {\tilde{X}}_{t, K} {\tilde{X}}_{t, K}^{^{⊺}} K_{h} (U_{t}, u_{0})]}^{- 1} \\ [\sum_{t = 1}^{n} {\tilde{X}}_{t, K} Y_{t} K_{h} (U_{t}, u_{0})], \end{aligned}$ (6) ) of Section 2.2, where the order of elements may have to be re-arranged if necessary. Result (Equation16(16) $E [ε_{t 1, K} | U_{t}] = 0 a . s .,$ (16) ) and some smoothness condition on $α_{K} (\cdot)$ would ensure the uniform consistency of the quasi-kernel estimation (see the proof of Theorem 3.2 in the supplemental document).

Let $A (K_{0})$ be the set of $K_{0}$ -dimensional twice continuously differentiable functions $α (u) = [α_{1} (u), \dots, α_{K_{0}} (u)]^{^{⊺}}$ such that at least two elements of $α (u)$ are identical functions over $u \in [0, 1]$ . The following additional assumptions are needed for proving the consistency of the information criterion proposed in Section 2.2.

Assumption 3.5

There exists a positive constant $c_{α}$ such that (17) $inf_{α (\cdot) \in A (K_{0})} \int_{0}^{1} {[α_{0} (u) - α (u)]}^{^{⊺}} Σ_{X | K_{0}} (u) [α_{0} (u) - α (u)] f_{U} (u) d u > c_{α} .$ (17)

Assumption 3.6

(i)	For any $1 \leq K \leq \bar{K}$ and given $C_{1 \| K}, \dots, C_{K \| K}$ , the $K \times K$ matrix $Σ_{X \| K} (u)$ defined in (Equation15(15) $Σ_{X \| K} (u) = E [X_{t, K} X_{t, K}^{^{⊺}} \| U_{t} = u], Σ_{X Y \| K} (u) = E [X_{t, K} Y_{t} \| U_{t} = u],$ (15) ) is positive definite for $u \in [0, 1]$ .
(ii)	For any $1 \leq K \leq K_{0} - 1$ and given $C_{1 \| K}, \dots, C_{K \| K}$ , the quasi-functional coefficient $α_{K} (\cdot)$ has continuous second-order derivatives.

Assumption 3.7

The bandwidth h and the dimension p satisfy $p h^{2} = O (ϵ_{n})$ , $n h^{6} = o (1)$ and $p = o (min {ϵ_{n}^{(ρ - 1) / 2}, ϵ_{n}^{- 1 / 3}})$ , where ρ is defined in (Equation7(7) $IC (K) = \log [{\tilde{σ}}_{n}^{2} (K)] + K \cdot {[\frac{\log (n h)}{n h}]}^{ρ}$ (7) ).

Remark 3.3

Assumptions 3.5 and 3.6 are mainly used when deriving the asymptotic lower bound of ${\tilde{σ}}_{n}^{2} (K)$ which is involved in the definition of $IC (K)$ when K is smaller than $K_{0}$ . Restriction (Equation17(17) $inf_{α (\cdot) \in A (K_{0})} \int_{0}^{1} {[α_{0} (u) - α (u)]}^{^{⊺}} Σ_{X | K_{0}} (u) [α_{0} (u) - α (u)] f_{U} (u) d u > c_{α} .$ (17) ) in Assumption 3.5 indicates that the $K_{0}$ functional elements in $α_{0} (\cdot)$ needs to be ‘sufficiently’ distinct. We may show that (Equation17(17) $inf_{α (\cdot) \in A (K_{0})} \int_{0}^{1} {[α_{0} (u) - α (u)]}^{^{⊺}} Σ_{X | K_{0}} (u) [α_{0} (u) - α (u)] f_{U} (u) d u > c_{α} .$ (17) ) is satisfied if $inf_{1 \leq K \leq K_{0}} inf_{u \in [0, 1]} λ_{min} (Σ_{X | K} (u)) > c_{1} > 0$ and the Lebesgue measure of ${u \in U : | α_{k_{1}}^{0} (u) - α_{k_{2}}^{0} (u) | > c_{2} > 0}$ is positive for any $k_{1} \neq k_{2}$ . Assumption 3.6 is required to prove the uniform consistency of the kernel estimation for the quasi-functional coefficients. Assumption 3.7 gives some further restriction on h and p, and indicates that the dimension of the covariates can diverge to infinity at a slow polynomial rate of the sample size n. For example, letting $h = n^{- 1 / 4}$ (i.e. under-smoothing in the kernel estimation), $ρ = 1 / 3$ and $p = n^{δ_{1}}$ with $0 \leq δ_{1} < 1 / 8$ , we may verify the conditions in Assumption 3.7.

Theorem 3.2 shows that the estimated number of clusters which minimises the IC objective function defined in (Equation7(7) $IC (K) = \log [{\tilde{σ}}_{n}^{2} (K)] + K \cdot {[\frac{\log (n h)}{n h}]}^{ρ}$ (7) ) is consistent.

Theorem 3.2

Suppose that Assumptions 3.1–3.7 are satisfied. Then, we have (18) $P (\tilde{K} = K_{0}) \to 1,$ (18) as $n \to \infty$ , where $\tilde{K}$ is defined in (Equation8(8) $\tilde{K} = {\arg \min}_{1 \leq K \leq \bar{K}} IC (K),$ (8) ).

A combination of (Equation12(12) $P ({{\tilde{C}}_{k}, k = 1, \dots, K_{0}} \neq {C_{k}^{0}, k = 1, \dots, K_{0}}) = o (1)$ (12) ) and (Equation18(18) $P (\tilde{K} = K_{0}) \to 1,$ (18) ) shows that the latent homogeneity structure can be consistently estimated. Define $\begin{aligned} A_{k}^{0} & = {[α_{k}^{0} (U_{1}), \dots, α_{k}^{0} (U_{n})]}^{^{⊺}}, B_{k}^{0} = {[α_{k}^{0'} (U_{1}), \dots, α_{k}^{0'} (U_{n})]}^{^{⊺}}, \\ {\hat{A}}_{k} & = {[{\hat{α}}_{k} (U_{1}), \dots, {\hat{α}}_{k} (U_{n})]}^{^{⊺}}, {\hat{B}}_{k} = {[{\hat{α}}_{k}^{'} (U_{1}), \dots, {\hat{α}}_{k}^{'} (U_{n})]}^{^{⊺}} . \end{aligned}$ Without loss of generality, conditional on $C_{n} (K_{0})$ and $\tilde{K} = K_{0}$ , we assume that ${\tilde{C}}_{1} = C_{1}^{0}, \dots, {\tilde{C}}_{K_{0}} = C_{K_{0}}^{0}$ , otherwise we only need to re-arrange the order of the elements in $α_{0} (\cdot) = [α_{1}^{0} (\cdot), \dots, α_{K_{0}}^{0} (\cdot)]^{^{⊺}}$ . For notational simplicity, we also assume that $α_{K_{0}}^{0} (\cdot) \equiv 0$ and $α_{k}^{0} (\cdot) \equiv α_{k}^{0}$ for $k = K_{*}, \dots, K_{0} - 1$ with $1 < K_{*} < K_{0}$ , where $α_{k}^{0}$ are non-zero constant coefficients (non-zero constant coefficients do not exist when $K_{*} = K_{0}$ and all of the functional coefficients are constant when $K_{*} = 1$ ). For simplicity, we next assume that all the observations of the index variable, $U_{t}$ , $t = 1, \dots, n$ , are in the set $U_{h}$ , to avoid the boundary effect of the kernel estimation, but this assumption can be removed if an appropriate truncation technique, such as those discussed in Sections 2.1 and 2.2, is applied to the penalised local linear estimation. Some additional conditions are needed for deriving the sparsity property for the penalised estimation proposed in Section 2.3.

Assumption 3.8

For any $k = 1, \dots, K_{0} - 1$ , there exists a positive constant $c_{A}$ such that $‖ A_{k}^{0} ‖ \geq c_{A} \sqrt{n}$ with probability approaching one. When $k = 1, \dots, K_{*} - 1$ (with $K_{*} \geq 2$ ), there exists a positive constant $c_{D}$ such that $D_{k}^{0} \geq c_{D} \sqrt{n}$ with probability approaching one.

Assumption 3.9

Let $p^{2} n h^{5} = O (1)$ , and the tuning parameter $λ_{1}$ satisfy (19) $λ_{1} = o (n^{1 / 2}), n^{1 / 2} p^{2} h^{2} + n^{1 / 2} p ϵ_{n} + p^{4} h^{- 1 / 2} = o (λ_{1}) .$ (19) Condition (Equation19(19) $λ_{1} = o (n^{1 / 2}), n^{1 / 2} p^{2} h^{2} + n^{1 / 2} p ϵ_{n} + p^{4} h^{- 1 / 2} = o (λ_{1}) .$ (19) ) is also satisfied when $λ_{1}$ is replaced by $λ_{2}$ .

Remark 3.4

Assumption 3.8 is a key condition to prove that $‖ {\tilde{A}}_{k} ‖ / \sqrt{n}$ and ${\tilde{D}}_{k} / \sqrt{n}$ are bounded away from zero with probability approaching one, which together with the definition of the SCAD derivative and $λ_{1} + λ_{2} = o (n^{1 / 2})$ in Assumption 3.9, indicates that when the functional coefficients or their deviations are significant, the influence of the penalty term in (Equation9(9) $Q_{n} (A, B) = L_{n} (A, B) + P_{n 1} (A) + P_{n 2} (B),$ (9) ) can be asymptotically ignored. For the case when p is fixed and $h = c_{h} n^{- 1 / 5}$ as discussed in Remark 3.1, if we choose $λ_{1} = λ_{2} = n^{δ_{*}}$ with $0.1 < δ_{*} < 0.5$ , (Equation19(19) $λ_{1} = o (n^{1 / 2}), n^{1 / 2} p^{2} h^{2} + n^{1 / 2} p ϵ_{n} + p^{4} h^{- 1 / 2} = o (λ_{1}) .$ (19) ) in Assumption 3.9 would be satisfied. On the other hand, as discussed in Remarks 3.1 and 3.3, the dimension p is allowed to be divergent to infinity.

Theorem 3.3

Suppose that Assumptions 3.1–3.9 are satisfied. Then, we have (20) $P (‖ {\hat{A}}_{K_{0}} ‖ = 0, ‖ {\hat{B}}_{k} ‖ = 0, k = K_{*}, \dots, K_{0}) \to 1,$ (20) as $n \to \infty$ , where ${\hat{A}}_{k}$ and ${\hat{B}}_{k}$ are defined in (Equation10(10) $\begin{aligned} {\hat{A}}_{k} = {[{\hat{α}}_{k} (U_{1}), \dots, {\hat{α}}_{k} (U_{n})]}^{^{⊺}} and {\hat{B}}_{k} = {[{\hat{α}}_{k}^{'} (U_{1}), \dots, {\hat{α}}_{k}^{'} (U_{n})]}^{^{⊺}}, k = 1, \dots, \tilde{K}, \end{aligned}$ (10) ).

The above sparsity result for the penalised local linear estimation shows that the zero coefficient and non-zero constant coefficients in the model can be identified asymptotically.

4. Practical issues in the estimation procedure

In this section, we first discuss how to choose the bandwidth in the kernel estimation and the tuning parameters in the penalised local least-squares estimation; and then introduce an easy-to-implement computational algorithm for the penalised approach in Section 2.3.

4.1. Choice of tuning parameters

The nonparametric kernel-based estimation may be sensitive to the value of bandwidth h. Therefore, choosing an appropriate bandwidth is an important issue when applying our kernel-based clustering and estimation methods. A commonly used bandwidth selection method is the so-called cross-validation criterion. Specifically, for the preliminary (or pre-clustering) kernel estimation, the objective function for the leave-one-out cross-validation is defined by $CV (h) = \frac{1}{n} \sum_{t = 1}^{n} {[Y_{t} - X_{t}^{^{⊺}} {\tilde{β}}_{- t} (U_{t} | h)]}^{2},$ where ${\tilde{β}}_{- t} (\cdot | h)$ is the preliminary kernel estimator of $β_{0} (\cdot)$ in model (Equation1(1) $Y_{t} = X_{t}^{^{⊺}} β_{0} (U_{t}) + ε_{t}, t = 1, \dots, n,$ (1) ) using the bandwidth h and all observations except the tth observation. Then we determine the optimal bandwidth ${\hat{h}}_{o p t}$ by minimising $CV (h)$ with respect to h. The cross-validation criterion for bandwidth selection in the post-clustering kernel estimation $\tilde{α} (\cdot)$ can be defined in exactly the same way.

For the choice of the tuning parameters $λ_{1}$ and $λ_{2}$ in the penalised local least-squares method, we use the generalised information criterion (GIC) proposed by Fan and Tang (Citation2013), which is briefly described as follows. Let $λ = (λ_{1}, λ_{2})$ and denote $M_{1} (λ)$ and $M_{2} (λ)$ the index sets of nonparametric functional coefficients and non-zero constant coefficients, respectively (after implementing the kernel-based clustering analysis and penalised estimation with the tuning parameter vector $λ$ ). As Cheng, Zhang, and Chen (Citation2009) suggest that an unknown functional parameter (varying with the index variable) would amount to $m_{0} h^{- 1}$ unknown constant parameters with $m_{0} = 1.028571$ when the Epanechnikov kernel is used, we construct the following GIC objective function: $\begin{aligned} GIC (λ) & = \sum_{t = 1}^{n} {[Y_{t} - \sum_{k \in M_{1} (λ)} {\tilde{X}}_{t, k | \tilde{K}} {\hat{α}}_{k, λ} (U_{t}) - \sum_{k \in M_{2} (λ)} {\tilde{X}}_{t, k | \tilde{K}} {\hat{α}}_{k, λ}]}^{2} \\ + 2 \log [\log (n)] \log (m_{0} h^{- 1}) (| M_{2} (λ) | + | M_{1} (λ) | m_{0} h^{- 1}), \end{aligned}$ where ${\hat{α}}_{k, λ} (\cdot)$ and ${\hat{α}}_{k, λ}$ are defined as the penalised estimation in Section 2.3 using the tuning parameter vector $λ$ , $| M |$ denotes the cardinality of the set $M$ , and the bandwidth h can be determined by the leave-one-out cross-validation. The optimal value of $λ$ can be found by minimising the objective function $GIC (λ)$ with respect to $λ$ .

4.2. Computational algorithm for penalised estimation

Let ${\tilde{X}}_{t} = {\tilde{X}}_{t, \tilde{K}} = ({\tilde{X}}_{t, 1 | \tilde{K}}, \dots, {\tilde{X}}_{t, \tilde{K} | \tilde{K}})^{^{⊺}}$ and define ${\tilde{Ω}}_{n k} (j) = diag {{\tilde{Ω}}_{n k, 1} (j), \dots, {\tilde{Ω}}_{n k, n} (j)}$ with ${\tilde{Ω}}_{n k, s} (j) = \frac{2}{n h} \sum_{t = 1}^{n} {\tilde{X}}_{t, k | \tilde{K}} {\tilde{X}}_{t, k | \tilde{K}} [(U_{t} - U_{s}) / h]^{j} K_{h} (U_{t}, U_{s}) .$ We next introduce an iterative procedure to compute the penalised local least-squares estimates of the functional coefficients proposed in Section 2.3 (cf. Li et al. Citation2015). It can be viewed as a nonparametric extension of the coordinate descent algorithm, which is a commonly used optimisation algorithm that finds the minimum of a function by successively minimising along the coordinate directions.

Find initial estimates of $A_{k}^{0}$ and $B_{k}^{0}$ , which are denoted by ${\hat{A}}_{k}^{(0)} = {[{\hat{α}}_{k}^{(0)} (U_{1}), \dots, {\hat{α}}_{k}^{(0)} (U_{n})]}^{^{⊺}} and {\hat{B}}_{k}^{(0)} = {[{\hat{α}}_{k}^{' (0)} (U_{1}), \dots, {\hat{α}}_{k}^{' (0)} (U_{n})]}^{^{⊺}},$ respectively. These initial estimates can be obtained by using the conventional (non-penalised) local linear estimation method.
Let ${\hat{A}}_{k}^{(j)}$ and ${\hat{B}}_{k}^{(j)}$ be the estimates after the jth iteration. We next update the lth functional coefficient starting from l = 1. Let $\begin{aligned} {\hat{α}}_{- l}^{(j)} (U_{s}) = {[{\hat{α}}_{1}^{(j + 1)} (U_{s}), \dots, {\hat{α}}_{l - 1}^{(j + 1)} (U_{s}), 0, {\hat{α}}_{l + 1}^{(j)} (U_{s}), \dots, {\hat{α}}_{\tilde{K}}^{(j)} (U_{s})]}^{^{⊺}}, \\ {\hat{α}}^{' (j)} (U_{s}) = {[{\hat{α}}_{1}^{' (j)} (U_{s}), \dots, {\hat{α}}_{\tilde{K}}^{' (j)} (U_{s})]}^{^{⊺}}, \\ {\hat{Y}}_{t, - l}^{(j)} = Y_{t} - {\tilde{X}}_{t} {\hat{α}}_{- l}^{(j)} (U_{s}) - {\tilde{X}}_{t} {\hat{α}}^{' (j)} (U_{s}) (U_{t} - U_{s}), \\ {\tilde{E}}_{n l} = {({\tilde{E}}_{n l, 1}, \dots, {\tilde{E}}_{n l, n})}^{^{⊺}}, {\tilde{E}}_{n l, s} = \frac{2}{n h} \sum_{t = 1}^{n} {\tilde{X}}_{t, l | \tilde{K}} {\hat{Y}}_{t, - l}^{(j)} K_{h} (U_{t}, U_{s}) . \end{aligned}$ If $‖ {\tilde{E}}_{n l} ‖ < p_{λ_{1}}^{'} (‖ {\tilde{A}}_{l} ‖)$ , we update ${\hat{A}}_{l}^{(j + 1)} = 0$ , otherwise, ${\hat{A}}_{l}^{(j + 1)} = {[{\tilde{Ω}}_{n l} (0) + p_{λ_{1}}^{'} (‖ {\tilde{A}}_{l} ‖) I_{n} / c_{l}]}^{- 1} {\tilde{E}}_{n l},$ where $I_{n}$ is an $n \times n$ identity matrix, $c_{l} = ‖ {\hat{A}}_{l}^{(j)} ‖$ if $‖ {\hat{A}}_{l}^{(j)} ‖ \neq 0$ , and $c_{l} = max_{k \neq l} ‖ {\hat{A}}_{k}^{(j)} ‖$ if $‖ {\hat{A}}_{l}^{(j)} ‖ = 0$ .
Update the derivative of the lth functional coefficient starting from l = 1. Let $\begin{aligned} {\hat{α}}^{(j + 1)} (U_{s}) = {[{\hat{α}}_{1}^{(j + 1)} (U_{s}), \dots, {\hat{α}}_{\tilde{K}}^{(j + 1)} (U_{s})]}^{^{⊺}}, \\ {\hat{α}}_{- l}^{' (j)} (U_{s}) = {[{\hat{α}}_{1}^{' (j + 1)} (U_{s}), \dots, {\hat{α}}_{l - 1}^{' (j + 1)} (U_{s}), 0, {\hat{α}}_{l + 1}^{' (j)} (U_{s}), \dots, {\hat{α}}_{\tilde{K}}^{' (j)} (U_{s})]}^{^{⊺}}, \\ {\overset{ˇ}{Y}}_{t, - l}^{(j)} = Y_{t} - {\tilde{X}}_{t} {\hat{α}}^{(j + 1)} (U_{s}) - {\tilde{X}}_{t} {\hat{α}}_{- l}^{' (j)} (U_{s}) (U_{t} - U_{s}), \\ {\overset{ˇ}{E}}_{n l} = {({\overset{ˇ}{E}}_{n l, 1}, \dots, {\overset{ˇ}{E}}_{n l, n})}^{^{⊺}}, {\overset{ˇ}{E}}_{n l, s} = \frac{2}{n h} \sum_{t = 1}^{n} {\tilde{X}}_{t, l | \tilde{K}} {\overset{ˇ}{Y}}_{t, - l}^{(j)} [(U_{t} - U_{s}) / h] K_{h} (U_{t}, U_{s}) . \end{aligned}$ If $‖ {\overset{ˇ}{E}}_{n l} ‖ < p_{λ_{2}}^{'} ({\tilde{D}}_{l})$ , we update ${\hat{B}}_{l}^{(j + 1)} = 0$ , otherwise, $h {\hat{B}}_{l}^{(j + 1)} = {[{\tilde{Ω}}_{n l} (2) + p_{λ_{2}}^{'} ({\tilde{D}}_{l}) I_{n} / d_{l}]}^{- 1} {\overset{ˇ}{E}}_{n l},$ where $d_{l} = ‖ h {\hat{B}}_{l}^{(j)} ‖$ if $‖ {\hat{B}}_{l}^{(j)} ‖ \neq 0$ , and $d_{l} = max_{k \neq l} ‖ h {\hat{B}}_{k}^{(j)} ‖$ if $‖ {\hat{B}}_{l}^{(j)} ‖ = 0$ .
Repeat Steps 2 and 3 until convergence of the estimates is achieved.

Our numerical studies in Sections 5 and 6 show that the above iterative procedure has reasonably good finite-sample performance.

5. Monte-Carlo simulation

In this section, we conduct Monte-Carlo simulation studies to evaluate the finite-sample performance of the proposed methods.

Example 5.1

Consider the following functional-coefficient model: (21) $Y_{t} = \sum_{j = 1}^{p} β_{j}^{0} (U_{t}) X_{t j} + σ ε_{t}, t = 1, \dots, n,$ (21) where the random covariate vector, $X_{t} = (X_{t 1}, \dots, X_{t p})^{^{⊺}}$ with p = 20 or 60, is independently generated from a multiple normal distribution with zero mean, unit variance and correlation coefficient ϱ being either 0 or 0.25, the univariate index variable $U_{t}$ is independently generated from a uniform distribution $U [0, 1]$ , the random error $ε_{t}$ is independently generated from the standard normal distribution and $σ = 0.5$ . The homogeneity structure on model (Equation21(21) $Y_{t} = \sum_{j = 1}^{p} β_{j}^{0} (U_{t}) X_{t j} + σ ε_{t}, t = 1, \dots, n,$ (21) ) is defined as follows: $\begin{aligned} β_{ℓ (k - 1) + j}^{0} (\cdot) = α_{k}^{0} (\cdot), for k = 1, 2, j = 1, \dots, ℓ, ℓ = p / 5, \\ β_{ℓ (k - 1) + j}^{0} (\cdot) = α_{k}^{0} (\cdot) \equiv α_{k}^{0}, for k = 3, 4, 5, j = 1, \dots, ℓ, ℓ = p / 5, \\ α_{1}^{0} (u) = \sin (2 π u), α_{2}^{0} (u) = (1 + δ) \sin (2 π u), α_{3}^{0} = 0.5, α_{4}^{0} = 0.5 + δ, α_{5}^{0} = 0, \end{aligned}$ where $δ = 0.2, 0.4, 0.8$ . The above means that there are five clusters for the coefficients: some are varying with $U_{t}$ and others are constant. The size of each cluster in this example is the same (i.e. four). Figure plots the five cluster-specific coefficient functions for each value of δ. The larger the value of δ, the further the distance is between these functions, and hence, the easier it is to identify the clusters.

Figure 1. Plots of the cluster-specific coefficient functions. Left panel: $δ = 0.4$ ; right panel: $δ = 0.8$ .

The sample size n is set to be 200, 400 or 600, and the number of replications is N = 500. We first use the kernel method to obtain preliminary nonparametric estimates of the functional coefficients $β_{j}^{0} (\cdot), j = 1, \dots, p$ , with the Epanechnikov kernel $K (z) = \frac{3}{4} (1 - z^{2})_{+}$ and the optimal bandwidth selected from the cross-validation method in Section 4.1. The homogeneity and semi-varying coefficient structure in model (Equation21(21) $Y_{t} = \sum_{j = 1}^{p} β_{j}^{0} (U_{t}) X_{t j} + σ ε_{t}, t = 1, \dots, n,$ (21) ) is ignored in this preliminary estimation. A combination of the kernel-based clustering method in Section 2.1 and the generalised information criterion in Section 2.2 is then used to estimate the homogeneity structure. In order to evaluate the clustering performance, we consider two commonly used measures: Normalised Mutual Information (NMI) and Purity, both of which can be used to examine how close the estimated set of clusters is to the true set of clusters. Letting $C_{1} = {C_{1}^{1}, \dots, C_{K_{1}}^{1}}$ and $C_{2} = {C_{1}^{2}, \dots, C_{K_{2}}^{2}}$ be two sets of disjoint clusters of $(1, 2, \dots, p)$ , the NMI between $C_{1}$ and $C_{2}$ is defined as $NMI (C_{1}, C_{2}) = \frac{I (C_{1}, C_{2})}{[H (C_{1}) + H (C_{2})] / 2},$ where $H (C_{1})$ and $H (C_{2})$ are the entropies of $C_{1}$ and $C_{2}$ , respectively, and $I (C_{1}, C_{2})$ is the mutual information between $C_{1}$ and $C_{2}$ defined as $I (C_{1}, C_{2}) = \sum_{k = 1}^{K_{1}} \sum_{j = 1}^{K_{2}} (\frac{| C_{k}^{1} \cap C_{j}^{2} |}{p}) \log_{2} (\frac{p | C_{k}^{1} \cap C_{j}^{2} |}{| C_{k}^{1} | | C_{j}^{2} |}) .$ The NMI measure takes a value between 0 and 1 with a larger value indicating that the two sets of clusters are closer. The Purity measure is defined by (22) $Purity (C_{1}, C_{2}) = \frac{1}{p} \sum_{k = 1}^{K_{1}} max_{1 \leq j \leq K_{2}} | C_{k}^{1} \cap C_{j}^{2} | .$ (22) It is easy to find that the Purity measure also takes values between 0 and 1, and if $C_{1}$ and $C_{2}$ are equal, then $Purity (C_{1}, C_{2}) = 1$ . However, the purity measure does not trade off the quality of clustering against the number of clusters. For example, a purity value of 1 is achieved if one set contains singleton clusters. The NMI, by contrast, allows for this trade-off.

We finally identify the clusters with zero coefficients and non-zero constant coefficients using the penalised method introduced in Section 2.3. The tuning parameters in the penalty terms are chosen by the GIC detailed in Section 4.1. In order to measure the accuracy of estimates of the coefficients $β_{j}^{0} (\cdot)$ , $1 \leq j \leq p$ , we compute the Mean Absolute Estimation Error (MAEE), which, for the preliminary (pre-clustering) kernel estimates, ${\tilde{β}}_{j} (\cdot)$ , $1 \leq j \leq p$ , is defined as $MAEE (PreC-Kernel) = \frac{1}{n p} \sum_{t = 1}^{n} \sum_{j = 1}^{p} | {\tilde{β}}_{j} (U_{t}) - β_{j}^{0} (U_{t}) |,$ and for the post-clustering kernel estimates, $MAEE (PostC-Kernel) = \frac{1}{n p} \sum_{t = 1}^{n} \sum_{j = 1}^{p} | {\tilde{β}}_{j}^{*} (U_{t}) - β_{j}^{0} (U_{t}) |,$ where ${\tilde{β}}_{j}^{*} (\cdot) = {\tilde{α}}_{k} (\cdot)$ if $j \in {\tilde{C}}_{k | \tilde{K}}$ , $1 \leq k \leq \tilde{K}$ , and ${\tilde{α}}_{k} (\cdot) = {\tilde{α}}_{k | \tilde{K}} (\cdot)$ , $1 \leq k \leq \tilde{K}$ , are the post-clustering kernel estimates of cluster-specific functional coefficients defined in (Equation6(6) $\begin{aligned} {\tilde{α}}_{K} (u_{0}) & = {[{\tilde{α}}_{1 | k} (u_{0}), \dots, {\tilde{α}}_{K | K} (u_{0})]}^{^{⊺}} \\ = {[\sum_{t = 1}^{n} {\tilde{X}}_{t, K} {\tilde{X}}_{t, K}^{^{⊺}} K_{h} (U_{t}, u_{0})]}^{- 1} \\ [\sum_{t = 1}^{n} {\tilde{X}}_{t, K} Y_{t} K_{h} (U_{t}, u_{0})], \end{aligned}$ (6) ). Let ${\hat{β}}_{j} (\cdot) = {\hat{α}}_{k} (\cdot)$ if $j \in {\tilde{C}}_{k | \tilde{K}}$ , $1 \leq k \leq \tilde{K}$ , where ${\hat{α}}_{k} (\cdot)$ , $1 \leq k \leq \tilde{K}$ , are the penalised estimates of the cluster-specific functional coefficients obtained by minimising (Equation9(9) $Q_{n} (A, B) = L_{n} (A, B) + P_{n 1} (A) + P_{n 2} (B),$ (9) ). The MAEE of the penalised estimates is defined as $MAEE (Penalised) = \frac{1}{n p} \sum_{t = 1}^{n} \sum_{j = 1}^{p} | {\hat{β}}_{j} (U_{t}) - β_{j}^{0} (U_{t}) | .$ The main purpose for considering the MAEE of the post-clustering kernel and penalised estimates for $β_{j}^{0} (\cdot)$ , $1 \leq j \leq p$ , rather than for $α_{k}^{0} (\cdot)$ , $1 \leq k \leq K_{0}$ , is to avoid having to order the estimated clusters and match each of them to one of the true clusters (as there is no natural way to do this).

Tables – give the simulation results for the case where the dimension of $X_{t}$ is 20. Table presents the frequency (over 500 replications) at which a number between 1 and 10 is selected as the number of clusters by the information criterion detailed in Section 2.2. Table gives the average values and standard deviations (in parentheses) of the NMI and Purity measurements over 500 replications. Table reports the average MAEEs and standard deviations (in parentheses) over 500 replications for the pre-clustering kernel estimation, post-clustering kernel estimation and the semiparametric penalised estimation. From Table , we can see that when $δ = 0.4$ and the covariates are uncorrelated, the number of clusters can be correctly estimated in about 80% of the replications even when n = 200, and when δ increases to 0.8, this percentage increases to almost 98%. As the sample size increases to 400, the information criterion selects the correct number of clusters in almost all replications. When the correlation coefficient between the covariates is 0.25, the number of clusters is correctly estimated in only 34% of replications when n = 200 and $δ = 0.4$ and in over 70% of replications when $δ = 0.8$ . As the sample size increases to 400, this percentage rises to over 98%. However, when $δ = 0.2$ , the distances between different coefficient functions become smaller and the number of clusters is often underestimated as 3 or 4, even when the covariates are uncorrelated. When the covariates are correlated, this underestimation becomes worse. In all of the specifications, the estimated number of clusters rarely goes below 3 or above 7. Table shows that when there is no correlation among the covariates and the different coefficient functions are moderately distanced (i.e. $δ = 0.4$ or 0.8), the NMI and Purity values are close to 1 even when the sample size is as small as 200. The increase of the covariates correlation coefficients to 0.25 or the decrease of δ to 0.2 causes the clustering to become less accurate. Finally, the results in Table show that, after identifying the homogeneity and semi-varying coefficient structure, the average MAEE values of the semiparametric penalised estimation are smaller than those of the post-clustering kernel estimation, which in turn are much smaller than those of the pre-clustering kernel estimation. In addition, all three estimation methods improve (with decreasing average MAEE values) as the sample size increases, and their performance becomes slightly worse when the correlation between the covariates increases to 0.25.

Tables give the results for p = 60. Comparing these results with those for p = 20, we can see that as the dimension of the covariates increases, the estimation becomes poorer. However, the overall pattern as δ, or ϱ, or n changes is similar: as δ increases, the estimation becomes more accurate due to the clusters becoming further distanced from each other; as ϱ increases, the results become poorer; and as n increases, the results improve.

Table 4. Results on estimation of cluster number for Example 5.1 with p = 60.

Display Table

Table 5. Average NMI and Purity for Example 5.1 with p = 60.

Display Table

Table 6. Average MAEE for Example 5.1 with p = 60.

Download CSV Display Table

Example 5.2

We consider model (Equation21(21) $Y_{t} = \sum_{j = 1}^{p} β_{j}^{0} (U_{t}) X_{t j} + σ ε_{t}, t = 1, \dots, n,$ (21) ) with p = 20 but with the following homogeneity structure instead: $\begin{aligned} β_{1}^{0} (\cdot) = α_{1}^{0} (\cdot), β_{2}^{0} (\cdot) = β_{3}^{0} (\cdot) = α_{2}^{0} (\cdot), β_{4}^{0} (\cdot) = \dots = β_{7}^{0} (\cdot) \equiv α_{3}^{0}, \\ β_{8}^{0} (\cdot) = \dots = β_{13}^{0} (\cdot) \equiv α_{4}^{0}, β_{14}^{0} (\cdot) = \dots = β_{20}^{0} (\cdot) \equiv α_{5}^{0} . \end{aligned}$ The data generating processes for the random covariates $X_{t}$ , the index variable $U_{t}$ and the error term $ε_{t}$ are the same as those in Example 5.1. The definitions of $α_{i}^{0} (\cdot)$ and $α_{i}^{0}$ are also the same as those in the previous example. However, the sizes of the clusters are now unequal, which are 1, 2, 4, 6, 7, respectively. To save space, we don't provide results for p = 60 for this example.

Tables and report the results for the estimation of the homogeneity structure and Table reports the average MAEEs and standard deviations (in parentheses) for the pre-clustering kernel estimation, the post-clustering kernel estimation and the penalised estimation over 500 replications. Comparing the results in Table with those in Table , we find that when $δ = 0.2$ , the number of clusters are more likely to be underestimated in Example 5.2 where cluster sizes are unequal. However, as δ increases, the results for the two examples become more and more comparable. The NMI and purity values in Table are similar to those in Table , while the MAEE values in Table are smaller than those in Table . The latter is mainly due to the fact that more coefficient functions (i.e. 17 out of 20) are constant in Example 5.2.

Table 7. Results on estimation of cluster number for Example 5.2 with p = 20.

Display Table

Table 8. Average NMI and Purity for Example 5.2 with p = 20.

Display Table

Table 9. Average MAEE for Example 5.2 with p = 20.

Download CSV Display Table

6. Empirical applications

In this section, we apply the developed model and methodology to two real data sets: the Boston house price data and the plasma beta-carotene level data. These two data sets have been extensively analysed in existing studies where functional-coefficient models are usually recommended. However, it is not clear whether certain homogeneity structure among the functional coefficients exists. This motivates us to further examine the modelling structure for these two data sets via the kernel-based clustering method and penalised approach introduced in Section 2.

Example 6.1

We first apply the developed model and methodology to the well-known Boston house price data. This data set has been previously analysed in many studies (cf. Fan and Huang Citation2005; Cai and Xu Citation2008; Wang and Xia Citation2009; Leng Citation2010). To investigate what factors influencing the house prices, we choose MEDV (the median value of owner-occupied homes in US$1000) as the response variable and the following 13 variables as the explanatory variables: INT (the intercept), CHAS (Charles River dummy variable; $= 1$ if tract bounds river, 0 otherwise), RAD (index of accessibility to radial highways), CRIM (crime rate per capita by town), ZN (proportion of residential land zoned for lots over 25,000 sq. ft.), INDUS (proportion of non-retail business acres per town), NOX (nitric oxides concentration in parts per 10 million), RM (average number of rooms per dwelling), AGE (proportion of owner-occupied units built prior to 1940), DIS (weighted distances to five Boston employment centres), TAX (full-value property-tax rate per US$ 10,000), PTRATIO (pupil–teacher ratio by town), and B (1000(Bk−0.63) $^{2},$ where Bk is the proportion of blacks by town). The variable LSTAT (percentage of lower status population) is chosen as the index variable in the varying-coefficient model, which enables us to investigate the interaction of LSTAT with the explanatory variables. The sample size is n = 506. The response variable and all explanatory variables (except the intercept, INT) undergo the Z-score transformation before being fitted: i.e. for any variable, $x_{t}$ , to be transformed, its Z-score is (23) $z_{t} = \frac{x_{t} - \bar{x}}{s (x)}, t = 1, \dots, 506,$ (23) where $\bar{x}$ and $s (x)$ are the sample mean and sample standard deviation of $x_{t}$ . Furthermore, as shown in the left panel of Figure , the index variable, LSTAT, exhibits strong skewness. Hence, we first take the square-root transformation of this variable to alleviate skewness and then the min–max normalisation: (24) $U_{t}^{⋆} = \frac{U_{t} - min (U)}{max (U) - min (U)},$ (24) where $min (U)$ and $max (U)$ denote the minimum and maximum of the observations of U, respectively. After the min–max normalisation, the support of $U_{t}^{⋆}$ becomes $[0, 1]$ , consistent with the assumption made on the index variable in the asymptotic theory. A histogram of this transformed variable is shown in the right panel of Figure .

Figure 2. Histograms for the original and transformed index variable in Example 6.1. Left panel: original data for LSTAT; right panel: LSTAT after the square-root and min-max transformations.

Figure plots the pre-clustering kernel estimated functional coefficients with the optimal bandwidth selected via the leave-one-out cross-validation method. The kernel-based clustering method and the generalised information criterion identify six clusters. The membership of these clusters and the characteristics of their functional coefficients are summarised in Table . DIS and TAX are found, by the penalised method, to have constant and similar negative effects on the response, while the variables CHAS, ZN, and B are found to be insignificant. All the other explanatory variables have varying effects on the response as the value of LSTAT changes. Plots of the post-clustering kernel estimates of the functional coefficients and their penalised local linear estimates are shown in Figures and , where for each $k = 1, \dots, 6$ , $α_{k} (\cdot)$ denotes the functional coefficient corresponding to the kth cluster listed in Table . The optimal tuning parameters in the penalised method are chosen, by the GIC, as $λ_{1} = 10$ and $λ_{2} = 2.3$ .

Figure 3. Pre-clustering estimates of the functional coefficients in Example 6.1.

Figure 4. Post-clustering estimates of the functional coefficients in Example 6.1 with $α_{k} (\cdot)$ , for each $k = 1, 2, \dots, 6$ , being the estimated functional coefficient corresponding to the kth cluster listed in Table .

Figure 4. Post-clustering estimates of the functional coefficients in Example 6.1 with αk(⋅), for each k=1,2,…,6, being the estimated functional coefficient corresponding to the kth cluster listed in Table 10.

Figure 5. Penalised estimates of the functional coefficients in Example 6.1 with $α_{k} (\cdot)$ , for each $k = 1, 2, \dots, 6$ , being the estimated functional coefficient corresponding to the kth cluster listed in Table .

Figure 5. Penalised estimates of the functional coefficients in Example 6.1 with αk(⋅), for each k=1,2,…,6, being the estimated functional coefficient corresponding to the kth cluster listed in Table 10.

Table 10. The estimated homogeneity structure in Example 6.1.

Download CSV Display Table

We next compare the out-of-sample predictive performance between the pre-clustering (preliminary) kernel method, the post-clustering kernel method and the proposed penalised method. We randomly split the full sample into a training set of size 400 and a testing set of size 106 and repeat 200 times to reduce randomness in the results obtained. When calculating out-of-sample predictions for the post-clustering and penalised methods, we use the homogeneity structure (i.e. the clusters and their membership) estimated from the full sample but estimate the values of the functional coefficients (evaluated at the LSTAT values belonging to the testing set) or the constant coefficients from the training sets. The predictive performance is measured by Mean Absolute Prediction Error (MAPE), which is defined by (25) $MAPE = \frac{1}{n_{⋆}} \sum_{t = 1}^{n_{⋆}} | Y_{t}^{⋆} - {\hat{Y}}_{t}^{⋆} |,$ (25) where $n_{⋆}$ is the size of the testing set (106 in this example), $Y_{t}^{⋆}$ is a true value of the response variable in the testing sample, and ${\hat{Y}}_{t}^{⋆}$ is the predicted value of $Y_{t}^{⋆}$ using the model estimated from the training sample. Table reports the average MAPE values over 200 replications of random sample splitting. We consider bandwidth values in the range $[0.06, 0.18]$ (with equal increment 0.02), which covers the optimal bandwidth of 0.168 for the preliminary kernel estimation and post-clustering kernel estimation. From Table , we can see that predicted values calculated from the model estimated by the penalised method have the smallest MAPEs over the range of bandwidth considered. Predictions made from the model estimated by the post-clustering kernel method have slightly larger MAPE values, while predictions from the pre-clustering kernel method have the largest MAPE values. This comparison result shows that the simplified functional-coefficient models from the developed kernel-based clustering and penalised methods provide a more accurate out-of-sample prediction.

Table 11. Average MAPE over 200 times of random sample splitting in Example 6.1.

Download CSV Display Table

Example 6.2

In this example, we use the proposed methods to analyse the plasma beta-carotene level data, which have been previously studied by Nierenberg, Stukel, Baron, Dain, and Greenberg (Citation1989), Wang and Li (Citation2009) and Kai et al. (Citation2011). The data were collected from 315 patients and are downloadable from the StatLib database http://lib.stat.cmu.edu/datasets/Plasma_Retinol. The primary interest is to investigate the relationship between personal characteristics and dietary factors, and plasma concentrations of beta-carotene. The response variable is chosen as BETAPLASMA (plasma beta-carotene level, ng/ml) and the candidate explanatory variables include INT (the intercept), AGE (years), QUETELET (Quetelet index, weight/height $^{2}$ ), CALORIES (number of calories consumed per day), FAT (grams of fat consumed per day), FIBRE (grams of fibre consumed per day), ALCOHOL (number of alcoholic drinks consumed per week), CHOLESTEROL (cholesterol consumed per day). The data set also contains categorical variables: SEX (1 = male, 2 = female), SMOKSTAT (smoking status, 1 = never, 2 = former, 3 = current smoker), VITUSE (vitamin use, 1 = yes, fairly often, 2 = yes, not often, 3 = no). We convert these into dummy variables: FEMALE ( = 1 if SEX = 2, 0 otherwise), NONSMOKER ( = 1 if SMOKSTAT = 1, 0 otherwise), FORMERSMOKER (= 1 if SMOKSTAT = 2, 0 otherwise), FREQVITUSE ( = 1 if VITUSE = 1, 0 otherwise), OCCAVITUSE ( = 1 if VITUSE = 2, 0 otherwise), and also include them as explanatory variables. As in Kai et al. (Citation2011), the index variable is chosen as BETADIET (dietary beta-carotene consumed, mcg per day). We again transform the response and explanatory variables (except the intercept, INT) by the Z-score method defined in (Equation23(23) $z_{t} = \frac{x_{t} - \bar{x}}{s (x)}, t = 1, \dots, 506,$ (23) ). As can be seen from the left panel of Figure , the index variable BETADIET also exhibits high skewness, so we first transform it by the square-root operator and then the min–max operator in (Equation24(24) $U_{t}^{⋆} = \frac{U_{t} - min (U)}{max (U) - min (U)},$ (24) ). Histograms for the original data for BETADIET as well as the transformed data are given in Figure .

Figure 6. Histograms for the original and transformed index variable in Example 6.2. Left panel: original data for BETADIET, right panel: BETADIET after the square-root and min-max transformations.

We again consider using a functional-coefficient model. In the preliminary kernel estimation, the Epanechnikov kernel $K (z) = \frac{3}{4} (1 - z^{2})_{+}$ is used and the optimal bandwidth is determined via the cross-validation method in Section 4.1. We combine the kernel-based clustering method and penalised local linear estimation (with the tuning parameters $λ_{1} = 6.5$ and $λ_{2} = 3$ chosen by the GIC method) to explore the homogeneity structure among the functional coefficients. Three distinct clusters are identified. The membership of each cluster and the characteristic of the corresponding coefficient function are summarised in Table . The pre-clustering estimates of all functional coefficients and the post-clustering and penalised estimates of the cluster-specific functional coefficients are plotted in Figures –.

Figure 7. Pre-clustering estimates of the functional coefficients in Example 6.2.

Figure 8. Post-clustering estimates of the functional coefficients in Example 6.2 with $α_{k} (\cdot)$ , for each k = 1, 2, 3, being the estimated functional coefficient corresponding to the kth cluster listed in Table .

Figure 8. Post-clustering estimates of the functional coefficients in Example 6.2 with αk(⋅), for each k = 1, 2, 3, being the estimated functional coefficient corresponding to the kth cluster listed in Table 12.

Figure 9. Penalised estimates of the functional coefficients in Example 6.2 with $α_{k} (\cdot)$ , for each k = 1, 2, 3, being the estimated functional coefficient corresponding to the kth cluster listed in Table .

Figure 9. Penalised estimates of the functional coefficients in Example 6.2 with αk(⋅), for each k = 1, 2, 3, being the estimated functional coefficient corresponding to the kth cluster listed in Table 12.

Table 12. The estimated homogeneity structure in Example 6.2.

Download CSV Display Table

The kernel clustering and shrinkage estimation results show that FIBRE, NONSMOKER, FORMERSMOKER, FREQVITUSE form a cluster and their effects on the response variable, the beta-carotene level, are positive, which implies that higher fibre intake, no smoking and frequent vitamin use are helpful for increasing beta-carotene levels. The variables INT (intercept), AGE, CALORIES, ALCOHOL, CHOLESTEROL, FEMALE, and OCCAVITUSE are found to be insignificant, while QUETELET and FAT are found to have negative effects on beta-carotene levels.

As in Example 6.1, we further compare the out-of-sample predictive performance between the preliminary kernel, post-clustering kernel and penalised methods. We randomly divide the full sample (315 observations) into a training set of size 250 and a testing set of size 65, and repeat the random sample splitting 200 times and compute the average MAPE values. The predictions are calculated in the same way as in Example 6.1. The range of bandwidth values considered is between 0.20 and 0.32 with an increment of 0.02. The results are reported in Table . From the table, we find that the penalised and post-clustering kernel methods provide more accurate out-of-sample prediction in terms of MAPE defined in (Equation25(25) $MAPE = \frac{1}{n_{⋆}} \sum_{t = 1}^{n_{⋆}} | Y_{t}^{⋆} - {\hat{Y}}_{t}^{⋆} |,$ (25) ) than the preliminary kernel method, with the penalised method slightly outperforming the post-clustering kernel method when the bandwidth is smaller.

Table 13. Average MAPE over 200 times of random sample splitting in Example 6.2.

Download CSV Display Table

7. Conclusion

In this paper, we have developed the kernel-based hierarchical clustering method and a generalised version of information criterion to uncover the latent homogeneity structure in the functional-coefficient models. Furthermore, the penalised local linear estimation approach is used to separate out the zero-constant cluster, the non-zero constant-coefficient clusters and the functional-coefficient clusters. The asymptotic theory in Section 3 shows that the estimation for the true number of clusters and the true set of clusters is consistent in the large-sample case. In the simulation study, we find that the proposed estimation methodology outperforms the direct nonparametric kernel estimation which ignores the latent structure in the model. In the empirical application to the Boston house price data and plasma beta-carotene level data, we show that the nonparametric functional-coefficient model can be substantially simplified with reduced numbers of unknown parametric and nonparametric components. As a result, the out-sample mean absolute prediction errors using the developed approach are significantly smaller than those using the naive kernel method which ignores the latent homogeneity structure among the functional coefficients.

Supplemental material

CLWZ-Supp-27-April-2021.pdf

Download PDF (190.7 KB)

Acknowledgments

The authors thank the Editor-in-Chief, an Associate Editor and two reviewers for their valuable comments, which improve the former version of the paper.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

Chen's research was partially supported by Grant 65617357 from the Economic and Social Research Council of the United Kingdom.

References

Bondell, H.D., and Reich, B.J. (2008), ‘Simultaneous Regression Shrinkage, Variable Selection and Supervised Clustering of Predictors with OSCAR’, Biometrics, 64(1), 115–123.
PubMed Web of Science ®Google Scholar
Cai, Z., Fan, J., and Yao, Q. (2000), ‘Functional-Coefficient Regression Models for Nonlinear Time Series’, Journal of the American Statistical Association, 95(451), 941–956.
Web of Science ®Google Scholar
Cai, Z., and Xu, X. (2008), ‘Nonparametric Quantile Estimations for Dynamic Smooth Coefficient Models’, Journal of the American Statistical Association, 103(484), 1595–1608.
Web of Science ®Google Scholar
Chen, J., Li, D., and Xia, Y. (2019), ‘Estimation of a Rank-Reduced Functional-Coefficient Panel Data Model in Presence of Serial Correlation’, Journal of Multivariate Analysis, 173, 456–479.
Web of Science ®Google Scholar
Cheng, M., Zhang, W., and Chen, L. (2009), ‘Statistical Estimation in Generalized Multiparameter Likelihood Models’, Journal of the American Statistical Association, 104(487), 1179–1191.
Web of Science ®Google Scholar
Everitt, B.S., Landau, S., Leese, M., and Stahl, D. (2011), Cluster Analysis (5th ed.), Wiley Series in Probability and Statistics. Wiley.
Google Scholar
Fan, J., and Gijbels, I. (1996), Local Polynomial Modelling and Its Applications, London: Chapman and Hall.
Google Scholar
Fan, J., and Huang, T. (2005), ‘Profile Likelihood Inferences on Semiparametric Varying-Coefficient Partially Linear Models’, Bernoulli, 11(6), 1031–1057.
Web of Science ®Google Scholar
Fan, J., and Li, R. (2001), ‘Variable Selection Via Nonconcave Penalized Likelihood and Its Oracle Properties’, Journal of the American Statistical Association, 96(456), 1348–1360.
Web of Science ®Google Scholar
Fan, J., Ma, Y., and Dai, W. (2014), ‘Nonparametric Independence Screening in Sparse Ultra-High Dimensional Varying Coefficient Models’, Journal of the American Statistical Association, 109(507), 1270–1284.
PubMed Web of Science ®Google Scholar
Fan, Y., and Tang, C.Y. (2013), ‘Tuning Parameter Selection in High Dimensional Penalized Likelihood’, Journal of the Royal Statistical Society, Series B (Statistical Methodology), 75(3), 531–552.
Web of Science ®Google Scholar
Fan, J., and Zhang, W. (1999), ‘Statistical Estimation in Varying Coefficient Models’, The Annals of Statistics, 27(5), 1491–1518.
Web of Science ®Google Scholar
Fan, J., and Zhang, W. (2008), ‘Statistical Methods with Varying Coefficient Models’, Statistics and Its Interface, 1(1), 179–195.
PubMed Web of Science ®Google Scholar
Green, P., and Silverman, B. (1994), Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach, London: Chapman and Hall/CRC.
Google Scholar
Jiang, Q., Wang, H., Xia, Y., and Jiang, G. (2013), ‘On a Principal Varying Coefficient Model’, Journal of the American Statistical Association, 108(501), 228–236.
Web of Science ®Google Scholar
Kai, B., Li, R., and Zou, H. (2011), ‘New Efficient Estimation and Variable Selection Methods for Semiparametric Varying-Coefficient Partially Linear Models’, The Annals of Statistics, 39(1), 305–332.
PubMed Web of Science ®Google Scholar
Ke, Z., Fan, J., and Wu, Y. (2015), ‘Homogeneity Pursuit’, Journal of the American Statistical Association, 110(509), 175–194.
PubMed Web of Science ®Google Scholar
Ke, Y., Li, J., and Zhang, W. (2016), ‘Structure Identification in Panel Data Analysis’, The Annals of Statistics, 44(3), 1193–1233.
Web of Science ®Google Scholar
Lee, E.R., and Mammen, E. (2016), ‘Local Linear Smoothing for Sparse High Dimensional Varying Coefficient Models’, Electronic Journal of Statistics, 10(1), 855–894.
Web of Science ®Google Scholar
Leng, C. (2010), ‘Variable Selection and Coefficient Estimation Via Regularized Rank Regression’, Statistica Sinica, 20, 167–181.
Web of Science ®Google Scholar
Li, D., Ke, Y., and Zhang, W. (2015), ‘Model Selection and Structure Specification in Ultra-High Dimensional Generalised Semi-Varying Coefficient Models’, The Annals of Statistics, 43(6), 2676–2705.
Web of Science ®Google Scholar
Liu, J., Li, R., and Wu, R. (2014), ‘Feature Selection for Varying Coefficient Models with Ultrahigh Dimensional Covariates’, Journal of the American Statistical Association, 109(505), 266–274.
PubMed Web of Science ®Google Scholar
Nierenberg, D., Stukel, T., Baron, J., Dain, B., and Greenberg, E. (1989), ‘Determinants of Plasma Levels of Beta-Carotene and Retinol’, American Journal of Epidemiology, 130(3), 511–521.
PubMed Web of Science ®Google Scholar
Park, B.U., Mammen, E., Lee, Y.K., and Lee, E.R. (2015), ‘Varying Coefficient Regression Models: A Review and New Developments’, International Statistical Review, 83(1), 36–64.
Web of Science ®Google Scholar
Rencher, A.C., and Christensen, W.F. (2012), Methods of Multivariate Analysis (3rd ed.), Wiley Series in Probability and Statistics. Wiley.
Google Scholar
Schwarz, G. (1978), ‘Estimating the Dimension of a Model’, The Annals of Statistics, 6(2), 461–464.
Web of Science ®Google Scholar
Shen, X., and Huang, H.C. (2010), ‘Group Pursuit Through a Regularization Solution Surface’, Journal of the American Statistical Association, 105(490), 727–739.
PubMed Web of Science ®Google Scholar
Su, L., Shi, Z., and Phillips, P.C.B. (2016), ‘Identifying Latent Structures in Panel Data’, Econometrica, 84(6), 2215–2264.
Web of Science ®Google Scholar
Su, L., Wang, X., and Jin, S. (2019), ‘Sieve Estimation of Time-Varying Panel Data Models with Latent Structures’, Journal of Business and Economic Statistics, 37(2), 334–349.
Web of Science ®Google Scholar
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. (2005), ‘Sparsity and Smoothness Via the Fused Lasso’, Journal of the Royal Statistical Society, Series B (Statistical Methodology), 67(1), 91–108.
Web of Science ®Google Scholar
Vogt, M., and Linton, O. (2017), ‘Classification of Nonparametric Regression Functions in Longitudinal Data Models’, Journal of the Royal Statistical Society, Series B (Statistical Methodology), 79(1), 5–27.
Web of Science ®Google Scholar
Wand, M.P., and Jones, M.C. (1995), Kernel Smoothing, London: Chapman and Hall.
Google Scholar
Wang, L., and Li, R. (2009), ‘Weighted Wilcoxon-Type Smoothly Clipped Absolute Deviation Method’, Biometrics, 65(2), 564–571.
PubMed Web of Science ®Google Scholar
Wang, H., and Xia, Y. (2009), ‘Shrinkage Estimation of the Varying-Coefficient Model’, Journal of the American Statistical Association, 104(486), 747–757.
Web of Science ®Google Scholar
Xia, Y., Zhang, W., and Tong, H. (2004), ‘Efficient Estimation for Semivarying-Coefficient Models’, Biometrika, 91(3), 661–681.
Web of Science ®Google Scholar

Nonparametric homogeneity pursuit in functional-coefficient models

ABSTRACT

1. Introduction