1,958
Views
1
CrossRef citations to date
0
Altmetric
Theory and Methods

Partially Linear Additive Regression with a General Hilbertian Response

, , , & ORCID Icon
Pages 942-956 | Received 15 Sep 2021, Accepted 13 Nov 2022, Published online: 20 Jan 2023

ABSTRACT

In this article we develop semiparametric regression techniques for fitting partially linear additive models. The methods are for a general Hilbert-space-valued response. They use a powerful technique of additive regression in profiling out the additive nonparametric components of the models, which necessarily involves additive regression of the nonadditive effects of covariates. We show that the estimators of the parametric components are n-consistent and asymptotically Gaussian under weak conditions. We also prove that the estimators of the nonparametric components, which are random elements taking values in a space of Hilbert-space-valued maps, achieve the univariate rate of convergence regardless of the dimension of covariates. We present some numerical evidence for the success of the proposed method and discuss real data applications. Supplementary materials for this article are available online.

1 Introduction

We study useful semiparametric regression techniques that can be used for analyzing a finite or infinite dimensional response variable. The response variable takes values in a general separable Hilbert space. The model consists of a parametric and a nonparametric part. The parametric part is linear in a covariate (predictor) vector, say X=(X1,,Xp) for p1, and the nonparametric part is to model the effect of another covariate vector, say Z=(Z1,,Zd) for d1. To avoid the curse of dimensionality in estimating the nonparametric part, it is assumed that the nonparametric effect is additive in Z, that is, it adds the unknown nonparametric effects of the individual covariates Zk. We consider two scenarios for X. One is that both Xj and Zk are real-valued, and the other that Xj take values in the Hilbert space where the response variable takes values while Zk are real-valued. The new techniques are important extensions of the partially linear additive regression for scalar responses coupled with scalar covariates studied by Yu, Mammen, and Park (Citation2011). In this article, we develop powerful techniques of estimating the models and provide sound theory that supports the methodology.

Our framework of Hilbert-space-valued (henceforth, Hilbertian) responses can be specialized to various data types. Its coverage is broad enough to include random variables, random vectors, random functions, random densities, compositional random vectors, infinite sequence random vectors, and compositional random functions, etc. All these types are commonly encountered in today’s data environments. The present work also provides a base toward semiparametric regression for other response spaces, such as Riemannian manifolds and Lie groups, along the way paved by Lin, Müller, and Park (Citation2022). The latter work pioneered a link connecting additive regression for manifold-valued responses to Hilbertian additive regression (Jeon and Park Citation2020) via Riemannian logarithmic map. Despite its importance, semiparametric regression with Hilbertian responses remains unexplored. To the best of our knowledge, this is the first attempt to study a semiparametric regression model for general Hilbertian responses.

We prove that our estimators of the parametric effects of Xj are n-consistent in case Xj are real-valued. We also derive the joint asymptotic distribution of the parametric estimators. For Hilbertian Xj, we show that the corresponding parametric estimators are still n-consistent if the Hilbert space where Xj take values is of finite-dimension, while they achieve a slightly slower rate if the Hilbert space is of infinite-dimension. Furthermore, we show that our methodology of estimating the nonparametric effect of the covariate Z is free from the curse of dimensionality, that is, affords a univariate rate of convergence regardless of the dimension of Z. This is in contrast with the partially linear modeling approach that does not have additivity structure in the nonparametric part. The latter suffers from the dimensionality problem. Not only the nonparametric part, our approach also improves the estimation of the parametric part. Indeed, we demonstrate, theoretically and numerically, that using additivity structure in the nonparametric part leads to efficiency gain in the estimation of the parametric part. It turns out that the gain is larger if E(Xj|Z=·) are farther from being additive.

The present work is not considered a direct extension of Yu, Mammen, and Park (Citation2011). Dealing with a general Hilbert space, instead of the conventional R, as the space of the values of the response variable, needs a number of innovations in developing relevant methodology and theory. In case Xj takes values in R, the theory requires to assess the stochastic magnitude of terms of the form n1i=1nUi·ĝ(Zij) for some Ui, which is a known function of (Xi,Zi), and for some stochastic (random) map ĝ, which is a random element taking values in the space of Hilbertian maps g:SH for a Hilbert space H and a compact subset S of R. The usual way of analyzing such a ĝ-weighted average is to consider a large set G to which ĝ belongs with a high probability, and then derive the maximal stochastic size of n1i=1nUi·g(Zij) over gG using an entropy bound for the set G. The set G embodies Hilbertian maps g:SH. In case H is finite-dimensional as Euclidean spaces, the entropy of G is finite although the dimension of G may be infinite, particularly when G is a class of nonparametric maps g. In case H is infinite-dimensional, however, the entropy of G is infinite so that it is not feasible to apply the empirical process theory directly to the ĝ-weighted average. To resolve this difficulty we take the Hilbertian norm of the ĝ-weighted average, and convert its maximal stochastic size to that of a g˜-weighted average where g˜ is a stochastic map from S to R. The conversion with the associated calculation of the sizes of various stochastic terms is one of the challenges we tackle in this article. For Hilbertian Xj taking values in H, however, a similar idea leads to dealing with n1i=1nη̂(Vi,Zij), where Vi is an H-valued function of (Xi,Zi) and η̂ is a stochastic map from H×S to R. It turns out that the class embodying the latter stochastic map with a high probability does not necessarily have a finite entropy. Thus, the case of Hilbertian Xj needs a different treatment, which is another difficulty we resolve in this work.

Our proposals are related to Jeon and Park (Citation2020), which developed a smooth backfitting technique for additive regression with Hilbertian responses. The latter, however, is for additive models without parametric component X, and for additive regression of additive effect, that is, for estimating the “assumed” additive effect of Z. It does not cover cases where the additivity model assumption is violated in additive regression. In contrast, our work is for models with linear effect of X in addition to the additive nonparametric effect of the covariate vector Z. Dealing with the additional linear effect necessarily involves additive regression of nonadditive effects. In this respect, our theory for additive regression is more general than the one in Jeon and Park (Citation2020). Moreover, our modeling approach allows for discrete type covariates (ordinal or nominal), which arise in a variety of statistical problems.

In the case of a functional response, say Y(·):TR for a domain TRq, which is a special case of Hilbetian response, one might think of applying a technique for scalar responses, such as the one in Yu, Mammen, and Park (Citation2011), in a pointwise manner to Y(t) for each tT and combine the pointwise results Ŷ(t), to construct Ŷ(·). However, this naive approach is problematic since the resulting Ŷ(·) is not guaranteed to take values in the space where Y(·) comes from. This is particularly the case when Y(·) has constraints, for example, Y(·)0 and TY(t) dt=1, like a random density. Similarly, for a compositional response Y(Y1,,Yk) with 0<Yj<1 and j=1kYj1, component-wise regression with Yj for each j does not give simplex-valued Ŷ as a whole. Our approach does not have these drawbacks when specialized to functional and compositional data since it applies directly to the observations of functional Y(·) and of compositional vector Y, respectively, as data objects.

A few past works on semiparametric regression for real-valued responses include the estimation of partially linear models without additivity structure in the nonparametric part (Bhattacharya and Zhao Citation1997; Liang Citation2006), and of partially linear additive models (Opsomer and Ruppert Citation1999; Liang et al. Citation2008; Yu, Mammen, and Park Citation2011; Lee, Han, and Park Citation2018). Among the latter four studying partially linear additive models, Opsomer and Ruppert (Citation1999) and Liang et al. (Citation2008) employed the ordinary backfitting technique (Buja, Hastie, and Tibshirani Citation1989) to estimate the additive nonparametric part, while Yu, Mammen, and Park (Citation2011) and Lee, Han, and Park (Citation2018) used the smooth backfitting technique (Mammen, Linton, and Nielsen Citation1999). The smooth backfitting method has been proved to be successful in various structured nonparametric models under weak conditions (Yu, Park, and Mammen Citation2008; Linton, Sperlich, and van Keilegom Citation2008; Lee, Mammen, and Park Citation2010, Citation2012; Zhang, Park, and Wang Citation2013; Bissantz et al. Citation2016; Han and Park Citation2018; Han, Müller, and Park Citation2020). On the contrary, the ordinary backfitting is known to work under stronger conditions on covariates (Opsomer and Ruppert Citation1997).

2 Additive Regression

Our approach to the semiparametric regression requires the estimation of “best” approximations of E(W|Z=·) for W being a general Hilbertian random element, or a real-valued random variable in the respective spaces of “additive” maps. This section is devoted to characterizing the best approximations and their estimation. We assume that Z is supported on [0,1]d.

2.1 Projection on Additive Function Space

We consider a general separable Hilbert space, denoted by H. Euclidean spaces, L2 spaces, Bayes-Hilbert spaces and simplices are special cases of H. Here and below, we denote H-valued maps, random elements taking values in H and their values in H, by bold-faced symbols. Note that we also use bold-faced symbols to denote random vectors and their realizations. Let 0 be the zero vector in H,·,· the inner product of H and ||·|| the associated norm. We denote by    and the operations of vector addition and of scalar multiplication in H, respectively. Let    denote the subtraction operation defined by a  b=a  (1)b.

We let Hadd(H) denote the space of additive maps g:[0,1]dH such that E(||g(Z)||2)< and g(z)=g1(z1)    gd(zd) for some gj:[0,1]H. For a random element W taking values in H with E(||W||2)<, let mW,+=mW,1    mW,d denote the projection of the mean regression map mW:=E(W|Z=·) onto Hadd(H). That is, (2.1) mW,+=arg ming+Hadd(H) [0,1]dmW(z)  g+(z)2f(z) dz,(2.1) where f is the joint density of Z. For the definition of the conditional expectation for Hilbertian W, we refer to Bosq (Citation2000). The minimizer mW,+ with mW,+(z)=mW,1(z1)mW,d(zd) satisfies (2.2) 0=[0,1]d1f(z)fj(zj)(mW(z)  mW,1(z1)    mW,d(zd)) dzj=E(W|Zj=zj)  kj01fjk(zj,zk)fj(zj)mW,k(zk) dzk  mW,j(zj)(2.2) for each 1jd and for all zj with fj(zj)>0. Here and below, zj for a d-vector z is (z1,,zj1,zj+1,,zd), and fj and fjk are the densities of Zj and (Zj, Zk), respectively. The integrals in (2.2) are special cases of the so called “Bochner integral” (Jeon and Park Citation2020), the latter being for Banach-space-valued maps. From (2.2), we see that the additive map mW,+ is characterized as the solution of the following system of Hilbertian integral equations: (2.3) mW,j(zj)=E(W|Zj=zj)kj01fjk(zj,zk)fj(zj)mW,k(zk) dzk,1jd.(2.3)

We note that (2.3) does not define a component tuple (mW,j:1jd), but only the sum of its entries, mW,+=mW,1    mW,d. Obviously, if (mW,j:1jd) satisfies (2.3), then (mW,1  c,mW,2  c,mW,3,,mW,d) also satisfies (2.3) for any constant cH. We also note that mW,+Hadd(H) is different from the regression map mW unless mW belongs to Hadd(H). Thus, the problem of estimating mW,+, which we present in Section 2.2, is different from that of estimating mW under the assumption mWHadd(H), the latter having been studied by Jeon and Park (Citation2020).

2.2 Estimation of Additive Projection

We now discuss the estimation of mW,+. The basic idea is to replace the marginal regression maps E(W|Zj=·) and the densities fj and fjk in (2.3) by the corresponding kernel estimators and then to solve the resulting system of equations. We note that our method requires the estimation of mXj,+ for real-valued Xj and of mXj,+ for Hilbertian Xj, as well. Below, we describe the method for W taking values in a general Hilbert space, since specialization to the case of random variables is immediate from the general treatment.

Suppose that we have n observations, (Zi,Wi), 1in, with Zi=(Zi1,,Zid). We estimate the marginal regression map MW,j(zj):=E(W|Zj=zj) for zj[0,1] by (2.4) M̂W,j(zj)=f̂j(zj)1·n1i=1n(Khj(zj,Zij)Wi),(2.4) where f̂j(zj)=n1i=1nKhj(zj,Zij) is a kernel estimator of fj. We also estimate fjk by f̂jk(zj,zk)=n1i=1nKhj(zj,Zij)Khk(zk,Zik). Here and throughout the article, Kh(·,·) is a normalized kernel function defined by Kh(u,v)=Kh(uv)01Kh(tv) dt,u,v[0,1], where h > 0 is the bandwidth, Kh(w)=K(w/h)/h and K(·)0 is the baseline kernel function. This type of normalized kernels have been used in the literature (Yu, Park, and Mammen Citation2008; Lee, Mammen, and Park Citation2010, Citation2012; Jeon and Park Citation2020). It has the property that 01Kh(u,v) du=1 for all v[0,1], which gives 01f̂j(zj) dzj=1 and 01f̂jk(zj,zk) dzk=f̂j(zj) for all zj[0,1]. Plugging these estimators into (2.3) gives the following estimated system of Hilbertian integral equations: (2.5) m̂W,j(zj)=M̂W,j(zj)  kj01f̂jk(zj,zk)f̂j(zj)m̂W,k(zk) dzk,1jd.(2.5)

In the supplement S.7, we will show that the system of Equationequations (2.5) has a unique solution, which we denote by m̂W,+. Just like that the system of equations at (2.3) determines only mW,+=mW,1    mW,d, the EquationEquation (2.5) by itself defines only m̂W,+=m̂W,1    m̂W,d, not individual components m̂W,j. The solution m̂W,+ does not have a closed form, but can be computed by iteration starting with an initial tuple (m̂W,j[0]:1jd). We may simply take m̂W,j[0]0, for example. In the rth (r1) cycle of the iteration, we update m̂W,j[r1] by (2.6) m̂W,j[r](zj)=M̂W,j(zj)    k=1j101f̂jk(zj,zk)f̂j(zj)m̂W,k[r](zk) dzk   k=j+1d01f̂jk(zj,zk)f̂j(zj)m̂W,k[r1](zk) dzk,1jd.(2.6)

In the supplement S.7, we will show that m̂W,+[r]:=m̂W,1[r]    m̂W,d[r] converges to m̂W,+ as r, see Proposition S.1. In the iteration, we can evaluate the Bochner integrals in (2.6) by the conventional Lebesgue integrals, provided that m̂W,j[0] are linear in Wi, that is, m̂W,j[0](zj)=n1i=1n(wij(zj)Wi) for some real-valued functions wij:[0,1]R. This is because the marginal M̂W,j are also linear in Wi so that all the subsequent updates m̂W,j[r] for r1 are linear in Wi. For a linear form g(u)=n1i=1nvi(u)Wi with real-valued weight functions vi:[0,1]R, it holds that 01f̂jk(zj,u)g(u) du=n1i=1n(01vi(u)·f̂jk(zj,u) du)Wi,where the integral on the left side is in Bochner sense while the one on the right hand side is a Lebesgue integral.

The regression map mW is not assumed to be additive and thus estimating the additive map mW,+ is considered additive regression of nonadditive effect, which is different from additive regression of additive effect, developed by Mammen, Linton, and Nielsen (Citation1999) and Jeon and Park (Citation2020). The very core of the matter in the present case regarding additive regression is that, when we write W=mW,+(Z)  ϱW with ϱW:=W  mW,+(Z) as an error term, E(ϱW|Z)=mW(Z)  mW,+(Z)0. Instead, by the definition of mW,+, we have E(ϱW,δ+(Z))=0 for all δ+Hadd(H). This implies that, for all 1kd and for all δk:[0,1]H with E(||δk(Zk)||2)<, 0=E(ϱW,δk(Zk))=E(E(ϱW|Zk),δk(Zk)). This gives (2.7) E(ϱW|Zk)=0,1kd.(2.7)

We show that the latter is enough for the theory of m̂W,+, see the supplement S.7 for details.

3 Hilbertian PLAM with Scalar Covariates

We introduce a partially linear additive model for a H-valued response Y and covariate vectors X(X1,,Xp) and Z(Z1,,Zd) taking values in Rp and Rd, respectively, and present a way of estimating the model. We allow the entries of X to be discrete (nominal or ordinal) while assuming that all entries of Z are of continuous type supported on [0,1]. We assume that E(Xj2)< for all 1jp.

3.1 The Model

The Hilbertian partially linear additive model (HPLAM) is given by (3.1) Y=β0  j=1p(Xjβj)  k=1dmk(Zk)  ε,(3.1)

where βj are unknown constants in H, mk:[0,1]H are unknown component maps with E(||mk(Zk)||2)< satisfying the constraints (3.2) E(mk(Zk))=0,1kd,(3.2) and ε is a Hilbertian error term such that E(ε|X,Z)=0 and E(||ε||2)<. Below, we give a proposition that establishes the identifiability of βj and mk in the model (3.1). Let PU denote the distribution of a random vector or element U. We make the following weak assumption.

(A0) The PX,Z has a density with respect to the product measure PXPZ, which is bounded away from zero on the support of PX,Z. The marginal distribution PZ has a density f with respect to the Lebesgue measure such that f(z)>0 for all z[0,1]d, and the support of PX is not contained in a hyperplane in Rp.

Proposition 3.1.

Let αjH for 0jp be Hilbertian constants and gk:[0,1]H for 1kd be Hilbertian maps with E(||gk(Zk)||2)< and E(gk(Zk))=0. Assume the condition (A0). Then, (3.3) E(α0  j=1p(Xjαj)  k=1dgk(Zk)2)=0(3.3) implies αj=0 and gk(Zk)=0 a.s. for all 0jp and 1kd.

A proof of Proposition 3.1 is given in the supplement S.1. With (3.2), we have β0=E(Y)  j=1pE(Xj)βj, so that by plugging the expression into (3.1) we may rewrite the model (3.1) as Yc=j=1pXjcβj    k=1dmk(Zk)    ε,where Yc=Y  E(Y) and Xjc=XjE(Xj).

3.2 Estimation of Parametric Components

Here, we discuss the estimation of βj using the profiling technique (Severini and Wong Citation1992) in conjunction with the estimation of additive projections presented in Section 2.2. Let m+=m1    mdHadd(H) and put β=(β1,,βp). Note that β does not involve β0. Then, under the model specification (3.1), (3.4) (β,m+)=argminbHp,g+Hadd(H)E(||Yc  ((Xc)b)  g+(Z)||2),(3.4) where Xc=(X1c,,Xpc). Here and below, cb=j=1p(cjbj) for c=(c1,,cp)Rp and b=(b1,,bp)Hp. Considering mYc((Xc)b),+ for each bHp, which solves the system of Equationequations (2.3) with W=Yc  ((Xc)b) and noting that mYc((Xc)β),+=m+, we get β=argminbHp E(||Yc  ((Xc)b)  mYc((Xc)b),+(Z)||2).

According to the arguments leading to Corollary S.1 in the supplement S.7, mW,+ is linear in W in the sense that m(WW),+=mW,+  mW,+ and mWc,+=mW,+  c for all Hilbertian variables W,W and Hilbertian constant c. Also, it holds that m(W1b1)(W2b2),+=(mW1,+b1)  (mW2,+b2) for all Hilbertian constants bjH and random variables Wj. Thus, mYc((Xc)b),+=mYc,+  (mXc,+b). Furthermore, we also get Yc  mYc,+(Z)=(Y  E(Y))  (mY,+(Z)  E(Y))=Y  mY,+(Z),and likewise XcmXc,+(Z)=XmX,+(Z). This gives (3.5) β=argminbHp E((Y  mY,+(Z))  (XmX,+(Z))b2).(3.5)

We estimate β by minimizing an empirical version of the objective functional at (3.5). Define X˜ij=Xijm̂Xj,+(Zi),Y˜i=Yi  m̂Y,+(Zi),where m̂Xj,+ for 1jp and m̂Y,+, respectively, are the solutions of (2.5) for W=Xj and W=Y. Put m̂X,+=(m̂X1,+,,m̂Xp,+). Then, X˜i:=(X˜i1,,X˜ip)=Xim̂X,+(Zi). We propose to estimate β by (3.6) β̂=argminbHp n1i=1nY˜i    X˜ib2.(3.6)

We present an explicit form of β̂ defined at (3.6). The Gâteaux derivative of the objective functional in (3.6) at β̂ to an arbitrary direction δHp equals 2 n1i=1nX˜iδ, Y˜i  X˜iβ̂=2 n1j=1pi=1nδj, X˜ij(Y˜i  X˜iβ̂)=2 n1j=1pδj, i=1nX˜ij(Y˜i  X˜iβ̂).

Equating the above derivative with zero for all δHp gives i=1nX˜ij(Y˜i  X˜iβ̂)=0,1jp.

From this we get that, whenever the p × p matrix i=1nX˜iX˜i is invertible, (3.7) β̂=(i=1nX˜iX˜i)1(i=1nX˜iY˜i),(3.7) where cv0=(c1v0,,cpv0)Hp for v0H and c=(c1,,cp)Rp, and Ab=(j=1pa1jbj,,j=1papjbj) for a p × p real matrix A=(aij) and b=(b1,,bp)Hp. With β̂=(β̂1,,β̂p) defined at (3.7) we estimate β0 in the model (3.1) by β̂0=Y¯  j=1pX¯jβ̂j, where X¯j=n1i=1nXij and Y¯=n1(i=1nYi). In Section 7.1, we will show that n1i=1nX˜iX˜i is invertible with probability tending to one and that β̂ is n-consistent and n(β̂  β) converges to a mean zero Gaussian random element taking values in H.

3.3 Estimation of Nonparametric Components

We estimate m+=k=1dmk in the model (3.1) by m̂+, which solves the system of equations at (2.5) with Wi=Yi  β̂0  (Xiβ̂). From the linearity of m̂W,+ in (W1,,Wn) (Corollary S.1 in the supplement S.7), it holds that m̂+(z)=m̂Y,+(z)    β̂0    (m̂X,+(z)β̂), where m̂X,+(z)=(m̂X1,+(z),,m̂Xp,+(z)). To estimate the individual component maps mk satisfying the constraints E(mk(Zk))=0, we put the following constraints on m̂k: (3.8) 01f̂k(u)m̂k(u) du=0,1kd,(3.8) where the integrals are in Bochner sense. Then, the constraints (3.8) identify the individual m̂k uniquely such that m̂+=m̂1    m̂d.

To see how the constraints identify a unique set of m̂k, let (m˜1,,m˜d) be an arbitrary tuple such that m̂+=m˜1    m˜d. Choose m̂k=m˜k  01f̂k(zk)m˜k(zk) dzk. Obviously, each m̂k satisfies (3.8). It also holds (3.9) k=1d01f̂k(zk)m˜k(zk) dzk=0,(3.9) which gives m̂+=m̂1    m̂d. To see (3.9), we note that, since m̂+=m˜1    m˜d, the tuple (m˜1,,m˜d) satisfies f̂k(zk)m˜k(zk)=f̂k(zk)M̂Yβ̂0(Xβ̂),k(zk)  lk01f̂kl(zk,zl)m˜l(zl) dzl for all k. Integrating both sides over zk[0,1] then gives (zl)l=1d01f̂l(zl)m˜l(zl) dzlk01f̂k(zk)M̂Yβ̂0(Xβ̂),k(zk) dzk=n1i=1n(Yi  β̂0  Xiβ̂)=0.

The first equality in the above equations follows from 01f̂kl(zk,zl) dzk=f̂l(zl), and the second from 01Khk(zk,Zik) dzk1 in view of (2.4).

3.4 Hilbertian PLM

We discuss briefly the estimation of the Hilbertian partially linear model (HPLM): Y=β0  j=1p(Xjβj)  m(Z)  ε, where m is allowed to be nonadditive, that is, m may not belong to Hadd(H). It deserves being discussed in this article since there has been no study on this model for Hilbertian responses, to the best of our knowledge, and it helps to understand our numerical results to be presented in Section 5.

Let mY be a multivariate kernel estimator of mY:=E(Y| Z=·) defined by (3.10) m̂Y(z)=(i=1nj=1dKhj(zj,Zij))1i=1n(j=1dKhj(zj,Zij))Yi.(3.10)

Likewise, let m̂X be a multivariate kernel estimator of mX:=E(X|Z=·), which is obtained by changing Yi in (3.10) to Xi and the vector operations for H to those for Rp. Let ϵY=Y  mY(Z), which should be differentiated from ε=Y  E(Y|X,Z). Put ϵX=XmX(Z). Also, let ϵY,i=Yi  mY(Zi) and ϵX,i=XimX(Zi). Define ϵ̂Y,i=Yi  m̂Y(Zi),ϵ̂X,i=Xim̂X(Zi).

Applying the idea of profiling, explored in Section 3.2 for the HPLAM, now to the HPLM, we may estimate β=(β1,,βp) by (3.11) β̂PLM=(i=1nϵ̂X,iϵ̂X,i)1(i=1nϵ̂X,iϵ̂Y,i),(3.11) and β0 by β̂0PLM=Y¯  j=1pX¯jβ̂jPLM, where β̂jPLM is the jth entry of β̂PLM. We note that β̂PLM takes the same form as β̂ at (3.7) with ϵ̂X,i and ϵ̂Y,i taking the roles of X˜i and Y˜i, respectively.

Let ϱX=XmX,+(Z). Note that ϱX=ϵX if all E(Xj|Z=·) are additive maps, that is, belong to Hadd(R). According to the theory in Section 7.1, under the HPLAM model (3.1), the asymptotic variances of β̂jPLM are equal to or larger than those of the respective β̂j, see the EquationEquation (7.2). The gains by β̂j against β̂jPLM get larger as the diagonal entries of [E(ϵXϵX)]1 grow farther away from the respective diagonal entries of [E(ϱXϱX)]1.

4 Hilbertian PLAM with Hilbertian Covariates

In this section we consider the case where the covariates in the parametric part are also Hilbertian. We continue to use the notation X(X1,,Xp) for Hilbertian Xj. Assume that E(||Xj||2)< for all 1jp.

4.1 The Model

Here, we assume that the Hilbert spaces for Xj are the same as H. In Section 4.3, we discuss the case where the spaces for Xj are different from each other or from H. We study the following Hilbertian partially linear additive model (4.1) Y=β0  j=1p(βjXj)  k=1dmk(Zk)  ε,(4.1) where β0 is an unknown constant in H but βj for 1jp are now unknown constants in R. The component maps mk and the error term ε are as in the model (3.1) with the constraints (3.2). As in Section 3.1, we find β0=E(Y)  j=1p(βjE(Xj)) and thus may rewrite the model (4.1) as (4.2) Yc=j=1p(βjXjc)    k=1dmk(Zk)    ε,(4.2) where Yc=Y  E(Y) and Xjc=XjE(Xj).

Under a weak assumption, the constraints (3.2) also entail the identifiability of the model (4.1). To state a proposition for the identifiability, we introduce new symbols for matrices and vectors of Hilbertian elements. Let a,b for a=(a1,,ap)Hp and b=(b1,,bp)Hp denote the p × p matrix whose (j, k) element equals aj,bk. Also, let a,c and c,a for aHp and cH denote p×1 and 1×p vectors, respectively, whose jth entries are aj,c and c,aj. In this notation, E(Xc,Xc) denotes the p × p matrix whose (j, k)th element equals E(Xjc,Xkc). We make the following assumption.

(B0) The joint distribution PX,Z has a density with respect to the product measure PXPZ, which is bounded away from zero on the support of PX,Z. The marginal distribution PZ has a density f with respect to the Lebesgue measure such that f(z)>0 for all z[0,1]d, and the matrix E(Xc,Xc) is positive-definite.

Proposition 4.1.

Let α0H and αjR for 1jp be constants and gk:[0,1]H for 1kd satisfy E(||gk(Zk)||2)< and E(gk(Zk))=0. Assume the condition (B0). Then, (4.3) E(α0  j=1p(αjXj)  k=1dgk(Zk)2)=0(4.3) implies α0=0,αj=0 and gk(Zk)=0 a.s. for all 1jp and 1kd.

A proof of Proposition 4.1 can be found in the supplement S.2.

4.2 Estimation of the Model

Put β=(β1,,βp)Rp. From (4.2), it holds that (β,m+)=argminbRp,g+Hadd(H) E(Yc  (j=1pbjXjc)  g+(Z)2),where b=(b1,,bp)Rp. Noting that m+=mU,+ with U=Yc  j=1p(βjXjc), and using that mW,+ is linear in W for any Hilbertian W, we get β=argminbRp E((Y  mY,+(Z))  (j=1pbj(Xj  mXj,+(Z)))2).

Let Y˜i=Yi  m̂Y,+(Zi) and X˜ij=Xij  m̂Xj,+(Zi). We estimate β by (4.4) β̂=argminbRp n1i=1nY˜i    (j=1pbjX˜ij)2.(4.4)

Put X˜i=(X˜i1,,X˜ip) and let X˜i,X˜i denote the p × p matrix whose (j, k)th element is given by X˜ij,X˜ik. Also, let X˜i,Y˜i denote the p-vector whose jth element is given by X˜ij,Y˜i. By taking the Gâteaux derivative of the objective functional in (4.4), we find that (4.5) β̂=(i=1nX˜i,X˜i)1·i=1nX˜i,Y˜i,(4.5) provided that i=1nX˜i,X˜i is invertible. In Section 7.2, we show that n1i=1nX˜i,X˜i is invertible with probability tending to one. With β̂=(β̂1,,β̂p) defined at (4.5) we estimate β0 in the model (4.1) by β̂0=Y¯  (j=1pβ̂jX¯j), where X¯j=n1i=1nXij and Y¯=n1i=1nYi.

We also estimate m+ by the solution m̂+ of the system of Equationequations (2.5) with Wi=Yi  β̂0  (j=1pβ̂jXij).

By arguing as in Section 3.3, we may show that m̂+ is uniquely decomposed into m̂+=j=1dm̂j with the constraints at (3.8).

4.3 Discussion

One may wonder if the linear part of the model (4.1) covers the standard function-on-function linear regression model when Xj and Y are functional variables of infinite-dimension. In fact, such a functional linear model reduces to the model (3.1), not to the one at (4.1). To see this, consider the case with p = 1 for simplicity, that is, in the linear part of the model, there is only one covariate, say X, taking values in L2(S) with SR. Let the response variable YY(·) take values in L2(T) for some TR. The standard function-on-function linear regression model (Ramsay and Silverman Citation2005; Yao, Müller, and Wang Citation2005; Benatia, Carrasco, and Florens Citation2017) with the covariate XX(·) is given by Lθ:L2(S)L2(T) such that (4.6) Lθ(x)(v)=β0(v)+Sθ(u,v)x(u) du,xx(·)L2(S),(4.6) where β0L2(T) and θL2(S×T) are considered as parameters. Let {ϕj:j1} and {ψk:k1} be known bases of L2(S) and L2(T), respectively. An approach in, for example, Ramsay and Silverman (Citation2005) is to take, as the space for θ(·,·), a tensor product of finite-dimensional subspaces of L2(S) and L2(T). In this approach θ is represented as (4.7) θ(u,v)=j=1L1k=1L2θjkϕj(u)ψk(v),θjkR(4.7) for some L1,L21. In our discussion, we allow L2= while we assume L1 is finite and fixed. For the functional covariate X(·), put ξj=Sϕj(u)X(u) du,βj(v)=k=1L2θjkψk(v).

Then, plugging the representation at (4.7) into the model (4.6) we get (4.8) Lθ(X)(v)=β0(v)+j=1L1ξj·βj(v),vT.(4.8)

Letting ξj and βj(·)L2(T) take the respective roles of Xj and βj in the model (3.1), the model (4.8) reduces to a special case of the linear part of the model (3.1) in Section 3.1. The above observation manifests that our model at (3.1) accommodates even infinite-dimensional θ in the standard function-on-function linear regression model (4.6), and the theory developed in Section 7.1 applies directly to this case.

One may think of an alternative basis generating θ, based on the functional principal components of X(·) and Y(·). In this way, θ is represented as at (4.7) but now with ϕj and ψk being the eigenfunctions of the respective covariance operators of X(·) and Y(·). The main difference from the previous approach is that ϕj and ψk are unknown. Thus, incorporating the unknown basis into the model at (4.6), ξj at (4.8) are not observed but need to be estimated. Put ξ̂ij=Sϕ̂j(u)Xi(u) du,1jL1,where ϕ̂j are the eigenfunctions of the sample covariance operator based on {Xi(·):1in}. Then, the estimation of the model (3.1), which incorporates (4.8) into the linear part, boils down to an errors-in-variables problem with “small” errors in observing the true covariate values ξij:=Sϕj(u)Xi(u) du. Certainly, the “measurement errors,” ξ̂ijξij, affect the estimation of βj=βj(·) and mk in the model (3.1). However, it is different from a typical errors-in-variables problem since the measurement errors are vanishing as n. Elaborating more on this, we let L1L1,n now diverge as n. If E(||X||L2(S)4)< and the eigenvalues λj corresponding to ϕj satisfy the standard separation condition that λjλj+1Cjς for some constants C > 0 and ς>2 as in, for example, Hall and Horowitz (Citation2007), then Lemma 2.3 in Horváth and Kokoszka (Citation2012) assures that max1jLn||ϕ̂j(·)ϕj(·)||L2(S)=Op(n1/2·L1,nς). The latter implies that max1jLn|ξ̂ijξij|=Op(n1/2·L1,nς) for each i. If we further assume that E(exp(||X||L2(S)/C0))< for some constant C0>0, then we may show that max1inmax1jLn|ξ̂ijξij|=Op(n1/2·L1,nς·logn). Following the arguments as in, for example, Jeon and Van Bever (Citation2022), we expect that this would add an extra error of size Op(n1/2·L1,nς·logn) to the errors of β̂j and m̂k presented in Section 7.1.

The model (4.1), as it stands, assumes that the spaces for Xj are the same as H, the space for Y. However, the limitation is not essential. Basically, there is an isometric isomorphism that maps the space for Xj, say Hj, to H, provided that dim(Hj)=dim(H). Let it be called gj:HjH. Then, one may let Y depend on Xj, 1jp, via gj, so that one may postulate (4.9) Y=β0  j=1p(βjgj(Xj))  k=1dmk(Zk)  ε.(4.9)

The methodology of estimating the above model (4.9) and the associated theory follow directly from those for the model (4.1) by letting gj(Xj) in the model (4.9) take the roles of Xj in the model (4.1).

One may be also interested in the case where the covariates in the nonparametric part, now denoted by Zk for 1kd, are also Hilbertian. Here, we briefly discuss a method designed for finite-dimensional Zk. Dealing with infinite-dimensional Zk in our framework is infeasible since then one needs to estimate nonparametric functions defined on infinite-dimensional domains. Let Zk take values in a compact subset Ck of a qk-dimensional Hilbert space Hk. Then, one may consider a version of the model (3.1) or of the model (4.1) with mk now being defined as a map from Ck to H. For a general Hilbertian W, one may also obtain the following version of the system of equations at (2.5): (4.10) m̂W,j(zj)=M̂W,j(zj)kjCkf̂jk(zj,zk)f̂j(zj)m̂W,k(zk) dμk(zk),1jd.(4.10)

Here, μk is the pushforward measure induced by the qk-dimensional Lebesgue measure Lebqk given by μk(A)=Lebqk(ηk(A)) for AHk, and ηk:HkRqk is an isometric isomorphism. Also, M̂W,j,f̂j and f̂jk are kernel-based estimators of E(W|Zj=·), the density fj of Zj and fjk of (Zj,Zk), respectively. We refer to Jeon, Park, and Van Keilegom (Citation2021) for concrete forms of these estimators. Then, one is able to estimate the model with Hilbertian Zk in the same way as we estimate the models (3.1) and (4.1), solving the system of Equationequations (4.10) for various choices of W.

5 Simulation Studies

5.1 Bandwidth Selection

The construction of m̂Xj,+ or m̂Xj,+ for 1jp and m̂Y,+ in computing β̂, requires choosing a set of bandwidths (hk:1kd). There is another set of bandwidths we need to select to construct m̂+, solving (2.5) with Wi=Yi  β̂0  (Xiβ̂). In our theoretical development in Section 7, we allow these bandwidth sets to be different from each other. In our numerical study, however, we simply took m̂+(z)=m̂Y,+(z)    β̂0    (m̂X,+(z)β̂) from the already calculated β̂0,β̂, m̂Y,+ and m̂X,+ based on a set of bandwidths (hk:1kd). To choose the single bandwidth set, we used the CBS (Coordinate-wise Bandwidth Selection) algorithm introduced in Jeon and Park (Citation2020), instead of a full-dimensional grid search. The algorithm is reproduced in the supplement S.3. However, for the approach of HPLM without additive modeling introduced in Section 3.4, we used the bandwidths obtained from a full-dimensional grid search. The CBS algorithm is based on a cross-validation criterion. In our simulation study discussed below within Section 5, we used a 10-fold cross-validation, while in the real data applications presented in Section 6, we employed a 5-fold cross-validation. For numerical integration in the implementation of the iterative algorithm at (2.6), we used a trapezoidal rule with 101 equally spaced grid points.

5.2 Data Generating Models

We conducted simulation studies for the model (3.1) with a density response YY(·) taking values in the Bayes-Hilbert space B([1/2,1/2]) defined in the supplement S.4. We considered several cases with p = 2 and d = 2 or 3. For the covariates Zj in the nonparametric part, we took Zj=Φ(Uj),1j3, where Φ is the cumulative distribution function of N(0, 1) and U(U1,U2,U3) is a multivariate normal random vectors with E(Uj)=0, var(Uj)=1 and cov(Uj,Uk)=ρ for all 1jk3. This allows for dependence among Zj when ρ0. For X1 in the parametric part, we took X1=X1,α with X1,α=α·η1A(Z1,Z2)+1α2·η1B(Z1,Z2)+ϵX1,0α1.

Here, ϵX1 is N(0,1/2) independent of Z and η1A(z1,z2)=72(z112)(z212),η1B(z1,z2)=3(z1+z2),(z1,z2)[0,1]2.

We note that η1A is nonadditive while η1B is additive such that (5.1) [0,1]2(η1A(z1,z2))2 dz1 dz2=[0,1]2(η1B(z1,z2))2 dz1 dz2=1/2,(5.1) (5.2) [0,1]2η1A(z1,z2)g+(z1,z2) dz1 dz2=0(5.2) for all g+:[0,1]2R with g+(z1,z2)=g1(z1)+g2(z2) for some gj. For X2, we set X2=η2(Z1,Z2)+ϵX2, where ϵX2 is N(0,1/2) independent of Z and ϵX1, and η2 is given by η2(z1,z2)=(2918)1/2(|z112|·|z212|14),(z1,z2)[0,1]2,

so that it is nonadditive with (5.3) [0,1]2η2(z1,z2)2 dz1 dz2=1/2.(5.3)

We note that, if α = 0, then E(X1,α|(Z1,Z2)=·)=η1B=mX1,α,+Hadd(R). Because of (5.2), η1A is perpendicular to Hadd(R) when ρ = 0, in which case Z1 and Z2 are independent with densities f1=f2=I[0,1](·), in the sense that E(η1A(Z1,Z2)g+(Z1,Z2))=0 for all g+Hadd(R). This implies that mX1,α,+=1α2·η1B when ρ = 0, and thus E(X1,α|(Z1,Z2)=·)=α·η1A+1α2·η1B gets farther away from mX1,α,+ as α increases, so that α controls the degree of departure of E(X1,α|(Z1,Z2)=·) from mX1,α,+. We also note that (5.1) and (5.3) give (5.4) E(X1,α2)=E(X22)=1whenρ=0.(5.4)

We considered four models to generate Yi(·). The first model is given by Model 1:Y(·)=(X1,αβ1(·))  (X2β2(·)) k=12mk(Zk,·)  ε(·),ρ=0.

For ε(·) in Model 1 and others below, we took ε(t)=ω(t)/1/21/2ω(u) du where ω(·) is a linear interpolation of ω(ti) for ti=1/2+(i1)/100,1i101 and ω(ti) are iid Lognormal(0,1), independent of (ϵX1,ϵX2) and (Z1,Z2,Z3). Put W(t)=logω(t). Then, ||ε(·)||B2=1/21/2(logω(t)1/21/2logω(u) du)2 dt1100i=1100W(ti)2(1100i=1100W(ti))2,where ||·||B is the norm for the space B defined through (S.6) in the supplementary materials. Since W(ti) are iid N(0, 1), we get E(||ε(·)||B2)99/1001. The coefficients βj(·), which themselves are densities on [1/2,1/2], are chosen as follows: β1(t)=[1/21/2(1+25sin(4uπ))C1 du]1(1+25sin(4tπ))C1·I(|t|1/2),β2(t)=[1/21/2(6595(|u|16)+)C2 du]1(6595(|t|16)+)C2·I(|t|1/2), where a+=max{a,0} and Cj are chosen so that ||βj(·)||B=1. Because of (5.4), it holds that E(||X1,αβ1(·)||B2)=E(||X2β2(·)||B2)=1whenρ=0.

The components m1 and m2 are also normalized. Specifically, m1(z1,t)=(1/21/2(2exp(z1)|u|)c1 du)1(2exp(z1)|t|)c1·I(|t|1/2),m2(z2,t)=(1/21/2cos(uπ/2)c2·(z2+z23) du)1cos(tπ/2)c2·(z2+z23)·I(|t|1/2),where c1 and c2 are chosen so that E(||mj(Zj,·)||B2)=1 for j = 1, 2.

As discussed briefly in Section 3.4 and demonstrated by our theory in Section 7.1, our estimator β̂ gets more stable than β̂PLM at (3.11), which ignores the additivity structure in the nonparametric part of the model, if ϱXϵX=mX(Z)mX,+(Z) grows farther away from the zero vector. Put (5.5) ΔPLAM=[E(ϱXϱX)]1,ΔPLM=[E(ϵXϵX)]1.(5.5)

In Model 1 where ρ = 0, we may find exact formulas for ΔPLAM and ΔPLM. Indeed, when ρ = 0, we get E(ϱX1,α|Z)=α·η1A(Z1,Z2) from (5.2) and E(ϱX2|Z)=η2(Z1,Z2)η2,+(Z1,Z2), where η2,+ is the projection of η2 onto Hadd(H). Since ϵX1 and ϵX2 are independent, ϱX1,α and ϱX2 are also independent conditional on Z. From these, we may find that the ith diagonal entries Δi,iPLAM and Δi,iPLM of ΔPLAM and ΔPLM, respectively, are: (5.6) Δ1,1PLAM=22α2E(η1A(Z)2)+1,Δ2,2PLAM=22E(η2(Z)η2,+(Z))2+1,Δ1,1PLM=2,Δ2,2PLM=2.(5.6)

From the formula (5.6) and according to Theorem 7.1 with the discussion that follows, we get that, under Model 1, the asymptotic gains by β̂j against β̂jPLM are (5.7) n1·(Δj,jPLMΔj,jPLAM)·E(||ε(·)||B2)n1·(Δj,jPLMΔj,jPLAM).(5.7)

Through the simulation with various values of α under Model 1, we may see how these theoretical gains by β̂j take effect empirically.

For the second model we specialized X1,α to X1,1, that is, took α = 1, and also fix ρ=1/2 in the generation of Zi. Other than that, it is the same as Model 1. The third and fourth models have the same parametric part as the second one. The third model has a nonadditive map in the nonparametric part, while the fourth has a three dimensional additive map. Specifically, Model 2:Y(·)=(X1,1β1(·))  (X2β2(·)) k=12mk(Zk,·)  ε(·)  ρ=1/2;Model 3:Y(·)=(X1,1β1(·))  (X2β2(·)) m1,2(Z1,Z2,·)  ε(·),  ρ=1/2;Model 4:Y(·)=(X1,1β1(·))  (X2β2(·)) k=13mk(Zk,·)  ε(·)  ρ=1/2.

Here, the density-valued maps m3 and m1,2 are given by m3(z3,t)=(1/21/2(1+u)c3·sin(2πz3) du)1(1+t)c3·sin(2πz3)·I(|t|1/2),m1,2(z1,z2,t)=(1/21/2exp(c1,2·u2log(1+z1/2+z2/2)) du)1·exp(c1,2·t2log(1+z1/2+z2/2))·I(|t|1/2),where c3 and c1,2 are chosen so that E(||m1,2(Z1,Z2,·)||B2)=E(||m3(Z3,·)||B2)=1. We considered Model 3 to see the sensitivity of our approach to the violation of additivity in the nonparametric part, and Model 4 to learn the effect of increased dimension of Z, in comparison with Model 2.

5.3 Simulation Results

We generated N = 500 pseudo samples of sizes n = 200 and 400 according to the four models specified in Section 5.2. We computed Monte Carlo approximations of (5.8) MSE(β̂j):=E(||β̂j(·)  βj(·)||B2)=SB(β̂j)+var(β̂j),SB(β̂j):=||E(β̂j(·))  βj(·)||B2,var(β̂j):=E(||β̂j(·)  E(β̂j(·))||B2)(5.8) and those for β̂jPLM. For the estimators of the nonparametric parts, we approximated (5.9) MISE(m̂+):=E(||m̂+(z,·)  m+(z,·)||B2) dz=ISB(m̂+)+IV(m̂+),ISB(m̂+):=||E(m̂+(z,·))  m+(z,·)||B2 dz,IV(m̂+):=E(m̂+(z,·)  E(m̂+(z,·))B) dz(5.9) and those for the PLM, based on the 500 pseudo samples. The target m+ in this computation was the centered version of m1  m2 or m1  m2  m3, that is, for Models 1 and 2, for example, (5.10) m+(z,·)=[m1(z1,·)  E(m1(Z1,·))] [m2(z2,·)  E(m2(Z2,·))].(5.10)

The centering introduces the coefficient β0=m0, which for Models 1 and 2 is given by m0(·)=E(m1(Z1,·))  E(m2(Z2,·)).

reports the values of the measures for Model 1. For Model 1, the theoretical gain by β̂2 against β̂2PLM does not depend on α as shown by (5.6) and (5.7). This is confirmed by the MSE values in the column of β2 in the table. For β̂1 against β̂1PLM, however, we get from (5.1) and (5.7) that n1·(Δ1,1PLMΔ1,1PLAM)=n1·2α2/(α2+1). The MSE values in the column of β1 are roughly in accordance with this formula. For example, in case α = 1 and n = 200, the theoretical value equals 5×103 while the corresponding empirical value is (10.635.17)×103=5.46×103. The results in the table also demonstrate that our approach leads to more accurate estimation for the nonparametric part as well.

Table 1 Mean squared error (MSE), squared bias (SB), and variance of β̂j, as defined at (5.8), and mean integrated squared error (MISE), integrated squared bias (ISB), and integrated variance (IV), as defined at (5.9), under Model 1, multiplied by 103.

Comparing the values for different sample sizes, we see that the results in obey the theoretical rates of convergence. The rate for the parametric part is n1 for both PLAM and PLM, while those for the nonparametric part equal n4/5 for our method and n2/3 for the PLM, provided that the corresponding optimal bandwidth sizes are used. We find that, in estimating m0 and βj, the ratios of the MSE values for n = 400 against n = 200 roughly coincide with the theoretical value (400/200)1=1/2. In estimating the nonparametric part m+, the ratios of the MISE values are nearly identical to the theoretical values (400/200)4/50.574 and (400/200)2/30.630, respectively, for our estimator and for the PLM. For instance, in case α = 0, the empirical ratio 26.06/44.110.590 for our estimator is roughly the same as 0.574, while 42.66/68.100.626 for the PLM approximates well the theoretical value 0.630.

reports the values of the measures at (5.8) and (5.9) for Models 2–4. Recall that, in these models, X1 is the same as the one under Model 1 when α = 1, but Zj=Φ(Uj) are correlated to each other via the correlation ρ=1/2 among Uj. From the table, we first learn that our proposal continues to dominate the PLM when the covariates in the nonparametric part are correlated, when their effects are nonadditive, or when their number increases. The comparison between the values for Model 2 and those for Model 1 with α = 1 may reveal the effect of correlation among covariates. We observe that the presence of the correlation increased the values of MSE and MISE slightly for our estimators. For the PLM, in estimating βj, the asymptotic theory at (7.1) in Section 7.1 tells that introducing correlation between Zj does not affect the asymptotic MSE properties of β̂jPLM because ϵX1 and ϵX2 are independent of Z in our simulation setting. This is confirmed empirically by comparing the values, corresponding to the PLM, in the columns of βj for Model 2 in , with those for Model 1 (α = 1) in .

Table 2 Values of measures of performance for β̂j and m̂+ (m̂1,2) under Models 2, 3, and 4 where Zj are correlated, multiplied by 103.

Now, we note that Models 3 and 4 differ from Model 2 only in the nonparametric part. In the computation of the values for Model 3, the target m1,2 was actually the centered version as at (5.10) for m+. The MSE values in the columns of βj for Model 3, in comparison with those for Model 2, demonstrate empirically the insensitivity of both β̂j and β̂jPLM to non-additivity in the nonparametric part, as asserted by Theorem 7.1 and the discussion that follows. Indeed, our theory tells that ϱX=XmX,+(Z) and ϵX=XmX(Z) determine the asymptotic distributions of β̂ and β̂PLM, respectively, and these have nothing to do with the structure of the nonparametric part. As for the estimation of the nonparametric part, the MISE values of both our estimator and the PLM for Model 3 are smaller than the corresponding ones for Model 2. This is due to the fact that m1,2 is easier to estimate than m1  m2. Comparing the values for Models 2 and 4, we see that the increased dimension of Z does not affect the precision of the estimators of βj. It increases the MISE values of the nonparametric part, moderately for our estimator but substantially for the PLM. This illustrates the dimensionality problem with the PLM approach. Finally, we find that, as the results in , those in as well are in accordance with the corresponding theoretical rates of convergence.

and also report the average computing time, given a set of bandwidths chosen by a cross-validation criterion. The results indicate that the calculation of the Hilbertian PLAM estimators takes two or three times more than the computation of the Hilbertian PLM estimators, except in the case of Model 4 where the covariate dimension in the nonparametric part is higher than the other models. The comparison of computing time demonstrates how much extra computational time one needs to perform the smooth backfitting iteration, which the Hilbertian PLAM requires while the Hilbertian PLM does not.

6 Application to U.S. Presidential Election Data

We present a real data application for the model (4.1). Another real data application for the model (3.1) can be found in the supplement S.6. It is believed that the underlying political orientation and population characteristics of a region affect an election result for the region. To validate this belief in the United States, we analyzed the 2020 U.S. presidential election data. We put the observed proportions of votes earned by Democratic Party (Yi1), Repulican Party (Yi2) and the rest (Yi3) for the ith state into a vector Yi=(Yi1,Yi2,Yi3), and took the compositional vectors as the observations of the response Y. We considered such three categories since the first two parties received the major spotlight in the election. We note that Y satisfies the two constraints 0<Yj<1 and j=13Yj=1, and thus, it takes values in S13={(a1,a2,a3)R3:0<aj<1,j=13aj=1}, which is a two-dimensional Hilbert space as detailed in the supplement S.5. As a covariate, we considered a compositional vector X1 that comprises the proportions of votes earned by the same parties in the 2016 U.S. presidential election. We added three other covariates in the model building, which are the proportion of people who have a bachelor or a higher degree (Z1), per capita income (Z2) and median age (Z3). The observations on these socio-demographic variables Zj were obtained from https://www.census.gov/acs/www/data/data-tables-and-tools/data-profiles/2019/. The observed value Xi1 of the Hilbertian covariate X1 is considered to represent the political orientation of the ith state, and the values of the socio-demographic covariates, Zij for 1j3, measure the state’s education, wealth and age levels, respectively.

We applied the Hilbertian partially linear additive regression approach based on the model (4.1) with the covariates X1,Z1,Z2 and Z3. In doing so, we applied trapezoidal rules based on two different grids to approximate the integrals in the iterative algorithm at (2.6): one with 101 equally spaced points (Dense) and the other with 21 (Sparse). We also used two different bandwidth selectors: one was the CBS bandwidth introduced in Section 5.1, and the other an optimally chosen bandwidth for density estimation (KDE) according to the R function “h.amise” in R package “kedd (version 1.0.3).” Based on these options we implemented the three combinations: Dense-CBS, Sparse-CBS, and Dense-KDE. As a competing approach, we also chose the Hilbertian additive regression approach of Jeon and Park (Citation2020), which uses only the scalar covariates (Z1,Z2,Z3), with the Dense-CBS scheme. We compared the prediction performance of these approaches via the leave-one-out average squared prediction error (ASPE) defined by n1i=1n||Yi  Ŷi(i)||2=n1i=1n(2×3)1j=13k=13(log(Yij/Yik)log(Ŷij(i)/Ŷik(i)))2,where n = 51 and Ŷi(i)(Ŷi1(i),Ŷi2(i),Ŷi3(i)) is the predicted value of Yi obtained without the ith observation. The above definition of ASPE is based on the geometry of S13 given in the supplement S.5. The values of the ASPE for our approach with the Dense-CBS, Sparse-CBS, and Dense-KDE schemes were respectively 0.063, 0.063, and 0.070, while the application of Jeon and Park (Citation2020) gave 0.198. Thus, our partially linear additive modeling approach incorporating the compositional covariate improved the prediction performance greatly. It turned out that the size of grid in numerical integration does not have significant effects on the prediction performance. However, it turned out that the CBS scheme taking into account the values of Y outperforms the scheme based on KDE.

To see the effects of the covariates, we estimated the model (4.1) using the whole dataset with the Dense-CBS scheme. We obtained that β̂1=2.454. The result with such a large positive value demonstrates that the party supporting tendency in the 2020 U.S. election is tied strongly with that in the 2016 election. Thus, the empirical result provides a strong evidence that the underlying political orientation is an important determinant for the presidential election. The estimated component maps are depicted in , which visualize the effects of the socio-demographic variables on the election results. A clear lesson from the estimated maps is that the approval rate for the Democratic candidate remains unchanged as education level or income level increases, while it decreases for the Republican candidate and increases for the group of other candidates. The effects of age do not seem to have a monotone pattern.

Fig. 1 The estimated component maps m̂k for the population characteristics based on the proposed method applied to the U.S. election data.

Fig. 1 The estimated component maps m̂k for the population characteristics based on the proposed method applied to the U.S. election data.

7 Theoretical Development

7.1 Asymptotic Theory: Scalar Covariates

Here, we discuss the case where Xi and Zi take values in Rp and [0,1]d, respectively. We assume that (Xi,Zi,εi) for 1in are iid copies of (X,Z,ε) such that E(ε|X,Z)=0. Recall that mXj,+, for each 1jp, is the solution of (2.3) with W=Xj, mX,+=(mX1,+,,mXp,+):[0,1]dRp,ϱXj=XjmXj,+(Z) and ϱX=(ϱX1,,ϱXp). We let (hk:1kd) denote the bandwidth set we use in the construction of m̂X,+ and m̂Y,+ for β̂ and β̂0. Recall m̂+=m̂Yβ̂0(Xβ̂),+, which solves (2.5) with Wi=Yi  β̂0  (Xiβ̂). We let (bk:1kd) denote the bandwidth set we use in the construction of m̂+ with β̂ and β̂0 at hand.

7.1.1 Technical Assumptions and Invertibility of n1i=1nX˜iX˜i

We collect the technical assumptions we use for our theoretical development.

  • (A1) The joint density f is bounded away from zero and infinity on its support [0,1]d and all fjk are continuously differentiable on [0,1]2.

  • (A2) The real-valued additive functions mXj,+ are twice continuously differentiable on [0,1]d and the H-valued mk are twice continously Fréchet differentiable on [0,1].

  • (A3) The Hilbertian error variable ε satisfies E(||ε||α)< for some α>12/5 when H is finite-dimensional and α>5/2 when H is infinite-dimensional. Also, there exists a constant 0<C1< such that E(||ε||2|X,Z)C1 a.s.

  • (A4) The p × p matrix E(ϱX ϱX) is invertible.

  • (A5) There exist constants 0<L,C2< such that max1jpE(exp(|Xj|/L)|Z)C2 a.s.

  • (A6) The baseline kernel K is symmetric and positive on (1,1) but it vanishes on R(1,1) and its first derivative is Lipschitz continuous.

  • (A7) The bandwidths hk satisfy hknγ for some γ with 1/6<γ<min{1/3,(α2)/α} when H is finite-dimensional, and with 1/5<γ<min{1/3,(α2)/α} when H is infinite-dimensional.

  • (A8) The bandwidths bk satisfy that bk0 and liminfnnckbk>0 for some constants ck<(α2)/α for all 1kd, and (nbjbk)1logn0 for all 1jkd.

The conditions (A1) and (A2) for mk are typically assumed in the smooth backfitting literature, see Jeon and Park (Citation2020). The moment conditions at (A3) ensure (α2)/α>1/6 for finite-dimensional H and (α2)/α>1/5 for infinite-dimensional H, so that the bandwidth ranges in (A7) make sense. The range for infinite-dimensional H does not cover the optimal size for univariate nonparametric smoothing, which is n1/5, but we note that the bandwidths hk are for estimating the parametric part of the model, not for the nonparametric part. For the latter, we use bk for which we assume (A8). The condition (A4) ensures that the matrix n1i=1nX˜iX˜i is invertible with probability tending to one, which we detail below. The exponential moment condition (A5) is assumed to use empirical process theory (e.g., van de Geer Citation2000) in developing technical discussion for β̂j. The condition (A6) is stronger than the typical one in kernel smoothing that K itself is Lipschitz continuous. The latter condition is required to make our estimator m̂+ smoother, so that it belongs to an H-valued function class with a proper entropy.

We now state a proposition that demonstrates the invertibility of n1i=1nX˜iX˜i.

Proposition 7.1.

Assume (A1), (A2) for mXj,+ and (A4)–(A7). Then, n1i=1nX˜iX˜i=E(ϱX ϱX)+Op(n2γ+n(1γ)/2logn),so that n1i=1nX˜iX˜i is invertible with probability tending to one.

7.1.2 Estimation of Parametric Components

We present the limit distribution of β̂ defined at (3.7). For this we need to introduce several terminologies. First, we let Hp be equipped with an inner product ·,·tp defined by a,btp=j=1paj,bj for a=(a1,,ap) and b=(b1,,bp) in Hp. Let ||·||tp be the associated norm such that ||a||tp2=j=1p||aj||2. For eH, define ee:HpHp, an outer product in Hp, by (ee)(a)=(e,a1e,,e,ape).

For a general Hp-valued random element V, a linear operator S:HpHp such that cov(V,atp,V,btp)=S(a),btp,a,bHpis called the covariance operator of V. For a random vector U taking values in Rp, cov(Uε,atp,Uε,btp)=E[UU(εε)(a)], btp, so that E[UU(εε)]:HpHp, defined by E[UU(εε)](a)=E[UU(εε)(a)], is the covariance operator of Uε. Let G(0p,S) denote a Gaussian random element taking values in Hp with mean 0p=(0,,0)Hp and covariance operator S, that is, the real-valued random variable G(0p,S),atp for any aHp is normally distributed with mean 0 and variance S(a),atp. Let Σ=E[(E ϱX ϱX)1·ϱXϱX·(E ϱX ϱX)1(εε)], which is the covariance operator of (E ϱXϱX)1ϱXε. Note that, in case ε is independent of (X,Z), then Σ reduces to (E ϱX ϱX)1 E(εε).

Theorem 7.1.

Under the assumptions (A1)–(A7), it holds that n(β̂  β)=n1/2(i=1n(E ϱXϱX)1ϱX,iεi)+op(1) G(0p,Σ).

The following corollary for β̂0=Y¯  (j=1pX¯jβ̂j) is an immediate consequence of Theorem 7.1. Recall that β0=E(Y)  (j=1pE(Xj)βj).

Corollary 7.1.

Under the assumptions (A1)–(A7), it holds that ||β̂0  β0||=Op(n1/2).

7.1.3 Comparison with HPLM

We make a theoretical comparison of β̂ and β̂PLM, the latter of which is defined at (3.11). Let ΣPLM=E[(E ϵX ϵX)1·ϵXϵX·(E ϵX ϵX)1(εε)]. Then, along the lines of the proof of Theorem 7.1 in the supplement S.11, we may prove that, under suitable conditions, (7.1) n(β̂PLM  β)  G(0p,ΣPLM).(7.1)

A special case of (7.1) for H=R and d = 1 has been derived by Liang (Citation2006). The result (7.1) for β̂PLM is valid under the PLM and thus remains true under the PLAM as well. In case ε is independent of (X,Z),ΣPLM reduces to (E ϵX ϵX)1E(εε). The following definition is an extension of the classical notion of “asymptotic efficiency” for real-valued parameters to that for Hilbertian parameters.

Definition 7.1.

Let θ̂1 and θ̂2 be estimators of a parameter θ in a separable Hilbert space H such that sn(θ̂j  θ)G(0,Σj) for a sequence sn and covariance operators Σj for j = 1, 2. We say that θ̂1 is asymptotically more efficient than θ̂2 if Σ1Σ2 is a nonnegative definite operator, that is, (Σ1Σ2)(h),h0 for all hH.

We note that (E ϵX ϵX)1(E ϱX ϱX)1 is nonnegative definite and it is zero only if mX=mX,+. A direct computation shows that (7.2) (ΣPLMΣ)(a),atp=E[(ε,a1,,ε,ap)·((E ϵXϵX)1(E ϱXϱX)1)·(ε,a1,,ε,ap)],(7.2) which implies that β̂ is asymptotically more efficient than β̂PLM in the sense of Definition 7.1. Let dj denote the jth diagonal entry of (E ϵX ϵX)1(E ϱX ϱX)1. Then, under the PLAM (3.1), it holds that n·(||β̂jPLM  βj||2||β̂j  βj||2)=Snj+op(1) with E(Snj)=dj·E(||ε||2), which is the gain by β̂j against β̂jPLM under (3.1).

7.1.4 Estimation of Nonparametric Components

As we discussed in Section 3.3, m̂+ is uniquely decomposed into m̂+=j=1dm̂j with the constraints at (3.8). We study the error rates for m̂+ and m̂k. The asymptotic results presented here do not follow immediately from the results in Jeon and Park (Citation2020) since Yi  β̂0  (Xiβ̂)m+(Zi)  ε. Furthermore, as seen in the assumption (A8), our work allows for a flexible range for the bandwidths bk, instead of assuming bkn1/5 as in Jeon and Park (Citation2020). Put κk(τ)=bkτ+(nbk)1/2logn for τ=1,2 and 1kd. Also, let δk=maxlkbl2+n1/2·(minlkbl)1/2.

Theorem 7.2.

Assume (A1)–(A8). Then, it holds that, for each 1kd, supzk[2bk,12bk]||m̂k(zk)  mk(zk)||=Op(κk(2)+δk),supzk[0,1]||m̂k(zk)  mk(zk)||=Op(κk(1)+δk).

Consequently, supz[2b1,12b1]××[2bd,12bd]||m̂+(z)  m+(z)||=Op(max1kdκk(2)),supz[0,1]d||m̂+(z)  m+(z)||=Op(max1kdκk(1)).

In the case where bkn1/5 for all 1kd, the rates in Theorem 7.2 reduce to the usual ones for univariate smoothing. Indeed, for each 1kd, supzk[2bk,12bk]||m̂k(zk)  mk(zk)||=Op(n2/5logn),supzk[0,1]||m̂k(zk)  mk(zk)||=Op(n1/5).

We also derive the asymptotic distributions of m̂+ and m̂k, which we defer to the supplement S.15.

7.2 Asymptotic Theory: Hilbertian Covariates

Here, we discuss the case where the covariates in the linear part of the model, which we denoted by Xj for 1jp, take values in H. Define now ϱXj=Xj  mXj,+(Z). Recall that E(ϱX,ϱX) is the p × p matrix whose (j, k) element equals E(ϱXj,ϱXk). We first introduce some technical assumptions for Hilbertian Xj. We note that the assumptions (A1), (A6) and (A8) in Section 7.1 still apply in the present case, which we call (B1), (B6), and (B8), respectively, here. The corresponding versions of the others are given below.

  • (B2) The H-valued mXj,+ are twice continuously Fréchet differentiable on [0,1]d and the H-valued mk at (4.1) are twice continously Fréchet differentiable on [0,1].

  • (B3) The Hilbertian error variable ε at (4.1) satisfies E(||ε||α)< for some α>12/5. Also, there exists a constant 0<C1< such that E(||ε||2|X,Z)C1 a.s.

  • (B4) The p × p matrix E(ϱX,ϱX) is invertible.

  • (B5) There exist constants 0<L,C2< such that max1jpE(exp(||Xj||/L)|Z)C2 a.s.

  • (B7) The bandwidths hk satisfy hknγ for some γ with 1/6<γ<min{1/3,(α2)/α}.

The conditions (B3) and (B7) are the same as (A3) and (A7), respectively, for finite-dimensional H. In fact, for infinite-dimensional H, we are able to derive a slower rate for β̂ and β̂0, see below the second parts of Theorem 7.3 and Corollary 7.2. For such rate of convergence, we find that (B3) and (B7) are sufficient. The following proposition is an analogue of Proposition 7.1 for Hilbertian Xj.

Proposition 7.2.

Assume (B1), (B2) for mXj,+ and (B4)–(B7). Then, n1i=1nX˜i,X˜i=E(ϱX,ϱX)+Op(n2γ+n(1γ)/2logn), so that n1i=1nX˜i,X˜i is invertible with probability tending to one.

We now derive the asymptotic properties of β̂. Note that E(ϱX,εε,ϱX) is the p × p matrix whose (j, k) element is given by E(ϱXj,εϱXk,ε). Define Σp=(E(ϱX,ϱX))1·E(ϱX,εε,ϱX)·(E(ϱX,ϱX))1.

Let Np(0p,Σp) denote the multivariate normal distribution with mean 0p=(0,,0)Rp and covariance matrix Σp.

Theorem 7.3.

Assume (B1)–(B7). If H is finite-dimensional, then it holds that n(β̂β)=n1/2·i=1n(E ϱX,ϱX)1ϱX,i,εi+op(1) Np(0p,Σp).

If H is infinite-dimensional, then β̂β=Op(n2γlogn+n(1γ)/2(logn)3/2).

The following corollary for β̂0=Y¯  j=1pβ̂jX¯j is an immediate consequence of Theorem 7.3. Recall that β0=E(Y)  j=1pβjE(Xj).

Corollary 7.2.

Assume (B1)–(B7). If H is finite-dimensional, then ||β̂0  β0||=Op(n1/2). If H is infinite-dimensional, then ||β̂0  β0||=Op(n2γlogn+n(1γ)/2(logn)3/2).

Next, we present the asymptotic properties of m̂+ and m̂k satisfying the constraints at (3.8). As in Section 7.1, we let (bk:1kd) denote the set of bandwidths for defining (2.5) with Wi=Yi  β̂0  j=1p(β̂jXij). Recall κk(τ)=bkτ+(nbk)1/2logn and δk=maxlkbl2+n1/2·(minlkbl)1/2. Put δ˜k=δk if H is finite-dimensional, and δ˜k=δk+n2γ(logn)2+n(1γ)/2(logn)5/2 if H is infinite-dimensional.

Theorem 7.4.

Assume (B1)–(B8). Then, Theorem 7.2 remains to hold with δk being replaced by δ˜k.

We also derive the asymptotic distributions of m̂+ and m̂k in the case of Hilbertian Xj, which we defer to the supplement S.18.

Supplementary Materials

The supplementary material contains the description of the CBS algorithm, the geometries of Bayes-Hilbert spaces and simplices, and an additional real data application. It also includes two additional theorems for the asymptotic distributions of the estimators of the nonparametric components of the partially linear additive models, and all technical proofs.

Supplemental material

Supplemental Material

Download Zip (893.7 KB)

Supplemental Material

Download PDF (1.2 MB)

Supplemental Material

Download MS Word (41.4 KB)

Acknowledgments

The authors thank an Associate Editor and two referees for giving thoughtful and constructive comments on the earlier versions of the article.

Additional information

Funding

Byeong U. Park’s research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2019R1A2C3007355). Kyusang Yu’s research was supported by the National Research Foundation of Korea (NRF) funded by the Korea government (MSIT) NRF-2021R1A4A5032622.

References

  • Benatia, D., Carrasco, M., and Florens, J.-P. (2017), “Functional Linear Regression with Functional Response,” Journal of Econometrics, 201, 269–291. DOI: 10.1016/j.jeconom.2017.08.008.
  • Bhattacharya, P. K., and Zhao, P.-L. (1997), “Semiparametric Inference in a Partial Linear Model,” The Annals of Statistics, 25, 244–262. DOI: 10.1214/aos/1034276628.
  • Bissantz, N., Dette, H., Hildebrandt, T., and Bissantz, K. (2016), “Smooth Backfitting in Additive Inverse Regression,” Annals of the Institute of Statistical Mathematics, 68, 827–853. DOI: 10.1007/s10463-015-0517-x.
  • Bosq, D. (2000), Linear Processes in Function Spaces, New York: Springer.
  • Buja, A., Hastie, T., and Tibshirani, R. (1989), “Linear Smoothers and Additive Models,” (with discussion), The Annals of Statistics, 17, 453–555. DOI: 10.1214/aos/1176347115.
  • Hall, P., and Horowitz, J. L. (2007), “Methodology and Convergence Rates for Functional Linear Regression,” The Annals of Statistics, 35, 70–91. DOI: 10.1214/009053606000000957.
  • Han, K., and Park, B. U. (2018), “Smooth Backfitting for Errors-in-Variables Additive Models,” The Annals of Statistics, 46, 2216–2250. DOI: 10.1214/17-AOS1617.
  • Han, K., Müller, H.-G., and Park, B. U. (2020), “Additive Functional Regression for Densities as Responses,” Journal of the American Statistical Association, 115, 997–1010. DOI: 10.1080/01621459.2019.1604365.
  • Horváth, L., and Kokoszka, P. (2012), Inference for Functional Data with Applications, New York: Springer.
  • Jeon, J. M., and Park, B. U. (2020), “Additive Regression with Hilbertian Responses,” The Annals of Statistics, 48, 2671–2697. DOI: 10.1214/19-AOS1902.
  • Jeon, J. M., Park, B. U., and Van Keilegom, I. (2021), “Additive Regression with Non-Euclidean Responses and Predictors,” The Annals of Statistics, 49, 2611–2641. DOI: 10.1214/21-AOS2048.
  • Jeon, J. M., and Van Bever, G. (2022), “Additive Regression with General Imperfect Variables,” arXiv:2212.05745.
  • Lee, E. R., Han, K., and Park, B. U. (2018), “Estimation of Errors-in-Variables Partially Linear Additive Models,” Statistica Sinica, 28, 2353–2373.
  • Lee, Y. K., Mammen, E., and Park, B. U. (2010), “Backfitting and Smooth Backfitting for Additive Quantile Models,” The Annals of Statistics, 38, 2857–2883. DOI: 10.1214/10-AOS808.
  • Lee, Y. K., Mammen, E., and Park, B. U. (2012), “Flexible Generalized Varying Coefficient Regression Models,” The Annals of Statistics, 40, 1906–1933.
  • Liang, H. (2006), “Estimation in Partially Linear Models and Numerical Comparisons,” Computational Statistics & Data Analysis, 50, 675–687. DOI: 10.1016/j.csda.2004.10.007.
  • Liang, H., Thurston, S. W., Ruppert, D., Apanasovich, T., and Hauser, R. (2008), “Additive Partially Linear Models with Measurement Errors,” Biometrika, 95, 667–678. DOI: 10.1093/biomet/asn024.
  • Lin, Z., Müller, H.-G., and Park, B. U. (2022), “Additive Models for Symmetric Positive-Definite Matrices and Lie Groups,” Biometrika (to appear). DOI: 10.1093/biomet/asac055.
  • Linton, O., Sperlich, S., and van Keilegom, I. (2008), “Estimation of a Semiparametric Transformation Model,” The Annals of Statistics, 36, 686–718. DOI: 10.1214/009053607000000848.
  • Mammen, E., Linton, O., and Nielsen, J. P. (1999), “The Existence and Asymptotic Properties of a Backfitting Projection Algorithm under Weak Conditions,” The Annals of Statistics, 27, 1443–1490. DOI: 10.1214/aos/1017939138.
  • Opsomer, J. D., and Ruppert, D. (1997), “Fitting a Bivariate Additive Model by Local Polynomial Regression,” The Annals of Statistics, 25, 186–211. DOI: 10.1214/aos/1034276626.
  • Opsomer, J. D., and Ruppert, D. (1999), “A Root-n Consistent Backfitting Estimator for Semiparametric Additive Modeling,” Journal of Computational and Graphical Statistics, 8, 715–732.
  • Ramsay, J. O., and Silverman, B. W. (2005), Functional Data Analysis, New York: Springer.
  • Severini, T., and Wong, W. (1992), “Profile Likelihood and Conditionally Parametric Models,” The Annals of Statistics, 20, 1768–1802. DOI: 10.1214/aos/1176348889.
  • van de Geer, S. (2000), Empirical Processes in M-Estimation, Cambridge: Cambridge University Press.
  • Yao, F., Müller, H.-G., and Wang, J.-L. (2005), “Functional Linear Regression Analysis for Longitudinal Data,” The Annals of Statistics, 33, 2873–2903. DOI: 10.1214/009053605000000660.
  • Yu, K., Mammen, E., and Park, B. U. (2011), “Semi-Parametric Regression: Efficiency Gains from Modeling the Nonparametric Part,” Bernoulli, 17, 736–748. DOI: 10.3150/10-BEJ296.
  • Yu, K., Park, B. U., and Mammen, E. (2008), “Smooth Backfitting in Generalized Additive Models,” The Annals of Statistics, 36, 228–260. DOI: 10.1214/009053607000000596.
  • Zhang, X., Park, B. U., and Wang, J.-L. (2013), “Time-Varying Additive Models for Longitudinal Data,” Journal of the American Statistical Association, 108, 983–998. DOI: 10.1080/01621459.2013.778776.