![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
Abstract
Data analysis in modern scientific research and practice has shifted from analysing a single dataset to coupling several datasets. We propose and study a kernel regression method that can handle the challenge of heterogeneous populations. It greatly extends the constrained kernel regression [Dai, C.-S., & Shao, J. (2023). Kernel regression utilizing external information as constraints. Statistica Sinica, 33, in press] that requires a homogeneous population of different datasets. The asymptotic normality of proposed estimators is established under some conditions and simulation results are presented to confirm our theory and to quantify the improvements from datasets with heterogeneous populations.
1. Introduction
With advanced technologies in data collection and storage, in modern statistical analyses we have not only a primary random sample from a population of interest, which results in a dataset referred to as the internal dataset, but also some independent external datasets from sources such as past investigations and publicly available datasets. In this paper, we consider nonparametric kernel regression (Bierens, Citation1987; Wand & Jones, Citation1994, December; Wasserman, Citation2006) between a univariate response Y and a covariate vector from a sampled subject, using the internal dataset with the help from independent external datasets. Specifically, we consider kernel estimation of the conditional expectation (regression function) of Y given
under an internal data population,
(1)
(1) where D = 1 indicates internal population and
is a fixed point in
, the range of
. The indicator D can be either random or deterministic. The subscript 1 in
emphasizes that it is for internal data population (D = 1), which may be different from
, a mixture of quantities from the internal and external data populations.
When external datasets also have measurements Y and , we may simply combine the internal and external datasets when the populations for internal and external data are identical (homogeneous). However, heterogeneity typically exists among populations for different datasets, especially when there are multiple external datasets collected in different ways and/or different time periods. In Section 2, we propose a method to handle heterogeneity among different populations and derive a kernel regression more efficient than the one using internal data alone. The result is also a crucial building block for the more complicated case in Section 3 where external datasets contain fewer measured covariates as described next.
In applications, it often occurs that an external dataset has measured Y and from each subject, where
is a part of the vector
, i.e., some components of
are not measured due to high measurement cost or the progress of technology and/or scientific relevance. With some unmeasured components of
, the external dataset cannot be directly used to estimate
in (Equation1
(1)
(1) ), since conditioning on the entire
is involved. To solve this problem, Dai and Shao (Citation2023) proposes a two-step kernel regression using external information as a constraint to improve kernel regression based on internal data alone, following the idea of using constraints in Chatterjee et al. (Citation2016) and H. Zhang et al. (Citation2020). However, these three cited papers mainly assume that the internal and external datasets share the same population, which may be unrealistic. The challenge in dealing with the heterogeneity among different populations is similar to the difficulty in handling nonignorable missing data if unmeasured components of
is treated as missing data, although in missing data problems we usually want to estimate
in (Equation1
(1)
(1) ).
In Section 3, we develop a methodology to handle population heterogeneity for internal and external datasets, which extends the procedure in Dai and Shao (Citation2023) to heterogeneous populations and greatly widens its application scope.
Under each scenario, we derive asymptotic normality in Section 4 for the proposed kernel estimators and obtain explicitly the asymptotic variances, which is important for large sample inference. Some simulation results are presented in Section 5 to compare finite sample performance of several estimators. Discussions on extensions and handling high dimension covariates are given in Section 6. All technical details are in the Appendix.
Our research fits into a general framework of data integration (Kim et al., Citation2021; Lohr & Raghunathan, Citation2017; Merkouris, Citation2004; Rao, Citation2021; Yang & Kim, Citation2020; Y. Zhang et al., Citation2017).
2. Efficient kernel estimation by combining datasets
The internal dataset contains observations ,
, independent and identically distributed (iid) from
, the internal population of
, where Y is the response and
is a p-dimensional covariate vector associated with Y. We are interested in the estimation of conditional expectation
in (Equation1
(1)
(1) ). The standard kernel regression estimator of
based on the internal dataset alone is
(2)
(2) where
,
is a given kernel function on
(the range of
), and b>0 is a bandwidth depending on n. We assume that
is standardized so that the same bandwidth b is used for every component of
in kernel regression. Because of the well-known curse of dimensionality for kernel-type methods, we focus on a low dimension p not varying with n. A discussion of handling a large dimensional
is given in Section 6.
We consider the case with one external dataset, independent of the internal dataset. Extension to multiple external datasets is straightforward and discussed in Section 6.
In this section we consider the situation where the external dataset contains iid observations ,
, from
, the external population of
.
2.1. Combing data from homogeneous populations
If we assume that the two populations and
are identical, then we can simply combine two datasets to obtain the kernel estimator
(3)
(3) which is obviously more efficient than
in (Equation2
(2)
(2) ) as the sample size is increased to N>n. The estimator
in (Equation3
(3)
(3) ), however, is not correct (i.e., it is biased) when populations
and
are different, because
for external population may be different from
for internal population.
2.2. Combing data from heterogeneous populations
We now derive a kernel estimator using two datasets and is asymptotically correct regardless of whether and
are the same or not. Let
be the conditional density of Y given
and
1 or 0 (for internal or external population). Then
(4)
(4) The ratio
links internal and external populations so that we can overcome the difficulty in utilizing the external data under heterogeneous populations.
If we can construct an estimator of
for every y,
, and D = 0 or 1, then we can modify the estimator in (Equation3
(3)
(3) ) by replacing every
with i>n by constructed response
. The resulting kernel estimator is
(5)
(5) Note that we use internal data
,
, to obtain estimator
and external data
,
, to construct estimator
. Applying kernel estimation, we obtain that
(6)
(6) where
and
are kernels with dimensions p + 1 and p and bandwidths
and
, respectively. The estimator in (Equation5
(5)
(5) ) is asymptotically valid under some regularity conditions for kernel and bandwidth, summarized in Theorem 4.1 of Section 4.
2.3. Combing data from heterogeneous populations with additional information
If additional information exists, then the approach in Section 2.2 can be improved. Assume that the internal and external datasets are formed according to a random binary indicator D such that ,
, are iid distributed as
, where
and
are observed internal data when
,
and
are observed external data when
, and N is still the known total sample size for internal and external data. In this situation, the internal and external sample sizes are
and N−n, respectively, both of which are random. In most applications, the assumption of random D is not substantial. From the identity
(7)
(7) we just need to estimate
and
for every
, constructed using for example the nonparametric estimators in Fan et al. (Citation1998) for binary response. For each estimator, both internal and external data on
and the indicator D are used.
A further improvement can be made if the following semi-parametric model holds,
(8)
(8) where
is an unspecified unknown function and γ is an unknown parameter. From (Equation7
(7)
(7) )–(Equation8
(8)
(8) ),
(9)
(9) If
, then
and the estimator
in (Equation3
(3)
(3) ) is correct. Under (Equation9
(9)
(9) ) with
, we just need to derive an estimator
of γ and apply kernel estimation to estimate
as a function of
. Note that we do not need to estimate the unspecified function
in (Equation8
(8)
(8) ), which is a nice feature of semi-parametric model (Equation8
(8)
(8) ).
We now derive an estimator . Applying (Equation7
(7)
(7) )–(Equation8
(8)
(8) ) to (Equation4
(4)
(4) ), we obtain that
where the second and third equalities follow from (Equation8
(8)
(8) ) and the last equation follows from
as
. For every real number t, define
Its estimator by kernel regression is
(10)
(10) where
is a kernel and
is a bandwidth. Then, we estimate γ by
(11)
(11) motivated by the fact that the objective function for minimization in (Equation11
(11)
(11) ) approximates
and, for any t,
because
.
Once is obtained, our estimator of
is
(12)
(12) with
in view of (Equation9
(9)
(9) ).
In applications, we need to choose bandwidths with given sample sizes n and N−n. We can apply the k-fold cross-validation as described in Györfi et al. (Citation2002). Requirements on the rates of bandwidths are described in theorems in Section 3.
3. Constrained kernel regression with unmeasured covariates
We still consider the case with one external dataset, independent of the internal dataset. In this section, the external dataset contains iid observations ,
, from the external population
, where
is a q-dimensional sub-vector of
with q<p.
Since the external dataset has only , not the entire
, we cannot apply the method in Section 2 when q<p. Instead, we consider kernel regression using external information in a constraint. First, we consider the estimation of the n-dimensional vector
, where
denotes the transpose of vector or matrix
throughout. Note that the standard kernel regression (Equation2
(2)
(2) ) estimates
as
Taking partial derivatives with respect to
's, we obtain that
(13)
(13) We improve
by the following constrained minimization,
(14)
(14)
(15)
(15) where
, l in (Equation14
(14)
(14) ) is a bandwidth that may be different from b in (Equation2
(2)
(2) ) or (Equation13
(13)
(13) ), and
is the kernel estimator of
using the jth of the three methods described in Section 2, j = 1, 2, 3. Specifically,
is given by (Equation3
(3)
(3) ),
is given by (Equation5
(5)
(5) ), and
is given by (Equation12
(12)
(12) ), with
and
replaced by
and
, respectively, and kernels and bandwidths suitably adjusted as dimensions of
and
are different. Note that
can be computed as both internal and external datasets have measured
's.
It turns out that in (Equation14
(14)
(14) ) has an explicit form
where
is the
matrix whose ith row is
and
is the n-dimensional vector whose ith component is
. Constraint (Equation15
(15)
(15) ) is an empirical analog of the theoretical constraint
(based on internal data), as
. Thus, if
is a good estimator of
, then
in (Equation14
(14)
(14) ) is more accurate than the unconstrained
in (Equation13
(13)
(13) ).
To obtain an improved estimator of the entire regression function in (Equation1
(1)
(1) ), not just the function at
,
, we apply the standard kernel regression with response vector
replaced by
in (Equation14
(14)
(14) ), which results in the following three estimators of
:
(16)
(16) where
is the ith component of
in (Equation14
(14)
(14) ) and b is the same bandwidth in (Equation2
(2)
(2) ). The first estimator
is simple, but can be incorrect when populations
and
are different. The asymptotic validity of
and
are established in the next section.
4. Asymptotic normality
We now establish the asymptotic normality of and
for a fixed
, as the sample size of the internal dataset increases to infinity. All technical proofs are given in the Appendix.
The first result is about in (Equation5
(5)
(5) ). The result is also applicable to
in (Equation3
(3)
(3) ) with an added condition that
.
Theorem 4.1
Assume the following conditions.
(B1) | The densities | ||||
(B2) |
| ||||
(B3) | The kernel κ is second order, i.e., | ||||
(B4) | The bandwidth b satisfies | ||||
(B5) | The kernels |
Then, for any fixed with
and
and
in (Equation5
(5)
(5) ),
(17)
(17) where
denotes convergence in distribution as
,
Conditions (B1)–(B4) are typically assumed for kernel estimation (Bierens, Citation1987). Condition (B5) is a sufficient condition for
(18)
(18) (Lemma 8.10 in Newey & McFadden, Citation1994), where
denotes a term tending to 0 in probability. Result (Equation18
(18)
(18) ) implies that the estimation of ratio
does not affect the asymptotic distribution of
in (Equation5
(5)
(5) ).
Note that both the squared bias and variance
in (Equation17
(17)
(17) ) are decreasing in the limit
, a quantity reflecting how many external data we have. In the extreme case of a = 0, i.e., the size of the external dataset is negligible compared with the size of the internal dataset, result (Equation17
(17)
(17) ) reduces to the well-known asymptotic normality for the standard kernel estimator
in (Equation2
(2)
(2) ) (Bierens, Citation1987). In the other extreme case of
, on the other hand,
and, hence,
has a convergence rate tending to 0 faster than
, the convergence rate of the standard kernel estimator
.
The next result is about in (Equation16
(16)
(16) ) as described in Section 3.
Theorem 4.2
Assume (B1)–(B5) with and p replaced by
and q, respectively, and the following conditions, where
and
, k = 0, 1, are defined in (B1)–(B2).
(C1) | The range | ||||
(C2) | Functions | ||||
(C3) | All kernel functions are positive, bounded, and Lipschitz continuous with mean zero and finite sixth moments. | ||||
(C4) |
| ||||
(C5) | The densities |
Then, for any fixed and
in (Equation16
(16)
(16) ),
(19)
(19) where
and is assumed to be positive definite without loss of generality.
The next result is about in (Equation11
(11)
(11) ).
Theorem 4.3
Suppose that (Equation8(8)
(8) ) holds for binary random D indicating internal and external data. Assume also the following conditions.
(D1) | The kernel | ||||
(D2) | The bandwidth | ||||
(D3) | γ in (Equation8 | ||||
(D4) |
| ||||
(D5) | The function |
Then, as the total sample size of internal and external datasets ,
(20)
(20) where
.
Conditions (D1)–(D5) are technical assumptions discussed in Lemmas 8.11 and 8.12 in Newey and McFadden (Citation1994). As discussed by Newey and McFadden (Citation1994), the condition that has a bounded support can be relaxed, as it is imposed for a simple proof.
Combining Theorems 4.1–4.3, we obtain the following result for in (Equation12
(12)
(12) ) or
in (Equation16
(16)
(16) ).
Corollary 4.1
Suppose that (Equation8(8)
(8) ) holds for the binary random D indicating internal and external data.
(i) | Under (B1)–(B4) and (D1)–(D5), result (Equation17 | ||||
(ii) | Under (C1)–(C4) and (D1)–(D5) with |
5. Simulation results
5.1. The performance of ![](//:0)
given by (16)
We first present simulation results to examine and compare the performance of the standard kernel estimator in (Equation2
(2)
(2) ) without using external information and our proposed estimator (Equation16
(16)
(16) ) with three variations,
,
, and
, as described in the end of Section 3. We consider
with univariate covariates X and Z, where Z is unmeasured in the external dataset (p = 2 and q = 1). The covariates are generated in two ways:
normal covariates:
is bivariate normal with means 0, variances 1, and correlation 0.5;
bounded covariates:
and
, where
, and
are identically distributed as uniform on
, B is uniform on
, and
,
, and B are independent.
Conditioned on , the response Y is normal with mean
and variance 1, where
follows one of the following four models:
(M1) |
| ||||
(M2) |
| ||||
(M3) |
| ||||
(M4) |
|
Note that all four models are nonlinear in ; (M1)-(M2) are additive models, while (M3)-(M4) are non-additive.
A total of N = 1, 200 data are generated from the population of as previously described. A data point is treated as internal or external according to a random binary D with conditional probability
, where
or 1/2, and
or
. Under the setting
or
, the unconditional
is around 13% or 50%.
The simulation studies performance of kernel estimators in terms of mean integrated square error (MISE). The following measure is calculated by simulation with S replications:
(21)
(21) where
are test data for each simulation replication s, the simulation is repeated independently for
, and
is one of
,
,
, and
, independent of test data. We consider two ways of generating test data
's. The first one is to use T = 121 fixed grid points on
with equal space. The second one is to take a random sample of T = 121 without replacement from the covariate
's of the internal dataset, for each fixed
and independently across s.
To show the benefit of using external information, we calculate the improvement in efficiency defined as follows:
(22)
(22) where the minimum is over
one of
,
,
, and
.
In all cases, we use the Gaussian kernel. The bandwidths b and l affect the performance of kernel methods. We consider two types of bandwidths in the simulation. The first one is ‘the best bandwidth’; for each method, we evaluate MISE in a pool of bandwidths and display the one that has the minimal MISE. This shows the best we can achieve in terms of bandwidth, but it cannot be used in applications. The second one is to select bandwidth from a pool of bandwidths via 10-fold cross validation (Györfi et al., Citation2002), which produces a decent bandwidth that can be applied to real data.
The simulated MISE values based on S = 200 replications are shown in Tables –.
Table 1. Simulated MISE (Equation21(21)
(21) ) and IMP (Equation22
(22)
(22) ) when the external dataset contains only X, with S = 200 under
,
.
Table 2. Simulated MISE (Equation21(21)
(21) ) and IMP (Equation22
(22)
(22) ) when the external dataset contains only X, with S = 200 under
,
.
Table 3. Simulated MISE (Equation21(21)
(21) ) and IMP (Equation22
(22)
(22) ) when the external dataset contains only X, with S = 200 under
,
.
Table 4. Simulated MISE (Equation21(21)
(21) ) and IMP (Equation22
(22)
(22) ) when the external dataset contains only X, with S = 200 under
,
.
Consider first the results in Tables –. Since , all three estimators,
,
, and
, are correct and more efficient than the standard estimator
in (Equation2
(2)
(2) ) without using external information. The estimator
is the best, as it uses the correct information that populations are homogeneous (
) and is simpler than
and
.
Next, the results in Tables – for indicate that the estimator
or
using a correct constraint is better than the estimator
using an incorrect constraint or the estimator
without using external information. Since
uses more information, it is in general better than
. Furthermore, with an incorrect constraint,
can be much worse than
without using external information.
5.2. The performance of ![](//:0)
given by (3), (5), or (12)
Under the same simulation setting as described in Section 5.1 but with covariate Z measured in both internal and external datasets, we compare the performance of three estimators, ,
, and
given by (Equation3
(3)
(3) ), (Equation5
(5)
(5) ), and (Equation12
(12)
(12) ), respectively, with the standard kernel estimator
in (Equation2
(2)
(2) ) without using external information. The mean integrated squared error (MISE) and improvement (IMP) are calculated using formulas (Equation21
(21)
(21) ) and (Equation22
(22)
(22) ), respectively, with
one of
,
,
, and
.
Tables – present the simulation results. The relative performance of ,
,
, and
follows the same pattern as
,
,
, and
in Section 5.1.
Table 5. Simulated MISE (Equation21(21)
(21) ) and IMP (Equation22
(22)
(22) ) when the external dataset contains both X and Z, with S = 200 under
,
.
Table 6. Simulated MISE (Equation21(21)
(21) ) and IMP (Equation22
(22)
(22) ) when the external dataset contains both X and Z, with S = 200 under
,
.
Table 7. Simulated MISE (Equation21(21)
(21) ) and IMP (Equation22
(22)
(22) ) when the external dataset contains both X and Z, with S = 200 under
,
.
Table 8. Simulated MISE (Equation21(21)
(21) ) and IMP (Equation22
(22)
(22) ) when the external dataset contains both X and Z, with S = 200 under
,
.
The only difference between the results here and those in Section 5.1 is that the use of more external data (a smaller n/N) results in a better performance of or
(or
when it is correct). This is actually consistent with our theoretical result Theorem 4.1 in Section 4, which shows that both the squared bias
and variance
in (Equation17
(17)
(17) ) are decreasing in the limit
. On the other hand, the simulation results in Section 5.1 and Theorem 4.2 in Section 4 do not show a clear indication of using more external data produces better estimators. The main reason for this is that, when Z is not observed in the external dataset, the estimator
relies more on internal data to recover the loss of Z from external dataset in a complicated way.
5.3. The performance of ![](//:0)
given by (16) with q = 2
We re-consider the simulation in Section 5.1 but with the dimension of to be q = 2, i.e.,
. We only consider normally distributed covariates with means 0, variances 1, and the correlations in
,
, and
being 0.5, 0.5, and 0.25, respectively. Given
, the response variable Y is normally distributed with mean
and variance 1. Moreover,
, while the remaining settings are the same as in Section 5.1. In calculating MISE (Equation21
(21)
(21) ), we only a random
with T = 121, not fixed grid points. Also, we consider only evaluating the performance of estimators
, since estimators
are simpler.
The results are shown in Table . Compared with results in Tables – for the case of q = 1, the MISEs in this case are larger due to the fact of having more covariates (q = 2). But the relative performances of estimators are the same as those shown in Tables –.
Table 9. Simulated MISE (Equation21(21)
(21) ) and IMP (Equation22
(22)
(22) ) when the external dataset contains only normally distributed
, with S = 200.
6. Discussion
Curse of dimensionality is a well-known problem for nonparametric methods. Thus, the proposed method in Section 2 is intended for low dimensional covariate , i.e., p is small. If p is not small, then we should reduce the dimension of
prior to applying the CK, or any kernel methods. For example, consider a single index model assumption (K.-C. Li, Citation1991), i.e.,
in (Equation1
(1)
(1) ) is assumed to be
(23)
(23) where
is an unknown p-dimensional vector. The well-known SIR technique (K.-C. Li, Citation1991) can be applied to obtain a consistent and asymptotically normal estimator
of
in (Equation23
(23)
(23) ). Once
is replaced by
, the kernel method can be applied with
replaced by the one-dimensional ‘covariate’
. We can also apply other dimension reduction techniques developed under assumptions weaker than (Equation23
(23)
(23) ) (Cook & Weisberg, Citation1991; B. Li & Wang, Citation2007; Ma & Zhu, Citation2012; Y. Shao et al., Citation2007; Xia et al., Citation2002).
We turn to the dimension of in the external dataset. When the dimension of
is high, we may consider the following approach. Instead of using constraint (Equation15
(15)
(15) ), we use component-wise constraints
(24)
(24) where
is the kth component of
,
, and
is an estimator of
using methods described in Section 2. More constraints are involved in (Equation24
(24)
(24) ), but estimation only involves one dimensional
,
.
The kernel κ we adopted in (Equation2(2)
(2) ) and (Equation16
(16)
(16) ) is the second-order kernel so that the convergence rate of
is
. An mth-order kernel with m>2 as defined by Bierens (Citation1987) may be used to achieve convergence rate
. Alternatively, we may apply other nonparametric smoothing techniques such as the local polynomial (Fan et al., Citation1997) to achieve convergence rate
with
.
Our results can be extended to the scenarios where several external datasets are available. Since each external source may provide different covariate variables, we may need to apply component-wise constraints (Equation24(24)
(24) ) by estimating
via combining all the external sources that collects covariate
. If populations of external datasets are different, then we may have to apply a combination of the methods described in Section 2.
Acknowledgments
The authors would like to thank two anonymous referees for helpful comments and suggestions.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Additional information
Funding
References
- Bierens, H. J. (1987). Kernel estimators of regression functions. In Advances in Econometrics: Fifth World Congress (Vol. 1, pp. 99–144). Cambridge University Press.
- Chatterjee, N., Chen, Y. H., Maas, P., & Carroll, R. J. (2016). Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. Journal of the American Statistical Association, 111(513), 107–117. https://doi.org/10.1080/01621459.2015.1123157
- Cook, R. D., & Weisberg, S. (1991). Sliced inverse regression for dimension reduction: Comment. Journal of the American Statistical Association, 86(414), 328–332. https://doi.org/10.2307/2290564
- Dai, C.-S., & Shao, J. (2023). Kernel regression utilizing external information as constraints. Statistica Sinica, 33, in press. https://doi.org/10.5705/ss.202021.0446
- Fan, J., Farmen, M., & Gijbels, I. (1998). Local maximum likelihood estimation and inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(3), 591–608. https://doi.org/10.1111/1467-9868.00142
- Fan, J., Gasser, T., Gijbels, I., Brockmann, M., & Engel, J. (1997). Local polynomial regression: optimal kernels and asymptotic minimax efficiency. Annals of the Institute of Statistical Mathematics, 49(1), 79–99. https://doi.org/10.1023/A:1003162622169
- Györfi, L., Kohler, M., Krzyżak, A., & Walk, H. (2002). A distribution-free theory of nonparametric regression. Springer.
- Kim, H. J., Wang, Z., & Kim, J. K. (2021). Survey data integration for regression analysis using model calibration. arXiv 2107.06448.
- Li, B., & Wang, S. (2007). On directional regression for dimension reduction. Journal of the American Statistical Association, 102(479), 997–1008. https://doi.org/10.1198/016214507000000536
- Li, K.-C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414), 316–327. https://doi.org/10.1080/01621459.1991.10475035
- Lohr, S. L., & Raghunathan, T. E. (2017). Combining survey data with other data sources. Statistical Science, 32(2), 293–312. https://doi.org/10.1214/16-STS584
- Ma, Y., & Zhu, L. (2012). A semiparametric approach to dimension reduction. Journal of the American Statistical Association, 107(497), 168–179. https://doi.org/10.1080/01621459.2011.646925
- Merkouris, T. (2004). Combining independent regression estimators from multiple surveys. Journal of the American Statistical Association, 99(468), 1131–1139. https://doi.org/10.1198/016214504000000601
- Nadaraya, E. A. (1964). On estimating regression. Theory of Probability & Its Applications, 9(1), 141–142. https://doi.org/10.1137/1109020
- Newey, W. K. (1994). Kernel estimation of partial means and a general variance estimator. Econometric Theory, 10(2), 1–21. https://doi.org/10.1017/S0266466600008409.
- Newey, W. K., & McFadden, D. (1994). Large sample estimation and hypothesis testing. Handbook of Econometrics, 4, 2111–2245. https://doi.org/10.1016/S1573-4412(05)80005-4
- Rao, J. (2021). On making valid inferences by integrating data from surveys and other sources. Sankhya B, 83(1), 242–272. https://doi.org/10.1007/s13571-020-00227-w
- Shao, J. (2003). Mathematical statistics. 2nd ed., Springer.
- Shao, Y., Cook, R. D., & Weisberg, S. (2007). Marginal tests with sliced average variance estimation. Biometrika, 94(2), 285–296. https://doi.org/10.1093/biomet/asm021
- Wand, M. P., & Jones, M. C. (1994, December). Kernel smoothing. Number 60 in Chapman & Hall/CRC Monographs on Statistics & Applied Probability, Boca Raton.
- Wasserman, L. (2006). All of nonparametric statistics. Springer.
- Xia, Y., Tong, H., Li, W. K., & Zhu, L.-X. (2002). An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 64(3), 363–410. https://doi.org/10.1111/1467-9868.03411
- Yang, S., & Kim, J. K. (2020). Statistical data integration in survey sampling: a review. Japanese Journal of Statistics and Data Science, 3(2), 625–650. https://doi.org/10.1007/s42081-020-00093-w
- Zhang, H., Deng, L., Schiffman, M., Qin, J., & Yu, K. (2020). Generalized integration model for improved statistical inference by leveraging external summary data. Biometrika, 107(3), 689–703. https://doi.org/10.1093/biomet/asaa014
- Zhang, Y., Ouyang, Z., & Zhao, H. (2017). A statistical framework for data integration through graphical models with application to cancer genomics. The Annals of Applied Statistics, 11(1), 161–184. https://doi.org/10.1214/16-AOAS998
Appendix
Proof of Theorem 4.1.
Proof of Theorem 4.1.
Let where
,
, and
Under (B3)–(B4), Theorem 2 in Nadaraya (Citation1964) shows that
converges to
in probability. Under (B1)–(B4),
,
and
,
. Then (Equation17
(17)
(17) ) holds for
, by Slutsky's theorem, the independence between
and
, and the definition of a. The desired result (Equation17
(17)
(17) ) follows from the fact that
is bounded by
(A1)
(A1) which is
by result (Equation18
(18)
(18) ) under condition (B5).
Proof of Theorem 4.2.
Proof of Theorem 4.2.
Write
(A2)
(A2) where
,
,
,
,
,
,
,
,
is the identity matrix of order n,
is the n-vector with all components being 1,
is the
diagonal matrix whose ith diagonal element is
,
is the
matrix whose
th entry is
,
with
,
is the n-dimensional vector whose ith component is
,
, and
,
, and
are defined in Section 2.
We first show that in (EquationA2
(A2)
(A2) ) is asymptotically normal with mean 0 and variance
defined in Theorem 4.2. Consider a further decomposition
, where
is a V-statistic with
and
Note that
having variance
where
is given in condition (C2), the second and third equalities follow from changing variables
and
, respectively. From the continuity of
and
,
converges to
Therefore, by the theory for asymptotic normality of V-statistics (e.g., Theorem 3.16 in J. Shao, Citation2003),
.
Conditioned on ,
has mean 0 and variance
This proves that
. Note that
and
is bounded by
Therefore, under the assumed condition that
is bounded away from zero, Lemma 3 in Dai and Shao (Citation2023) implies
. Note that
Conditioned on
,
has mean 0 and variance
because, under the assumed condition that
is bounded away from zero, Lemma 3 in Dai and Shao (Citation2023) implies
Thus,
Consequently,
has the same asymptotic distribution as
, the claimed result.
From Lemma 4 in Dai and Shao (Citation2023) and (C4), Note that
where the second equality follows from (A4) and Lemmas 3–4 in Dai and Shao (Citation2023), and the last equality follows from Lemma 2 in Dai and Shao (Citation2023) and continuity of
. Also,
where the first equality follows from Lemma 3 in Dai and Shao (Citation2023) and the law of large numbers, the second equality follows from Lemma 4 in Dai and Shao (Citation2023), and the last equality follows from the law of large numbers. Similarly,
where the second equality follows from Lemma 3 in Dai and Shao (Citation2023). Under (B1)–(B5) with
and p replaced by
and q, and (C5), Lemma 8.10 in Newey and McFadden (Citation1994) implies that
(A3)
(A3) which is
and, hence,
From Lemma 3 in Dai and Shao (Citation2023) and the Central Limit Theorem,
Combining these results, we obtain that
. This completes the proof.
Proof of Theorem 4.3.
Proof of Theorem 4.3.
Define
Then,
,
, and
Let
,
, and
Taking derivatives with respect to t, we obtain
and
where ψ is given in (D5). Note that
and
. We establish the asymptotic normality of
in the following four steps.
Step 1: Since γ is the unique minimizer of , from Theorem 2.1 in Newey and McFadden (Citation1994), it suffices to prove that
Note that
From (D3),
is bounded by
for a constant c and hence Lemma 2.4 in Newey and McFadden (Citation1994) implies that
Based on Lemma B.3 in Newey (Citation1994), conditions (D1)–(D4) imply that
for all
. As a result, by a similar argument of the proof of Lemma B.3 in Newey (Citation1994), we obtain that
Since
is bounded away from zero and
and
are Lipschitz continuous functions with respect to
,
and
These results together with the previous inequality implies that
Step 2: Conditions (D1)–(D5) ensure that Lemma 8.11 in Newey and McFadden (Citation1994) holds and hence with
.
Step 3: Note that and
where
,
, and the last term
The law of large numbers guarantees that
. A similar argument in Step 1 shows that
. For
, we have
Under (D3),
,
, and
converge uniformly for all
as
and, thus, the
because
. This shows that
Step 4: By Taylor's expansion, for some
. From the results in Steps 1-3,
This completes the proof of (Equation20
(20)
(20) ).
Proof of Corollary 4.1.
Proof of Corollary 4.1.
From Theorem 4.3, (Equation20
(20)
(20) ) shows that
. Furthermore, Lemma 8.10 in Newey and McFadden (Citation1994) shows that
(A4)
(A4) which is
under the assumed condition
and
. Since
converges faster than (EquationA4
(A4)
(A4) ), (Equation18
(18)
(18) ) holds. As a result, (Equation17
(17)
(17) ) holds with
replaced by
under (B1)–(B4) and (D1)–(D5).
Under (D1)–(D5) with
replaced by
and p replaced by q, Lemma 8.10 in Newey and McFadden (Citation1994) implies that
From the asymptotic normality of
,
, which converges to 0 faster than
. Hence (EquationA3
(A3)
(A3) ) holds while
is estimated by
. Then, the rest of proof of the second claims follows the argument in the proof of Theorem 4.2.