Full article: A permutation approach to goodness-of-fit testing in regression models

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Model checking plays an important role in parametric regression as model misspecification seriously affects the validity and efficiency of regression analysis. Model checks can be performed by constructing an empirical process from the model's fitted values and residuals. Due to a complex covariance function of the process obtaining the exact distribution of the test statistic is, however, intractable. Several solutions to overcome this have been proposed. It was shown that the simulation and bootstrap-based approaches are asymptotically valid, however, we show by using simulations that the rate of convergence can be slow. We, therefore, propose to estimate the null distribution by using a novel permutation-based procedure. We prove, under some mild assumptions, that this yields consistent tests under the null and some alternative hypotheses. Small sample properties of the proposed approach are studied in an extensive Monte Carlo simulation study and real data illustration is also provided.

Keywords:

1. Introduction

Regression is a fundamental statistical tool for analysing the association between a dependent variable, Y, and a set of potentially transformed independent variables, $x \in R^{m}$ , which is often used in various applications. Let $(Y_{i}, x_{i})$ , $i = 1, 2, \dots, n$ , denote independent replications of $(Y, x)$ , such that Y and $x$ satisfy (1) $Y = m (x) + \sqrt{σ (x)} ϵ,$ (1) where $m (x) = E (Y | x)$ and $σ (x) = var (Y | x) > 0$ , $x \in R^{m}$ are some unknown functions and ϵ is a random error independent of $x$ with mean zero and unit variance.

Within a parametric framework, one assumes that the true function $m (\cdot)$ belongs to a given family of functions $G = {g_{θ} : θ \in Θ}$ parameterized by some unknown parameter $θ$ and $Θ \subset R^{d}$ , and the regression model is written as $Y_{i} = g_{θ} (x_{i}) + \sqrt{σ (x_{i})} ϵ_{i},$ where $x_{i} = (x_{i 1}, \dots, x_{i m})^{T} \in R^{m}$ is the ith row of the matrix $X \in R^{n \times m}$ and $ϵ_{i}$ , $i = 1, 2, \dots, n$ are independent replications of ϵ. Important examples are the widely used linear model $g_{θ} (x_{i}) = x_{i}^{T} θ$ and a logistic model $g_{θ} (x_{i}) = θ_{0} / (1 + θ_{1} \exp (x_{i}^{T} θ_{2}))$ . We are interested in testing (2) $H_{0} : m \in G against H_{1} : m \notin G .$ (2) Assuming (Equation2(2) $H_{0} : m \in G against H_{1} : m \notin G .$ (2) ) is tantamount to saying that the function m is equal to a function $g_{θ_{0}}$ for some unknown true parameter $θ_{0}$ .

In parametric regression, model misspecification seriously affects the validity and efficiency of regression analysis, therefore model checking plays an important role. There is a plethora of goodness-of-fit tests for testing (Equation2(2) $H_{0} : m \in G against H_{1} : m \notin G .$ (2) ) (see González-Manteiga and Crujeiras [Citation1] for a review). Out of these, the tests based on empirical regression processes [Citation2], which we consider here, are particularly appealing due to their relative simplicity and informative visual representation which can provide a useful hint on how to correct the misspecification [Citation3]. Namely, as argued by Solari et al. [Citation4], in practice, if we find evidence against the model, we would also like to know why the model does not fit. Inspecting the plots of empirical regression processes can provide hints for answering this question (see, e.g., Lin et al. [Citation3]), whereas this is not as straightforward when using for example the smoothing-based tests [Citation5,Citation6].

The approach that we consider is based on the model's residuals, $\hat{e} = ({\hat{e}}_{1}, \dots, {\hat{e}}_{n})^{T}$ , where ${\hat{e}}_{i} = Y_{i} - g_{\hat{θ}} (x_{i})$ , and $\hat{θ}$ is, under (Equation2(2) $H_{0} : m \in G against H_{1} : m \notin G .$ (2) ), some consistent estimator of $θ_{0}$ . In particular, our approach is based on standardized residuals ${\hat{ϵ}}_{i} = \hat{σ} (x_{i})^{- 1 / 2} {\hat{e}}_{i},$ where $\hat{σ}$ is some consistent estimator of σ.

These standardized residuals are ordered by their respective fitted values, ${\hat{Y}}_{i} = g_{\hat{θ}} (x_{i})$ , and the empirical process, based on a cumulative sum of (standardized and ordered) residuals, (3) ${\hat{W}}_{n} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{ϵ}}_{i} 1 ({\hat{Y}}_{i} \leq t), t \in R,$ (3) where $1 (\cdot)$ is the indicator function, is constructed.

Test statistic, T, used for testing $H_{0}$ can then be any function of ${\hat{W}}_{n} (t)$ such that large (or small) values correspond to evidence against $H_{0}$ . The proposed approach will have good power against the alternatives which cause a systematic pattern in terms of the successive number of positive versus negative ordered residuals. The process ${\hat{W}}_{n} (t)$ can also be displayed graphically, which can provide a useful hint on how to correct the model [Citation3].

Exact distributions of the two most commonly used types of test statistics, Kolmogorov–Smirnov (KS) type test statistic and Cramér–von Mieses (CvM) type test statistic, for the Brownian motion and Brownian bridge stochastic processes, rely on the assumption of independence [Citation7,Citation8]. Hence, due to the model's residuals not being independent, these results cannot be used in our context. Given the complex correlation structure of the residuals, the exact distribution of KS and CvM test statistics is intractable.

We propose to overcome this issue by using permutations to approximate the null distribution of the test statistic. Let $T^{*}$ be a test statistic obtained by using a permutation procedure which will be described later. The proposed permutation procedure can be based on all possible permutations or a large number of random permutations [Citation9–11]. The p-value is then defined in the usual way as the proportion of $T^{*}$ which are as or more extreme as T, including the identity permutation (corresponding to the original test statistic) in case of the random permutations [Citation10–13].

Using permutations with linear regression models is common (see DiCiccio and Romano [Citation14] and references therein) and two different permutation procedures are used. One is the raw data permutation procedure proposed by Manly [Citation15] where the outcome variable is permuted. The other procedure is the permutation of residuals proposed by ter Braak [Citation16], which for homoscedastic random errors is equivalent to the permutation procedure proposed by Freedman and Lane [Citation17] (see Lepage and Podgórski [Citation18] for a theoretical justification of this approach in the setting with homoscedastic random errors). Both procedures were shown to provide valid inference in the context of tests for partial regression coefficients in a linear model [Citation14,Citation19,Citation20]. In this paper, we modify the permutation of residuals procedure so that it can be used in goodness-of-fit testing problems in a general setting with potentially heteroscedastic random errors. To achieve this, we rely on some consistent estimator of σ. This is different from the approach used by DiCiccio and Romano [Citation14] which also incorporates heteroscedastic errors by using appropriately studentized test statistics. However, in our context obtaining the heteroscedastic robust standard error needed to appropriately studentize the test statistic would be very challenging. We show, under some mild assumptions, that, under $H_{0}$ and conditionally on the data, the proposed approach correctly approximates the distribution of the constructed random process and therefore provides valid inference in the context of goodness-of-fit testing. We also show the consistency of our approach under some alternative hypotheses, illustrating situations where our approach can be powerful.

Small sample properties of the proposed approach are studied by an extensive Monte Carlo simulation study. It is shown that our approach performs well also when the errors are heteroscedastic, non-normal and the sample size is small. An example of the non-linear case is also given. The approach is illustrated by data from a cross-sectional study of the dependence in daily activities.

1.1. Existing approaches based on empirical processes

Several approaches based on a similar cumulative sums process as defined in (Equation3(3) ${\hat{W}}_{n} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{ϵ}}_{i} 1 ({\hat{Y}}_{i} \leq t), t \in R,$ (3) ) were proposed [Citation3,Citation21–26]. They are mainly defined as (4) ${\tilde{W}}_{n} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{e}}_{i} 1 (x_{i} \leq t),$ (4) where $t = (t_{1}, \dots, t_{m})^{T} \in R^{m}$ , $1 (x_{i} \leq t) = {(1 (x_{i 1} \leq t_{1}), \dots, 1 (x_{i m} \leq t_{m}))}^{T}$ , and the test statistics are then KS or CvM statistics based on ${\tilde{W}}_{n} (t)$ .

In contrast to our approach, the residuals in (Equation4(4) ${\tilde{W}}_{n} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{e}}_{i} 1 (x_{i} \leq t),$ (4) ) are not standardized. Also, in (Equation4(4) ${\tilde{W}}_{n} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{e}}_{i} 1 (x_{i} \leq t),$ (4) ) multivariate ordering procedure is used which can be problematic in higher dimensions [Citation27], therefore Lin et al. [Citation3] and Stute and Zhu [Citation23] ordered the residuals in (Equation4(4) ${\tilde{W}}_{n} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{e}}_{i} 1 (x_{i} \leq t),$ (4) ) by the fitted values as in (Equation3(3) ${\hat{W}}_{n} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{ϵ}}_{i} 1 ({\hat{Y}}_{i} \leq t), t \in R,$ (3) ); we will use ${\tilde{W}}_{n} (t)$ , $t \in R$ to denote that the ordering is by the fitted values and ${\tilde{W}}_{n} (t)$ , $t \in R^{m}$ to denote the use of the multivariate ordering procedure. Lin et al. [Citation3] also considered to take the sum only within a window specified by some positive constant b, i.e., in (Equation4(4) ${\tilde{W}}_{n} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{e}}_{i} 1 (x_{i} \leq t),$ (4) ) they use $1 (t - b < x_{i j} \leq t)$ . This is done as the process ${\tilde{W}}_{n} (t)$ tends to be dominated by the residuals with small covariate values. These different ways of ordering the residuals could be used also in our proposed approach, but ordering the residuals by the fitted values is the most convenient option when one wants to test for the possible lack-of-fit of the entire fitted model. We show later how to define different orderings, avoiding the use of multivariate ordering, to investigate the lack-of-fit for the selected parts of the model.

The existing approaches differ in how the null distribution of different test statistics based on (Equation4(4) ${\tilde{W}}_{n} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{e}}_{i} 1 (x_{i} \leq t),$ (4) ) is obtained (approximated). The approximation proposed in Stute [Citation2] depended on the asymptotic distribution of ${\tilde{W}}_{n} (t)$ but did not yield satisfactory results with small sample size [Citation24]. Stute et al. [Citation25] transformed a similar process as given in (Equation4(4) ${\tilde{W}}_{n} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{e}}_{i} 1 (x_{i} \leq t),$ (4) ) to its innovation part to obtain a distribution-free CvM test statistics but this result can only be applied to a situation with a single regressor. This approach has been later extended to a situation with more regressors and can be applied to generalized linear models [Citation23]. For the homoscedastic linear model (see Stute and Zhu [Citation23] for a more general case) the test statistic based on the transformation, $T_{n}$ , of the process ${\tilde{W}}_{n} (t)$ , say $T_{n} {\tilde{W}}_{n} (t)$ , is, for some $x_{0} < max ({\hat{Y}}_{1}, \dots, {\hat{Y}}_{n})$ , defined as $K_{S Z} = \frac{1}{n ψ_{n} (x_{0})^{2}} \sum_{i = 1}^{n} 1 ({\hat{Y}}_{i} \leq x_{0}) \hat{σ} ({\hat{Y}}_{i}) {[T_{n} {\tilde{W}}_{n} ({\hat{Y}}_{i})]}^{2},$ where $ψ_{n} (u) = \frac{1}{n} \sum_{i = 1}^{n} 1 ({\hat{Y}}_{i} \leq u) {\hat{e}}_{i}^{2}$ and $\hat{σ} (u)$ is for example given by the Nadaraya–Watson estimator, (5) $\hat{σ} (u) = \frac{\sum_{i = 1}^{n} {\hat{e}}_{i}^{2} K_{h_{n}} (u - {\hat{Y}}_{i})}{\sum_{i = 1}^{n} K_{h_{n}} (u - {\hat{Y}}_{i})},$ (5) where $K_{c}$ is some kernel with bandwidth c. The transformed process is obtained from $T_{n} {\tilde{W}}_{n} (t) = {\tilde{W}}_{n} (t) - \frac{1}{n^{3 / 2}} \sum_{i = 1}^{n} 1 ({\hat{Y}}_{i} \leq t) r_{n}^{T} ({\hat{Y}}_{i}) A_{n} ({\hat{Y}}_{i})^{- 1} \sum_{j = 1}^{n} 1 ({\hat{Y}}_{j} \geq {\hat{Y}}_{i}) {\hat{σ}}^{- 1} ({\hat{Y}}_{j}) r_{n} ({\hat{Y}}_{j}) {\hat{e}}_{j},$ where $m \times m$ matrix $A_{n} (u)$ and m-vector $r_{n} (u)$ are given by $A_{n} (u) = \frac{1}{n} \sum_{i = 1}^{n} 1 ({\hat{Y}}_{i} \geq u) {\hat{σ}}^{- 1} ({\hat{Y}}_{i}) r_{n} ({\hat{Y}}_{i}) r_{n}^{T} ({\hat{Y}}_{i}),$ and $r_{n} (u) = \frac{\sum_{i = 1}^{n} x_{i} K_{h_{n}} (u - {\hat{Y}}_{i})}{\sum_{i = 1}^{n} K_{h_{n}} (u - {\hat{Y}}_{i})},$ respectively. Critical values and p-values for $K_{S Z}$ can be obtained from $\int W^{2}$ , where W is the standard Brownian motion [Citation23].

Su and Wei [Citation26] and Lin et al. [Citation3] used a simulation approach to obtain the p-values which relies on approximating ${\tilde{W}}_{n} (t)$ with a random process, which for the linear model (see Lin et al. [Citation3] for a more general case) is defined as (6) ${\tilde{W}}_{n}^{*} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} [1 ({\hat{Y}}_{i} \leq t) - (\sum_{k = 1}^{n} x_{k}^{T} 1 ({\hat{Y}}_{i} \leq t)) {[\sum_{k = 1}^{n} x_{k} x_{k}^{T}]}^{- 1} x_{i}] {\hat{e}}_{i} Z_{i},$ (6) where $(Z_{1}, \dots, Z_{n})$ are independent standard normal variables (for process ${\tilde{W}}_{n} (t)$ , $1 ({\hat{Y}}_{i} \leq t)$ would be replaced by $1 (x_{i} \leq t)$ , see Su and Wei [Citation26] and Lin et al. [Citation3] for more details). ${\tilde{W}}_{n}^{*}$ can equivalently be defined as (7) ${\tilde{W}}_{n}^{*} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{e}}_{i}^{*} 1 ({\hat{Y}}_{i} \leq t),$ (7) where ${\hat{e}}_{1}^{*}, \dots, {\hat{e}}_{n}^{*}$ are the residuals obtained when regressing $y^{*}$ on $X$ , where $y^{*} = (Y_{1}^{*}, \dots, Y_{n}^{*})^{T}$ and (8) $Y_{i}^{*} = {\hat{Y}}_{i} + {\hat{e}}_{i} Z_{i}, i = 1, \dots, n .$ (8) Stute et al. [Citation24] used the bootstrap [Citation28–30] to obtain ${\hat{e}}_{1}^{*}, \dots, {\hat{e}}_{n}^{*}$ . They considered classical, smooth, wild and residual bootstrap. In wild bootstrap, $Z_{i}$ in (Equation8(8) $Y_{i}^{*} = {\hat{Y}}_{i} + {\hat{e}}_{i} Z_{i}, i = 1, \dots, n .$ (8) ) is replaced by $V_{i}$ , where $(V_{1}, \dots, V_{n})$ are independent and identically distributed such that $E (V_{i}) = 0$ , $var (V_{i}) = 1$ and $| V_{i} | \leq c < \infty$ for some finite c. In residual bootstrap, ${\hat{e}}_{i} Z_{i}$ in (Equation8(8) $Y_{i}^{*} = {\hat{Y}}_{i} + {\hat{e}}_{i} Z_{i}, i = 1, \dots, n .$ (8) ) are replaced by an i.i.d. sample from the empirical distribution function of the ( centred) ${\hat{e}}_{i}$ 's. The approaches proposed by Stute et al. [Citation25], Su and Wei [Citation26] and Stute and Zhu [Citation23] were shown to be asymptotically valid [Citation23–25] but, as it is shown later, they do not attain the nominal level with a small sample size.

Hattab and Christensen [Citation31] proposed several tests based on partial sums of residuals where the test statistics are based on sums of a subset of the (ordered and standardized) residuals [Citation27,Citation32]. Defining partial sums over a subset of the residuals, as opposed to cumulative sums over all residuals as in (Equation3(3) ${\hat{W}}_{n} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{ϵ}}_{i} 1 ({\hat{Y}}_{i} \leq t), t \in R,$ (3) ) or (Equation4(4) ${\tilde{W}}_{n} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{e}}_{i} 1 (x_{i} \leq t),$ (4) ), makes it possible to determine the asymptotic distribution when the (appropriately standardized) test statistic is based on the largest of the absolute values of partial sums [Citation27] or the largest partial sum of absolute values of the (ordered and standardized) residuals [Citation31]. However, the rate of convergence is very slow and using only a subset of the residuals can lead to loss of power when ordering is not appropriately chosen [Citation31]. Determining the number of residuals that are included in the subset for which the test statistic is computed can also be problematic in practice. To solve the issue with a slow rate of convergence, Hattab and Christensen [Citation31] proposed a Monte Carlo simulation scheme. In the case when the partial sums are based on all residuals, this approach is identical to the approach proposed in Su and Wei [Citation26] and Lin et al. [Citation3]. While the tests based on partial sums can, if the residuals are ordered appropriately, under some alternatives be more powerful than the tests based on cumulative sums, they are not considered here in more detail, since they lack a nice visual representation which is one of the main strengths of the approach considered here [Citation3]. The permutation approach proposed here could, however, easily be used instead of the Monte Carlo procedure used by Hattab and Christensen [Citation31] also for tests based on partial model checks.

2. The proposed approach and its asymptotic convergence

Here we first give more details about the proposed approach. This is then followed by an asymptotic study of the convergence of the constructed random processes under the null hypothesis. Finally, the asymptotic convergence of the constructed random processes under a particular alternative hypothesis is studied in the last section.

2.1. The proposed approach for obtaining the null distribution

The null distribution of the proposed empirical process (see Equation (Equation3(3) ${\hat{W}}_{n} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{ϵ}}_{i} 1 ({\hat{Y}}_{i} \leq t), t \in R,$ (3) )) is obtained with (random) permutations as described next. Let ${\hat{ϵ}}_{π (i)}$ be the ith element of $Π \hat{ϵ}$ , where $Π \in R^{n \times n}$ is some (random) permutation matrix, i.e., a square matrix that has exactly one entry 1 in each row and each column and 0s elsewhere. Define the modified outcome, $y^{*} = (Y_{1}^{*}, \dots, Y_{n}^{*})^{T}$ , as $Y_{i}^{*} = {\hat{Y}}_{i} + \sqrt{\hat{σ} (x_{i})} {\hat{ϵ}}_{π (i)} .$ The empirical process based on the modified outcome is then defined as (9) ${\hat{W}}_{n}^{*} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{ϵ}}_{i}^{*} 1 ({\hat{Y}}_{i}^{*} \leq t), t \in R,$ (9) where ${\hat{ϵ}}_{i}^{*}$ and ${\hat{Y}}_{i}^{*}$ are obtained by regressing $y^{*}$ on $X$ . Note that Equations (Equation3(3) ${\hat{W}}_{n} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{ϵ}}_{i} 1 ({\hat{Y}}_{i} \leq t), t \in R,$ (3) ) and (Equation9(9) ${\hat{W}}_{n}^{*} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{ϵ}}_{i}^{*} 1 ({\hat{Y}}_{i}^{*} \leq t), t \in R,$ (9) ) are very similar. The processes ${\hat{W}}_{n}$ and ${\hat{W}}_{n}^{*}$ are defined by the same formula but on different data.

The function σ in (Equation9(9) ${\hat{W}}_{n}^{*} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{ϵ}}_{i}^{*} 1 ({\hat{Y}}_{i}^{*} \leq t), t \in R,$ (9) ) is re-estimated for each version of the modified outcome. This re-estimation, denoted by ${\hat{σ}}^{*}$ , is then used to define the residuals ${\hat{e}}_{i}^{*}$ , as will be explained in Section 2.2. When σ is constant, $Y_{i}^{*} = {\hat{Y}}_{i} + {\hat{e}}_{π (i)}$ , where ${\hat{e}}_{π (i)}$ is the ith element of $Π \hat{e}$ .

The test statistics based on ${\hat{W}}_{n}$ considered here were the KS-type statistic $T_{S} = sup_{t \in R} | {\hat{W}}_{n} (t) |,$ and the CvM-type statistic, $T_{C} = \int_{R} {\hat{W}}_{n} (t)^{2} F_{n} (d t),$ where $F_{n} (\cdot)$ is the empirical distribution function of $\hat{Y}$ . The test statistics based on ${\hat{W}}_{n}^{*}$ are obtained similarly and are denoted as $T_{S}^{*}$ and $T_{C}^{*}$ . The p-values are then estimated as the proportion of $T_{S}^{*}$ ( $T_{C}^{*}$ ) which are as large or larger than $T_{S}$ ( $T_{C})$ . Since we were mainly considering situations where it was computationally infeasible to calculate the permutation p-value based on the whole permutation group, i.e., all $n!$ possible permutations, the p-value was calculated based on random permutations [Citation9–11] using some large value of random permutations, including the identity permutation [Citation12,Citation13].

We can do this because, as it will be shown in Section 2.3, under some assumptions, the stochastic process ${\hat{W}}_{n}$ converges weakly to the same limit as the stochastic processes ${\hat{W}}_{n}^{*}$ , the later conditionally on the data. Moreover, because we use the supremum norm when assessing the convergence of stochastic processes, the test statistics $T_{S}$ and $T_{S}^{*}$ are equal to the norms of the stochastic processes. Since the norm on a metric space is continuous, the KS test statistics converge weakly to the same limit. A similar result can also be established for the CVM-type statistics. More on random permutations and empirical stochastic processes can be found in van der Vaart and Wellner [Citation33].

To test the lack-of-fit for a single covariate, say covariate j, the process could be modified by ordering the residuals by the respective covariate, i.e., using $1 (x_{i j} \leq t)$ instead of $1 ({\hat{Y}}_{i} \leq t)$ and $1 ({\hat{Y}}_{i}^{*} \leq t)$ in (Equation3(3) ${\hat{W}}_{n} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{ϵ}}_{i} 1 ({\hat{Y}}_{i} \leq t), t \in R,$ (3) ) and (Equation9(9) ${\hat{W}}_{n}^{*} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{ϵ}}_{i}^{*} 1 ({\hat{Y}}_{i}^{*} \leq t), t \in R,$ (9) ), respectively. For the linear case, it is also straightforward to test the lack-of-fit for a set of covariates, by using $1 (\sum_{j} x_{i j} {\hat{θ}}_{j} \leq t)$ and $1 (\sum_{j} x_{i j} {\hat{θ}}_{j}^{*} \leq t)$ in (Equation3(3) ${\hat{W}}_{n} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{ϵ}}_{i} 1 ({\hat{Y}}_{i} \leq t), t \in R,$ (3) ) and (Equation9(9) ${\hat{W}}_{n}^{*} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{ϵ}}_{i}^{*} 1 ({\hat{Y}}_{i}^{*} \leq t), t \in R,$ (9) ), respectively, where the sum is taken only over the defined set of covariates. Considering the set of specified covariates enables efficient detection of the lack-of-fit due to the omission of the interaction effect. For example, if the p-value obtained by defining some subset of two variables is significant at some level α and the two p-values obtained by defining a subset of only one of the two variables are both insignificant at level α, this implies that an important interaction effect between the two variables was omitted. Furthermore, if the p-value obtained by defining the subset which includes a variable that was not present when fitting the model is significant at level α this implies that the variable should be added to the model. It is also important when the model besides the linear component of the covariate includes also its nonlinear transformation and one wants to test for the adequacy of such assumed nonlinear relation with the outcome.

One could be tempted to estimate the null distribution of the test statistic by constructing the random process directly from the permuted residuals, ${\hat{ϵ}}_{π (i)}$ , without refitting the model. Such a test will, however, be conservative as this will overestimate the variance of the constructed random process. When the residuals are sorted based on the fitted values, then neighbouring residuals are negatively correlated (especially for the residuals close to each other in this ordering). When the residuals are randomly permuted, this negative correlation is preserved, however, neighbouring residuals are less negatively correlated which overestimates the variance of the random process. When n is large the negative correlation becomes smaller, but the constructed stochastic process depends on a lot of residuals that are negatively correlated, and all these small negative correlations combined are still important. Therefore, a new model is fitted after the permutation to ensure that this negative correlation of residuals that are close with respect to the ordering corresponding to the fitted values is kept.

As the model is refitted, it is necessary to use standardized residuals in (Equation3(3) ${\hat{W}}_{n} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{ϵ}}_{i} 1 ({\hat{Y}}_{i} \leq t), t \in R,$ (3) ) and (Equation9(9) ${\hat{W}}_{n}^{*} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{ϵ}}_{i}^{*} 1 ({\hat{Y}}_{i}^{*} \leq t), t \in R,$ (9) ). The residuals are after permutation and refitting the model less variable (especially obvious with more covariates or when the sample size is small), hence the variability of the random processes constructed after permutations is smaller and the tests would reject the true null hypothesis too often.

2.2. Additional notation and definitions

We use $span M, M \in R^{m_{1} \times m_{2}}$ to denote the column span or the image of matrix $M$ . Let $X \subset R^{m}$ denote the set of all possible outcomes of the random vector $x$ that generates the rows of the matrix $X$ . The stochastic processes that we will use will be indexed on the set of functions $F$ , defined as $F = {f_{t}, t \in R, f_{t} : X \to R, f_{t} (x) = 1 (g_{θ} (x) \leq t)} .$ Rewrite the regression model in the matrix form as $y = g_{θ} (X) + e,$ where $y = (Y_{1}, \dots, Y_{n})^{T} \in R^{n}$ and $g : R^{m} \to R$ is some function dependent on the unknown parameter $θ \in Θ \subset R^{d}$ and $g$ indicates that the function g is applied to every row of the matrix $X$ . $e = (e_{1}, \dots, e_{n})^{T} \in R^{n}$ is an unknown random vector of errors, with independent components, distributed as $e_{i} = \sqrt{σ (x_{i})} ϵ_{i} .$ The estimators of the parameters $θ$ and σ, as well as any quantity that uses these estimators in place of the parameters they are estimating, will be denoted by $\hat{}$ . For example, $\hat{θ}$ denotes the estimator of $θ$ and $\hat{σ}$ denotes the estimator of σ.

Let vectors $ϕ$ be defined by the equation $\sqrt{n} (\hat{θ} - θ) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} ϕ_{i} e_{i} + o_{P} (1),$ and let matrix $Φ_{n}$ be defined as $Φ_{n} = \frac{1}{n} (\begin{matrix} ϕ_{1} & \dots & ϕ_{n} \end{matrix}) .$ Let the function $h_{θ}$ denote the derivative of g, stored as a row vector, of the function $η \mapsto g_{η}$ at $θ$ . Then define the matrix $H_{n} = (\begin{matrix} h_{θ} (x_{1}) \\ ⋮ \\ h_{θ} (x_{n}) \end{matrix}) .$ For a permutation matrix $Π$ define the vector of the modified outcomes $y^{*}$ as $y^{*} = \hat{y} + {\hat{Σ}}_{n}^{\frac{1}{2}} Π {\hat{Σ}}_{n}^{- \frac{1}{2}} \hat{e},$ where the diagonal matrix ${\hat{Σ}}_{n}$ has the elements $\hat{σ} (x_{1}), \dots, \hat{σ} (x_{n})$ on the diagonal. Let $Σ_{n}$ be a diagonal matrix that has the elements $σ (x_{1}), \dots, σ (x_{n})$ on the diagonal. Let ${\hat{θ}}^{*}$ and ${\hat{σ}}^{*}$ denote the estimated parameter vector $θ$ and function σ, respectively, on the outcome vector $y^{*}$ . Let ${\hat{Σ}}_{n}^{*}$ be diagonal matrix with diagonal elements equal to ${\hat{σ}}^{*} (x_{1}), \dots, {\hat{σ}}^{*} (x_{n})$ . The empirical process obtained using the modified outcomes, re-estimating $θ$ and σ can then be written as ${\hat{W}}_{n}^{*} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{ϵ}}_{i}^{*} 1 (g_{{\hat{θ}}^{*}} (x_{i}) \leq t),$ where for $i = 1, \dots, n$ , ${\hat{ϵ}}_{i}^{*} = \frac{1}{\sqrt{{\hat{σ}}^{*} (x_{i})}} {\hat{e}}_{i}^{*},$ and ${\hat{e}}_{i}^{*} = y_{i}^{*} - g_{{\hat{θ}}^{*}} (x_{i}) .$

2.3. Asymptotic behaviour under $H_{0}$

The following assumptions are made about the data.

(A1)	Random vectors $x_{i}, i = 1, 2, \dots$ are i.i.d., have bounded variances and are independent of i.i.d. random variables $ϵ_{i}$ , which have zero mean, variance 1 and there exists some $δ > 0$ , such that $E ({\| ϵ_{1} \|}^{2 + δ}) < \infty$ .
(A2)	Assume that the function $θ \mapsto g (x)$ is differentiable at some neighbourhood of $θ_{0}$ for every $x \in R^{d}$ . Denote its derivative at $θ_{0}$ by $h_{θ_{0}}$ , so that $h_{θ_{0}} (x) \in R^{1 \times m}$ , assume that the function $x \to h_{θ_{0}} (x)$ is bounded uniformly in $x$ . Assume also that the function $θ \mapsto h_{θ}$ is differentiable at some neighbourhood of $θ_{0}$ and assume that its derivatives are bounded uniformly in $x$ . We assume that the following holds for the estimators $\hat{θ}$ and ${\hat{Σ}}_{n}$ .
(B1)	The expression $\sqrt{n} (\hat{θ} - θ_{0})$ is equal to $\sqrt{n} (\hat{θ} - θ_{0}) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} ϕ_{i} e_{i} + o_{P} (1),$ where $ϕ_{i} \in R^{d}$ is some vector that is independent of $e_{i}$ . We additionally assume that the vectors $ϕ_{i}$ have all absolute values of its elements between $ϕ_{min}$ and $ϕ_{max}$ , where $0 < ϕ_{min} < ϕ_{max} < \infty$ . Assume also that there exists a constant $C < \infty$ , such that $x_{i} ϕ_{i} < C$ , for every i almost surely.
(B2)	We assume that our data and estimator $\hat{σ}$ is such, that there exists some number $r > 0$ , such that $sup_{x \in X} \| \hat{σ} (x_{i}) - σ (x_{i}) \| = O_{P} (n^{\frac{1}{r}}) .$ There exist some $σ_{min} > 0$ and $σ_{max}, σ_{min} < σ_{max} < \infty$ , so that for every i $σ_{min} \leq \| σ (x_{i}) \| \leq σ_{max}$ and $σ_{min} \leq \| \hat{σ} (x_{i}) \| \leq σ_{max}$ almost surely. The following assumptions are made regarding the permutations and estimators ${\hat{θ}}^{}$ and ${\hat{Σ}}_{n}^{}$ .
(C1)	Assume that the permutation matrix $Π$ is independent of $X$ and $ϵ$ .
(C2)	Assume that $\sqrt{n} ({\hat{θ}}^{} - θ_{0}) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} ϕ_{i} e_{i}^{} + o_{P} (1),$ where $e_{i}^{} = y_{i}^{} - g_{θ_{0}} (x_{i}) .$
(C3)	Assume that $sup_{x_{i} \in X} \| {\hat{σ}}^{} (x_{i}) - σ (x_{i}) \| = O_{P} (n^{\frac{1}{r}}) .$ Assume also that $σ_{min} \leq \| {\hat{σ}}^{} (x_{0}) \| \leq σ_{max} .$
(C4)	For every $n \in N$ it holds that (10) $lim_{n \to \infty} (I - H_{n} Φ_{n}) Σ_{n}^{\frac{1}{2}} 1 = o_{P} (1),$ (10) for every choice of the permutation matrix $Π$ .

The assumptions, apart from assumption (C4), are standard in this context. The assumptions (B1) and (C2) imply that the estimators are both consistent and converge to the fixed and unknown true parameter $θ_{0}$ fast enough, as in Remark 1 of Stute et al. [Citation25]. The assumption (C4) on the other hand is specific to using permutations and represents an important necessary condition that needs to hold in order for the proposed permutation procedure to work. For example, assume that the function g is equal to $g_{θ} (X) = X θ,$ and that we use the classic ordinary least squares estimator (OLS) to estimate the parameter $θ$ . The vectors $ϕ_{i}$ are therefore equal to $ϕ_{i} = n {(X^{T} X)}^{- 1} x_{i}^{T} .$ Observe that in this case the function $h$ is equal to $h (x_{i}) = x_{i}$ . Hence it holds that $H_{n} Φ_{n} = X {(X^{T} X)}^{- 1} X^{T} .$ In the case that $Σ_{n}$ is such that ${(σ (x_{1})^{1 / 2}, \dots, σ (x_{n})^{1 / 2})}^{T} \subset span X$ , as for example in our simulated example in Section 3.1.2, it holds that $(I - H_{n} Φ_{n}) Σ_{n}^{\frac{1}{2}} 1 = (I - n X {(X^{T} X)}^{- 1} X^{T}) Σ_{n}^{\frac{1}{2}} 1 = 0 .$ This holds because the matrix $(I - X {(X^{T} X)}^{- 1} X^{T})$ is a projection matrix on $span X$ . The assumption (C4) therefore holds in this case. It clearly follows that the above condition, in the case that $Σ_{n} = σ^{2} I$ , implies that it must hold that $1 \subset span X$ . This then implies that (11) $\sum_{i = 1}^{n} {\hat{e}}_{i} = 0.$ (11) It can be shown that when using the least squares estimator for estimating the parameter $θ$ and assuming that $Σ_{n} = σ^{2} I$ , Equality (Equation11(11) $\sum_{i = 1}^{n} {\hat{e}}_{i} = 0.$ (11) ) implies that Assumption (C4) holds.

Because a large part of the theory that follows will be concerned with the asymptotic behaviour of random variables, we will use little o $o_{P} (\cdot)$ and big O notation $O_{P} (\cdot)$ . Remember that for a sequence of random variables $X_{n}, n = 1, 2, \dots$ the statement $X_{n} = o_{P} (1)$ equivalent to the statement that the sequence $X_{n}$ converges to zero in probability. This notation is explained in more detail in Section 14.2 of Bishop et al. [Citation34] and in Section 2.2 of Vaart [Citation35].

Remember also that the statement that the sequence of random variables $X_{n}$ , $n = 1, 2, \dots$ is such that $X_{n} = O_{P} (1)$ is equivalent to the statement that $X_{n}$ is uniformly tight. That is for every $ϵ > 0$ there exists some $M > 0$ , such that $sup_{n} P (| X_{n} | > M) < ϵ .$ In addition recall that for a sequence of random variables $X_{n}, Y_{n}, R_{n}, n = 1, 2, \dots$ it holds that

the sequence of random variables $X_{n} = o_{P} (R_{n})$ if and only if $X_{n} = Y_{n} R_{n}$ , where $Y_{n} = o_{P} (1)$ ;
the sequence of random variables $X_{n} = O_{P} (R_{n})$ if and only if $X_{n} = Y_{n} R_{n}$ , where $Y_{n} = O_{P} (1)$ .

The sequence $R_{n}$ will be equal to some deterministic function of n in our case.

As in van der Vaart and Wellner [Citation33], our theory is based on random elements. In our case, the random elements are equal to $(x, ϵ)$ . We will use $P$ to denote their law. In the proofs, we will also use $P_{x}$ and $P_{ϵ}$ to denote the law of either random vector $x$ or random variable ϵ that is independent of $x$ .

The following proposition shows the convergence of the process $W_{n} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} ϵ_{i} 1 (g_{θ_{0}} (x_{i}) \leq t),$ conditionally on the design matrix $X$ .

Proposition 2.1

If the assumptions (A1) and (A2) hold, the process $W_{n}$ converges weakly to some tight stochastic process, conditionally on random vectors $x_{i}, i = 1, 2, \dots$ .

Proof.

Define the family of functions $F$ as $F = {f_{t}, t \in R, f_{t} : X \to R, f_{t} (x) = 1 (g_{θ_{0}} (x) \leq t)} .$ It is not difficult to show that the family $F$ is Donsker. One way of doing this is with the help of VC-classes and entropy numbers. Observe also that in the special case, when m = 1, we are dealing with the simple indicator functions, which are the classic example of a Donsker class of functions.

Since the family of functions $F$ is Donsker, this means that the process $t \mapsto G f_{t}$ converges weakly to a tight, Borel measureable element in $l^{\infty} (F)$ . By the multiplier central limit theorem 2.9.6 in van der Vaart and Wellner [Citation33], it holds that conditionally on $x_{i}, i = 1, 2, \dots$ , the process $t \mapsto \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} ϵ_{i} (f_{t} (x_{i}) - P_{X} f_{t}),$ converges weakly to a tight stochastic process. This is equivalent to the joint weak convergence of every finite number of marginals $f_{t_{1}}, \dots, f_{t_{k}}$ and there exists a semimetric ρ, so that $F$ is totally bounded and the sequence of processes is asymptotically ρ-equicontinuous.

Now define the new semimetric $ρ^{'}$ on $F$ as $ρ^{'} (f_{t}, f_{s}) = max (ρ (f_{t}, f_{s}), L_{1} (f_{t}, f_{s})) .$ Here $L_{1}$ denotes the classic $L_{1}$ norm. Observe that the space $l^{\infty} (F)$ is totally bounded with respect to the semimetric $ρ^{'}$ . For any $δ > 0$ , let $F_{δ}$ be defined as $F_{δ} = {f - g \in F, ρ^{'} (f, g) < δ} .$ For any $η > 0$ , there exists a $δ > 0$ so that $\begin{aligned} E_{ϵ} {‖ W_{n} ‖}_{F_{δ}} & = E_{ϵ} sup_{t, s ρ^{'} (f_{t}, f_{s}) < δ} | W_{n} (t) - W_{n} (s) | \\ \leq E_{ϵ} sup_{t, s ρ^{'} (f_{t}, f_{s}) < δ} | \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} ϵ_{i} (f_{t} (x_{i}) - f_{s} (x_{i})) | + E_{ϵ} sup_{t, s ρ^{'} (f_{t}, f_{s}) < δ} | P_{x} f_{t} - P_{x} f_{s} | \\ \leq 2 η . \end{aligned}$ By Lemma 2.3.11 in van der Vaart and Wellner [Citation33], the process $W_{n}$ converges to a tight limit.

In order for the proposed approach to be useful in practice, the unknown quantities in $W_{n}$ need to be replaced by their respective estimators. The following theorem shows the convergence of the resulting process, ${\hat{W}}_{n}$ , conditionally on $X$ .

Theorem 2.1

If Assumptions (A1), (A2), (B1) and (B2) hold, the process ${\hat{W}}_{n}$ converges weakly to some tight stochastic process, conditionally on random vectors $x_{i}, i = 1, 2, \dots$ .

The limit is a Gaussian process with mean zero and covariance function equal to $K (t, s) = lim_{n \to \infty} \frac{1}{n} 1 (g_{θ_{0}} (X) \leq t)^{T} Σ_{n}^{- \frac{1}{2}} (I - H_{n} Φ_{n}) Σ_{n} (I - H_{n} Φ_{n}) Σ_{n}^{- \frac{1}{2}} 1 (g_{θ_{0}} (X) \leq s) .$

Proof.

Observe that from (B2) and the fact, that the square root function is concave, it follows that (12) $\begin{aligned} | \frac{1}{\sqrt{\hat{σ} (x_{i})}} - \frac{1}{\sqrt{σ (x_{i})}} | & = | \frac{\sqrt{σ (x_{i})} - \sqrt{\hat{σ} (x_{i})}}{\sqrt{\hat{σ} (x_{i}) σ (x_{i})}} | \\ \leq \frac{| \sqrt{σ (x_{i})} - \sqrt{\hat{σ} (x_{i})} |}{\sqrt{σ_{min}^{2}}} \\ \leq \frac{\sqrt{| σ (x_{i}) - \hat{σ} (x_{i}) |}}{σ_{min}} \\ = \frac{\sqrt{O_{P} (n^{\frac{1}{r}})}}{σ_{min}} \\ = O_{P} (n^{\frac{1}{2 r}}) . \end{aligned}$ (12) It holds that (13) $\begin{aligned} {\hat{W}}_{n} (t) & = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} \frac{1}{\sqrt{σ (x_{i})}} {\hat{e}}_{i} 1 (g_{\hat{θ}} (x_{i}) \leq t) \\ + \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} (\frac{1}{\sqrt{\hat{σ} (x_{i})}} - \frac{1}{\sqrt{σ (x_{i})}}) {\hat{e}}_{i} 1 (g_{\hat{θ}} (x_{i}) \leq t) \\ = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} ϵ_{i} 1 (g_{\hat{θ}} (x_{i}) \leq t) \\ + \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} \frac{1}{\sqrt{σ (x_{i})}} ({\hat{e}}_{i} - e_{i}) 1 (g_{\hat{θ}} (x_{i}) \leq t) \\ + \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} (\frac{1}{\sqrt{\hat{σ} (x_{i})}} - \frac{1}{\sqrt{σ (x_{i})}}) e_{i} 1 (g_{\hat{θ}} (x_{i}) \leq t) \\ + \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} (\frac{1}{\sqrt{\hat{σ} (x_{i})}} - \frac{1}{\sqrt{σ (x_{i})}}) ({\hat{e}}_{i} - e_{i}) 1 (g_{\hat{θ}} (x_{i}) \leq t) . \end{aligned}$ (13) From (Equation12(12) $\begin{aligned} | \frac{1}{\sqrt{\hat{σ} (x_{i})}} - \frac{1}{\sqrt{σ (x_{i})}} | & = | \frac{\sqrt{σ (x_{i})} - \sqrt{\hat{σ} (x_{i})}}{\sqrt{\hat{σ} (x_{i}) σ (x_{i})}} | \\ \leq \frac{| \sqrt{σ (x_{i})} - \sqrt{\hat{σ} (x_{i})} |}{\sqrt{σ_{min}^{2}}} \\ \leq \frac{\sqrt{| σ (x_{i}) - \hat{σ} (x_{i}) |}}{σ_{min}} \\ = \frac{\sqrt{O_{P} (n^{\frac{1}{r}})}}{σ_{min}} \\ = O_{P} (n^{\frac{1}{2 r}}) . \end{aligned}$ (12) ) it follows that the third term in the above sum is equal to (14) $\frac{1}{\sqrt{n}} \sum_{i = 1}^{n} (\frac{1}{\sqrt{\hat{σ} (x_{i})}} - \frac{1}{\sqrt{σ (x_{i})}}) e_{i} 1 (g_{\hat{θ}} (x_{i}) \leq t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} O_{P} (n^{\frac{1}{2 r}}) e_{i} 1 (g_{\hat{θ}} (x_{i}) \leq t) .$ (14) Let $s_{n}$ denote $s_{n} = \sum_{i = 1}^{n} σ (x_{i}) .$ Because it holds that, for every $δ > 0$ $\begin{aligned} 0 & \leq lim_{n \to \infty} \frac{1}{s_{n}} \sum_{i = 1}^{n} E (e_{i}^{2} 1 (| e_{i} | \geq δ \sqrt{s_{n}})) \\ \leq lim_{n \to \infty} \frac{1}{n σ_{min}} \sum_{i = 1}^{n} E (e_{i}^{2} 1 (| ϵ_{i} | \geq δ \sqrt{\frac{n σ_{max}}{σ_{min}}})), \\ = 0, \end{aligned}$ because $var (ϵ_{i}) = 1$ . Therefore we can use the Lindeberg's central limit theorem to get that (Equation14(14) $\frac{1}{\sqrt{n}} \sum_{i = 1}^{n} (\frac{1}{\sqrt{\hat{σ} (x_{i})}} - \frac{1}{\sqrt{σ (x_{i})}}) e_{i} 1 (g_{\hat{θ}} (x_{i}) \leq t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} O_{P} (n^{\frac{1}{2 r}}) e_{i} 1 (g_{\hat{θ}} (x_{i}) \leq t) .$ (14) ) is $o_{P} (1)$ .

From (Equation12(12) $\begin{aligned} | \frac{1}{\sqrt{\hat{σ} (x_{i})}} - \frac{1}{\sqrt{σ (x_{i})}} | & = | \frac{\sqrt{σ (x_{i})} - \sqrt{\hat{σ} (x_{i})}}{\sqrt{\hat{σ} (x_{i}) σ (x_{i})}} | \\ \leq \frac{| \sqrt{σ (x_{i})} - \sqrt{\hat{σ} (x_{i})} |}{\sqrt{σ_{min}^{2}}} \\ \leq \frac{\sqrt{| σ (x_{i}) - \hat{σ} (x_{i}) |}}{σ_{min}} \\ = \frac{\sqrt{O_{P} (n^{\frac{1}{r}})}}{σ_{min}} \\ = O_{P} (n^{\frac{1}{2 r}}) . \end{aligned}$ (12) ) and (B1) it follows that the fourth term in (Equation13(13) $\begin{aligned} {\hat{W}}_{n} (t) & = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} \frac{1}{\sqrt{σ (x_{i})}} {\hat{e}}_{i} 1 (g_{\hat{θ}} (x_{i}) \leq t) \\ + \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} (\frac{1}{\sqrt{\hat{σ} (x_{i})}} - \frac{1}{\sqrt{σ (x_{i})}}) {\hat{e}}_{i} 1 (g_{\hat{θ}} (x_{i}) \leq t) \\ = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} ϵ_{i} 1 (g_{\hat{θ}} (x_{i}) \leq t) \\ + \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} \frac{1}{\sqrt{σ (x_{i})}} ({\hat{e}}_{i} - e_{i}) 1 (g_{\hat{θ}} (x_{i}) \leq t) \\ + \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} (\frac{1}{\sqrt{\hat{σ} (x_{i})}} - \frac{1}{\sqrt{σ (x_{i})}}) e_{i} 1 (g_{\hat{θ}} (x_{i}) \leq t) \\ + \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} (\frac{1}{\sqrt{\hat{σ} (x_{i})}} - \frac{1}{\sqrt{σ (x_{i})}}) ({\hat{e}}_{i} - e_{i}) 1 (g_{\hat{θ}} (x_{i}) \leq t) . \end{aligned}$ (13) ) can be written as $\begin{aligned} \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} (\frac{1}{\sqrt{\hat{σ} (x_{i})}} - \frac{1}{\sqrt{σ (x_{i})}}) ({\hat{e}}_{i} - e_{i}) 1 (g_{\hat{θ}} (x_{i}) \leq t) \\ = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} O_{P} (n^{\frac{1}{2 r}}) {(g_{θ_{0}} (x_{i}) - g_{\hat{θ}} (x_{i}))}^{T} 1 (g_{\hat{θ}} (x_{i}) \leq t) \\ = \frac{1}{\sqrt{n}} O_{P} (n^{\frac{1}{2 r}}) \sum_{i = 1}^{n} h_{θ_{0}} (x_{i}) (θ_{0} - \hat{θ}) 1 (g_{\hat{θ}} (x_{i}) \leq t) + o_{P} (1) \\ = - \frac{1}{\sqrt{n}} O_{P} (n^{\frac{1}{2 r}}) \sum_{i = 1}^{n} h_{θ_{0}} (x_{i}) (\hat{θ} - θ_{0}) 1 (g_{\hat{θ}} (x_{i}) \leq t) + o_{P} (1) \\ = - O_{P} (n^{\frac{1}{2 r}}) (\frac{1}{n} \sum_{i = 1}^{n} 1 (g_{\hat{θ}} (x_{i}) \leq t) h_{θ_{0}} (x_{i})) (\frac{1}{\sqrt{n}} \sum_{i = 1}^{n} ϕ_{i} σ (x_{i}) ϵ_{i}) + o_{P} (1) \end{aligned}$ We can use the strong law of large numbers on the second and Lindeberg's central limit theorem, as before, on the third factor in the last line. Hence it follows that the fourth term in (Equation13(13) $\begin{aligned} {\hat{W}}_{n} (t) & = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} \frac{1}{\sqrt{σ (x_{i})}} {\hat{e}}_{i} 1 (g_{\hat{θ}} (x_{i}) \leq t) \\ + \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} (\frac{1}{\sqrt{\hat{σ} (x_{i})}} - \frac{1}{\sqrt{σ (x_{i})}}) {\hat{e}}_{i} 1 (g_{\hat{θ}} (x_{i}) \leq t) \\ = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} ϵ_{i} 1 (g_{\hat{θ}} (x_{i}) \leq t) \\ + \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} \frac{1}{\sqrt{σ (x_{i})}} ({\hat{e}}_{i} - e_{i}) 1 (g_{\hat{θ}} (x_{i}) \leq t) \\ + \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} (\frac{1}{\sqrt{\hat{σ} (x_{i})}} - \frac{1}{\sqrt{σ (x_{i})}}) e_{i} 1 (g_{\hat{θ}} (x_{i}) \leq t) \\ + \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} (\frac{1}{\sqrt{\hat{σ} (x_{i})}} - \frac{1}{\sqrt{σ (x_{i})}}) ({\hat{e}}_{i} - e_{i}) 1 (g_{\hat{θ}} (x_{i}) \leq t) . \end{aligned}$ (13) ) is also $o_{P} (1)$ , and therefore (15) ${\hat{W}}_{n} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} ϵ_{i} 1 (g_{\hat{θ}} (x_{i}) \leq t) + \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} \frac{1}{\sqrt{σ (x_{i})}} ({\hat{e}}_{i} - e_{i}) 1 (g_{\hat{θ}} (x_{i}) \leq t) + o_{P} (1) .$ (15) By assumption (B1), it follows that $\begin{aligned} {\hat{W}}_{n} (t) & = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} ϵ_{i} 1 (g_{\hat{θ}} (x_{i}) \leq t) \\ - \sqrt{n} {(\hat{θ} - θ_{0})}^{T} (\frac{1}{n} \sum_{i = 1}^{n} \frac{1}{\sqrt{σ (x_{i})}} x_{i}^{T} 1 (g_{\hat{θ}} (x_{i}) \leq t)) + o_{P} (1) \\ = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} ϵ_{i} 1 (g_{\hat{θ}} (x_{i}) \leq t) \\ - (\frac{1}{n} \sum_{i = 1}^{n} \frac{1}{\sqrt{σ (x_{i})}} h_{θ_{0}} (x_{i}) 1 (g_{\hat{θ}} (x_{i}) \leq t)) (\frac{1}{\sqrt{n}} \sum_{i = 1}^{n} ϕ_{i} σ (x_{i}) ϵ_{i}) + o_{P} (1) . \end{aligned}$ From Proposition 2.1 which implies equicontinuity of the first summand and by (A2), it follows that $\begin{aligned} {\hat{W}}_{n} (t) & = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} ϵ_{i} 1 (g_{θ_{0}} (x_{i}) \leq t) \\ - (\frac{1}{n} \sum_{i = 1}^{n} \frac{1}{\sqrt{σ (x_{i})}} h_{θ_{0}} (x_{i}) 1 (g_{θ_{0}} (x_{i}) \leq t)) (\frac{1}{n} \sum_{i = 1}^{n} ϕ_{i} σ (x_{i}) ϵ_{i}) + o_{P} (1) \\ = \frac{1}{\sqrt{n}} Σ_{n}^{- \frac{1}{2}} 1 (g_{θ_{0}} (X) \leq t)^{T} e - \frac{1}{\sqrt{n}} Σ_{n}^{- \frac{1}{2}} 1 (g_{θ_{0}} (X))^{T} H_{n} (Φ_{n} e) + o_{P} (1) \\ = \frac{1}{\sqrt{n}} Σ_{n}^{- \frac{1}{2}} 1 (g_{θ_{0}} (X) \leq t)^{T} (I - H_{n} Φ_{n}) e + o_{P} (1) . \end{aligned}$ Now we can, as in the proof of Theorem 19.23 in Vaart [Citation35], with the help of Proposition 2.1 prove that the process ${\hat{W}}_{n}$ converges to a tight stochastic process conditionally on the $x_{i}, i = 1, 2, \dots$ .

Observe that the covariance function of the process ${\hat{W}}_{n}$ , conditionally on $X$ is equal to $\begin{aligned} cov ({\hat{W}}_{n} (t), {\hat{W}}_{n} (s) | X) & = \frac{1}{n} 1 (g_{θ_{0}} (X) \leq t)^{T} Σ_{n}^{- \frac{1}{2}} (I - H_{n} Φ_{n}) Σ_{n} (I - H_{n} Φ_{n}) \\ \times Σ_{n}^{- \frac{1}{2}} 1 (g_{θ_{0}} (X) \leq s) + o_{P} (1) . \end{aligned}$

We could have also proved the last part of the previous proof in a more direct way, as in the proof of Proposition 2.1.

Theorem 2.2 then shows that the processes based on the modified outcomes, ${\hat{W}}_{n}^{*}$ , conditionally on data, converge weakly and that their limits are equal to the limits of the processes based on the original outcomes. The proof will borrow some steps from the proofs of Theorems 3.7.1 and 3.7.2. from van der Vaart and Wellner [Citation33]. Since the permutations are in our case restricted only to the permutation of $ϵ_{i}, i = 1, 2, \dots$ and since the sums in the permuted processes in our method have length n, we could not have used the theorems from van der Vaart and Wellner [Citation33] directly.

Theorem 2.2

Assume that (A1), (A2), (B1), (B2) and (C1)–(C4) hold. The processes ${\hat{W}}_{n}^{*}$ converge conditionally on $x_{i}, ϵ_{i}, i = 1, 2, \dots$ to a tight stochastic process that is equal to the limit of the processes ${\hat{W}}_{n}$ conditionally on $x_{i}, i = 1, 2, \dots$ .

Proof.

Remember first that, by definition $y^{*} = \hat{y} + {\hat{Σ}}_{n}^{\frac{1}{2}} Π {\hat{Σ}}_{n}^{- \frac{1}{2}} \hat{e} .$ Now we can use Assumptions (C2), and (C3) as in (Equation12(12) $\begin{aligned} | \frac{1}{\sqrt{\hat{σ} (x_{i})}} - \frac{1}{\sqrt{σ (x_{i})}} | & = | \frac{\sqrt{σ (x_{i})} - \sqrt{\hat{σ} (x_{i})}}{\sqrt{\hat{σ} (x_{i}) σ (x_{i})}} | \\ \leq \frac{| \sqrt{σ (x_{i})} - \sqrt{\hat{σ} (x_{i})} |}{\sqrt{σ_{min}^{2}}} \\ \leq \frac{\sqrt{| σ (x_{i}) - \hat{σ} (x_{i}) |}}{σ_{min}} \\ = \frac{\sqrt{O_{P} (n^{\frac{1}{r}})}}{σ_{min}} \\ = O_{P} (n^{\frac{1}{2 r}}) . \end{aligned}$ (12) ), to prove, as in the first steps of the proof of Theorem 2.1, that (16) $\begin{aligned} {\hat{W}}_{n}^{*} (t) & = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{ϵ}}_{i}^{*} 1 (g_{{\hat{θ}}^{*}} (x_{i}) \leq t) \\ = \frac{1}{\sqrt{n}} 1 (g_{{\hat{θ}}^{*}} (X) \leq t)^{T} {\hat{ϵ}}^{*} \\ = \frac{1}{\sqrt{n}} 1 (g_{θ_{0}} (X) \leq t)^{T} Σ^{- \frac{1}{2}} (I - H_{n} Φ_{n}) Π \hat{e} + o_{P} (1) . \end{aligned}$ (16) Observe that (17) $var (Π \hat{ϵ} | X, ϵ) = C_{1} I + C_{2} 1 1^{T},$ (17) where $C_{1} = \frac{1}{n} \sum_{i = 1}^{n} {\hat{ϵ}}_{i}^{2}$ and $C_{2} = - \frac{1}{n (n - 1)} \sum_{i \neq j} {\hat{ϵ}}_{i} {\hat{ϵ}}_{j}$ . Observe that $C_{1} = 1 + o_{P} (1)$ . From Equations (Equation16(16) $\begin{aligned} {\hat{W}}_{n}^{*} (t) & = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{ϵ}}_{i}^{*} 1 (g_{{\hat{θ}}^{*}} (x_{i}) \leq t) \\ = \frac{1}{\sqrt{n}} 1 (g_{{\hat{θ}}^{*}} (X) \leq t)^{T} {\hat{ϵ}}^{*} \\ = \frac{1}{\sqrt{n}} 1 (g_{θ_{0}} (X) \leq t)^{T} Σ^{- \frac{1}{2}} (I - H_{n} Φ_{n}) Π \hat{e} + o_{P} (1) . \end{aligned}$ (16) ), (Equation17(17) $var (Π \hat{ϵ} | X, ϵ) = C_{1} I + C_{2} 1 1^{T},$ (17) ) and Assumption (C4) it therefore follows that the covariance function of the process ${\hat{W}}_{n}^{*}$ is equal to $\begin{aligned} cov ({\hat{W}}_{n}^{*} (t), {\hat{W}}_{n}^{*} (s) | X, ϵ) & = \frac{1}{n} 1 (g_{θ_{0}} (X) \leq t)^{T} Σ_{n}^{- \frac{1}{2}} (I - H_{n} Φ_{n}) Σ_{n} (I - H_{n} Φ_{n}) \\ \times Σ_{n}^{- \frac{1}{2}} 1 (g_{θ_{0}} (X) \leq s) + o_{P} (1) . \end{aligned}$ The covariance functions of the processes ${\hat{W}}_{n}$ and ${\hat{W}}_{n}^{*}$ are therefore asymptotically equal.

Therefore the Crammér–Wold device and the central limit theorem for non-identically distributed random variables, for example Lindeberg's CLT, imply that the marginals of these processes converge to the same limits.

For the asymptotic ρ-equicontinuity, first observe that $| {\hat{W}}_{n}^{*} (s) - {\hat{W}}_{n}^{*} (t) | = | \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} ϵ_{i}^{*} (1 (g_{θ_{0}} (x_{i}) \leq s) - 1 (g_{θ_{0}} (x_{i}) \leq t)) | + o_{P} (1) .$ Hence we can prove the asymptotic ρ-equicontinuity of the process ${\hat{W}}_{n}^{*}$ in almost the same way as it is proven in the second half of Theorems 3.7.1 and 3.7.2 in van der Vaart and Wellner [Citation33]. First note that, since $X$ is fixed in our setting, we cannot use the theory of empirical stochastic processes on i.i.d. random variables, but since we are dealing with the indicator functions, we are in fact dealing with sampling without replacement. Therefore we can use the Hoeffding's inequality to bound the above expression from above by the same expression where $ϵ_{i}$ are sampled with replacement. Then we can bound the resulting expression by a constant times the same expression containing Poissonization. After this, we can use the conditional multiplier theorem in van der Vaart and Wellner [Citation33] which implies the desired result.

2.4. Power

In this section, we will briefly examine the power of our approach. As before, assume that we are using data $X$ and the function $g_{θ}$ to model our data and that $\hat{θ}$ converges to some limit $θ$ . Assume that the data $y$ is generated so that it holds that $m (x) = g_{θ} (x) + r (x),$ where r is some unknown function. As before, we assume that $\hat{σ}$ converges to some limit σ.

Assume that the following assumption holds.

(P1)

The functions r and g are such that there exists an interval T, such that there exists an infinite number of $x_{i}$ , so that $g_{θ} (x_{i}) \in T$ and it holds that either $r (x_{i}) > 0$ for every such $x_{i}$ or $r (x_{i}) < 0$ for every such $x_{i}$ .

If we assume further that Assumptions (A1), (A2), (B1), (B2) and (C1)–(C4) hold and in addition that $E ({\hat{ϵ}}_{i}^{*}) = 0$ , holds for every i, then our method rejects the null hypothesis. The reason for this is that the mean function of the process ${\hat{W}}_{n}$ does not converge on T, while the mean function of the processes ${\hat{W}}_{n}^{*}$ is zero, as before.

For example, let $x_{i 1} \sim U (0, 1)$ be i.i.d and let $x_{i 2} \sim U (0, 1)$ be i.i.d, where $U (a, b)$ denotes uniform distribution on $[a, b]$ . Let $x_{i 1}$ and $x_{i 2}$ be independent. Let $x_{i 1}$ and $x_{i 2}$ be independent of $ϵ_{i}$ .

If $m (x) = 1 + θ^{'} x_{1} + x_{1}^{2}$ , $g_{θ} (x) = 1 + θ x_{1}$ , then the function r is equal to $r (x) = x_{1}^{2} + (θ^{'} - θ) x_{1},$ and it can be shown that Assumption (P1) holds.

On the other hand if we assume that $m (x) = 1 + θ^{'} x_{1}$ , $g_{θ} (x) = 1 + θ_{1} x_{1} + θ_{2} x_{2}$ , then the function r is equal to $r (x) = (θ^{'} - θ_{1}) x_{1} - θ_{2} x_{2} .$ Because the sequences ${x_{i 1}}_{i \in N}$ and ${x_{i 2}}_{i \in N}$ are independent, Assumption (P1) does not hold.

3. Simulation results

To check the size of our proposed approach and to illustrate its power for several alternatives we conducted a simulation study with two predictor variables and an intercept term. We compare our approach with the approaches proposed by Stute and Zhu [Citation23], Lin et al. [Citation3], and Stute et al. [Citation24]. The fitted model is always, $\hat{y} = {\hat{θ}}_{0} + {\hat{θ}}_{1} x_{1} + {\hat{θ}}_{2} x_{2},$ where ${\hat{θ}}_{j}$ , j = 0, 1, 2 were obtained by OLS if not stated otherwise. The predictor variables were simulated independently from the uniform $[0, 1]$ distribution for n = 50, 100, 200 and 500 subjects. In each step of the simulation 1000 random permutations were performed. To estimate the variance function, we used the estimator defined in Equation (Equation5(5) $\hat{σ} (u) = \frac{\sum_{i = 1}^{n} {\hat{e}}_{i}^{2} K_{h_{n}} (u - {\hat{Y}}_{i})}{\sum_{i = 1}^{n} K_{h_{n}} (u - {\hat{Y}}_{i})},$ (5) ) using Gaussian kernel with bandwidth set to $0.5 / \sqrt{n}$ [Citation25] and $10^{8}$ . When calculating the test statistic for the approach proposed by Stute and Zhu [Citation23] we used, as in Stute et al. [Citation25], Gaussian kernel with bandwidth set to $0.5 / \sqrt{n}$ and we set $x_{0}$ to be the 99th centile of the predicted values [Citation23,Citation25]; the p-values were obtained by simulating $\int W^{2}$ . We used Rademacher distribution (i.e., sign-flipping) for the approach proposed by Stute and Zhu [Citation24] and standard normal distribution for the approach proposed by Lin et al. [Citation3] simulating 1000 null realizations (see Equation Equation7(7) ${\tilde{W}}_{n}^{*} (t) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{e}}_{i}^{*} 1 ({\hat{Y}}_{i} \leq t),$ (7) ) of the respective processes. In all approaches, the residuals were ordered by the fitted values. Each step of the simulation was repeated 10, 000 times (simulation margins of error are $\pm 0.004$ and $\pm 0.006$ for $α = 0.05$ and 0.1, respectively). All the analyses were performed with R language for statistical computing, version 3.6.0 [Citation36]. All methods were programmed in R by the authors and are available as an R package gofLM (available upon request from the authors). The results when using the CvM or KS test statistic were similar in terms of size. Using the CvM test statistic was, however, more powerful (data not shown), hence we only show the results for the CvM test statistic.

3.1. Size of the test

The size was verified with a simulated example where the outcome variable was simulated from, $y_{i} = 1 + θ_{1} x_{i 1} + 0.25 x_{i 2} + e_{i} .$

3.1.1. Normal homoscedastic random errors.

Here $e_{i} \sim N (0, σ^{2})$ , where $N (\cdot)$ denotes a normal distribution. Different values of $θ_{1} = 0.5$ , 1 and 2 and $σ^{2} = 0.5$ , 1 and 2 were considered.

The results for different $θ_{1}$ and $σ^{2}$ were very similar (data not shown), hence we show in Figure the difference between the empirical rejection rate and the nominal level, α (choosing $α = 0.05$ , 0.1 and 0.95 to give an overview over a wide range of αs) only for $θ_{1} = 1$ and $σ^{2} = 1$ . The performance of our proposed approach was good, regardless of the choice of the bandwidth, with the difference between the empirical rejection rate and the nominal level not exceeding the simulation margins of error regardless of the sample size, n. In contrast, the performance of the other approaches was not good with the difference between the empirical rejection rate and the nominal level exceeding the simulation margins of error unless when the sample size was large. Note that the results presented in Figure imply that the distribution of the p-values when using the approaches proposed by Stute and Zhu [Citation23], Lin et al. [Citation3] and Stute et al. [Citation24] are not uniform: the first approach is too conservative with the null hypothesis being rejected less often than the nominal level, the latter two approaches are too liberal rejecting the null hypothesis too often.

Figure 1. The difference between the empirical rejection rate and the nominal level, α, for some chosen values of $α = 0.05$ , 0.1 and 0.95 (columns) for the example with normal homoscedastic random errors using $θ_{1} = 1$ and $σ^{2} = 1$ . Shaded areas are simulation margins of error. Different colours represent different approaches.

3.1.2. Normal heteroscedastic random errors.

Here we illustrate how the approaches perform in the presence of heteroscedasticity, with $e_{i} \sim N (0, σ_{i}^{2})$ , where $σ_{i}^{2} = (1 + ψ x_{i 1})^{2}$ . Different values of $ψ = - 0.5$ , $- 0.25$ , 0, 0.25, 0.5 were considered, using $θ_{1} = 1$ .

The results when using different values of ψ were similar (data not shown), hence we show in Figure only the results for $ψ = - 0.5$ and 0.5. The results were very similar to the example with normal homoscedastic random errors, with our proposed approach with bandwidth set to $0.5 / \sqrt{n}$ attaining the nominal level and other approaches achieving this only when the sample size was large being too conservative (approach proposed by Stute and Zhu [Citation23]) or too liberal (the approaches proposed by Lin et al. [Citation3] and Stute et al. [Citation24]) otherwise. As expected, setting the bandwidth to $10^{8}$ yielded slightly worse results than when using $0.5 / \sqrt{n}$ with the former approach being slightly conservative for some values of ψ, however, the differences between the empirical rejection rate and the nominal level were not substantial.

Figure 2. The difference between the empirical rejection rate and the nominal level, α, for some chosen values of $α = 0.05$ , 0.1 and 0.95 (columns) for the example with normal heteroscedastic random errors using different values of $ψ = - 0.5$ and 0.5 (rows). Shaded areas are simulation margins of error. Different colours represent different approaches.

3.1.3. Non-normal homoscedastic random errors.

The size with non-normal random errors was checked with the example where $e_{i} = ϵ_{i} - a s$ , $ϵ_{i} \sim γ (a, s)$ and a and s are the shape and the scale parameter of the gamma distribution ( $γ (\cdot)$ ), respectively using $θ_{1} = 1$ . The scale parameter was set to s = 1, while different values of the shape parameter were considered, a = 2, 4, 6, 8 and 10. Larger values of a mean that the distribution of the error term was more symmetric but also more variable.

We show in Figure only the results for a = 6 (the trends observed for other values of a were similar, data not shown). Our approach, setting the bandwidth to $10^{8}$ , performed well attaining the nominal level; setting the bandwidth to $0.5 / \sqrt{n}$ also performed well, but there were a few exceptions (the approach was slightly conservative with n<200). The performance of the other approaches was worse with the tests not attaining the nominal level for all values of α, especially with a smaller sample size (as in the previous examples the approach proposed by Stute and Zhu [Citation23] was too conservative and the approaches proposed by Lin et al. [Citation3] and Stute et al. [Citation24] were too liberal).

Figure 3. The difference between the empirical rejection rate and the nominal level, α, for some chosen values of $α = 0.05$ , 0.1 and 0.95 (columns) for the example with non-normal homoscedastic random errors using a = 6. Shaded areas are simulation margins of error. Different colours represent different approaches.

3.2. Omission of a quadratic effect or an interaction term

Here we illustrate the power to detect the lack-of-fit due to omitting the quadratic effect or the interaction term. The outcome variable was simulated from $y_{i} = 1 + x_{i 1} + 0.25 x_{i 2} + θ_{3} x_{i 3} + e_{i},$ where $x_{i 3} = x_{i 1}^{2}$ or $x_{i 3} = x_{i 1} x_{i 2}$ , for the example where the quadratic and the interaction effects are omitted, respectively, and $e_{i} \sim N (0, 0.1)$ . Different values of $θ_{3} = 0$ , 0.25, 0.50, 0.75, 1 were considered. Results for $α = 0.05$ and for different effect and sample sizes are reported in Figure (power against the omission of the quadratic effect – lower panels, power against the omission of the interaction effect – upper panels; the results for n = 200 are omitted for brevity of the presentation).

Figure 4. Fraction rejected at $α = 0.05$ for different approaches under the null hypothesis ( $θ_{3} = 0$ ) and under different alternatives (omission of a quadratic term, omission of an interaction term; $θ_{3} > 0$ ) for various different sample sizes (n = 50, 100 and 500 columns). Shaded areas are simulation margins of error. Different colours represent different approaches.

Judging from Figure our approach performs well. As expected, the approach was more powerful when the effect size ( $θ_{3}$ ) and sample size were larger. Setting the bandwidth to $10^{8}$ was slightly more powerful than using $0.5 / \sqrt{n}$ , but the differences were not substantial (results for $0.5 / \sqrt{n}$ are not shown). The proposed approach was more powerful than the test proposed by Stute and Zhu [Citation23], had similar power as the test proposed by Lin et al. [Citation3] while the approach proposed by Stute et al. [Citation24] was, with a smaller sample size, slightly more powerful, but the differences were not substantial. However, recall that the approaches proposed by Lin et al. [Citation3] and Stute et al. [Citation24] are too liberal rejecting the null hypothesis too often (see the results presented in the previous sections).

3.3. Nonlinear regression

Here we illustrate how our approach performs when applied to nonlinear regression. Here we only compare our approach with the approach proposed by Stute et al. [Citation24], since it performed better than the other approaches in the previous examples. The outcome variable was simulated from the power model $y_{i} = 2 + 15 x_{i 1}^{1 / 3} + e_{i},$ where $e_{i} \sim N (0, 1)$ . The fitted model was the same as the simulated model. R function nls [Citation37] was used to fit the model, using true values of the parameters as the starting values. The results are presented in Figure .

Figure 5. The difference between the empirical rejection rate and the nominal level, α, for some chosen values of $α = 0.05$ , 0.1 and 0.95 (columns) for the nonlinear regression example where the fitted model was the same as the simulated model. Shaded areas are simulation margins of error. Different colours represent different approaches.

With a bandwidth set to $0.5 / \sqrt{n}$ our approach was too conservative, however, when setting the bandwidth to $10^{8}$ the empirical rejection rates were within the simulation margins of error. The approach proposed by Stute et al. [Citation24] was with a smaller sample size again too liberal.

To check the power of the proposed approach and compare it with the approach proposed by Stute et al. [Citation24], the outcome variable was simulated from Michaelis–Menten model $y_{i} = 2 + 15 x_{i 1} / (0.3 + x_{i 1}) + e_{i},$ where $e_{i} \sim N (0, 1)$ . The fitted model, however, assumed that the data were generated from the power model. Results are presented in Figure .

Figure 6. Fraction rejected at $α = 0.05$ for different approaches for the nonlinear regression example where the fitted and simulated models were different. Shaded areas are simulation margins of error. Different colours represent different approaches.

With bandwidth set to $10^{8}$ the power of our approach was larger than when using $0.5 / \sqrt{n}$ , and similar to the power obtained with the approach proposed by Stute et al. [Citation24].

4. Application

Our approach was applied to a dataset from a cross-sectional study of the dependence in daily activities among family practice non-attenders in Slovenia. We reanalysed a subset of data for 423 adult patients (aged 18 or over) requiring some level of assistance in their daily activities. We modelled the level of dependence in daily activities determined by a questionnaire through eight items, each on the 4-point Likert scale (dependence), exhibiting positively skewed distribution, and two covariates, patients age in years (age) and the assessment of pain intensity as determined by a questionnaire on a 10-point Likert scale (pain). First, we fitted a simple additive linear model using the OLS estimator and the results are shown in Figure , panels (A), (B) and (C). Our approach based on the CvM type test statistic, using Gaussian kernel with bandwidth set to $0.5 / \sqrt{n}$ , returned p-value of 0.027 and we can conclude that the fit of the model is poor (Figure , panel (A)). The tests using CvM type test statistic targeting at age and pain returned p-values equal to 0.041 and 0.042, respectively. From the plots of the constructed random process, we can conclude that the lack-of-fit due to covariate age is due to the omission of the quadratic term (Figure , panel (B), black curve). Based on Figure panel (C), we then also modelled the covariate pain as a restricted cubic spline using the truncated power basis defining 3 knots. When modelling pain as a restricted cubic spline and age as a quadratic term, the p-value for the full model check was 0.185 and the null hypothesis was not rejected at $α = 0.05$ (Figure , panel D). The test using CvM type test statistic targeting at age and its square, returned a p-value of 0.560 and the test targeting at pain and its expanded term returned a p-value of 0.386 (Figure , panels (E) and (F), respectively) with the plots of the constructed process revealing an improved fit of the model.

Figure 7. The constructed random process (black curves) and 1000 randomly selected random processes which are obtained using the proposed permutation procedure (grey curves) for the dependence in daily activities dataset. The p-values are for the CvM type test statistic. Panel (A) refers to the model which includes only the main effects. Panels (B) and (C) are tests targeting at age and pain, respectively for the model which includes only the main effects. Panel (D) refers to the model which includes also the quadratic term of pain and models age as a restricted cubic spline with 3 knots. Panel (E) refers to the test targeting at age and its expanded term and panel (F) is for a test targeting at pain and its square.

5. Discussion

In practice, graphical procedures are mainly used to evaluate goodness-of-fit of the regression model. While these procedures are useful, they can be to a large extent subjective. They can be formalized by forming an empirical process based on residuals. Several goodness-of-fit tests based on such processes have been proposed. They are, however, very computationally expensive [Citation24], target a certain fixed alternative [Citation3,Citation4], rely on the asymptotic independence of the residuals [Citation22,Citation27] or rely on some asymptotic properties of the constructed random process [Citation3,Citation23,Citation26] and therefore can yield poor results with small sample size. We propose an approach that does not depend on the distribution of the errors and has the correct size also with a small sample size. This is achieved by using a novel permutation procedure to obtain the p-values. We prove that conditionally on the design matrix $X$ , this approach is asymptotically consistent under the null and some alternative hypotheses.

In our approach, a consistent estimator of the function σ is required. In our simulation study, we used the standard Nadaraya–Watson estimator, where a kernel needs to be chosen and the bandwidth needs to be specified. It is well known that the choice of the kernel is not problematic [Citation38]. We compared two options for setting the bandwidth, one used regularly in similar problems and one presenting the extreme case where σ is practically a constant. Both options gave a very similar result for the linear case (even with non-constant variance), showing some robustness of our approach against the choice of bandwidth. However, for the non-linear case using $0.5 / \sqrt{n}$ did not yield good results, since our approach was too conservative having small power. In our particular example, this choice of bandwidth led to poor estimation of σ. With a larger bandwidth σ was estimated better and consequently, our approach performed well attaining the correct size while being as powerful as the approach proposed by Stute et al. [Citation24] (which with a smaller sample size was too liberal in all our simulation examples). The question of how to consistently estimate the function σ in a general regression setting is, however, beyond the scope of this paper, see, e.g., Fan and Yao [Citation39].

One potential disadvantage of using (any kind of) permutation (or bootstrap-based) testing could be a large computational burden. In the homoscedastic case, this was not an issue, as it is possible to implement the tests in a very efficient manner. For example, a typical simulated data example with two covariates, 1000 samples, and 10,000 permutations took 2.6 s using our R implementation. In comparison, the tests proposed by Lin et al. [Citation3] using the gof package (no longer available on CRAN) with 10,000 simulations took 28.6 s for the same example. For the heteroscedastic case, the computational burden is increased by the calculation of $\hat{σ}$ and ${\hat{σ}}^{*}$ , the latter of which must be performed many times. For example, when using 1000 permutations, the R implementation of the method takes about 10 s for data with n = 400 on a modern computer. In comparison, the implementation of our method written in Julia takes less than 40 ms on such data and the implementation in R for homoscedastic data takes around 130 milliseconds. With n = 1000 the R implementation for hetereoscedastic data takes about a minute, while the implementation in Julia takes less than 200 milliseconds. With n = 10, 000 the implementation in Julia is still manageable, taking about 20 s, while the R implementation is much slower, taking about 2.5 hours. Neither of these implementations is currently close to being optimal, but we do not expect this to be an important issue in practical research.

Acknowledgments

The dependence in daily activities data was provided by Zalika Klemenc Ketiš and Antonija Poplas Susič from the Health Centre Ljubljana. The dependence in daily activities data cannot be shared due to violating confidentiality. The constructive comments of two Reviewers are gratefully acknowledged.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

Jakob Peterlin is a young researcher funded by the Slovene Research Agency (ARRS). The authors acknowledge the financial support by ARRS (Methodology for data analysis in medical sciences, P3-0154, and projects J3-1761 and N1-0035).

References

González-Manteiga W, Crujeiras RM. An updated review of goodness-of-fit tests for regression models. TEST. 2013;22(3):361–411. doi:10.1007/s11749-013-0327-5.
Web of Science ®Google Scholar
Stute W. Nonparametric model checks for regression. Ann Stat. 1997;25(2):613–641. doi:10.1214/aos/1031833666.
Web of Science ®Google Scholar
Lin DY, Wei LJ, Ying Z. Model-checking techniques based on cumulative residuals. Biometrics. 2002;58(1):1–12. doi:10.1111/j.0006-341X.2002.00001.x.
PubMed Web of Science ®Google Scholar
Solari A, Cessie S, Goeman JJ. Testing goodness of fit in regression: a general approach for specified alternatives. Stat Med. 2012;31(28):3656–3666. doi:10.1002/sim.5417.
PubMed Web of Science ®Google Scholar
Khmaladze EV, Koul HL. Martingale transforms goodness-of-fit tests in regression models. Ann Stat. 2004;32(3):995–1034. doi:10.1214/009053604000000274.
Web of Science ®Google Scholar
Khmaladze EV, Koul HL. Goodness-of-fit problem for errors in nonparametric regression: distribution free approach. Ann Stat. 2009;37(6A):3165–3185. doi:10.1214/08-AOS680.
Web of Science ®Google Scholar
Beghin L, Orsingher E. On the maximum of the generalized Brownian bridge. Lith Math J. 1999;39(2):157–167. doi:10.1007/BF02469280.
Google Scholar
Kulperger R. On the distribution of the maximum of Brownian bridges with application to regression with correlated errors. J Stat Comput Simul. 1990;34(2-3):97–106. doi:10.1080/00949659008811209.
Google Scholar
Dwass M. Modified randomization tests for nonparametric hypotheses. Ann Math Stat. 1957;28(1):181–187. doi:10.1214/aoms/1177707045.
Google Scholar
Hemerik J, Goeman J. Exact testing with random permutations. TEST. 2018;27(4):811–825. doi:10.1007/s11749-017-0571-1.
PubMed Web of Science ®Google Scholar
Phipson B, Smyth GK. Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn. Stat Appl Genet Mol Biol. 2010;9(1):39.
Web of Science ®Google Scholar
Ge Y, Dudoit S, Speed TP. Resampling-based multiple testing for microarray data analysis. Test. 2003;12(1):1–77.
Web of Science ®Google Scholar
Lehmann EL, Romano JP. Testing statistical hypotheses. Vol. 3, New York, NY: Springer; 2005.
Google Scholar
DiCiccio CJ, Romano JP. Robust permutation tests for correlation and regression coefficients. J Am Stat Assoc. 2017;0:0–0. doi:10.1080/01621459.2016.1202117.
Web of Science ®Google Scholar
Manly B. Randomization, bootstrap and Monte Carlo methods in biology. 3rd ed. New York, NY: Taylor & Francis; 2006.
Google Scholar
ter Braak CJF. Permutation versus bootstrap significance tests in multiple regression and Anova. Berlin, Heidelberg: Springer Berlin Heidelberg; 1992. doi:10.1007/978-3-642-48850-4_10. p. 79–85.
Google Scholar
Freedman D, Lane D. A nonstochastic interpretation of reported significance levels. J Bus Econ Stat. 1983;1(4):292–298. http://www.jstor.org/stable/1391660.
Google Scholar
Lepage R, Podgórski K. Resampling permutations in regression without second moments. J Multivar Anal. 1996;57(1):119–141. doi:10.1006/jmva.1996.0025. https://www.sciencedirect.com/science/article/pii/S0047259X96900251.
Web of Science ®Google Scholar
Anderson MJ, Legendre P. An empirical comparison of permutation methods for tests of partial regression coefficients in a linear model. J Stat Comput Simul. 1999;62(3):271–303. doi:10.1080/00949659908811936.
Web of Science ®Google Scholar
Anderson MJ, Robinson J. Permutation tests for linear models. Aust N Z J Stat. 2001;43(1):75–88. doi:10.1111/1467-842X.00156.
Web of Science ®Google Scholar
Diebolt J, Zuber J. Goodness-of-fit tests for nonlinear heteroscedastic regression models. Stat Probab Lett. 1999;42(1):53–60. doi:10.1016/S0167-7152(98)00189-8. http://www.sciencedirect.com/science/article/pii/S0167715298001898.
Web of Science ®Google Scholar
Fan J, Huang LS. Goodness-of-fit tests for parametric regression models. J Am Stat Assoc. 2001;96(454):640–652. doi:10.1198/016214501753168316.
Web of Science ®Google Scholar
Stute W, Zhu LX. Model checks for generalized linear models. Scand J Stat. 2002;29(3):535–545. http://www.jstor.org/stable/4616730.
Web of Science ®Google Scholar
Stute W, Gonzalez Manteiga W, Presedo Quindimil M. Bootstrap approximations in model checks for regression. J Am Stat Assoc. 1998;93(441):141–149. http://www.jstor.org/stable/2669611.
Web of Science ®Google Scholar
Stute W, Thies S, Zhu LX. Model checks for regression: an innovation process approach. Ann Stat. 1998;26(5):1916–1934. doi:10.1214/aos/1024691363.
Web of Science ®Google Scholar
Su JQ, Wei LJ. A lack-of-fit test for the mean function in a generalized linear model. J Am Stat Assoc. 1991;86(414):420–426. http://www.jstor.org/stable/2290587.
Web of Science ®Google Scholar
Christensen R, Lin Y. Lack-of-fit tests based on partial sums of residuals. Commun Stat Theory Methods. 2015;44(13):2862–2880. doi:10.1080/03610926.2013.844256.
Web of Science ®Google Scholar
Hardle W, Mammen E. Comparing nonparametric versus parametric regression fits. Ann Stat. 1993;21(4):1926–1947. doi:10.1214/aos/1176349403.
Web of Science ®Google Scholar
Liu RY. Bootstrap procedures under some non-i.i.d. models. Ann Stat. 1988;16(4):1696–1708. doi:10.1214/aos/1176351062.
Web of Science ®Google Scholar
Wu CFJ. Jackknife, bootstrap and other resampling methods in regression analysis. Ann Stat. 1986;14(4):1261–1295. doi:10.1214/aos/1176350142.
Web of Science ®Google Scholar
Hattab MW, Christensen R. Lack of fit tests based on sums of ordered residuals for linear models. Aust N Z J Stat. 2018;60(2):230–257. doi:10.1111/anzs.12231.
Web of Science ®Google Scholar
Christensen R, Sun SK. Alternative goodness-of-fit tests for linear models. J Am Stat Assoc. 2010;105(489):291–301. doi:10.1198/jasa.2009.tm08697.
Web of Science ®Google Scholar
van der Vaart A, Wellner J. Weak convergence and empirical processes: with applications to statistics. New York, NY: Springer; 1996. (Springer Series in Statistics). https://books.google.si/books?id=OCenCW9qmp4C.
Google Scholar
Bishop Y, Light R, Mosteller F, et al. Discrete multivariate analysis: theory and practice. New York: Springer; 2007. https://books.google.si/books?id=IvzJz976rsUC.
Google Scholar
Vaart AW. Asymptotic statistics. New York, NY: Cambridge University Press; 2000. https://EconPapers.repec.org/RePEc:cup:cbooks:9780521784504.
Google Scholar
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria: 2014. http://www.R-project.org/.
Google Scholar
Bates D, Watts D. Nonlinear regression analysis, its applications. New York: Wiley; 1988. (Wiley series in probability and mathematical statistics). http://gso.gbv.de/DB=2.1/CMD?ACT=SRCHA&SRT=YOP&IKT=1016&TRM=ppn+025625357&sourceid=fbw_bibsonomy.
Google Scholar
Wand M, Jones M. Kernel smoothing. New York, NY: Taylor & Francis; 1994. (Chapman & Hall/CRC Monographs on Statistics & Applied Probability). https://books.google.si/books?id=GTOOi5yE008C.
Google Scholar
Fan J, Yao Q. Efficient estimation of conditional variance functions in stochastic regression. Biometrika. 1998;85(3):645–660. http://www.jstor.org/stable/2337393.
Web of Science ®Google Scholar

A permutation approach to goodness-of-fit testing in regression models

Abstract

1. Introduction

1.1. Existing approaches based on empirical processes

2. The proposed approach and its asymptotic convergence