23
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Fast Calculation of Gaussian Process Multiple-Fold Cross-Validation Residuals and their Covariances

&
Received 25 May 2022, Accepted 03 May 2024, Published online: 17 Jun 2024

Abstract

We generalize fast Gaussian process leave-one-out formulas to multiple-fold cross-validation, highlighting in turn the covariance structure of cross-validation residuals in simple and universal kriging frameworks. We illustrate how resulting covariances affect model diagnostics. We further establish in the case of noiseless observations that correcting for covariances between residuals in cross-validation-based estimation of the scale parameter leads back to maximum likelihood estimation. Also, we highlight in broader settings how differences between pseudo-likelihood and likelihood methods boil down to accounting or not for residual covariances. The proposed fast calculation of cross-validation residuals is implemented and benchmarked against a naive implementation, all in R. Numerical experiments highlight the substantial speed-ups that our approach enables. However, as supported by a discussion on main drivers of computational costs and by a numerical benchmark, speed-ups steeply decline as the number of folds (say, all sharing the same size) decreases. An application to a contaminant localization test case illustrates that the way of grouping observations in folds may affect model assessment and parameter fitting compared to leave-one-out. Overall, our results enable fast multiple-fold cross-validation, have consequences in model diagnostics, and pave the way to future work on hyperparameter fitting as well as on goal-oriented fold design. Supplementary materials for this article are available online.

1 Introduction and Notation

Gaussian process (GP) models are at the heart of a number of prominent methods in spatial statistics, machine learning, and beyond. Properly validating such models and fitting underlying hyperparameters thus appear as crucial and impactful endeavors. It is often the case that, for reasons of data scarcity or other, the latter need to be conducted based on a unique dataset. Cross-validation is commonly used in such a context, not only in GP modeling but also as a general approach to assess statistical models and tune parameters in a vast class of prediction algorithms. Fast leave-one-out cross-validation formulas are known for some models and approaches including GP prediction. However, leave-one-out is known to suffer from some pitfalls, and multiple-fold cross-validation is increasingly preferred over it in broader contexts, notably in the realm of machine learning.

We focus here on cross-validation in the context of GP modeling, and more specifically on adapting fast formulas for leave-one-out (LOO) to multiple-fold cross-validation and exploring how these could be efficiently exploited in topics such as model diagnostics and covariance hyperparameter fitting. One of the essential bricks of the proposed approaches turns out to be closed-form formulas for the covariance structure of cross-validation residuals. Our main results are presented both in the simple kriging and universal kriging frameworks, also with links to ridge regression. They contribute to a lively stream of research on cross-validation that pertains to GP modeling, and also to further related classes of statistical models.

Leave-one-out (LOO) and multiple-fold cross-validation in the GP framework were tackled as early as in Dubrule (Citation1983), and some further references as well as a more recent account of their use in machine learning is given in chap. 5 of Rasmussen and Williams (Citation2006). On a different note, cross-validation has led to a number of investigations in the context of spline modeling, for instance in Golub, Heath, and Wahba (Citation1979) where generalized cross-validation was used for the selection of a ridge (regularization) parameter. In the domain of GP modeling for computer experiments, cross-validation was tackled as early as in Currin et al. (Citation1988) (with a pioneering discussion on links to MLE), and leave-one-out was investigated in Bachoc (Citation2013) and found to deliver a valuable alternative to MLE in misspecified cases. A general survey of cross-validation procedures for model selection is presented in Arlot and Celisse (Citation2010). Cross-validation for model selection is further tackled in Zhang and Yang (Citation2015). Le Gratiet, Cannamela, and Iooss (2015) proposed cokriging-based sequential design strategies using fast cross-validation for multi-fidelity computer codes. Fischer (Citation2016) studied model selection for GP regression by Approximation Set Coding. Martino, Laparra, and Camps-Valls (Citation2017) suggested new probabilistic cross-validation Estimators for Gaussian process regression. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure were investigated in Roberts et al. (Citation2017). New prediction error estimators were defined in Rabinowicz and Rosset (Citation2020a), where novel model selection criteria in the same spirit as AIC and Mallow’s Cp were suggested. LOO cross-validation for Bayesian model comparison in large data was recently tackled in Magnusson et al. (Citation2020). Kaminsky et al. (Citation2021) introduced an efficient batch multiple-fold cross-validation Voronoi adaptive sampling technique for global surrogate modeling. Rabinowicz and Rosset (2020b) introduced a bias-corrected cross-validation estimator for correlated data. Bates, Hastie, and Tibshirani (Citation2023) tackles the question of what one actually does estimate in cross-validation. In contrast with the random design settings considered in the latter as in many theoretical works pertaining to cross-validation, our baseline settings here will assume a fixed design.

LOO cross-validation consists in predicting at each of the observation locations when (temporarily) removing the corresponding observation from the dataset, and then comparing obtained predictions to left-out observations. As a first example, we consider a case of GP prediction for a one-dimensional test function (which definition is recalled in Section A.2 of the supplementary material) observed at a regular design, here a 10-point subdivision of [0,1], and a GP model assuming a Matérn 5/2 stationary covariance kernel. The test function is represented in black in , along with the GP predictor (dashed blue line) and LOO predictions at the design locations (red stars). Modeling is performed with the DiceKriging package Roustant, Ginsbourger, and Deville (Citation2012) and cross-validation relies on a fast implementation using the cv function newly available in the DiceKriging package (version 1.6.0). This simple example in a misspecified case illustrates how LOO residuals may be more representative of actual prediction errors than the built-in GP prediction standard deviation, the latter being in this case not depending on the actual observations and calculated under a questionable stationarity assumption.

Fig. 1 On the upper panel, GP mean predictor (dashed blue line) of the test function (black line) defined by (A.4) based on 10 evaluations at a regular grid, LOO cross-validation predictions (red stars). Lower panel: absolute prediction errors associated with GP (black line) and LOO (red stars) predictions, and GP prediction standard deviation (dashed blue line).

Fig. 1 On the upper panel, GP mean predictor (dashed blue line) of the test function (black line) defined by (A.4) based on 10 evaluations at a regular grid, LOO cross-validation predictions (red stars). Lower panel: absolute prediction errors associated with GP (black line) and LOO (red stars) predictions, and GP prediction standard deviation (dashed blue line).

It has in fact been established in the seminal paper Dubrule (Citation1983) that the LOO prediction errors and variance could be calculated quite elegantly and efficiently by relying on the inverse of a certain matrix (the main focus in Dubrule (Citation1983) being on a variogram matrix extended by monomial basis functions) and quantities derived thereof. We work here in terms of covariances and related matrices as in Bachoc (Citation2013), where fast LOO formulas were obtained for simple kriging in the square-integrable case.

An elegant by-product formula for the square norm of LOO residulas has been leveraged in Bachoc (Citation2013) for hyperparameter estimation in the stationary case, namely by minimizing this quantity as a function of covariance parameters (excluding the “variance” parameter). In particular, numerical experiments conducted in Bachoc (Citation2013) suggested that leave-one-out error minimization is more robust than maximum likelihood estimation (MLE) in the case of model misspecification, a common situation in practice. Yet, depending on the experimental design, the errors obtained by leaving one point at a time may be far from representative of generalization errors, as illustrated in Section 6 and in Section A.2.1 of the supplementary material. While remaining with a prescribed design, multiple-fold cross-validation allows to mitigate shortcomings of leave-one-out by considering situations where larger parts of the design (i.e., several points) are removed at a time.

Our primary aim in the present paper is to generalize fast-leave-one-out formulas to multiple-fold cross-validation (MFCV), hence, facilitating their use in both contexts of model diagnostics and hyperparameter optimization. We derive such formulas and highlight in turn the covariance structure of cross-validation residuals, a crucial ingredient in investigated diagnostics and in approaches pertaining to parameter fitting. Furthermore, following ideas already sketched in Dubrule (Citation1983), we present extensions of our results to the case of universal kriging. The obtained results clarify some links between cross-validation and MLE, and also open new perspectives regarding the design of folds and ultimately also of the underlying experimental design points.

Section 2 presents the considered class of statistical models and recalls the basics of simple and universal kriging under Gaussian process assumptions, with some brief detours in Section 2.1 on how Gaussian linear and ridge regression may fit into this framework as well as on Bayesian interpretations of universal kriging and ridge regression. Prediction equations are revisited in Section 2.2 in transductive settings thanks to Schur complement approaches, hence, setting the decor for fast cross-validation. Section 3 is dedicated to fast multiple-fold cross-validation, starting in Section 3.1 with generic results that apply in Gaussian vector conditioning. We then present in Section 3.2 our main results on the fast calculation and the joint probability distribution of cross-validation residuals in the simple kriging case. The latter are then extended to universal kriging settings in Section 3.3, notably retrieving as a particular case fast cross-validation results for linear regression presented in Zhang (Citation1993). In Section 4 we present consequences of the main results in the context of model fitting. Section 4.1 focuses on graphical diagnostics via pivotal statistics for cross-validation residuals, while the two subsequent sections deal with cross-validation-based covariance parameter estimation. More specifically, Section 4.2 is devoted to the estimation of the scale parameter; Section 4.3 then pertains to the estimation of further hyperparameters building upon either the norm or the (pseudo-)likelihood of multiple-fold cross-validation residuals. Section 5 mostly consists in numerical experiments illustrating the accuracy and speed-ups offered by the closed-form formulas (Section 5.1), and some considerations regarding associated computational complexities (Section 5.2). Section 6 presents an application to a contaminant localization test case and illustrate that grouping observations in folds may lead to improved model assessment and parameter fitting compared to LOO. The ultimate Section 7 is a discussion opening perspectives on how the established results could be further exploited to improve diagnostics and parameter estimation procedures in GP modeling.

2 Models and Transductive Prediction Settings

2.1 Of Universal Kriging and Ridge Regression

Throughout the article we consider statistical models of the form (1) Zi=ξ(xi)+εi   (xiD,i{1,,n}),(1) with ξ(x)=μ(x)+η(x) (xD) where η is a centered GP with covariance kernel k, μ is a trend function, and ε=(ε1,,εn)N(0,Σε) independently of η. In some cases, μ is known up to some coefficients: unknown constant (ordinary kriging settings) or linear combination of basis functions with unknown coefficients (universal kriging settings). When μ is known, one speaks of simple kriging settings. While several variants of Kriging can be defined without distributional assumptions (and without even assuming square-integrability of η), we restrict the exposition here to the broadly used Gaussian case. However, we do not require k to be stationary. Let us remark that throughout the following exposition on kriging/GP prediction, the kernel (including hyperparameters) is considered as given. In practice, the equations are also employed with plugged in estimates of hyperparameters. Full Bayesian approaches where the uncertainty on kernel hyperparameters is further propagated to the predictions are beyond the considered scope. However, in Sections 4 and 6 we tackle the issue of model validation and hyperparameter fitting via the established cross-validation results.

We consider a linear trend case that encompasses the above settings, and also has some interesting connections to linear (ridge) regression as we will see below. We hence assume μ to be defined by (2) μ(x)=j=1pβjfj(x)   (xD),(2) where the fj (1jp) are prescribed basis functions and βj (1jp) are real-valued coefficients. Simple kriging can be retrieved by taking p = 1 with β1=1 and f1 arbitrary (possibly null), and universal kriging with p1 and β=(β1,,βp) unknown (ordinary kriging is a special case with p = 1 and f11). In simple kriging, a predictor of ξ of the form ξ̂(x)=μ(x)+λ(x)Z (xD) is sought, where Z=(Z1,,Zn). λ(x) is determined by minimizing the residual variance (3) var[ξ(x)ξ̂(x)]=k(x,x)+λ(x)(K+Σε)λ(x)2k(x)Kλ(x),(3)

where K=(k(xi,xj))i,j{1,,n} and k(x)=(k(x,xj))j{1,,n}. In universal kriging, the predictor is of the form ξ̂(x)=λ(x)Z   (xD), and p linear constraints fj(x)=i=1nλi(x)fj(xj) for j{1,,p} are added to the minimization of (3) to ensure unbiasedness of ξ̂(x). Minimizing (3) under these unbiasedness constraints can be addressed by Lagrange duality, leading to a linear problem featuring for each xDa p-dimensional Lagrange multiplier @@l(x): (4) (K+ΣεFF0)(λ(x)(x))=(k(x)f(x)),(4) where F=(fj(xi))1in,1jpRn×p and f(x)= (f1(x),,fp(x))Rp. Assuming that Σ=K+Σε and I=FΣ1F are invertible, solving for λ(x) delivers the universal kriging predictor, that can ultimately be written, as further detailed in Section A.3 of the supplementary material, as (5) ξ̂(x)=f(x)β̂+k(x)Σ1(ZFβ̂)   (xD)(5) where β̂=I1FΣ1Z. In simple kriging with μ(x)=f(x)β, the predictor has the same form as (5) yet with β instead of β̂. Furthermore, the universal kriging residual covariance writes, for arbitrary x,xD: (6) cov[ξ(x)ξ̂(x),ξ(x)ξ̂(x)]=k(x,x)k(x)Σ1k(x)+(f(x)FΣ1k(x))I1(f(x)FΣ1k(x)).(6)

The simple kriging residual covariance boils down to the right-hand side of the first line of (6). On a different note, it is worth noting that when setting η0 (and hence ξ(x)=j=1pβjfj(x)) one retrieves thereby the equations of (generalized) least squares.

We now review the Bayesian approach to universal kriging Omre (Citation1987), Handcock and Stein (Citation1993), and Helbert, Dupuy, and Carraro (Citation2009), that will in turn give us an occasion to incorporate ridge regression into the discussion and will prove practical in further developments throughout the article. Let us assume to this end that β is endowed with a Gaussian prior distribution N(0,Σβ) where Σβ is an invertible covariance matrix. Let us first consider Σβ as fixed. The posterior distribution of β knowing Z is Gaussian with E[β|Z]=β̂MAP:=(Σβ1+FΣ1F)1FΣ1Z and (7) cov[β|Z]=(Σβ1+FΣ1F)1(7)

whereby, for x,xD, E[ξ(x)|Z]=f(x)β̂MAP+k(x)Σ1(ZFβ̂MAP) and (8) cov[ξ(x),ξ(x)|Z]=k(x,x)k(x)Σ1k(x)+(f(x)FΣ1k(x))cov[β|Z](f(x)FΣ1k(x)).(8)

With Σβ10, we hence have that cov[β|Z]I1 and so E[ξ(x)|Z]ξ̂(x) and cov[ξ(x),ξ(x)|Z]cov[ξ(x)ξ̂(x),ξ(x)ξ̂(x)].

We now turn to linear (ridge) regression and present how the resulting equations can be obtained in similar settings. In linear ridge regression, say as classically with Σε=σ2In, one estimates β via β̂λ=(FF+λIn)1FZ, where λ>0 is a regularization parameter and In stands for the identity matrix (δij)i,j{1,,n}Rn×n. The canonical way to introduce this estimator is by replacing the ordinary least squares minimization by the minimization of its penalized counterpart β||ZFβ||22+λ||β||22.

Yet it is well-known that β̂λ can also be seen as maximum a posterior estimator in a Bayesian framework, under the prior distribution βN(0,γ2Ip), where γ2=σ2λ. In such a framework, one has indeed ZN(0,γ2FF+σ2In) and β|Z=zN(F(FF+λIp)1z,γ2Ipγ2F(FF+λIn)1F). From there one can obtain, for instance by using the Woodbury formula, that F(FF+λIp)1=(FF+λIp)1F, whereby E[β|Z]=(FF+λIp)1FZ=β̂λ, which coincides in the considered Gaussian case with the MAP estimator. For F full column-ranked and γ, the posterior distribution is proper and one recovers as posterior expectation and MAP the ordinary least squares estimator β̂=(FF)1FZ.

2.2 Transductive GP Prediction via Schur Complements

One speaks of transductive settings in cases where the point(s) at which one wishes to predict are fixed in advance. We consider in this section the case where ξ(xi) (resp. Zi ) are to be predicted for mn1 distinct values of i (mnp in the case of universal kriging) from the remaining nm observations Zj . Without loss of generality, we will assume that the indices at which the predictions are to be made are io=(1,,m) and denote by jo=(m+1,,n) the remaining ones. For any vector vRn, matrix ARn×n, and arbitrary vectors of ordered indices i,j, we denote here and in the following by v[i] the subvector of v with indices in i and by A[i,j] the block extracted from A with corresponding indices, using in turn the convention A[i]=A[i,i]. Similarly, we use the convention k(x)=k(x,x) for xD.

In simple kriging settings (here μ0), it is common knowledge that ξ̂(xio) and the associated residual covariance matrix cov(ξ(xio)ξ̂(xio)) can be obtained elegantly based on manipulating (9) Σ=(k(xio)k(xio,xjo)k(xjo,xio)k(xjo)+Σε[jo]).(9)

Assuming indeed that Σ is invertible, we obtain by block inversion formula (See Section A.1 in supplementary material, in particular Proposition A.1) that Σ1[io]=(k(xio)k(xio,xjo)Σ[jo]1k(xjo,xio))1, whereby (10) cov(ξ(xio)ξ̂(xio))=(Σ1[io])1(10) where Σ1[io] stands for the m × m upper left block of Σ1. Using now the upper right block Σ1[io,jo] of Σ1 and (10), we obtain similarly that Σ1[io,jo]=Σ1[io]k(xio,xjo)Σ[jo]1, so (11) ξ̂(xio)=(Σ1[io])1Σ1[io,jo]Z[jo].(11)

Applying now under invertibility assumption a similar block inversion to (12) Σ=(Σ[io]Σ[io,jo]Σ[jo,io]Σ[jo])=(k(xio)+Σε[io]k(xio,xjo)+Σε[io,jo]k(xjo,xio)+Σε[jo,io]k(xjo)+Σε[jo]),(12) we get Σ1[io]=(Σ[io]Σ[io,jo]Σ[jo]1Σ[jo,io])1 and Σ1[io,jo]=Σ1[io]Σ[io,jo]Σ[jo]1 whereby the BLUP of Zio given Zjo, denoted here and in the following by Ẑ(io)[io], is given by (13) Ẑ(io)[io]=(Σ1[io])1Σ1[io,jo]Z[jo].(13)

Considering now the prediction residual Eio:=Z[io]Ẑ(io)[io], we get from the latter (14) Eio=(Σ1[io])1(Σ1Z)[io],=(Q[io])1(QZ)[io],(14) with Q=Σ1. Let us stress that Ẑ(io)[io] departs from the most common simple kriging predictor via the treatment of noise, in the sense that in the considered transductive settings it can be relevant to predict the noisy Z[io] instead of ξ(xio), and since a general noise covariance matrix is assumed for the sake of generality, this implies a straightforward adaptation of the blocks entering into play (i.e., blocks of Σ instead of those of Σ). Of course, in case Σ=Σ (e.g., under a null observation noise, as often encountered in computer experiments), the two approaches above coincide.

We will now see that under invertibility conditions, the universal kriging predictor and residual covariance too can be retrieved based on the inverse of a matrix, namely of (15) M=(k(xio)k(xio,xjo)F[io,]k(xjo,xio)k(xjo)+Σε[jo]F[jo,]F[io,]F[jo,]0),(15) where F[io,] is the m × p matrix of basis functions evaluated at the prediction points. As we prove below, assuming invertibility of M and of M[io]=M[m+1:n+p], the universal kriging residual covariance matrix can then be plainly obtained as follows: (16) cov(ξ(xio)ξ̂(xio))=(M1[io])1.(16)

Furthermore, the universal kriging predictor too writes similarly as in the simple kriging case: (17) ξ̂(xio)=(M1[io])1M1[io,jo]Z[jo].(17)

In fact, in the two last equations, what changed between simple and universal kriging settings is simply the replacement of Σ by M. Let us now present how this works. Assuming indeed that M and M[io] are invertible, we first apply Proposition A.1 and obtain (18) (M1[io])1=k(xio)[k(xio,xjo),F[io,]]M[io]1[k(xio,xjo),F[io,]].(18)

Now, provided Σ[jo]=k(xjo)+Σε[jo] and Iio=F[jo,]Σ[jo]1F[jo,] are invertible, we get using a variation of the block inversion formula (detailed in Remark A.3): (19) M[io]1=(Σ[jo]1Σ[jo]1F[jo,]Iio1F[jo,]Σ[jo]1Σ[jo]1F[jo,]Iio1Iio1F[jo,]Σ[jo]1Iio1).(19)

Expanding the product on the right-hand side of (18) using (19) then delivers (20) (k(xio,xjo),F[io,])M1[io]1(k(xio,xjo),F[io,])=k(xio,xjo)Σ[jo]1k(xjo,xio)(F[io,]k(xio,xjo)Σ[jo]1F[jo,])Iio1(F[io,]k(xio,xjo)Σ[jo]1F[jo,]),(20) which establishes (16). As for (17), it then follows from Proposition A.2 and (19) that M1[io,jo]Z[jo]=M1[io](k(xio,xjo),F[io,])M[io]1Z˜[io]where by (M1[io])1M1[io,jo]Z[jo]=k(xio,xjo)Σ[jo]1(Z[jo]F[jo,]β̂(io))+F[io,]β̂(io) with Z˜=(Z0)Rn+p and β̂(io) the GLS estimator of β based on Z[jo].

Applying now under suitable invertibility assumptions a similar block inversion to (21) M=(ΣFF0),(21) we get the BLUP Ẑ(io)[io] of Z[io] given Z[jo] under universal kriging settings as (22) Ẑ(io)[io]=(M1[io])1M1[io,jo]Z[jo].(22)

Considering further the associated prediction residual Eio:=Z[io]Ẑ(io)[io], we finally get (23) Eio=(M1[io])1(M1[1:n]Z)[io],=(Q˜[io])1(Q˜Z)[io],(23) with Q˜=QQF(FQF)1FQ.

3 Fast Multiple-Fold Cross-Validation

In this section we leverage transductive GP equations in settings when subsets of the n observations are left out, delivering in turn the joint distribution of multiple-fold cross-validation residuals together with a fast computation approach. The key is that the previous results of (14) and (23) carry over to arbitrary ways of separating the dataset into left out and remaining observations (i.e., between test and learning sets). Throughout the following we consider vectors of strictly ordered indices from {1,,n}, and denote by S the set of all such index vectors. Since these index vectors are completely characterized by non-empty subsets of {1,,n}, there are 2n1 elements in S. Elements iS, that may also be thought of as subsets of {1,,n}, are called folds.

3.1 The Generic Gaussian Vector Case

We now consider generic Gaussian cross-validation settings and deterministic functions arising in this context. vRn and ARn×n stand for a vector and a symmetric positive definite matrix, respectively. For any iS, we define the following cross-validation residual mapping: (24) ei(·;m,A):vRnei(v;m,A)=v[i]m[i]A[i,i]A[i]1(v[i]m[i]),(24) delivering for vRn the deterministic counterpart of the cross-validation residual obtained when predicting V[i] by V̂(i)[i]=m[i]+A[i,i]A[i]1(V[i]m[i]), where VN(m,A). The notation i refers again to the extraction of sub-vectors/matrices by removing components with indices in i.

Proposition 1

(Fast calculation of cross-validation residuals). Let iS and P=A1. Then, (25) ei(v;m,A)=P[i]1(P(vm))[i]   (vRn).(25)

Consequently, for VN(m,A), the Ei=ei(V;m,A) (iS) are jointly Gaussian, centered, and (26) cov(Ei,Ej)=P[i]1P[i,j]P[j]1  (i,jS).(26)

For the case of folds i1,,iq for which concatenation gives (1,,n), Ei1,,iq:=[Ei1,,Eiq] has centered n-dimensional Gaussian distribution with covariance matrix (27) cov(Ei1,,iq)=B1PB1(27) where B=blockdiag(P[i1],,P[iq]).

Proof.

EquationEquation (25) follows from suitably expressing blocks of A’s inverse using the Schur complement approach of Equations (A.1) and (A.2), delivering, respectively A1[i]=(A[i]A[i,i]A[i]1A[i,i])1, andA1[i,i]=(A[i]A[i,i]A[i]1A[i,i])1A[i,i]A[i]1.

Considering the rows indexed by i of the product P(vm)=A1(vm), we hence get that (P(vm))[i]=P[i](v[i]m[i])P[i]A[i,i]A[i]1(v[i]m[i]) whereby ei(v;m,A)=(P[i])1(Pv)[i]=P[i]1In[i,]P(vm), where In denotes the n × n identity matrix. The joint Gaussianity of the Ei’s and the covariance structure follow from the affine dependence of Ei=(P[i])1In[i,]P(Vm) on V, so that concatenating the q1 random vectors Ei1,,Eiq leads to a Gaussian vector by left multiplication of Vm by a deterministic matrix. As for the covariance matrix of Ei1,,iq, it follows from [Ei1,,Eiq]=(P[i1]1In[i1,]PVP[iq]1In[iq,]PV)=blockdiag(P[i1]1,,P[iq]1)(In[i1,]In[iq,])PV=B1InPV=B1PV, hence cov(Ei1,,iq)=B1PAPB1=B1PB1. □

Remark 1.

The first part of Proposition 1 is purely deterministic, and hence applies to any function approximation methods that boils down to Gaussian conditioning equations (e.g., regularized quadratic risk minimization in Reproducing Kernel Hilbert Spaces).

Remark 2.

Note that for arbitary i1,,iqS that is without imposing ordering between them or that they form a partition, we would have a similar result yet without the above simplification, that is, cov(Ei1,,iq)=B1ΔPΔTB1 with Δ=[In[i1,],,In[iq,]]. An extreme case would be to consider all possible non-empty subsets of {1,,n}, leading to q=2n1 and n2n1 lines for Δ.

Remark 3.

In the extreme case where q = n and the ij’s are set to (j) (1jn), one recovers on the other hand fast leave-one-out cross-validation formulas, and we obtain as a by-product the covariance matrix of leave-one-out errors (28) cov(E(1),,(n))=diag(P[i]1)Pdiag(P[i]1).(28)

3.2 The Simple Kriging Case

We now come back to GP prediction settings and first focus on simple kriging, following the observation model Zi=ξ(xi)+εi   (xiD,i{1,,n}) from the previous section, where we recall that ξ(x)=μ(x)+η(x) (xD) where η is a centered GP with covariance kernel k and μ is a trend function, and εN(0,Σε) independently of η. μ is first assumed null, an assumption that we will relax later on. Like in the exposition of the previous section, we denote Z the observation vector and, for any iS, Ei=Z[i]Ẑ(i)[i] where Ẑ(i)[i] is the BLUP of Z[i] based on Z[i]. We assume throughout that Σ=K+Σε is invertible, with K=(k(xi,xj))i,j{1,,n}.

Corollary 1.

The Ei (iS) are jointly Gaussian, centered, and (29) cov(Ei,Ej)=Q[i]1Q[i,j]Q[j]1  (i,jS),(29) where Q=Σ1. For the case of folds i1,,iq for which concatenation gives (1,,n), Ei1,,iq:=[Ei1,,Eiq] has centered n-dimensional Gaussian distribution with covariance matrix (30) cov(Ei1,,iq)=B1QB1(30) where B=blockdiag(Q[i1],,Q[iq]).

Proof.

Directly follows from Lemma 1 using Ei=ei(Z;0,Σ) with Σ=K+Σε. □

Remark 4.

Extending these results to the case of simple kriging with a known trend function μ directly follows from Ei=ei(Z;μ,Σ) with μ=(μ(xi))1in.

Remark 5.

In the case of Remark 3, we then obtain cov(Ei,Ej)=Q[i]1Q[i,j]Q[j]1 whereby corr(Ei,Ej)=Q[i]12Q[i,j]Q[j]12, leading in particular to a correlation matrix of LOO residuals (31) corr(E(1),,(n))=diag(Q[i]12)Qdiag(Q[i]12)=B12QB12.(31)

As a by-product of Proposition 1 (in the limiting case of LOO, see Remark 3), we represent on the correlations between the LOO prediction residual at the leftmost location versus the LOO residuals at the other locations in the settings of the opening example from .

Fig. 2 Correlations between LOO residuals at design points in the example displayed on . This figure was realized with the R package corrplot Wei and Simko (Citation2021).

Fig. 2 Correlations between LOO residuals at design points in the example displayed on Figure 1. This figure was realized with the R package corrplot Wei and Simko (Citation2021).

A first interesting thing that we would like to stress here is that correlation between LOO prediction residuals at two successive locations is negative, with a value below–0.5 (Let us not that on , the radii of the disks represent the magnitude of correlations, while their signs (resp. positive and negative) are represented via the black and white colors). An illustration of this effect can be seen on the upper panel of , see in particular the fourth and fifth locations. Coming back to , after the second location the correlation goes back to positive but with a smaller magnitude and subsequently appear to continue with a damped oscillation until stationing around zero. As we will see in Section 3, accounting for these correlations is instrumental in producing suitable QQ-plots from cross-validation residuals. We extend the analytical example in Section 3.3, with a focus on multiple-fold cross-validation.

3.3 Extension to Universal Kriging Case and More

Let us now formulate a corollary to Proposition 1 for the case of universal kriging such as presented throughout Section 2 in general and transductive settings.

Corollary 2.

For any iS such that F[i,] is full column-ranked using the notation Ẑ(i)[i] to denote the BLUP of Z[i] based on Z[i] in universal kriging settings, the residual Ei=Z[i]Ẑ(i)[i] writes (32) Ei=(M1[i])1(M1[1:n]Z)[i],=(Q˜[i])1(Q˜Z)[i],(32) where M is defined in (21), and Q˜=QQF(FQF)1FQ with Q=Σ1. Consequently, for any q > 1 and i1,,iqS such that the F[ij,]’s are full column-ranked, the Eij (1jq) are jointly Gaussian, centered, and with covariance structure given by (33) cov(Ei,Ej)=Q˜[i]1Q˜[i,j]Q˜[j]1(i,j{i1,,iq}).(33)

In particular, for the case of index vectors i1,,iq forming a partition of {1,,n} and such that the concatenation of i1,,iq gives (1,,n), then the concatenated vector of cross-validation residuals Ei1,,iq:=[Ei1,,Eiq] has a n-dimensional centered Gaussian distribution with covariance matrix (34) cov(Ei1,,iq)=B˜1Q˜B˜1,(34) where B˜=blockdiag(Q˜[i1],,Q˜[iq]).

Proof.

A direct way to prove this property is to first generalize (23) to the case of generic folds iS such that F[i,] is full column-ranked, which establishes (32), and then to follow the steps of Proposition 1 regarding the covariance structure and the joint Gaussian distribution of cross-validation residuals. Another proof taking a Bayesian route is sketched below. □

Remark 6.

As we have recalled in Section 2, it is known that universal kriging equations can be retrieved in a Bayesian setting by endowing β with a Gaussian prior distribution N(0,Σβ) with invertible Σβ and letting Σβ1 tend toward 0Rp×p. Under this model (at fixed invertible Σβ), the random vector Z would be centered Gaussian with a covariance matrix Σ+FΣβF. Then one may use Proposition 1 with m=0 and P=Σ+FΣβF to find that the cross-validation errors Z[i]ẐΣβ(i)[i] are jointly Gaussian with covariance structure cov(Z[i]ẐΣβ(i)[i],Z[j]ẐΣβ(j)[i])=PΣβ[i]1PΣβ[i,j]PΣβ[j]1 where ẐΣβ(i)[i] denotes the BLUP (conditional expectation) of Z[i] based on Z[i] under the prior distribution N(0,Σβ) on Σβ, and PΣβ=(Σ+FΣβF)1. Now by block inversion and Woodbury formula, we get PΣβ=(ΣFFΣβ1)1[1:n]=QQF(Σβ1+FQF)1FQ.

From there we can conclude since, for Σβ10, PΣβQ˜.

Remark 7.

The latter corollary enables retrieving fast formulas for multiple-fold cross-validation residuals of linear regression models of Zhang (Citation1993) as well as associated covariances. In fact, putting K=0 and Σε=τ2In (where τ>0) delivers τ2Q˜=InH with H=F(FF)1F, whereby Ei=(InH)[i]1(ZHZ)[i] and cov(Ei,Ej)=(InH)[i]1(InH)1[i,j](InH)[j]1  (i,jS).

Similarly, in ridge regression, we obtain by putting Σε=τ2In and K=λIn, (35) Ei=(InHλ)[i]1(ZHλZ)[i],(35) where Hλ=F(FTF+λIn)1FT. Consequently, for any q > 1 and i1,,iqS, the Eij,λ (1jq) are jointly Gaussian, centered, and with covariance structure given by (36) cov(Ei,Ej)=(InHλ)[i]1(InHλ)1[i,j](InHλ)[j]1  (i,jS).(36)

Many other models could enjoy similar results (encompassing kernel ridge regression and least squares support vector machines, see An, Liu, and Venkatesh (Citation2007) where similar fast computations have been obtained for these model classes, yet without investigating the covariances and joint distribution of CV residuals) but in the following we will stick to GP as it is our main model of interest and the principles are readily applicable to other predictors falling under the settings of Proposition 1.

4 Some Consequences in GP Model Fitting

4.1 Graphical Diagnostics and Pivotal Statistics

We now wish to further explore the potential of the established formulas for diagnosing the quality of GP models. For simplicity, we will in what follows either remain generic or stick to simple kriging settings. A first immediate consequence of Proposition 1 and Corollary 2 for model fitting is that, under the hypothesis that observations are generated from the assumed model, cross-validation residuals can be transformed via appropriate matrix multiplications into a random vector with standard multivariate normal distribution. Assuming for simplicity that cov(Ei1,,iq) is full-ranked and denoting t=j=1q#ij it follows that cov(Ei1,,iq)1/2Ei1,,iqN(0,It).

If fact, any matrix SRt×t such that SB1Σ1B1S=In (remaining for simplicity under the final assumptions of Corollary 1 where Δ=In) does the job. More specifically, one gets indeed with S=Σ1/2B SEi1,,iq=Σ1/2ZN(0,In), and hence the hypothesis that the model is correct can be questioned using any standard means relying on such a pivotal quantity with multivariate Gaussian distribution (e.g., by means of a χ2 test statistic or graphical diagnostics such as Q-Q plots). Similar considerations apply to the UK settings of Corollary 2.

Remark 8.

If Δ is not In but still an invertible matrix, then the above can be adapted with S=Σ1/2Δ1B1. In case Δ is not invertible, analoguous approaches can be employed relying on the Moore-Penrose pseudo-inverse of B1ΔΣ1ΔB1.

Coming back to our first one-dimensional example, represents Q-Q plots of the distribution of standardized LOO residuals (right panel) versus “transformed” LOO residuals (left panel). One may notice in particular, going from standardized to transformed LOO residuals, how the second point (corresponding to the second smallest theoretical quantile) switches from below the diagonal to above it.

Fig. 3 On the effect of accounting and correcting for correlation in Q-Q plots based on LOO residuals. Left panel: Q-Q plot against N(0,1) of LOO residuals merely divided by corresponding LOO standard deviations. Right panel: Q-Q plot against N(0,1) of duly transformed LOO residuals.

Fig. 3 On the effect of accounting and correcting for correlation in Q-Q plots based on LOO residuals. Left panel: Q-Q plot against N(0,1) of LOO residuals merely divided by corresponding LOO standard deviations. Right panel: Q-Q plot against N(0,1) of duly transformed LOO residuals.

We now turn to the topic of covariance parameter estimation relying on cross-validation residuals. As we will see below, incorporating covariances between CV cross-validation residuals within such approaches leads to interesting developments.

4.2 Some Implications in Scale Parameter Estimation

Assuming a kernel of the form k(x,x)=σ2r(x,x) with unknown σ2 and Σ=K (noiseless observations being for instance commonly encountered in the realm of computer experiments), the problem of estimating σ2 by maximum likelihood versus cross-validation has been considered in earlier works, leading notably to results concerning the superiority of MLE in the well-specified case but also to situations when CV turned out to perform better than MLE in the misspecified case Bachoc (Citation2013). Denoting by R the correlation matrix associated with the observation points x1,,xn, assumed full-rank, it is well-known (see for instance Santner, Williams, and Notz Citation2003) that the MLE of σ2 can be written in closed form as (37) σ̂ML2=1nZR1Z.(37)

In contrast, the leave-one-out-based estimator of σ2 investigated in Bachoc (Citation2013) reads (38) σ̂LOO2=1nZR1(diag(R1))1R1Z,(38) and originates from the idea, traced back by Bachoc (Citation2013) to Cressie (Citation1993), that the criterion (with notation CLOO inherited from Bachoc Citation2013) defined by (39) CLOO(1)(σ2)=1ni=1nEi2σ2ci2,(39) should take a value close to one, where ci2=(si2)/σ2, leading to σ̂LOO2=1ni=1nEi2ci2 and ultimately to (38). This estimator turns out to be unbiased as σ̂LOO2, yet with greater variance (see Bachoc Citation2013): (40) var(σ̂LOO2)=2σ4tr((R1(diag(R1))1)2)n22σ4n=var(σ̂ML2).(40)

In light of former considerations on the joint distribution of cross-validation residuals, it appears natural to revise =var(σ̂ML2). in order to correct for covariances between LOO residuals, resulting in (41) CLOO˜(1)(σ2)=1nE(B1QB1)1E=1nσ2Ediag(R1)Rdiag(R1)E=1nσ2ZR1Z,(41) so that setting this modified criterion to 1 would plainly result in σ̂LOO˜2=1nZR1Z=σ̂ML2 Hence, the advantages found for σ̂LOO2 in Bachoc (Citation2013) in the case of model misspecification come at the price of neglecting covariances and hence possible redundancies between leave-one-out residuals, yet attempting to address this issue by naturally accounting for those covariances within the criterion leads to the MLE, known to enjoy optimality properties in the well-specified case of the considered settings but to be potentially suboptimal otherwise. We will see next that this is not the only instance where modifying LOO in order to account for residual covariances leads back to MLE.

4.3 On the Estimation of Further Covariance Parameters

Cross-validation has also been used to estimate covariance parameters beyond the scale parameter σ2. We will now denote by θ those additional parameters (and by ψ=(σ2,θ) the concatenation of all covariance parameters) and review some approaches that have been used to estimate them via cross-validation. An important point compared to the last section on σ2’s estimation is that Kriging predictors and CV residuals generally depend on θ, so that we will now stress it by noting E(θ). Still in noiseless settings, the approach followed by Santner, Williams, and Notz (Citation2003) and Bachoc (Citation2013) to estimate θ based on leave-one-out cross-validation residuals is to minimize (42) CLOO(2)(θ)=i=1nEi(θ)2.(42)

Following Bachoc (Citation2013), the latter criterion can be in turn expressed in closed form as (43) CLOO(2)(θ)=ZRθ1diag(Rθ1)2Rθ1Z,(43) where Rθ stands for Z’s correlation matrix under correlation parameter θ. Based on our results, a natural extension of this criterion to multiple fold cross-validation can be similarly obtained. Considering q-fold settings such as in the final part of Corollary 1, CLOO(2) becomes indeed (44) CCV(2)(θ)=j=1q||Eij(θ)||2=||Ei1,,iq(θ)||2.(44)

Then, building up upon our main theorem, we obtain that (45) CCV(2)(θ)=ZRθ1blockdiag((Rθ1[i1])1,,(Rθ1[iq])1)Rθ1Z,(45) generalizing indeed (43). Note that applying a sphering transformation to the CV residuals previous to taking norms in (42) and (44) would lead to a criterion that boils down to θZTRθ1Z and hence appears as one of the two building blocks of the log-likelihood criterion.

On a different note and relaxing now the assumption of noiseless observations, (42) is only one among several possible ways to construct a criterion based on applying loss functions to leave-one-out residuals, as exposed in Rasmussen and Williams (Citation2006). While (42) corresponds to the squared error loss, one arrives for instance by using instead the “negative log validation density loss” (after the terminology of Rasmussen and Williams (Citation2006), sec. 5.4.2) at the “LOO log predictive probability,” also sometimes called pseudo-likelihood: (46) CLOO(3)(ψ)=j=1nlog(pZ[j]|Z[j](zj|z[j];ψ)),(46) where pZ[j]|Z[j](z[j]|z[j];ψ) denotes the conditional density of Z[j] at the value z[j] knowing that Z[j]=z[j]. Further scoring rules could be used as a base to define further criteria relying on cross-validation residuals, as illustrated for instance with the CRPS score in Petit et al. (Citation2020) and Petit (Citation2022).

Let us now focus on an extension of CLOO(3)(ψ) to multiple fold cross-validation, and on the exploration of some links between the resulting class of criteria and the log-likelihood. First, CLOO(3)(ψ) is straightforwardly adapted into (47) CCV(3)(ψ)=j=1qlog(pZ[ij]|Z[ij](z[ij]|z[ij];ψ)).(47)

Reformulating the sum of log terms as the log of a product, we see that (48) CCV(3)(ψ)=log(j=1qpZ[ij]|Z[ij](z[ij]|z[ij];ψ))=log(j=1qpEij(eij;ψ))(48) and so it appears that this approach amounts indeed in some sense to improperly assuming independence between the cross-validation residuals Eij (j=1q).

We now examine in more detail (still in the settings of the final part of Corollary 1) when such cross-validation residuals are independent and show that in such case CCV(3)(ψ) coincides with the log-likelihood criterion based on Z.

Corollary 3.

The following propositions are equivalent:

  1. The cross-validation residuals Eij (j=1q) are mutually independent,

  2. B=Q,

  3. the subvectors Z[ij] (j=1q) are mutually independent.

If they hold, then CCV(3)(ψ)=log(pZ(z;ψ)) (ψΨ).

Proof.

As we know from Corollary 1 that the cross-validation residuals Eij (j=1q) form a Gaussian vector, they are mutually independent if and only if their cross-covariance matrices are null. Yet we also know by (34) from the same theorem that cov(Ei,Ej)=(Q[i])1Q[i,j](Q[j])1  (i,jS). It follows that a) is equivalent to Q[ij,ij]=0 for jj (j,j{1,,q}), which coincides in turn with b). Looking finally at b) from the perspective of Σ=Q1 being block-diagonal, we obtain similarly that b) is equivalent to c). If these propositions hold, then Eij=Z[ij] (j=1q) from (25), and thus for ψΨ: CCV(3)(ψ)=log(j=1qpEij(eij;ψ))=log(j=1qpZ[ij](z[ij];ψ))=log(pZ(z;ψ)).

Also, one could have derived it from pZ[ij]|Z[ij](z[ij]|z[ij];ψ)=pZij(z[ij];ψ). □

5 Computational Aspects

5.1 Accuracy and Speed-Up Checks for Fast Formulas

We have seen in previous sections that block matrix inversion enabled us to reformulate the equations of (multiple-fold) cross-validation residuals and associated covariances in compact form, from simple kriging to universal kriging settings. Here and in supplementary material, we briefly illustrate that our Schur-complement-based formuale do indeed reproduce results obtained by a straightforward (“naive”) computational approach, also keeping an eye on computational speed-ups and by-products. We focus below on how efficiency gains evolve depending on the number of folds. The next section focuses in more detail on the respective computational complexities of the two approaches. The settings are analogue to those presented in Section A.2. Accuracy and speed-up checks for LOO are presented in Section A.4 (also in supplementary material).

In the following experimental results and theoretical considerations, we focus on the particular case of homogeneous fold sizes and denote by r1 their common number of elements, so that n=r×q. For convenience, we consider in experiments a design of size n=210=1024, and let the number of folds decrease from q = 1024 to q = 2 by successive halvings. This amounts to folds of sizes r ranging from 1 to 512, correspondingly. For each value of r, 50 random replications of the folds are considered; this is done by permuting the indices at random prior to arranging them in regularly constructed folds of successive r elements. The resulting speed-up distributions are represented in , in function of r. With speed-up medians ranging from 1027.82 to 1.50 (means from 1429.63 to 1.28, all truncated after the second digit), we hence observed a speed-up whatever the value of r, yet with a substantial decrease when r increases compared to the LOO situation, as could be expected.

Fig. 4 Relative errors on CV means and covariances (between naive and fast methods, with the naive method as reference) measured in similar settings as on .

Fig. 4 Relative errors on CV means and covariances (between naive and fast methods, with the naive method as reference) measured in similar settings as on Figure 3.

Additionally, relative errors on cross-validation residuals and associated variances calculated using the fast versus naive approach are represented in , also in function of r. It is remarkable that, while the relative error on the calculation of residuals increases moderately (with r) from a median of around 3.5×1014 to values nearing 4×1014, in the case of the covariances the variation is more pronounced, with an increase from approx. 2×1011 to 1.2×1010. Further analysis may shed light on the underlying phenomena, yet one can state that with a maximum relative error of magnitude 1010, the observed differences remain of a negligible extent.

Fig. 5 Speed-up (ratio between times required to run the naive and fast methods) measured for q-fold CV, where q decreases from 1024 to 2 and 50 seeds are used that affect here both model fitting and the folds.

Fig. 5 Speed-up (ratio between times required to run the naive and fast methods) measured for q-fold CV, where q decreases from 1024 to 2 and 50 seeds are used that affect here both model fitting and the folds.

5.2 More on Complexity and Efficiency Gains

We now turn to a more specific examination on how our closed form formulas affect computational speed compared to a naive implementation of multiple fold cross-validation. As we will now develop, a first order reasoning on computational costs associated with naive versus closed multiple-fold cross-validation highlights the existence of two regimes depending on q: for large q it is faster to use close form formula yet for very small q one should better employ the naive approach. The rationale is as follows; in a naive implementation, the main computational cost is on inverting q covariance sub-matrices, all belonging to

R(nq)×(nq). Denoting this (approximate) cost by κnaive and assuming a cubic cost for matrix inversion (with a multiplicative constant γ>0), we hence obtain that (49) κnaive=q×γ(nr)3.(49)

On the other hand, the closed form approach requires one inversion of a n × n matrix and q inversions of r × r matrices, leading to an approximate cost (50) κclosed=γ(n3+qr3).(50)

We thus obtain an approximate speed up of κnaiveκclosed=(q1)3(q2+1), implying that close form approach would be advantageous when (q1)3(q2+1). As q is a positive integer, the latter condition be proved to be equivalent to q4, as q(q1)3(q2+1) possesses a unique real root between 3 and 4, takes negative values for q{1,2,3}, and tends to + when q+. In our implementation, however, the cost of the fast approach turns out to be smaller than γ(n3+qr3) since the Cholesky factor is already pre-calculated. Modeling this with a damping factor of α(0,1] in front of the γn3 term, we end up with an approximate cost of κclose=γ(αn3+qr3), leading via analogue calculations to an acceleration whenever (q1)3(αq2+1). Again, the polynomial involved possesses a single real-valued root, with a value shrinking toward 2 as α tends to 0.

In practice, speed-ups are already observed here from q = 2, which may be attributed to various reasons pertaining notably to implementation specifics and arbitrary settings used in the numerical experiments. Also, let us stress again that the reasoning done above is to be taken as an approximation as it does not account for matrix-vector multiplication costs, storage and communication overheads, and further auxiliary operations that could influence the actual running times.

On a different note, let us point out that the computational benefits highlighted above in the case of multiple-fold cross-validation with a unique partitioning do carry over and even further grow in the case of N2 replications with varying partitions. Sticking for simplicity to the previously considered settings (α = 1) but replicating the partitions, we find that a naive approach has a cost driven by (51) κnaiverep=Nqγ(nr)3,(51) while a similar approach using the fast formulas would have a cost driven by (52) κclosedrep=γ(n3+Nqr3).(52)

Then we obtain that κclosedrepκnaiverep as soon as (q1)3q2N+1, occurring already from q = 3.

6 Application Test Case

We now illustrate and investigate some applications of multiple-fold cross-validation on an application test case motivated by a contaminant source localization problem from hydrogeology Pirot et al. (Citation2019). There the goal is to minimize an objective function that is quantifying the discrepancy between given contaminant concentrations at monitoring wells and analogue quantities obtained under varying scenarios regarding the potential source underlying this contamination. Solving this difficult optimization problem in order to localize the source is performed in Pirot et al. (Citation2019) via Bayesian Optimization, and Gaussian process models are thus used to surrogate the objective function. An instance of such an objective function is presented in . Our focus is on fitting a GP model to this function under a clustered experimental design, and to highlight notable differences arising here between leave-one-out and multiple-fold cross-validation using the cluster structure.

Fig. 6 Contaminant localization test function designed by summing dissimilarities between given concentrations at monitoring wells and corresponding simulation results when varying the candidate source localization (See Pirot et al. (Citation2019) for more detail). The filled disks stand for the experimental design, consisting of 25 clover-shaped clusters of size 5. The small green circle stands for the actual (prescribed) contaminant source, and the red square indicates the grid point with the smallest misfit.

Fig. 6 Contaminant localization test function designed by summing dissimilarities between given concentrations at monitoring wells and corresponding simulation results when varying the candidate source localization (See Pirot et al. (Citation2019) for more detail). The filled disks stand for the experimental design, consisting of 25 clover-shaped clusters of size 5. The small green circle stands for the actual (prescribed) contaminant source, and the red square indicates the grid point with the smallest misfit.

We wish to illustrate some differences between CV in the LOO versus in “Leave-One-Cluster-Out” settings (i.e., 25-fold CV with folds clusters). Since we can evaluate the actual function on a fine grid, we can easily calculate the absolute prediction errors and compare them to absolute cross-validation residuals. The mean absolute error on the considered 2601-element grid is 2.27×102. The mean absolute error of the 125 LOO residuals is 0.42×102. In contrast, the mean absolute error of the 125 multiple-fold CV residuals is 3.49×102. Thus, while absolute multiple-fold CV residuals appears to give a rather over-pessimistic picture of the prediction situation (as CV predictions are done only at points at points in left-out clusters), analogue LOO residuals give a very over-optimistic one (as CV predictions are done only at left-out points within clusters including four not left-out points).

represents LOO (upper panel) versus MFCV (medium panel) residual norms as a function of the range parameter and compares them to the true reconstruction error (lower panel). This example illustrates that the design of folds may crucially affect results. In the present case, it can be observed that, compared to the curve associated with LOO residuals, the curve associated with MFCV residuals is more similar to the actual (out-of-sample) reconstruction error. In contrast, the top curve associated with LOO features a flatter region near the optimum resulting in turn in an optimal range associated with a larger reconstruction error. Replications of the example with further realizations of the clustered designs delivered quite variable results (overall, neither the considered MFCV nor LOO stood up as evidently preferable to the other). Yet some recurring features emerged, in particular with the curves associated with LOO appearing more prone to decreasing to a plateau quite similarly as on . This calls for more research on the impact of fold design on resulting indicators and estimators.

Fig. 7 Log square norm of LOO residuals (top), of MFCV residuals (center), and of reconstruction error (bottom) as a function of the range hyperparameter.

Fig. 7 Log square norm of LOO residuals (top), of MFCV residuals (center), and of reconstruction error (bottom) as a function of the range hyperparameter.

7 Discussion

Fast leave-one-out formulas for Gaussian process prediction that have been used for model diagnostics and hyperparameter fitting do carry over well to multiple-fold cross-validation, as our theoretical and numerical results in both simple and universal kriging settings suggest. The resulting formulas were found to allow substantial speed-ups, yet with a most favorable scenario in the leave-one-out case and decreasing advantages in cases with lesser folds of larger observation subsets. A first order analysis in terms of involved matrix inversion complexities corroborated the main trends observed in numerical experiments, although some room remains to analyze actual computation times more precisely depending on implementation specifics. In addition, established formulas enabled a closed-form calculation of the covariance structure of cross-validation residuals, and eventually correcting for an improper assumption of independence silently underlying QQ-plots with standardized LOO residuals.

As we established as well, the established formulas lend themselves well to generalizing existing covariance hyperparameter estimation approaches that are based on cross-validation residuals. Looking first at scale parameter estimation in the LOO case, we found that changing a criterion underlying the scale estimation approach used in Santner, Williams, and Notz (Citation2003) and Bachoc (Citation2013) by correcting for covariance between LOO residuals led back to the maximum likelihood estimator of scale. On a different note and with more general covariance hyperparameters in view, as established in Corollary 3 maximizing the pseudo-likelihood criterion mentioned in Rasmussen and Williams (Citation2006) was found to coincide with MLE in the case of independent cross-validation residuals.

It is interesting to note as a potential starting point to future work on cross-validation-based parameter estimation that going beyond the independence assumption of Corollary 3 and accounting for the calculated residual covariances results in fact in a departure from basic Maximum Likelihood Estimation. Still assuming indeed that Δ=In and replacing the product in CCV(3)(ψ) by the joint density of cross-validation residuals, we end up indeed with (53) CCV˜(3)(ψ)=log(p(Ei1,,Eiq)((ei1,,eiq);ψ))=log(pB1QZ(e;ψ))=log(pB1QZ(B1Qz;ψ))=log(pZ(z;ψ)|det(B1Q)|)=log(pZ(z;ψ))log(det(B))+log(det(Q)),(53) and we see that maximizing CCV˜(3)(ψ) generally departs indeed from MLE, as Q and B are functions of ψ, yet coincides with it when det(B)=det(Q). While studying further this new criterion is out of scope of the present article, we believe that using the full distribution of cross-validation residuals could be helpful in designing novel criteria and procedures for covariance parameter estimation and more. Beyond this, efficient adjoint computation of gradients could be useful to speed-up covariance hyperparameter estimation procedures such as studied here, but also variations thereof relying on further scoring rules Petit et al. (Citation2020), not only in LOO settings but also potentially in the case of multiple-fold cross-validation.

It is important to stress that while the presented results do not come with procedures for the design of folds, they might be of interest to create objective functions for their choice. Looking at the covariance matrix between cross-validation residuals might be a relevant ingredient to such procedures. Of course, designing the folds is not completely unrelated to choosing the xi’s and we expect interesting challenges to arise at this interface. Also, stochasic optimization under random folds is yet another approach of interest which could be eased thanks to fast computation of cross-validation residuals. Last but not least, as fast multiple-fold cross-validation carries over well to linear regression and GP modeling shares foundations with several other paradigms, it would be desirable to generalize presented results to broader model classes (see for instance Rabinowicz and Rosset (2020b), where mixed models are tackled).

Supplemental material

JCGS-22-145_Suppl.pdf

Download PDF (401.3 KB)

Acknowledgments

Part of DG’s contributions have taken place within the Swiss National Science Foundation project number 178858, and he would like to thank Sam Allen, Athénaïs Gautier, Fabian Guignard, Philip Stange, Cédric Travelletti, Riccardo Turin as well as many attendees of events where this work was presented for stimulating scientific exchanges. Yves Deville should be warmly thanked too for insightful comments on an earlier version of this article. Warm thanks as well to the two referees and the members of the editorial team, whose comments have enabled improving the paper substantially. Calculations were performed on UBELIX, the High Performance Computing (HPC) cluster of the University of Bern. DG would also like to aknowledge support of Idiap Research Institute, his primary affiliation during earlier stages of this work.

Supplementary Materials

In the supplement, we provide extra detail about block inversion via Schur complements, as well as about the basics of universal kriging. Also, we give detail on the opening 1-dimensional example and present additional numerical experiments and results. The latter pertain on the one hand to the effect on cross-validation diagnostics of a change of design (when points are “doubled” by adding close neighbors and thus creating clusters) and, on the other hand, to LOO and MFCV speed-ups and relative errors depending on the size of considered designs.

Disclosure Statement

The authors report there are no competing interests to declare.

References

  • An, S., Liu, W., and Venkatesh, S. (2007), “Fast Cross-validation Algorithms for Least Squares Support Vector Machine and Kernel Ridge Regression,” Pattern Recognition, 40, 2154–2162. DOI: 10.1016/j.patcog.2006.12.015.
  • Arlot, S., and Celisse, A. (2010), “A Survey of Cross-Validation Procedures for Model Selection,” Statistical Surveys, 4, 40–79.
  • Bachoc, F. (2013), “Cross Validation and Maximum Likelihood Estimation of Hyper-Parameters of Gaussian Processes with Model Misspecification,” Computational Statistics and Data Analysis, 66, 55–69. DOI: 10.1016/j.csda.2013.03.016.
  • Bates, S., Hastie, T., and Tibshirani, R. (2023), “Cross-validation: What Does It Estimate and How Well Does It Do It?” Journal of the American Statistical Association. DOI: 10.1080/01621459.2023.2197686.
  • Cressie, N. A. C. (1993), Statistics for Spatial Data, London: Wiley.
  • Currin, C., Mitchell, T., Morris, M., and Ylvisaker, D. (1988), “A Bayesian Approach to the Design and Analysis of Computer Experiments,” Technical Report 6498, ORNL.
  • Dubrule, O. (1983), “Cross Validation of Kriging in a Unique Neighborhood,” Journal of the International Association for Mathematical Geology, 15, 687–699. DOI: 10.1007/BF01033232.
  • Fischer, B. (2016), “Model Selection for Gaussian Process Regression by Approximation Set Coding,” Master’s thesis, ETH Zürich.
  • Golub, G. H., Heath, M., and Wahba, G. (1979), “Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter,” Technometrics, 21, 215–223. DOI: 10.1080/00401706.1979.10489751.
  • Handcock, M. S., and Stein, M. L. (1993), “A Bayesian Analysis of Kriging,” Technometrics, 35, 403–410. DOI: 10.1080/00401706.1993.10485354.
  • Helbert, C., Dupuy, D., and Carraro, L. (2009), “Assessment of Uncertainty in Computer Experiments from Universal to Bayesian Kriging,” Applied Stochastic Models in Business and Industry, 25, 99–113. DOI: 10.1002/asmb.743.
  • Kaminsky, A. L., Wang, Y., Pant, K., Hashii, W. N., and Atachbarian, A. (2021), “An Efficient Batch k-fold Cross-Validation Voronoi Adaptive Sampling Technique for Global Surrogate Modeling,” Journal of Mechanical Design, 143, 011706. DOI: 10.1115/1.4047155.
  • Le Gratiet, L., Cannamela, C., and Iooss, B. (2015), “Cokriging-based Sequential Design Strategies Using Fast Cross-Validation for Multi-Fidelity Computer Codes,” Technometrics, 57, 418–427. DOI: 10.1080/00401706.2014.928233.
  • Magnusson, M., Andersen, M. R., Jonasson, J., and Vehtari, A. (2020), “Leave-One-Out Cross-Validation for Bayesian Model Comparison in Large Data,” in Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS).
  • Martino, L., Laparra, V., and Camps-Valls, G. (2017), “Probabilistic Cross-Validation Estimators for Gaussian Process Regression,” in 2017 25th European Signal Processing Conference (EUSIPCO), pp. 823–827. DOI: 10.23919/EUSIPCO.2017.8081322.
  • Omre, H. (1987), “Bayesian Kriging–Merging Observations and Qualified Guesses in Kriging,” Mathematical Geology, 19, 25–39. DOI: 10.1007/BF01275432.
  • Petit, S. (2022), “Improved Gaussian Process Modeling: Application to Bayesian Optimization,” PhD thesis, Université Paris-Saclay.
  • Petit, S. J., Bect, J., Da Veiga, S., Feliot, P., and Vazquez, E. (2020), “Towards New Cross-Validation-based Estimators for Gaussian Process Regression: Efficient Adjoint Computation of Gradients,” in 52èmes Journées de Statistique (JdS 2020), pp. 633–638. https://arxiv.org/pdf/2002.11543
  • Pirot, G., Krityakierne, T., Ginsbourger, D., and Renard, P. (2019), “Contaminant Source Localization via Bayesian Global Optimization,” Hydrology and Earth System Sciences, 23, 351–369. DOI: 10.5194/hess-23-351-2019.
  • Rabinowicz, A., and Rosset, S. (2020a), “Assessing Prediction Error Atinterpolation and Extrapolation Points,” Electronic Journal of Statistics, 14, 272–301. DOI: 10.1214/19-EJS1666.
  • ——— (2020b), “Cross-Validation for Correlated Data,” Journal of the American Statistical Association, 117, 718–731. DOI: 10.1080/01621459.2020.1801451.
  • Rasmussen, C. E., and Williams, C. K. I. (2006), Gaussian Processes for Machine Learning, Cambridge, MA: The MIT Press.
  • Roberts, D. R., Bahn, V., Ciuti, S., Boyce, M. S., Elith, J., Guillera-Arroita, G., Hauenstein, S., Lahoz-Monfort, J. J., Schröder, B., Thuiller, W., Warton, D. I., Wintle, B. A., Hartig, F., and Dormann, C. F. (2017), “Cross-Validation Strategies for Data with Temporal, Spatial, Hierarchical, or Phylogenetic Structure,” Ecography, 40, 913–929. DOI: 10.1111/ecog.02881.
  • Roustant, O., Ginsbourger, D., and Deville, Y. (2012), “DiceKriging, DiceOptim: Two R Packages for the Analysis of Computer Experiments by Kriging-based Metamodelling and Optimization,” Journal of Statistical Software, 51, 1–55. DOI: 10.18637/jss.v051.i01.
  • Santner, T. J., Williams, B. J., and Notz, W. (2003), The Design and Analysis of Computer Experiments, New York: Springer.
  • Wei, T., and Simko, V. (2021), R package ’corrplot’: Visualization of a Correlation Matrix, available at https://github.com/taiyun/corrplot.
  • Zhang, P. (1993), “Model Selection via Multifold Cross Validation,” Annals of Statistics, 21, 299–313.
  • Zhang, Y., and Yang, Y. (2015), “Cross-Validation for Selecting a Model Selection Procedure,” Journal of Econometrics, 187, 95–112. DOI: 10.1016/j.jeconom.2015.02.006.