![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
Abstract
We generalize fast Gaussian process leave-one-out formulas to multiple-fold cross-validation, highlighting in turn the covariance structure of cross-validation residuals in simple and universal kriging frameworks. We illustrate how resulting covariances affect model diagnostics. We further establish in the case of noiseless observations that correcting for covariances between residuals in cross-validation-based estimation of the scale parameter leads back to maximum likelihood estimation. Also, we highlight in broader settings how differences between pseudo-likelihood and likelihood methods boil down to accounting or not for residual covariances. The proposed fast calculation of cross-validation residuals is implemented and benchmarked against a naive implementation, all in R. Numerical experiments highlight the substantial speed-ups that our approach enables. However, as supported by a discussion on main drivers of computational costs and by a numerical benchmark, speed-ups steeply decline as the number of folds (say, all sharing the same size) decreases. An application to a contaminant localization test case illustrates that the way of grouping observations in folds may affect model assessment and parameter fitting compared to leave-one-out. Overall, our results enable fast multiple-fold cross-validation, have consequences in model diagnostics, and pave the way to future work on hyperparameter fitting as well as on goal-oriented fold design. Supplementary materials for this article are available online.
1 Introduction and Notation
Gaussian process (GP) models are at the heart of a number of prominent methods in spatial statistics, machine learning, and beyond. Properly validating such models and fitting underlying hyperparameters thus appear as crucial and impactful endeavors. It is often the case that, for reasons of data scarcity or other, the latter need to be conducted based on a unique dataset. Cross-validation is commonly used in such a context, not only in GP modeling but also as a general approach to assess statistical models and tune parameters in a vast class of prediction algorithms. Fast leave-one-out cross-validation formulas are known for some models and approaches including GP prediction. However, leave-one-out is known to suffer from some pitfalls, and multiple-fold cross-validation is increasingly preferred over it in broader contexts, notably in the realm of machine learning.
We focus here on cross-validation in the context of GP modeling, and more specifically on adapting fast formulas for leave-one-out (LOO) to multiple-fold cross-validation and exploring how these could be efficiently exploited in topics such as model diagnostics and covariance hyperparameter fitting. One of the essential bricks of the proposed approaches turns out to be closed-form formulas for the covariance structure of cross-validation residuals. Our main results are presented both in the simple kriging and universal kriging frameworks, also with links to ridge regression. They contribute to a lively stream of research on cross-validation that pertains to GP modeling, and also to further related classes of statistical models.
Leave-one-out (LOO) and multiple-fold cross-validation in the GP framework were tackled as early as in Dubrule (Citation1983), and some further references as well as a more recent account of their use in machine learning is given in chap. 5 of Rasmussen and Williams (Citation2006). On a different note, cross-validation has led to a number of investigations in the context of spline modeling, for instance in Golub, Heath, and Wahba (Citation1979) where generalized cross-validation was used for the selection of a ridge (regularization) parameter. In the domain of GP modeling for computer experiments, cross-validation was tackled as early as in Currin et al. (Citation1988) (with a pioneering discussion on links to MLE), and leave-one-out was investigated in Bachoc (Citation2013) and found to deliver a valuable alternative to MLE in misspecified cases. A general survey of cross-validation procedures for model selection is presented in Arlot and Celisse (Citation2010). Cross-validation for model selection is further tackled in Zhang and Yang (Citation2015). Le Gratiet, Cannamela, and Iooss (2015) proposed cokriging-based sequential design strategies using fast cross-validation for multi-fidelity computer codes. Fischer (Citation2016) studied model selection for GP regression by Approximation Set Coding. Martino, Laparra, and Camps-Valls (Citation2017) suggested new probabilistic cross-validation Estimators for Gaussian process regression. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure were investigated in Roberts et al. (Citation2017). New prediction error estimators were defined in Rabinowicz and Rosset (Citation2020a), where novel model selection criteria in the same spirit as AIC and Mallow’s Cp were suggested. LOO cross-validation for Bayesian model comparison in large data was recently tackled in Magnusson et al. (Citation2020). Kaminsky et al. (Citation2021) introduced an efficient batch multiple-fold cross-validation Voronoi adaptive sampling technique for global surrogate modeling. Rabinowicz and Rosset (2020b) introduced a bias-corrected cross-validation estimator for correlated data. Bates, Hastie, and Tibshirani (Citation2023) tackles the question of what one actually does estimate in cross-validation. In contrast with the random design settings considered in the latter as in many theoretical works pertaining to cross-validation, our baseline settings here will assume a fixed design.
LOO cross-validation consists in predicting at each of the observation locations when (temporarily) removing the corresponding observation from the dataset, and then comparing obtained predictions to left-out observations. As a first example, we consider a case of GP prediction for a one-dimensional test function (which definition is recalled in Section A.2 of the supplementary material) observed at a regular design, here a 10-point subdivision of , and a GP model assuming a Matérn 5/2 stationary covariance kernel. The test function is represented in black in , along with the GP predictor (dashed blue line) and LOO predictions at the design locations (red stars). Modeling is performed with the DiceKriging package Roustant, Ginsbourger, and Deville (Citation2012) and cross-validation relies on a fast implementation using the cv function newly available in the DiceKriging package (version 1.6.0). This simple example in a misspecified case illustrates how LOO residuals may be more representative of actual prediction errors than the built-in GP prediction standard deviation, the latter being in this case not depending on the actual observations and calculated under a questionable stationarity assumption.
Fig. 1 On the upper panel, GP mean predictor (dashed blue line) of the test function (black line) defined by (A.4) based on 10 evaluations at a regular grid, LOO cross-validation predictions (red stars). Lower panel: absolute prediction errors associated with GP (black line) and LOO (red stars) predictions, and GP prediction standard deviation (dashed blue line).
![Fig. 1 On the upper panel, GP mean predictor (dashed blue line) of the test function (black line) defined by (A.4) based on 10 evaluations at a regular grid, LOO cross-validation predictions (red stars). Lower panel: absolute prediction errors associated with GP (black line) and LOO (red stars) predictions, and GP prediction standard deviation (dashed blue line).](/cms/asset/2d4e047d-93dc-46f4-a83d-236b4a3c6770/ucgs_a_2353633_f0001_c.jpg)
It has in fact been established in the seminal paper Dubrule (Citation1983) that the LOO prediction errors and variance could be calculated quite elegantly and efficiently by relying on the inverse of a certain matrix (the main focus in Dubrule (Citation1983) being on a variogram matrix extended by monomial basis functions) and quantities derived thereof. We work here in terms of covariances and related matrices as in Bachoc (Citation2013), where fast LOO formulas were obtained for simple kriging in the square-integrable case.
An elegant by-product formula for the square norm of LOO residulas has been leveraged in Bachoc (Citation2013) for hyperparameter estimation in the stationary case, namely by minimizing this quantity as a function of covariance parameters (excluding the “variance” parameter). In particular, numerical experiments conducted in Bachoc (Citation2013) suggested that leave-one-out error minimization is more robust than maximum likelihood estimation (MLE) in the case of model misspecification, a common situation in practice. Yet, depending on the experimental design, the errors obtained by leaving one point at a time may be far from representative of generalization errors, as illustrated in Section 6 and in Section A.2.1 of the supplementary material. While remaining with a prescribed design, multiple-fold cross-validation allows to mitigate shortcomings of leave-one-out by considering situations where larger parts of the design (i.e., several points) are removed at a time.
Our primary aim in the present paper is to generalize fast-leave-one-out formulas to multiple-fold cross-validation (MFCV), hence, facilitating their use in both contexts of model diagnostics and hyperparameter optimization. We derive such formulas and highlight in turn the covariance structure of cross-validation residuals, a crucial ingredient in investigated diagnostics and in approaches pertaining to parameter fitting. Furthermore, following ideas already sketched in Dubrule (Citation1983), we present extensions of our results to the case of universal kriging. The obtained results clarify some links between cross-validation and MLE, and also open new perspectives regarding the design of folds and ultimately also of the underlying experimental design points.
Section 2 presents the considered class of statistical models and recalls the basics of simple and universal kriging under Gaussian process assumptions, with some brief detours in Section 2.1 on how Gaussian linear and ridge regression may fit into this framework as well as on Bayesian interpretations of universal kriging and ridge regression. Prediction equations are revisited in Section 2.2 in transductive settings thanks to Schur complement approaches, hence, setting the decor for fast cross-validation. Section 3 is dedicated to fast multiple-fold cross-validation, starting in Section 3.1 with generic results that apply in Gaussian vector conditioning. We then present in Section 3.2 our main results on the fast calculation and the joint probability distribution of cross-validation residuals in the simple kriging case. The latter are then extended to universal kriging settings in Section 3.3, notably retrieving as a particular case fast cross-validation results for linear regression presented in Zhang (Citation1993). In Section 4 we present consequences of the main results in the context of model fitting. Section 4.1 focuses on graphical diagnostics via pivotal statistics for cross-validation residuals, while the two subsequent sections deal with cross-validation-based covariance parameter estimation. More specifically, Section 4.2 is devoted to the estimation of the scale parameter; Section 4.3 then pertains to the estimation of further hyperparameters building upon either the norm or the (pseudo-)likelihood of multiple-fold cross-validation residuals. Section 5 mostly consists in numerical experiments illustrating the accuracy and speed-ups offered by the closed-form formulas (Section 5.1), and some considerations regarding associated computational complexities (Section 5.2). Section 6 presents an application to a contaminant localization test case and illustrate that grouping observations in folds may lead to improved model assessment and parameter fitting compared to LOO. The ultimate Section 7 is a discussion opening perspectives on how the established results could be further exploited to improve diagnostics and parameter estimation procedures in GP modeling.
2 Models and Transductive Prediction Settings
2.1 Of Universal Kriging and Ridge Regression
Throughout the article we consider statistical models of the form
(1)
(1)
with
(
) where
is a centered GP with covariance kernel k, μ is a trend function, and
independently of
. In some cases, μ is known up to some coefficients: unknown constant (ordinary kriging settings) or linear combination of basis functions with unknown coefficients (universal kriging settings). When μ is known, one speaks of simple kriging settings. While several variants of Kriging can be defined without distributional assumptions (and without even assuming square-integrability of
), we restrict the exposition here to the broadly used Gaussian case. However, we do not require k to be stationary. Let us remark that throughout the following exposition on kriging/GP prediction, the kernel (including hyperparameters) is considered as given. In practice, the equations are also employed with plugged in estimates of hyperparameters. Full Bayesian approaches where the uncertainty on kernel hyperparameters is further propagated to the predictions are beyond the considered scope. However, in Sections 4 and 6 we tackle the issue of model validation and hyperparameter fitting via the established cross-validation results.
We consider a linear trend case that encompasses the above settings, and also has some interesting connections to linear (ridge) regression as we will see below. We hence assume μ to be defined by
(2)
(2)
where the fj
(
) are prescribed basis functions and βj
(
) are real-valued coefficients. Simple kriging can be retrieved by taking p = 1 with
and f1 arbitrary (possibly null), and universal kriging with
and
unknown (ordinary kriging is a special case with p = 1 and
). In simple kriging, a predictor of
of the form
is sought, where
.
is determined by minimizing the residual variance
(3)
(3)
where and
. In universal kriging, the predictor is of the form
and p linear constraints
for
are added to the minimization of (3) to ensure unbiasedness of
. Minimizing (3) under these unbiasedness constraints can be addressed by Lagrange duality, leading to a linear problem featuring for each
a p-dimensional Lagrange multiplier
:
(4)
(4)
where
and
. Assuming that
and
are invertible, solving for
delivers the universal kriging predictor, that can ultimately be written, as further detailed in Section A.3 of the supplementary material, as
(5)
(5)
where
. In simple kriging with
, the predictor has the same form as (5) yet with
instead of
. Furthermore, the universal kriging residual covariance writes, for arbitrary
:
(6)
(6)
The simple kriging residual covariance boils down to the right-hand side of the first line of (6). On a different note, it is worth noting that when setting (and hence
) one retrieves thereby the equations of (generalized) least squares.
We now review the Bayesian approach to universal kriging Omre (Citation1987), Handcock and Stein (Citation1993), and Helbert, Dupuy, and Carraro (Citation2009), that will in turn give us an occasion to incorporate ridge regression into the discussion and will prove practical in further developments throughout the article. Let us assume to this end that is endowed with a Gaussian prior distribution
where
is an invertible covariance matrix. Let us first consider
as fixed. The posterior distribution of
knowing Z is Gaussian with
and
(7)
(7)
whereby, for and
(8)
(8)
With , we hence have that
and so
and
.
We now turn to linear (ridge) regression and present how the resulting equations can be obtained in similar settings. In linear ridge regression, say as classically with one estimates
via
, where
is a regularization parameter and In
stands for the identity matrix
. The canonical way to introduce this estimator is by replacing the ordinary least squares minimization by the minimization of its penalized counterpart
Yet it is well-known that can also be seen as maximum a posterior estimator in a Bayesian framework, under the prior distribution
, where
. In such a framework, one has indeed
and
. From there one can obtain, for instance by using the Woodbury formula, that
whereby
, which coincides in the considered Gaussian case with the MAP estimator. For F full column-ranked and
, the posterior distribution is proper and one recovers as posterior expectation and MAP the ordinary least squares estimator
.
2.2 Transductive GP Prediction via Schur Complements
One speaks of transductive settings in cases where the point(s) at which one wishes to predict are fixed in advance. We consider in this section the case where (resp. Zi
) are to be predicted for
distinct values of i (
in the case of universal kriging) from the remaining n–m observations Zj
. Without loss of generality, we will assume that the indices at which the predictions are to be made are
and denote by
the remaining ones. For any vector
, matrix
, and arbitrary vectors of ordered indices
, we denote here and in the following by
the subvector of v with indices in i and by
the block extracted from A with corresponding indices, using in turn the convention
. Similarly, we use the convention
for
.
In simple kriging settings (here ), it is common knowledge that
and the associated residual covariance matrix
can be obtained elegantly based on manipulating
(9)
(9)
Assuming indeed that is invertible, we obtain by block inversion formula (See Section A.1 in supplementary material, in particular Proposition A.1) that
, whereby
(10)
(10)
where
stands for the m × m upper left block of
. Using now the upper right block
of
and (10), we obtain similarly that
, so
(11)
(11)
Applying now under invertibility assumption a similar block inversion to
(12)
(12)
we get
and
whereby the BLUP of
given
, denoted here and in the following by
, is given by
(13)
(13)
Considering now the prediction residual , we get from the latter
(14)
(14)
with
. Let us stress that
departs from the most common simple kriging predictor via the treatment of noise, in the sense that in the considered transductive settings it can be relevant to predict the noisy
instead of
, and since a general noise covariance matrix is assumed for the sake of generality, this implies a straightforward adaptation of the blocks entering into play (i.e., blocks of Σ instead of those of
). Of course, in case
(e.g., under a null observation noise, as often encountered in computer experiments), the two approaches above coincide.
We will now see that under invertibility conditions, the universal kriging predictor and residual covariance too can be retrieved based on the inverse of a matrix, namely of
(15)
(15)
where
is the m × p matrix of basis functions evaluated at the prediction points. As we prove below, assuming invertibility of
and of
, the universal kriging residual covariance matrix can then be plainly obtained as follows:
(16)
(16)
Furthermore, the universal kriging predictor too writes similarly as in the simple kriging case:
(17)
(17)
In fact, in the two last equations, what changed between simple and universal kriging settings is simply the replacement of by
. Let us now present how this works. Assuming indeed that
and
are invertible, we first apply Proposition A.1 and obtain
(18)
(18)
Now, provided and
are invertible, we get using a variation of the block inversion formula (detailed in Remark A.3):
(19)
(19)
Expanding the product on the right-hand side of (18) using (19) then delivers
(20)
(20)
which establishes (16). As for (17), it then follows from Proposition A.2 and (19) that
with
and
the GLS estimator of
based on
.
Applying now under suitable invertibility assumptions a similar block inversion to
(21)
(21)
we get the BLUP
of
given
under universal kriging settings as
(22)
(22)
Considering further the associated prediction residual , we finally get
(23)
(23)
with
.
3 Fast Multiple-Fold Cross-Validation
In this section we leverage transductive GP equations in settings when subsets of the n observations are left out, delivering in turn the joint distribution of multiple-fold cross-validation residuals together with a fast computation approach. The key is that the previous results of (14) and (23) carry over to arbitrary ways of separating the dataset into left out and remaining observations (i.e., between test and learning sets). Throughout the following we consider vectors of strictly ordered indices from , and denote by
the set of all such index vectors. Since these index vectors are completely characterized by non-empty subsets of
, there are
elements in
. Elements
, that may also be thought of as subsets of
, are called folds.
3.1 The Generic Gaussian Vector Case
We now consider generic Gaussian cross-validation settings and deterministic functions arising in this context. and
stand for a vector and a symmetric positive definite matrix, respectively. For any
, we define the following cross-validation residual mapping:
(24)
(24)
delivering for
the deterministic counterpart of the cross-validation residual obtained when predicting
by
, where
. The notation
refers again to the extraction of sub-vectors/matrices by removing components with indices in i.
Proposition 1
(Fast calculation of cross-validation residuals). Let and
. Then,
(25)
(25)
Consequently, for , the
are jointly Gaussian, centered, and
(26)
(26)
For the case of folds for which concatenation gives
,
has centered n-dimensional Gaussian distribution with covariance matrix
(27)
(27)
where
.
Proof.
EquationEquation (25)(25)
(25) follows from suitably expressing blocks of A’s inverse using the Schur complement approach of Equations (A.1) and (A.2), delivering, respectively
Considering the rows indexed by i of the product , we hence get that
whereby
, where In
denotes the n × n identity matrix. The joint Gaussianity of the
’s and the covariance structure follow from the affine dependence of
on V, so that concatenating the
random vectors
leads to a Gaussian vector by left multiplication of
by a deterministic matrix. As for the covariance matrix of
, it follows from
hence
. □
Remark 1.
The first part of Proposition 1 is purely deterministic, and hence applies to any function approximation methods that boils down to Gaussian conditioning equations (e.g., regularized quadratic risk minimization in Reproducing Kernel Hilbert Spaces).
Remark 2.
Note that for arbitary that is without imposing ordering between them or that they form a partition, we would have a similar result yet without the above simplification, that is,
with
. An extreme case would be to consider all possible non-empty subsets of
, leading to
and
lines for Δ.
Remark 3.
In the extreme case where q = n and the ’s are set to (j)
, one recovers on the other hand fast leave-one-out cross-validation formulas, and we obtain as a by-product the covariance matrix of leave-one-out errors
(28)
(28)
3.2 The Simple Kriging Case
We now come back to GP prediction settings and first focus on simple kriging, following the observation model from the previous section, where we recall that
(
) where
is a centered GP with covariance kernel k and μ is a trend function, and
independently of
. μ is first assumed null, an assumption that we will relax later on. Like in the exposition of the previous section, we denote Z the observation vector and, for any
where
is the BLUP of
based on
. We assume throughout that
is invertible, with
.
Corollary 1.
The
are jointly Gaussian, centered, and
(29)
(29)
where
. For the case of folds
for which concatenation gives
,
has centered n-dimensional Gaussian distribution with covariance matrix
(30)
(30)
where
.
Proof.
Directly follows from Lemma 1 using with
. □
Remark 4.
Extending these results to the case of simple kriging with a known trend function μ directly follows from with
.
Remark 5.
In the case of Remark 3, we then obtain whereby
, leading in particular to a correlation matrix of LOO residuals
(31)
(31)
As a by-product of Proposition 1 (in the limiting case of LOO, see Remark 3), we represent on the correlations between the LOO prediction residual at the leftmost location versus the LOO residuals at the other locations in the settings of the opening example from .
Fig. 2 Correlations between LOO residuals at design points in the example displayed on . This figure was realized with the R package corrplot Wei and Simko (Citation2021).
![Fig. 2 Correlations between LOO residuals at design points in the example displayed on Figure 1. This figure was realized with the R package corrplot Wei and Simko (Citation2021).](/cms/asset/f5c6c23e-b127-4203-9f46-0a17821914cd/ucgs_a_2353633_f0002_c.jpg)
A first interesting thing that we would like to stress here is that correlation between LOO prediction residuals at two successive locations is negative, with a value below–0.5 (Let us not that on , the radii of the disks represent the magnitude of correlations, while their signs (resp. positive and negative) are represented via the black and white colors). An illustration of this effect can be seen on the upper panel of , see in particular the fourth and fifth locations. Coming back to , after the second location the correlation goes back to positive but with a smaller magnitude and subsequently appear to continue with a damped oscillation until stationing around zero. As we will see in Section 3, accounting for these correlations is instrumental in producing suitable QQ-plots from cross-validation residuals. We extend the analytical example in Section 3.3, with a focus on multiple-fold cross-validation.
3.3 Extension to Universal Kriging Case and More
Let us now formulate a corollary to Proposition 1 for the case of universal kriging such as presented throughout Section 2 in general and transductive settings.
Corollary 2.
For any such that
is full column-ranked using the notation
to denote the BLUP of
based on
in universal kriging settings, the residual
writes
(32)
(32)
where M is defined in (21), and
with
. Consequently, for any q > 1 and
such that the
’s are full column-ranked, the
are jointly Gaussian, centered, and with covariance structure given by
(33)
(33)
In particular, for the case of index vectors forming a partition of
and such that the concatenation of
gives
, then the concatenated vector of cross-validation residuals
has a n-dimensional centered Gaussian distribution with covariance matrix
(34)
(34)
where
.
Proof.
A direct way to prove this property is to first generalize (23) to the case of generic folds such that
is full column-ranked, which establishes (32), and then to follow the steps of Proposition 1 regarding the covariance structure and the joint Gaussian distribution of cross-validation residuals. Another proof taking a Bayesian route is sketched below. □
Remark 6.
As we have recalled in Section 2, it is known that universal kriging equations can be retrieved in a Bayesian setting by endowing with a Gaussian prior distribution
with invertible
and letting
tend toward
. Under this model (at fixed invertible
), the random vector Z would be centered Gaussian with a covariance matrix
. Then one may use Proposition 1 with
and
to find that the cross-validation errors
are jointly Gaussian with covariance structure
where
denotes the BLUP (conditional expectation) of
based on
under the prior distribution
on
, and
. Now by block inversion and Woodbury formula, we get
From there we can conclude since, for .
Remark 7.
The latter corollary enables retrieving fast formulas for multiple-fold cross-validation residuals of linear regression models of Zhang (Citation1993) as well as associated covariances. In fact, putting and
(where
) delivers
with
, whereby
and
Similarly, in ridge regression, we obtain by putting and
,
(35)
(35)
where
. Consequently, for any q > 1 and
, the
are jointly Gaussian, centered, and with covariance structure given by
(36)
(36)
Many other models could enjoy similar results (encompassing kernel ridge regression and least squares support vector machines, see An, Liu, and Venkatesh (Citation2007) where similar fast computations have been obtained for these model classes, yet without investigating the covariances and joint distribution of CV residuals) but in the following we will stick to GP as it is our main model of interest and the principles are readily applicable to other predictors falling under the settings of Proposition 1.
4 Some Consequences in GP Model Fitting
4.1 Graphical Diagnostics and Pivotal Statistics
We now wish to further explore the potential of the established formulas for diagnosing the quality of GP models. For simplicity, we will in what follows either remain generic or stick to simple kriging settings. A first immediate consequence of Proposition 1 and Corollary 2 for model fitting is that, under the hypothesis that observations are generated from the assumed model, cross-validation residuals can be transformed via appropriate matrix multiplications into a random vector with standard multivariate normal distribution. Assuming for simplicity that is full-ranked and denoting
it follows that
If fact, any matrix such that
(remaining for simplicity under the final assumptions of Corollary 1 where
) does the job. More specifically, one gets indeed with
and hence the hypothesis that the model is correct can be questioned using any standard means relying on such a pivotal quantity with multivariate Gaussian distribution (e.g., by means of a
test statistic or graphical diagnostics such as Q-Q plots). Similar considerations apply to the UK settings of Corollary 2.
Remark 8.
If Δ is not In
but still an invertible matrix, then the above can be adapted with . In case Δ is not invertible, analoguous approaches can be employed relying on the Moore-Penrose pseudo-inverse of
.
Coming back to our first one-dimensional example, represents Q-Q plots of the distribution of standardized LOO residuals (right panel) versus “transformed” LOO residuals (left panel). One may notice in particular, going from standardized to transformed LOO residuals, how the second point (corresponding to the second smallest theoretical quantile) switches from below the diagonal to above it.
Fig. 3 On the effect of accounting and correcting for correlation in Q-Q plots based on LOO residuals. Left panel: Q-Q plot against of LOO residuals merely divided by corresponding LOO standard deviations. Right panel: Q-Q plot against
of duly transformed LOO residuals.
![Fig. 3 On the effect of accounting and correcting for correlation in Q-Q plots based on LOO residuals. Left panel: Q-Q plot against N(0,1) of LOO residuals merely divided by corresponding LOO standard deviations. Right panel: Q-Q plot against N(0,1) of duly transformed LOO residuals.](/cms/asset/986eb01b-1a2e-47eb-b4cf-3d2cdbbc802d/ucgs_a_2353633_f0003_b.jpg)
We now turn to the topic of covariance parameter estimation relying on cross-validation residuals. As we will see below, incorporating covariances between CV cross-validation residuals within such approaches leads to interesting developments.
4.2 Some Implications in Scale Parameter Estimation
Assuming a kernel of the form with unknown
and
(noiseless observations being for instance commonly encountered in the realm of computer experiments), the problem of estimating
by maximum likelihood versus cross-validation has been considered in earlier works, leading notably to results concerning the superiority of MLE in the well-specified case but also to situations when CV turned out to perform better than MLE in the misspecified case Bachoc (Citation2013). Denoting by R the correlation matrix associated with the observation points
, assumed full-rank, it is well-known (see for instance Santner, Williams, and Notz Citation2003) that the MLE of
can be written in closed form as
(37)
(37)
In contrast, the leave-one-out-based estimator of investigated in Bachoc (Citation2013) reads
(38)
(38)
and originates from the idea, traced back by Bachoc (Citation2013) to Cressie (Citation1993), that the criterion (with notation
inherited from Bachoc Citation2013) defined by
(39)
(39)
should take a value close to one, where
, leading to
and ultimately to (38). This estimator turns out to be unbiased as
, yet with greater variance (see Bachoc Citation2013):
(40)
(40)
In light of former considerations on the joint distribution of cross-validation residuals, it appears natural to revise in order to correct for covariances between LOO residuals, resulting in
(41)
(41)
so that setting this modified criterion to 1 would plainly result in
Hence, the advantages found for
in Bachoc (Citation2013) in the case of model misspecification come at the price of neglecting covariances and hence possible redundancies between leave-one-out residuals, yet attempting to address this issue by naturally accounting for those covariances within the criterion leads to the MLE, known to enjoy optimality properties in the well-specified case of the considered settings but to be potentially suboptimal otherwise. We will see next that this is not the only instance where modifying LOO in order to account for residual covariances leads back to MLE.
4.3 On the Estimation of Further Covariance Parameters
Cross-validation has also been used to estimate covariance parameters beyond the scale parameter . We will now denote by θ those additional parameters (and by
the concatenation of all covariance parameters) and review some approaches that have been used to estimate them via cross-validation. An important point compared to the last section on
’s estimation is that Kriging predictors and CV residuals generally depend on θ, so that we will now stress it by noting
. Still in noiseless settings, the approach followed by Santner, Williams, and Notz (Citation2003) and Bachoc (Citation2013) to estimate θ based on leave-one-out cross-validation residuals is to minimize
(42)
(42)
Following Bachoc (Citation2013), the latter criterion can be in turn expressed in closed form as
(43)
(43)
where
stands for Z’s correlation matrix under correlation parameter θ. Based on our results, a natural extension of this criterion to multiple fold cross-validation can be similarly obtained. Considering q-fold settings such as in the final part of Corollary 1,
becomes indeed
(44)
(44)
Then, building up upon our main theorem, we obtain that
(45)
(45)
generalizing indeed (43). Note that applying a sphering transformation to the CV residuals previous to taking norms in (42) and (44) would lead to a criterion that boils down to
and hence appears as one of the two building blocks of the log-likelihood criterion.
On a different note and relaxing now the assumption of noiseless observations, (42) is only one among several possible ways to construct a criterion based on applying loss functions to leave-one-out residuals, as exposed in Rasmussen and Williams (Citation2006). While (42) corresponds to the squared error loss, one arrives for instance by using instead the “negative log validation density loss” (after the terminology of Rasmussen and Williams (Citation2006), sec. 5.4.2) at the “LOO log predictive probability,” also sometimes called pseudo-likelihood:
(46)
(46)
where
denotes the conditional density of
at the value
knowing that
. Further scoring rules could be used as a base to define further criteria relying on cross-validation residuals, as illustrated for instance with the CRPS score in Petit et al. (Citation2020) and Petit (Citation2022).
Let us now focus on an extension of to multiple fold cross-validation, and on the exploration of some links between the resulting class of criteria and the log-likelihood. First,
is straightforwardly adapted into
(47)
(47)
Reformulating the sum of log terms as the log of a product, we see that
(48)
(48)
and so it appears that this approach amounts indeed in some sense to improperly assuming independence between the cross-validation residuals
.
We now examine in more detail (still in the settings of the final part of Corollary 1) when such cross-validation residuals are independent and show that in such case coincides with the log-likelihood criterion based on Z.
Corollary 3.
The following propositions are equivalent:
The cross-validation residuals
are mutually independent,
,
the subvectors
are mutually independent.
If they hold, then
.
Proof.
As we know from Corollary 1 that the cross-validation residuals
form a Gaussian vector, they are mutually independent if and only if their cross-covariance matrices are null. Yet we also know by (34) from the same theorem that
. It follows that a) is equivalent to
for
(
), which coincides in turn with b). Looking finally at b) from the perspective of
being block-diagonal, we obtain similarly that b) is equivalent to c). If these propositions hold, then
from (25), and thus for
:
Also, one could have derived it from . □
5 Computational Aspects
5.1 Accuracy and Speed-Up Checks for Fast Formulas
We have seen in previous sections that block matrix inversion enabled us to reformulate the equations of (multiple-fold) cross-validation residuals and associated covariances in compact form, from simple kriging to universal kriging settings. Here and in supplementary material, we briefly illustrate that our Schur-complement-based formuale do indeed reproduce results obtained by a straightforward (“naive”) computational approach, also keeping an eye on computational speed-ups and by-products. We focus below on how efficiency gains evolve depending on the number of folds. The next section focuses in more detail on the respective computational complexities of the two approaches. The settings are analogue to those presented in Section A.2. Accuracy and speed-up checks for LOO are presented in Section A.4 (also in supplementary material).
In the following experimental results and theoretical considerations, we focus on the particular case of homogeneous fold sizes and denote by their common number of elements, so that
. For convenience, we consider in experiments a design of size
, and let the number of folds decrease from q = 1024 to q = 2 by successive halvings. This amounts to folds of sizes r ranging from 1 to 512, correspondingly. For each value of r, 50 random replications of the folds are considered; this is done by permuting the indices at random prior to arranging them in regularly constructed folds of successive r elements. The resulting speed-up distributions are represented in , in function of r. With speed-up medians ranging from 1027.82 to 1.50 (means from 1429.63 to 1.28, all truncated after the second digit), we hence observed a speed-up whatever the value of r, yet with a substantial decrease when r increases compared to the LOO situation, as could be expected.
Fig. 4 Relative errors on CV means and covariances (between naive and fast methods, with the naive method as reference) measured in similar settings as on .
![Fig. 4 Relative errors on CV means and covariances (between naive and fast methods, with the naive method as reference) measured in similar settings as on Figure 3.](/cms/asset/487641f9-a424-4355-acaf-5ff035ca9afa/ucgs_a_2353633_f0004_c.jpg)
Additionally, relative errors on cross-validation residuals and associated variances calculated using the fast versus naive approach are represented in , also in function of r. It is remarkable that, while the relative error on the calculation of residuals increases moderately (with r) from a median of around to values nearing
, in the case of the covariances the variation is more pronounced, with an increase from approx.
to
. Further analysis may shed light on the underlying phenomena, yet one can state that with a maximum relative error of magnitude
, the observed differences remain of a negligible extent.
5.2 More on Complexity and Efficiency Gains
We now turn to a more specific examination on how our closed form formulas affect computational speed compared to a naive implementation of multiple fold cross-validation. As we will now develop, a first order reasoning on computational costs associated with naive versus closed multiple-fold cross-validation highlights the existence of two regimes depending on q: for large q it is faster to use close form formula yet for very small q one should better employ the naive approach. The rationale is as follows; in a naive implementation, the main computational cost is on inverting q covariance sub-matrices, all belonging to
. Denoting this (approximate) cost by
and assuming a cubic cost for matrix inversion (with a multiplicative constant
), we hence obtain that
(49)
(49)
On the other hand, the closed form approach requires one inversion of a n × n matrix and q inversions of r × r matrices, leading to an approximate cost
(50)
(50)
We thus obtain an approximate speed up of , implying that close form approach would be advantageous when
. As q is a positive integer, the latter condition be proved to be equivalent to
, as
possesses a unique real root between 3 and 4, takes negative values for
, and tends to
when
. In our implementation, however, the cost of the fast approach turns out to be smaller than
since the Cholesky factor is already pre-calculated. Modeling this with a damping factor of
in front of the
term, we end up with an approximate cost of
, leading via analogue calculations to an acceleration whenever
. Again, the polynomial involved possesses a single real-valued root, with a value shrinking toward 2 as α tends to 0.
In practice, speed-ups are already observed here from q = 2, which may be attributed to various reasons pertaining notably to implementation specifics and arbitrary settings used in the numerical experiments. Also, let us stress again that the reasoning done above is to be taken as an approximation as it does not account for matrix-vector multiplication costs, storage and communication overheads, and further auxiliary operations that could influence the actual running times.
On a different note, let us point out that the computational benefits highlighted above in the case of multiple-fold cross-validation with a unique partitioning do carry over and even further grow in the case of replications with varying partitions. Sticking for simplicity to the previously considered settings (α = 1) but replicating the partitions, we find that a naive approach has a cost driven by
(51)
(51)
while a similar approach using the fast formulas would have a cost driven by
(52)
(52)
Then we obtain that as soon as
, occurring already from q = 3.
6 Application Test Case
We now illustrate and investigate some applications of multiple-fold cross-validation on an application test case motivated by a contaminant source localization problem from hydrogeology Pirot et al. (Citation2019). There the goal is to minimize an objective function that is quantifying the discrepancy between given contaminant concentrations at monitoring wells and analogue quantities obtained under varying scenarios regarding the potential source underlying this contamination. Solving this difficult optimization problem in order to localize the source is performed in Pirot et al. (Citation2019) via Bayesian Optimization, and Gaussian process models are thus used to surrogate the objective function. An instance of such an objective function is presented in . Our focus is on fitting a GP model to this function under a clustered experimental design, and to highlight notable differences arising here between leave-one-out and multiple-fold cross-validation using the cluster structure.
Fig. 6 Contaminant localization test function designed by summing dissimilarities between given concentrations at monitoring wells and corresponding simulation results when varying the candidate source localization (See Pirot et al. (Citation2019) for more detail). The filled disks stand for the experimental design, consisting of 25 clover-shaped clusters of size 5. The small green circle stands for the actual (prescribed) contaminant source, and the red square indicates the grid point with the smallest misfit.
![Fig. 6 Contaminant localization test function designed by summing dissimilarities between given concentrations at monitoring wells and corresponding simulation results when varying the candidate source localization (See Pirot et al. (Citation2019) for more detail). The filled disks stand for the experimental design, consisting of 25 clover-shaped clusters of size 5. The small green circle stands for the actual (prescribed) contaminant source, and the red square indicates the grid point with the smallest misfit.](/cms/asset/6dc27b79-4f9d-4659-912b-12b4abe0c4dd/ucgs_a_2353633_f0006_c.jpg)
We wish to illustrate some differences between CV in the LOO versus in “Leave-One-Cluster-Out” settings (i.e., 25-fold CV with folds clusters). Since we can evaluate the actual function on a fine grid, we can easily calculate the absolute prediction errors and compare them to absolute cross-validation residuals. The mean absolute error on the considered 2601-element grid is
. The mean absolute error of the 125 LOO residuals is
. In contrast, the mean absolute error of the 125 multiple-fold CV residuals is
. Thus, while absolute multiple-fold CV residuals appears to give a rather over-pessimistic picture of the prediction situation (as CV predictions are done only at points at points in left-out clusters), analogue LOO residuals give a very over-optimistic one (as CV predictions are done only at left-out points within clusters including four not left-out points).
represents LOO (upper panel) versus MFCV (medium panel) residual norms as a function of the range parameter and compares them to the true reconstruction error (lower panel). This example illustrates that the design of folds may crucially affect results. In the present case, it can be observed that, compared to the curve associated with LOO residuals, the curve associated with MFCV residuals is more similar to the actual (out-of-sample) reconstruction error. In contrast, the top curve associated with LOO features a flatter region near the optimum resulting in turn in an optimal range associated with a larger reconstruction error. Replications of the example with further realizations of the clustered designs delivered quite variable results (overall, neither the considered MFCV nor LOO stood up as evidently preferable to the other). Yet some recurring features emerged, in particular with the curves associated with LOO appearing more prone to decreasing to a plateau quite similarly as on . This calls for more research on the impact of fold design on resulting indicators and estimators.
7 Discussion
Fast leave-one-out formulas for Gaussian process prediction that have been used for model diagnostics and hyperparameter fitting do carry over well to multiple-fold cross-validation, as our theoretical and numerical results in both simple and universal kriging settings suggest. The resulting formulas were found to allow substantial speed-ups, yet with a most favorable scenario in the leave-one-out case and decreasing advantages in cases with lesser folds of larger observation subsets. A first order analysis in terms of involved matrix inversion complexities corroborated the main trends observed in numerical experiments, although some room remains to analyze actual computation times more precisely depending on implementation specifics. In addition, established formulas enabled a closed-form calculation of the covariance structure of cross-validation residuals, and eventually correcting for an improper assumption of independence silently underlying QQ-plots with standardized LOO residuals.
As we established as well, the established formulas lend themselves well to generalizing existing covariance hyperparameter estimation approaches that are based on cross-validation residuals. Looking first at scale parameter estimation in the LOO case, we found that changing a criterion underlying the scale estimation approach used in Santner, Williams, and Notz (Citation2003) and Bachoc (Citation2013) by correcting for covariance between LOO residuals led back to the maximum likelihood estimator of scale. On a different note and with more general covariance hyperparameters in view, as established in Corollary 3 maximizing the pseudo-likelihood criterion mentioned in Rasmussen and Williams (Citation2006) was found to coincide with MLE in the case of independent cross-validation residuals.
It is interesting to note as a potential starting point to future work on cross-validation-based parameter estimation that going beyond the independence assumption of Corollary 3 and accounting for the calculated residual covariances results in fact in a departure from basic Maximum Likelihood Estimation. Still assuming indeed that and replacing the product in
by the joint density of cross-validation residuals, we end up indeed with
(53)
(53)
and we see that maximizing
generally departs indeed from MLE, as Q and B are functions of ψ, yet coincides with it when
. While studying further this new criterion is out of scope of the present article, we believe that using the full distribution of cross-validation residuals could be helpful in designing novel criteria and procedures for covariance parameter estimation and more. Beyond this, efficient adjoint computation of gradients could be useful to speed-up covariance hyperparameter estimation procedures such as studied here, but also variations thereof relying on further scoring rules Petit et al. (Citation2020), not only in LOO settings but also potentially in the case of multiple-fold cross-validation.
It is important to stress that while the presented results do not come with procedures for the design of folds, they might be of interest to create objective functions for their choice. Looking at the covariance matrix between cross-validation residuals might be a relevant ingredient to such procedures. Of course, designing the folds is not completely unrelated to choosing the ’s and we expect interesting challenges to arise at this interface. Also, stochasic optimization under random folds is yet another approach of interest which could be eased thanks to fast computation of cross-validation residuals. Last but not least, as fast multiple-fold cross-validation carries over well to linear regression and GP modeling shares foundations with several other paradigms, it would be desirable to generalize presented results to broader model classes (see for instance Rabinowicz and Rosset (2020b), where mixed models are tackled).
JCGS-22-145_Suppl.pdf
Download PDF (401.3 KB)Acknowledgments
Part of DG’s contributions have taken place within the Swiss National Science Foundation project number 178858, and he would like to thank Sam Allen, Athénaïs Gautier, Fabian Guignard, Philip Stange, Cédric Travelletti, Riccardo Turin as well as many attendees of events where this work was presented for stimulating scientific exchanges. Yves Deville should be warmly thanked too for insightful comments on an earlier version of this article. Warm thanks as well to the two referees and the members of the editorial team, whose comments have enabled improving the paper substantially. Calculations were performed on UBELIX, the High Performance Computing (HPC) cluster of the University of Bern. DG would also like to aknowledge support of Idiap Research Institute, his primary affiliation during earlier stages of this work.
Supplementary Materials
In the supplement, we provide extra detail about block inversion via Schur complements, as well as about the basics of universal kriging. Also, we give detail on the opening 1-dimensional example and present additional numerical experiments and results. The latter pertain on the one hand to the effect on cross-validation diagnostics of a change of design (when points are “doubled” by adding close neighbors and thus creating clusters) and, on the other hand, to LOO and MFCV speed-ups and relative errors depending on the size of considered designs.
Disclosure Statement
The authors report there are no competing interests to declare.
References
- An, S., Liu, W., and Venkatesh, S. (2007), “Fast Cross-validation Algorithms for Least Squares Support Vector Machine and Kernel Ridge Regression,” Pattern Recognition, 40, 2154–2162. DOI: 10.1016/j.patcog.2006.12.015.
- Arlot, S., and Celisse, A. (2010), “A Survey of Cross-Validation Procedures for Model Selection,” Statistical Surveys, 4, 40–79.
- Bachoc, F. (2013), “Cross Validation and Maximum Likelihood Estimation of Hyper-Parameters of Gaussian Processes with Model Misspecification,” Computational Statistics and Data Analysis, 66, 55–69. DOI: 10.1016/j.csda.2013.03.016.
- Bates, S., Hastie, T., and Tibshirani, R. (2023), “Cross-validation: What Does It Estimate and How Well Does It Do It?” Journal of the American Statistical Association. DOI: 10.1080/01621459.2023.2197686.
- Cressie, N. A. C. (1993), Statistics for Spatial Data, London: Wiley.
- Currin, C., Mitchell, T., Morris, M., and Ylvisaker, D. (1988), “A Bayesian Approach to the Design and Analysis of Computer Experiments,” Technical Report 6498, ORNL.
- Dubrule, O. (1983), “Cross Validation of Kriging in a Unique Neighborhood,” Journal of the International Association for Mathematical Geology, 15, 687–699. DOI: 10.1007/BF01033232.
- Fischer, B. (2016), “Model Selection for Gaussian Process Regression by Approximation Set Coding,” Master’s thesis, ETH Zürich.
- Golub, G. H., Heath, M., and Wahba, G. (1979), “Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter,” Technometrics, 21, 215–223. DOI: 10.1080/00401706.1979.10489751.
- Handcock, M. S., and Stein, M. L. (1993), “A Bayesian Analysis of Kriging,” Technometrics, 35, 403–410. DOI: 10.1080/00401706.1993.10485354.
- Helbert, C., Dupuy, D., and Carraro, L. (2009), “Assessment of Uncertainty in Computer Experiments from Universal to Bayesian Kriging,” Applied Stochastic Models in Business and Industry, 25, 99–113. DOI: 10.1002/asmb.743.
- Kaminsky, A. L., Wang, Y., Pant, K., Hashii, W. N., and Atachbarian, A. (2021), “An Efficient Batch k-fold Cross-Validation Voronoi Adaptive Sampling Technique for Global Surrogate Modeling,” Journal of Mechanical Design, 143, 011706. DOI: 10.1115/1.4047155.
- Le Gratiet, L., Cannamela, C., and Iooss, B. (2015), “Cokriging-based Sequential Design Strategies Using Fast Cross-Validation for Multi-Fidelity Computer Codes,” Technometrics, 57, 418–427. DOI: 10.1080/00401706.2014.928233.
- Magnusson, M., Andersen, M. R., Jonasson, J., and Vehtari, A. (2020), “Leave-One-Out Cross-Validation for Bayesian Model Comparison in Large Data,” in Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS).
- Martino, L., Laparra, V., and Camps-Valls, G. (2017), “Probabilistic Cross-Validation Estimators for Gaussian Process Regression,” in 2017 25th European Signal Processing Conference (EUSIPCO), pp. 823–827. DOI: 10.23919/EUSIPCO.2017.8081322.
- Omre, H. (1987), “Bayesian Kriging–Merging Observations and Qualified Guesses in Kriging,” Mathematical Geology, 19, 25–39. DOI: 10.1007/BF01275432.
- Petit, S. (2022), “Improved Gaussian Process Modeling: Application to Bayesian Optimization,” PhD thesis, Université Paris-Saclay.
- Petit, S. J., Bect, J., Da Veiga, S., Feliot, P., and Vazquez, E. (2020), “Towards New Cross-Validation-based Estimators for Gaussian Process Regression: Efficient Adjoint Computation of Gradients,” in 52èmes Journées de Statistique (JdS 2020), pp. 633–638. https://arxiv.org/pdf/2002.11543
- Pirot, G., Krityakierne, T., Ginsbourger, D., and Renard, P. (2019), “Contaminant Source Localization via Bayesian Global Optimization,” Hydrology and Earth System Sciences, 23, 351–369. DOI: 10.5194/hess-23-351-2019.
- Rabinowicz, A., and Rosset, S. (2020a), “Assessing Prediction Error Atinterpolation and Extrapolation Points,” Electronic Journal of Statistics, 14, 272–301. DOI: 10.1214/19-EJS1666.
- ——— (2020b), “Cross-Validation for Correlated Data,” Journal of the American Statistical Association, 117, 718–731. DOI: 10.1080/01621459.2020.1801451.
- Rasmussen, C. E., and Williams, C. K. I. (2006), Gaussian Processes for Machine Learning, Cambridge, MA: The MIT Press.
- Roberts, D. R., Bahn, V., Ciuti, S., Boyce, M. S., Elith, J., Guillera-Arroita, G., Hauenstein, S., Lahoz-Monfort, J. J., Schröder, B., Thuiller, W., Warton, D. I., Wintle, B. A., Hartig, F., and Dormann, C. F. (2017), “Cross-Validation Strategies for Data with Temporal, Spatial, Hierarchical, or Phylogenetic Structure,” Ecography, 40, 913–929. DOI: 10.1111/ecog.02881.
- Roustant, O., Ginsbourger, D., and Deville, Y. (2012), “DiceKriging, DiceOptim: Two R Packages for the Analysis of Computer Experiments by Kriging-based Metamodelling and Optimization,” Journal of Statistical Software, 51, 1–55. DOI: 10.18637/jss.v051.i01.
- Santner, T. J., Williams, B. J., and Notz, W. (2003), The Design and Analysis of Computer Experiments, New York: Springer.
- Wei, T., and Simko, V. (2021), R package ’corrplot’: Visualization of a Correlation Matrix, available at https://github.com/taiyun/corrplot.
- Zhang, P. (1993), “Model Selection via Multifold Cross Validation,” Annals of Statistics, 21, 299–313.
- Zhang, Y., and Yang, Y. (2015), “Cross-Validation for Selecting a Model Selection Procedure,” Journal of Econometrics, 187, 95–112. DOI: 10.1016/j.jeconom.2015.02.006.