Full article: Interpreting uninterpretable predictors: kernel methods, Shtarkov solutions, and random forests

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Many of the best predictors for complex problems are typically regarded as hard to interpret physically. These include kernel methods, Shtarkov solutions, and random forests. We show that, despite the inability to interpret these three predictors to infinite precision, they can be asymptotically approximated and admit conceptual interpretations in terms of their mathematical/statistical properties. The resulting expressions can be in terms of polynomials, basis elements, or other functions that an analyst may regard as interpretable.

Keywords:

1. Introduction

Fundamentally point prediction is an input–output relation. Given a pair of related sequences $x_{1}, x_{2}, \dots$ and $y_{1}, y_{2}, \dots$ where each $y_{i}$ is an outcome of some $Y_{i}$ , the predicted value for $y_{i}$ is ${\hat{y}}_{i} = F_{i} (x_{i})$ in which $F_{i}$ is typically chosen using the earlier $x_{i}$ 's and $y_{i}$ 's, i.e., $x_{i - 1}, \dots, x_{1}$ and $y_{i - 1}, \dots, y_{1}$ . An extra set of burn-in data may be used to choose $F_{i}$ and there may be added complexities from side information. We may put many sorts of desiderata on the $F_{i}$ 's – low predictive error, simplicity, even insisting each $F_{i} \in F_{i}$ for some set of functions $F_{i}$ . However, the point predictor, $F_{i}$ , is a merely a function – a way to convert an input x to an output y.

Interval predictors are somewhat more complex: They give a prediction interval, say $I_{i}$ with a pre-assigned probability that the event ${Y_{i} \in I_{i}}$ will occur. Thus they have the same input but a different output. Regardless of any further desiderata we might impose, such as optimality criteria, interval predictors (or their generalization to regional predictors) remain input–output relations. This may be regarded as an example of conformal prediction, see Vovk et al. (Citation2005), as we discuss later in Section 5.

By contrast, modelling is a conceptually different process. In principle, a statistical modeler proposes a model, say $Y = F (x) + ϵ$ for simplicity, to be true and has a collection of terms, say $t_{j} (x)$ for $j = 1, \dots, m$ , that may be part of an additive model. The modeler uses the data to choose terms and the end result is a model something like $Y (x) = \sum_{j = 1}^{q} {\hat{t}}_{j} (x) + ϵ$ , where the ${\hat{t}}_{j}$ 's are the same as the $t_{j}$ 's apart from estimating some coefficients. The modeler then asserts that the model reflects reality by ensuring that each component has a correlate in reality: Each $t_{j}$ means something physical, there is a reason that the terms are added, and it has been verified that the error ϵ represents intrinsic variability rather than small terms that have been ignored i.e., bias. In this case, the model gives the point predictor $\hat{Y} (x) = \sum_{j = 1}^{q} {\hat{t}}_{j} (x)$ and it is assumed that any estimates in ${\hat{t}}_{j}$ are satisfactorily close to their true values that the model is ‘good’ – not readily falsifiable. Note that even though $\hat{Y}$ is an estimator of F, we focus on how well it predicts Y.

What are the differences between these two approaches? First, a predictor is just a mathematical construct to match the output from a data generator (DG). It has no greater significance. All that matters about $\hat{Y}$ is how close $\hat{Y} (x)$ is to Y = y, i.e., how well using $\hat{Y}$ lets someone predict Y. The quality of the prediction is usually measured formally, e.g., by some cumulative error such as the prediction sum of squares ${PRESS}_{n} = \sum_{i = 1}^{n} (y_{i} - {\hat{Y}}_{- i} (x_{i}))^{2},$ where the subscript $- i$ indicates that data point $(x_{i}, y_{i})$ was not used to form $\hat{Y}$ . However, the point remains: $\hat{Y}$ predicts well, adequately, or poorly according to how we assess predictive performance. While there is nothing more that must necessarily be said, there is much that can be said – how the predictor was found, what its properties are, etc. – and some aspects of these will be discussed for the predictors here.

A clear statement of the basic problem studied here can be found in Geisser (Citation1975) who focused on ‘low structure’ data and treated it strictly empirically. i.e., with minimal discussions about abstract constructs such as distributions. This is analogous to what we call complex data – data about which it is hard to make any strong assumptions. Geisser (Citation1975)'s treatment was prescient in that he clearly expressed many of the ideas here in elementary settings such as ANOVA, regression, and posterior means. Our work goes beyond this by examining techniques that were not available in 1975 and we do so from a contemporary conceptual standpoint, not just empirically (although that is clearly no less important).

The notion of ‘interpretability’ used here is the same as used in Le and Clarke (Citation2020). Briefly, a model M consists of K components, say $M = {c_{1}, \dots, c_{K}}$ . The $c_{k}$ 's are the components that go into the formulation of a model such as variables, parameters, and rules for how they are to be combined to produce a model. The model M is interpretable if and only if each $c_{k}$ has a physical correlate, i.e., they correspond to some identifiable and measurable feature of the DG. We say a model is valid if and only if it is interpretable and correct at least to the degree that its predictions and future outcomes are sufficiently close.

In this setting, the key question in modelling is how well the terms in $\hat{Y}$ encapsulate the components of the DG. That is, what aspect (or $c_{k}$ ) of the DG does a specific $t_{j}$ represent and how accurately? In short, the subcomponents of a model matter because it is hoped they have physical correlates. The model is meant to match reality in that its components can be interpreted physically in the context of the DG. Otherwise put, a model can be falsified by falsifying one of its components. Thus, it makes sense to ask if a model is ‘true’ – it being understood that ‘true’ may only be in a provisional sense. A better model may be found that discredits the earlier model and science proceeds by sequentially falsifying ever better models hopefully arriving at a model that is either not-falsifiable or so close to true (in the absolute sense) that it isn't worth the trouble to falsify. In either case, there is the idea of a model being true that has no genuine analog for predictors. The closest analog would be for a predictor to be optimal (within a class) but this is not part of measurable, objective reality.

The main link between predictors and model is that models regularly provide predictors while predictors do not in general lead to models – at least not directly. (Indeed, models that do not make measurable predictions are not valid models as they are not falsifiable.) Thus, we may speak of model-induced predictors and non-model induced predictors. Loosely, the class of predictors is very much larger than the class of models. So, one way to find good models is to find a good predictor and determine a model that performs almost as well in terms of prediction. That way, there is a physical interpretation even if some predictive power is lost.

Ideally, we want a model that is not falsified (or at least is very hard to falsify) and that gives an extremely good predictor. Sometimes this occurs but often it does not especially for complex problems. That is, a really good predictor often outperforms the predictor generated from a provisionally true model and does so by a substantial margin. In earlier work, we gave examples of this, see Le and Clarke (Citation2020). That is, we showed how certain interpretable predictors, linear models in particular, could be modified to give improved predictions. The modification were chiefly to introduce non-interpretable features to the model induced predictor. The end result was a predictor that had some interpretable and some non-interpretable components, and gave demonstrably improved prediction over the model induced predictor. We called the difference between the error of using the partially interpretable predictors over the model induced predictors the cost of modelling. Intuitively, the flexibility from the loss of full interpretability enabled improved prediction.

Here we examine the same problem but from the reverse direction. That is, we start with uninterpretable predictors and find interpretable models that are close to them. The interpretable models do not in general perform as well as the uninterpretable predictors since the latter are the result of optimization. Thus, non-interpretable (but optimal) predictors can lead to approximate models that may be examined to see how they relate to physical components of a DG. This is another way to formalize the cost of modelling, or interpretability, in terms of the loss of predictive accuracy because the approximate model may still say something useful about the DG.

One implication of this reasoning is that the principle of falsification may have to be reconsidered. Falsifiability is the assertion that any conjectured model must be disprovable before it can become accepted. The problem is that if the predictions from essentially every model for a DG can be improved, then essentially every model is flawed and can be discredited. We are not actually uncovering truth. Accordingly, the principle of falsification may itself be ‘false’ in the sense that since every model is disprovable, disprovability can only be used to discriminate better models from worse ones not to arrive at a model that can be generally accepted – at least not often enough that it is useful as a foundational philosophical principle.

A further complication is the concept of $M$ -closed, -complete, and -open problems; see Bernardo and Smith (Citation2000) for the original definitions. These are defined by the location of a class of proposed models or predictors relative to the DG. Here we say that an $M$ -closed problem is one in which the DG is exactly one of a finite list of explicit candidate models. The problem is therefore merely selecting the one that matches the data generator. In the Bayesian case, this also means that the prior is well-defined in the sense that the prior probability of a model represents the pre-experimental belief that the model is true.

An $M$ -complete problem is one in which the DG has a true model but it is inaccessible in some sense. For instance, it might be too complicated to formulate. There may not be any closed form expression for it, so it can only be approximated numerically. The true model might be so complicated that it is unrealistic to learn much about it from data that might realistically be gathered. Indeed, the true model may be so complex that no approximation to it is adequate even if a serviceable one can be found under restrictions. The main point is the model exists so that, for instance, expectations and convergences are well defined but any properties of it are problematic and uncertain. In this case, the prior probabilities are not that a model is true but rather that it is close to the true model given the model list. (Interpreting the prior as weights on actions in a decision theory problem is also possible; see Le and Clarke (Citation2016b) for a brief discussion of this.)

The two problem classes contrast sharply with the $M$ -open problem class in which no model for the data generator can be assumed to exist. Hence, expectations and expressions related to the form of a model, e.g., modes of convergence, do not make sense and the meaning of a prior is unclear unless it is taken as a belief that a given predictor, possibly model-based, will perform better than another predictor within the class of predictors under study. In the $M$ -open problem class modelling makes little sense; we are essentially left only with predictors and their properties. Thus, a model for a DG in the class really means the predictor the model generates because the model has no necessary meaning. A predictor may be examined to learn something about the DG, such as the relevance or irrelevance of a variable, but detailed knowledge in the sense of a complete set of correlates for a DG is unobtainable.

A key point is that the existence of $M$ -open problems undermines the principle of falsifiability. If there is no true model and predictors can only be evaluated in terms of how well they perform on a relative basis, then the principle of falsifiability is irrelevant to many modern complex problems. That is, in many settings, all models are wrong and hence already falsified. They are not useful either except insofar as they give a good predictor that may or may not say anything about the DG. So, falsifiability per se is often merely a distraction from good prediction.

Indeed, many of the most important data sets currently being or recently gathered were not generated by a DG that admits a model, or, more precisely, were not generated by a mechanism that has anything stable or identifiable enough to model effectively. However, as long as there is something to measure we can make a guess as to its next value. Moreover, for these situations, predictors derived from model averages are often found to be better than their individual components, regardless of whether the models in the average are ‘true’ or merely regarded as the predictors they generate. There are other classes of predictors also do not generally admit physical interpretations yet also have clear predictive utility. Here, we study three classes of such predictors.

Our goal is to show that it is possible to provide interpretations for predictors that are generally uninterpretable. Specifically, kernel methods, the Bayes Shtarkov solution, and random forests are predictors for complex ( $M$ -complete or $M$ -open) problems that are typically regarded as hard or impossible to interpret. We develop two types of interpretation for each of them. One is an interpretation in the usual sense of finding physical correlates, usually approximately, for components of the predictors. The other is a theoretical characterization of these predictors in terms of concepts to help guide their use.

In Section 5 we summarize out findings by stating what we call the prediction principle that we propose should be added to the falsification principle. This is separate from and in addition to the celebrated prequential principle, see Dawid (Citation1984, Citation1992, Citation2010) and Dawid and Vovk (Citation1999), that we regard as foundational.

The structure of this paper is as follows. In Section 2 we provide an interpretation for kernel methods such as relevance vector machines (RVM's) and support vector machines (SVM's) using the eigenfunctions of kernels with a consistency result so the interpretations will be valid. In Section 3 we present a Bayes version of the Shtarkov predictor and indicate how to interpret it as a mixing of Beta distributions or in terms of a Pearson distribution. In Section 4 we show that a random forest is asymptotically equivalent to boosting so random forest builds an additive logistic regression model. They can also be approximated in a regression sense. Some concluding remarks are made in Section 5. Longer technical proofs are in Appendix 1 and some further discussion of interpretability versus complexity can be found in Appendix 2.

2. RVM's and SVM's for regression

For a regression problem with a training data set $D = D_{n} = {(y_{i}, x_{i}), i = 1, \dots, n}$ , where y is the response variable and x is a covariate of dimension p, the goal is to find a function $f (x)$ to predict the responses y in a test set. This can be viewed as a regularization problem of the form (1) $\begin{aligned} min_{f \in H_{K}} [\frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) + λ ‖ f ‖_{H_{K}}^{2}], \end{aligned}$ (1) where $H_{K}$ is a reproducing kernel Hilbert space (RKHS) with kernel K and norm $‖ \cdot ‖_{H_{K}}$ , L is a loss function, and $λ > 0$ is the smoothing parameter. It can be shown that (Equation1(1) $\begin{aligned} min_{f \in H_{K}} [\frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) + λ ‖ f ‖_{H_{K}}^{2}], \end{aligned}$ (1) ) has a solution of the form (2) $\begin{aligned} {\hat{f}}_{λ} (x) = \sum_{i = 1}^{n} α_{i} K (x_{i}, x); \end{aligned}$ (2) see Kimeldorf and Wahba (Citation1971). The solution ${\hat{f}}_{λ}$ is not a conventional model because the evaluations of K do not have any necessary physical correlates and in addition to its dependence on the parameters $α_{i}$ , ${\hat{f}}_{λ}$ depends explicitly on the $x_{i}$ 's to define the functions in the sum, the number of which depends on n. Indeed, the optimal values of the $α_{i}$ 's are $α_{i} = α_{i} (D)$ . Estimating the $α_{i}$ 's means finding a data-driven approximation ${\hat{α}}_{i} (D)$ even though the ‘true’ value depends on $D$ . That is, the data is used once to define the ‘true’ $α_{i}$ 's from the optimization (Equation1(1) $\begin{aligned} min_{f \in H_{K}} [\frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) + λ ‖ f ‖_{H_{K}}^{2}], \end{aligned}$ (1) ) and then again to obtain good estimates for them. It is easiest to regard the latter estimates as converging in the sense of real numbers rather than stochastically.

Conditional on $D$ , the (uninterpretable) representer theorem predictor is (3) $\begin{aligned} {\hat{Y}}_{rep} (x) = \sum_{i = 1}^{n} α_{i} K (x_{i}, x) . \end{aligned}$ (3) It is well known that RVM's and SVM's are of the form (Equation3(3) $\begin{aligned} {\hat{Y}}_{rep} (x) = \sum_{i = 1}^{n} α_{i} K (x_{i}, x) . \end{aligned}$ (3) ). Moreover, it is seen that (Equation3(3) $\begin{aligned} {\hat{Y}}_{rep} (x) = \sum_{i = 1}^{n} α_{i} K (x_{i}, x) . \end{aligned}$ (3) ) is the mode of a posterior where $L (y_{i}, f (x_{i}))$ is the exponent in an exponential family and $λ ‖ f ‖_{H_{K}}^{2}$ is the log of the prior on f. The representation (Equation2(2) $\begin{aligned} {\hat{f}}_{λ} (x) = \sum_{i = 1}^{n} α_{i} K (x_{i}, x); \end{aligned}$ (2) ) of f is of special interest when the number of covariates p is much larger than the sample size n and predictors such as ${\hat{Y}}_{rep} (\cdot)$ are often used in $M$ -complete (and $M$ -open) settings.

We start with a consistency result so the approximate interpretations we present will be asymptotically valid.

2.1. Consistency of the representer theorem predictor

Consider an $M$ -complete problem and let (4) $\begin{aligned} {\hat{Q}}_{n} (f) = \frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) + λ ‖ f ‖_{H_{K}}^{2} . \end{aligned}$ (4) The population version of (Equation4(4) $\begin{aligned} {\hat{Q}}_{n} (f) = \frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) + λ ‖ f ‖_{H_{K}}^{2} . \end{aligned}$ (4) ) is (5) $\begin{aligned} Q_{0} (f) = E_{(X, Y)} L (Y, f (X)) + λ ‖ f ‖_{H_{K}}^{2} . \end{aligned}$ (5) We can assume that, for each i the ${\hat{α}}_{i} = {\hat{α}}_{i} (D)$ 's are known from the empirical optimization of ${\hat{Q}}_{n}$ and that the true values for the $α_{i}$ 's given $D$ are fixed with limits, assuming they exist, due to the optimization of $Q_{0}$ .

Theorem 2.1

Assume (i) $Q_{0} (f)$ is uniquely minimized at $f_{0}$ , (ii) $Q_{0} (f)$ is continuous in f, and (iii) ${\hat{Q}}_{n} (f)$ converges uniformly in probability to $Q_{0} (f)$ i.e., $sup_{f \in H_{K}} | {\hat{Q}}_{n} (f) - Q_{0} (f) | \overset{P}{\to} 0$ . Then ${\hat{Y}}_{rep} \overset{P}{\to} f_{0},$ where the convergence is over independent outcomes from the distribution of $(X, Y)$ .

Proof.

This is a modification of Theorem 2.1 in Newey and McFadden (Citation1994). Since ${\hat{Q}}_{n} ({\hat{Y}}_{rep}) \leq {\hat{Q}}_{n} (f)$ for any f by the optimality of ${\hat{Y}}_{rep}$ , we have that for any $ϵ > 0$ , ${\hat{Q}}_{n} ({\hat{Y}}_{rep}) \leq {\hat{Q}}_{n} (f_{0}) + ϵ / 3$ . Also, for any $ϵ > 0$ , Assumption (iii) gives $\begin{aligned} Q_{0} ({\hat{Y}}_{rep}) & < {\hat{Q}}_{n} ({\hat{Y}}_{rep}) + \frac{ϵ}{3}, \\ {\hat{Q}}_{n} (f_{0}) & < Q_{0} (f_{0}) + \frac{ϵ}{3}, \end{aligned}$ with probability approaching one (w.p.a.1), as $n \to \infty$ . Therefore, w.p.a.1, (6) $\begin{aligned} Q_{0} ({\hat{Y}}_{rep}) & < {\hat{Q}}_{n} ({\hat{Y}}_{rep}) + \frac{ϵ}{3} < {\hat{Q}}_{n} (f_{0}) + \frac{2 ϵ}{3} \\ < Q_{0} (f_{0}) + ϵ . \end{aligned}$ (6) For any $δ > 0$ , let $B (f_{0}, δ) = {f \in H_{K} : ‖ f - f_{0} ‖_{H_{K}} < δ} .$ Since $B (f_{0}, δ)^{c}$ is closed, Assumptions (i) and (ii) give $inf_{f \in B (f_{0}, δ)^{c}} Q_{0} (f) = Q_{0} (f^{*}) > Q_{0} (f_{0})$ for some $f^{*} \in B (f_{0}, δ)^{c}$ .

Choosing $ϵ = inf_{f \in B (f_{0}, δ)^{c}} Q_{0} (f) - Q_{0} (f_{0}) = Q_{0} (f^{*}) - Q_{0} (f_{0})$ , expression (Equation6(6) $\begin{aligned} Q_{0} ({\hat{Y}}_{rep}) & < {\hat{Q}}_{n} ({\hat{Y}}_{rep}) + \frac{ϵ}{3} < {\hat{Q}}_{n} (f_{0}) + \frac{2 ϵ}{3} \\ < Q_{0} (f_{0}) + ϵ . \end{aligned}$ (6) ) implies $Q_{0} ({\hat{Y}}_{rep}) < Q_{0} (f^{*}) = inf_{f \in B (f_{0}, δ)^{c}} Q_{0} (f) w . p . a .1,$ and hence ${\hat{Y}}_{rep} \in B (f_{0}, δ)$ . Letting $δ \to 0$ completes the proof.

Some discussion of what Theorem 2.1 means and does not mean is important here. The mode of convergence is stochastic, in the Hilbert space norm. The objects converging are functions of the form (Equation3(3) $\begin{aligned} {\hat{Y}}_{rep} (x) = \sum_{i = 1}^{n} α_{i} K (x_{i}, x) . \end{aligned}$ (3) ) in which the $x_{i}$ 's appear as arguments in the kernel evaluations and the $(x_{i}, y_{i})$ 's appear implicitly in the definition of the $α_{i}$ 's. The whole function (Equation3(3) $\begin{aligned} {\hat{Y}}_{rep} (x) = \sum_{i = 1}^{n} α_{i} K (x_{i}, x) . \end{aligned}$ (3) ) converges to the minimizer $f_{0}$ . The function itself depends on $D$ through the values of the $x_{i}$ 's and the $α_{i}$ 's. This means that the connection between (Equation3(3) $\begin{aligned} {\hat{Y}}_{rep} (x) = \sum_{i = 1}^{n} α_{i} K (x_{i}, x) . \end{aligned}$ (3) ) for one $D$ and other data set $D^{*}$ is unclear. The $x_{i}$ 's, $y_{i}$ 's, and the sample sizes, say n and $n^{*}$ , may be different. So, there is no necessary relationship between $α_{i} (D)$ and $α_{i} (D^{*})$ even when $i \leq min (n, n^{*})$ and the first $min (n, n^{*})$ pairs $(x_{i}, y_{i})$ are the same for $D$ and $D^{*}$ . It may be easiest to regard increasing data sets as a sequence of problems corresponding to the accumulation of data and the convergence in the joint distribution of $(X, Y)$ as summarizing the effect of replications over the entirety of all the countably infinite data sequences. There may be further structure in the convergence of the $α_{i}$ 's and their effect on the convergence of (2.1) to $f_{0}$ , but we do not treat this here.

The convergence in Theorem 2.1 is in probability. The mode can be improved to $L^{2}$ with some extra hypotheses as seen in the following.

Corollary 2.1

Assume the conditions in Theorem 2.1 hold. Assume

Let X be a generic random variable representing any $X_{i}$ . Then there is an $ϵ > 0$ so that $\forall x \in \bar{supp (X)}$ $E [K^{2 + ϵ} (X, x)] < \infty;$
The sum of squares of the $α_{i}$ 's is bounded with rate $(1 / n)$ , i.e., $\exists M$ so that for any $D$ , $\sum_{i = 1}^{\infty} α_{i}^{2} (D) < M$ and $\exists N = N (n)$ so that $\sum_{i = N}^{n} α_{i}^{2} = o_{P} (1 / (n - N))$ .

Then, as $n \to \infty$ , for any x we have ${\hat{Y}}_{rep} (x) \overset{L^{2}}{\to} f_{0} (x) .$

Proof.

To establish the result, it is enough to show that as $n \to \infty$ , (7) $\begin{aligned} E [{\hat{Y}}_{rep} (x) - f_{0} (x)]^{2} \to 0. \end{aligned}$ (7) This is done in Appendix A.

Again, some discussion of this result is worthwhile. First, Assumption (i) is easy to verify for most kernels K. However, Assumption (ii) is asymptotic and therefore hard to verify. The boundedness clause, while intuitive, may have to be enforced by restricting $(X, Y)$ to a compact set and renormalizing $P_{(X, Y)}$ . The compact set can then be allowed to increase slowly while still preserving the result. The second clause of the assumption, the rate, is harder to deal with. Nevertheless, the rate assumption (and the boundedness assumption) are nearly always satisfied, at least approximately, in practice. While not a verification, the second clause can be checked by seeing how the $α_{j}$ 's perform using bootstrap samples from an $D$ . If the clause is satisfied for bootstrap samples and a range of finite n then it may be reasonable to take as true. In practice, with RVM's few of the $α_{i}$ 's are non-zero so as a practical matter, Assumption (ii) usually appears to be satisfied. Indeed, this can be seen in Tipping (Citation2001).

A counterfactual may make Assumption (ii) less unpalatable. If the Representer Theorem solution were a Fourier expansion, the two clauses of Assumption (ii) would seem fairly reasonable. The first clause of Assumption (ii) would only mean that $\sum_{i} α_{i} K (x_{i}, \cdot)$ is in the Hilbert space because Bessel's inequality gives that the sum of squared Fourier coefficients is less than the norm of the function. (This is true for any orthonormal basis.) Also, the rate $\sum_{i = N}^{n} α_{i}^{2} = o_{P} (1 / (n - N))$ as n and $N (n)$ increase imposes a sparsity condition. It limits the collection of functions that can be well approximated because, as n increases, the last $α_{i}$ 's can't be too large. That is, the true function is only being approximated by $N (n)$ evaluations of the kernel. Otherwise put, this method is only effective for functions $f_{0}$ that are sufficiently sparse in terms of the $K (x_{i}, x)$ 's required to express them. The rate clause in Assumption (ii) bounds how far $f_{0}$ can be from the approximations ${\hat{Y}}_{rep}$ , in $L^{2}$ , for consistency – as opposed to merely optimal approximation – to hold.

The mode of convergence in Corollary 2.1 is in $L^{2}$ pointwise in x. If a distribution P is assigned to X, then Egoroff's theorem can be applied. It strengthens the result by giving ${\hat{Y}}_{rep} (X) - f_{0} (X) \to 0$ in $L^{2}$ uniformly for $X \in A$ where A has arbitrarily large probability under P. That is, the convergence is almost uniform over most of the sample space of X.

To complete our treatment of the consistency of ${\hat{Y}}_{rep}$ we note that Assumption (iii) in Theorem 2.1 is hard to verify. So, we provide sufficient conditions for it.

Theorem 2.2

Assume (i) the loss function L is continuous in f, (ii) there exists $δ > 0$ so that for any $f^{*} \in H_{K}$ , $E_{(X, Y)} [sup_{f \in B (f^{*}, δ)} L (Y, f (X))] < \infty$ where $B (f^{*}, δ) = {f \in H_{K} : ‖ f - f^{*} ‖_{H_{K}} < δ}$ , and (iii) there exists an increasing sequence of compact subsets of $H_{K}$ , ${D_{j}}_{j = 1}^{\infty}$ , converging to $H_{K}$ such that, for each fixed n, $lim_{j \to \infty} sup_{f \in D_{j}} | {\hat{Q}}_{n} (f) - Q_{0} (f) | = sup_{f \in H_{K}} | {\hat{Q}}_{n} (f) - Q_{0} (f) |$ . Then $sup_{f \in H_{K}} | {\hat{Q}}_{n} (f) - Q_{0} (f) | \overset{P}{\to} 0.$

Proof.

The proof is adapted from Theorem 6.10, the uniform weak law of large numbers, in Bierens (Citation2005). The details are in Appendix A.

Remark

Uniform laws of large numbers emerge from empirical process theory, see Van de Geer (Citation2000) Chapter 2 for instance. Often these results have weaker hypotheses that are harder to verify. We have used extensions of the classical law of large numbers since our goal is predictor interpretability not weakest conditions.

2.2. Theoretical interpretation of the representer theorem solution predictors

Under Mercer's conditions, see Mercer's Theorem in Scholkopf and Smola (Citation2002), the kernel K can be decomposed as (8) $\begin{aligned} K (x_{i}, x) = \sum_{m = 1}^{\infty} λ_{m} g_{m} (x_{i}) g_{m} (x), \end{aligned}$ (8) where ${g_{m} ∣ m = 1, 2, \dots}$ is an orthonormal set of eigenfunctions of K with $\int K (x, y) g_{m} (y) d y = λ_{m} g_{m} (x)$ , $m = 1, 2, \dots$ . Thus, different kernels correspond to different orthonormal bases.

We can now use the $g_{m}$ 's in a nonparametric regression expansion for $f_{0}$ . Write the projection of $f_{0}$ onto the span of ${g_{1}, \dots, g_{M}}$ as (9) $\begin{aligned} r (x) = \sum_{j = 1}^{M} β_{m} g_{m} (x) \end{aligned}$ (9) so that estimating the $β_{m}$ 's is equivalent to estimating the projection. Since the $g_{m}$ 's are orthonormal, the optimal $β_{m}$ 's in an $L^{2}$ projection sense are $⟨ g_{m}, f_{0} ⟩$ and these can be estimated by ${\hat{β}}_{m} = (1 / n) \sum_{i = 1}^{n} y_{i} g_{m} (x_{i})$ for $m = 1, \dots, M$ . For any reasonable choice of joint distribution for $(Y, X)$ , the central limit theorem gives that for each m, there is a $σ_{m}$ so that (10) $\begin{aligned} {\hat{β}}_{m} \sim N (β_{m}, \frac{σ_{m}^{2}}{n}), \end{aligned}$ (10) asymptotically in n, assuming the second moments of the ${\hat{β}}_{m}$ 's exist. The integrated squared bias of r as an estimator for $f_{0}$ is (11) $\begin{aligned} B_{M} = B (f_{0}, r) = \int_{- \infty}^{\infty} (f_{0} (x) - r (x))^{2} d x = \sum_{m = M + 1}^{\infty} β_{m}^{2} . \end{aligned}$ (11) Since both r and $f_{0}$ are in a Hilbert space, $B_{M}$ is finite for any M and for any latter m, $β_{m} \to 0$ as $M \to \infty$ . Now, having controlled the variance of the ${\hat{β}}_{m}$ 's and the bias of r we see that (12) $\begin{aligned} {\hat{r}}_{M} (x) = \sum_{m = 1}^{M} {\hat{β}}_{m} g_{m} (x) \end{aligned}$ (12) converges to r as $n \to \infty$ and to $f_{0}$ with bias roughly $B_{M_{n}} (f_{0}, \hat{r})$ . We can let $M_{n} \to \infty$ so slowly as $n \to \infty$ that $B_{M}$ also goes to zero. Doing this, it is seen that we have $B_{M_{n}} \to 0$ as well, possibly slowly with n. Now, $\hat{r}$ converges to $f_{0}$ pointwise in x, i.e., (13) $\begin{aligned} {\hat{r}}_{M} (x) \to f_{0} (x) in P \end{aligned}$ (13) for any x. More explicit formal conditions for (Equation13(13) $\begin{aligned} {\hat{r}}_{M} (x) \to f_{0} (x) in P \end{aligned}$ (13) ) and for versions of (Equation13(13) $\begin{aligned} {\hat{r}}_{M} (x) \to f_{0} (x) in P \end{aligned}$ (13) ) in stronger modes of convergence can be given, but that is beyond our present scope. However, we note that this argument holds for any orthonormal basis but that the $g_{m}$ 's, being derived from K, are natural for this problem. Of course, if the basis elements in any orthonormal basis have a physical interpretation for a given K that is more compelling than the $g_{m}$ , that basis would be preferred.

From (Equation13(13) $\begin{aligned} {\hat{r}}_{M} (x) \to f_{0} (x) in P \end{aligned}$ (13) ) and Theorem 2.1, we have the following.

Theorem 2.3

Assume (Equation13(13) $\begin{aligned} {\hat{r}}_{M} (x) \to f_{0} (x) in P \end{aligned}$ (13) ) holds and that the hypotheses of Theorem 1 are satisfied. For $M$ -closed and $M$ -complete problems, for any x, the orthonormal basis predictor ${\hat{r}}_{M}$ in (Equation12(12) $\begin{aligned} {\hat{r}}_{M} (x) = \sum_{m = 1}^{M} {\hat{β}}_{m} g_{m} (x) \end{aligned}$ (12) ) is asymptotically equivalent to the representer theorem solution predictor ${\hat{Y}}_{rep} (x)$ , i.e., as $n \to \infty$ and consequently $M_{n} \to \infty$ at an appropriate rate ${\hat{r}}_{M_{n}} (x) - {\hat{Y}}_{rep} (x) \overset{P}{\to} 0.$

From Theorem 2.3 we are justified in regarding $\hat{r}$ as an interpretation of ${\hat{Y}}_{rep} (x)$ on the grounds that the $g_{m}$ 's (or other orthonormal basis) admit a physical interpretation relevant to the DG. The cost of this interpretation is asymptotically zero but for finite n depends on $B_{M_{n}}$ in (Equation11(11) $\begin{aligned} B_{M} = B (f_{0}, r) = \int_{- \infty}^{\infty} (f_{0} (x) - r (x))^{2} d x = \sum_{m = M + 1}^{\infty} β_{m}^{2} . \end{aligned}$ (11) ) and the rate for the ${\hat{β}}_{m}$ 's in (Equation10(10) $\begin{aligned} {\hat{β}}_{m} \sim N (β_{m}, \frac{σ_{m}^{2}}{n}), \end{aligned}$ (10) ).

Here we give some examples of the eigenfunctions $g_{m}$ for common choices of kernel to show the interpretability is non-trivial. (Other orthonormal bases may be easier.)

Example 2.1

Consider the kernel $K (x, y) = e^{- x y}$ on $(0, \infty)$ .

By the definition of the Gamma function, for $α > 0$ , $\int_{0}^{\infty} t^{α - 1} e^{- x t} d t = Γ (α) x^{- α} .$ Changing α to $1 - α$ , for $α < 1$ , $\int_{0}^{\infty} t^{- α} e^{- x t} d t = Γ (1 - α) x^{α - 1} .$ So, for $0 < α < 1$ , $\begin{aligned} \int_{0}^{\infty} (\frac{1}{\sqrt{Γ (α)}} t^{α - 1} + \frac{1}{\sqrt{Γ (1 - α)}} t^{- α}) e^{- x t} d t \\ = \sqrt{Γ (α) Γ (1 - α)} \\ \times (\frac{1}{\sqrt{Γ (α)}} x^{α - 1} + \frac{1}{\sqrt{Γ (1 - α)}} x^{- α}) . \end{aligned}$ Therefore, by definition, the eigenfunctions of this kernel are $\begin{aligned} g_{α} (x) = \frac{1}{\sqrt{Γ (α)}} x^{α - 1} + \frac{1}{\sqrt{Γ (1 - α)}} x^{- α}, \\ for 0 < α < 1. \end{aligned}$

Example 2.2

Polynomial kernel

The non-homogeneous version of the polynomial kernel of degree d is defined by $K (x, y) = (c + ⟨ x, y ⟩)^{d}$ where c is a constant and $⟨ \cdot, \cdot ⟩$ is the inner product.

For p = 2, say, let $x = (x_{1}, x_{2})$ , c = 0, d = 3, and suppose that $(x_{1}, x_{2}) \sim 0.5 N ((- 3, 1), I_{2}) + 0.5 N ((2, - 1), I_{2})$ , then the eigenfunctions of this kernel are, see Liyang and Lee (Citation2013), $\begin{aligned} g_{1} (x) & = \frac{1}{1862.615} (0.848 x_{1}^{3} - 0.791 x_{1}^{2} x_{2} \\ + 0.437 x_{1} x_{2}^{2} - 0.097 x_{2}^{3}), \\ g_{2} (x) & = \frac{1}{343.748} (- 0.518 x_{1}^{3} - 1.073 x_{1}^{2} x_{2} \\ + 0.862 x_{1} x_{2}^{2} - 0.317 x_{2}^{3}), \\ g_{3} (x) & = \frac{1}{59.266} (- 0.112 x_{1}^{3} - 1.079 x_{1}^{2} x_{2} \\ - 0.929 x_{1} x_{2}^{2} + 0.559 x_{2}^{3}), \\ g_{4} (x) & = \frac{1}{1862.615} (0.848 x_{1}^{3} - 0.791 x_{1}^{2} x_{2} \\ + 0.437 x_{1} x_{2}^{2} - 0.097 x_{2}^{3}) . \end{aligned}$

Example 2.3

Exponential kernel

Consider the exponential kernel $K (x, y) = \exp (- \frac{| x - y |}{w})$ for the uniform distribution on the interval $[- 1, 1]$ . In Diaconis et al. (Citation2008) it was shown that the eigenfunctions of this kernel can be written as $c o s (b x)$ or $s i n (b x)$ inside the interval $[- 1, 1]$ for appropriately chosen values of b and decay exponentially away from it.

Example 2.4

Gaussian kernel

Consider the Gaussian kernel $K (x, y) = \exp (- \frac{(x - y)^{2}}{2 w^{2}})$ for the normal distribution $N (μ, σ^{2})$ . Let $β = 2 σ^{2} / w^{2}$ and let $H_{i} (x)$ be the i th-order Hermite polynomial, Shi et al. (Citation2008) provided the eigenfunctions of this kernel, $\begin{aligned} g_{i} (x) & = \frac{(1 + 2 β)^{1 / 8}}{\sqrt{2^{i - 1} (i - 1)!}} \exp (- \frac{(x - μ)^{2}}{2 σ^{2}} \frac{\sqrt{1 + 2 β} - 1}{2}) \\ \times H_{i - 1} ({(\frac{1}{4} + \frac{β}{2})}^{1 / 4} \frac{x - μ}{σ}), \end{aligned}$ for $i = 1, 2, \dots$ In particular, the first eigenfunction is $\begin{aligned} g_{1} (x) & = {(1 + \frac{4 σ^{2}}{w^{2}})}^{1 / 8} \\ \times \exp (- \frac{(x - μ)^{2}}{4 σ^{2}} (\sqrt{1 + \frac{4 σ^{2}}{w^{2}}} - 1)) . \end{aligned}$

Other examples may be developed but the key points are (i) the bases used for the orthonormal basis predictor should be chosen in view of the DG to ensure interpretability, and (ii) for finite n, using an interpretation of ${\hat{Y}}_{rep}$ will not in general be as good a predictor (in say $P R E S S_{n}$ ) and this is the cost of interpretability.

2.3. A more empirical interpretation

If $g_{m}$ 's are not interpretable, and no obvious orthonormal basis can be identified, we might be led to default to the coordinates of the $x_{i}$ 's since they, presumably, were the quantities measured. In the $\dim (x) = 1$ case we can write Taylor expansions $g_{m} (x) = β_{0} + β_{1} x + \dots + β_{k} x^{k} .$ If each Taylor expansion converges i.e., $g_{m} (x)$ analytic, then we get an analogous result for projections of the form of $r (x)$ in (Equation9(9) $\begin{aligned} r (x) = \sum_{j = 1}^{M} β_{m} g_{m} (x) \end{aligned}$ (9) ). For ease of exposition, each Taylor series can be represented as a finite sum of orthonormal polynomials. A common choice is the Hermite polynomials, often denoted $H_{0}, H_{1} \dots$ . Then the difference between using kth-order Taylor expansions and expansions using the first k Hermite polynomial basis elements is simply identifying the linear transformation between bases.

Following Section 2.2, we can form r as in (Equation9(9) $\begin{aligned} r (x) = \sum_{j = 1}^{M} β_{m} g_{m} (x) \end{aligned}$ (9) ) and $\hat{r}$ as in (Equation12(12) $\begin{aligned} {\hat{r}}_{M} (x) = \sum_{m = 1}^{M} {\hat{β}}_{m} g_{m} (x) \end{aligned}$ (12) ) using Hermite polynomials, and obtain a variant on Theorem 2.3 so that the $\hat{r}$ provides a quantifiably good approximation to $f_{0}$ . The result can be left in Hermite polynomials or converted back to the Taylor expansion of each $g_{m}$ in r. The point of converting back to the polynomial basis used for Taylor expansions is that functions of the form $x^{j}$ are usually easier to interpret physically in the context of a DG than Hermite polynomials are simply because it was x that was measured.

Thus, using linear models, which are commonly regarded as interpretable, to approximate the $g_{m}$ 's, is asymptotically equivalent to approximating K in (Equation8(8) $\begin{aligned} K (x_{i}, x) = \sum_{m = 1}^{\infty} λ_{m} g_{m} (x_{i}) g_{m} (x), \end{aligned}$ (8) ) directly. In practice, we suggest it will be easier to use Hermite polynomials (or any other orthonormal basis) on the terms on the right side of (Equation8(8) $\begin{aligned} K (x_{i}, x) = \sum_{m = 1}^{\infty} λ_{m} g_{m} (x_{i}) g_{m} (x), \end{aligned}$ (8) ), and convert them to the polynomials used in Taylor expansions than approximating K directly. Again, the finite sample discrepancy between the approximation we just described and (Equation3(3) $\begin{aligned} {\hat{Y}}_{rep} (x) = \sum_{i = 1}^{n} α_{i} K (x_{i}, x) . \end{aligned}$ (3) ) quantifies the cost of passing from an optimal predictor to an interpretable predictor. An entirely analogous argument holds when $\dim (x) \geq 2$ provided that orthonormal bases for polynomial spaces with $\dim (x)$ variables are used.

3. Bayes Shtarkov predictors in $M$ -open settings

Consider the online prediction of arbitrary sequences $y_{1}, y_{2}, \dots$ , drawn from a finite set $Y$ . In $M$ -open settings, interest focusses on the case that no probability distribution can be assumed for a sequence of length n, say $y^{n} = (y_{1}, y_{2}, \dots, y_{n})$ . This is the paradigm $M$ -open statistical prediction problem for strings ofvalues.

This problem can be regarded as a sequential game between Nature, N, and a Forecaster, F, permitting F to access a collection of experts indexed by $θ \in Θ \subset R^{k}$ for some k. In the special case of log-loss, each round of the game proceeds as follows. Each expert announces a density say $p_{θ}$ . Given this, F announces a density $q (\cdot)$ that will be used to predict the value N issues. Finally, N issues y and pays F $\log q (y)$ . If this number is negative, it is the amount of money F pays N and this concludes the round. See Shtarkov (Citation1987) and Cesa-Bnachi and Lugosi (Citation2006) for details of this game and its properties.

Now suppose n independent rounds of this game are to be played. Prior to the first round, each expert θ announces a density $p (\cdot ∣ θ)$ for $y^{n}$ . F receives these $p_{θ}$ 's and chooses the density $q (y^{n})$ by trying to match the performance of the best expert θ for predicting $y^{n}$ . Then, N reveals $y^{n}$ and incurs the loss (or gain) $\log q (y^{n})$ . The question remains how F should use the $p_{θ}$ 's to choose q. Obviously, the best expert will incur the loss $min_{θ} \log 1 / p (y^{n} ∣ θ)$ to F where θ ranges over the experts.

3.1. The Bayesian version

In the Bayes version of the game, F has access to experts that are weighted by a prior $w (θ)$ . (If $w \equiv 1$ , this reduces to the frequentist version.) In this case, F would want to choose q to minimize the maximum regret (14) $\begin{aligned} sup_{y^{n}} [\log \frac{1}{q (y^{n})} - inf_{θ} \log \frac{1}{w (θ) p (y^{n} ∣ θ)}] \\ = sup_{y^{n}} [sup_{θ} \log \frac{w (θ) p (y^{n} ∣ θ)}{q (y^{n})}] . \end{aligned}$ (14) More formally, the solution $q_{opt}$ to (Equation14(14) $\begin{aligned} sup_{y^{n}} [\log \frac{1}{q (y^{n})} - inf_{θ} \log \frac{1}{w (θ) p (y^{n} ∣ θ)}] \\ = sup_{y^{n}} [sup_{θ} \log \frac{w (θ) p (y^{n} ∣ θ)}{q (y^{n})}] . \end{aligned}$ (14) ) that we henceforth call the Bayes Shtarkov predictor (for the discrete case) is, see Le and Clarke (Citation2016a), (15) $\begin{aligned} q_{opt} (y^{n}) & = \arg_{q} [inf_{q \in P} (sup_{y^{n}} sup_{θ} \log \frac{w (θ) p (y^{n} ∣ θ)}{q (y^{n})})] \\ = \frac{w (\tilde{θ} (y^{n})) p (y^{n} ∣ \tilde{θ} (y^{n}))}{\sum_{y^{n}} w (\tilde{θ} (y^{n})) p (y^{n} ∣ \tilde{θ} (y^{n}))}, \end{aligned}$ (15) where θ ranges over the ‘parameter space’ indexing the experts, $P$ is the collection of all densities for $Y^{n}$ with respect to counting measure, and $\tilde{θ}$ is the posterior mode. (Since q is in the denominator of (Equation15(15) $\begin{aligned} q_{opt} (y^{n}) & = \arg_{q} [inf_{q \in P} (sup_{y^{n}} sup_{θ} \log \frac{w (θ) p (y^{n} ∣ θ)}{q (y^{n})})] \\ = \frac{w (\tilde{θ} (y^{n})) p (y^{n} ∣ \tilde{θ} (y^{n}))}{\sum_{y^{n}} w (\tilde{θ} (y^{n})) p (y^{n} ∣ \tilde{θ} (y^{n}))}, \end{aligned}$ (15) ), $q (y) \neq 0$ for any $y \in Y$ .)

In the continuous case, the sum becomes an integral over a subset of a real space, $P$ becomes a class of densities with respect to Lebesgue measure, so (Equation15(15) $\begin{aligned} q_{opt} (y^{n}) & = \arg_{q} [inf_{q \in P} (sup_{y^{n}} sup_{θ} \log \frac{w (θ) p (y^{n} ∣ θ)}{q (y^{n})})] \\ = \frac{w (\tilde{θ} (y^{n})) p (y^{n} ∣ \tilde{θ} (y^{n}))}{\sum_{y^{n}} w (\tilde{θ} (y^{n})) p (y^{n} ∣ \tilde{θ} (y^{n}))}, \end{aligned}$ (15) ) becomes (16) $\begin{aligned} q_{opt} (y^{n}) = \frac{w (\tilde{θ} (y^{n})) p (y^{n} ∣ \tilde{θ} (y^{n}))}{\int w (\tilde{θ} (y^{n})) p (y^{n} ∣ \tilde{θ} (y^{n})) d y^{n}}; \end{aligned}$ (16) see Clarke (Citation2007, Sec. 5.2) for discussion of (Equation15(15) $\begin{aligned} q_{opt} (y^{n}) & = \arg_{q} [inf_{q \in P} (sup_{y^{n}} sup_{θ} \log \frac{w (θ) p (y^{n} ∣ θ)}{q (y^{n})})] \\ = \frac{w (\tilde{θ} (y^{n})) p (y^{n} ∣ \tilde{θ} (y^{n}))}{\sum_{y^{n}} w (\tilde{θ} (y^{n})) p (y^{n} ∣ \tilde{θ} (y^{n}))}, \end{aligned}$ (15) ) and (Equation16(16) $\begin{aligned} q_{opt} (y^{n}) = \frac{w (\tilde{θ} (y^{n})) p (y^{n} ∣ \tilde{θ} (y^{n}))}{\int w (\tilde{θ} (y^{n})) p (y^{n} ∣ \tilde{θ} (y^{n})) d y^{n}}; \end{aligned}$ (16) ).

The solution $q_{opt} (y^{n})$ does not factor into a product of $q (y_{i})$ 's and so does not correspond to a stochastic process. Nevertheless, regardless of how $q_{opt} (y^{n})$ is computed, univariate Bayes Shtarkov densities predictors, when they exist, are of the form (17) $\begin{aligned} q_{opt} (y_{n + 1} ∣ y^{n}) = \frac{q_{opt} (y^{n + 1})}{q_{opt} (y^{n})}, \end{aligned}$ (17) and can be used prequentially, i.e., to generate sequential predictions.

The foregoing generalizes directly to the case where side information, i.e., a value $x_{i}$ is associated to each $y_{i}$ . So, let us write the corresponding $q_{opt}$ as $q_{shk} (y ∣ x^{n + 1}, y^{n})$ , namely the Bayes Shtarkov predictive density for Y where $x^{n + 1} = (x_{1}, \dots, x_{n + 1})$ and $y^{n} = (y_{1}, \dots, y_{n})$ . Now, there are two ways to generate predictions from $q_{shk}$ : (i) use $q_{shk}$ as a density to generate interval predictors, and (ii) convert $q_{shk}$ to a point predictor.

For the first, recall that under various regularity conditions, $q_{shk}$ can be approximated by the Bayesian's marginal density for the data, i.e., $q_{shk} (y^{n}) \approx m (y^{n})$ in terms of regret. That is, (Equation14(14) $\begin{aligned} sup_{y^{n}} [\log \frac{1}{q (y^{n})} - inf_{θ} \log \frac{1}{w (θ) p (y^{n} ∣ θ)}] \\ = sup_{y^{n}} [sup_{θ} \log \frac{w (θ) p (y^{n} ∣ θ)}{q (y^{n})}] . \end{aligned}$ (14) ) leads to (18) $\begin{aligned} sup_{y^{n}} [\log \frac{w (\tilde{θ}) p (y^{n} ∣ \tilde{θ})}{q_{shk} (y^{n}) \hat{I} (\tilde{θ})}] & = sup_{y^{n}} [\log \frac{w (\tilde{θ}) p (y^{n} ∣ \tilde{θ)}}{m (y^{n}) \hat{I} (\tilde{θ})}] \\ = \frac{1}{2} \log \frac{n}{2 π} + o (1), \end{aligned}$ (18) where $\hat{I} (\cdot)$ is the empirical Fisher information from $p (y ∣ θ)$ ; see Xie and Barron (Citation2000) and Clarke (Citation2007). This means a stochastic process, the Bayesian's mixture, is approximating a density $q_{shk}$ that does not correspond to a stochastic process. Hence, even in $M$ -open problems, there may be good – new – predictors that resemble familiar predictors. If computing $q_{shk}$ is onerous, $m (\cdot)$ may be a good predictor. Likewise, conditionals from $m (\cdot)$ such as $m (y_{n + 1} ∣ y^{n})$ may be a good approximation for (Equation17(17) $\begin{aligned} q_{opt} (y_{n + 1} ∣ y^{n}) = \frac{q_{opt} (y^{n + 1})}{q_{opt} (y^{n})}, \end{aligned}$ (17) ).

More important for the present, if the Bayesian's marginal for the data is ‘interpretable’ – perhaps because the densities proposed by the experts are – (Equation18(18) $\begin{aligned} sup_{y^{n}} [\log \frac{w (\tilde{θ}) p (y^{n} ∣ \tilde{θ})}{q_{shk} (y^{n}) \hat{I} (\tilde{θ})}] & = sup_{y^{n}} [\log \frac{w (\tilde{θ}) p (y^{n} ∣ \tilde{θ)}}{m (y^{n}) \hat{I} (\tilde{θ})}] \\ = \frac{1}{2} \log \frac{n}{2 π} + o (1), \end{aligned}$ (18) ) provides an asymptotic interpretation of $q_{shk}$ in terms of $m (\cdot)$ . The difference between $m (\cdot)$ and $q_{shk}$ represents the degree to which $m (\cdot)$ is predictively suboptimal to $q_{shk}$ – in regret under log-loss – and the degree of suboptimality decreases as $n \to \infty$ . Thus, $(1 - α) %$ predictive intervals under $m (\cdot)$ and $q_{shk}$ are equivalent in the limit even though for all finite n $q_{shk}$ is better in terms of regret.

By contrast, if a generic ‘interpretable’ predictor $r (y^{n})$ is used, the regret usually becomes far worse. For instance, (19) $\begin{aligned} sup_{y^{n}} [\log \frac{w (\tilde{θ}) p (y^{n} ∣ \tilde{θ})}{r (y^{n}) \hat{I} (\tilde{θ})}] \\ = sup_{y^{n}} [\log \frac{w (\tilde{θ}) p (y^{n} ∣ \tilde{θ})}{q_{shk} (y^{n}) \hat{I} (\tilde{θ})} \\ + n (\frac{1}{n} \sum_{i = 1}^{n} \log \frac{q (y_{i} ∣ y^{i - 1})}{r (y_{i} ∣ y^{i - 1})})] \\ \approx \frac{1}{2} \log \frac{n}{2 π} + n sup_{y^{n}} D (q_{i} ∣ r_{i}) + o (1), \end{aligned}$ (19) in which $D (q_{i} ∣ r_{i})$ is defined by (Equation19(19) $\begin{aligned} sup_{y^{n}} [\log \frac{w (\tilde{θ}) p (y^{n} ∣ \tilde{θ})}{r (y^{n}) \hat{I} (\tilde{θ})}] \\ = sup_{y^{n}} [\log \frac{w (\tilde{θ}) p (y^{n} ∣ \tilde{θ})}{q_{shk} (y^{n}) \hat{I} (\tilde{θ})} \\ + n (\frac{1}{n} \sum_{i = 1}^{n} \log \frac{q (y_{i} ∣ y^{i - 1})}{r (y_{i} ∣ y^{i - 1})})] \\ \approx \frac{1}{2} \log \frac{n}{2 π} + n sup_{y^{n}} D (q_{i} ∣ r_{i}) + o (1), \end{aligned}$ (19) ) and typically dominates, resulting in a much larger loss for F. In part, this is an artifact of using the log-loss of a density ratio which is much more sensitive to tail behaviour than other functions. Statements analogous to (Equation17(17) $\begin{aligned} q_{opt} (y_{n + 1} ∣ y^{n}) = \frac{q_{opt} (y^{n + 1})}{q_{opt} (y^{n})}, \end{aligned}$ (17) ), (Equation18(18) $\begin{aligned} sup_{y^{n}} [\log \frac{w (\tilde{θ}) p (y^{n} ∣ \tilde{θ})}{q_{shk} (y^{n}) \hat{I} (\tilde{θ})}] & = sup_{y^{n}} [\log \frac{w (\tilde{θ}) p (y^{n} ∣ \tilde{θ)}}{m (y^{n}) \hat{I} (\tilde{θ})}] \\ = \frac{1}{2} \log \frac{n}{2 π} + o (1), \end{aligned}$ (18) ), and (Equation19(19) $\begin{aligned} sup_{y^{n}} [\log \frac{w (\tilde{θ}) p (y^{n} ∣ \tilde{θ})}{r (y^{n}) \hat{I} (\tilde{θ})}] \\ = sup_{y^{n}} [\log \frac{w (\tilde{θ}) p (y^{n} ∣ \tilde{θ})}{q_{shk} (y^{n}) \hat{I} (\tilde{θ})} \\ + n (\frac{1}{n} \sum_{i = 1}^{n} \log \frac{q (y_{i} ∣ y^{i - 1})}{r (y_{i} ∣ y^{i - 1})})] \\ \approx \frac{1}{2} \log \frac{n}{2 π} + n sup_{y^{n}} D (q_{i} ∣ r_{i}) + o (1), \end{aligned}$ (19) ) hold for sequences of discrete outcomes $y_{i}$ . In either case, this suggests that interpretability per se, without any direct relationship to the minimum regret in a logarithmic sense, gives much larger costs.

For the second, if we consider pointwise prediction, we want a point predictor from $q_{shk}$ that represents where the mass of $q_{shk}$ is, say $E_{q_{shk}} (Y)$ . One such predictor is (20) $\begin{aligned} {\hat{Y}}_{shk} (x) = E_{q_{shk}} (Y ∣ x^{n + 1}, y^{n}) . \end{aligned}$ (20) Others include $med (Y ∣ x^{n + 1}, y^{n})$ , $mode (Y ∣ x^{n + 1}, y^{n})$ etc. In addition, because of (Equation18(18) $\begin{aligned} sup_{y^{n}} [\log \frac{w (\tilde{θ}) p (y^{n} ∣ \tilde{θ})}{q_{shk} (y^{n}) \hat{I} (\tilde{θ})}] & = sup_{y^{n}} [\log \frac{w (\tilde{θ}) p (y^{n} ∣ \tilde{θ)}}{m (y^{n}) \hat{I} (\tilde{θ})}] \\ = \frac{1}{2} \log \frac{n}{2 π} + o (1), \end{aligned}$ (18) ) we are led to approximate each of these point predictors by expectations with respect to the conditional $m (\cdot ∣ x^{n + 1}, y^{n})$ from the mixture.

Assume $p_{j} (y)$ is the proposed density for Y from expert j, $j = 1, \dots, J$ i.e., we assume finitely many experts or that a continuum of experts can be approximated by a weighted sum of finitely many. Even though there is no distribution associated with Y because the problem is $M$ -open, we can still take expectations with respect to $q_{shk}$ and the $p_{j}$ 's. We can write (21) $\begin{aligned} | Y - E_{\sum_{j = 1}^{J} γ_{j} p_{j}} (Y) | \\ \leq | Y - E_{q_{s h k}} (Y) | + | E_{q_{s h k}} (Y) - E_{\sum_{j = 1}^{J} γ_{j} p_{j}} (Y) | \\ = | Y - {\hat{Y}}_{s h k} | + | {\hat{Y}}_{s h k} - E_{\sum_{j = 1}^{J} γ_{j} p_{j}} (Y) | \end{aligned}$ (21) where $γ_{j} > 0$ , $\sum_{j = 1}^{J} γ_{j} = 1$ , and subscripts on E indicate density in which expectation is taken. Since the Bayes Shtarkov predictor is best in log-loss, the first term in (Equation21(21) $\begin{aligned} | Y - E_{\sum_{j = 1}^{J} γ_{j} p_{j}} (Y) | \\ \leq | Y - E_{q_{s h k}} (Y) | + | E_{q_{s h k}} (Y) - E_{\sum_{j = 1}^{J} γ_{j} p_{j}} (Y) | \\ = | Y - {\hat{Y}}_{s h k} | + | {\hat{Y}}_{s h k} - E_{\sum_{j = 1}^{J} γ_{j} p_{j}} (Y) | \end{aligned}$ (21) ) is likely to be small. Thus, we only need to find $γ_{j}$ 's such that the second term in (Equation21(21) $\begin{aligned} | Y - E_{\sum_{j = 1}^{J} γ_{j} p_{j}} (Y) | \\ \leq | Y - E_{q_{s h k}} (Y) | + | E_{q_{s h k}} (Y) - E_{\sum_{j = 1}^{J} γ_{j} p_{j}} (Y) | \\ = | Y - {\hat{Y}}_{s h k} | + | {\hat{Y}}_{s h k} - E_{\sum_{j = 1}^{J} γ_{j} p_{j}} (Y) | \end{aligned}$ (21) ) is as small as possible if not zero. If this is done, then the first term represents the minimal cost of prediction and the second term represents the additional cost of interpretation, if the $p_{j}$ 's are interpretable.

We can modify (Equation21(21) $\begin{aligned} | Y - E_{\sum_{j = 1}^{J} γ_{j} p_{j}} (Y) | \\ \leq | Y - E_{q_{s h k}} (Y) | + | E_{q_{s h k}} (Y) - E_{\sum_{j = 1}^{J} γ_{j} p_{j}} (Y) | \\ = | Y - {\hat{Y}}_{s h k} | + | {\hat{Y}}_{s h k} - E_{\sum_{j = 1}^{J} γ_{j} p_{j}} (Y) | \end{aligned}$ (21) ) by adding and subtracting expressions involving expectations with respect to $m (y_{n + 1} ∣ x^{n + 1}, y^{n})$ as a way to try to identify the components of the error meant by (Equation21(21) $\begin{aligned} | Y - E_{\sum_{j = 1}^{J} γ_{j} p_{j}} (Y) | \\ \leq | Y - E_{q_{s h k}} (Y) | + | E_{q_{s h k}} (Y) - E_{\sum_{j = 1}^{J} γ_{j} p_{j}} (Y) | \\ = | Y - {\hat{Y}}_{s h k} | + | {\hat{Y}}_{s h k} - E_{\sum_{j = 1}^{J} γ_{j} p_{j}} (Y) | \end{aligned}$ (21) ) in terms of the uninterpretable $q_{shk}$ , its distance from a mixture (possibly regarded as interpretable) and sums of weighted densities of the experts (if they are interpretable), e.g., we can add and subtract $E_{m (\cdot ∣ y^{n})} Y$ in the last term of (Equation21(21) $\begin{aligned} | Y - E_{\sum_{j = 1}^{J} γ_{j} p_{j}} (Y) | \\ \leq | Y - E_{q_{s h k}} (Y) | + | E_{q_{s h k}} (Y) - E_{\sum_{j = 1}^{J} γ_{j} p_{j}} (Y) | \\ = | Y - {\hat{Y}}_{s h k} | + | {\hat{Y}}_{s h k} - E_{\sum_{j = 1}^{J} γ_{j} p_{j}} (Y) | \end{aligned}$ (21) ). In practice, the last terms would represent the cost of interpretability while the first term on the right in (Equation21(21) $\begin{aligned} | Y - E_{\sum_{j = 1}^{J} γ_{j} p_{j}} (Y) | \\ \leq | Y - E_{q_{s h k}} (Y) | + | E_{q_{s h k}} (Y) - E_{\sum_{j = 1}^{J} γ_{j} p_{j}} (Y) | \\ = | Y - {\hat{Y}}_{s h k} | + | {\hat{Y}}_{s h k} - E_{\sum_{j = 1}^{J} γ_{j} p_{j}} (Y) | \end{aligned}$ (21) ) represents the minimal cost of prediction.

3.2. A theoretical interpretation for Bayes Shtarkov predictors

Separate from approximating $q_{shk}$ by mixture densities or other expressions, in some cases we can characterize $q_{shk}$ as a mixture itself. The hypotheses are rather strong and the mixing density is somewhat artificial but in some examples this characterization may be useful. We have the following.

Theorem 3.1

Assume $q_{shk}$ is m-monotone over $(0, \infty)$ i.e., $(- 1)^{k} q_{shk}^{(k)} (| x |) \geq 0$ for $k = 0, \dots, m - 1$ where $q_{shk}^{(k)}$ is the kth derivative of $q_{shk}$ and $q_{shk}^{(0)} = q_{shk}$ . Then $q_{shk}$ can be represented as the following mixture for any integer k, $1 \leq k \leq m$ , (22) $\begin{aligned} q_{shk} (y) = \int_{0}^{\infty} [\frac{1}{s} k {(1 - \frac{| y |}{s})}_{+}^{k - 1}] g (s) d s, \end{aligned}$ (22) where $a_{+} = max {a, 0}$ and the mixing density $g (s)$ is $g (s) = \frac{1}{k} \sum_{j = 0}^{k - 1} \frac{(- 1)^{j}}{j!} [j s^{j} q_{shk}^{(j)} (s) + s^{j + 1} q_{shk}^{(j + 1)} (s)] .$

Proof.

The proof of Theorem 1 in Polson et al. (Citation2014), based on Williamson (Citation1956), holds for all positive values of y. For negative values of y, define $f (y)$ on $(0, \infty)$ by $f (y) = q_{shk} (- y)$ , then we have the same result for $f (y)$ : $f (y) = \int_{0}^{\infty} [\frac{1}{s} k {(1 - \frac{y}{s})}_{+}^{k - 1}] g (s) d s .$ Hence, for the negative values of y, $q_{shk} (y) = f (- y) = \int_{0}^{\infty} [\frac{1}{s} k {(1 - \frac{- y}{s})}_{+}^{k - 1}] g (s) d s .$ Thus, for all y, $q_{shk}$ has the representation (Equation22(22) $\begin{aligned} q_{shk} (y) = \int_{0}^{\infty} [\frac{1}{s} k {(1 - \frac{| y |}{s})}_{+}^{k - 1}] g (s) d s, \end{aligned}$ (22) ).

We do not have sufficient conditions for $q_{shk}$ to be m-monotone. However, examples of $q_{shk}$ in Le and Clarke (Citation2016a) look like graphs of 1/y or $e^{- y}$ which are m-monotone.

One of the implications of Theorem 3.1 is that if $X \sim Beta (1, k)$ and $S \sim g (\cdot)$ then Y = SX will have density (Equation22(22) $\begin{aligned} q_{shk} (y) = \int_{0}^{\infty} [\frac{1}{s} k {(1 - \frac{| y |}{s})}_{+}^{k - 1}] g (s) d s, \end{aligned}$ (22) ). For instance,

if k = 1, $g (s) = s q_{shk}^{'} (s)$ and $X \sim Beta (1, 1) = Uniform (0, 1)$ ,
if k = 2, $g (s) = - \frac{s^{2}}{2} q_{shk}^{″} (s)$ and $X \sim Beta (1, 2)$ ,
if k = 3, $g (s) = \frac{s^{3}}{3} q_{shk}^{‴} (s)$ and $X \sim Beta (1, 3)$ .

Thus, while g remains uninterpretable, X is recognizable and Y is a product. A limitation of this result is that it is only for univariate y. However, as suggested by the relationship between $m (\cdot)$ and $q_{shk}$ in (Equation18(18) $\begin{aligned} sup_{y^{n}} [\log \frac{w (\tilde{θ}) p (y^{n} ∣ \tilde{θ})}{q_{shk} (y^{n}) \hat{I} (\tilde{θ})}] & = sup_{y^{n}} [\log \frac{w (\tilde{θ}) p (y^{n} ∣ \tilde{θ)}}{m (y^{n}) \hat{I} (\tilde{θ})}] \\ = \frac{1}{2} \log \frac{n}{2 π} + o (1), \end{aligned}$ (18) ), extensions to multivariate y may be possible under some $M$ -open analog of stationarity – provided there is the right sort of dependence so that the middle term of order n in (Equation19(19) $\begin{aligned} sup_{y^{n}} [\log \frac{w (\tilde{θ}) p (y^{n} ∣ \tilde{θ})}{r (y^{n}) \hat{I} (\tilde{θ})}] \\ = sup_{y^{n}} [\log \frac{w (\tilde{θ}) p (y^{n} ∣ \tilde{θ})}{q_{shk} (y^{n}) \hat{I} (\tilde{θ})} \\ + n (\frac{1}{n} \sum_{i = 1}^{n} \log \frac{q (y_{i} ∣ y^{i - 1})}{r (y_{i} ∣ y^{i - 1})})] \\ \approx \frac{1}{2} \log \frac{n}{2 π} + n sup_{y^{n}} D (q_{i} ∣ r_{i}) + o (1), \end{aligned}$ (19) ) can be avoided.

3.3. An empirical interpretation for Bayes Shtarkov predictors

We can approximate the univariate density $q_{shk} (y)$ by finding the member of a large family of densities closest to it. For this we want a relatively large family of candidate densities that are parametrized in some way that reflects out understanding of the shapes of densities. One such family consists the Pearson distributions first introduced by Pearson (Citation1895). There are at least 7 useful subtypes within the Pearson family; Pearson himself ultimately identified 12. Overall, this family is characterized by five parameters: A location parameter a (often interpretable as a mode), a location parameter λ (often interpretable as a mean, $μ_{1}$ ), a variance $μ_{2}$ , a skewness $γ_{1}$ (this enters as $β_{1} = γ_{1}^{2}$ ), and a kurtosis $β_{2}$ . While this family is only for univariate densities, there are proposed generalizations to the multivariate case, see Steyn (Citation1960) amongst others, although they are not well developed and there is little recent work on them.

Formally, a Pearson density p is any solution to the differential equation, (23) $\begin{aligned} \frac{p^{'} (y)}{p (y)} + \frac{a + (y - λ)}{b_{0} + b_{1} (y - λ) + b_{2} (y - λ)^{2}} = 0, \end{aligned}$ (23) where $\begin{aligned} b_{0} & = \frac{4 β_{2} - 3 β_{1}}{10 β_{2} - 12 β_{1} - 18} μ_{2}, \\ b_{1} & = a = \sqrt{μ_{2}} \sqrt{β_{1}} \frac{β_{2} + 3}{10 β_{2} - 12 β_{1} - 18}, \\ b_{2} & = \frac{2 β_{2} - 3 β_{1} - 6}{10 β_{2} - 12 β_{1} - 18} . \end{aligned}$ Now, in principle, values of $(a, λ, μ_{1}, β_{1}, β_{2})$ yielding the Pearson density closest to a given $q_{shk}$ density can be found. This Pearson density may be regarded as an interpretation of $q_{shk}$ , but it cannot be as good as $q_{shk}$ in terms of regret, see (Equation14(14) $\begin{aligned} sup_{y^{n}} [\log \frac{1}{q (y^{n})} - inf_{θ} \log \frac{1}{w (θ) p (y^{n} ∣ θ)}] \\ = sup_{y^{n}} [sup_{θ} \log \frac{w (θ) p (y^{n} ∣ θ)}{q (y^{n})}] . \end{aligned}$ (14) ). The increase in regret from using the Pearson density closest to $q_{shk}$ , rather than $q_{shk}$ itself, is the cost of interpretation.

The solution of (Equation23(23) $\begin{aligned} \frac{p^{'} (y)}{p (y)} + \frac{a + (y - λ)}{b_{0} + b_{1} (y - λ) + b_{2} (y - λ)^{2}} = 0, \end{aligned}$ (23) ) is the indefinite integral $p_{p r s} (y) \propto \exp (- \int \frac{y - a}{b_{2} y^{2} + b_{1} y + b_{0}} d y) .$ For the sake of completeness, we look at an example, namely the two cases of the Pearson type IV distributions based on whether $b_{1}^{2} - 4 b_{0} b_{2}$ is negative or non-negative. (The term discriminant arises from the use of the quadratic root formula.) They are:

Case I: if $b_{1}^{2} - 4 b_{0} b_{2} < 0$ , then (24) $\begin{aligned} p_{p r s} (y) & \propto {[1 + {(\frac{y - λ}{α})}^{2}]}^{- m} \\ \times \exp [- ν \arctan (\frac{y - λ}{α})], \end{aligned}$ (24) where $\begin{aligned} α & = \frac{\sqrt{4 b_{0} b_{2} - b_{1}^{2}}}{2 b_{2}}, \\ ν & = - \frac{2 b_{2} a + b_{1}}{2 b_{2}^{2} α}, \\ m & = \frac{1}{2 b_{2}} . \end{aligned}$ Case II: if $b_{1}^{2} - 4 b_{0} b_{2} \geq 0$ , then (25) $\begin{aligned} p_{p r s} (y) \propto {(1 - \frac{y}{a_{1}})}^{- ν (a_{1} - a)} {(1 - \frac{y}{a_{2}})}^{- ν (a_{2} - a)}, \end{aligned}$ (25) where $\begin{aligned} a_{1} & = \frac{- b_{1} - \sqrt{b_{1}^{2} - 4 b_{0} b_{2}}}{2 b_{2}}, \\ a_{2} & = \frac{- b_{1} + \sqrt{b_{1}^{2} - 4 b_{0} b_{2}}}{2 b_{2}}, \\ ν & = \frac{1}{b_{2} (a_{1} - a_{2})} . \end{aligned}$ Let d be a distance between densities. We can find ${\hat{v}}_{1} = (λ_{\min}, α_{\min}, ν_{\min}, m_{\min})$ and ${\hat{v}}_{2} = (a_{\min}, a_{1, m i n}, a_{2, m i n}, ν_{\min})$ that achieve (26) $\begin{aligned} \arg (min d (q_{shk}, p_{p r s})), \end{aligned}$ (26) where the minimum is over the parameters in (Equation24(24) $\begin{aligned} p_{p r s} (y) & \propto {[1 + {(\frac{y - λ}{α})}^{2}]}^{- m} \\ \times \exp [- ν \arctan (\frac{y - λ}{α})], \end{aligned}$ (24) ) or (Equation25(25) $\begin{aligned} p_{p r s} (y) \propto {(1 - \frac{y}{a_{1}})}^{- ν (a_{1} - a)} {(1 - \frac{y}{a_{2}})}^{- ν (a_{2} - a)}, \end{aligned}$ (25) ), respectively. Naturally, we would choose the Pearson density corresponding to whichever of ${\hat{v}}_{1}$ and ${\hat{v}}_{2}$ gave a lower value of the minimum in (Equation26(26) $\begin{aligned} \arg (min d (q_{shk}, p_{p r s})), \end{aligned}$ (26) ). In principle, this can be done over the other types of Pearson densities (or any parametrized class of densities) so that the parameters giving the overall minimum for the distance in (Equation26(26) $\begin{aligned} \arg (min d (q_{shk}, p_{p r s})), \end{aligned}$ (26) ) can be found. The result is a Pearson distribution whose density approximates $q_{shk}$ as closely as possible within the family and hence has an interpretation based on the shape of the approximating density. If the minimum is not small enough given the choice of d, we may be led to consider richer families of densities than Pearson.

As a final point for this section, recall it is well known that if all moments of a distribution exist then they characterize the distribution and that the more the moments of two distributions that are close, the closer the two distributions are. Thus, as long as the moments of the distribution are meaningful the distributions can be regarded as interpretable, but, as we have seen, interpretation has a cost.

4. Random forest predictors

Random forest (RF) predictors were introduced by Breiman (Citation2001a). The main idea is to use bootstrap aggregation on trees with binary splits or, more formally, binary recursive partitioning models, and add one extra step to reduce the correlation between any pair of trees in the forest. The extra step is random selection from the explanatory variables. That is, when growing a tree on a bootstrapped sample, before each split, choose $m \leq p$ of the explanatory variables at random to be candidates for splitting. Values for m typically range from $\log_{2} p$ to $\sqrt{p}$ . Despite the de-correlation step, RF's have many of the properties of bagging trees. Here we focus on RF's because they are one of the most successful model averaging methods for binary classification and often have benefits other methods, including boosting, do not.

Here we will relate RF's to boosting, see Schapire (Citation1990), to show the main point of this paper – that interpretable methods have a performance cost over optimal predictive methods – holds for classification as well as regression. This is a different perspective from Wyner et al. (Citation2017) who argue that RF's and boosting work well because both are interpolating and averaging. Our point has to do with interpreting classifiers not understanding why they work.

4.1. Interpreting RF's in terms of boosting

Our point in this subsection can be stated succinctly as follows. Let $R F (x)$ be the random forest classifier based on x. Consider a data set $D = {(y_{i}, x_{i}), i = 1, \dots, n}$ where the pairs $(y_{i}, x_{i})$ are independent over i, $y_{i} \in Y = {- 1, 1}$ and $x_{i} \in R^{p}$ . Write $R F (x) = R F_{B, ρ, τ} (x)$ where ρ is a splitting rule, τ is a stop splitting rule and B is the number of trees used to form the random forest. Thus, $R F (x) = (\frac{1}{B}) \sum_{b = 1}^{B} T_{b, ρ, τ} (x)$ where $T_{b} = T_{b, ρ, τ}$ is the classification tree formed from the bth bootstrap sample and the decorrelation used to form $T_{b}$ is suppressed in the notation. Let $B S T C_{J} (x) = B S T C (x)$ be the boosted classifier using J iterations of the boosting procedure starting with the initial tree-based classifier $C (x)$ assumed to be ‘weak’. That is $P (Y \neq C (X)) < .5$ , but not by much. Now we can write (27) $\begin{aligned} R F (x) = (R F (x) - B S T C (x)) + B S T C (x) . \end{aligned}$ (27) The idea is that $R F (\cdot)$ is uninterpretable but $B S T C (\cdot)$ has a limited interpretation (due to Friedman et al. (Citation2000), see below) so the term $(R F (x) - B S T C (x))$ represents the cost of that interpretation. We assume that RF's perform a little better than boosted trees because they generally do unless the boosting classifier is sufficiently well-calibrated and does not overfit. That is, RF's are nearly automatic and hence more robust. Moreover, a majority of successful classifiers are random forests or variants on random forests; see Caruana et al. (Citation2008) for a definitive report emphasizing high dimensional problems. As a generality, boosting also does not generalize well beyond binary classification.

To make this more precise, recall that while there are a variety of boosting algorithms, and variants on boosting algorithms including gradient boosting, AdaBoost due to Freund and Schapire (Citation1997) is arguably the most popular. The basic idea of boosting is to improve a weak classifier iteratively by averaging the reweighted improvements i.e., to help the weak classifier learn from its mistakes. To generate the improved classifiers, the boosting procedure re-uses the data like bagging or RF's but builds iterates rather than starting anew with each iteration.

Let $C (x)$ be a fixed initial classifier for Y and assume that, as a function, $C (x)$ is representable as a tree. Denote the iterates of C under the boosting algorithm by ${\hat{C}}_{j}$ , $j = 1, \dots, J$ . The iterates are also classifiers. They are ensembled into a final classifier by weighted majority voting to yield (28) $\begin{aligned} B S T C (x) = s i g n (\sum_{j = 1}^{J} β_{j} {\hat{C}}_{j} (x)), \end{aligned}$ (28) where the weights $β_{1}, \dots, β_{J}$ are computed by the AdaBoost algorithm, see Freund and Schapire (Citation1997) for details. The central intuition in Adaboost is that increasing the penalty for misclassified data points forces successive ${\hat{C}}_{j}$ 's to make fewer errors. There is a performance criterion that is satisfied by most versions of boosting, see Freund and Schapire (Citation1997) Theorem 6. This result leaves open the possibility that a boosted classifier could be perfect in a limiting sense but does not actually give convergence of BSTC to a limit.

Even though Adaboost was a novel approach to classification by ensembles, Friedman et al. (Citation2000) showed it was equivalent to to forward stagewise additive logistic regression under exponential loss. This provides an interpretation of boosting because the logistic regression gives an explicit expression for $P (Y = 1 ∣ D)$ in terms of specifically constructed functions of x (the ${\hat{C}}_{j}$ 's below), see Friedman et al. (Citation2000, Sec. 3.3) and Hastie et al. (Citation2009, Secs. 10.4 and 10.5) for details. Since the ${\hat{C}}_{j}$ 's are individual trees they admit interpretations in terms of the explanatory variables. This result holds in a limiting sense as the terms in the logistic regression increase and as $n \to \infty$ . So, for each finite step, the boosted classifier is suboptimal even though it is Bayes optimal in the limit; we use this below in the proof of Corollary 4.1.

One key step in the Adaboost procedure is choosing how the iterates ${\hat{C}}_{j}$ are to be generated. There are various choices; the most popular are probably naive Bayes classifiers or trees with a maximum number of splits. Here we use the latter. So, we start by taking C to be a tree classifier and want our iterates to be trees as well. Specifically, the criterion the iterates must satisfy is (29) $\begin{aligned} {\hat{C}}_{j + 1} (x) = \arg min_{h \in G_{j}} \sum_{i = 1}^{n} D_{j} (i) 1_{{y_{i} \neq h (x_{i})}} (x) \end{aligned}$ (29) where $D_{j} = (D_{j} (1), \dots, D_{j} (n))$ for $j \geq 1$ is the ‘empirical’ distribution on the n data points given by (30) $\begin{aligned} D_{j} (i) = \frac{D_{j - 1} (i) e^{β_{j - 1} 1_{{y_{i} \neq {\hat{C}}_{j - 1} (x_{i})}}}}{N_{j - 1}} \end{aligned}$ (30) in which ${\hat{C}}_{0} = C$ , $N_{j - 1}$ is a normalization constant, the $β_{j - 1}$ 's are found by an auxiliary procedure, and the sequence of distributions $D_{j}$ is initialized by $D_{0} = (1 / n . \dots, 1 / n)$ .

In (Equation29(29) $\begin{aligned} {\hat{C}}_{j + 1} (x) = \arg min_{h \in G_{j}} \sum_{i = 1}^{n} D_{j} (i) 1_{{y_{i} \neq h (x_{i})}} (x) \end{aligned}$ (29) ), the classifiers h at step j vary over a set of classifiers $G_{j}$ . As noted, C is a tree and we want the iterates to be trees. So, each $G_{j}$ must be a class of trees. If $G_{j}$ is too small a class, e.g., trees with exactly one split (often called ‘stumps’,) then the range of boosted tree-based classifiers will be too small. For instance, if $G_{j}$ only contained stumps, it would not include trees that allowed interactions between entries in x. On the other hand, if $G_{j}$ is too big, ${\hat{C}}_{j}$ will fit the data perfectly even though such a classifier often has poor generalization error. So, we have to choose a reasonable value for the number of splits to allow in the classifiers in $G_{j}$ .

Chapter 10, Sec. 11 in Hastie et al. (Citation2009) recommends using trees with four to eight splits. Three splits allows for interactions between explanatory variables, but often not enough so starting with four splits and working up to eight often a good overall procedure. Hastie et al. (Citation2009) comment that 10 or more splits are rarely required for good performance, e.g., low generalization error.

To see that under these circumstances (Equation28(28) $\begin{aligned} B S T C (x) = s i g n (\sum_{j = 1}^{J} β_{j} {\hat{C}}_{j} (x)), \end{aligned}$ (28) ) is a tree it is enough to show that, as functions of x, a linear combination of trees is a tree. First, a real constant times a tree function is a tree so it is enough to show that the sum of two trees is again a tree. Let $T_{1}$ and $T_{2}$ be trees. To see that $T (x) = T_{1} (x) + T_{2} (x)$ is also a tree, observe that if each leaf node of $T_{1}$ is taken as the root node of $T_{2}$ the result is a tree T of twice the depth for which each input vector of explanatory variables ends up in exactly one leaf. However, some of the nodes or leaves may be void. This does not make T invalid, just artificial, and it may be collapsible into a much smaller tree. Nevertheless, in a trivial sense, the linear combination of trees is tree.

Less artificially, write (31) $\begin{aligned} T_{k} (x) = \sum_{ℓ = 1}^{u_{k}} α_{ℓ} I_{R_{ℓ}^{(k)}} (x) \end{aligned}$ (31) for k = 1, 2, where the $R_{ℓ}^{(k)}$ are disjoint and exhaustive regions in $R^{p}$ assumed to have edges parallel to the axes of the real space. In the special case of p = 1, $R$ has an ordering so it is easy to see that assigning the right constant to each intersection $R_{ℓ}^{(1)} \cap R_{ℓ^{'}}^{(2)}$ gives a function that can be expressed as constants times indicator functions for intervals in $R$ . Since intervals can be defined by splits on the single real variable, the sum $T (x) = T_{1} (x) + T_{2} (x)$ has the same form as (Equation31(31) $\begin{aligned} T_{k} (x) = \sum_{ℓ = 1}^{u_{k}} α_{ℓ} I_{R_{ℓ}^{(k)}} (x) \end{aligned}$ (31) ) and arises from a tree structure using binary splits on the explanatory variables.

The same sort of argument applies to $R^{2}$ . It is easy to see that in the two dimensional case $T (x) = T_{1} (x) + T_{2} (x)$ can be written in the form (Equation31(31) $\begin{aligned} T_{k} (x) = \sum_{ℓ = 1}^{u_{k}} α_{ℓ} I_{R_{ℓ}^{(k)}} (x) \end{aligned}$ (31) ). What harder is to see that the regions defined by the $R_{ℓ}^{(1)} \cap R_{ℓ^{'}}^{(2)}$ arise from binary splits on the entries of x. While harder, this is not hard: In the real plane, label the coordinates as $x_{1}$ and $x_{2}$ and let the tree structure of $T_{1}$ and $T_{2}$ be represented as partitions of $R^{2}$ . Pick the root note of, say, $T_{1}$ . Without loss of generality, assume it is a split of the form $x_{1} < c_{1}$ versus $x_{1} \geq c_{1}$ . Consider the left branch. The region $x_{1} < c_{1}$ will be partitioned by horizontal lines, i.e., by ranges of $x_{2}$ . Choose the largest cutoff for $x_{2}$ , say $c_{2}$ . Thus, splitting on $x_{2} < c_{2}$ versus $x_{2} \geq c_{2}$ will give us two daughter nodes on the left branch. We can then repeat this for each cutpoint on the left branch splitting at $c_{3}$ , $c_{4}$ etc. The issue arises when the horizontal band represented by the split is itself split by some value of $x_{1}$ . If this happens at, say $c_{4}$ then this simply adds another split that has to be carried over to the other splits on the left, i.e., for $x_{2} < c_{5}$ versus $x_{2} \geq c_{5}$ , $x_{2} < c_{6}$ versus $x_{2} \geq c_{6}$ etc. as far down the left side of the tree as there are splits on $x_{2}$ . If there are further splits, they can be accommodated in the same way and the argument can be applied analogously to the right branch of the tree. The same argument can be applied in three or more dimensions; it is simply a matter of considering splits on each dimension in turn and all the splits that may be performed on it using the other explanatory variables, in all possible sequences.

Now, if $C (\cdot)$ is a tree then its iterates ${\hat{C}}_{1}, {\hat{C}}_{2}, \dots$ can be assumed to be trees as can the final output $B S T C (x)$ . It is reasonable to expect the boosted classifier to be Bayes optimal. Indeed, Theorem 6 in Freund and Schapire (Citation1997) gives that the probability of misclassification by a J-step boosted classifier $P (Y \neq C_{Boost, J} (X))$ can only decrease as J increases. Separately, Biau et al. (Citation2008) gives conditions under which some randomized RF-like and majority vote averaging classifiers achieve the minimal Bayes risk. (These are not pure RF's because they ignore the decorrelation and choose split points randomly.) Moreover, since we are using trees here, and trees are a very rich class of nonparametric function estimators (in this case classification functions), it is safe to assume that there are trees that are arbitrarily close to the Bayes classifier even if they are not the same as studied in Biau et al. (Citation2008).

Suppose a Bayes optimal classifier $C_{B} (x)$ exists, i.e., there is a classifier $C_{B}$ that achieves the minimal misclassification error, $\arg min_{C \in S} P (Y \neq C (X))$ where S is the set of essentially all classifiers. Then, $C_{B} (x) = \arg max_{r \in {- 1, 1}} P (Y = r ∣ X = x) .$ for each x. We begin with a result that initiates a boosting procedure with a Bayes optimal classifier. Of course, we find that the boosting procedure is unable to improve an optimal classifier. This is no surprise. Indeed, it is an artificial hypothesis – why boost an optimal classifier? However, we will use this result in Corollary 4.1 below and remove the hypothesis.

Theorem 4.1

Suppose a Bayes classifier is used as $C_{0}$ in a boosting procedure. Then, on average, in the limit of increasing n, the weights $β_{j}$ in (Equation30(30) $\begin{aligned} D_{j} (i) = \frac{D_{j - 1} (i) e^{β_{j - 1} 1_{{y_{i} \neq {\hat{C}}_{j - 1} (x_{i})}}}}{N_{j - 1}} \end{aligned}$ (30) ) are identical positive constants, for appropriately chosen $G_{j}$ 's.

Proof.

it is easy to see that the first step of the boosting procedure gives $\begin{aligned} e r r_{1} & = \frac{1}{n} \sum_{i = 1}^{n} I (Y \neq {\hat{C}}_{1}) \to E_{(X, Y)} (I (Y \neq {\hat{C}}_{1} (X))) \\ \to P (Y \neq C_{B} (X)) \end{aligned}$ because of the minimum in (Equation29(29) $\begin{aligned} {\hat{C}}_{j + 1} (x) = \arg min_{h \in G_{j}} \sum_{i = 1}^{n} D_{j} (i) 1_{{y_{i} \neq h (x_{i})}} (x) \end{aligned}$ (29) ), provided $n \to \infty$ and $G_{1} = G_{1, n}$ is increasing by invoking Theorem 6 Freund and Schapire (Citation1997). So, as $n \to \infty$ , $β_{1} = \log (\frac{1 - e r r_{1}}{e r r_{1}}) \to \log (\frac{1 - P (Y \neq C_{B} (X))}{P (Y \neq C_{B} (X))}) .$ Now $w_{i} \propto e^{β_{1}}$ if $y_{i} \neq {\hat{C}}_{1} (x_{i})$ and $w_{i} \propto 1$ if $y_{i} = {\hat{C}}_{1} (x_{i})$ for $i = 1, \dots, n$ . Also, $P (Y \neq C_{B} (X)) < 1 / 2$ , we have $e^{β_{1}} > 1$ . So, the first iteration $w_{i}$ 's are derived from the $β_{1}$ 's and the number and indices of the misclassifications. Out of n data points there will be asymptotically $n P (Y \neq C_{B} (X))$ misclassifications and they will occur randomly over the n data points.

Thus, from examining (Equation30(30) $\begin{aligned} D_{j} (i) = \frac{D_{j - 1} (i) e^{β_{j - 1} 1_{{y_{i} \neq {\hat{C}}_{j - 1} (x_{i})}}}}{N_{j - 1}} \end{aligned}$ (30) ), any instance of the distribution $D_{1}$ randomly permutes the locations of the $e^{β_{1}}$ and ‘1’ entries, but the number of each type will be asymptotically constant. So, the first step optimization will again, on average, lead to the Bayes classifier: The misclassifications of the Bayes classifier are randomly located and spread uniformly over the occurrences of $e^{β_{1}}$ and ‘1’ entries. Therefore the output ${\hat{C}}_{1}$ is on average the same as the initial classifier $C_{0} (x) = C_{B} (x)$ . The only way another classifier could improve on $C_{B}$ would be to have fewer misclassifications on the indices i that had $e^{β_{1}}$ rather than one which is impossible (on average) because the locations are random.

The same reasoning applies to step, j = 2. We get that the same proportion of observations have the weights $w_{i} \propto e^{β_{1}}$ and ‘1’. Hence, at the end of this step we still get that as $n \to \infty$ , $β_{2} = \log (\frac{1 - e r r_{2}}{e r r_{2}}) \to \log (\frac{1 - P [Y \neq C_{B} (X)]}{P [Y \neq C_{B} (X)]})$ on average, $w_{i} \propto e^{β_{2}} = e^{β_{1}}$ if $y_{i} \neq {\hat{C}}_{2} (x_{i})$ , and $w_{i} \propto 1$ if $y_{i} = {\hat{C}}_{2} (x_{i})$ for $i = 1, \dots, n$ . Again, $e^{β_{2}} > 1$ with the number of misclassifications the same as before and the locations of the misclassifications permuted randomly. Hence $D_{2}$ is unchanged on average and ${\hat{C}}_{2}$ is essentially $C_{B}$ , as before.

If we continue this process for steps $j = 3, \dots, J$ , we have in the limit $β_{1}, β_{2}, \dots, β_{J} \to \log (\frac{1 - P (Y \neq C_{B} (X))}{P (Y \neq C_{B} (X))}) > 0,$ as $n \to \infty$ , on average in $P_{X, Y}$ .

Now, we get a corollary by initializing a boosting procedure with a ‘weak’ classifier.

Corollary 4.1

Let $C_{0} (\cdot)$ be weak classifier with $P (Y \neq C_{0} (X)) < 1 / 2$ (i.e., as a classifier, $C_{0} (X)$ is better than a coin toss). Then as $n \to \infty$ , the output of the AdaBoost algorithm is a majority vote of a sum of trees, i.e., a ‘forest’.

Remark

The output of Adaboost is only a forest, a collection of trees, not a random forest. In fact, the output of boosting is the sign of a sum of a weighted sequence of trees. As seen above, this is a tree. That is, as a weighted sum of functions, the majority vote of the individual trees from boosting is the same as the output of a single tree. In the same spirit, a random forest is a weighted sum of trees and therefore can be represented as a single tree if desired. Since Adaboost and RF's are good – essentially Bayes classifiers – we expect the two should be close to each other as functions of their inputs. Even though the tree from boosting is a RF-like classifier, not actually a RF, we can still compare it to an actual RF classifier as in (Equation27(27) $\begin{aligned} R F (x) = (R F (x) - B S T C (x)) + B S T C (x) . \end{aligned}$ (27) ).

Proof.

Recall that the boosting classifier is asymptotically Bayes optimal because it is a greedy approximation to the relative classification rate, see Le and Clarke (Citation2018) Theorem 3.5, cf. Theorem 6 in Freund and Schapire (Citation1997). So, for J large enough the iterates from the boosting procedure are Bayes optimal asymptotically. Thus, by Theorem 4.1 the $β_{j}$ 's converge to the same positive constant and for large enough J, the latter terms in BSTC will dominate to give the limit. That is, $\begin{aligned} B S T C (x) & = s i g n (\sum_{j = 1}^{J} β_{j} {\hat{C}}_{j} (x)) \\ \approx s i g n (\sum_{j = 1}^{J} {\hat{C}}_{j} (x)) \\ = majority vote of {{\hat{C}}_{j} (x)} |_{j = 1}^{J} = R F^{*} (x) . \end{aligned}$ where $R F^{*}$ is a sum of trees and the approximation improves as $n \to \infty$ .

In view of (Equation27(27) $\begin{aligned} R F (x) = (R F (x) - B S T C (x)) + B S T C (x) . \end{aligned}$ (27) ), the Corollary means that if BSTC is a good approximation for RF's asymptotically then RF's will be well-approximated by an additive logistic regression model under the exponential loss. It agrees with the result in Le and Clarke (Citation2018) showing that the risks of RF's and boosting converge to the minimal Bayes's risk. Furthermore, we agree with Mease and Wuner (Citation2008) and offer an argument supporting their claim that boosting does not overfit since RF's are asymptotically equivalent to boosting, at least in a misclassification sense, and they do not overfit, see Breiman (Citation2001b). The new part here is using BSTC as an interpretation for RF's and noting the cost of interpretation.

4.2. A more empirical interpretation for RF's

By construction, RF's give a function $f_{R F} (x_{1}, \dots, x_{p})$ . So, suppose we also have a collection of functions $f_{1}, \dots, f_{K}$ that we want to use as a way to ‘interpret’ $f_{R F}$ by regression. Conditional on the data and $f_{k}$ 's, solving (32) $\begin{aligned} \hat{w} = \arg min_{w = (w_{1}, \dots, w_{K})} \sum_{i = 1}^{n} {(f_{R F} (x_{i}) - \sum_{k = 1}^{K} w_{k} f_{k} (x_{i}))}^{2} \end{aligned}$ (32) gives an approximation of $f_{R F} (x)$ , (33) $\begin{aligned} {\hat{Y}}_{R F} = \sum_{k = 1}^{K} {\hat{w}}_{k} f_{k} (x) . \end{aligned}$ (33) The ${\hat{w}}_{k}$ 's may be found by using a standard least squares approach treating $f_{R F} (x_{i})$ as the $Y_{i}$ 's and the $f_{k} (x_{i})$ 's as K explanatory variables. More generally, since $x_{i} = (x_{i 1}, \dots, x_{i p})$ the $f_{k}$ 's can be regarded as the leading terms in a basis expansion, e.g., a Taylor expansion in p dimensions, cf. Section 2.2. It is easy to see that replacing BSTC in (Equation27(27) $\begin{aligned} R F (x) = (R F (x) - B S T C (x)) + B S T C (x) . \end{aligned}$ (27) ) by the right side of (Equation33(33) $\begin{aligned} {\hat{Y}}_{R F} = \sum_{k = 1}^{K} {\hat{w}}_{k} f_{k} (x) . \end{aligned}$ (33) ) gives the cost of interpreting $f_{R F}$ in terms of its regression function (using the $f_{k}$ 's) by the residuals from (Equation33(33) $\begin{aligned} {\hat{Y}}_{R F} = \sum_{k = 1}^{K} {\hat{w}}_{k} f_{k} (x) . \end{aligned}$ (33) ). Whether the residuals are satisfactorily small can be assessed by a variety of established techniques.

Expression (Equation32(32) $\begin{aligned} \hat{w} = \arg min_{w = (w_{1}, \dots, w_{K})} \sum_{i = 1}^{n} {(f_{R F} (x_{i}) - \sum_{k = 1}^{K} w_{k} f_{k} (x_{i}))}^{2} \end{aligned}$ (32) ) can also be phrased in terms of logistic regression, which may be more appropriate for a classifier; just replace the inner sum in (Equation32(32) $\begin{aligned} \hat{w} = \arg min_{w = (w_{1}, \dots, w_{K})} \sum_{i = 1}^{n} {(f_{R F} (x_{i}) - \sum_{k = 1}^{K} w_{k} f_{k} (x_{i}))}^{2} \end{aligned}$ (32) ) by the corresponding expression from a logit in a selection of variables such as the $f_{i}$ 's and again minimize the error over choices of the parameters. This is not hard, but we have not done it for of ease of exposition.

5. Discussion

In this paper we have examined three classes of predictors – kernel methods, Shtarkov solutions, and random forests – and shown that, despite the inability to interpret them, they can be asymptotically approximated both theoretically and pragmatically by interpretable expressions. In each case we have given an expression that quantifies the cost of approximating the ideal but uninterpretable predictor by its interpretable expression. Consequently, up to approximation error, the hitherto uninterpretable expressions that for the most part did not permit physical inference have been manipulated into forms from which physical inference may be possible.

For the sake of completeness, it is important to discuss conformal prediction; see Shafer and Vovk (Citation2008) for a comprehensive overview based on Vovk et al. (Citation2005). Applications to exponential families and generalized linear models are given in Eck and Crawford (Citation2019) and recent computational progress is given in Vovk et al. (Citation2020). Essentially, conformal prediction assumes sequential data and that future data will resemble past data, i.e., the DG has enough stability that prediction is feasible.

Conformal prediction can also be regarded as an extension of Geisser (Citation1975) by including the assumptions necessary for future data to look like past data and taking a probabilistic or stochastic processes approach to data analysis. At root is a non-conformity measure used to assess how close data points are; different non-conformity measures lead to different prediction regions. Thus, this framework has much in common with calibration and prequentialism, see Dawid (Citation1984), and is more general than time series (Box–Jenkins or state space models). On the other hand, much contemporary sequential data will not fit into this framework. In any event, our work here is broadly consistent with conformal prediction even though our emphasis is on interpretability more than the stochastic properties of the DG.

Recall that our starting points was interpretability and in Section 1 we proposed a definition for interpretability. We did not distinguish between constructing an interpretable model pre-data or deriving an interpretable model post-data although it is clear the latter will generally be better justified if it is derived from an input–output relation that predicts well. Then, we took linear combinations of variables as the paradigm case for interpretable quantities. We did this in Sections 2.3 and 4.2. In Section 3.3 we allowed a broader interpretation – the interaction of variables and parameters were not linear but closely related enough that the effects could be readily seen. That is, in all three cases, we implicitly identified the $c_{k}$ 's with variables and parameters whose combination was explicit and properties could be queried.

An anonymous referee asked (i) if deep neural networks (DNN's) are interpretable and if so could they be used in place of the combinations of variables and parameters in Sections 2.3 and 4.2 and (ii) if reinforcement learning might provide an alternative interpretation to that in Section 3.3. for the Shtarkov solution. The referee also observed that there are cases where DNN's may perform better than kernel methods and RF's, and implicitly recognized that the Shtarkov solution, possibly with side information, was similar to reinforcement learning. Thus, if DNN's and reinforcement learning are interpretable why not simply start with the them and ignore these other methods?

First, the interpretability of DNN's is problematic. Defining a clear physical correlate for nodes, layers of nodes, types of layers of nodes, and connectivity will often be elusive because DNN's are usually overparametrized and may be mathematically distinct even as they have very similar numerical properties as input–output relations. Some theorists have suggested that a DNN can be partitioned so that modules within it may have physical meaning or that reducing the number of neurons layer by layer might be akin to finding summary statistics. However, these suggestions remain conjectures. In short, it is not clear that DNN's are interpretable according to the definition used here even if sensitivity analyses can be used to understand the effect of parameters and variables on each other – a difficult task in practice for all but the smallest neural networks. If this holds then the kind of analysis done earlier, for kernel methods, say, to derive an interpretable approximation would have to be done for DNN's in order to assess what the DNN might have to say about a DG. That is, we are led to infer that the better performance of DNN's over other methods could be a result of their flexibility and associated non-interpretability, at least partially. Often, as a model becomes more complex or more general it becomes less interpretable. This does not contradict our basic assertion that interpretation has a cost in terms of prediction.

As to reinforcement learning, this is usually regarded as a sequential decision process in the context of discrete time, discrete space Markov processes. While transitions and actions may be interpretable, the analogy between reinforcement learning and the Shtarkov solution is not tight: The Shtarkov solution does not assume any distribution on the sequence of data points and arises from a minimization of regret while reinforcement learning finds an optimal action for each transition. It is reasonable to conjecture that some version of reinforcement learning will provide an approximation to Shtarkov in some settings, but investigating this is beyong our present scope.

Finally, we draw three implications from our results. First, from a pragmatic standpoint, we are arguing that, as a generality, models are at best only approximately true and the degree of approximation is usually unassessable. Proposed models can routinely be discredited by searching a more general class of predictors to get measurably better prediction. How can one assert a model is ‘true’ if its predictions can be improved? The consequence is that the other inferences from models taken as ‘true’ must be seen as unreliable absent further validation and assessment of their degree of mis-specification. In particular, physical interpretations are tentative at best.

Model mis-specification is an extensively studied topic, see Walker (Citation2013) especially Sections 5 and 6, and the ensuing discussion. These authors generally focus on what we have called $M$ -complete problems and take this as being the typical setting for modelling and analysis. Accordingly, these authors try to characterize the inferential difference between whatever model used and the unknown but true model. Two recurring questions are: (1) Given that the true model class is unobtainable, what is it that we are making inferences about? and (2) Since the degree of mis-specification is important, how should we compare proposed models? These are addressed in a variety of ways by O'Hagan (Citation2013), Hoff and Wakefield (Citation2013) and De Blasi (Citation2013).

Second, here we offer answers, perhaps unsatisfying, to these two questions. We argue that inferences should be about the next outcome, i.e., prediction, and that we should compare proposed models by how close the predictors they give are to the best predictor we can find. That is, prediction is the paradigm statistical problem, not estimation, interpretability, or other goals. Unless we have achieved good prediction there is no particular reason to trust other inferences. After all, from the falsification principle, it is unclear how to discredit estimates or the result of tests except by repeating an experiment, which is rarely done. The current term for this longstanding problem is the ‘replicability crisis’. At root, requiring optimal prediction is a solution to problems with replicability: No predictive validation implies no valid inferences of any other sort.

Finally, we suggest as a default that experimenters achieve good prediction via optimal uninterpretable methods and then adapt the results (as much as they dare) to make them interpretable. The loss of predictive power as a consequence of constructing a physical interpretation can then be quantified and its cost assessed. In this way a simplified, and possibly interpretable, model may be validated up to a measurable degree of predictive loss. The cost of interpretability is then a limit on the validity of the model much as a standard error is a limit on the certainty of an estimate. We conclude with what might be called the prediction principle: The degree of predictive success a method has determines the reliability of any interpretation that rests on it.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

Bernardo, J., & Smith, A. F. M. (2000). Bayesian theory. John Wiley & Sons.
Google Scholar
Biau, G., Devroye, L., & Lugosi, G. (2008). Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research, 9(66), 2015–2033. https://doi.org/https://doi.org/10.1145/1390681.1442799
Google Scholar
Bierens, H. (2005). Introduction to the mathematical and statistical foundations of econometrics. Cambridge University Press.
Google Scholar
Billingsley, P. (2012). Probability and measure. Wiley.
Google Scholar
Breiman, L. (2001a). Stacked regressions. Machine Learning, 24(1), 49–64. https://doi.org/https://doi.org/10.1007/BF00117832
Google Scholar
Breiman, L. (2001b). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/https://doi.org/10.1023/A:1010933404324
Web of Science ®Google Scholar
Caruana, R., Karampatziakis, N., & Yessenalina, A. (2008). An empirical evaluation of supervised learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 2008.
Google Scholar
Cesa-Bnachi, N., & Lugosi, G. (2006). Prediction, learning, and games. Cambridge University Press.
Google Scholar
Clarke, B. (2007). Information optimality and Bayesian modelling. Journal of Econometrics, 138(2), 405–429. https://doi.org/https://doi.org/10.1016/j.jeconom.2006.05.003
Web of Science ®Google Scholar
Dawid, A. P. (1984). Statistical theory: The prequential approach (with discussion). Journal of the Royal Statistical Society, Series A 147(2), 278–292. https://doi.org/https://doi.org/10.2307/2981683
Web of Science ®Google Scholar
Dawid, A. P. (1992). Prequential data analysis. In: M. Ghosh & P. K. Pathak (Eds.), Current issues in statistical inference: Essays in Honor of D. Basu (pp. 113–126). IMS Lecture Notes Monograph Ser. 17. Institute of Mathematical Statistics.
Google Scholar
Dawid, A. P. (2010). Fundamentals of prequential analysis. http://www3.stat.sinica.edu.tw/2013frontiers/presentation/29.pdf
Google Scholar
Dawid, A. P., & Vovk, V. G. (1999). Prequential probability: Principles and properties. Bernoulli, 5(1), 125–162. https://doi.org/https://doi.org/10.2307/3318616
Web of Science ®Google Scholar
De Blasi, P. (2013). Discussion on article ‘Bayesian inference with misspecified models’ by Stephen G. Walker. Journal of Statistical Planning and Inference, 143(10), 1634–1637. https://doi.org/https://doi.org/10.1016/j.jspi.2013.05.015
Web of Science ®Google Scholar
Diaconis, P., Goel, S., & Holmes, S. (2008). Horseshoes in multidimensional scaling and local kernel methods. The Annals of Applied Statistics, 2(3), 777–807. https://doi.org/https://doi.org/10.1214/08-AOAS165
Web of Science ®Google Scholar
Eck, D., & Crawford, F. (2019). Efficient and minimal length parametric conformal prediction regions. https://arxiv.org/pdf/1905.03657.pdf.
Google Scholar
Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55, 1(119), 139. https://doi.org/https://doi.org/10.1006/jcss.1997.1504
Web of Science ®Google Scholar
Friedman, J., Hastie, T., & Tibshirani, R. (2000). A statistical view of boosting. The Annals of Statistics, 28(2), 337–407. https://doi.org/https://doi.org/10.1214/aos/1016218223
Web of Science ®Google Scholar
Geisser, S. (1975). The predictive sample reuse method with applications. Journal of the American Statistical Association, 70(350), 320–328. https://doi.org/https://doi.org/10.1080/01621459.1975.10479865
Web of Science ®Google Scholar
Hastie, T., Tibshirani, R., & Friedman, J. (2009). Elements of statistical learning (2nd ed.). Springer.
Google Scholar
Hoff, P., & Wakefield, J. (2013). Bayesian sandwich posteriors for pseudo-true parameters: A discussion of ‘Bayesian inference with misspecified models’ by Stephen Walker. Journal of Statistical Planning and Inference, 143(10), 1638–1642. https://doi.org/https://doi.org/10.1016/j.jspi.2013.05.014
Web of Science ®Google Scholar
Kimeldorf, G., & Wahba, G. (1971). Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33(1), 82–95. https://doi.org/https://doi.org/10.1016/0022-247X(71)90184-3
Web of Science ®Google Scholar
Le, T. M., & Clarke, B. (2016a). Using the Bayesian Shtarkov solution for predictions. Computational Statistics & Data Analysis, 104(9), 183–196. https://doi.org/https://doi.org/10.1016/j.csda.2016.06.018
Google Scholar
Le, T. M., & Clarke, B. (2016b). A Bayes interpretation of stacking for M-complete and M-open settings. Bayesian Analysis, 12(3), 807–829.https://doi.org/https://doi.org/10.1214/16-BA1023
Web of Science ®Google Scholar
Le, T. M., & Clarke, B. (2018). On the interpretation of ensemble classifiers in terms of Bayes classifiers. Journal of Classification, 35(2), 198–229. https://doi.org/https://doi.org/10.1007/s00357-018-9257-y
Web of Science ®Google Scholar
Le, T. M., & Clarke, B. (2020). In praise of partially interpretable predictors. Statistical Analysis and Data Mining, 13(2), 113–133. https://doi.org/https://doi.org/10.1002/sam.v13.2
Web of Science ®Google Scholar
Liyang, Z., & Lee, Y. (2013). Eigen-analysis of nonlinear PCA with polynomial kernels. Statistical Analysis and Data Mining, 6(6), 529–544. https://doi.org/https://doi.org/10.1002/sam.11211
Web of Science ®Google Scholar
Mease, D., & Wuner, A. (2008). Evidence contrary to the statistical view of boosting. The Journal of Machine Learning Research, 9(6), 131–156.https://doi.org/https://doi.org/10.1145/1390681.1390687
Google Scholar
Newey, W., & McFadden, D. (1994). Large sample estimation and hypothesis testing. Elsevier Science.
Google Scholar
O'Hagan, A. (2013). Bayesian inference with misspecified models: Inference about what? Journal of Statistical Planning and Inference, 143(10), 1643–1648. https://doi.org/https://doi.org/10.1016/j.jspi.2013.05.016
Web of Science ®Google Scholar
Pearson, K. (1895). Contributions to the mathematical theory of evolution. II. Skew variation in homogeneous material. Philosophical Transactions of the Royal Society of London A, (186), 343–414. https://doi.org/https://doi.org/10.1098/rsta.1895.0010
Google Scholar
Polson, N. G., Scott, J. G., & Windle, J. (2014). The Bayesian bridge. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4), 713–733. https://doi.org/https://doi.org/10.1111/rssb.2014.76.issue-4
Web of Science ®Google Scholar
Schapire, R. (1990). The strength of weak learnability. Machine Learning, 5(June 1990), 197–227. https://doi.org/https://doi.org/10.1007/BF00116037
Google Scholar
Scholkopf, B., & Smola, A. (2002). Learning with kernels. MIT Press.
Google Scholar
Shafer, G., & Vovk, V. (2008). A tutorial on conformal prediction. The Journal of Machine Learning Research, 9(12), 371–421.
Google Scholar
Shi, T., Belkin, M., & Yu, B. (2008). Data spectroscopy: Learning mixture models using eigenspaces of convolution operators. In Andrew McCallum & Sam Roweis (Eds.), Proceedings of the 25th Annual International Conference on Machine Learning (pp. 936–953). University of Cambridge.
Google Scholar
Shtarkov, Y. (1987). Universal sequential coding of single messages. Problems in Information Transmission, 23(3), 3–17.
Google Scholar
Steyn, H. (1960). On regression properties of multivariate probability functions of Pearso's types. Proceedings of the Royal Academy of Sciences, 63, 302–311. https://doi.org/https://doi.org/10.1016/S1385-7258(60)50038-2
Google Scholar
Tipping, M. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1(June 2001), 211–244. https://doi.org/https://doi.org/10.1162/15324430152748236.
Google Scholar
Van de Geer, S. (2000). Applications of empirical process theory. Cambridge University Press.
Google Scholar
Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic learning in a random world. Springer.
Google Scholar
Vovk, V., Petej, I., Nouretdinov, I., Manokhin, V., & Gammerman, A. (2020). Computationally efficient versions of conformal predictive distributions. Neurocomputing, 397(July 2020), 292–308. https://doi.org/https://doi.org/10.1016/j.neucom.2019.10.110
Google Scholar
Walker, S. G. (2013). Bayesian inference with misspecified models, with discussion and rejoinder. Journal of Statistical Planning and Inference, 143(10), 1621–1633. https://doi.org/https://doi.org/10.1016/j.jspi.2013.05.013
Web of Science ®Google Scholar
Williamson, R. E. (1956). Multiply monotone functions and their Laplace transforms. Duke Mathematical Journal, 23(2), 189–207. https://doi.org/https://doi.org/10.1215/S0012-7094-56-02317-1
Web of Science ®Google Scholar
Wyner, A., Olsen, M., & Bleich, J. (2017). Explaining the success of AdaBoost and random forests as interpolating classifiers. Journal of Machine Learning Research, 18(May 2017), 1–33.
Google Scholar
Xie, Q., & Barron, A. R. (2000). Asymptotic minimax regret for data compression, gambling, and prediction. IEEE Transactions on Information Theory, 46(2), 431–445. https://doi.org/https://doi.org/10.1109/18.825803
Web of Science ®Google Scholar

Appendices

Appendix 1.

Details of some proofs

Proof of Corollary 2.1.

To see (Equation7(7) $\begin{aligned} E [{\hat{Y}}_{rep} (x) - f_{0} (x)]^{2} \to 0. \end{aligned}$ (7) ), recall Theorem 2.1 gives (A1) $\begin{aligned} [{\hat{Y}}_{rep} (x) - f_{0} (x)]^{2} \overset{P}{\to} 0 as n \to \infty . \end{aligned}$ (A1) More explicitly, given $D$ , using (Equation3(3) $\begin{aligned} {\hat{Y}}_{rep} (x) = \sum_{i = 1}^{n} α_{i} K (x_{i}, x) . \end{aligned}$ (3) ) the left-hand side of (EquationA1(A1) $\begin{aligned} [{\hat{Y}}_{rep} (x) - f_{0} (x)]^{2} \overset{P}{\to} 0 as n \to \infty . \end{aligned}$ (A1) ) is (A2) $\begin{aligned} {[\sum_{i = 1}^{n} α_{i} K (X_{i}, x) - f_{0} (x)]}^{2}, \end{aligned}$ (A2) where $α_{i} = α_{i} (D)$ . Cauchy-Schwarz gives that (EquationA2(A2) $\begin{aligned} {[\sum_{i = 1}^{n} α_{i} K (X_{i}, x) - f_{0} (x)]}^{2}, \end{aligned}$ (A2) ) is bounded by (A3) $\begin{aligned} 2 {[\sum_{i = 1}^{n} α_{i} K (X_{i}, x)]}^{2} + 2 f_{0}^{2} (x) \\ \leq 4 {[\sum_{i = N}^{n} α_{i} K (X_{i}, x)]}^{2} + 4 {[\sum_{i = 1}^{N - 1} α_{i} K (X_{i}, x)]}^{2} + 2 f_{0}^{2} (x) \\ \leq 4 (n - N + 1) (\sum_{i = N}^{n} α_{i}^{2}) [\frac{1}{n - N + 1} \sum_{i = N}^{n} K^{2} (X_{i}, x)] \\ + 4 {[\sum_{i = 1}^{N - 1} α_{i} K (X_{i}, x)]}^{2} + 2 f_{0}^{2} (x) . \end{aligned}$ (A3) Next, we show that the term $1 / (n - N + 1) \sum_{i = N}^{n} K^{2} (x_{i}, x)$ on the right-hand side of (EquationA3(A3) $\begin{aligned} 2 {[\sum_{i = 1}^{n} α_{i} K (X_{i}, x)]}^{2} + 2 f_{0}^{2} (x) \\ \leq 4 {[\sum_{i = N}^{n} α_{i} K (X_{i}, x)]}^{2} + 4 {[\sum_{i = 1}^{N - 1} α_{i} K (X_{i}, x)]}^{2} + 2 f_{0}^{2} (x) \\ \leq 4 (n - N + 1) (\sum_{i = N}^{n} α_{i}^{2}) [\frac{1}{n - N + 1} \sum_{i = N}^{n} K^{2} (X_{i}, x)] \\ + 4 {[\sum_{i = 1}^{N - 1} α_{i} K (X_{i}, x)]}^{2} + 2 f_{0}^{2} (x) . \end{aligned}$ (A3) ) is uniformly integrable. Begin by noting that Jensen's inequality gives (A4) $\begin{aligned} {(\sum_{i = N}^{n} a_{i})}^{1 + ϵ} \leq (n - N + 1)^{ϵ} (\sum_{i = N}^{n} a_{i}^{1 + ϵ}), \end{aligned}$ (A4) for any $ϵ > 0$ and $a_{i} \geq 0, i = N, \dots, n$ . (Set $φ (x) = x^{1 + ϵ}$ .) Now, by (EquationA4(A4) $\begin{aligned} {(\sum_{i = N}^{n} a_{i})}^{1 + ϵ} \leq (n - N + 1)^{ϵ} (\sum_{i = N}^{n} a_{i}^{1 + ϵ}), \end{aligned}$ (A4) ) $\begin{aligned} sup_{n} E {[\frac{1}{n - N + 1} \sum_{i = N}^{n} K^{2} (X_{i}, x)]}^{1 + ϵ} \\ \leq sup_{n} \frac{1}{(n - N + 1)^{1 + ϵ}} \\ \times E [(n - N + 1)^{ϵ} \sum_{i = N}^{n} K^{2 (1 + ϵ)} (X_{i}, x)] \\ \leq E [K^{2 (1 + ϵ)} (X, x)] < \infty, \end{aligned}$ by Assumption (i). Thus, $1 / (n - N + 1) \sum_{i = N}^{n} K^{2} (x_{i}, x)$ is uniformly integrable.

Assumption (ii) gives, as $n \to \infty$ , $(n - N + 1) \sum_{i = N}^{n} α_{i}^{2} = o_{P} (1)$ and is bounded. So, the right-hand side of (EquationA3(A3) $\begin{aligned} 2 {[\sum_{i = 1}^{n} α_{i} K (X_{i}, x)]}^{2} + 2 f_{0}^{2} (x) \\ \leq 4 {[\sum_{i = N}^{n} α_{i} K (X_{i}, x)]}^{2} + 4 {[\sum_{i = 1}^{N - 1} α_{i} K (X_{i}, x)]}^{2} + 2 f_{0}^{2} (x) \\ \leq 4 (n - N + 1) (\sum_{i = N}^{n} α_{i}^{2}) [\frac{1}{n - N + 1} \sum_{i = N}^{n} K^{2} (X_{i}, x)] \\ + 4 {[\sum_{i = 1}^{N - 1} α_{i} K (X_{i}, x)]}^{2} + 2 f_{0}^{2} (x) . \end{aligned}$ (A3) ) is uniformly integrable in $P_{(X, Y)}$ . This implies $[{\hat{Y}}_{rep} (x) - f_{0} (x)]^{2}$ is uniformly integrable for any x and by (EquationA1(A1) $\begin{aligned} [{\hat{Y}}_{rep} (x) - f_{0} (x)]^{2} \overset{P}{\to} 0 as n \to \infty . \end{aligned}$ (A1) ) has limit zero in probability. Therefore, $E [{\hat{Y}}_{rep} (x) - f_{0} (x)]^{2} \to 0,$ as $n \to \infty$ by the Theorem 25.12 in Billingsley (Citation2012), concluding the proof.

Proof of Theorem 2.2.

From (Equation4(4) $\begin{aligned} {\hat{Q}}_{n} (f) = \frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) + λ ‖ f ‖_{H_{K}}^{2} . \end{aligned}$ (4) ), (Equation5(5) $\begin{aligned} Q_{0} (f) = E_{(X, Y)} L (Y, f (X)) + λ ‖ f ‖_{H_{K}}^{2} . \end{aligned}$ (5) ), and using the fact that $\begin{aligned} sup_{x} | f (x) | & \leq max {| sup_{x} f (x) |, | inf_{x} f (x) |} \\ \leq | sup_{x} f (x) | + | inf_{x} f (x) |, \end{aligned}$ we have (A5) $\begin{aligned} sup_{f \in B (f^{*}, δ)} | {\hat{Q}}_{n} (f) - Q_{0} (f) | \\ = sup_{f \in B (f^{*}, δ)} | \frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X)) | \\ \leq | sup_{f \in B (f^{*}, δ)} {\frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X))} | \\ + | inf_{f \in B (f^{*}, δ)} {\frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X))} | . \end{aligned}$ (A5) To bound the terms on the right-hand side of (EquationA5(A5) $\begin{aligned} sup_{f \in B (f^{*}, δ)} | {\hat{Q}}_{n} (f) - Q_{0} (f) | \\ = sup_{f \in B (f^{*}, δ)} | \frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X)) | \\ \leq | sup_{f \in B (f^{*}, δ)} {\frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X))} | \\ + | inf_{f \in B (f^{*}, δ)} {\frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X))} | . \end{aligned}$ (A5) ), since $E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) < \infty$ , note that (A6) $\begin{aligned} sup_{f \in B (f^{*}, δ)} {\frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X))} \\ \leq \frac{1}{n} \sum_{i = 1}^{n} sup_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - inf_{f \in B (f^{*}, δ)} E_{(X, Y)} L (Y, f (X)) \\ \leq \frac{1}{n} \sum_{i = 1}^{n} sup_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) \\ \leq | \frac{1}{n} \sum_{i = 1}^{n} sup_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) \\ \leq | \frac{1}{n} \sum_{i = 1}^{n} sup_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + | \frac{1}{n} \sum_{i = 1}^{n} inf_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) \\ - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) \\ - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) . \end{aligned}$ (A6)

Similarly, we have

(A7) $\begin{aligned} inf_{f \in B (f^{*}, δ)} {\frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X))} \\ \geq \frac{1}{n} \sum_{i = 1}^{n} inf_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - sup_{f \in B (f^{*}, δ)} E_{(X, Y)} L (Y, f (X)) \\ \geq \frac{1}{n} \sum_{i = 1}^{n} inf_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) \\ \geq - | \frac{1}{n} \sum_{i = 1}^{n} inf_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) - E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) \\ \geq - | \frac{1}{n} \sum_{i = 1}^{n} sup_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ - | \frac{1}{n} \sum_{i = 1}^{n} inf_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) - E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) . \end{aligned}$ (A7) Therefore, from (EquationA6(A6) $\begin{aligned} sup_{f \in B (f^{*}, δ)} {\frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X))} \\ \leq \frac{1}{n} \sum_{i = 1}^{n} sup_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - inf_{f \in B (f^{*}, δ)} E_{(X, Y)} L (Y, f (X)) \\ \leq \frac{1}{n} \sum_{i = 1}^{n} sup_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) \\ \leq | \frac{1}{n} \sum_{i = 1}^{n} sup_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) \\ \leq | \frac{1}{n} \sum_{i = 1}^{n} sup_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + | \frac{1}{n} \sum_{i = 1}^{n} inf_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) \\ - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) \\ - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) . \end{aligned}$ (A6) ) and (EquationA7(A7) $\begin{aligned} inf_{f \in B (f^{*}, δ)} {\frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X))} \\ \geq \frac{1}{n} \sum_{i = 1}^{n} inf_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - sup_{f \in B (f^{*}, δ)} E_{(X, Y)} L (Y, f (X)) \\ \geq \frac{1}{n} \sum_{i = 1}^{n} inf_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) \\ \geq - | \frac{1}{n} \sum_{i = 1}^{n} inf_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) - E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) \\ \geq - | \frac{1}{n} \sum_{i = 1}^{n} sup_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ - | \frac{1}{n} \sum_{i = 1}^{n} inf_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) - E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) . \end{aligned}$ (A7) ), the first term on the RHS of (EquationA5(A5) $\begin{aligned} sup_{f \in B (f^{*}, δ)} | {\hat{Q}}_{n} (f) - Q_{0} (f) | \\ = sup_{f \in B (f^{*}, δ)} | \frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X)) | \\ \leq | sup_{f \in B (f^{*}, δ)} {\frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X))} | \\ + | inf_{f \in B (f^{*}, δ)} {\frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X))} | . \end{aligned}$ (A5) ) is bounded by

(A8) $\begin{aligned} | sup_{f \in B (f^{*}, δ)} {\frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X))} | \\ \leq | \frac{1}{n} \sum_{i = 1}^{n} sup_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + | \frac{1}{n} \sum_{i = 1}^{n} inf_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)), \end{aligned}$ (A8) and similarly the second term on the right-hand side of (EquationA5(A5) $\begin{aligned} sup_{f \in B (f^{*}, δ)} | {\hat{Q}}_{n} (f) - Q_{0} (f) | \\ = sup_{f \in B (f^{*}, δ)} | \frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X)) | \\ \leq | sup_{f \in B (f^{*}, δ)} {\frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X))} | \\ + | inf_{f \in B (f^{*}, δ)} {\frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X))} | . \end{aligned}$ (A5) ) is bounded by

(A9) $\begin{aligned} | inf_{f \in B (f^{*}, δ)} {\frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X))} | \\ \leq | \frac{1}{n} \sum_{i = 1}^{n} sup_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + | \frac{1}{n} \sum_{i = 1}^{n} inf_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) . \end{aligned}$ (A9) Combining (EquationA5(A5) $\begin{aligned} sup_{f \in B (f^{*}, δ)} | {\hat{Q}}_{n} (f) - Q_{0} (f) | \\ = sup_{f \in B (f^{*}, δ)} | \frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X)) | \\ \leq | sup_{f \in B (f^{*}, δ)} {\frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X))} | \\ + | inf_{f \in B (f^{*}, δ)} {\frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X))} | . \end{aligned}$ (A5) ), (EquationA8(A8) $\begin{aligned} | sup_{f \in B (f^{*}, δ)} {\frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X))} | \\ \leq | \frac{1}{n} \sum_{i = 1}^{n} sup_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + | \frac{1}{n} \sum_{i = 1}^{n} inf_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)), \end{aligned}$ (A8) ), and (EquationA9(A9) $\begin{aligned} | inf_{f \in B (f^{*}, δ)} {\frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X))} | \\ \leq | \frac{1}{n} \sum_{i = 1}^{n} sup_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + | \frac{1}{n} \sum_{i = 1}^{n} inf_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) . \end{aligned}$ (A9) ), we get (A10) $\begin{aligned} sup_{f \in B (f^{*}, δ)} | \frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X)) | \\ \leq 2 | \frac{1}{n} \sum_{i = 1}^{n} sup_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) \\ - E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + 2 | \frac{1}{n} \sum_{i = 1}^{n} inf_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) \\ - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + 2 [E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) \\ - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X))] . \end{aligned}$ (A10) It follows from the continuity of L in f and the dominated convergence theorem that the third term on the right-hand side of (EquationA9(A9) $\begin{aligned} | inf_{f \in B (f^{*}, δ)} {\frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X))} | \\ \leq | \frac{1}{n} \sum_{i = 1}^{n} sup_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + | \frac{1}{n} \sum_{i = 1}^{n} inf_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) . \end{aligned}$ (A9) ) satisfies $\begin{aligned} lim_{δ \to 0} sup_{f^{*} \in H_{K}} E_{(X, Y)} [sup_{f \in B (f^{*}, δ)} L (Y, f (X)) - inf_{f \in B (f^{*}, δ)} L (Y, f (X))] \\ \leq lim_{δ \to 0} E_{(X, Y)} sup_{f^{*} \in H_{K}} [sup_{f \in B (f^{*}, δ)} L (Y, f (X)) \\ - inf_{f \in B (f^{*}, δ)} L (Y, f (X))] = 0, \end{aligned}$ and hence we can choose δ so small that (A11) $\begin{aligned} sup_{f^{*} \in H_{K}} E_{(X, Y)} [sup_{f \in B (f^{*}, δ)} L (Y, f (X)) - inf_{f \in B (f^{*}, δ)} L (Y, f (X))] < ϵ . \end{aligned}$ (A11) Furthermore, by the compactness of $D_{j}$ , there exist finitely many of $f^{*}$ 's, say $f_{1}, \dots, f_{N (δ)}$ , such that $D_{j} \subset ⋃_{i = 1}^{N (δ)} B (f_{i}, δ)$ . Hence, by the union of events bound, $\begin{aligned} P (sup_{f \in D_{j}} | \frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X)) | > ϵ) \\ \leq P (max_{1 \leq i \leq N (δ)} sup_{f \in B (f_{i}, δ)} | \frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) \\ - E_{(X, Y)} L (Y, f (X)) | > ϵ) \\ \leq \sum_{i = 1}^{N (δ)} P (sup_{f \in B (f_{i}, δ)} | \frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) \\ - E_{(X, Y)} L (Y, f (X)) | > ϵ) . \end{aligned}$ Using (EquationA10(A10) $\begin{aligned} sup_{f \in B (f^{*}, δ)} | \frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i})) - E_{(X, Y)} L (Y, f (X)) | \\ \leq 2 | \frac{1}{n} \sum_{i = 1}^{n} sup_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) \\ - E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + 2 | \frac{1}{n} \sum_{i = 1}^{n} inf_{f \in B (f^{*}, δ)} L (y_{i}, f (x_{i})) \\ - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X)) | \\ + 2 [E_{(X, Y)} sup_{f \in B (f^{*}, δ)} L (Y, f (X)) \\ - E_{(X, Y)} inf_{f \in B (f^{*}, δ)} L (Y, f (X))] . \end{aligned}$ (A10) ) and (EquationA11(A11) $\begin{aligned} sup_{f^{*} \in H_{K}} E_{(X, Y)} [sup_{f \in B (f^{*}, δ)} L (Y, f (X)) - inf_{f \in B (f^{*}, δ)} L (Y, f (X))] < ϵ . \end{aligned}$ (A11) ) this expression is bounded by $\begin{aligned} \sum_{i = 1}^{N (δ)} P (| \frac{1}{n} \sum_{i = 1}^{n} sup_{f \in B (f_{i}, δ)} L (y_{i}, f (x_{i})) - E_{(X, Y)} sup_{f \in B (f_{i}, δ)} L (Y, f (X)) | \\ + | \frac{1}{n} \sum_{i = 1}^{n} inf_{f \in B (f_{i}, δ)} L (y_{i}, f (x_{i})) \\ - E_{(X, Y)} inf_{f \in B (f_{i}, δ)} L (Y, f (X)) | > \frac{ϵ}{2}) \\ \leq \sum_{i = 1}^{N (δ)} P (| \frac{1}{n} \sum_{i = 1}^{n} sup_{f \in B (f_{i}, δ)} L (y_{i}, f (x_{i})) \\ - E_{(X, Y)} sup_{f \in B (f_{i}, δ)} L (Y, f (X)) | > \frac{ϵ}{4}) \\ + \sum_{i = 1}^{N (δ)} P (| \frac{1}{n} \sum_{i = 1}^{n} inf_{f \in B (f_{i}, δ)} L (y_{i}, f (x_{i})) \\ - E_{(X, Y)} inf_{f \in B (f_{i}, δ)} L (Y, f (X)) | > \frac{ϵ}{4}) \end{aligned}$ which goes to 0 as $n \to \infty$ by the weak law of large numbers, concluding the proof by Assumption (iii).

Appendix 2.

Interpretability versus complexity

Interpretability is a different concept from complexity. We have implicitly assumed that the best predictors (and models) are highly complex, but we regard this as the most common case in current practice, not a priori true. In point of fact, an interpretable model may be simple or complex and an uninterpretable model may be simple or complex. Otherwise put, all pairs of (interpretability, complexity) can occur. As a generality, $M$ -closed problems are less complex than $M$ -complete problems and they in turn are less complex than $M$ -open problems. However, this ordering does not in general preclude the existence of an interpretable predictor or model for any problem.

Here, we use the notion of interpretability in Le and Clarke (Citation2020). On the other hand, here complexity refers to how many components are required for good prediction or modelling. This is different from other notions of complexity such as VC dimension or code length since these do not require any components of a predictor or model to have any physical correlates.

Nevertheless, as an empirical observation we have noted that the more interpretability one demands of a model, the more complex it will typically be and, often, the more complex the true model is, the higher the model mis-specification will be. Correspondingly, the less interpretability one requires, the smaller the error of the predictor can be but it is not clear what effect this has on complexity. Empirically, the predictions from an interpretable model are likely to be worse than the prediction from a well chosen non-interpretable model or predictor. This arises for the intuitive reason that restricting model classes to only those that are interpretable is likely to increase bias because real world phenomena are rarely captured to infinite precision by what we think are physically meaningful models. Because interpretable models may be simplifications of the real phenomena we expect some of the bias to be reflected in increased variance as well. The exception is when the model actually is valid to infinite precision or at least to a precision higher than that achieved by other models; this can happen but is atypical. These observations are neither new nor surprising. The issue is to quantify them to ascertain how much of an interpretation one can derive from an uninterpretable model without losing too much accuracy of prediction.

Interpreting uninterpretable predictors: kernel methods, Shtarkov solutions, and random forests

Abstract

1. Introduction

2. RVM's and SVM's for regression

2.1. Consistency of the representer theorem predictor

2.2. Theoretical interpretation of the representer theorem solution predictors

Polynomial kernel

Exponential kernel

Gaussian kernel

2.3. A more empirical interpretation

3. Bayes Shtarkov predictors in $M$ -open settings

3.1. The Bayesian version

3.2. A theoretical interpretation for Bayes Shtarkov predictors

3.3. An empirical interpretation for Bayes Shtarkov predictors

4. Random forest predictors

4.1. Interpreting RF's in terms of boosting

4.2. A more empirical interpretation for RF's

5. Discussion

Disclosure statement

References

Appendices

Appendix 1.

Details of some proofs

Appendix 2.

Interpretability versus complexity

Information for

Open access

Opportunities

Help and information

Interpreting uninterpretable predictors: kernel methods, Shtarkov solutions, and random forests

Abstract

1. Introduction

2. RVM's and SVM's for regression

2.1. Consistency of the representer theorem predictor

2.2. Theoretical interpretation of the representer theorem solution predictors

Polynomial kernel

Exponential kernel

Gaussian kernel

2.3. A more empirical interpretation

3. Bayes Shtarkov predictors in M-open settings

3.1. The Bayesian version

3.2. A theoretical interpretation for Bayes Shtarkov predictors

3.3. An empirical interpretation for Bayes Shtarkov predictors

4. Random forest predictors

4.1. Interpreting RF's in terms of boosting

4.2. A more empirical interpretation for RF's

5. Discussion

Disclosure statement

References

Appendices

Appendix 1.

Details of some proofs

Appendix 2.

Interpretability versus complexity

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

3. Bayes Shtarkov predictors in $M$ -open settings