770
Views
0
CrossRef citations to date
0
Altmetric
Articles

Interpreting uninterpretable predictors: kernel methods, Shtarkov solutions, and random forests

&
Pages 10-28 | Received 18 Mar 2020, Accepted 08 Aug 2021, Published online: 08 Sep 2021

Abstract

Many of the best predictors for complex problems are typically regarded as hard to interpret physically. These include kernel methods, Shtarkov solutions, and random forests. We show that, despite the inability to interpret these three predictors to infinite precision, they can be asymptotically approximated and admit conceptual interpretations in terms of their mathematical/statistical properties. The resulting expressions can be in terms of polynomials, basis elements, or other functions that an analyst may regard as interpretable.

1. Introduction

Fundamentally point prediction is an input–output relation. Given a pair of related sequences x1,x2, and y1,y2, where each yi is an outcome of some Yi, the predicted value for yi is y^i=Fi(xi) in which Fi is typically chosen using the earlier xi's and yi's, i.e., xi1,,x1 and yi1,,y1. An extra set of burn-in data may be used to choose Fi and there may be added complexities from side information. We may put many sorts of desiderata on the Fi's – low predictive error, simplicity, even insisting each FiFi for some set of functions Fi. However, the point predictor, Fi, is a merely a function – a way to convert an input x to an output y.

Interval predictors are somewhat more complex: They give a prediction interval, say Ii with a pre-assigned probability that the event {YiIi} will occur. Thus they have the same input but a different output. Regardless of any further desiderata we might impose, such as optimality criteria, interval predictors (or their generalization to regional predictors) remain input–output relations. This may be regarded as an example of conformal prediction, see Vovk et al. (Citation2005), as we discuss later in Section 5.

By contrast, modelling is a conceptually different process. In principle, a statistical modeler proposes a model, say Y=F(x)+ϵ for simplicity, to be true and has a collection of terms, say tj(x) for j=1,,m, that may be part of an additive model. The modeler uses the data to choose terms and the end result is a model something like Y(x)=j=1qt^j(x)+ϵ, where the t^j's are the same as the tj's apart from estimating some coefficients. The modeler then asserts that the model reflects reality by ensuring that each component has a correlate in reality: Each tj means something physical, there is a reason that the terms are added, and it has been verified that the error ϵ represents intrinsic variability rather than small terms that have been ignored i.e., bias. In this case, the model gives the point predictor Y^(x)=j=1qt^j(x) and it is assumed that any estimates in t^j are satisfactorily close to their true values that the model is ‘good’ – not readily falsifiable. Note that even though Y^ is an estimator of F, we focus on how well it predicts Y.

What are the differences between these two approaches? First, a predictor is just a mathematical construct to match the output from a data generator (DG). It has no greater significance. All that matters about Y^ is how close Y^(x) is to Y = y, i.e., how well using Y^ lets someone predict Y. The quality of the prediction is usually measured formally, e.g., by some cumulative error such as the prediction sum of squares PRESSn=i=1n(yiY^i(xi))2,where the subscript i indicates that data point (xi,yi) was not used to form Y^. However, the point remains: Y^ predicts well, adequately, or poorly according to how we assess predictive performance. While there is nothing more that must necessarily be said, there is much that can be said – how the predictor was found, what its properties are, etc. – and some aspects of these will be discussed for the predictors here.

A clear statement of the basic problem studied here can be found in Geisser (Citation1975) who focused on ‘low structure’ data and treated it strictly empirically. i.e., with minimal discussions about abstract constructs such as distributions. This is analogous to what we call complex data – data about which it is hard to make any strong assumptions. Geisser (Citation1975)'s treatment was prescient in that he clearly expressed many of the ideas here in elementary settings such as ANOVA, regression, and posterior means. Our work goes beyond this by examining techniques that were not available in 1975 and we do so from a contemporary conceptual standpoint, not just empirically (although that is clearly no less important).

The notion of ‘interpretability’ used here is the same as used in Le and Clarke (Citation2020). Briefly, a model M consists of K components, say M={c1,,cK}. The ck's are the components that go into the formulation of a model such as variables, parameters, and rules for how they are to be combined to produce a model. The model M is interpretable if and only if each ck has a physical correlate, i.e., they correspond to some identifiable and measurable feature of the DG. We say a model is valid if and only if it is interpretable and correct at least to the degree that its predictions and future outcomes are sufficiently close.

In this setting, the key question in modelling is how well the terms in Y^ encapsulate the components of the DG. That is, what aspect (or ck) of the DG does a specific tj represent and how accurately? In short, the subcomponents of a model matter because it is hoped they have physical correlates. The model is meant to match reality in that its components can be interpreted physically in the context of the DG. Otherwise put, a model can be falsified by falsifying one of its components. Thus, it makes sense to ask if a model is ‘true’ – it being understood that ‘true’ may only be in a provisional sense. A better model may be found that discredits the earlier model and science proceeds by sequentially falsifying ever better models hopefully arriving at a model that is either not-falsifiable or so close to true (in the absolute sense) that it isn't worth the trouble to falsify. In either case, there is the idea of a model being true that has no genuine analog for predictors. The closest analog would be for a predictor to be optimal (within a class) but this is not part of measurable, objective reality.

The main link between predictors and model is that models regularly provide predictors while predictors do not in general lead to models – at least not directly. (Indeed, models that do not make measurable predictions are not valid models as they are not falsifiable.) Thus, we may speak of model-induced predictors and non-model induced predictors. Loosely, the class of predictors is very much larger than the class of models. So, one way to find good models is to find a good predictor and determine a model that performs almost as well in terms of prediction. That way, there is a physical interpretation even if some predictive power is lost.

Ideally, we want a model that is not falsified (or at least is very hard to falsify) and that gives an extremely good predictor. Sometimes this occurs but often it does not especially for complex problems. That is, a really good predictor often outperforms the predictor generated from a provisionally true model and does so by a substantial margin. In earlier work, we gave examples of this, see Le and Clarke (Citation2020). That is, we showed how certain interpretable predictors, linear models in particular, could be modified to give improved predictions. The modification were chiefly to introduce non-interpretable features to the model induced predictor. The end result was a predictor that had some interpretable and some non-interpretable components, and gave demonstrably improved prediction over the model induced predictor. We called the difference between the error of using the partially interpretable predictors over the model induced predictors the cost of modelling. Intuitively, the flexibility from the loss of full interpretability enabled improved prediction.

Here we examine the same problem but from the reverse direction. That is, we start with uninterpretable predictors and find interpretable models that are close to them. The interpretable models do not in general perform as well as the uninterpretable predictors since the latter are the result of optimization. Thus, non-interpretable (but optimal) predictors can lead to approximate models that may be examined to see how they relate to physical components of a DG. This is another way to formalize the cost of modelling, or interpretability, in terms of the loss of predictive accuracy because the approximate model may still say something useful about the DG.

One implication of this reasoning is that the principle of falsification may have to be reconsidered. Falsifiability is the assertion that any conjectured model must be disprovable before it can become accepted. The problem is that if the predictions from essentially every model for a DG can be improved, then essentially every model is flawed and can be discredited. We are not actually uncovering truth. Accordingly, the principle of falsification may itself be ‘false’ in the sense that since every model is disprovable, disprovability can only be used to discriminate better models from worse ones not to arrive at a model that can be generally accepted – at least not often enough that it is useful as a foundational philosophical principle.

A further complication is the concept of M-closed, -complete, and -open problems; see Bernardo and Smith (Citation2000) for the original definitions. These are defined by the location of a class of proposed models or predictors relative to the DG. Here we say that an M-closed problem is one in which the DG is exactly one of a finite list of explicit candidate models. The problem is therefore merely selecting the one that matches the data generator. In the Bayesian case, this also means that the prior is well-defined in the sense that the prior probability of a model represents the pre-experimental belief that the model is true.

An M-complete problem is one in which the DG has a true model but it is inaccessible in some sense. For instance, it might be too complicated to formulate. There may not be any closed form expression for it, so it can only be approximated numerically. The true model might be so complicated that it is unrealistic to learn much about it from data that might realistically be gathered. Indeed, the true model may be so complex that no approximation to it is adequate even if a serviceable one can be found under restrictions. The main point is the model exists so that, for instance, expectations and convergences are well defined but any properties of it are problematic and uncertain. In this case, the prior probabilities are not that a model is true but rather that it is close to the true model given the model list. (Interpreting the prior as weights on actions in a decision theory problem is also possible; see Le and Clarke (Citation2016b) for a brief discussion of this.)

The two problem classes contrast sharply with the M-open problem class in which no model for the data generator can be assumed to exist. Hence, expectations and expressions related to the form of a model, e.g., modes of convergence, do not make sense and the meaning of a prior is unclear unless it is taken as a belief that a given predictor, possibly model-based, will perform better than another predictor within the class of predictors under study. In the M-open problem class modelling makes little sense; we are essentially left only with predictors and their properties. Thus, a model for a DG in the class really means the predictor the model generates because the model has no necessary meaning. A predictor may be examined to learn something about the DG, such as the relevance or irrelevance of a variable, but detailed knowledge in the sense of a complete set of correlates for a DG is unobtainable.

A key point is that the existence of M-open problems undermines the principle of falsifiability. If there is no true model and predictors can only be evaluated in terms of how well they perform on a relative basis, then the principle of falsifiability is irrelevant to many modern complex problems. That is, in many settings, all models are wrong and hence already falsified. They are not useful either except insofar as they give a good predictor that may or may not say anything about the DG. So, falsifiability per se is often merely a distraction from good prediction.

Indeed, many of the most important data sets currently being or recently gathered were not generated by a DG that admits a model, or, more precisely, were not generated by a mechanism that has anything stable or identifiable enough to model effectively. However, as long as there is something to measure we can make a guess as to its next value. Moreover, for these situations, predictors derived from model averages are often found to be better than their individual components, regardless of whether the models in the average are ‘true’ or merely regarded as the predictors they generate. There are other classes of predictors also do not generally admit physical interpretations yet also have clear predictive utility. Here, we study three classes of such predictors.

Our goal is to show that it is possible to provide interpretations for predictors that are generally uninterpretable. Specifically, kernel methods, the Bayes Shtarkov solution, and random forests are predictors for complex (M-complete or M-open) problems that are typically regarded as hard or impossible to interpret. We develop two types of interpretation for each of them. One is an interpretation in the usual sense of finding physical correlates, usually approximately, for components of the predictors. The other is a theoretical characterization of these predictors in terms of concepts to help guide their use.

In Section 5 we summarize out findings by stating what we call the prediction principle that we propose should be added to the falsification principle. This is separate from and in addition to the celebrated prequential principle, see Dawid (Citation1984Citation1992Citation2010) and Dawid and Vovk (Citation1999), that we regard as foundational.

The structure of this paper is as follows. In Section 2 we provide an interpretation for kernel methods such as relevance vector machines (RVM's) and support vector machines (SVM's) using the eigenfunctions of kernels with a consistency result so the interpretations will be valid. In Section 3 we present a Bayes version of the Shtarkov predictor and indicate how to interpret it as a mixing of Beta distributions or in terms of a Pearson distribution. In Section 4 we show that a random forest is asymptotically equivalent to boosting so random forest builds an additive logistic regression model. They can also be approximated in a regression sense. Some concluding remarks are made in Section 5. Longer technical proofs are in Appendix 1 and some further discussion of interpretability versus complexity can be found in Appendix 2.

2. RVM's and SVM's for regression

For a regression problem with a training data set D=Dn={(yi,xi),i=1,,n}, where y is the response variable and x is a covariate of dimension p, the goal is to find a function f(x) to predict the responses y in a test set. This can be viewed as a regularization problem of the form (1) minfHK[1ni=1nL(yi,f(xi))+λfHK2],(1) where HK is a reproducing kernel Hilbert space (RKHS) with kernel K and norm HK, L is a loss function, and λ>0 is the smoothing parameter. It can be shown that (Equation1) has a solution of the form (2) f^λ(x)=i=1nαiK(xi,x);(2) see Kimeldorf and Wahba (Citation1971). The solution f^λ is not a conventional model because the evaluations of K do not have any necessary physical correlates and in addition to its dependence on the parameters αi, f^λ depends explicitly on the xi's to define the functions in the sum, the number of which depends on n. Indeed, the optimal values of the αi's are αi=αi(D). Estimating the αi's means finding a data-driven approximation α^i(D) even though the ‘true’ value depends on D. That is, the data is used once to define the ‘true’ αi's from the optimization (Equation1) and then again to obtain good estimates for them. It is easiest to regard the latter estimates as converging in the sense of real numbers rather than stochastically.

Conditional on D, the (uninterpretable) representer theorem predictor is (3) Y^rep(x)=i=1nαiK(xi,x).(3) It is well known that RVM's and SVM's are of the form (Equation3). Moreover, it is seen that (Equation3) is the mode of a posterior where L(yi,f(xi)) is the exponent in an exponential family and λfHK2 is the log of the prior on f. The representation (Equation2) of f is of special interest when the number of covariates p is much larger than the sample size n and predictors such as Y^rep() are often used in M-complete (and M-open) settings.

We start with a consistency result so the approximate interpretations we present will be asymptotically valid.

2.1. Consistency of the representer theorem predictor

Consider an M-complete problem and let (4) Q^n(f)=1ni=1nL(yi,f(xi))+λfHK2.(4) The population version of (Equation4) is (5) Q0(f)=E(X,Y)L(Y,f(X))+λfHK2.(5) We can assume that, for each i the α^i=α^i(D)'s are known from the empirical optimization of Q^n and that the true values for the αi's given D are fixed with limits, assuming they exist, due to the optimization of Q0.

Theorem 2.1

Assume (i) Q0(f) is uniquely minimized at f0, (ii) Q0(f) is continuous in f, and (iii) Q^n(f) converges uniformly in probability to Q0(f) i.e., supfHK|Q^n(f)Q0(f)|P0. Then Y^repPf0,where the convergence is over independent outcomes from the distribution of (X,Y).

Proof.

This is a modification of Theorem 2.1 in Newey and McFadden (Citation1994). Since Q^n(Y^rep)Q^n(f) for any f by the optimality of Y^rep, we have that for any ϵ>0, Q^n(Y^rep)Q^n(f0)+ϵ/3. Also, for any ϵ>0, Assumption (iii) gives Q0(Y^rep)<Q^n(Y^rep)+ϵ3,Q^n(f0)<Q0(f0)+ϵ3,with probability approaching one (w.p.a.1), as n. Therefore, w.p.a.1, (6) Q0(Y^rep)<Q^n(Y^rep)+ϵ3<Q^n(f0)+2ϵ3<Q0(f0)+ϵ.(6) For any δ>0, let B(f0,δ)={fHK:ff0HK<δ}.Since B(f0,δ)c is closed, Assumptions (i) and (ii) give inffB(f0,δ)cQ0(f)=Q0(f)>Q0(f0)for some fB(f0,δ)c.

Choosing ϵ=inffB(f0,δ)cQ0(f)Q0(f0)=Q0(f)Q0(f0), expression (Equation6) implies Q0(Y^rep)<Q0(f)=inffB(f0,δ)cQ0(f)w.p.a.1,and hence Y^repB(f0,δ). Letting δ0 completes the proof.

Some discussion of what Theorem 2.1 means and does not mean is important here. The mode of convergence is stochastic, in the Hilbert space norm. The objects converging are functions of the form (Equation3) in which the xi's appear as arguments in the kernel evaluations and the (xi,yi)'s appear implicitly in the definition of the αi's. The whole function (Equation3) converges to the minimizer f0. The function itself depends on D through the values of the xi's and the αi's. This means that the connection between (Equation3) for one D and other data set D is unclear. The xi's, yi's, and the sample sizes, say n and n, may be different. So, there is no necessary relationship between αi(D) and αi(D) even when imin(n,n) and the first min(n,n) pairs (xi,yi) are the same for D and D. It may be easiest to regard increasing data sets as a sequence of problems corresponding to the accumulation of data and the convergence in the joint distribution of (X,Y) as summarizing the effect of replications over the entirety of all the countably infinite data sequences. There may be further structure in the convergence of the αi's and their effect on the convergence of (2.1) to f0, but we do not treat this here.

The convergence in Theorem 2.1 is in probability. The mode can be improved to L2 with some extra hypotheses as seen in the following.

Corollary 2.1

Assume the conditions in Theorem 2.1 hold. Assume

  1. Let X be a generic random variable representing any Xi. Then there is an ϵ>0 so that xsupp(X)¯ E[K2+ϵ(X,x)]<;

  2. The sum of squares of the αi's is bounded with rate (1/n), i.e., M so that for any D, i=1αi2(D)<M and N=N(n) so that i=Nnαi2=oP(1/(nN)).

Then, as n, for any x we have Y^rep(x)L2f0(x).

Proof.

To establish the result, it is enough to show that as n, (7) E[Y^rep(x)f0(x)]20.(7) This is done in Appendix A.

Again, some discussion of this result is worthwhile. First, Assumption (i) is easy to verify for most kernels K. However, Assumption (ii) is asymptotic and therefore hard to verify. The boundedness clause, while intuitive, may have to be enforced by restricting (X,Y) to a compact set and renormalizing P(X,Y). The compact set can then be allowed to increase slowly while still preserving the result. The second clause of the assumption, the rate, is harder to deal with. Nevertheless, the rate assumption (and the boundedness assumption) are nearly always satisfied, at least approximately, in practice. While not a verification, the second clause can be checked by seeing how the αj's perform using bootstrap samples from an D. If the clause is satisfied for bootstrap samples and a range of finite n then it may be reasonable to take as true. In practice, with RVM's few of the αi's are non-zero so as a practical matter, Assumption (ii) usually appears to be satisfied. Indeed, this can be seen in Tipping (Citation2001).

A counterfactual may make Assumption (ii) less unpalatable. If the Representer Theorem solution were a Fourier expansion, the two clauses of Assumption (ii) would seem fairly reasonable. The first clause of Assumption (ii) would only mean that iαiK(xi,) is in the Hilbert space because Bessel's inequality gives that the sum of squared Fourier coefficients is less than the norm of the function. (This is true for any orthonormal basis.) Also, the rate i=Nnαi2=oP(1/(nN)) as n and N(n) increase imposes a sparsity condition. It limits the collection of functions that can be well approximated because, as n increases, the last αi's can't be too large. That is, the true function is only being approximated by N(n) evaluations of the kernel. Otherwise put, this method is only effective for functions f0 that are sufficiently sparse in terms of the K(xi,x)'s required to express them. The rate clause in Assumption (ii) bounds how far f0 can be from the approximations Y^rep, in L2, for consistency – as opposed to merely optimal approximation – to hold.

The mode of convergence in Corollary 2.1 is in L2 pointwise in x. If a distribution P is assigned to X, then Egoroff's theorem can be applied. It strengthens the result by giving Y^rep(X)f0(X)0 in L2 uniformly for XA where A has arbitrarily large probability under P. That is, the convergence is almost uniform over most of the sample space of X.

To complete our treatment of the consistency of Y^rep we note that Assumption (iii) in Theorem 2.1 is hard to verify. So, we provide sufficient conditions for it.

Theorem 2.2

Assume (i) the loss function L is continuous in f, (ii) there exists δ>0 so that for any fHK, E(X,Y)[supfB(f,δ)L(Y,f(X))]< where B(f,δ)={fHK:ffHK<δ}, and (iii) there exists an increasing sequence of compact subsets of HK, {Dj}j=1, converging to HK such that, for each fixed n, limjsupfDj|Q^n(f)Q0(f)|=supfHK|Q^n(f)Q0(f)|. Then supfHK|Q^n(f)Q0(f)|P0.

Proof.

The proof is adapted from Theorem 6.10, the uniform weak law of large numbers, in Bierens (Citation2005). The details are in Appendix A.

Remark

Uniform laws of large numbers emerge from empirical process theory, see Van de Geer (Citation2000) Chapter 2 for instance. Often these results have weaker hypotheses that are harder to verify. We have used extensions of the classical law of large numbers since our goal is predictor interpretability not weakest conditions.

2.2. Theoretical interpretation of the representer theorem solution predictors

Under Mercer's conditions, see Mercer's Theorem in Scholkopf and Smola (Citation2002), the kernel K can be decomposed as (8) K(xi,x)=m=1λmgm(xi)gm(x),(8) where {gmm=1,2,} is an orthonormal set of eigenfunctions of K with K(x,y)gm(y)dy=λmgm(x), m=1,2,. Thus, different kernels correspond to different orthonormal bases.

We can now use the gm's in a nonparametric regression expansion for f0. Write the projection of f0 onto the span of {g1,,gM} as (9) r(x)=j=1Mβmgm(x)(9) so that estimating the βm's is equivalent to estimating the projection. Since the gm's are orthonormal, the optimal βm's in an L2 projection sense are gm,f0 and these can be estimated by β^m=(1/n)i=1nyigm(xi) for m=1,,M. For any reasonable choice of joint distribution for (Y,X), the central limit theorem gives that for each m, there is a σm so that (10) β^mN(βm,σm2n),(10) asymptotically in n, assuming the second moments of the β^m's exist. The integrated squared bias of r as an estimator for f0 is (11) BM=B(f0,r)=(f0(x)r(x))2dx=m=M+1βm2.(11) Since both r and f0 are in a Hilbert space, BM is finite for any M and for any latter m, βm0 as M. Now, having controlled the variance of the β^m's and the bias of r we see that (12) r^M(x)=m=1Mβ^mgm(x)(12) converges to r as n and to f0 with bias roughly BMn(f0,r^). We can let Mn so slowly as n that BM also goes to zero. Doing this, it is seen that we have BMn0 as well, possibly slowly with n. Now, r^ converges to f0 pointwise in x, i.e., (13) r^M(x)f0(x)inP(13) for any x. More explicit formal conditions for (Equation13) and for versions of (Equation13) in stronger modes of convergence can be given, but that is beyond our present scope. However, we note that this argument holds for any orthonormal basis but that the gm's, being derived from K, are natural for this problem. Of course, if the basis elements in any orthonormal basis have a physical interpretation for a given K that is more compelling than the gm, that basis would be preferred.

From (Equation13) and Theorem 2.1, we have the following.

Theorem 2.3

Assume (Equation13) holds and that the hypotheses of Theorem 1 are satisfied. For M-closed and M-complete problems, for any x, the orthonormal basis predictor r^M in (Equation12) is asymptotically equivalent to the representer theorem solution predictor Y^rep(x), i.e., as n and consequently Mn at an appropriate rate r^Mn(x)Y^rep(x)P0.

From Theorem 2.3 we are justified in regarding r^ as an interpretation of Y^rep(x) on the grounds that the gm's (or other orthonormal basis) admit a physical interpretation relevant to the DG. The cost of this interpretation is asymptotically zero but for finite n depends on BMn in (Equation11) and the rate for the β^m's in (Equation10).

Here we give some examples of the eigenfunctions gm for common choices of kernel to show the interpretability is non-trivial. (Other orthonormal bases may be easier.)

Example 2.1

Consider the kernel K(x,y)=exy on (0,).

By the definition of the Gamma function, for α>0, 0tα1extdt=Γ(α)xα.Changing α to 1α, for α<1, 0tαextdt=Γ(1α)xα1.So, for 0<α<1, 0(1Γ(α)tα1+1Γ(1α)tα)extdt=Γ(α)Γ(1α)×(1Γ(α)xα1+1Γ(1α)xα).Therefore, by definition, the eigenfunctions of this kernel are gα(x)=1Γ(α)xα1+1Γ(1α)xα,for 0<α<1.

Example 2.2

Polynomial kernel

The non-homogeneous version of the polynomial kernel of degree d is defined by K(x,y)=(c+x,y)d where c is a constant and , is the inner product.

For p = 2, say, let x=(x1,x2), c = 0, d = 3, and suppose that (x1,x2)0.5N((3,1),I2)+0.5N((2,1),I2), then the eigenfunctions of this kernel are, see Liyang and Lee (Citation2013), g1(x)=11862.615(0.848x130.791x12x2+0.437x1x220.097x23),g2(x)=1343.748(0.518x131.073x12x2+0.862x1x220.317x23),g3(x)=159.266(0.112x131.079x12x20.929x1x22+0.559x23),g4(x)=11862.615(0.848x130.791x12x2+0.437x1x220.097x23).

Example 2.3

Exponential kernel

Consider the exponential kernel K(x,y)=exp(|xy|w) for the uniform distribution on the interval [1,1]. In Diaconis et al. (Citation2008) it was shown that the eigenfunctions of this kernel can be written as cos(bx) or sin(bx) inside the interval [1,1] for appropriately chosen values of b and decay exponentially away from it.

Example 2.4

Gaussian kernel

Consider the Gaussian kernel K(x,y)=exp((xy)22w2) for the normal distribution N(μ,σ2). Let β=2σ2/w2 and let Hi(x) be the i th-order Hermite polynomial, Shi et al. (Citation2008) provided the eigenfunctions of this kernel, gi(x)=(1+2β)1/82i1(i1)!exp((xμ)22σ21+2β12)×Hi1((14+β2)1/4xμσ),for i=1,2, In particular, the first eigenfunction is g1(x)=(1+4σ2w2)1/8×exp((xμ)24σ2(1+4σ2w21)).

Other examples may be developed but the key points are (i) the bases used for the orthonormal basis predictor should be chosen in view of the DG to ensure interpretability, and (ii) for finite n, using an interpretation of Y^rep will not in general be as good a predictor (in say PRESSn) and this is the cost of interpretability.

2.3. A more empirical interpretation

If gm's are not interpretable, and no obvious orthonormal basis can be identified, we might be led to default to the coordinates of the xi's since they, presumably, were the quantities measured. In the dim(x)=1 case we can write Taylor expansions gm(x)=β0+β1x++βkxk.If each Taylor expansion converges i.e., gm(x) analytic, then we get an analogous result for projections of the form of r(x) in (Equation9). For ease of exposition, each Taylor series can be represented as a finite sum of orthonormal polynomials. A common choice is the Hermite polynomials, often denoted H0,H1. Then the difference between using kth-order Taylor expansions and expansions using the first k Hermite polynomial basis elements is simply identifying the linear transformation between bases.

Following Section 2.2, we can form r as in (Equation9) and r^ as in (Equation12) using Hermite polynomials, and obtain a variant on Theorem 2.3 so that the r^ provides a quantifiably good approximation to f0. The result can be left in Hermite polynomials or converted back to the Taylor expansion of each gm in r. The point of converting back to the polynomial basis used for Taylor expansions is that functions of the form xj are usually easier to interpret physically in the context of a DG than Hermite polynomials are simply because it was x that was measured.

Thus, using linear models, which are commonly regarded as interpretable, to approximate the gm's, is asymptotically equivalent to approximating K in (Equation8) directly. In practice, we suggest it will be easier to use Hermite polynomials (or any other orthonormal basis) on the terms on the right side of (Equation8), and convert them to the polynomials used in Taylor expansions than approximating K directly. Again, the finite sample discrepancy between the approximation we just described and (Equation3) quantifies the cost of passing from an optimal predictor to an interpretable predictor. An entirely analogous argument holds when dim(x)2 provided that orthonormal bases for polynomial spaces with dim(x) variables are used.

3. Bayes Shtarkov predictors in M-open settings

Consider the online prediction of arbitrary sequences y1,y2,, drawn from a finite set Y. In M-open settings, interest focusses on the case that no probability distribution can be assumed for a sequence of length n, say yn=(y1,y2,,yn). This is the paradigm M-open statistical prediction problem for strings ofvalues.

This problem can be regarded as a sequential game between Nature, N, and a Forecaster, F, permitting F to access a collection of experts indexed by θΘRk for some k. In the special case of log-loss, each round of the game proceeds as follows. Each expert announces a density say pθ. Given this, F announces a density q() that will be used to predict the value N issues. Finally, N issues y and pays F logq(y). If this number is negative, it is the amount of money F pays N and this concludes the round. See Shtarkov (Citation1987) and Cesa-Bnachi and Lugosi (Citation2006) for details of this game and its properties.

Now suppose n independent rounds of this game are to be played. Prior to the first round, each expert θ announces a density p(θ) for yn. F receives these pθ's and chooses the density q(yn) by trying to match the performance of the best expert θ for predicting yn. Then, N reveals yn and incurs the loss (or gain) logq(yn). The question remains how F should use the pθ's to choose q. Obviously, the best expert will incur the loss minθlog1/p(ynθ) to F where θ ranges over the experts.

3.1. The Bayesian version

In the Bayes version of the game, F has access to experts that are weighted by a prior w(θ). (If w1, this reduces to the frequentist version.) In this case, F would want to choose q to minimize the maximum regret (14) supyn[log1q(yn)infθlog1w(θ)p(ynθ)]=supyn[supθlogw(θ)p(ynθ)q(yn)].(14) More formally, the solution qopt to (Equation14) that we henceforth call the Bayes Shtarkov predictor (for the discrete case) is, see Le and Clarke (Citation2016a), (15) qopt(yn)=argq[infqP(supynsupθlogw(θ)p(ynθ)q(yn))]=w(θ~(yn))p(ynθ~(yn))ynw(θ~(yn))p(ynθ~(yn)),(15) where θ ranges over the ‘parameter space’ indexing the experts, P is the collection of all densities for Yn with respect to counting measure, and θ~ is the posterior mode. (Since q is in the denominator of (Equation15), q(y)0 for any yY.)

In the continuous case, the sum becomes an integral over a subset of a real space, P becomes a class of densities with respect to Lebesgue measure, so (Equation15) becomes (16) qopt(yn)=w(θ~(yn))p(ynθ~(yn))w(θ~(yn))p(ynθ~(yn))dyn;(16) see Clarke (Citation2007, Sec. 5.2) for discussion of (Equation15) and (Equation16).

The solution qopt(yn) does not factor into a product of q(yi)'s and so does not correspond to a stochastic process. Nevertheless, regardless of how qopt(yn) is computed, univariate Bayes Shtarkov densities predictors, when they exist, are of the form (17) qopt(yn+1yn)=qopt(yn+1)qopt(yn),(17) and can be used prequentially, i.e., to generate sequential predictions.

The foregoing generalizes directly to the case where side information, i.e., a value xi is associated to each yi. So, let us write the corresponding qopt as qshk(yxn+1,yn), namely the Bayes Shtarkov predictive density for Y where xn+1=(x1,,xn+1) and yn=(y1,,yn). Now, there are two ways to generate predictions from qshk: (i) use qshk as a density to generate interval predictors, and (ii) convert qshk to a point predictor.

For the first, recall that under various regularity conditions, qshk can be approximated by the Bayesian's marginal density for the data, i.e., qshk(yn)m(yn) in terms of regret. That is, (Equation14) leads to (18) supyn[logw(θ~)p(ynθ~)qshk(yn)I^(θ~)]=supyn[logw(θ~)p(ynθ)~m(yn)I^(θ~)]=12logn2π+o(1),(18) where I^() is the empirical Fisher information from p(yθ); see Xie and Barron (Citation2000) and Clarke (Citation2007). This means a stochastic process, the Bayesian's mixture, is approximating a density qshk that does not correspond to a stochastic process. Hence, even in M-open problems, there may be good – new – predictors that resemble familiar predictors. If computing qshk is onerous, m() may be a good predictor. Likewise, conditionals from m() such as m(yn+1yn) may be a good approximation for (Equation17).

More important for the present, if the Bayesian's marginal for the data is ‘interpretable’ – perhaps because the densities proposed by the experts are – (Equation18) provides an asymptotic interpretation of qshk in terms of m(). The difference between m() and qshk represents the degree to which m() is predictively suboptimal to qshk – in regret under log-loss – and the degree of suboptimality decreases as n. Thus, (1α)% predictive intervals under m() and qshk are equivalent in the limit even though for all finite n qshk is better in terms of regret.

By contrast, if a generic ‘interpretable’ predictor r(yn) is used, the regret usually becomes far worse. For instance, (19) supyn[logw(θ~)p(ynθ~)r(yn)I^(θ~)]=supyn[logw(θ~)p(ynθ~)qshk(yn)I^(θ~)+n(1ni=1nlogq(yiyi1)r(yiyi1))]12logn2π+nsupynD(qiri)+o(1),(19) in which D(qiri) is defined by (Equation19) and typically dominates, resulting in a much larger loss for F. In part, this is an artifact of using the log-loss of a density ratio which is much more sensitive to tail behaviour than other functions. Statements analogous to (Equation17), (Equation18), and (Equation19) hold for sequences of discrete outcomes yi. In either case, this suggests that interpretability per se, without any direct relationship to the minimum regret in a logarithmic sense, gives much larger costs.

For the second, if we consider pointwise prediction, we want a point predictor from qshk that represents where the mass of qshk is, say Eqshk(Y). One such predictor is (20) Y^shk(x)=Eqshk(Yxn+1,yn).(20) Others include med(Yxn+1,yn), mode(Yxn+1,yn) etc. In addition, because of (Equation18) we are led to approximate each of these point predictors by expectations with respect to the conditional m(xn+1,yn) from the mixture.

Assume pj(y) is the proposed density for Y from expert j, j=1,,J i.e., we assume finitely many experts or that a continuum of experts can be approximated by a weighted sum of finitely many. Even though there is no distribution associated with Y because the problem is M-open, we can still take expectations with respect to qshk and the pj's. We can write (21) |YEj=1Jγjpj(Y)||YEqshk(Y)|+|Eqshk(Y)Ej=1Jγjpj(Y)|=|YY^shk|+|Y^shkEj=1Jγjpj(Y)|(21) where γj>0, j=1Jγj=1, and subscripts on E indicate density in which expectation is taken. Since the Bayes Shtarkov predictor is best in log-loss, the first term in (Equation21) is likely to be small. Thus, we only need to find γj's such that the second term in (Equation21) is as small as possible if not zero. If this is done, then the first term represents the minimal cost of prediction and the second term represents the additional cost of interpretation, if the pj's are interpretable.

We can modify (Equation21) by adding and subtracting expressions involving expectations with respect to m(yn+1xn+1,yn) as a way to try to identify the components of the error meant by (Equation21) in terms of the uninterpretable qshk, its distance from a mixture (possibly regarded as interpretable) and sums of weighted densities of the experts (if they are interpretable), e.g., we can add and subtract Em(yn)Y in the last term of (Equation21). In practice, the last terms would represent the cost of interpretability while the first term on the right in (Equation21) represents the minimal cost of prediction.

3.2. A theoretical interpretation for Bayes Shtarkov predictors

Separate from approximating qshk by mixture densities or other expressions, in some cases we can characterize qshk as a mixture itself. The hypotheses are rather strong and the mixing density is somewhat artificial but in some examples this characterization may be useful. We have the following.

Theorem 3.1

Assume qshk is m-monotone over (0,) i.e., (1)kqshk(k)(|x|)0 for k=0,,m1 where qshk(k) is the kth derivative of qshk and qshk(0)=qshk. Then qshk can be represented as the following mixture for any integer k, 1km, (22) qshk(y)=0[1sk(1|y|s)+k1]g(s)ds,(22) where a+=max{a,0} and the mixing density g(s) is g(s)=1kj=0k1(1)jj![jsjqshk(j)(s)+sj+1qshk(j+1)(s)].

Proof.

The proof of Theorem 1 in Polson et al. (Citation2014), based on Williamson (Citation1956), holds for all positive values of y. For negative values of y, define f(y) on (0,) by f(y)=qshk(y), then we have the same result for f(y): f(y)=0[1sk(1ys)+k1]g(s)ds.Hence, for the negative values of y, qshk(y)=f(y)=0[1sk(1ys)+k1]g(s)ds.Thus, for all y, qshk has the representation (Equation22).

We do not have sufficient conditions for qshk to be m-monotone. However, examples of qshk in Le and Clarke (Citation2016a) look like graphs of 1/y or ey which are m-monotone.

One of the implications of Theorem 3.1 is that if XBeta(1,k) and Sg() then Y = SX will have density (Equation22). For instance,

  1. if k = 1, g(s)=sqshk(s) and XBeta(1,1)=Uniform(0,1),

  2. if k = 2, g(s)=s22qshk(s) and XBeta(1,2),

  3. if k = 3, g(s)=s33qshk(s) and XBeta(1,3).

Thus, while g remains uninterpretable, X is recognizable and Y is a product. A limitation of this result is that it is only for univariate y. However, as suggested by the relationship between m() and qshk in (Equation18), extensions to multivariate y may be possible under some M-open analog of stationarity – provided there is the right sort of dependence so that the middle term of order n in (Equation19) can be avoided.

3.3. An empirical interpretation for Bayes Shtarkov predictors

We can approximate the univariate density qshk(y) by finding the member of a large family of densities closest to it. For this we want a relatively large family of candidate densities that are parametrized in some way that reflects out understanding of the shapes of densities. One such family consists the Pearson distributions first introduced by Pearson (Citation1895). There are at least 7 useful subtypes within the Pearson family; Pearson himself ultimately identified 12. Overall, this family is characterized by five parameters: A location parameter a (often interpretable as a mode), a location parameter λ (often interpretable as a mean, μ1), a variance μ2, a skewness γ1 (this enters as β1=γ12), and a kurtosis β2. While this family is only for univariate densities, there are proposed generalizations to the multivariate case, see Steyn (Citation1960) amongst others, although they are not well developed and there is little recent work on them.

Formally, a Pearson density p is any solution to the differential equation, (23) p(y)p(y)+a+(yλ)b0+b1(yλ)+b2(yλ)2=0,(23) where b0=4β23β110β212β118μ2,b1=a=μ2β1β2+310β212β118,b2=2β23β1610β212β118.Now, in principle, values of (a,λ,μ1,β1,β2) yielding the Pearson density closest to a given qshk density can be found. This Pearson density may be regarded as an interpretation of qshk, but it cannot be as good as qshk in terms of regret, see (Equation14). The increase in regret from using the Pearson density closest to qshk, rather than qshk itself, is the cost of interpretation.

The solution of (Equation23) is the indefinite integral pprs(y)exp(yab2y2+b1y+b0dy).For the sake of completeness, we look at an example, namely the two cases of the Pearson type IV distributions based on whether b124b0b2 is negative or non-negative. (The term discriminant arises from the use of the quadratic root formula.) They are:

Case I: if b124b0b2<0, then (24) pprs(y)[1+(yλα)2]m×exp[νarctan(yλα)],(24) where α=4b0b2b122b2,ν=2b2a+b12b22α,m=12b2.Case II: if b124b0b20, then (25) pprs(y)(1ya1)ν(a1a)(1ya2)ν(a2a),(25) where a1=b1b124b0b22b2,a2=b1+b124b0b22b2,ν=1b2(a1a2).Let d be a distance between densities. We can find v^1=(λmin,αmin,νmin,mmin) and v^2=(amin,a1,min,a2,min,νmin) that achieve (26) arg(mind(qshk,pprs)),(26) where the minimum is over the parameters in (Equation24) or (Equation25), respectively. Naturally, we would choose the Pearson density corresponding to whichever of v^1 and v^2 gave a lower value of the minimum in (Equation26). In principle, this can be done over the other types of Pearson densities (or any parametrized class of densities) so that the parameters giving the overall minimum for the distance in (Equation26) can be found. The result is a Pearson distribution whose density approximates qshk as closely as possible within the family and hence has an interpretation based on the shape of the approximating density. If the minimum is not small enough given the choice of d, we may be led to consider richer families of densities than Pearson.

As a final point for this section, recall it is well known that if all moments of a distribution exist then they characterize the distribution and that the more the moments of two distributions that are close, the closer the two distributions are. Thus, as long as the moments of the distribution are meaningful the distributions can be regarded as interpretable, but, as we have seen, interpretation has a cost.

4. Random forest predictors

Random forest (RF) predictors were introduced by Breiman (Citation2001a). The main idea is to use bootstrap aggregation on trees with binary splits or, more formally, binary recursive partitioning models, and add one extra step to reduce the correlation between any pair of trees in the forest. The extra step is random selection from the explanatory variables. That is, when growing a tree on a bootstrapped sample, before each split, choose mp of the explanatory variables at random to be candidates for splitting. Values for m typically range from log2p to p. Despite the de-correlation step, RF's have many of the properties of bagging trees. Here we focus on RF's because they are one of the most successful model averaging methods for binary classification and often have benefits other methods, including boosting, do not.

Here we will relate RF's to boosting, see Schapire (Citation1990), to show the main point of this paper – that interpretable methods have a performance cost over optimal predictive methods – holds for classification as well as regression. This is a different perspective from Wyner et al. (Citation2017) who argue that RF's and boosting work well because both are interpolating and averaging. Our point has to do with interpreting classifiers not understanding why they work.

4.1. Interpreting RF's in terms of boosting

Our point in this subsection can be stated succinctly as follows. Let RF(x) be the random forest classifier based on x. Consider a data set D={(yi,xi),i=1,,n} where the pairs (yi,xi) are independent over i, yiY={1,1} and xiRp. Write RF(x)=RFB,ρ,τ(x) where ρ is a splitting rule, τ is a stop splitting rule and B is the number of trees used to form the random forest. Thus, RF(x)=(1B)b=1BTb,ρ,τ(x)where Tb=Tb,ρ,τ is the classification tree formed from the bth bootstrap sample and the decorrelation used to form Tb is suppressed in the notation. Let BSTCJ(x)=BSTC(x) be the boosted classifier using J iterations of the boosting procedure starting with the initial tree-based classifier C(x) assumed to be ‘weak’. That is P(YC(X))<.5, but not by much. Now we can write (27) RF(x)=(RF(x)BSTC(x))+BSTC(x).(27) The idea is that RF() is uninterpretable but BSTC() has a limited interpretation (due to Friedman et al. (Citation2000), see below) so the term (RF(x)BSTC(x)) represents the cost of that interpretation. We assume that RF's perform a little better than boosted trees because they generally do unless the boosting classifier is sufficiently well-calibrated and does not overfit. That is, RF's are nearly automatic and hence more robust. Moreover, a majority of successful classifiers are random forests or variants on random forests; see Caruana et al. (Citation2008) for a definitive report emphasizing high dimensional problems. As a generality, boosting also does not generalize well beyond binary classification.

To make this more precise, recall that while there are a variety of boosting algorithms, and variants on boosting algorithms including gradient boosting, AdaBoost due to Freund and Schapire (Citation1997) is arguably the most popular. The basic idea of boosting is to improve a weak classifier iteratively by averaging the reweighted improvements i.e., to help the weak classifier learn from its mistakes. To generate the improved classifiers, the boosting procedure re-uses the data like bagging or RF's but builds iterates rather than starting anew with each iteration.

Let C(x) be a fixed initial classifier for Y and assume that, as a function, C(x) is representable as a tree. Denote the iterates of C under the boosting algorithm by C^j, j=1,,J. The iterates are also classifiers. They are ensembled into a final classifier by weighted majority voting to yield (28) BSTC(x)=sign(j=1JβjC^j(x)),(28) where the weights β1,,βJ are computed by the AdaBoost algorithm, see Freund and Schapire (Citation1997) for details. The central intuition in Adaboost is that increasing the penalty for misclassified data points forces successive C^j's to make fewer errors. There is a performance criterion that is satisfied by most versions of boosting, see Freund and Schapire (Citation1997) Theorem 6. This result leaves open the possibility that a boosted classifier could be perfect in a limiting sense but does not actually give convergence of BSTC to a limit.

Even though Adaboost was a novel approach to classification by ensembles, Friedman et al. (Citation2000) showed it was equivalent to to forward stagewise additive logistic regression under exponential loss. This provides an interpretation of boosting because the logistic regression gives an explicit expression for P(Y=1D) in terms of specifically constructed functions of x (the C^j's below), see Friedman et al. (Citation2000, Sec. 3.3) and Hastie et al. (Citation2009, Secs. 10.4 and 10.5) for details. Since the C^j's are individual trees they admit interpretations in terms of the explanatory variables. This result holds in a limiting sense as the terms in the logistic regression increase and as n. So, for each finite step, the boosted classifier is suboptimal even though it is Bayes optimal in the limit; we use this below in the proof of Corollary 4.1.

One key step in the Adaboost procedure is choosing how the iterates C^j are to be generated. There are various choices; the most popular are probably naive Bayes classifiers or trees with a maximum number of splits. Here we use the latter. So, we start by taking C to be a tree classifier and want our iterates to be trees as well. Specifically, the criterion the iterates must satisfy is (29) C^j+1(x)=argminhGji=1nDj(i)1{yih(xi)}(x)(29) where Dj=(Dj(1),,Dj(n)) for j1 is the ‘empirical’ distribution on the n data points given by (30) Dj(i)=Dj1(i)eβj11{yiC^j1(xi)}Nj1(30) in which C^0=C, Nj1 is a normalization constant, the βj1's are found by an auxiliary procedure, and the sequence of distributions Dj is initialized by D0=(1/n.,1/n).

In (Equation29), the classifiers h at step j vary over a set of classifiers Gj. As noted, C is a tree and we want the iterates to be trees. So, each Gj must be a class of trees. If Gj is too small a class, e.g., trees with exactly one split (often called ‘stumps’,) then the range of boosted tree-based classifiers will be too small. For instance, if Gj only contained stumps, it would not include trees that allowed interactions between entries in x. On the other hand, if Gj is too big, C^j will fit the data perfectly even though such a classifier often has poor generalization error. So, we have to choose a reasonable value for the number of splits to allow in the classifiers in Gj.

Chapter 10, Sec. 11 in Hastie et al. (Citation2009) recommends using trees with four to eight splits. Three splits allows for interactions between explanatory variables, but often not enough so starting with four splits and working up to eight often a good overall procedure. Hastie et al. (Citation2009) comment that 10 or more splits are rarely required for good performance, e.g., low generalization error.

To see that under these circumstances (Equation28) is a tree it is enough to show that, as functions of x, a linear combination of trees is a tree. First, a real constant times a tree function is a tree so it is enough to show that the sum of two trees is again a tree. Let T1 and T2 be trees. To see that T(x)=T1(x)+T2(x) is also a tree, observe that if each leaf node of T1 is taken as the root node of T2 the result is a tree T of twice the depth for which each input vector of explanatory variables ends up in exactly one leaf. However, some of the nodes or leaves may be void. This does not make T invalid, just artificial, and it may be collapsible into a much smaller tree. Nevertheless, in a trivial sense, the linear combination of trees is tree.

Less artificially, write (31) Tk(x)==1ukαIR(k)(x)(31) for k = 1, 2, where the R(k) are disjoint and exhaustive regions in Rp assumed to have edges parallel to the axes of the real space. In the special case of p = 1, R has an ordering so it is easy to see that assigning the right constant to each intersection R(1)R(2) gives a function that can be expressed as constants times indicator functions for intervals in R. Since intervals can be defined by splits on the single real variable, the sum T(x)=T1(x)+T2(x) has the same form as (Equation31) and arises from a tree structure using binary splits on the explanatory variables.

The same sort of argument applies to R2. It is easy to see that in the two dimensional case T(x)=T1(x)+T2(x) can be written in the form (Equation31). What harder is to see that the regions defined by the R(1)R(2) arise from binary splits on the entries of x. While harder, this is not hard: In the real plane, label the coordinates as x1 and x2 and let the tree structure of T1 and T2 be represented as partitions of R2. Pick the root note of, say, T1. Without loss of generality, assume it is a split of the form x1<c1 versus x1c1. Consider the left branch. The region x1<c1 will be partitioned by horizontal lines, i.e., by ranges of x2. Choose the largest cutoff for x2, say c2. Thus, splitting on x2<c2 versus x2c2 will give us two daughter nodes on the left branch. We can then repeat this for each cutpoint on the left branch splitting at c3, c4 etc. The issue arises when the horizontal band represented by the split is itself split by some value of x1. If this happens at, say c4 then this simply adds another split that has to be carried over to the other splits on the left, i.e., for x2<c5 versus x2c5, x2<c6 versus x2c6 etc. as far down the left side of the tree as there are splits on x2. If there are further splits, they can be accommodated in the same way and the argument can be applied analogously to the right branch of the tree. The same argument can be applied in three or more dimensions; it is simply a matter of considering splits on each dimension in turn and all the splits that may be performed on it using the other explanatory variables, in all possible sequences.

Now, if C() is a tree then its iterates C^1,C^2, can be assumed to be trees as can the final output BSTC(x). It is reasonable to expect the boosted classifier to be Bayes optimal. Indeed, Theorem 6 in Freund and Schapire (Citation1997) gives that the probability of misclassification by a J-step boosted classifier P(YCBoost,J(X)) can only decrease as J increases. Separately, Biau et al. (Citation2008) gives conditions under which some randomized RF-like and majority vote averaging classifiers achieve the minimal Bayes risk. (These are not pure RF's because they ignore the decorrelation and choose split points randomly.) Moreover, since we are using trees here, and trees are a very rich class of nonparametric function estimators (in this case classification functions), it is safe to assume that there are trees that are arbitrarily close to the Bayes classifier even if they are not the same as studied in Biau et al. (Citation2008).

Suppose a Bayes optimal classifier CB(x) exists, i.e., there is a classifier CB that achieves the minimal misclassification error, argminCSP(YC(X)) where S is the set of essentially all classifiers. Then, CB(x)=argmaxr{1,1}P(Y=rX=x).for each x. We begin with a result that initiates a boosting procedure with a Bayes optimal classifier. Of course, we find that the boosting procedure is unable to improve an optimal classifier. This is no surprise. Indeed, it is an artificial hypothesis – why boost an optimal classifier? However, we will use this result in Corollary 4.1 below and remove the hypothesis.

Theorem 4.1

Suppose a Bayes classifier is used as C0 in a boosting procedure. Then, on average, in the limit of increasing n, the weights βj in (Equation30) are identical positive constants, for appropriately chosen Gj's.

Proof.

it is easy to see that the first step of the boosting procedure gives err1=1ni=1nI(YC^1)E(X,Y)(I(YC^1(X)))P(YCB(X))because of the minimum in (Equation29), provided n and G1=G1,n is increasing by invoking Theorem 6 Freund and Schapire (Citation1997). So, as n, β1=log(1err1err1)log(1P(YCB(X))P(YCB(X))).Now wieβ1 if yiC^1(xi) and wi1 if yi=C^1(xi) for i=1,,n. Also, P(YCB(X))<1/2, we have eβ1>1. So, the first iteration wi's are derived from the β1's and the number and indices of the misclassifications. Out of n data points there will be asymptotically nP(YCB(X)) misclassifications and they will occur randomly over the n data points.

Thus, from examining (Equation30), any instance of the distribution D1 randomly permutes the locations of the eβ1 and ‘1’ entries, but the number of each type will be asymptotically constant. So, the first step optimization will again, on average, lead to the Bayes classifier: The misclassifications of the Bayes classifier are randomly located and spread uniformly over the occurrences of eβ1 and ‘1’ entries. Therefore the output C^1 is on average the same as the initial classifier C0(x)=CB(x). The only way another classifier could improve on CB would be to have fewer misclassifications on the indices i that had eβ1 rather than one which is impossible (on average) because the locations are random.

The same reasoning applies to step, j = 2. We get that the same proportion of observations have the weights wieβ1 and ‘1’. Hence, at the end of this step we still get that as n, β2=log(1err2err2)log(1P[YCB(X)]P[YCB(X)])on average, wieβ2=eβ1 if yiC^2(xi), and wi1 if yi=C^2(xi) for i=1,,n. Again, eβ2>1 with the number of misclassifications the same as before and the locations of the misclassifications permuted randomly. Hence D2 is unchanged on average and C^2 is essentially CB, as before.

If we continue this process for steps j=3,,J, we have in the limit β1,β2,,βJlog(1P(YCB(X))P(YCB(X)))>0,as n, on average in PX,Y.

Now, we get a corollary by initializing a boosting procedure with a ‘weak’ classifier.

Corollary 4.1

Let C0() be weak classifier with P(YC0(X))<1/2 (i.e., as a classifier, C0(X) is better than a coin toss). Then as n, the output of the AdaBoost algorithm is a majority vote of a sum of trees, i.e., a ‘forest’.

Remark

The output of Adaboost is only a forest, a collection of trees, not a random forest. In fact, the output of boosting is the sign of a sum of a weighted sequence of trees. As seen above, this is a tree. That is, as a weighted sum of functions, the majority vote of the individual trees from boosting is the same as the output of a single tree. In the same spirit, a random forest is a weighted sum of trees and therefore can be represented as a single tree if desired. Since Adaboost and RF's are good – essentially Bayes classifiers – we expect the two should be close to each other as functions of their inputs. Even though the tree from boosting is a RF-like classifier, not actually a RF, we can still compare it to an actual RF classifier as in (Equation27).

Proof.

Recall that the boosting classifier is asymptotically Bayes optimal because it is a greedy approximation to the relative classification rate, see Le and Clarke (Citation2018) Theorem 3.5, cf. Theorem 6 in Freund and Schapire (Citation1997). So, for J large enough the iterates from the boosting procedure are Bayes optimal asymptotically. Thus, by Theorem 4.1 the βj's converge to the same positive constant and for large enough J, the latter terms in BSTC will dominate to give the limit. That is, BSTC(x)=sign(j=1JβjC^j(x))sign(j=1JC^j(x))=majority vote of {C^j(x)}|j=1J=RF(x).where RF is a sum of trees and the approximation improves as n.

In view of (Equation27), the Corollary means that if BSTC is a good approximation for RF's asymptotically then RF's will be well-approximated by an additive logistic regression model under the exponential loss. It agrees with the result in Le and Clarke (Citation2018) showing that the risks of RF's and boosting converge to the minimal Bayes's risk. Furthermore, we agree with Mease and Wuner (Citation2008) and offer an argument supporting their claim that boosting does not overfit since RF's are asymptotically equivalent to boosting, at least in a misclassification sense, and they do not overfit, see Breiman (Citation2001b). The new part here is using BSTC as an interpretation for RF's and noting the cost of interpretation.

4.2. A more empirical interpretation for RF's

By construction, RF's give a function fRF(x1,,xp). So, suppose we also have a collection of functions f1,,fK that we want to use as a way to ‘interpret’ fRF by regression. Conditional on the data and fk's, solving (32) w^=argminw=(w1,,wK)i=1n(fRF(xi)k=1Kwkfk(xi))2(32) gives an approximation of fRF(x), (33) Y^RF=k=1Kw^kfk(x).(33) The w^k's may be found by using a standard least squares approach treating fRF(xi) as the Yi's and the fk(xi)'s as K explanatory variables. More generally, since xi=(xi1,,xip) the fk's can be regarded as the leading terms in a basis expansion, e.g., a Taylor expansion in p dimensions, cf. Section 2.2. It is easy to see that replacing BSTC in (Equation27) by the right side of (Equation33) gives the cost of interpreting fRF in terms of its regression function (using the fk's) by the residuals from (Equation33). Whether the residuals are satisfactorily small can be assessed by a variety of established techniques.

Expression (Equation32) can also be phrased in terms of logistic regression, which may be more appropriate for a classifier; just replace the inner sum in (Equation32) by the corresponding expression from a logit in a selection of variables such as the fi's and again minimize the error over choices of the parameters. This is not hard, but we have not done it for of ease of exposition.

5. Discussion

In this paper we have examined three classes of predictors – kernel methods, Shtarkov solutions, and random forests – and shown that, despite the inability to interpret them, they can be asymptotically approximated both theoretically and pragmatically by interpretable expressions. In each case we have given an expression that quantifies the cost of approximating the ideal but uninterpretable predictor by its interpretable expression. Consequently, up to approximation error, the hitherto uninterpretable expressions that for the most part did not permit physical inference have been manipulated into forms from which physical inference may be possible.

For the sake of completeness, it is important to discuss conformal prediction; see Shafer and Vovk (Citation2008) for a comprehensive overview based on Vovk et al. (Citation2005). Applications to exponential families and generalized linear models are given in Eck and Crawford (Citation2019) and recent computational progress is given in Vovk et al. (Citation2020). Essentially, conformal prediction assumes sequential data and that future data will resemble past data, i.e., the DG has enough stability that prediction is feasible.

Conformal prediction can also be regarded as an extension of Geisser (Citation1975) by including the assumptions necessary for future data to look like past data and taking a probabilistic or stochastic processes approach to data analysis. At root is a non-conformity measure used to assess how close data points are; different non-conformity measures lead to different prediction regions. Thus, this framework has much in common with calibration and prequentialism, see Dawid (Citation1984), and is more general than time series (Box–Jenkins or state space models). On the other hand, much contemporary sequential data will not fit into this framework. In any event, our work here is broadly consistent with conformal prediction even though our emphasis is on interpretability more than the stochastic properties of the DG.

Recall that our starting points was interpretability and in Section 1 we proposed a definition for interpretability. We did not distinguish between constructing an interpretable model pre-data or deriving an interpretable model post-data although it is clear the latter will generally be better justified if it is derived from an input–output relation that predicts well. Then, we took linear combinations of variables as the paradigm case for interpretable quantities. We did this in Sections 2.3 and 4.2. In Section 3.3 we allowed a broader interpretation – the interaction of variables and parameters were not linear but closely related enough that the effects could be readily seen. That is, in all three cases, we implicitly identified the ck's with variables and parameters whose combination was explicit and properties could be queried.

An anonymous referee asked (i) if deep neural networks (DNN's) are interpretable and if so could they be used in place of the combinations of variables and parameters in Sections 2.3 and 4.2 and (ii) if reinforcement learning might provide an alternative interpretation to that in Section 3.3. for the Shtarkov solution. The referee also observed that there are cases where DNN's may perform better than kernel methods and RF's, and implicitly recognized that the Shtarkov solution, possibly with side information, was similar to reinforcement learning. Thus, if DNN's and reinforcement learning are interpretable why not simply start with the them and ignore these other methods?

First, the interpretability of DNN's is problematic. Defining a clear physical correlate for nodes, layers of nodes, types of layers of nodes, and connectivity will often be elusive because DNN's are usually overparametrized and may be mathematically distinct even as they have very similar numerical properties as input–output relations. Some theorists have suggested that a DNN can be partitioned so that modules within it may have physical meaning or that reducing the number of neurons layer by layer might be akin to finding summary statistics. However, these suggestions remain conjectures. In short, it is not clear that DNN's are interpretable according to the definition used here even if sensitivity analyses can be used to understand the effect of parameters and variables on each other – a difficult task in practice for all but the smallest neural networks. If this holds then the kind of analysis done earlier, for kernel methods, say, to derive an interpretable approximation would have to be done for DNN's in order to assess what the DNN might have to say about a DG. That is, we are led to infer that the better performance of DNN's over other methods could be a result of their flexibility and associated non-interpretability, at least partially. Often, as a model becomes more complex or more general it becomes less interpretable. This does not contradict our basic assertion that interpretation has a cost in terms of prediction.

As to reinforcement learning, this is usually regarded as a sequential decision process in the context of discrete time, discrete space Markov processes. While transitions and actions may be interpretable, the analogy between reinforcement learning and the Shtarkov solution is not tight: The Shtarkov solution does not assume any distribution on the sequence of data points and arises from a minimization of regret while reinforcement learning finds an optimal action for each transition. It is reasonable to conjecture that some version of reinforcement learning will provide an approximation to Shtarkov in some settings, but investigating this is beyong our present scope.

Finally, we draw three implications from our results. First, from a pragmatic standpoint, we are arguing that, as a generality, models are at best only approximately true and the degree of approximation is usually unassessable. Proposed models can routinely be discredited by searching a more general class of predictors to get measurably better prediction. How can one assert a model is ‘true’ if its predictions can be improved? The consequence is that the other inferences from models taken as ‘true’ must be seen as unreliable absent further validation and assessment of their degree of mis-specification. In particular, physical interpretations are tentative at best.

Model mis-specification is an extensively studied topic, see Walker (Citation2013) especially Sections 5 and 6, and the ensuing discussion. These authors generally focus on what we have called M-complete problems and take this as being the typical setting for modelling and analysis. Accordingly, these authors try to characterize the inferential difference between whatever model used and the unknown but true model. Two recurring questions are: (1) Given that the true model class is unobtainable, what is it that we are making inferences about? and (2) Since the degree of mis-specification is important, how should we compare proposed models? These are addressed in a variety of ways by O'Hagan (Citation2013), Hoff and Wakefield (Citation2013) and De Blasi (Citation2013).

Second, here we offer answers, perhaps unsatisfying, to these two questions. We argue that inferences should be about the next outcome, i.e., prediction, and that we should compare proposed models by how close the predictors they give are to the best predictor we can find. That is, prediction is the paradigm statistical problem, not estimation, interpretability, or other goals. Unless we have achieved good prediction there is no particular reason to trust other inferences. After all, from the falsification principle, it is unclear how to discredit estimates or the result of tests except by repeating an experiment, which is rarely done. The current term for this longstanding problem is the ‘replicability crisis’. At root, requiring optimal prediction is a solution to problems with replicability: No predictive validation implies no valid inferences of any other sort.

Finally, we suggest as a default that experimenters achieve good prediction via optimal uninterpretable methods and then adapt the results (as much as they dare) to make them interpretable. The loss of predictive power as a consequence of constructing a physical interpretation can then be quantified and its cost assessed. In this way a simplified, and possibly interpretable, model may be validated up to a measurable degree of predictive loss. The cost of interpretability is then a limit on the validity of the model much as a standard error is a limit on the certainty of an estimate. We conclude with what might be called the prediction principle: The degree of predictive success a method has determines the reliability of any interpretation that rests on it.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

Appendices

Appendix 1.

Details of some proofs

Proof of Corollary 2.1.

To see (Equation7), recall Theorem 2.1 gives (A1) [Y^rep(x)f0(x)]2P0asn.(A1) More explicitly, given D, using (Equation3) the left-hand side of (EquationA1) is (A2) [i=1nαiK(Xi,x)f0(x)]2,(A2) where αi=αi(D). Cauchy-Schwarz gives that (EquationA2) is bounded by (A3) 2[i=1nαiK(Xi,x)]2+2f02(x)4[i=NnαiK(Xi,x)]2+4[i=1N1αiK(Xi,x)]2+2f02(x)4(nN+1)(i=Nnαi2)[1nN+1i=NnK2(Xi,x)]+4[i=1N1αiK(Xi,x)]2+2f02(x).(A3) Next, we show that the term 1/(nN+1)i=NnK2(xi,x) on the right-hand side of (EquationA3) is uniformly integrable. Begin by noting that Jensen's inequality gives (A4) (i=Nnai)1+ϵ(nN+1)ϵ(i=Nnai1+ϵ),(A4) for any ϵ>0 and ai0,i=N,,n. (Set φ(x)=x1+ϵ.) Now, by (EquationA4) supnE[1nN+1i=NnK2(Xi,x)]1+ϵsupn1(nN+1)1+ϵ×E[(nN+1)ϵi=NnK2(1+ϵ)(Xi,x)]E[K2(1+ϵ)(X,x)]<,by Assumption (i). Thus, 1/(nN+1)i=NnK2(xi,x) is uniformly integrable.

Assumption (ii) gives, as n, (nN+1)i=Nnαi2=oP(1) and is bounded. So, the right-hand side of (EquationA3) is uniformly integrable in P(X,Y). This implies [Y^rep(x)f0(x)]2is uniformly integrable for any x and by (EquationA1) has limit zero in probability. Therefore, E[Y^rep(x)f0(x)]20,as n by the Theorem 25.12 in Billingsley (Citation2012), concluding the proof.

Proof of Theorem 2.2.

From (Equation4), (Equation5), and using the fact that supx|f(x)|max{|supxf(x)|,|infxf(x)|}|supxf(x)|+|infxf(x)|,we have (A5) supfB(f,δ)|Q^n(f)Q0(f)|=supfB(f,δ)|1ni=1nL(yi,f(xi))E(X,Y)L(Y,f(X))||supfB(f,δ){1ni=1nL(yi,f(xi))E(X,Y)L(Y,f(X))}|+|inffB(f,δ){1ni=1nL(yi,f(xi))E(X,Y)L(Y,f(X))}|.(A5) To bound the terms on the right-hand side of (EquationA5), since E(X,Y)supfB(f,δ)L(Y,f(X))<, note that (A6) supfB(f,δ){1ni=1nL(yi,f(xi))E(X,Y)L(Y,f(X))}1ni=1nsupfB(f,δ)L(yi,f(xi))inffB(f,δ)E(X,Y)L(Y,f(X))1ni=1nsupfB(f,δ)L(yi,f(xi))E(X,Y)inffB(f,δ)L(Y,f(X))|1ni=1nsupfB(f,δ)L(yi,f(xi))E(X,Y)supfB(f,δ)L(Y,f(X))|+E(X,Y)supfB(f,δ)L(Y,f(X))E(X,Y)inffB(f,δ)L(Y,f(X))|1ni=1nsupfB(f,δ)L(yi,f(xi))E(X,Y)supfB(f,δ)L(Y,f(X))|+|1ni=1ninffB(f,δ)L(yi,f(xi))E(X,Y)inffB(f,δ)L(Y,f(X))|+E(X,Y)supfB(f,δ)L(Y,f(X))E(X,Y)inffB(f,δ)L(Y,f(X)).(A6)

Similarly, we have

(A7) inffB(f,δ){1ni=1nL(yi,f(xi))E(X,Y)L(Y,f(X))}1ni=1ninffB(f,δ)L(yi,f(xi))supfB(f,δ)E(X,Y)L(Y,f(X))1ni=1ninffB(f,δ)L(yi,f(xi))E(X,Y)supfB(f,δ)L(Y,f(X))|1ni=1ninffB(f,δ)L(yi,f(xi))E(X,Y)inffB(f,δ)L(Y,f(X))|+E(X,Y)inffB(f,δ)L(Y,f(X))E(X,Y)supfB(f,δ)L(Y,f(X))|1ni=1nsupfB(f,δ)L(yi,f(xi))E(X,Y)supfB(f,δ)L(Y,f(X))||1ni=1ninffB(f,δ)L(yi,f(xi))E(X,Y)inffB(f,δ)L(Y,f(X))|+E(X,Y)inffB(f,δ)L(Y,f(X))E(X,Y)supfB(f,δ)L(Y,f(X)).(A7) Therefore, from (EquationA6) and (EquationA7), the first term on the RHS of (EquationA5) is bounded by

(A8) |supfB(f,δ){1ni=1nL(yi,f(xi))E(X,Y)L(Y,f(X))}||1ni=1nsupfB(f,δ)L(yi,f(xi))E(X,Y)supfB(f,δ)L(Y,f(X))|+|1ni=1ninffB(f,δ)L(yi,f(xi))E(X,Y)inffB(f,δ)L(Y,f(X))|+E(X,Y)supfB(f,δ)L(Y,f(X))E(X,Y)inffB(f,δ)L(Y,f(X)),(A8) and similarly the second term on the right-hand side of (EquationA5) is bounded by

(A9) |inffB(f,δ){1ni=1nL(yi,f(xi))E(X,Y)L(Y,f(X))}||1ni=1nsupfB(f,δ)L(yi,f(xi))E(X,Y)supfB(f,δ)L(Y,f(X))|+|1ni=1ninffB(f,δ)L(yi,f(xi))E(X,Y)inffB(f,δ)L(Y,f(X))|+E(X,Y)supfB(f,δ)L(Y,f(X))E(X,Y)inffB(f,δ)L(Y,f(X)).(A9) Combining (EquationA5), (EquationA8), and (EquationA9), we get (A10) supfB(f,δ)|1ni=1nL(yi,f(xi))E(X,Y)L(Y,f(X))|2|1ni=1nsupfB(f,δ)L(yi,f(xi))E(X,Y)supfB(f,δ)L(Y,f(X))|+2|1ni=1ninffB(f,δ)L(yi,f(xi))E(X,Y)inffB(f,δ)L(Y,f(X))|+2[E(X,Y)supfB(f,δ)L(Y,f(X))E(X,Y)inffB(f,δ)L(Y,f(X))].(A10) It follows from the continuity of L in f and the dominated convergence theorem that the third term on the right-hand side of (EquationA9) satisfies limδ0supfHKE(X,Y)[supfB(f,δ)L(Y,f(X))inffB(f,δ)L(Y,f(X))]limδ0E(X,Y)supfHK[supfB(f,δ)L(Y,f(X))inffB(f,δ)L(Y,f(X))]=0,and hence we can choose δ so small that (A11) supfHKE(X,Y)[supfB(f,δ)L(Y,f(X))inffB(f,δ)L(Y,f(X))]<ϵ.(A11) Furthermore, by the compactness of Dj, there exist finitely many of f's, say f1,,fN(δ), such that Dji=1N(δ)B(fi,δ). Hence, by the union of events bound, P(supfDj|1ni=1nL(yi,f(xi))E(X,Y)L(Y,f(X))|>ϵ)P(max1iN(δ)supfB(fi,δ)|1ni=1nL(yi,f(xi))E(X,Y)L(Y,f(X))i=1nL(yi,f(xi))|>ϵ)i=1N(δ)P(supfB(fi,δ)|1ni=1nL(yi,f(xi))E(X,Y)L(Y,f(X))i=1nL(yi,f(xi))|>ϵ).Using (EquationA10) and (EquationA11) this expression is bounded by i=1N(δ)P(|1ni=1nsupfB(fi,δ)L(yi,f(xi))E(X,Y)supfB(fi,δ)L(Y,f(X))|+|1ni=1ninffB(fi,δ)L(yi,f(xi))E(X,Y)inffB(fi,δ)L(Y,f(X))|>ϵ2)i=1N(δ)P(|1ni=1nsupfB(fi,δ)L(yi,f(xi))E(X,Y)supfB(fi,δ)L(Y,f(X))|>ϵ4)+i=1N(δ)P(|1ni=1ninffB(fi,δ)L(yi,f(xi))E(X,Y)inffB(fi,δ)L(Y,f(X))|>ϵ4)which goes to 0 as n by the weak law of large numbers, concluding the proof by Assumption (iii).

Appendix 2.

Interpretability versus complexity

Interpretability is a different concept from complexity. We have implicitly assumed that the best predictors (and models) are highly complex, but we regard this as the most common case in current practice, not a priori true. In point of fact, an interpretable model may be simple or complex and an uninterpretable model may be simple or complex. Otherwise put, all pairs of (interpretability, complexity) can occur. As a generality, M-closed problems are less complex than M-complete problems and they in turn are less complex than M-open problems. However, this ordering does not in general preclude the existence of an interpretable predictor or model for any problem.

Here, we use the notion of interpretability in Le and Clarke (Citation2020). On the other hand, here complexity refers to how many components are required for good prediction or modelling. This is different from other notions of complexity such as VC dimension or code length since these do not require any components of a predictor or model to have any physical correlates.

Nevertheless, as an empirical observation we have noted that the more interpretability one demands of a model, the more complex it will typically be and, often, the more complex the true model is, the higher the model mis-specification will be. Correspondingly, the less interpretability one requires, the smaller the error of the predictor can be but it is not clear what effect this has on complexity. Empirically, the predictions from an interpretable model are likely to be worse than the prediction from a well chosen non-interpretable model or predictor. This arises for the intuitive reason that restricting model classes to only those that are interpretable is likely to increase bias because real world phenomena are rarely captured to infinite precision by what we think are physically meaningful models. Because interpretable models may be simplifications of the real phenomena we expect some of the bias to be reflected in increased variance as well. The exception is when the model actually is valid to infinite precision or at least to a precision higher than that achieved by other models; this can happen but is atypical. These observations are neither new nor surprising. The issue is to quantify them to ascertain how much of an interpretation one can derive from an uninterpretable model without losing too much accuracy of prediction.