Search in:

Journal of the American Statistical Association Volume 115, 2020 - Issue 530

Submit an article Journal homepage

Free access

1,414

Views

CrossRef citations to date

Altmetric

Listen

Comments

Discussion of Professor Bradley Efron’s Article on “Prediction, Estimation, and Attribution”

Min-ge XieDepartment of Statistics, Rutgers University, Piscataway, NJCorrespondence[email protected]

Zheshi ZhengDepartment of Statistics, Rutgers University, Piscataway, NJ

Pages 667-671 | Received 09 Apr 2020, Accepted 18 Apr 2020, Published online: 04 Jun 2020

Cite this article
https://doi.org/10.1080/01621459.2020.1762614
CrossMark

In this article

1 Introduction
2 Prediction, Testing Data, and Learning Models
3 Case Study: Prediction Under Neural Network Models
4 Conclusion
References

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions
View PDF PDF View EPUB EPUB

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

1 Introduction

By noting the rapid growing trend of “pure prediction algorithms,” Professor Efron compares and bridges the statistics of the 20th Century (estimation and attribution) to that of the current fast growing development of the 21st Century (prediction). The outstanding discussion offers many deep-rooted insights and comments. As did his forward thinking article on Fisher’s influence on modern statistics (Efron Citation1998), which helped shape many recent developments on statistical inference (including our own work on confidence distribution (Singh, Xie, and Strawderman Citation2005; Xie and Singh Citation2013)), this equally inspiring article by Professor Efron will certainly galvanize many contemporary and powerful developments for modern statistics and for the foundations of data science.

In this note, we echo and also provide additional support to two important points made by Professor Efron: (1) prediction is “an easier task than either attribution or estimation”; (2) the IID assumption (e.g. random splitting of training and testing datasets) is crucial in the current developments on predictions, but we also need to do more for the case when the IID assumption is not met. Based on our own research, we provide additional evidence to support these discussions. We discover that prediction has a homeostasis property and works well under the IID setting even if the learning model used is completely wrong. We also highlight the importance of having a good modeling and inference practice: a good learning model with good estimation is important to improve prediction efficiency in the IID case and it becomes essential to maintain validity in the non-IID case. The message remains: we still need to make effort to build a good learning model and estimation algorithm in prediction, even if prediction is an easier task than estimation.

From the outset, we would like to point out that it is not a straw-man argument to consider non-IID testing data. On the contrary, such data are prevalent in data science. In addition to those examples provided by Professor Efron that showed “drift,” we can easily imagine non-IID examples in many typical applications. For instance, a predictive algorithm is trained on a database of patient medical records and we would like to predict potential outcomes of a treatment for a new patient with more severe symptoms than what the average patient shows. The new patient with more severe symptoms is not a typical IID draw from the general patient population. Similarly, in the finance sector, one is often interested in predicting the financial performance of a particular company. If a predictive model is trained on all institutes, then the testing data (of the specific company of interest) are unlikely IID draws from the same general population of the training data. The limitation of the IID assumption, in our opinion, has hampered our efforts to fully take advantage of fast-developing machine learning methodologies (e.g., deep neural network model, tree based methods, etc.) in many real-world applications.

Our discussions in this note are based on a so-called conformal prediction procedure, an attractive new prediction framework that is error (or model) distribution free (see, e.g., Vovk, Gammerman, and Shafer Citation2005; Shafer and Vovk Citation2008). We discover a homeostasis phenomenon that the expected bias caused by using a wrong model can largely be offset by the corresponding negatively shifted predictive errors under the IID setting. Thus, the predictive conclusion is always valid even if the model used to train the data is completely wrong. This robustness result clearly supports the claim that prediction is an easier task than modeling and estimation. However, the use of a wrong training (learning) model has at least two undesirable impacts on prediction: (a) a prediction based on a wrong model typically produces much wider predictive intervals than those based on a correct model; (b) although the IID case enjoys a nice homeostatic cancellation of bias (in fitted model) and shifts (in associated predictive errors) when using a wrong learning model, in the non-IID case this cancellation is often no longer effective, resulting in invalid predictions. The use of a correct learning model can help mitigate and sometimes solve the problem of invalid prediction for non-IID (e.g., drifted or individual-specific) testing data.

Section 2 reviews a conformal predictive procedure and shows that the prediction is valid under the IID setting, even if the learning model is completely wrong. Section 3 is a numerical study using a neural network model to demonstrate the impact of a wrong learning model and estimation on prediction in both the IID and non-IID cases. Section 4 is a concluding remark. A more detailed discussion, including an introduction of predictive curve (to represent predictive intervals of all levels) and an elaborated study of linear models, is in Xie and Zheng (2020).

2 Prediction, Testing Data, and Learning Models

As in Equation (6.4) of Professor Efron’s article, we assume that a training (observed) dataset of size n, say, $D_{obs} = {(x_{i}, y_{i}), i = 1, \dots, n}$ is available, where $(x_{i}, y_{i}), i = 1, \dots, n$ , are IID random samples from an unknown population $F$ . For a given $x_{new}$ , we would like to predict what $y_{new}$ would be. We first use the typical assumption that $(x_{new}, y_{new})$ is also an IID draw from $F$ . Later we relax this requirement and only assume that $y_{new} | x_{new}$ relates to $x_{new}$ the same way as $y_{i} | x_{i}$ relates to $x_{i}$ , but $x_{new}$ follows a marginal distribution that is different from that of $x_{i}$ .

For notation convenience, we consider $(x_{new}, y_{new})$ as the $(n + 1)$ th observation and introduce the index n + 1, with $x_{n + 1} = x_{new}$ and $y_{n + 1}$ as a potential value of the unobserved $y_{new}$ . Unless specified otherwise, the index “n + 1” and index “new” are exchangeable throughout the note.

2.1 Conformal Prediction Inference With Quantified Confidence Levels

The conformal prediction method has attracted increasing attention in learning communities in recent years (see, e.g., Vovk, Gammerman, and Shafer Citation2005; Shafer and Vovk Citation2008; Lei et al. Citation2018; Barber et al. Citation2019a, 2019b). The idea is straightforward. To make a prediction of the unknown $y_{new}$ given $x_{n + 1} = x_{new}$ , we examine a potential value $y_{n + 1}$ , and see how “conformal” the pair $(x_{n + 1}, y_{n + 1})$ is among the observed n pairs of IID data points $(x_{i}, y_{i}), i = 1, \dots, n$ . The higher the “conformality,” the more likely $y_{new}$ takes the potential value $y_{n + 1}$ . Frequently, a learning model, say $y_{i} \sim μ (x_{i})$ for $i = 1 \dots, n, n + 1$ , is used to assist prediction. However, the learning model is not essential. As we will see later, even if $μ (\cdot)$ is totally wrong or does not exist, a conformal prediction can still provide us a valid prediction, as long as the IID assumption holds.

To be specific, this note employs a conformal prediction procedure known as the Jackknife-plus method (see, e.g., Barber et al. 2019b). Consider a combined collection of both the training and testing data but with the unknown $y_{new}$ replaced by a potential value $y_{n + 1}$ : $A = D_{obs} \cup {(x_{new}, y_{n + 1})} = {(x_{i}, y_{i}), i = 1, \dots, n, n + 1}$ . We define conformal residuals $R_{i j} = y_{i} - {\hat{y}}_{i}^{- (i, j)}, for i = j and i, j = 1, \dots, n, n + 1,$ where ${\hat{y}}_{i}^{- (i, j)}$ is the prediction of y_i based on the leave-two-out dataset $A^{- (i, j)} = A - {(x_{i}, y_{i}), (x_{j}, y_{j})}$ . If a working model $μ (\cdot)$ is used, for instance, the model is first fit based on the leave-two-out dataset $A^{- (i, j)}$ and the point prediction is set to be ${\hat{y}}_{i}^{- (i, j)} = \hat{μ} (x_{i}; A^{- (i, j)})$ , where $\hat{μ} (\cdot; A^{- (i, j)})$ is the fitted (trained) model using $A^{- (i, j)}$ .

For each given $y_{n + 1}$ (a potential value of $y_{new}$ ), we define(1) $Q_{n} (y_{n + 1}) = \frac{1}{n} \sum_{i = 1}^{n} 1_{{R_{n + 1, i} \geq R_{i, n + 1}}},$ (1) which relates to the degree of “conformity” of the residual values $R_{n + 1, i} = y_{n + 1} - {\hat{y}}_{n + 1}^{- (i, n + 1)}$ among the conformal residuals $R_{i, n + 1} = y_{i} - {\hat{y}}_{i}^{- (i, n + 1)}, i = 1, \dots, n$ . (Here, $R_{i, n + 1}$ are in fact the leave-one-out residuals of using the training dataset $D_{obs}$ .) If $Q_{n} (y_{n + 1}) \approx \frac{1}{2}$ , then $R_{n + 1, i}$ is around the middle of the training data residuals $R_{i, n + 1}$ and thus “most conformal.” When $Q_{n} (y_{n + 1}) \approx 0$ or $\approx 1, R_{n + 1, i}$ is at the extreme ends of the training data residuals $R_{i, n + 1}$ and thus “least conformal.” This intuition leads us to define a conformal predictive interval of $y_{new}$ as(2) $\begin{matrix} C_{α} = {y : Q_{n} (y) \geq \frac{α}{2}} \cap {y : 1 - Q_{n} (y) \geq \frac{α}{2}} \\ = [q_{\frac{α}{2}} ({{\hat{y}}_{n + 1}^{- (i, n + 1)} + R_{i, n + 1}}_{i = 1}^{n}), \\ q_{1 - \frac{α}{2}} ({{\hat{y}}_{n + 1}^{- (i, n + 1)} + R_{i, n + 1}}_{i = 1}^{n})], \end{matrix}$ (2) where $q_{α} ({a_{i}}_{i = 1}^{n})$ is the αth quantile of $a_{1}, \dots, a_{n}$ . The interval (2) is a variant version of the Jackknife-plus predictive interval proposed by Barber et al. (2019b) in which $R_{i, n + 1}$ is replaced by $| R_{i, n + 1} | = | y_{i} - {\hat{y}}_{i}^{- (i, n + 1)} |$ instead. The following proposition states that, under the IID assumption, $C_{α}$ defined in (2) is a predictive set for $y_{new}$ with guaranteed level- $(1 - 2 α)$ .

Proposition 1.

If $(x_{i}, y_{i}), (x_{new}, y_{new}) \overset{iid}{\sim} F$ , for $i = 1, \dots, n$ , then $P (y_{new} \in C_{α}) \geq 1 - 2 α$ .

A proof of the proposition can be found in Xie and Zheng (2020), which holds for a finite n. Barber et al. (2019b) pointed out empirically intervals like $C_{α}$ have a typical coverage rate of $1 - α$ . In the rest of the note, we treat $C_{α}$ as an approximate level- $(1 - α)$ predictive interval.

Note that the function $Q_{n} (y)$ defined in (1) is in essence a predictive distribution function of $y_{new}$ (see, e.g., Shen, Liu, and Xie Citation2018; Vovk et al. Citation2019). The corresponding predictive curve of $y_{new}$ is ${PV}_{n} (y) = 2 \min {Q_{n} (y), 1 - Q_{n} (y)} .$

Clearly, $C_{α} = {y : {PV}_{n} (y) \geq α}$ . A plot of predictive curve function ${PV}_{n} (y)$ provides a full picture of conformal predictive intervals of all levels. Analogous to that of confidence distribution and Birnbaum’s confidence curve, predictive function $Q_{n} (y)$ has a confidence interpretation as the p-value function of the one-sided test $H_{0} : y_{new} = y$ versus $H_{1} : y_{new} \leq y$ , and the predictive curve $PV (y)$ has the same implementation for the corresponding two sided test (see. e.g., Xie and Zheng 2020, sec. 2.2). A formal definition of conformal predictive function is in Vovk et al. (Citation2019).

A striking result is that Proposition 1 holds, even if the learning model $μ (\cdot)$ used to obtain the prediction is completely wrong, as long as the IID assumption holds. This robust property against model misspecification is highly touted in the machine learning community. It gives support to the sentiment of using “black box” algorithms where the role of model fitting is reduced to an afterthought, although we will also provide arguments to counter this sentiment.

2.2 IID Versus Non-IID: Efficiency and Validity Under a Wrong Model

Although the validity of prediction is robust against wrong learning models in the IID case, there is no free lunch. The predictive intervals obtained under a wrong model are typically wider. For instance, suppose that the true model is $y = μ_{0} (x) + ϵ$ , but a wrong model $y = μ_{1} (x) + e$ is used. Since $y = μ_{0} (x) + ϵ = μ_{1} (x) + {μ_{0} (x) - μ_{1} (x)} + ϵ$ , we have $e = {μ_{0} (x) - μ_{1} (x)} + ϵ$ . So, when $ϵ$ is independent of x, $var (e) = var ({μ_{0} (x) - μ_{1} (x)}) + var (ϵ) \geq var (ϵ)$ and the equality holds only when $μ_{1} (x) = μ_{0} (x)$ . Thus, the error term e under a wrong model has a larger variance than the error term $ϵ$ under the true model. The larger the variance $var ({μ_{0} (x) - μ_{1} (x)})$ is (i.e., the more discrepant $μ_{1} (x)$ and $μ_{0} (x)$ are), the larger the variance of the error term e is. A larger error translates to less accurate estimation and prediction. See also Proposition 2 of Xie and Zheng (2020) for a formal statement regarding the predictive interval lengths in linear models.

We have an explanation why a conformal predictive algorithm can still provide valid prediction even under a totally wrong learning model in the IID case. Specifically, when we use a wrong model $μ_{1} (x)$ , the corresponding point predictor will be biased by the magnitude of $μ_{1} (x_{new}) - μ_{0} (x_{new})$ , but at the same time the error term e absorbs the bias and produces residuals with a shift by the magnitude of $μ_{0} (x_{i}) - μ_{1} (x_{i}) = - {μ_{1} (x_{i}) - μ_{0} (x_{i})}$ . In the conformal predictive interval (2), the quantiles of residuals are added back to the point prediction to form the interval bounds. If the IID assumption holds, the bias is offset by the shift. See also Xie and Zheng (2020) in which an explicit mathematical expression of this cancellation in linear models is derived. Along with greater residual variance, the offsetting ensures the validity of the conformal prediction in the IID case. We call this tendency of self-balance to maintain validity a homeostasis phenomenon.

The IID assumption is a crucial condition to ensure the validity of a prediction under a wrong model. If the IID assumption does not hold for the testing data, the prediction based on a wrong learning model (or a correct model but a wrong parameter estimation) is often invalid with large errors, as we see in our case studies. We think this IID assumption also explains why deep neural network and other machine learning methods work so well in academic research settings (where random split of data into training and testing sets is a common practice) but fail to produce “killer applications” to make predictions for a given patient or company whose $x_{new}$ are often not close to the center of the training data. The good news is that, if we use a correct model for training and can get good model estimates, it is still possible to get a valid prediction for a specific $x_{new}$ . Modeling and estimation remain relevant and often crucial for prediction in both IID and non-IID cases.

3 Case Study: Prediction Under Neural Network Models

We use a neural network model and a simulation study to provide an empirical support for our discussion. In the current neural network development, model fitting algorithms do not pay much attention to correctly estimate the model parameters. We find that the estimation of model parameters also plays an important role in prediction, in addition to a correct model specification.

Example 1.

Suppose our training data $(y_{i}, x_{i}), i = 1, \dots, n$ , are IID samples from the model(3) $\begin{matrix} y_{i} = μ_{0} (x_{i}) + ϵ_{i} \\ = \max {0, \max {0, z_{i 1} + z_{i 2}} - \max {0, w_{i}}} + ϵ_{i}, \\ ϵ_{i} \overset{iid}{\sim} N (0, σ^{2}), \end{matrix}$ (3) where $x_{i} = {(z_{i 1}, z_{i 2}, w_{i})}^{T} \overset{iid}{\sim} N (μ_{x}, Σ_{x})$ and $ϵ_{i}$ and $x_{i}$ are independent. Here, $μ_{x} = {(0, 0, 0)}^{T}$ , the $(k, k')$ -element of Σ_x is ${0.5}^{| k - k' |} / 2$ , for $k, k' \in {1, 2, 3}, σ^{2} = 1$ and n = 300. Model (3) is in fact a neural network model (with a diagram presented in ) and we can re-express $μ_{0} (x_{i})$ as(4) $μ_{0} (x_{i}) = f (A_{2} f (A_{1} x_{i})) .$ (4)

Fig. 1 Diagrams of four neural network models: (a) true $μ_{0} (\cdot)$ ; (b) partial $μ_{1} (\cdot)$ ; and (c, d) over-parameterized $μ_{2} (\cdot)$ and $μ_{3} (\cdot)$ of (20 nodes in each layer) × L layers, with L = 20 and 100, respectively.

Fig. 1 Diagrams of four neural network models: (a) true μ0(·); (b) partial μ1(·); and (c, d) over-parameterized μ2(·) and μ3(·) of (20 nodes in each layer) × L layers, with L = 20 and 100, respectively.

Here, $f (x) = \max (x, 0)$ is the ReLU activate function, and $A_{1} = (\begin{matrix} a_{11}^{(1)} & a_{12}^{(1)} & a_{13}^{(1)} \\ a_{21}^{(1)} & a_{22}^{(1)} & a_{23}^{(1)} \end{matrix})$ and $A_{2} = (a_{1}^{(2)}, a_{2}^{(2)})$ are the model parameters. Corresponding to (3), the true model parameter values are $a_{11}^{(1)} = a_{12}^{(1)} = a_{23}^{(1)} = 1, a_{13}^{(1)} = a_{21}^{(1)} = a_{22}^{(1)} = 0$ and $(a_{1}^{(2)}, a_{2}^{(2)}) = (1, - 1)$ . In our analysis, we assume that we know the model form (4) but do not know the values of model parameters A₁ and A₂.

For the testing data, we consider two scenarios: (i) [IID case] $x_{new} \overset{iid}{\sim} N (μ_{x}, Σ_{x})$ and, given $x_{new}, (x_{new}, y_{new})$ follows (3); (ii) [Non-IID case] the marginal distribution $x_{new} \overset{iid}{\sim} (T_{1}, T_{2}, T_{3})$ and, given $x_{new}, (x_{new}, y_{new})$ follows (3). Here, $T_{1}, T_{2}, T_{3}$ are iid random variables from the t-distribution with degrees of freedom 3 and non-centrality parameter 1.

In addition to (a) the true model $μ_{0} (\cdot)$ , four wrong learning models are considered: $\begin{matrix} (b) μ_{1} (x_{i}) = f (B z_{i}) (partially correct neural network \\ model, missing w_{i}); \end{matrix}$ $\begin{matrix} (c) μ_{2} (x_{i}) = f (C_{20} f (C_{19} \dots f (C_{1} x))) \\ (deep neural network model with 20 layers); \end{matrix}$ $\begin{matrix} (d) μ_{3} (x_{i}) = f (D_{100} f (D_{99} \dots f (D_{1} x))) \\ (deep neural network model with 100 layers); \end{matrix}$ $(e) μ_{4} (x_{i}) = η_{0} (without any covariates),$ where $z_{i} = {(z_{i 1}, z_{i 2})}^{T}, B = (b_{1}, b_{2}), C_{1}, D_{1} \in R^{20 \times 3}, C_{20}, D_{100} \in R^{1 \times 20}$ , and $C_{i}, D_{j} \in R^{20 \times 20}, 2 \leq i \leq 19, 2 \leq j \leq 99$ . In our analysis, the neural network models $μ_{0} (\cdot)$ - $μ_{3} (\cdot)$ are fit using the neuralnet package (cran.r-project.org/web/packages/neuralnet/).

The Neuralnet package is an off-the-shelf machine learning algorithm. Its emphasis is on learning and not on model parameter estimation. Even under the true model $μ_{0} (\cdot)$ , the estimates of model parameters from Neuralnet are not very accurate (see ). In the table, “Opt-MSE” refers to a code that we wrote by directly minimizing $MSE = \sum_{j = 1}^{n} {(y_{j} - μ_{0} (x_{j}))}^{2}$ , which can be implemented when the neural network is small. The calculation in the table is based on 20 repeated runs, each with a training dataset of size n = 300 from model (3).

Table 1 Mean square error of each parameter in μ₀ (training data n = 300; repetition = 10).

Download CSV Display Table

Reported in are the coverage rate and average interval length of predictive intervals computed under 10 = 5 × 2 settings with five different learning models $μ_{k} (\cdot), k = 0, 1, \dots, 4$ , and in two scenarios. The analysis is repeated for 10 times with 10 simulated training datasets from model (3). We use 10 repetitions and not a greater number, because it takes a long time to fit a neural network model. However, for each of the 10 training datasets, 20 pairs of $(y_{new}, x_{new})$ are used. So, for the reported values, each is computed using 10 × 20 = 200 pairs of $(y_{new}, x_{new})$ . For the true neural network model $μ_{0} (\cdot)$ , Opt-MSE is also used to fit the model. As we can see in , under the IID scenario, all predictive intervals are valid with a correct coverage. The best one with the shortest interval length is the one that uses the correct model and Opt-MSE estimation method. In the non-IID case, only the shallow neural network models provide valid predictions, and among them, Opt-MSE can give us confidence intervals with half the width. Indeed, when a wrong learning model is used, the IID assumption is essential for the prediction validity and the use of a wrong model often results in wider intervals. Furthermore, the estimation of model parameters seems to also have a big impact on prediction.

Table 2 Performance of 95% predictive intervals under five different learning models and in two scenarios: coverage rates (before brackets) and average interval lengths (inside brackets) (training data size = 300; testing data size = 20; repetition = 10).

Display Table

To get a full picture of the predictive intervals at all levels, we plot in the predictive curves of $y_{new}$ . The plots are based on the first training dataset and making prediction for (a) the IID case with the realization $x_{new} = (- 0.909, - 1.149, - 0.771)$ , and (b) the non-IID case with the realization $x_{new} = (3.653, 1.748, 1.063)$ . The realized value of $μ_{0} (x_{new})$ is 0 and 4.338 in (a) and (b), respectively. From , we see that the use of a wrong model $μ_{1} (\cdot)$ – $μ_{4} (\cdot)$ results in wider predictive curve (and predictive intervals at all levels $1 - α \in (0, 1)$ ) in both IID and non-IID cases. Although the shallow neural network models $μ_{0} (\cdot)$ and $μ_{1} (\cdot)$ can provide good coverage rates, the predictive curves in the non-IID case are much wider than other approaches. This peculiar phenomenon occurs even when we assume to know the true model structure $μ_{0} (\cdot)$ , indicating the importance of estimating model parameters accurately. Furthermore, in the non-IID case, there are large shifts when we use deep neural network models $μ_{2} (\cdot)$ and $μ_{3} (\cdot)$ , leading to invalid predictions. The best prediction result is from the one obtained by using the correct learning model $μ_{0} (\cdot)$ with the more accurate parameter estimation method Opt-MSE. The message is the same as what we have learned from , which also exactly mirrors what is found in the case study of linear models in Xie and Zheng (2020).

Fig. 2 Plots of predictive curves for (a) $x_{new} \overset{iid}{\sim} x_{i}$ and (b) $x_{new} \sim x_{i}$ . In each plot, the red solid curve is the target (oracle) predictive curve ${PV}_{n} (y) = 2 \max {Φ (y - μ_{new}), 1 - Φ (y - μ_{new})}$ , obtained assuming that the distribution of $y_{new} \sim N (μ_{new}, 1)$ is completely known. The two predictive curves obtained using $μ_{0} (\cdot)$ are in black (solid line for Opt-MSE; dashed line for Neuralnet). The other predictive curves (all in a dashed or broken line and in various colors) are obtained using the other four wrong working models.

Fig. 2 Plots of predictive curves for (a) x new∼ iidxi and (b) x new∼xi. In each plot, the red solid curve is the target (oracle) predictive curve PVn(y)=2max{Φ(y−μ new),1−Φ(y−μ new)}, obtained assuming that the distribution of y new∼N(μ new,1) is completely known. The two predictive curves obtained using μ0(·) are in black (solid line for Opt-MSE; dashed line for Neuralnet). The other predictive curves (all in a dashed or broken line and in various colors) are obtained using the other four wrong working models.

4 Conclusion

Professor Efron pointed out that “the 21st Century has seen the rise of a new breed of what can be called ‘pure prediction algorithms’.” We are fully in agreement with Professor Efron’s discussion that the prediction algorithms “can be stunningly successful,” and that “the emperor has nice clothes but they’re not suitable for every occasion.” Along the same line and under the setting of conformal prediction, we have demonstrated and explained how and why a prediction method can be successful under the IID assumption, even if the learning model is completely wrong. More importantly, we have also demonstrated that it is still meaningful, and often crucial, to build our prediction algorithms based on a good practice of modeling, estimation and inference. We fully anticipate and believe that “the most powerful ideas of Twentieth Century statistics”—modeling, estimation, and inference, will play a pivotal role in building the mathematical foundation of modern data science and in fully realizing its potential for real-world applications.

Funding

The research is supported in part by research grants from NSF.

References

Barber, R. F., Candes, E. J., Ramdas, A., and Tibshirani, R. J. (2019a), “The Limits of Distribution-Free Conditional Predictive Inference,” arXiv no. 1903.04684.
Google Scholar
——— (2019b), “Predictive Inference With the Jackknife+,” arXiv no. 1905.02928.
Google Scholar
Efron, B. (1998), “R. A. Fisher in the 21st Century (Invited paper Presented at the 1996 R. A. Fisher Lecture),” Statistical Science, 13, 95–122. DOI: 10.1214/ss/1028905930.
Web of Science ®Google Scholar
Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. (2018), “Distribution-Free Predictive Inference for Regression,” Journal of the American Statistical Association, 113, 1094–1111. DOI: 10.1080/01621459.2017.1307116.
Web of Science ®Google Scholar
Shafer, G., and Vovk, V. (2008), “A Tutorial on Conformal Prediction,” Journal of Machine Learning Research, 9, 371–421.
Web of Science ®Google Scholar
Shen, J., Liu, R., and Xie, M. (2018), “Prediction With Confidence—A General Framework for Predictive Inference,” Journal of Statistical Planning and Inference, 195, 126–140. DOI: 10.1016/j.jspi.2017.09.012.
Web of Science ®Google Scholar
Singh, K., Xie, M., and Strawderman, W. E. (2005), “Combining Information From Independent Sources Through Confidence Distributions,” The Annals of Statistics, 33, 159–183. DOI: 10.1214/009053604000001084.
Web of Science ®Google Scholar
Vovk, V., Gammerman, A., and Shafer, G. (2005), Algorithmic Learning in a Random World, New York: Springer.
Google Scholar
Vovk, V., Shen, J., Manokhin, V., and Xie, M. (2019), “Nonparametric Predictive Distributions by Conformal Prediction,” Machine Learning, 108, 445–474. DOI: 10.1007/s10994-018-5755-8.
Web of Science ®Google Scholar
Xie, M., and Singh, K. (2013), “Confidence Distribution, the Frequentist Distribution Estimator of a Parameter” (with discussion), International Statistical Review, 81, 3–39. DOI: 10.1111/insr.12000.
Web of Science ®Google Scholar
Xie, M., and Zheng, Z. (2020), “Homeostasis Phenomenon in Predictive Inference When Using a Wrong Learning Model: A Tale of Random Split of Data Into Training and Test Sets,” arXiv no. 2003.08989.
Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Discussion of Professor Bradley Efron’s Article on “Prediction, Estimation, and Attribution”

1 Introduction

2 Prediction, Testing Data, and Learning Models

2.1 Conformal Prediction Inference With Quantified Confidence Levels

2.2 IID Versus Non-IID: Efficiency and Validity Under a Wrong Model

3 Case Study: Prediction Under Neural Network Models

Table 1 Mean square error of each parameter in μ₀ (training data n = 300; repetition = 10).

Table 2 Performance of 95% predictive intervals under five different learning models and in two scenarios: coverage rates (before brackets) and average interval lengths (inside brackets) (training data size = 300; testing data size = 20; repetition = 10).

4 Conclusion

References

Information for

Open access

Opportunities

Help and information

Discussion of Professor Bradley Efron’s Article on “Prediction, Estimation, and Attribution”

1 Introduction

2 Prediction, Testing Data, and Learning Models

2.1 Conformal Prediction Inference With Quantified Confidence Levels

2.2 IID Versus Non-IID: Efficiency and Validity Under a Wrong Model

3 Case Study: Prediction Under Neural Network Models

Table 1 Mean square error of each parameter in μ0 (training data n = 300; repetition = 10).

Table 2 Performance of 95% predictive intervals under five different learning models and in two scenarios: coverage rates (before brackets) and average interval lengths (inside brackets) (training data size = 300; testing data size = 20; repetition = 10).

4 Conclusion

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 1 Mean square error of each parameter in μ₀ (training data n = 300; repetition = 10).