Full article: Empirical Bayes Mean Estimation With Nonparametric Errors Via Order Statistic Regression on Replicated Data

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

We study empirical Bayes estimation of the effect sizes of N units from K noisy observations on each unit. We show that it is possible to achieve near-Bayes optimal mean squared error, without any assumptions or knowledge about the effect size distribution or the noise. The noise distribution can be heteroscedastic and vary arbitrarily from unit to unit. Our proposal, which we call Aurora, leverages the replication inherent in the K observations per unit and recasts the effect size estimation problem as a general regression problem. Aurora with linear regression provably matches the performance of a wide array of estimators including the sample mean, the trimmed mean, the sample median, as well as James-Stein shrunk versions thereof. Aurora automates effect size estimation for Internet-scale datasets, as we demonstrate on data from a large technology firm.

Keywords:

1 Introduction

Empirical Bayes (EB) (Efron Citation2012; Robbins Citation1956, Citation1964) and related shrinkage methods are the de facto standard for estimating effect sizes in many disciplines. In genomics, EB is used to detect differentially expressed genes when the number of samples is small (Smyth Citation2004; Love, Huber, and Anders Citation2014). In survey sampling, EB improves noisy estimates of quantities, like the average income, for small communities (Rao and Molina Citation2015). The key insight of EB is that one can often estimate unit-level quantities better by sharing information across units, rather than analyzing each unit separately.

Formally, EB models the observed data $Z = (Z_{1}, \dots, Z_{N})$ as arising from the following generative process:(1) $μ_{i} \sim G, Z_{i} | μ_{i} \sim F (\cdot | μ_{i}), i = 1, \dots, N .$ (1)

The goal here is to estimate the mean parameters, $μ_{i} : = E_{F} [Z_{i} | μ_{i}]$ for $i = 1, \dots, N$ , from the observed data $Z$ . If G and F are fully specified, then the optimal estimator (in the sense of mean squared error) is the posterior mean $E_{G, F} [μ_{i} | Z_{i}]$ , which achieves the Bayes risk. EB deals with the case where F or G is unknown, so the Bayes rule cannot be calculated. Most modern EB methods (Jiang and Zhang Citation2009; Brown and Greenshtein Citation2009; Muralidharan Citation2012; Saha and Guntuboyina Citation2020) assume that F is known (say, $F (\cdot | μ_{i}) = N (μ_{i}, 1)$ ) and construct estimators ${\hat{μ}}_{i}$ that asymptotically match the risk of the unknown Bayes rule, without making any assumptions about the unknown prior G.

We examine the same problem of estimating the μ_i s when the likelihood F is also unknown. Indeed, knowledge of F is an assumption that requires substantial domain expertise. For example, it took many years for the genomics community to agree on an EB model for detecting differences in gene expression based on microarray data (Baldi and Long Citation2001; Lönnstedt and Speed Citation2002; Smyth Citation2004). Then, once this technology was superseded by bulk RNA-Seq, the community had to devise a new model from scratch, eventually settling on the negative binomial likelihood (Love, Huber, and Anders Citation2014; Gierliński et al. Citation2015).

Unfortunately, there is no way to avoid making such strong assumptions when there is no information besides the one Z_i per μ_i . If F is even slightly underspecified, then it becomes hopeless to disentangle F from G. To appreciate the problem, consider the Normal–Normal model:(2) $G = N (0, A) F (\cdot | μ_{i}) = N (μ_{i}, σ^{2}) .$ (2)

Here, Z_i is marginally distributed as $N (0, A + σ^{2})$ , and the observations Z_i only provide information about $A + σ^{2}$ . Now, when $σ^{2}$ is known, A can be estimated by first estimating the marginal variance and subtracting $σ^{2}$ . Indeed, Efron and Morris (1973) showed that by plugging in a particular estimate of A into the Bayes rule $E_{G, F} [μ_{i} | Z_{i}] = (1 - \frac{σ^{2}}{σ^{2} + A}) Z_{i}$ , one recovers the celebrated James-Stein estimator (James and Stein Citation1961). Yet, as soon as $σ^{2}$ is unknown, then A (and hence, G) is unidentified, and there is no hope of approximating the unknown Bayes rule.

However, as any student of random effects knows, the Normal–Normal model (2) becomes identifiable if we simply have independent replicates Z_ij for each unit i. The driving force behind this work is an analogous observation in the context of empirical Bayes estimation: replication makes it possible to estimate μ_i with essentially no assumptions on F or G. The method we propose, described in the next section, performs well in practice and nearly matches the risk of the Bayes rule, which depends on the unknown F and G.

2 The Aurora Method

First, we formally specify the EB model when replicates $Z_{i j} \in R$ are available.(3) $\begin{matrix} (μ_{i}, α_{i}) \overset{iid}{\sim} G, i = 1, \dots, N \\ Z_{i j} | (μ_{i}, α_{i}) \overset{iid}{\sim} F (\cdot | μ_{i}, α_{i}) j = 1, \dots, K . \end{matrix}$ (3)

Again, the quantity of interest is the mean parameter $μ_{i} : = E_{F} [Z_{i j} | μ_{i}, α_{i}]$ . The additional parameter α_i is a nuisance parameter that allows for heterogeneity across the units, while preserving exchangeability (Galvao and Kato Citation2014; Okui and Yanagi Citation2020). For example, α_i is commonly taken to be the conditional variance $σ_{i}^{2} : = {var}_{F} [Z_{i j} | μ_{i}, α_{i}]$ to allow for heteroscedasticity. However, α_i could even be infinite-dimensional—for instance, a random element from a space of distributions. The α_i have no impact on our estimation strategy and are purely a technical device.

Given data from model (3), one approach would be to collapse the replicates into a single observation per unit—say, by taking their mean—which would bring us back to the setting of model (1). We could appeal to the central limit theorem to justify knowing that the likelihood is Normal. An important message of this paper is that we can do better by using the replicates.

2.1 Proposed Method

Since we have replicates, we can split the Z_ij into two groups for each i. First, consider the case of K = 2 replicates so that we can write (X_i , Y _i ) for $(Z_{i 1}, Z_{i 2})$ . Now, X_i and Y_i are conditionally independent given $(μ_{i}, α_{i})$ . illustrates the relationship between X_i and Y_i under two different settings. The key insight is that the conditional mean $E_{G, F} [Y_{i} | X_{i}]$ is (almost surely) identical to the posterior mean $E_{G, F} [μ_{i} | X_{i}]$ , by the following calculation (Krutchkoff Citation1967). (For convenience, we suppress the dependence of the expected values on G, F.)(4) $\begin{matrix} E [Y_{i} | X_{i}] = E [E [Y_{i} | μ_{i}, α_{i}, X_{i}] | X_{i}] \\ = E [E [Y_{i} | μ_{i}, α_{i}] | X_{i}] = E [μ_{i} | X_{i}] . \end{matrix}$ (4)

Fig. 1 EB with replicates: Two simulations with N = 1000 and K = 2. First, we draw $μ_{i} \sim G$ , where $G = N (0.5, 4)$ in the left panel and G is the uniform distribution on the discrete set ${- 3, 0, 3}$ in the right panel. Then, for each i, we draw $X_{i}, Y_{i} | μ_{i} \overset{iid}{\sim} N (μ_{i}, 1)$ and plot the points (X_i , Y _i ). The line shows the posterior mean $E [μ_{i} | X_{i}]$ , which in light of (4), is identical to the conditional mean $E [Y_{i} | X_{i}]$ .

Fig. 1 EB with replicates: Two simulations with N = 1000 and K = 2. First, we draw μi∼G, where G=N(0.5,4) in the left panel and G is the uniform distribution on the discrete set {−3, 0, 3} in the right panel. Then, for each i, we draw Xi,Yi | μi ∼iid N(μi,1) and plot the points (Xi , Y i ). The line shows the posterior mean E[μi | Xi], which in light of (4), is identical to the conditional mean E[Yi | Xi].

This suggests that we can estimate the Bayes rule based on X_i (i.e., the posterior mean $E [μ_{i} | X_{i}]$ ) by simply regressing Y_i on X_i using any black-box predictive model, such as a local averaging smoother. Let $\hat{m} (\cdot)$ be the fitted regression function; our estimate of each μ_i is then just ${\hat{μ}}_{i} = \hat{m} (X_{i})$ .

To extend this method to K > 2, we can again split the replicates $Z_{i}$ into two parts:(5) $\begin{matrix} X_{i} : = X_{i} (j) : = (Z_{i 1}, \dots, Z_{i (j - 1)}, Z_{i (j + 1)}, \dots, Z_{i K}) \\ Y_{i} : = Y_{i} (j) : = Z_{i j}, \end{matrix}$ (5) where $j \in {1, \dots, K}$ is an arbitrary fixed index for now. (We suppress j in the notation below.) Now, one approach is to summarize the vector $X_{i}$ by the mean of its values ${\bar{X}}_{i}$ and regress Y_i on ${\bar{X}}_{i}$ to learn $E [μ_{i} | {\bar{X}}_{i}]$ (Coey and Cunningham Citation2019). This works for essentially the same reason as the K = 2 case. However, if ${\bar{X}}_{i}$ is not sufficient for $(μ_{i}, α_{i})$ in model (3) with K – 1 replicates, then $E [μ_{i} | {\bar{X}}_{i}]$ may be different from and suboptimal to $E [μ_{i} | X_{i}]$ .

Instead, we propose learning $E [μ_{i} | X_{i}]$ directly. The rationale is contained in the following result.

Proposition 1.

Let Z_ij be generated according to EquationEquation (3)(3) $\begin{matrix} (μ_{i}, α_{i}) \overset{iid}{\sim} G, i = 1, \dots, N \\ Z_{i j} | (μ_{i}, α_{i}) \overset{iid}{\sim} F (\cdot | μ_{i}, α_{i}) j = 1, \dots, K . \end{matrix}$ (3) and assume that $E [| μ_{i} |] < \infty, E [| Z_{i j} |] < \infty$ . Define $X_{i}$ and Y_i as in EquationEquation (5)(5) $\begin{matrix} X_{i} : = X_{i} (j) : = (Z_{i 1}, \dots, Z_{i (j - 1)}, Z_{i (j + 1)}, \dots, Z_{i K}) \\ Y_{i} : = Y_{i} (j) : = Z_{i j}, \end{matrix}$ (5) . Let $X_{i}^{(\cdot)}$ be the vector of order statistics of $X_{i}$ : $X_{i}^{(\cdot)} : = (X_{i}^{(1)}, \dots, X_{i}^{(K - 1)}), X_{i}^{(1)} \leq \dots \leq X_{i}^{(K - 1)} .$

That is, $X_{i}^{(\cdot)}$ is simply a sorted version of $X_{i}$ . Then, almost surely(6) $E [Y_{i} | X_{i}^{(\cdot)}] = E [μ_{i} | X_{i}] .$ (6)

Proof.

The same argument as EquationEquation (4)(4) $\begin{matrix} E [Y_{i} | X_{i}] = E [E [Y_{i} | μ_{i}, α_{i}, X_{i}] | X_{i}] \\ = E [E [Y_{i} | μ_{i}, α_{i}] | X_{i}] = E [μ_{i} | X_{i}] . \end{matrix}$ (4) shows that $E [Y_{i} | X_{i}^{(\cdot)}] = E [μ_{i} | X_{i}^{(\cdot)}]$ almost surely. Now, under exchangeable sampling, the order statistics $X_{i}^{(\cdot)}$ are sufficient for $(μ_{i}, α_{i})$ and therefore it follows that $E [μ_{i} | X_{i}^{(\cdot)}] = E [μ_{i} | X_{i}]$ almost surely, with no assumptions on F. □

EquationEquation (6)(6) $E [Y_{i} | X_{i}^{(\cdot)}] = E [μ_{i} | X_{i}] .$ (6) suggests that we should regress Y_i on the order statistics $X_{i}^{(\cdot)}$ to learn a function $\hat{m} (\cdot) : R^{K - 1} \to R$ that approximates the posterior mean. The fitted values $\hat{m} (X_{i}^{(\cdot)})$ can be used to estimate μ_i . There is one more detail worth mentioning. The estimate $\hat{m} (X_{i}^{(\cdot)})$ depends on the arbitrary choice of j in EquationEquation (5)(5) $\begin{matrix} X_{i} : = X_{i} (j) : = (Z_{i 1}, \dots, Z_{i (j - 1)}, Z_{i (j + 1)}, \dots, Z_{i K}) \\ Y_{i} : = Y_{i} (j) : = Z_{i j}, \end{matrix}$ (5) . To reduce the variance of the estimate, we average over all choices of $j \in {1, \dots, K}$ . This method, summarized in , is called Aurora, which stands for “Averages of Units by Regressing on Ordered Replicates Adaptively.”

Table 1 A summary of Aurora, which is the proposed method for estimating the means μ_i when the data come from model (3).

Display Table

3 Related Work

The Aurora method is closely related to the extensive literature on empirical Bayes, which we cite throughout this paper. One work that is worth emphasizing is Stigler (Citation1990), who motivated empirical Bayes estimators such as James-Stein, through the lens of regression to the mean; for us, regression is not just a motivation but the estimation strategy itself. We were surprised to find a similar idea to Aurora in a forgotten manuscript, uncited to date, that Johns (Citation1986) contributed to a symposium for Herbert Robbins (Van Ryzin Citation1986). Although Johns (Citation1986) used a fairly complex predictive model (projection pursuit regression, Friedman and Stuetzle Citation1981), we show theoretically (Sections 4 and 5) and empirically (Sections 6 and 7) that the order statistics encode enough structure that linear regression and k-nearest neighbor regression can be used as the predictive model.

Models similar to EquationEquation (3)(3) $\begin{matrix} (μ_{i}, α_{i}) \overset{iid}{\sim} G, i = 1, \dots, N \\ Z_{i j} | (μ_{i}, α_{i}) \overset{iid}{\sim} F (\cdot | μ_{i}, α_{i}) j = 1, \dots, K . \end{matrix}$ (3) with replicated noisy measurements of unobservable random quantities have been studied in the context of deconvolution and error-in-variables regression (Devanarayan and Stefanski Citation2002; Schennach Citation2004). In econometrics, panel data with random effects are often modelled as in EquationEquation (3)(3) $\begin{matrix} (μ_{i}, α_{i}) \overset{iid}{\sim} G, i = 1, \dots, N \\ Z_{i j} | (μ_{i}, α_{i}) \overset{iid}{\sim} F (\cdot | μ_{i}, α_{i}) j = 1, \dots, K . \end{matrix}$ (3) with the additional potential complication of time dependence, that is, $j = 1, \dots, K$ indexes time, while $i = 1, \dots, N$ may correspond to different geographic regions (Horowitz and Markatou Citation1996; Hall and Yao Citation2003; Neumann Citation2007; Jochmans and Weidner 2018). Fithian and Ting (Citation2017) used observations from model (3) to learn a low-dimensional smooth parametric model that describes the data well. However, their goal was testing, whereas our goal is mean estimation.

The Aurora method is also related to a recent line of research that leverages black-box prediction methods to solve statistical tasks that are not predictive in nature. For example, Chernozhukov et al. (Citation2018) considered inference for low dimensional causal quantities when high dimensional nuisance components are estimated by machine learning. Boca and Leek (Citation2018) reinterpreted the multiple testing problem in the presence of informative covariates as a regression problem and estimated the proportion of null hypotheses conditionally on the covariates. The estimated conditional proportion of null hypotheses may then be used for downstream multiple testing methods (Ignatiadis and Huber Citation2021). Black-box regression models can also be used to improve empirical Bayes point estimates in the presence of side-information (Ignatiadis and Wager Citation2019).

Finally, a crucial ingredient in Aurora is data splitting, which is a classical idea in statistics (Cox Citation1975), typically used to ensure honest inference for low-dimensional parameters. In the context of simultaneous inference, Rubin, Dudoit, and Van der Laan (Citation2006) and Habiger and Peña (Citation2014) used data-splitting to improve power in multiple testing.

4 Properties of the Aurora Estimator

We provide theoretical guarantees for the general Aurora estimator described in Section 2. Throughout this section, we split $Z_{i}$ into $X_{i} (j)$ and $Y_{i} (j)$ as in EquationEquation (5)(5) $\begin{matrix} X_{i} : = X_{i} (j) : = (Z_{i 1}, \dots, Z_{i (j - 1)}, Z_{i (j + 1)}, \dots, Z_{i K}) \\ Y_{i} : = Y_{i} (j) : = Z_{i j}, \end{matrix}$ (5) , and we write $X_{i}^{(\cdot)} (j)$ for the order statistics of $X_{i} (j)$ . We also write $Z$ and $X^{(\cdot)} (j)$ for the concatenation of all the $Z_{i}$ and $X_{i}^{(\cdot)} (j)$ , respectively. As before, we omit j whenever a definition does not depend on the specific choice of j.

4.1 Three Oracle Benchmarks

In this section, we define three benchmarks for assessing the quality of a mean estimator in model (3) (in terms of mean squared error). These benchmarks provide the required context to interpret the theoretical guarantees of Aurora that we establish in Section 4.2.

First, it is impossible to improve on the Bayes rule, so the Bayes risk serves as an oracle. We denote the Bayes risk (based on all K replicates) by(7) $\begin{matrix} R_{K}^{*} (G, F) : = E_{G, F} [{(μ_{i} - E_{G, F} [μ_{i} | Z_{i}])}^{2}] \\ = E_{G, F} [{(μ_{i} - E_{G, F} [μ_{i} | X_{i}^{(\cdot)}, Y_{i}])}^{2}] . \end{matrix}$ (7)

As explained in Proposition 1, the function we seek to mimic (for each j in the loop of the Aurora algorithm in ) is the oracle Bayes rule based on the order statistics of K – 1 replicates,(8) $m^{*} (X_{i}^{(\cdot)}) : = E_{G, F} [μ_{i} | X_{i}^{(\cdot)}] .$ (8)

The risk of $m^{*} (\cdot)$ is the Bayes risk based on $K - 1$ replicates,(9) $\begin{matrix} R_{K - 1}^{*} (G, F) : = E_{G, F} [{(μ_{i} - E_{G, F} [μ_{i} | X_{i}^{(\cdot)}])}^{2}] \\ = E_{G, F} [{(μ_{i} - m^{*} (X_{i}^{(\cdot)}))}^{2}] . \end{matrix}$ (9)

In the Aurora algorithm, we average over the choice of held-out replicate j. Thus, we also define an oracle rule that averages $m^{*} (X_{i}^{(\cdot)} (j))$ over all j. We use the following notation for this oracle and its risk:(10) $\begin{matrix} {\bar{m}}^{*} (Z_{i}) : = \frac{1}{K} \sum_{j = 1}^{K} m^{*} (X_{i}^{(\cdot)} (j)), \\ {\bar{R}}_{K - 1}^{*} (G, F) : = E_{G, F} [{(μ_{i} - {\bar{m}}^{*} (Z_{i}))}^{2}] . \end{matrix}$ (10)

It is immediate that $R_{K}^{*} (G, F) \leq {\bar{R}}_{K - 1}^{*} (G, F)$ . Also, by the definition of ${\bar{m}}^{*}$ in EquationEquation (10)(10) $\begin{matrix} {\bar{m}}^{*} (Z_{i}) : = \frac{1}{K} \sum_{j = 1}^{K} m^{*} (X_{i}^{(\cdot)} (j)), \\ {\bar{R}}_{K - 1}^{*} (G, F) : = E_{G, F} [{(μ_{i} - {\bar{m}}^{*} (Z_{i}))}^{2}] . \end{matrix}$ (10) and by Jensen’s inequality, it holds that, $\begin{matrix} E_{G, F} [{(μ_{i} - {\bar{m}}^{*} (Z_{i}))}^{2}] \\ = E_{G, F} [{\frac{1}{K} \sum_{j = 1}^{K} (μ_{i} - m^{*} (X_{i}^{(\cdot)} (j)))}^{2}] \\ \leq \frac{1}{K} \sum_{j = 1}^{K} E_{G, F} [{(μ_{i} - m^{*} (X_{i}^{(\cdot)} (j)))}^{2}], \end{matrix}$ so ${\bar{R}}_{K - 1}^{*} (G, F) \leq R_{K - 1}^{*} (G, F)$ . This bound can be improved through the following insight: $m^{*} (X_{i}^{(\cdot)} (j)), j = 1, \dots, K$ are jackknife estimates of the posterior mean $E [μ_{i} | Z_{i}]$ and ${\bar{m}}^{*} (Z_{i})$ is their average. By a fundamental result for the jackknife (Efron and Stein Citation1981, theor. 2), it holds that¹ $var [{\bar{m}}^{*} (Z_{i}) | μ_{i}, α_{i}] \leq var [m^{*} (X_{i}^{(\cdot)}) | μ_{i}, α_{i}] \cdot (K - 1) / K$ . Armed with this insight, in Supplement A.1 we prove the following Proposition.

Proposition 2.

Under model (3) with $E [μ_{i}^{2}] < \infty, E [Z_{i j}^{2}] < \infty$ , it holds that(11) $\begin{matrix} R_{K}^{*} (G, F) \leq {\bar{R}}_{K - 1}^{*} (G, F) \leq R_{K - 1}^{*} (G, F) \\ - E [var [m^{*} (X_{i}^{(\cdot)}) | μ_{i}, α_{i}]] / K . \end{matrix}$ (11)

Remark 1.

For sufficiently “regular” problems, $R_{K}^{*} (G, F)$ will typically be of order $O (1 / K)$ , while $R_{K - 1}^{*} (G, F) - R_{K}^{*} (G, F)$ will be of order $O (1 / K^{2})$ ; we make this argument rigorous for location families in Section 5.2 and Supplement E. The correction term on the right-hand side of EquationEquation (11)(11) $\begin{matrix} R_{K}^{*} (G, F) \leq {\bar{R}}_{K - 1}^{*} (G, F) \leq R_{K - 1}^{*} (G, F) \\ - E [var [m^{*} (X_{i}^{(\cdot)}) | μ_{i}, α_{i}]] / K . \end{matrix}$ (11) will also be of order $O (1 / K^{2})$ in such problems. In our next example, we demonstrate the importance of this additional term.

Example 1

(Normal likelihood with Normal prior). We consider the Normal–Normal model (2) from the introduction with prior variance A and noise variance $σ^{2} = 1$ , and with replicates $Z_{i j}, j = 1, \dots, K$ as in EquationEquation (3)(3) $\begin{matrix} (μ_{i}, α_{i}) \overset{iid}{\sim} G, i = 1, \dots, N \\ Z_{i j} | (μ_{i}, α_{i}) \overset{iid}{\sim} F (\cdot | μ_{i}, α_{i}) j = 1, \dots, K . \end{matrix}$ (3) . In this case,² we can analytically compute that $R_{K}^{*} (G, F) = A / (K A + 1)$ , and so it follows that $R_{K - 1}^{*} (G, F) - R_{K}^{*} (G, F) = Θ (1 / K^{2})$ . The right-most inequality in (11) is an equality and ${\bar{R}}_{K - 1}^{*} (G, F) - R_{K}^{*} (G, F) = Θ (1 / K^{4})$ . We conclude that, in the Normal-Normal model, the averaged oracle ${\bar{m}}^{*}$ has risk closer to the full Bayes estimator than to the Bayes estimator based on K – 1 replicates.

4.2 Regret Bound for Aurora

In this section, we derive our main regret bound for Aurora. The basic idea is that the in-sample error of the regression method $\hat{m}$ directly translates to bounds on the estimation error for the effect sizes μ_i , and thus, Aurora can leverage black-box predictive models. To this end, we define the in-sample prediction error for estimating the averaged oracle ${\bar{m}}^{*}$ ,(12) $\bar{Err} (m^{*}, \hat{m}) : = \frac{1}{N} \sum_{i = 1}^{N} E_{G, F} [{{\bar{m}}^{*} (Z_{i}) - \frac{1}{K} \sum_{j = 1}^{K} {\hat{m}}_{j} (X_{i}^{(\cdot)} (j))}^{2}] .$ (12)

When the predictive mechanism of the Aurora algorithm () is the same for all j,³ then Jensen’s inequality provides the following convenient upper bound on EquationEquation (12)(12) $\bar{Err} (m^{*}, \hat{m}) : = \frac{1}{N} \sum_{i = 1}^{N} E_{G, F} [{{\bar{m}}^{*} (Z_{i}) - \frac{1}{K} \sum_{j = 1}^{K} {\hat{m}}_{j} (X_{i}^{(\cdot)} (j))}^{2}] .$ (12) :(13) $\begin{matrix} \bar{Err} (m^{*}, \hat{m}) \leq Err (m^{*}, \hat{m}) \\ : = \frac{1}{N} \sum_{i = 1}^{N} E_{G, F} [{m^{*} (X_{i}^{(\cdot)}) - \hat{m} (X_{i}^{(\cdot)})}^{2}] . \end{matrix}$ (13)

$Err (m^{*}, \hat{m})$ is the in-sample error from approximating $m^{*}$ by $\hat{m}$ . We are ready to state our main result:

Theorem 3.

The mean squared error of the Aurora estimator ${\hat{μ}}_{i}^{Aur}$ (described in ) satisfies the following regret bound under model (3) with $E [μ_{i}^{2}] < \infty, E [Z_{i j}^{2}] < \infty$ : $\begin{matrix} \frac{1}{N} \sum_{i = 1}^{N} E_{G, F} [{(μ_{i} - {\hat{μ}}_{i}^{Aur})}^{2}] \\ \leq R_{K}^{*} (G, F) & (Irreducible Bayes error) \\ + 2 ({\bar{R}}_{K - 1}^{*} (G, F) - R_{K}^{*} (G, F)) & (Error due to data splitting) \\ + 2 \bar{Err} (m^{*}, \hat{m}) & (Estimation error) \end{matrix}$

${\hat{μ}}_{i, j}^{Aur}$ (i.e., the Aurora estimator based on a single held-out response replicate) satisfies the above regret bound with ${\bar{R}}_{K - 1}^{*} (G, F), \bar{Err} (m^{*}, \hat{m})$ replaced by $R_{K - 1}^{*} (G, F), Err (m^{*}, \hat{m})$ .

As elaborated in Remark 1, the second error term in the above decomposition will typically be negligible compared to $R_{K}^{*} (G, F)$ ; this is the price we pay for making no assumptions about F and G. Hence, beyond the irreducible Bayes error, the main source of error depends on how well we can estimate $m^{*} (\cdot)$ . Crucially, this error is the in-sample estimation error of $\hat{m}$ , which is often easier to analyze and smaller in magnitude than out-of-sample estimation error (Hastie, Tibshirani, and Friedman Citation2009; Chatterjee Citation2013; Rosset and Tibshirani Citation2020).

4.3 Aurora With k-Nearest-Neighbors is Universally Consistent

Theorem 3 d

emonstrated that a regression model with small in-sample error translates, through Aurora, to mean estimates with small mean squared error. Now, we combine this result with results from nonparametric regression to prove that under model (3), it is possible to asymptotically (in N) match the oracle risk (10) and in view of Proposition 2, to outperform the Bayes risk based on K – 1 replicates.

Theorem 4

(Universal consistency with k-Nearest-Neighbor (kNN) estimator). Consider model (3) with $E [μ_{i}^{2}] < \infty, E [Z_{i j}^{2}] < \infty$ . We estimate μ_i with the Aurora algorithm where $\hat{m} (\cdot)$ is the k-Nearest-Neighbor (kNN) estimator with $k = k_{N} \in N$ , that is, the nonparametric regression estimator which predicts⁴ $\begin{matrix} \hat{m} (x) = \frac{1}{k} \sum_{i \in S_{k} (x)} Y_{i}, where \\ S_{k} (x) = {i \in {1, \dots, N} : \sum_{j \neq i} 1 ({‖ X_{i}^{(\cdot)} - x ‖}_{2} > {‖ X_{j}^{(\cdot)} - x ‖}_{2}) < k}, \end{matrix}$

and ${‖ \cdot ‖}_{2}$ is the Euclidean norm. If $k = k_{N}$ satisfies $k \to \infty, k / N \to 0$ as $N \to \infty$ , then $\underset{N \to \infty}{\lim \sup} \frac{1}{N} \sum_{i = 1}^{N} E [{(μ_{i} - {\hat{μ}}_{i}^{Aur})}^{2}] = {\bar{R}}_{K - 1}^{*} (G, F) \leq R_{K - 1}^{*} (G, F) .$

This result is a consequence of universal consistency in nonparametric regression (Stone Citation1977; Györfi et al. Citation2002). It demonstrates that Aurora can asymptotically nearly match the Bayes risk with substantial generality,⁵ and suggests the power and expressivity of the Aurora algorithm.

To apply Aurora with kNN, a data-driven choice of k is required. In Supplement B.2 we describe a procedure that chooses the number of nearest neighbors k_j per held-out response replicate j through leave-one-out (LOO) cross-validation on the units, and we study its performance empirically in the simulations of Section 6. Nevertheless, Aurora-kNN tuned by LOO, is computationally involved. This motivates our next section; there we study a procedure that is interpretable, easy to implement and that scales well to large datasets.

5 Aurora With Linear Regression

In this section, we analyze the Aurora algorithm when linear regression is used as the predictive model. That is, $\hat{m}$ is a linear function of the order statistics(14) $\hat{m} (X_{i}^{(\cdot)}) = {\hat{β}}_{0} + \sum_{j^{'} = 1}^{K - 1} {\hat{β}}_{j^{'}} X_{i}^{(j^{'})},$ (14) where $\hat{β}$ are the ordinary least squares coefficients of the linear regression $Y \sim X^{(\cdot)}$ . We call the method Auroral, with the final “l” signifying “linear” and write ${\hat{μ}}_{i}^{AurL}$ for the resulting estimates of μ_i . Our main result is that Auroral matches the performance of the best estimator that is linear in the order statistics based on K – 1 replicates. To state this result formally, we first define the minimum risk among estimators in class $C$ :(15) $R_{K - 1}^{C} (G, F) : = \inf_{m \in C} {\frac{1}{N} \sum_{i = 1}^{N} E_{G, F} [{(μ_{i} - m (X_{i}^{(\cdot)}))}^{2}]} .$ (15)

The class of interest to us specifically is the class of estimators linear in the order statistics(16) $Lin : = Lin (R^{K - 1}) : = {m : m (x) = β_{0} + \sum_{j' = 1}^{K - 1} β_{j'} x^{(j')}} .$ (16)

This is a broad class that includes all estimators of μ_i based on $X_{i}^{(\cdot)}$ that proceed through the following two steps: (i) summarization, where an appropriate summary statistic of the likelihood F is used, that is linear in the order statistics, and (ii) linear shrinkage, where the summary statistic is linearly shrunk toward a fixed location. Schematically $\begin{matrix} Step 1 (Summarization) : X_{i} \mapsto T (X_{i}) \in R, \\ where T (\cdot) \in {\begin{matrix} Sample mean \\ Sample median \\ Trimmed mean \\ ⋮ \end{matrix} \\ Step 2 (Linear shrinkage) : T (X_{i}) \mapsto α T (X_{i}) + γ, \\ (e . g ., James - Steinshrinkage) \end{matrix}$

The summarization step can apply nonlinear functions of the original data, such as the median and the trimmed mean, because the input to the linear model are the order statistics, rather than the original data. Such linear combinations of the order statistics are known as L-statistics, which is a class so large that it includes efficient estimators of the mean in any smooth, symmetric location family, see Van der Vaart (Citation2000) and Section 5.2.

We now show that Auroral matches $R_{K - 1}^{Lin} (G, F)$ asymptotically in N in the setting of Theorem 3.

Theorem 5

(Regret over linear estimators).

1. Assume there exists C_N > 0 such that $E [\max_{i = 1, \dots, N} var [Y_{i} | X_{i}^{(\cdot)}]] \leq C_{N}$ ,⁶ then: $\frac{1}{N} \sum_{i = 1}^{N} E [{(μ_{i} - {\hat{μ}}_{i}^{AurL})}^{2}] \leq R_{K - 1}^{Lin} (G, F) + C_{N} \frac{K}{N} .$

2. Assume there exists $Γ > 0$ , such that $E [Y_{i}^{4}] \leq Γ^{2}$ , then: $\frac{1}{N} \sum_{i = 1}^{N} E [{(μ_{i} - {\hat{μ}}_{i}^{AurL})}^{2}] \leq R_{K - 1}^{Lin} (G, F) + Γ \sqrt{\frac{K}{N}} .$

If $m^{*} \in Lin (R^{K - 1})$ , then the conclusion of Theorem 3 holds for ${\hat{μ}}^{AurL}$ with the term $\bar{Err} (m^{*}, \hat{m})$ bounded by $C_{N} K / N$ (under Assumption (1)), resp. $Γ \sqrt{K / N}$ (under (2)).

In datasets, we often encounter $K ≪ N$ . Then, Theorem 5 implies that Auroral will almost match the risk $R_{K - 1}^{Lin} (G)$ of the best L-statistic based on K – 1 replicates. This result is in the spirit of restricted empirical Bayes (Griffin and Krutchkoff Citation1971; Maritz Citation1974; Norberg Citation1980; Robbins Citation1983), which seeks to find the best estimator among estimators in a given class, such as the best Bayes linear estimators of Hartigan (Citation1969). In these works, linearity typically refers to linearity in $X_{i}$ and not $X_{i}^{(\cdot)}$ ; Lwin (Citation1976), however, used empirical Bayes to learn the best L-statistic from the class $Lin$ , when the likelihood takes the form of a known location-scale family.

5.1 Examples of Auroral Estimation

In this section, we give three examples in which Auroral satisfies strong risk guarantees.

Example 2

(Point mass prior). Suppose the prior on μ is a point mass at $\bar{μ}$ ; that is, $P_{G} [μ_{i} = \bar{μ}] = 1$ . Then, the Bayes rule based on the order statistics, $m^{*} (X_{i}^{(\cdot)}) \equiv \bar{μ}$ , has risk 0 and is trivially a member of $Lin (R^{K - 1})$ , so $R_{K - 1}^{Lin} (G, F) = 0$ . Therefore, by Theorem 5, provided that $σ_{i}^{2} \leq C$ almost surely ( $σ_{i}^{2} = var [Z_{i j} | μ_{i}, α_{i}]$ ), the risk of the Auroral estimator satisfies

\frac{1}{N} \sum_{i = 1}^{N} E [{(μ_{i} - {\hat{μ}}_{i}^{AurL})}^{2}] \leq C \frac{K}{N} .

Example 3

(Normal likelihood with Normal prior). We revisit Example 1,⁷ wherein we argued that the averaged oracle ${\bar{m}}^{*} (Z_{i})$ (10) has risk equal to the Bayes risk $R_{K}^{*} (G, F)$ plus $O (1 / K^{4})$ . Auroral attains this risk, plus the ordinary least squares estimation error that decays as $O (K / N)$ .

In addition to Auroral, we consider two estimators, that take advantage of the fact that ${\bar{Z}}_{i} : = \frac{1}{K} \sum_{j = 1}^{K} Z_{i j}$ is sufficient for μ_i in the Normal model with K replicates:

Coey and Cunningham (Citation2019) (CC-L): We proceed similarly to the Auroral algorithm with one modification. For the jth held-out replicate, we regress $Y_{i} (j)$ on ${\bar{X}}_{i} (j) = \frac{1}{K - 1} \sum_{j' \neq j} Z_{i j'}$ using ordinary least squares to obtain ${\hat{m}}_{j}^{CC - L}$ and ${\hat{μ}}_{i j}^{CC - L} : = {\hat{m}}_{j}^{CC - L} ({\bar{X}}_{i} (j))$ . Finally we estimate μ_i by ${\hat{μ}}_{i}^{CC - L} : = \frac{1}{K} \sum_{j = 1}^{K} {\hat{μ}}_{i j}^{CC - L}$ .
James and Stein (Citation1961) (JS): We estimate μ_i by ${\hat{μ}}_{i}^{JS} = (1 - \frac{(N - 2) / K}{\sum_{i = 1}^{N} {\bar{Z}}_{i}^{2}}) {\bar{Z}}_{i}$ .

CC-L implicitly uses the assumption of Normal likelihood by reducing the replicates to their mean, which is the sufficient statistic. Similar to Auroral, its risk is equal to $R_{K}^{*} (G, F)$ plus $O (1 / K^{4})$ and the least-square estimation error, which in this case is $O (1 / N)$ (only the slope and intercept need to be estimated for each held-out replicate). James-Stein makes full use of the model assumptions (Normal prior, Normal likelihood, and known variance) and is expected to perform best. JS achieves the Bayes risk based on K observations, $R_{K}^{*} (G, F)$ plus an error term that decays as $O (1 / (N K^{2}))$ . The price Auroral pays compared to CC-L and JS for using no assumptions whatsoever, when reduction to the mean was possible, is modest.

Example 4

(Exponential families with conjugate priors). Let ν be a σ-finite measure on the Borel sets of $R$ and $X$ be the interior of the convex hull of the support of ν. Let $M (θ) = log (\int exp (z \cdot θ) d ν (z))$ . Suppose $Θ = {θ \in R : M (θ) < \infty}$ is open, that both $X, Θ$ are nonempty and that $M (\cdot)$ is differentiable for $θ \in Θ$ with strictly increasing derivative $θ \mapsto M' (θ)$ .

Now, suppose $μ_{i}, Z_{i j}$ are generated from model (3) as follows (with G, F implicitly defined). First, θ_i is drawn from a prior with Lebesgue density proportional to $exp (K_{0} (z_{0} \cdot θ - M (θ)))$ on Θ for some $K_{0} > 0$ and $z_{0} \in X$ . Next, $μ_{i} = M' (θ_{i})$ and $Z_{i j} | μ_{i}$ is drawn from the distribution with ν-density equal to $exp (z \cdot θ_{i} - M (θ_{i}))$ , where $θ_{i} = {(M')}^{- 1} (μ_{i})$ . Diaconis and Ylvisaker (Citation1979, theor. 2) then prove that, $m^{*} (X_{i}^{(\cdot)}) = E [μ_{i} | X_{i}^{(\cdot)}] = \frac{K_{0} z_{0} + (K - 1) {\bar{X}}_{i}}{K_{0} + K - 1} .$

Thus, $m^{*} \in Lin (R^{K - 1})$ and Auroral matches the risk of $m^{*}$ up to the error term in Theorem 5.

5.2 Auroral Estimation in Location Families

In this section, we provide another example of Auroral estimation. In contrast to the rest of this article, which treats K as fixed, we consider an asymptotic regime in which both K and N tend to $\infty$ (with K growing substantially slower than N). In doing so, we hope to provide the following conceptual insights: first, we provide a concrete setting in which Auroral dominates any method that first summarizes $Z_{i}$ as ${\bar{Z}}_{i}$ . Second, we elaborate on the expressivity of the class $Lin (R^{K - 1})$ (16).

We consider model (3) with $μ_{i} \sim G$ for a smooth prior G and $Z_{i j} \overset{iid}{\sim} F (\cdot | μ_{i})$ , where $F (\cdot | μ_{i})$ has density $f (\cdot - μ_{i})$ with respect to the Lebesgue measure and $f (\cdot)$ is a density that is symmetric around 0.

We proceed with a heuristic discussion, that we will make rigorous in the formal statements below. Suppose for now that $f (\cdot)$ is sufficiently regular with Fisher Information,⁸ (18) $I (f) = \int \frac{f' {(x)}^{2}}{f (x)} 1 (f (x) > 0) d x < \infty .$ (18)

Then, by classical parametric theory, we expect that $K \cdot R_{K}^{*} (G, F) \to I {(f)}^{- 1}$ as $K \to \infty$ ⁹. On the other hand, another classical result in the theory of robust statistics (Bennett Citation1952; Jung Citation1956; Chernoff, Gastwirth, and Johns Citation1967; Van der Vaart Citation2000) states that for smooth location families, there exists an L-statistic (i.e., a linear combination of the order statistics) that is asymptotically efficient. By Theorem 5, we thus anticipate that the risk of Auroral is equal to $(1 + o (1)) I {(f)}^{- 1} / K$ .

In contrast, if we first summarize $Z_{i}$ by ${\bar{Z}}_{i}$ , then the best estimator we can possibly use is the posterior mean, $E [μ_{i} | {\bar{Z}}_{i}]$ . However, for K large (so that the likelihood swamps the prior), $E [μ_{i} | {\bar{Z}}_{i}] \approx {\bar{Z}}_{i}$ , and so the risk will behave roughly as $σ^{2} / K$ , where $σ^{2} = \int f^{2} (x) d x$ . We summarize our findings in the Corollary below, and provide the formal proofs in Supplement E.

Corollary 6 (Smooth location families). Suppose that G satisfies regularity Assumption 1 and $f (\cdot)$ satisfies regularity Assumption 2 (where both assumptions are stated in Supplement E.1). Then, in an asymptotic regime with $K, N \to \infty, K^{2} / N \to 0$ , it holds that, $\begin{matrix} \frac{1}{N} \sum_{i = 1}^{N} E [{(μ_{i} - {\hat{μ}}_{i}^{AurL})}^{2}] / \frac{I {(f)}^{- 1}}{K} \to 1, \\ \frac{1}{N} \sum_{i = 1}^{N} E [{(μ_{i} - {\hat{μ}}_{i}^{Avg})}^{2}] / \frac{σ^{2}}{K} \to 1 as N \to \infty, \end{matrix}$ where ${\hat{μ}}_{i}^{Avg}$ can be either the CC-L estimator ${\hat{μ}}_{i}^{CC - L}$ or the Bayes estimator $E [μ_{i} | {\bar{Z}}_{i}]$ .

Recalling that $I {(f)}^{- 1} \leq σ^{2}$ with equality when $f (\cdot)$ is the Gaussian density, we see that Auroral adapts¹⁰ to the unknown density $f (\cdot)$ and outperforms any estimator that first averages $Z_{i}$ .

What about location families that are not regular? Below we give an example, namely the Rectangular location family with $f (\cdot) = U [- B, B]$ , B > 0 in which a similar conclusion to Corollary 6 holds. The advantage of Auroral is even more pronounced in this case ( $O (K^{- 2})$ risk for Auroral, versus $O (K^{- 1})$ for any method that first averages $Z_{i}$ ).

Corollary 7 (Rectangular location family). Suppose $f (\cdot)$ is the uniform density on $[- B, B]$ for a B > 0, that G satisfies regularity Assumption 1 in Supplement E.1 and that $K, N \to \infty, K^{3} / N \to 0$ . Then, $\begin{matrix} \underset{N \to \infty}{\lim \sup} {K \cdot (\frac{1}{N} \sum_{i = 1}^{N} E [{(μ_{i} - {\hat{μ}}_{i}^{AurL})}^{2}] / \\ \frac{1}{N} \sum_{i = 1}^{N} E [{(μ_{i} - {\hat{μ}}_{i}^{Avg})}^{2}])} \leq 6. \end{matrix}$

6 Empirical Performance in Simulations

In this section, we study the empirical performance of Aurora and competing empirical Bayes algorithms in three scenarios: homoscedastic location families (Section 6.1), heteroscedastic location families (Section 6.2) and a heavy-tailed likelihood (Section 6.3).

6.1 Homoscedastic Location Families

We start by empirically studying the location family problem from Section 5.2 for K = 10 replicates and $N = 10^{4}$ units. We first generate the means μ_i from one of two possible priors G, parameterized by a simulation parameter A: the Normal prior $N (0.5, A)$ and the three-point discrete prior that assigns equal probabilities to ${- \sqrt{3 A / 2}, 0, \sqrt{3 A / 2}}$ . Both prior distributions have variance A.

We then generate the replicates Z_ij around each mean μ_i from one of three location families: Normal, Laplace, or Rectangular. The parameters of these distributions are chosen so that the noise variance is $σ^{2} = var [Z_{i j} | μ_{i}] = 4$ .

We compare eight estimators of μ_i :

Aurora-type methods: Auroral, Aurora-kNN (Aur-kNN) with k chosen by leave-one-out cross-validation from the set ${1, \dots, 1000}$ (as described in Supplement B.2) and CC-L (described in Example 3).
Standard estimators of location: The mean ${\bar{Z}}_{i}$ , the median and the midrange (that is, ${\hat{μ}}_{i} = (\max_{j} {Z_{i j}} + \min_{j} {Z_{i j}}) / 2$ ).
Standard empirical Bayes estimators applied to the averages ${\bar{Z}}_{i}$ : James-Stein (positive-part) shrinking toward $\sum_{i = 1}^{N} {\bar{Z}}_{i} / N$ (which we provide with oracle knowledge of $σ^{2} = 4$ ) and the nonparametric maximum likelihood estimator (NPMLE) of Koenker and Mizera (Citation2014) (as implemented in the REBayes package [Koenker and Gu Citation2017] in the function “GLmix”), which is a convex programming formulation of the estimation scheme of Kiefer and Wolfowitz (Citation1956) and Jiang and Zhang (Citation2009). For the NPMLE we estimate the standard deviation for each unit by the sample standard deviation ${\hat{σ}}_{i}^{2}$ over its replicates and use the working approximation ${\bar{Z}}_{i} | μ_{i} \overset{\cdot}{\sim} N (μ_{i}, {\hat{σ}}_{i}^{2} / K)$ .

The results¹¹ are shown in . The standard location estimators have constant mean squared error (MSE) in all panels, since they do not make use of the prior.

Fig. 2 Homoscedastic location families: The MSE of the 8 estimators, as a function of the prior standard deviation. Each column represents a different prior distribution: Normal (left) and a three-point distribution (right). Each row represents a different location family for the likelihood (Normal, Laplace, Rectangular).

We discuss the case of a Normal prior first: here the MSE of all methods is non-decreasing in A. Auroral closely matches the best estimator for every A and likelihood. In the case of Normal likelihood, James-Stein (with oracle knowledge of $σ^{2}$ ), CC-L and Auroral perform best,¹² followed closely by Aurora-kNN and the NPMLE. The standard location estimators are competitive when the prior is relatively uninformative (i.e., A is large). Among these, the mean performs best for the Normal likelihood, the median for the Laplace likelihood and the midrange for the Rectangular likelihood.

The three component prior highlights a case in which nonlinear empirical Bayes shrinkage can be helpful. Here, Aurora-kNN performs best across all settings, closely followed by the NPMLE in the case of the Normal likelihood¹³. The NPMLE is also the second most competitive method for the Laplace likelihood with large prior variance A. The behavior of all other methods tracks closely with their behavior under a Normal prior.

One may at this point wonder: How do the Auroral weights (coefficients) $\hat{β}$ in EquationEquation (14)(14) $\hat{m} (X_{i}^{(\cdot)}) = {\hat{β}}_{0} + \sum_{j^{'} = 1}^{K - 1} {\hat{β}}_{j^{'}} X_{i}^{(j^{'})},$ (14) look like? In , we show these weights from a single replication (with $N = 2 \cdot 10^{5}$ and K = 10) for each of the three likelihoods considered above and for two Normal priors. First, we focus on the points in blue, which correspond to an uninformative prior ( $G = N (0.5, 400)$ ). When the likelihood F is Normal, the Auroral weights are roughly constant and equal to 1/9. In other words, ${\hat{μ}}_{i, j}^{AurL}$ , that is the Auroral estimate with held-out replicate j, is (approximately) the sample mean ${\bar{X}}_{i} (j)$ of $X_{i}^{(\cdot)} (j)$ . When the likelihood F is Laplace, the Auroral weights pick out the median $X_{i}^{(5)} (j)$ and a few order statistics around it. When the likelihood F is Rectangular, Auroral assigns approximately 1/2 weight each to the minimum and the maximum and 0 weight to all of the other order statistics. In other words, ${\hat{μ}}_{i, j}^{AurL}$ is (approximately) the midrange of $X_{i}^{(\cdot)} (j)$ . Notice that Auroral did not know the likelihood F in any of these examples. Rather, it adaptively learned an appropriate summary from the data.

Fig. 3 The coefficients of the intercept and the order statistics in the linear regression model for $\hat{m}$ in Auroral: The colors represent different choices of prior G (defined in the legend), while the panels represent different choices of likelihood F. The coefficients shown have been averaged over the held-out replicate j of the Auroral algorithm ().

Fig. 3 The coefficients of the intercept and the order statistics in the linear regression model for m̂ in Auroral: The colors represent different choices of prior G (defined in the legend), while the panels represent different choices of likelihood F. The coefficients shown have been averaged over the held-out replicate j of the Auroral algorithm (Table 1).

Next, we examine the difference between using informative versus uninformative priors G. When the prior is informative ( $G = N (0.5, 0.4)$ , orange in ), Auroral automatically learns a nonzero intercept, which is determined by the prior mean, and the remaining weights are shrunk toward zero.

6.2 Heteroscedastic Location Families

In our second simulation setting, we study location families where $σ_{i}^{2}$ is also random and so we find ourselves in the heteroscedastic location family problem. Again we benchmark Auroral, Aurora-kNN (k chosen by leave-one-out cross-validation from the set ${1, \dots, 1000}$ ), CC-L, the Gaussian NPMLE and the sample mean. We also consider two estimators which have been proposed specifically for the heteroscedastic Normal problem, the SURE (Stein’s Unbiased Risk Estimate) method of Xie, Kou, and Brown (Citation2012) that shrinks toward the grand mean and the GL (Group-linear) estimator of Weinstein et al. (Citation2018). We apply these estimators to the averages ${\bar{Z}}_{i}$ . Both of these estimators have been developed under the assumption that the analyst has exact knowledge of $σ_{i}^{2}$ ; so we provide them with this oracle knowledge (SURE (or.) and GL (or.)—the other methods are not provided this information). Furthermore, we apply the Group-linear method that uses the sample variance ${\hat{σ}}_{i}^{2}$ calculated based on the replicates.

We use three simulations, inspired by simulation settings (a), (c) and (f) of Weinstein et al. (Citation2018): in all three simulations we let $N = 10, 000, K = 10$ . First we draw ${\bar{σ}}_{i}^{2} \sim U [0.1, {\bar{σ}}_{max}^{2}]$ , where ${\bar{σ}}_{max}^{2}$ is a simulation parameter that we vary. Then for the first setting we draw $μ_{i} \sim N (0, 0.5)$ , while for the last two settings we let $μ_{i} = {\bar{σ}}_{i}^{2}$ . Weinstein et al. (Citation2018) used the latter as a model of strong mean–variance dependence. The methods that have access to ${\bar{σ}}_{i}^{2}$ can in principle predict perfectly (i.e., the Bayes risk is equal to 0). Finally we draw $Z_{i j} | μ_{i}, σ_{i}^{2} \sim F (\cdot | μ_{i}, σ_{i}^{2})$ where $σ_{i}^{2} = {\bar{σ}}_{i}^{2} \cdot K$ and F is either the Normal location-scale family (first two settings) or the Rectangular location-scale family (last setting).

Results from the simulations are shown in . The oracle SURE and oracle Group-linear estimators perform best in the first panel and oracle Group-linear strongly outperforms all other methods in the last two panels. This is not surprising, since oracle Group-linear has oracle access to $σ_{i}^{2}$ and the method was developed for precisely such settings with strong mean-variance relationship. Among the other methods, Auroral and Aurora-kNN remain competitive. In the first panel, they match CC-L, while in the last two panels they outperform Group-linear with estimated variances. We point out that Auroral outperforms CC-L in the second panel, despite the Normal likelihood. This is possible, because the mean is no longer sufficient in the heteroscedastic problem.

Fig. 4 Heteroscedastic location families: Data are generated as follows: First ${\bar{σ}}_{i}^{2} \sim U [0.1, {\bar{σ}}_{max}^{2}]$ , with ${\bar{σ}}_{max}^{2}$ varying on the x-axis. Then $μ_{i} \sim N (0, 0.5)$ (in the left panel) or $μ_{i} = {\bar{σ}}_{i}^{2}$ . Finally $Z_{i j} | μ_{i}, σ_{i}^{2} \sim F (\cdot | μ_{i}, σ_{i}^{2})$ , $j = 1, \dots, K$ where K = 10, $σ_{i}^{2} = {\bar{σ_{i}}}^{2} K$ and $F (\cdot | μ_{i}, σ_{i}^{2})$ is a Normal location-scale family (first two panels) or Rectangular (last panel). The y-axis shows the mean squared error of the estimation methods.

Fig. 4 Heteroscedastic location families: Data are generated as follows: First σ¯i2∼U[0.1,σ¯max2], with σ¯max2 varying on the x-axis. Then μi∼N(0,0.5) (in the left panel) or μi=σ¯i2. Finally Zij|μi,σi2∼F(·|μi,σi2), j=1,…,K where K = 10, σi2=σi¯2K and F(·|μi,σi2) is a Normal location-scale family (first two panels) or Rectangular (last panel). The y-axis shows the mean squared error of the estimation methods.

6.3 A Pareto Example

For our third example, we consider a Pareto likelihood, which is heavy tailed and non-symmetric. Concretely, we let $μ_{i} \sim G = U [2, μ_{max}]$ (with $μ_{max}$ a varying simulation parameter) and $F (\cdot | μ_{i})$ is the Pareto distribution with tail index α = 3 and mean μ_i . We compare Auroral, Aurora-kNN (Aur-kNN) with k chosen by leave-one-out cross-validation from the set ${1, \dots, 100}$ (as described in Supplement B.2), CC-L, the sample mean and median, as well as the maximum likelihood estimator for the Pareto distribution (assuming the tail index is unknown). For this example we also vary $(K, N) = (20, 10^{4}), (100, 10^{4}), (100, 10^{5})$ . The results are shown in . Throughout all settings, Auroral performs best, followed by Aurora-kNN. All methods improve as K increases. Auroral and Aurora-kNN also improve as N increases.

Fig. 5 Pareto distribution example: Data are generated as $μ_{i} \sim U [2, μ_{max}]$ , with $μ_{max}$ varying on the x-axis and $Z_{i j} | μ_{i} \sim F (\cdot | μ_{i})$ , where $F (\cdot | μ_{i})$ is the Pareto distribution with mean μ_i and tail index α = 3. The panels correspond to different choices for K and N. The y-axis shows the mean squared error of the estimation methods.

Fig. 5 Pareto distribution example: Data are generated as μi∼U[2,μmax], with μmax varying on the x-axis and Zij|μi∼F(·|μi), where F(·|μi) is the Pareto distribution with mean μi and tail index α = 3. The panels correspond to different choices for K and N. The y-axis shows the mean squared error of the estimation methods.

7 Application: Predicting Treatment Effects at Google

In this section, we apply Auroral to a problem encountered at Google and other technology companies—estimating treatment effects at a fine-grained level. All major technology firms run large randomized controlled experiments, often called A/B tests, to study interventions and to evaluate policies (Tang et al. Citation2010; Kohavi et al. Citation2013; Kohavi and Longbotham Citation2017; Athey and Luca Citation2019). Estimation of the average treatment effect from such an experiment (e.g., comparing a metric between treated users and control users) is a well-understood statistical task (Wager et al. Citation2016; Athey and Imbens Citation2017). In the application we consider below, instead, interest lies in estimating treatment effects on fine-grained groups – a separate treatment effect for each of thousands of different online advertisers. In this setting, EB techniques, such as Auroral, can stabilize estimates by sharing information across advertisers.

To apply Auroral for the task of fine-grained estimation of treatment effects, we require replicates. Interestingly, the data from the technology firm we work with is routinely organized and analyzed using “streaming buckets,” that is, the experiment data is divided into K chunks of approximately equal size that are called streaming buckets. The buckets correspond to (approximately) disjoint subsets of users, partitioned at random, and so data across different buckets are independent to sufficient approximation. We refer to Chamandy et al. (Citation2012) for a detailed description of the motivation for using streaming buckets to deal with the data’s scale and structure, as well as statistical and computational issues involved in the analysis of streaming bucket data. For the purpose of our application, each streaming bucket corresponds to a replicate in EquationEquation (3)(3) $\begin{matrix} (μ_{i}, α_{i}) \overset{iid}{\sim} G, i = 1, \dots, N \\ Z_{i j} | (μ_{i}, α_{i}) \overset{iid}{\sim} F (\cdot | μ_{i}, α_{i}) j = 1, \dots, K . \end{matrix}$ (3) .

We model the statistical problem of our application as follows. The metric of interest is the cost-per-click (CPC), which is the price a specific advertiser pays for an ad clicked by a user. The goal of the experiment is to estimate the change in CPC before and after treatment. We have data for each of N advertisers and two treatment arms: control (w = 0) and treated (w = 1). The data are further divided into K buckets. For each advertiser $i = 1, \dots, N$ , bucket $j = 1, \dots, K$ , and treatment arm w = 0, 1, we record the total number of clicks $N_{ijw} \in N_{> 0}$ and the total cost of the clicks A_ijw . We define ${CPC}_{ijw} : = A_{ijw} / N_{ijw}$ , the empirical cost-per-click for advertiser i in bucket j and treatment arm w.

In this application, the advertisers are the units and the buckets are the replicates. Each observation is $Z_{i j} = {CPC}_{i j 1} - {CPC}_{i j 0}$ , and the goal is to estimate the treatment effect(19) $μ_{i} : = E [{CPC}_{i j 1} - {CPC}_{i j 0} | μ_{i}, α_{i}] .$ (19)

$μ_{i}, α_{i}$ from model (3) capture advertiser-level idiosyncrasies. We consider the following estimation strategies:

The aggregate estimate: we pool the data in all buckets and then compute the difference in CPCs, that is, ${\hat{μ}}_{i} = \sum_{j = 1}^{K} A_{i j 1} / \sum_{j = 1}^{K} N_{i j 1} - \sum_{j = 1}^{K} A_{i j 0} / \sum_{j = 1}^{K} N_{i j 0}$ .
The mean of the Z_ij , that is, ${\hat{μ}}_{i} = \frac{1}{K} \sum_{j = 1}^{K} Z_{i j}$ .
The CC-L estimator, wherein for each j we use ordinary least squares to regress $Y_{i} (j)$ onto the aggregate estimate in the other K – 1 buckets, that is, onto $\sum_{j' \neq j} A_{i j' 1} / \sum_{j' \neq j} N_{i j' 1} - \sum_{j' \neq j} A_{i j' 0} / \sum_{j' \neq j} N_{i j' 0}$ .
The CC-L estimator, wherein for each j we use ordinary least squares to regress $Y_{i} (j)$ onto the mean of the other K – 1 buckets, that is, onto $\frac{1}{K - 1} \sum_{j' \neq j} Z_{i j'}$
The Auroral estimator that (for each j) regresses $Y_{i} (j)$ on the order statistics $X_{i}^{(\cdot)} (j)$ .

We empirically evaluate the methods as follows: we use data from an experiment running at Google for one week, retaining only the top advertisers based on number of clicks, resulting in $N >$ 50,000. The number of replicates is equal to K = 4. As ground truth, we use the aggregate estimate based on experiment data from the 3 preceding and 3 succeeding weeks. Then, we compute the mean squared error of the estimates (calculated from the one week data) against the ground truth and report the percent change compared to the aggregate estimate.

The results are shown in . The table also shows the results of the same analysis applied to a second metric, the change in click-through rate (CTR), which is the proportion of times that an ad by a given advertiser which is shown to a user is actually clicked. We observe that the improvement in estimation error through Auroral is substantial. Furthermore, Auroral outperforms both variants of CC-L, which in turn outperform estimators that do not share information across advertisers (i.e., the aggregate estimate and the mean of the Z_ij ).

Table 2 Empirical performance on the advertiser-level estimation problem: Percent change in mean squared error for estimating change in cost-per-click ( $Δ CPC$ ) and change in click-through rate ( $Δ CTR$ ) compared to the aggregate estimate (± standard errors).

Display Table

8 Conclusion

We have presented a general framework for constructing empirical Bayes estimators from K noisy replicates of N units. The basic idea of our method, which we term Aurora, is to leave one replicate out and regress this held-out replicate on the remaining $K - 1$ replicates. We then repeat this process over all choices of held-out replicate and average the results. We have shown that if the $K - 1$ replicates are first sorted, then even linear regression produces results that are competitive with the best methods, which usually make parametric assumptions, while our method is fully nonparametric.

We conclude by mentioning some direct extensions of Aurora that are suggested by its connection to regression.

8.1 More Powerful Regression Methods

In this article, we have used linear regression and k-Nearest Neighbor regression to learn $\hat{m} (\cdot)$ . But we can go further; for example, we could use isotonic regression [Guntuboyina and Sen Citation2018] based on a partial order on $X_{i}^{(\cdot)}$ . Or we could combine linear and isotonic regression by considering single index models with nondecreasing link function (Balabdaoui, Durot, and Jankowski Citation2019), that is, predictors of the form $\hat{m} (x^{(\cdot)}) = t (α^{⊤} x^{(\cdot)})$ , where ${‖ α ‖}_{2} = 1$ and t is an unknown non-decreasing function. Other possibilities include recursive partitioning (Breiman et al. Citation1984; Zeileis, Hothorn, and Hornik Citation2008), in which linear regression is fit on the leaves of a tree, or even random forests aggregated from such trees (Friedberg et al. Citation2021; Künzel et al. Citation2019).

8.2 More General Targets

We have only considered estimation of $μ_{i} = E [Z_{i j} | μ_{i}, α_{i}]$ in EquationEquation (3)(3) $\begin{matrix} (μ_{i}, α_{i}) \overset{iid}{\sim} G, i = 1, \dots, N \\ Z_{i j} | (μ_{i}, α_{i}) \overset{iid}{\sim} F (\cdot | μ_{i}, α_{i}) j = 1, \dots, K . \end{matrix}$ (3) . As pointed out in Johns (Citation1957, Citation1986) this naturally extends to parameters $θ_{i} = E [h (Z_{i j}) | μ_{i}, α_{i}]$ where h is a known function. The only modification needed to estimate θ_i is that we fit a regression model to learn $E [h (Y_{i}) | X_{i}^{(\cdot)}]$ instead. We may further extend Vernon Johns’ observation to arbitrary U-statistics. Concretely, given $r < K, r \in N$ and a fixed function $h : R^{r} \to R$ , we can use Aurora to estimate $θ_{i} = E [h (Z_{i 1}, Z_{i 2}, \dots, Z_{i r}) | μ_{i}, α_{i}]$ . In this case we need to hold out r replicates to form the response. For example, with r = 2 and $h (z_{1}, z_{2}) = {(z_{1} - z_{2})}^{2} / 2$ , we can estimate the conditional variance $σ_{i}^{2} = var [Z_{i j} | μ_{i}, α_{i}]$ . Denoising the variance with empirical Bayes is an important problem that proved to be essential for the analysis of genomic data (Smyth Citation2004; Lu and Stephens Citation2016). However, these articles assumed a parametric form of the likelihood, while Aurora would permit fully nonparametric estimation of the variance parameter.

8.3 External Covariates

Model (3) posits a priori exchangeability of the N units. However, in many applications, domain experts also have access to side-information ζ_i about each unit. Hence, multiple authors (Fay and Herriot Citation1979; Tan Citation2016; Kou and Yang 2017; Banerjee, Mukherjee, and Sun Citation2020; Coey and Cunningham Citation2019; Ignatiadis and Wager Citation2019) have developed methods that improve mean estimation by utilizing information in the ζ_i . Aurora can be directly extended to accommodate such side information. Instead of regressing Y_i on $X_{i}^{(\cdot)}$ , one regresses Y_i on both $X_{i}^{(\cdot)}$ and ζ_i .

9 Software

We provide reproducible code for our simulation results in the following Github repository: https://github.com/nignatiadis/AuroraPaper. A package implementing the method is available at https://github.com/nignatiadis/Aurora.jl. The package has been implemented in the Julia programming language (Bezanson et al. Citation2017).

Supplemental material

Supplemental Material

Download PDF (469 KB)

Acknowledgments

We are grateful to Niall Cardin, Michael Sklar, and Stefan Wager for helpful discussions and comments on the article. We would also like to thank the associate editor and the anonymous reviewers for their insightful and helpful suggestions.

Supplementary Material

The supplementary materials contain details and proofs for theoretical claims made in the article and a description of the implementation of Aurora with k-Nearest-Neighbors.

Notes

¹ Here we use the fact that the entries of

Z_{i}

are independent conditionally on

μ_{i}, α_{i}

. This inequality is also known in the theory of U-statistics (Hoeffding Citation1948, theor. 5.2).

² The details are given in Supplement D.

³ That is, the map ${(X_{i}^{(\cdot)} (j), Y_{i} (j))}_{i = 1, \dots, N} \mapsto {\hat{m}}_{j}$ is the same for all j. This does not imply that ${\hat{m}}_{j} = {\hat{m}}_{j'}$ for $j \neq j'$ , since the training data changes.

⁴ The definition below assumes no ties. In the proof we explain how to randomize to deal with ties.

⁵ Vernon Johns (Citation1957) established a result similar to Theorem 4. We find this remarkable, since Johns (Citation1957) anticipated later developments in nonparametric regression, independently proving universal consistency for partition-based regression estimators as an intermediate step.

⁶ By the law of total variance, almost surely, it holds that(17) $\begin{matrix} var [Y_{i} | X_{i}^{(\cdot)}] = var [E [Y_{i} | μ_{i}, α_{i}, X_{i}^{(\cdot)}] | X_{i}^{(\cdot)}] + E [var [Y_{i} | μ_{i}, α_{i}, X_{i}^{(\cdot)}] | X_{i}^{(\cdot)}] \\ = var [μ_{i} | X_{i}^{(\cdot)}] + E [σ_{i}^{2} | X_{i}^{(\cdot)}] . \end{matrix}$ (17)

Thus, for example, when $μ_{i}, σ_{i}^{2}$ have bounded support, there exists C > 0 such that $var [Y_{i} | X_{i}^{(\cdot)}] \leq C$ almost surely and so, the stated assumption holds with $C_{N} = C$ .

⁷ See Supplement D for details regarding the regret bounds we discuss here.

⁸ In location families, the Fisher information is constant as a function of the location parameter μ_i.

⁹ Since $K \to \infty$ , the likelihood swamps the prior. Furthermore, we will place regularity assumptions on the prior to rule out the possibility of superefficiency.

¹⁰ This adaptivity is perhaps expected, in light of existing theory on semiparametric efficiency in location families. For example, it is known that even for N = 1 one can asymptotically (as $K \to \infty$ ) match the variance of the parametric maximum likelihood estimator in symmetric location families, even without precise knowledge of F (Stein Citation1956; Bickel et al. Citation1998). However, the simulations of Section 6.1 demonstrate that Auroral adapts to the unknown likelihood already for K = 10, while semiparametric efficiency results are truly asymptotic in K, requiring an initial nonparametric density estimate.

¹¹ Throughout this section, we calculate the mean squared error by averaging over 100 Monte Carlo replicates.

¹² CC-L and James-Stein perform so similarly in the simulations of this subsection that they are indistinguishable in all panels of with the difference of their MSEs smaller than $10^{- 3}$ in all cases. In the first panel (Normal likelihood), Auroral is also indistinguishable from CC-L and James-Stein.

¹³ Even for the Normal likelihood, the assumptions of the implemented NPMLE are not fully satisfied, since we use estimated standard deviations ${\hat{σ}}_{i}^{2}$ for each unit.

References

Athey, S., and Imbens, G. W. (2017), “The Econometrics of Randomized Experiments,” in Handbook of Economic Field Experiments, Vol. 1, pp. 73–140. Elsevier.
Google Scholar
Athey, S., and Luca, M. (2019), “Economists (and Economics) in Tech Companies,” Journal of Economic Perspectives, 33, 209–30. DOI: 10.1257/jep.33.1.209.
Web of Science ®Google Scholar
Balabdaoui, F., Durot, C., and Jankowski, H. (2019), “Least Squares Estimation in the Monotone Single Index Model,” Bernoulli, 25, 3276–3310. DOI: 10.3150/18-BEJ1090.
Web of Science ®Google Scholar
Baldi, P., and Long, A. D. (2001), “A Bayesian Framework for the Analysis of Microarray Expression Data: Regularized t-Test and Statistical Inferences of Gene Changes,” Bioinformatics, 17, 509–519. DOI: 10.1093/bioinformatics/17.6.509.
PubMed Web of Science ®Google Scholar
Banerjee, T., Mukherjee, G., and Sun, W. (2020), “Adaptive Sparse Estimation With Side Information,” Journal of the American Statistical Association, 115, 2053–2067. DOI: 10.1080/01621459.2019.1679639.
Web of Science ®Google Scholar
Bennett, C. A. (1952), “Asymptotic Properties of Ideal Linear Estimators,” Unpublished dissertation, University of Michigan.
Google Scholar
Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. (2017), “Julia: A Fresh Approach to Numerical Computing,” SIAM Review, 59, 65–98. DOI: 10.1137/141000671.
Web of Science ®Google Scholar
Bickel, P. J., Klaassen, C. A., Ritov, Y., and Wellner, J. A. (1998), Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins Series in the Mathematical Sciences. New York: Springer.
Google Scholar
Boca, S. M., and Leek, J. T. (2018), “A Direct Approach to Estimating False Discovery Rates Conditional on Covariates,” PeerJ, 6, e6035. DOI: 10.7717/peerj.6035.
PubMed Web of Science ®Google Scholar
Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. (1984), Classification and Regression Trees, Boca Raton, FL: CRC Press.
Google Scholar
Brown, L. D., and Greenshtein (2009), “Nonparametric Empirical Bayes and Compound Decision Approaches to Estimation of a High-Dimensional Vector of Normal Means,” The Annals of Statistics, 1685–1704. DOI: 10.1214/08-AOS630.
Web of Science ®Google Scholar
Chamandy, N., Muralidharan, O., Najmi, A., and Naidu, S. (2012), “Estimating Uncertainty for Massive Data Streams,” Technical report, Google. https://research.google/pubs/pub43157/
Google Scholar
Chatterjee, S. (2013), “Assumptionless Consistency of the Lasso,” arXiv:1303.5817.
Google Scholar
Chernoff, H., Gastwirth, J. L., and Johns, M. V. (1967), “Asymptotic Distribution of Linear Combinations of Functions of Order Statistics With Applications to Estimation,” The Annals of Mathematical Statistics, 38, 52–72. DOI: 10.1214/aoms/1177699058.
Google Scholar
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018), “Double/Debiased Machine Learning for Treatment and Structural Parameters,” The Econometrics Journal, 21, C1–C68. DOI: 10.1111/ectj.12097.
Web of Science ®Google Scholar
Coey, D., and Cunningham, T. (2019), “Improving Treatment Effect Estimators Through Experiment Splitting,” in The World Wide Web Conference (WWW ’19), pp. 285–295. New York, NY: Association for Computing Machinery. DOI: 10.1145/3308558.3313452.
Google Scholar
Cox, D. R. (1975), “A Note on Data-Splitting for the Evaluation of Significance Levels,” Biometrika, 62, 441–444. DOI: 10.1093/biomet/62.2.441.
Web of Science ®Google Scholar
Devanarayan, V., and Stefanski, L. A. (2002), “Empirical Simulation Extrapolation for Measurement Error Models With Replicate Measurements,” Statistics & Probability Letters, 59, 219–225.
Web of Science ®Google Scholar
Diaconis, P., and Ylvisaker, D. (1979), “Conjugate Priors for Exponential Families,” The Annals of Statistics, 7, 269–281. DOI: 10.1214/aos/1176344611.
Web of Science ®Google Scholar
Efron, B. (2012), Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction, Vol. 1. Cambridge: Cambridge University Press.
Google Scholar
Efron, B., and Morris, C. (1973), “Stein’s Estimation Rule and Its Competitors—An Empirical Bayes Approach,” Journal of the American Statistical Association, 68, 117–130.
Web of Science ®Google Scholar
Efron, B., and Stein, C. (1981), “The Jackknife Estimate of Variance,” The Annals of Statistics, 9, 586–596,. DOI: 10.1214/aos/1176345462.
Web of Science ®Google Scholar
Fay, R. E., and Herriot, R. A. (1979), “Estimates of Income for Small Places: An Application of James-Stein Procedures to Census Data,” Journal of the American Statistical Association, 74, 269–277. DOI: 10.1080/01621459.1979.10482505.
Web of Science ®Google Scholar
Fithian, W., and Ting, D. (2017), “Family Learning: Nonparametric Statistical Inference With Parametric Efficiency,” arXiv:1711.10028.
Google Scholar
Friedberg, R., Tibshirani, J., Athey, S., and Wager, S. (2021), “Local Linear Forests,” Journal of Computational and Graphical Statistics, 30, 503–517. DOI: 10.1080/10618600.2020.1831930.
Web of Science ®Google Scholar
Friedman, J. H., and Stuetzle, W. (1981), “Projection Pursuit Regression,” Journal of the American statistical Association, 76, 817–823. DOI: 10.1080/01621459.1981.10477729.
Web of Science ®Google Scholar
Galvao, A. F., and Kato, K. (2014), “Estimation and Inference for Linear Panel Data Models Under Misspecification When Both n and t Are Large,” Journal of Business & Economic Statistics, 32, 285–309, 2014. DOI: 10.1080/07350015.2013.875473.
Web of Science ®Google Scholar
Gierliński, M., Cole, C., Schofield, P., Schurch, N. J., Sherstnev, A., Singh, V., Wrobel, N., Gharbi, K., Simpson, G., and Owen-Hughes, T. (2015), “Statistical Models for RNA-seq Data Derived From a Two-Condition 48-Replicate Experiment,” Bioinformatics, 31, 3625–3630. DOI: 10.1093/bioinformatics/btv425.
PubMed Web of Science ®Google Scholar
Griffin, B. S., and Krutchkoff, R. G. (1971), “Optimal Linear Estimators: An Empirical Bayes Version With Application to the Binomial Distribution,” Biometrika, 58, 195–201. DOI: 10.1093/biomet/58.1.195.
Web of Science ®Google Scholar
Guntuboyina, A., and Sen, B. (2018), “Nonparametric Shape-Restricted Regression,” Statistical Science, 33, 568–594. DOI: 10.1214/18-STS665.
Web of Science ®Google Scholar
Györfi, L., Kohler, M., Krzyzak, A., and Walk, H. (2002), A Distribution-Free Theory of Nonparametric Regression, New York: Springer, 88.
Google Scholar
Habiger, J. D., and Peña, E. A. (2014), “Compound p-value Statistics for Multiple Testing Procedures,” Journal of Multivariate Analysis, 126, 153–166. DOI: 10.1016/j.jmva.2014.01.007.
PubMed Web of Science ®Google Scholar
Hall, P., and Yao, Q. (2003), “Inference in Components of Variance Models With Low Replication,” The Annals of Statistics, 31, 414–441. DOI: 10.1214/aos/1051027875.
Web of Science ®Google Scholar
Hartigan, J. (1969), “Linear Bayesian Methods,” Journal of the Royal Statistical Society, Series B, 31, 446–454. DOI: 10.1111/j.2517-6161.1969.tb00804.x.
Google Scholar
Hastie, T., Tibshirani, R., and Friedman, J. H. (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.), Springer Series in Statistics. New York: Springer.
Google Scholar
Hoeffding, W. (1948), “A Class of Statistics With Asymptotically Normal Distribution,” Annals of Mathematical Statistics, 19, 293–325. DOI: 10.1214/aoms/1177730196.
Google Scholar
Horowitz, J. L. and Markatou, M. (1996), “Semiparametric Estimation of Regression Models for Panel Data,” The Review of Economic Studies, 63, 145–168. DOI: 10.2307/2298119.
Web of Science ®Google Scholar
Ignatiadis, N., and Huber, W. (2021), “Covariate Powered Cross-Weighted Multiple Testing,” Journal of the Royal Statistical Society, Series B, DOI: 10.1111/rssb.12411.
Google Scholar
Ignatiadis, N., and Wager, S. (2019), “Covariate-Powered Empirical Bayes Estimation,” in Advances in Neural Information Processing Systems, pp. 9620–9632.
Google Scholar
James, W., and Stein, C. (1961), “Estimation With Quadratic Loss,” in Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, pp. 361–379.
Google Scholar
Jiang, W., and Zhang, C.-H. (2009), “General Maximum Likelihood Empirical Bayes Estimation of Normal Means,” The Annals of Statistics, 37,1647–1684. DOI: 10.1214/08-AOS638.
Web of Science ®Google Scholar
Jochmans, K., and Weidner, (2018), “Inference on a Distribution From Noisy Draws,” arXiv:1803.04991.
Google Scholar
Johns, M. V. (1957), “Non-Parametric Empirical Bayes Procedures,” The Annals of Mathematical Statistics, 28, 649–669. DOI: 10.1214/aoms/1177706877.
Google Scholar
Johns, M. V. (1986), “Fully Nonparametric Empirical Bayes Estimation Via Projection Pursuit,” in Adaptive Statistical Procedures and Related Topics, eds. John Van Ryzin, pp. 164–178. Hayward, CA: Institute of Mathematical Statistics.
Google Scholar
Jung, J. (1956), “On Linear Estimates Defined by a Continuous Weight Function,” Arkiv für Matematik, 3, 199–209. DOI: 10.1007/BF02589406.
Google Scholar
Kiefer, J., and Wolfowitz, J. (1956), “Consistency of the Maximum Likelihood Estimator in the Presence of Infinitely Many Incidental Parameters,” The Annals of Mathematical Statistics, 27, 887–906. DOI: 10.1214/aoms/1177728066.
Google Scholar
Koenker, R., and Gu, J. (2017), “REBayes: Empirical Bayes Mixture Methods in R,” Journal of Statistical Software, 82, 1–26. DOI: 10.18637/jss.v082.i08.
Web of Science ®Google Scholar
Koenker, R., and Mizera, I. (2014), “Convex Optimization, Shape Constraints, Compound Decisions, and Empirical Bayes Rules,” Journal of the American Statistical Association, 109, 674–685. DOI: 10.1080/01621459.2013.869224.
Web of Science ®Google Scholar
Kohavi, R., and Longbotham, R. (2017), “Online Controlled Experiments and A/B Testing,” Encyclopedia of Machine Learning and Data Mining, 7,922–929.
Google Scholar
Kohavi, R., Deng, A., Frasca, B., Walker, T., Xu, Y., and Pohlmann, N. (2013), “Online Controlled Experiments at Large Scale,” in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1168–1176.
Google Scholar
Kou, S., and Yang, J. J. (2017), “Optimal Shrinkage Estimation in Heteroscedastic Hierarchical Linear Models,” in Big and Complex Data Analysis, eds. S. Ahmed. Springer, Cham. DOI: 10.1007/978-3-319-41573-4.
Google Scholar
Krutchkoff, R. G. (1967), “A Supplementary Sample Non-Parametric Empirical Bayes Approach to Some Statistical Decision Problems,” Biometrika, 54, 451–458. DOI: 10.1093/biomet/54.3-4.451.
PubMed Web of Science ®Google Scholar
Künzel, S. R., Saarinen, T. F., Liu, E. W., and Sekhon, J. S. (2019), “Linear Aggregation in Tree-Based Estimators,” arXiv:1906.06463.
Google Scholar
Lönnstedt, I., and Speed, T. (2002), “Replicated Microarray Data,” Statistica Sinica, 12, 31–46.
Web of Science ®Google Scholar
Love, M. I., Huber, W., and Anders, S. (2014), “Moderated Estimation of Fold Change and Dispersion for RNA-seq Data With DESeq2,” Genome Biology, 15, 550. DOI: 10.1186/s13059-014-0550-8.
PubMed Web of Science ®Google Scholar
Lu, M., and Stephens, M. (2016), “Variance Adaptive Shrinkage (vash): Flexible Empirical Bayes Estimation of Variances,” Bioinformatics, 32, 3428–3434. DOI: 10.1093/bioinformatics/btw483.
PubMed Web of Science ®Google Scholar
Lwin, T. (1976), “Optimal Linear Estimators of Location and Scale Parameters Using Order Statistics and Related Empirical Bayes Estimation,” Scandinavian Actuarial Journal, 1976, 79–91. DOI: 10.1080/03461238.1976.10405604.
Google Scholar
Maritz, J. S. (1974), “Aligning of Estimates: An Alternative to Empirical Bayes Methods,” Australian Journal of Statistics, 16, 135–143. DOI: 10.1111/j.1467-842X.1974.tb00928.x.
Google Scholar
Muralidharan, O. (2012), “High Dimensional Exponential Family Estimation Via Empirical Bayes,” Statistica Sinica, 22, 1217–1232.
Web of Science ®Google Scholar
Neumann, M. H. (2007), “Deconvolution From Panel Data With Unknown Error Distribution,” Journal of Multivariate Analysis, 98, 1955–1968. DOI: 10.1016/j.jmva.2006.09.012.
Web of Science ®Google Scholar
Norberg, R. (1980), “Empirical Bayes Credibility,” Scandinavian Actuarial Journal, 1980, 177–194. DOI: 10.1080/03461238.1980.10408653.
Google Scholar
Okui, R., and Yanagi, T. (2020), “Kernel Estimation for Panel Data With Heterogeneous Dynamics,” The Econometrics Journal, 23, 156–175. DOI: 10.1093/ectj/utz019.
Web of Science ®Google Scholar
Rao, J., and Molina, I. (2015), Small Area Estimation. Wiley Series in Survey Methodology. Wiley.
Google Scholar
Robbins, H. (1956), “An Empirical Bayes Approach to Statistics,” in Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, ed. J. Neyman, pp. 157–163. Berkeley and Los Angeles: University of California Press.
Google Scholar
Robbins, H. (1964), “The Empirical Bayes Approach to Statistical Decision Problems,” Annals of Mathematical Statistics, 35, 1–20.
Google Scholar
Robbins, H. (1983), “Some Thoughts on Empirical Bayes Estimation,” The Annals of Statistics, 11, 713–723.
Web of Science ®Google Scholar
Rosset, S., and Tibshirani, R. J. (2020), “From Fixed-X to Random-X Regression: Bias-Variance Decompositions, Covariance Penalties, and Prediction Error Estimation,” Journal of the American Statistical Association, 115, 138–151. DOI: 10.1080/01621459.2018.1424632.
Web of Science ®Google Scholar
Rubin, D., Dudoit, S., and Van der Laan, M. (2006), “A Method to Increase the Power of Multiple Testing Procedures Through Sample Splitting,” Statistical Applications in Genetics and Molecular Biology, 5, . DOI: 10.2202/1544-6115.1148.
PubMed Web of Science ®Google Scholar
Saha, S., and Guntuboyina, A. (2020), “On the Nonparametric Maximum Likelihood Estimator for Gaussian Location Mixture Densities With Application to Gaussian Denoising,” Annals of Statistics, 48, 738–762.
Web of Science ®Google Scholar
Schennach, S. M. (2004), “Estimation of Nonlinear Models With Measurement Error,” Econometrica, 72, 33–75. DOI: 10.1111/j.1468-0262.2004.00477.x.
Web of Science ®Google Scholar
Smyth, G. K. (2004), “Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments,” Statistical Applications in Genetics and Molecular Biology, 3, 1–25. DOI: 10.2202/1544-6115.1027.
PubMedGoogle Scholar
Stein, C. (1956), “Efficient Nonparametric Testing and Estimation,” in Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, eds. J. Neyman, pp. 187–195. Berkeley and Los Angeles: University of California Press.
Google Scholar
Stigler, S. M. (1990), “The 1988 Neyman Memorial Lecture: A Galtonian Perspective on Shrinkage Estimators,” Statistical Science, 5, 147–155. DOI: 10.1214/ss/1177012274.
Google Scholar
Stone, C. J. (1977), “Consistent Nonparametric Regression,” The Annals of Statistics, 5, 595–620. DOI: 10.1214/aos/1176343886.
Web of Science ®Google Scholar
Tan, Z. (2016), “Steinized Empirical Bayes Estimation for Heteroscedastic Data,” Statistica Sinica, 5, 1219–1248,. DOI: 10.5705/ss.202014.0069.
Google Scholar
Tang, D., Agarwal, A., O’Brien, D., and Meyer, M. (2010), “Overlapping Experiment Infrastructure: More, Better, Faster Experimentation,” in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 17–26.
Google Scholar
Van der Vaart, A. W. (2000), Asymptotic Statistics, Vol. 3. Cambridge: Cambridge University Press.
Google Scholar
Van Ryzin, J. (1986), Adaptive Statistical Procedures and Related Topics: Proceedings of a Symposium in Honor of Herbert Robbins, June 7–11, 1985, Brookhaven National Laboratory, Upton, New York: Institute of Mathematical Statistics.
Google Scholar
Wager, S., Du, W., Taylor, J., and Tibshirani, R. J. (2016), “High-Dimensional Regression Adjustments in Randomized Experiments,” Proceedings of the National Academy of Sciences, 113, 12673– 12678. DOI: 10.1073/pnas.1614732113.
Google Scholar
Weinstein, A., Ma, Z., Brown, L. D., and Zhang, C.-H. (2018), “Group-Linear Empirical Bayes Estimates for a Heteroscedastic Normal Mean,” Journal of the American Statistical Association, 113, 698–710. DOI: 10.1080/01621459.2017.1280406.
Web of Science ®Google Scholar
Xie, X., Kou, S., and Brown, L. D. (2012), “SURE Estimates for a Heteroscedastic Hierarchical Model,” Journal of the American Statistical Association, 107, 1465–1479. DOI: 10.1080/01621459.2012.728154.
PubMed Web of Science ®Google Scholar
Zeileis, A., Hothorn, T., and Hornik, K. (2008), “Model-Based Recursive Partitioning,” Journal of Computational and Graphical Statistics, 17, 492–514. DOI: 10.1198/106186008X319331.
Web of Science ®Google Scholar

Empirical Bayes Mean Estimation With Nonparametric Errors Via Order Statistic Regression on Replicated Data

Abstract

1 Introduction