Full article: Synthetic likelihood in misspecified models

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Bayesian synthetic likelihood is a widely used approach for conducting Bayesian analysis in complex models where evaluation of the likelihood is infeasible but simulation from the assumed model is tractable. We analyze the behaviour of the Bayesian synthetic likelihood posterior when the assumed model differs from the actual data generating process. We demonstrate that the Bayesian synthetic likelihood posterior can display a wide range of non-standard behaviours depending on the level of model misspecification, including multimodality and asymptotic non-Gaussianity. Our results suggest that likelihood tempering, a common approach for robust Bayesian inference, fails for synthetic likelihood whilst recently proposed robust synthetic likelihood approaches can ameliorate this behavior and deliver reliable posterior inference under model misspecification. All results are illustrated using a simple running example.

Keywords:

Disclaimer

As a service to authors and researchers we are providing this version of an accepted manuscript (AM). Copyediting, typesetting, and review of the resulting proofs will be undertaken on this manuscript before final publication of the Version of Record (VoR). During production and pre-press, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal relate to these versions also.

1 Introduction

Approximate Bayesian methods, sometimes called likelihood-free methods, have become a common approach to conduct Bayesian inference in situations where the likelihood function is intractable. Two of the most prominent statistical methods in this paradigm are approximate Bayesian computation, see Marin et al. (2012) for a review and Sisson et al. (2018) for a handbook treatment, and the method of synthetic likelihood (Wood, 2010). Synthetic likelihood-based inference is often conducted by placing a prior over the unknown model parameters and using Markov chain Monte Carlo methods to sample the resulting posterior. Throughout the remainder we refer to such methods as Bayesian synthetic likelihood, and refer to Price et al. (2018) for an introduction.

The goal of both approximate Bayesian computation (ABC) and Bayesian synthetic likelihood (BSL) is to conduct inference on the unknown model parameters by simulating summary statistics under the assumed model and matching them against observed summaries calculated from the data. The simulated summary statistics are used to construct an estimate of the likelihood, which is then used to conduct posterior inference on the model unknowns. While ABC implicitly constructs a nonparametric estimate of the likelihood for the summaries, synthetic likelihood uses a Gaussian approximation with an estimated mean and variance.

As these methods have grown in prominence, much research has been conducted to understand the benefits and disadvantages of different approximate Bayesian approaches. In terms of statistical regularity, i.e., large sample behavior, the method of approximate Bayesian computation behaves quite regularly in the context of correct model specification (see, e.g., Li and Fearnhead, 2018, and Frazier et al., 2018), with Frazier et al. (2022) demonstrating that BSL displays similar large sample behavior to ABC, while scaling to higher-dimensional summaries more easily than simple implementations of ABC. For a more in-depth comparison of ABC and BSL, we refer to Section 3.1.3 of Martin et al. (2023).

The goal of both methods is to conduct inference in models that are so complicated that the resulting likelihood is intractable. However, models are only ever an approximation of reality and, thus, correct specification is unlikely. Hence, for a diverse collection of summary statistics, it is unlikely that the assumed model can exactly match all the summaries calculated from the observed data; with this problem likely exacerbated in early phases of model exploration, design, and formulation. When the summaries simulated under the assumed model cannot match the observed summaries for any value of the unknown model parameters, we say that the model is misspecified. This notion of misspecification is consistent with previous analyses of misspecification in ABC (see, e.g., Marin et al., 2014, and Frazier et al., 2020).

Several authors have now discussed the impacts of model misspecification in likelihood-based Bayesian inference (see, e.g., Kleijn and van der Vaart, 2012, Miller and Dunson, 2019, Bhattacharya et al., 2019), and it is known that ABC posteriors display non-standard asymptotic behaviors in misspecified models (Frazier et al., 2020). Given the links between ABC and BSL, it is critical for us to understand the behavior of the BSL posterior in misspecified models. However, we are unaware of any research that rigorously characterises the behavior of BSL under model misspecification.

The need to theoretically examine the behavior of the BSL posterior in misspecified models is also motivated by the empirical analysis carried out in Frazier and Drovandi (2021), where the authors present an empirical example showing that the BSL posterior displays non-standard behavior at a small sample size. Critically, Frazier and Drovandi (2021) did not explore whether this behavior abated as the sample size increased, or whether it was an artefact of their Monte Carlo approximation; nor did the authors explore the mechanism causing this non-standard posterior behavior, or present any theoretical results on the behavior of the BSL posterior in misspecified models.

In this manuscript we make four contributions to the literature on approximate Bayesian methods, and BSL methods more particularly. Our first contribution is to formally demonstrate that the BSL posterior is sensitive to model misspecification: depending on the nature of the model misspecification, we prove that the BSL posterior can display non-standard asymptotic concentration or standard (i.e., Gaussian) concentration. These results deviate from those in correctly specified models, where Frazier et al. (2022) demonstrate that the BSL posterior displays standard concentration.¹

Our second contribution is to categorise the wide behaviours that the BSL posterior can present in misspecified models, which we illustrate empirically using a running example, while also verifying the technical conditions needed for our theoretical results in this example. As part of this analysis, we significantly extend the initial findings of Frazier and Drovandi (2021) by empirically demonstrating novel behaviours that the BSL posterior exhibits in misspecified models, including concentration onto a boundary point, multi-modality, and regions of posterior flatness. Critically, our theoretical analysis demonstrates that the non-standard behavior of the BSL posterior is not caused by Monte Carlo approximations, or a small sample issue, but is driven by the asymptotic behavior of the synthetic likelihood.

The third contribution of this work is to highlight the differing behaviour of the BSL and ABC posteriors in misspecified models. In contrast to the case of correct model specification, the ABC and BSL posteriors concentrate onto different points in the parameter space when the model is misspecified. Our final contribution is an in-depth comparison of three possible approaches for dealing with model misspecification when conducting likelihood-free Bayesian inference. In this comparison, we theoretically demonstrate that a popular approach to robust Bayesian inference, likelihood tempering, does not ameliorate the non-standard behavior of the BSL posterior. However, we show empirically and theoretically that certain “robust” BSL approaches deliver reliable approximate Bayesian inference even when the model is misspecified. In particular, we propose a novel modification of the BSL adjustment procedure in Frazier et al. (2022) that can adequately handle model misspecification, and we formally demonstrate that this procedure delivers valid uncertainty quantification in misspecified models.

To motivate our analysis, we first illustrate the sensitivity of the BSL posterior to model misspecification in a simple example.

Running example: moving average model of order one

The researcher believes the observed data $y = {(y_{1}, \dots, y_{n})}^{⊤}$ is generated according to a moving average model of order one(1) $y_{t} = e_{t} + θ_{} e_{t - 1} (t = 1, \dots, n),$ (1) with e_t independent and identically distributed standard normal, and our prior beliefs are uniform over $θ \in [- 1, 1]$ . We take as summary statistics the sample autocovariances $γ_{j} (y_{1 : n}) = \frac{1}{n} \sum_{t = 1 + j}^{n} y_{t} y_{t - j}$ , for $j \in {0, 1}$ , and let $S_{n} (y) = {(γ_{0} (y_{1 : n}), γ_{1} (y_{1 : n}))}^{⊤}$ denote the observed summaries we will use to conduct inference on θ.

While the assumed model is (1), the data actually evolves according to a stochastic volatility model: for $0 < ρ < 1, 0 < σ_{v} < 1$ , and u_t , v_t independent standard normal errors(2) $y_{t} = \exp (h_{t} / 2) u_{t}, h_{t} = ω + ρ h_{t - 1} + v_{t} σ_{v} (t = 1, \dots, n) .$ (2)

Under the process in (2), the assumed model in (1) is misspecified, however, for any value of $ω, ρ, σ_{v}$ above, the population autocovariances are zero. Therefore, a priori we expect the Bayesian synthetic likelihood posterior for θ to have significant mass near θ = 0, as this yields simulated data with no autocorrelation, and would most closely “match” the observed summaries. Throughout the remainder, unless otherwise stated, we use the term posterior to refer to the Bayesian synthetic likelihood posterior.

We generate data from the model in (2) with parameter values $ω = - 0.10, ρ = 0.90$ and $σ_{v} = 0.40$ , which produces a series that displays many of the same features as monthly asset returns, and we consider three different sample sizes: n = 100, 500, 1000. For each sample size and dataset, we plot the exact posterior in Figure 1; we refer to Supplementary Appendix E.1 for further details regarding construction of the exact posterior in this example.

At a small sample size (n = 100), the posterior appears to be well-behaved with a single mode around θ = 0. However, as the sample size increases the posterior becomes bi-modal with well-separated modes of nearly equal height that both lie in the interior of the parameter space (n = 1000).² The emergence of this bi-modality as the sample size increases signals the presence of a non-standard asymptotic phenomena, and suggests that the posterior will not concentrate onto a single point. Moreover, in this example the normality of the summary statistics is reasonable: both summaries can be verified to satisfy a central limit theorem under the process in (2). This behavior is surprising, and worrisome, given that the value of θ that (asymptotically) minimizes the Euclidean distance between the observed and simulated summaries is θ = 0. While θ = 0 ensures that the simulated summaries are as close as possible to the observed summaries, the posterior has little mass near this point at large samples sizes. Instead, at larger sample sizes the posterior gives the impression that we require meaningful autocorrelation to match the observed summaries, when in fact the observed data has no autocorrelation.

In the remainder of the paper, we elaborate on the above behavior and characterize the asymptotic behavior of the posterior in misspecified models. The remainder of the paper is organized as follows. In Section 2, we discuss the relevant concept of model misspecification in synthetic likelihood. In Section 3, we characterize the asymptotic behavior of the BSL posterior in misspecified models and demonstrate that the posterior may be asymptotically non-Gaussian. We also compare this theoretical behavior to what is obtained in the case of ABC. In Section 4, we obtain new insights into approaches aimed at dealing with model misspecification. Section 5 concludes. Proofs of all results are contained in the Supplementary Appendix.

2 Synthetic likelihood and model misspecification

Let $y = {(y_{1}, \dots, y_{n})}^{⊤}$ denote the observed data and define $P_{•}^{(n)}$ as the true distribution of y. The observed data is assumed to be generated from a class of parametric models ${P_{θ}^{(n)} : θ \in Θ \subseteq ℝ^{d_{θ}}}$ for which the likelihood function is intractable, but from which we can easily simulate pseudo-data z for any $θ \in Θ$ . Let Π denote the prior measure for θ and $π (θ)$ its density.

Since the likelihood function is intractable, we conduct inference using approximate Bayesian methods. The main idea is to search for values of θ that produce pseudo-data z which is “close enough” to y, and then retain these values to build an approximation to the posterior. To make the problem computationally practical, the comparison is generally carried out using summaries of the data. Let $S_{n} : ℝ^{n} \to ℝ^{d}, d \geq d_{θ}$ , denote the vector summary statistic mapping used in the analysis. Where there is no confusion, we write S_n for the mapping or its value when evaluated at the observed data y.

The method of BSL approximates the intractable distribution of $S_{n} (y) | θ$ using a Gaussian distribution with mean $b (θ) = E {S_{n} (z) | θ}$ and variance $Σ_{n} (θ) = var {S_{n} (z) | θ}$ , both of which are calculated under $P_{θ}^{(n)}$ . The map $θ \mapsto E {S_{n} (z) | θ}$ may technically depend on n, however, if the data are iid or weakly dependent, and S_n takes the form of an average, then $E {S_{n} (z) | θ}$ will not depend on n in any meaningful way. As the majority of summaries used in BSL take the form of averages, it is reasonable to neglect the potential dependence on n. We also note that the notation $b (θ)$ has also been used in several papers on ABC and BSL; see, e.g., Frazier et al. (2018), and Frazier et al. (2022).

The synthetic likelihood is denoted as $N {S_{n}; b (θ), Σ_{n} (θ)}$ , where $N (x; μ, Σ)$ is the normal density function evaluated at x with mean μ and covariance matrix Σ. In typical applications $b (θ)$ and $Σ_{n} (θ)$ are unknown, and are estimated using the sample mean ${\hat{b}}_{n} (θ)$ and sample variance ${\hat{Σ}}_{n} (θ)$ calculated using m independent simulated datasets. These sample quantities are depicted as n-dependent, rather than m-dependent, as we later take m to diverge as n diverges.

Wood (2010) and Price et al. (2018) suggest exploring the estimated synthetic likelihood $N {S_{n}; {\hat{b}}_{n} (θ), {\hat{Σ}}_{n} (θ)}$ and obtaining point estimates of θ using Markov chain Monte Carlo. Following Andrieu and Roberts (2009), the use of $N {S_{n}; {\hat{b}}_{n} (θ), {\hat{Σ}}_{n} (θ)}$ within Markov chain Monte Carlo results in draws from the target posterior $\begin{matrix} \hat{π} (θ | S_{n}) \propto π (θ) {\hat{g}}_{n} (S_{n} | θ), \\ {\hat{g}}_{n} (S_{n} | θ) = \int N {S_{n}; {\hat{b}}_{n} (θ), {\hat{Σ}}_{n} (θ)} \prod_{i = 1}^{m} d P_{θ}^{(n)} {S_{n} (z^{i})} d S_{n} (z^{1}) \dots d S_{n} (z^{m}), \end{matrix}$ where ${\hat{g}}_{n} (S_{n} | θ)$ is the expectation of the estimated synthetic likelihood. See Price et al. (2018) and Frazier et al. (2022) for a discussion on the connection between pseudo-marginal methods and BSL. In contrast, if $b (θ)$ and $Σ_{n} (θ)$ were known, we could target the exact posterior(3) $π (θ | S_{n}) \propto π (θ) N {S_{n}; b_{} (θ), Σ_{n} (θ)}$ (3)

2.1 Model misspecification

While Bayesian synthetic likelihood is based on a likelihood, it is not a likelihood for y but for $S_{n} (y)$ , and this “likelihood” is itself a normal approximation of the sampling distribution for the summaries. As such, interpreting the impact of model misspecification requires us to consider the loss of information that results from replacing the data by the summaries, as well as the use of an approximation for the likelihood of the summaries. To cultivate intuition regarding the impact of these approximations when the model is misspecified, we explore these approximations when the mean and variance of the summaries are known. The same general conclusions will follow in the case where the synthetic likelihood is estimated, but does not seem to add additional insights.

Let $f_{•}^{(n)}$ denote the density function for the summary statistics $S_{n} (y)$ under $P_{•}^{(n)}$ . The Kullback-Leibler divergence between the synthetic likelihood, $N {\cdot; b (θ), Σ_{n} (θ)}$ , and the density function of the summaries, $f_{•}^{(n)}$ , is $\begin{matrix} kl [f_{•}^{(n)} (\cdot) | | N {\cdot; b (θ), Σ_{n} (θ)}] = \int \log [\frac{f_{•}^{(n)} (s)}{N {s; b (θ), Σ_{n} (θ)}}] f_{•}^{(n)} (s) d s \\ = \frac{1}{2} \log {| Σ_{n} (θ) |} + \frac{1}{2} \int {s - b (θ)}^{⊤} Σ_{n} {(θ)}^{- 1} {s - b (θ)} f_{•}^{(n)} (s) d s + C, \end{matrix}$ where C is a constant that does not depend on θ. For $b_{•} = \int s f_{•}^{(n)} (s) d s$ , and $V = \int (s - b_{•}) {(s - b_{•})}^{⊤} f_{•}^{(n)} (s) d s$ , using properties of quadratic forms, $\begin{matrix} kl [f_{•}^{(n)} (\cdot) | | N {\cdot; b (θ), Σ_{n} (θ)}] = \frac{1}{2} \log {| Σ_{n} (θ) |} + \frac{1}{2} tr {Σ_{n} {(θ)}^{- 1} V} \\ + \frac{1}{2} {b (θ) - b_{•}}^{⊤} Σ_{n} {(θ)}^{- 1} {b (θ) - b_{•}} + C . \end{matrix}$

Thus, outside of cases where $f_{•}^{(n)}$ is a Gaussian density, the synthetic likelihood is always misspecified. In addition, the asymptotic behavior of the Kullback-Leibler divergence is governed by the term ${b (θ) - b_{•}}^{⊤} Σ_{n} {(θ)}^{- 1} {b (θ) - b_{•}} .$ Since the summaries are generally an average, $Σ_{n}^{} (θ)$ is generally of order $n^{- 1}$ , and when $Σ_{n}^{} (θ)$ is positive-definite it will be the case that $c_{1} | | \sqrt{n} {b (θ) - b_{•}} | |^{2} \leq | | Σ_{n} {(θ)}^{- 1 / 2} {b (θ) - b_{•}} | |^{2} \leq c_{2} | | \sqrt{n} {b (θ) - b_{•}} | |^{2},$

for some $0 < c_{1} \leq c_{2} < \infty$ . Therefore, if there exists no $θ \in Θ$ such that $b (θ) = b_{•}$ as $n \to \infty$ , then $\inf_{θ \in Θ} kl [f_{•}^{(n)} (\cdot) | | N {\cdot; b (θ), Σ_{n} (θ)}] \to \infty .$

The above shows that the meaningful concept of model misspecification in synthetic likelihood is that there does not exist any $θ \in Θ$ such that $b_{} (θ) = b_{•}$ . This condition is called model incompatibility by Marin et al. (2014), and features in the literature on model misspecification in approximate Bayesian computation (Frazier et al., 2020). We then say that the model is misspecified, or incompatible, if(4) $\lim_{n \to \infty} \inf_{θ \in Θ} {b_{} (θ) - b_{•}}^{⊤} {n Σ_{n} (θ)}^{- 1} {b_{} (θ) - b_{•}} > 0.$ (4)

Throughout the remainder, model misspecification is interpreted in terms of equation (4).

2.2 Consequences of model misspecification

We briefly demonstrate the consequences of model misspecification in BSL by returning to the simple running example (see Section 1 for details). Depending on the level of model misspecification, the posterior can display Gaussian-like posterior concentration, bi-modality, and/or concentration onto the boundary of the parameter space.

Example: Moving Average model

The researcher believes $y_{1}, \dots, y_{n}$ is generated according to an MA(1) model, see equation (1), and our prior beliefs are uniform over $[- 1, 1]$ . The summary statistics are $γ_{j} (y_{1 : n}) = \frac{1}{n} \sum_{t = 1 + j}^{n} y_{t} y_{t - j}$ , for $j \in {0, 1}$ , and $S_{n} (y) = {(γ_{0} (y_{1 : n}), γ_{1} (y_{1 : n}))}^{⊤}$ . In this example, the mean and variance of the summaries can be calculated exactly, with these quantities then used to construct the exact Bayesian synthetic likelihood posterior. The mean of the summaries is $b (θ) = E {S_{n} (z_{1 : n}) | θ} = {(1 + θ^{2}, θ)}^{⊤} .$

The variance of the summaries also has a closed-form, and can be derived using the results of De Gooijer (1981) on the variance and covariance of sample autocorrelations in autoregressive integrated moving average models. Partitioning $Σ_{n} (θ)$ as $Σ_{n} (θ) = (\begin{matrix} Σ_{11, n} (θ) & Σ_{12, n} (θ) \\ Σ_{12, n} (θ) & Σ_{22, n} (θ) \end{matrix}),$ the leading terms in the components of $Σ_{n} (θ)$ are as follows:³ $\begin{matrix} Σ_{11, n} (θ) = (2 / n^{4}) [n^{3} \cdot {(1 + θ^{2})}^{2} + 2 \cdot n^{2} \cdot (n - 1) \cdot θ^{2}] + O (n^{- 2}) \\ Σ_{22, n} (θ) = (1 / n^{2}) [(n - 1) \cdot ({(1 + θ^{2})}^{2} + θ^{2}) + 2 \cdot (n - 2) \cdot θ^{2}] + O (n^{- 2}) \\ Σ_{12, n} (θ) = (2 / n^{4}) [n^{2} \cdot ((n - 1) \cdot (2 \cdot (1 + θ^{2}) \cdot θ))] + O (n^{- 2}) . \end{matrix}$

From this representation, it is clear that each term has a dominant $O (n^{- 1})$ term, and that $n Σ_{n} (θ)$ is positive-definite for all $θ \in [- 1, 1]$ .

Recall that the actual data generating process (DGP) for $y_{1 : n}$ evolves according to the stochastic volatility (SV) model in equation (2). Under this DGP the summaries $S_{n} (y)$ converge in probability towards $b_{•} : = {(b_{•, 0}, b_{•, 1})}^{⊤} = {(\begin{matrix} \exp (\frac{ω}{1 - ρ} + \frac{1}{2} \frac{σ_{v}^{2}}{1 - ρ^{2}}), & 0 \end{matrix})}^{⊤} .$

Therefore, if for given values of $ω, σ_{v}$ and ρ there does not exist a value of θ such that $\exp {ω / (1 - ρ) + \frac{1}{2} σ_{v}^{2} / (1 - ρ^{2})} = 1 + θ^{2},$ we cannot match the first summary, and the assumed model is misspecified. When $0 < b_{•, 0} \leq 1$ , the unique minimum of $| | b (θ) - b_{•} | |$ is achieved at θ = 0, and it is this value onto which we would hope the BSL posterior would concentrate asymptotically. However, as we have already seen from Figure 1, for certain values of $ω, σ_{v}, ρ$ , and depending on the sample size, the posterior can be bi-modal.

To help explain this phenomena, we now analyze the posterior across various levels of model misspecification by fixing the value of the observed summaries $S_{n} (y)$ at its limit $b_{•} = {(b_{•, 0}, b_{•, 1})}^{⊤}$ , and by changing the value of $b_{•, 0}$ . To this end, we plot the posteriors for three values of n = 100, 500, 1000, and across six different values of the first summary statistic $b_{•, 0} \in {0.01, 0.10, 0.25, 0.50, 0.75, 0.99}$ . These values represent a situation of significant misspecification, at $b_{•, 0} = 0.01$ , tending towards no misspecification, $b_{•, 0} = 0.99$ . We plot the resulting posteriors graphically in Figure 2. The results demonstrate that the behavior of the posterior varies markedly as the level of model misspecification changes.

At small samples sizes (n = 100) the posteriors are (nearly all) uni-modal with a mode whose location depends on the level of misspecification. However, for larger levels of misspecification, as the sample size increases the posterior concentrates mass on two distinct modes, with the heights of the two modes varying with the level of misspecification.

The results in Figure 2 are surprising, and show that depending on the level of model misspecification, the posterior can display concentration onto the boundary of the parameter space ( $b_{•, 0} = 0.01$ ); bi-modality with concentration occurring on the interior of the parameters space ( $b_{•, 0} \in {0.10, 0.25}$ ); a region of “flatness” ( $b_{•, 0} = 0.50$ ); and approximate Gaussianity ( $b_{•, 0} \in {0.75, 0.99}$ ). We speculate that the region of posterior flatness may be the result of two modes on either size of θ = 0, so that in neighbourhoods around θ = 0 the posterior appears flat.⁴ From a practical standpoint, however, whether the posterior is genuinely flat near θ = 0, or has two close modes on either side of θ = 0 is largely irrelevant as it would require a very large sample size to reliably distinguish between the two cases. In Supplementary Appendix E.3 we expand on the mechanisms causing this posterior behavior, and give additional discussion on the cause of the posterior “flatness” observed in Figure 2.

At larger levels of model misspecification, the values onto which the exact posterior in (3) is concentrating are not at all related to the values of θ under which $| | b (θ) - b_{•} | |$ is small. In comparison, if one were to apply ABC based on $| | \cdot | |$ in the same example, the resulting posterior would be uni-modal and have the majority of its mass near the origin (θ = 0). This is due to the fact that when $0 < b_{•, 0} \leq 1$ , the distance $| | b (θ) - b_{•} | |$ is uniquely minimized at θ = 0; hence, following Theorem 1 in Frazier et al. (2020), the ABC posterior would concentrate mass onto θ = 0.

The behavior observed in the left-most panel of Figure 2 is similar to, but distinct from, the behavior documented by Frazier and Drovandi (2021) in the MA(1) model. In that work, for a sample size of n = 100 the authors empirically showed that an importance sampling estimate of the posterior $\hat{π} (θ | S_{n})$ placed nearly all of its mass near the boundaries of the parameter space, i.e., $θ \pm 1$ . In contrast to Frazier and Drovandi (2021), all of the above results pertain to the exact posterior $π (θ | S_{n})$ that results from using the exact mean and variance of the summaries. As such, Figure 2 demonstrates the root cause of the irregular posterior behavior: the behavior is not due to small sample sizes, or Monte Carlo approximations, but is caused by the asymptotic (non-standard) behavior of the exact synthetic likelihood $N {S_{n}; b (θ), Σ_{n} (θ)}$ . Furthermore, we remark that the behavior observed in this current example is vastly more diverse than the behavior observed by Frazier and Drovandi (2021). It is the diversity of this behavior which we theoretically investigate in the following section.

3 Asymptotic behavior of BSL

We now characterize the behavior of $\hat{π} (θ | S_{n})$ when the assumed model is misspecified. The following notations are used to make the results easier to state and follow. For $x \in ℝ, | x |$ denotes the absolute value of x, and for $x \in ℝ^{p}, | | x | |$ denotes the Euclidean norm of x. For A denoting a square matrix, we abuse notation and let $| A |$ denote the determinant of A and $| | A | |$ any convenient matrix norm. The terms $λ_{\max} (A)$ and $λ_{\min} (A)$ denote the maximal and minimal eigenvalues of A. Throughout, C denotes a generic positive constant that can change with each usage. For real-valued sequences ${a_{n}}_{n \geq 1}$ and ${d_{n}}_{n \geq 1}$ : $a_{n} ≲ d_{n}$ implies $a_{n} \leq C d_{n}$ for some finite C > 0 and all n large; $a_{n} ≍ d_{n}$ implies $a_{n} ≲ d_{n}$ and $d_{n} ≲ a_{n}$ . For x_n a random variable, $x_{n} = o_{p} (a_{n})$ and $x_{n} = O_{p} (a_{n})$ have their usual definitions. Likewise, ${plim}_{n \to \infty} x_{n}$ denotes the probability limit of x_n . All limits are taken as $n \to \infty$ so that when no confusion will result, we use $\lim_{}$ and $plim$ to denote $\lim_{n \to \infty}$ and ${plim}_{n \to \infty}$ , respectively. The notation $\Rightarrow$ denotes weak convergence under $P_{•}^{(n)}$ . Proofs of all stated results are given in the Supplementary Appendix.

3.1 Asymptotic behavior: multiple modes

Let $g_{n} (\cdot | θ) : = N {\cdot; b (θ), Σ_{n} (θ)}$ denote the synthetic likelihood with known mean and variance, and define its score and limit counterpart as $M_{n} (θ) = n^{- 1} \partial \log g_{n} (S_{n} | θ) / \partial θ, M (θ) = {plim}_{} M_{n} (θ) .$

Likewise, define the Hessian of $n^{- 1} \log g_{n} (\cdot | θ)$ and its limit counterpart as $H_{n} (θ) = n^{- 1} \partial^{2} \log g_{n} (S_{n} | θ) / \partial θ \partial θ^{⊤}, H (θ) = {plim}_{} H_{n} (θ) .$

The posterior multi-modality in the running example can be traced back to the existence of multiple roots to the score equation $0 = M_{n} (θ)$ , e.g., $θ_{1, n}$ and $θ_{2, n}$ . In such cases, if $- H_{n} (θ_{j, n})$ is positive-definite, for j = 1, 2, then the posterior will exhibit multiple modes (around $θ_{1, n}$ and $θ_{2, n}$ ). Let $Int (Θ)$ denote the interior of Θ, and let $Θ_{⋆} : = {θ \in Int (Θ) : 0 = M (θ)}$ denote the collection of asymptotic roots.

We maintain the following regularity conditions.

Assumption 1. (i) $Θ \subset ℝ^{d_{θ}}$ is compact; (ii) The map $θ \mapsto \log g_{n} (\cdot | θ)$ is twice continuously differentiable on $Int (Θ)$ .

Assumption 2. There exist $b_{•} \in ℝ^{d}, d \geq d_{θ}$ , and a positive-definite matrix V such that $\sqrt{n} (S_{n} - b_{•}) \Rightarrow N (0, V)$ .

Assumption 3. The set $Θ_{⋆}$ is non-empty and finite. For some $δ > 0$ , at least one $θ \in Θ_{⋆}$ satisfies $0 < δ \leq λ_{\min} {- H (θ)} \leq 1 / δ$ .

Remark 1. The compactness in Assumption 1(i) and the smoothness condition in Assumption 1(ii) ensure the existence of a solution to $0 = M_{n} (θ)$ , and are standard regularity conditions employed in the analysis of frequentist point estimators (see, e.g., Jennrich, 1969 for a classical reference, and Chapter 5 of Van der Vaart, 2000 for a textbook treatment). We note, however, that both conditions can be relaxed by requiring stronger conditions on $M_{n} (θ)$ and $H_{n} (θ)$ . Assumption 2 is standard in the literature on approximate Bayesian methods, and requires that the observed summary statistics are asymptotically Gaussian with a well-behaved covariance matrix. Assumption 3 restricts $\log g_{n} (\cdot | θ)$ to have a finite collection of local maxima, all of which lie in $Int (Θ)$ .⁵ Importantly, as illustrated in the simple running example, there is no reason to suspect that values in $Θ_{⋆}$ deliver small values of $| | b (θ) - b_{•} | |$ . Recall that $b (θ)$ denotes the (asymptotic) mean of the simulated summaries under $P_{θ}^{(n)}$ , while $b_{•}$ denotes the (asymptotic) mean of the observed summaries under $P_{•}^{(n)}$ .

Lemma 1. If Assumptions 1-3 are satisfied, then there exists θ_n in Θ that solves $0 = M_{n} (θ)$ , and $| | θ_{n} - θ_{⋆} | | = o_{p} (1)$ for some $θ_{⋆} \in Θ_{⋆}$ .

Consider that $M (θ)$ has zeros $θ_{1, ⋆}$ and $θ_{2, ⋆}$ , with $- H (θ_{1, ⋆})$ and $- H (θ_{2, ⋆})$ both positive-definite. Since both values satisfy the sufficient conditions in Lemma 1, it follows that $θ_{1, n} = θ_{1, ⋆} + o_{p} (1)$ and $θ_{2, n} = θ_{2, ⋆} + o_{p} (1)$ . Consequently, we should expect that the posterior assigns non-vanishing probability mass to both points. Therefore, if we wish to analyze $\hat{π} (θ | S_{n})$ in misspecified models, we must restrict our attention to regions around roots of the synthetic likelihood score equation.

Assumption 4. For any $θ_{⋆} \in Θ_{⋆}$ , and some $δ > 0$ , for all $| | θ - θ_{⋆} | | \leq δ$ , there exists a K > 0 such that $| | \partial^{2} \log g_{n} (S_{n} | θ) / \partial θ_{j} \partial θ_{i} - \partial^{2} \log g_{n} (S_{n} | θ_{⋆}) / \partial θ_{j} \partial θ_{i} | | \leq K | | θ - θ_{⋆} | |$ for $i, j \in {1, \dots, d_{θ}} .$

Remark 2. The smoothness condition in Assumption 1 is a commonly used sufficient condition to obtain second-order approximations of frequentist criterion functions (see, e.g., Chapter 5 in Van der Vaart, 2000), and ensures that $\log g_{n} (S_{n} | θ)$ admits a valid expansion around each $θ_{⋆} \in Θ_{⋆}$ . Assumption 4 gives sufficient regularity to ensure that the remainder term in such an expansion can be appropriately controlled. In this way, Assumptions 1, and 4 resemble commonly employed assumptions in frequentist point-estimation theory, and will be satisfied so long as the mappings $θ \mapsto b (θ)$ and $θ \mapsto Σ_{n} (θ)$ are smooth enough.⁶ However, the possible multi-modality of the limit synthetic likelihood (Assumption 3) requires that these assumptions hold at each $θ_{⋆} \in Θ_{⋆}$ . The smoothness conditions in Assumptions 1 and 4 are stronger than those employed by Frazier et al. (2022) to analyze the posterior in correctly specified models. In Supplementary Appendix D.1, we show that the smoothness assumptions in Frazier et al. (2022) are insufficient to deduce the behavior of the posterior in misspecified models, and we discuss why additional smoothness conditions are necessary in misspecified models.

Assumption 5. Let $A_{n} (θ)$ denote either ${\hat{Σ}}_{n} (θ)$ or $Σ_{n} (θ)$ . For some $δ > 0$ , any $θ_{⋆} \in Θ_{⋆}$ , and all $| | θ - θ_{⋆} | | \leq δ$ , the sequence of matrices $A_{n} (θ)$ satisfy: (i) for all n large enough, there exist positive constants $c_{1} \leq c_{2}$ , such that $0 < c_{1} \leq | n A_{n} (θ) | \leq c_{2} < \infty$ ; (ii) there exists a matrix function $Σ (θ)$ that is continuous over Θ, is positive-definite for all $| | θ - θ_{⋆} | | \leq δ$ , and any $θ_{⋆} \in Θ$ , and is such that $\sup_{θ \in Θ} ‖ n A_{n} (θ) - Σ (θ) ‖ = o_{p} (1)$ .

Assumption 6. For any $δ > 0$ , and any $θ \in Θ_{⋆}, π (θ_{⋆}) > 0$ , and $π (\cdot)$ is continuous on ${θ \in Θ : | | θ - θ_{⋆} | | \leq δ}$ .

Remark 3. Assumption 5 places sufficient regularity on the covariance matrix used in BSL to ensure the posterior asymptotically concentrates on $Θ_{⋆}$ , and is nearly identical to the regularity conditions for $Σ_{n} (θ)$ used by Frazier et al. (2022) in correctly specified models (see their Assumption 3). For a detailed discussion on Assumption 5, we refer the interested reader to Frazier et al. (2022). Assumption 6 is a standard regularity condition in the theoretical analysis of Bayesian posterior distributions for Euclidean valued parameters.

Assumption 7. For all $θ \in Θ, E {| | S_{n} (z) | |^{4} | θ} < \infty .$

Remark 4. Assumption 7 requires that the simulated summary statistics admit at least a finite fourth moment, which is required to ensure that the posterior $\hat{π} (θ | S_{n})$ exists, and concentrates toward the exact posterior $π (θ | S_{n})$ as $m \to + \infty$ . This condition is substantially weaker than the corresponding condition used in Frazier et al. (2022), which required that the summary statistics have a sub-Gaussian tail. Since the distribution of $S_{n} (z)$ is intractable, analytically verifying Assumption 7 is likely infeasible in complex models. However, it is relatively straightforward to verify Assumptions 1, 4, 5 and 7 using simulation from the assumed model. We refer the interested reader to Supplementary Appendix E.2 for additional discussion of this point, and for verification of Assumptions 1-7 in the running example.

To state our first result, let $Δ = {- H (θ_{⋆})}^{- 1}$ , let $t = \sqrt{n} (θ - θ_{n})$ be a local parameter, and $\hat{π} (t | S_{n}) = \hat{π} (θ_{n} + t / \sqrt{n} | S_{n}) / {\sqrt{n}}^{d_{θ}}$ the posterior for t.

Theorem 1 (Asymptotic shape). If Assumptions 1-7 are satisfied, then for each $θ_{⋆} \in Θ_{⋆}$ the following is satisfied: for any finite $γ > 0$ , and $m \to \infty$ as $n \to \infty$ , $| \int_{| | t | | \leq γ} \hat{π} (t | S_{n}) d t - \int_{| | t | | \leq γ} π_{⋆} (t) d t | = O_{p} (1 / m),$ for some density function $π_{⋆} (t) \propto N {t; 0, Δ}$ .

Consider again that $Θ_{⋆}$ contains only $θ_{1, ⋆}$ and $θ_{2, ⋆}$ ; Theorem 1 then demonstrates that in a shrinking neighborhood of $θ_{1, ⋆}$ (respectively, $θ_{2, ⋆}$ ) the posterior $\hat{π} (t | S_{n})$ is proportional to $N {t; 0, Δ_{1}}$ (respectively, $N {t; 0, Δ_{2}}$ ), where $Δ_{j} = {- H (θ_{j, ⋆})}^{- 1}$ . Thus, $\hat{π} (t | S_{n})$ is not asymptotically Gaussian, but is mixed Gaussian with modes near $θ_{1, ⋆}$ and $θ_{2, ⋆}$ .

Remark 5. While multi-modality of the posterior $\hat{π} (θ | S_{n})$ can result from model misspecification, nothing confines Theorem 1 exclusively to misspecified models. Consider that there are two values of θ, say $θ_{1, ⋆}$ and $θ_{2, ⋆}$ , such that $b (θ_{j, ⋆}) = b_{•}$ . In this case, the model is “correctly specified”, and both $θ_{1, ⋆}$ and $θ_{2, ⋆}$ solve $0 = M (θ)$ . Theorem 1 then implies that the posterior will concentrate around $θ_{1, ⋆}$ and $θ_{2, ⋆}$ . Therefore, if $θ \mapsto b (θ)$ does not identity a unique value of θ such that $b (θ) = b_{•}$ , then the result of Theorem 1 applies.⁷ An example demonstrating the conclusions of Theorem 1 in correctly specified models is given in Supplementary Appendix E.5.

Remark 6. One interpretation of the posterior $\hat{π} (θ | S_{n})$ is that we are conducting generalized Bayesian inference using a scoring rule that assesses “goodness of fit” relative to the first two moments of the distribution for $S_{n} (y)$ . Following Gneiting and Raftery (2007), a score which only depends on first and second moments “is strictly proper relative to any convex class of probability measures characterized by the first two moments.” However, even if the true distribution of $S_{n} (y)$ is asymptotically Gaussian with mean $b_{•}$ and variance V, there may be no value in Θ such that $b (θ) = b_{•}$ and $Σ (θ) = V$ ; since $b (θ)$ does not span $ℝ^{d}$ , nor does $Σ (θ)$ span the space of possible variance matrices. Therefore, there is no guarantee in general that the posterior will concentrate onto a unique element even though our inferences are based on a strictly proper scoring rule.

Remark 7. The results and proof strategies used in Frazier et al. (2022) to deduce the asymptotic behavior of the posterior in correctly specified models are entirely dependent on the existence of a unique $θ \in Θ$ such that $b (θ) = b_{•}$ . In particular, the proof strategy used by Frazier et al. (2022) hinges on the validity of a particular global approximation to $\log g_{n} (S_{n} | θ)$ which explicitly requires that $b (θ) = b_{•}$ for some $θ \in Θ$ . As such, when the model is misspecified, the results and arguments used in Frazier et al. (2022) are invalid, and we require novel arguments to derive the behavior of $\hat{π} (θ | S_{n})$ . For additional discussion of this point, we refer the interested reader to Supplementary Appendix D.1.

3.2 Asymptotic behavior: single mode

Even if the model is misspecified, if there exists a unique solution to $0 = M (θ)$ , then $\hat{π} (θ | S_{n})$ will be asymptotically Gaussian. To deduce such a result, we must restrict the set $Θ_{⋆}$ in Assumption 3 to be a singleton.

Assumption 8. $Θ_{⋆} = {θ_{⋆}}$ , and for some $δ > 0, 0 < δ \leq λ_{\min} {- H (θ_{⋆})} \leq 1 / δ$ .

Define the set $T_{n} = {t : t = \sqrt{n} (θ - θ_{n}), θ \in Θ}$ .

Theorem 2 (Asymptotic Shape). If Assumptions 1, 2, and Assumptions 4-8 are satisfied, then, for $m \to \infty$ as $n \to \infty$ , $\int_{T_{n}} | | t | | | \hat{π} (t | S_{n}) - N {t; 0, Δ} | d t = O_{p} (1 / m) .$

Even when $Θ_{⋆}$ is a singleton, the model is still misspecified; there does not exist a $θ \in Θ$ such that $b (θ) = b_{•}$ . Hence, as discussed in Remark 7, the results and arguments presented in Frazier et al. (2022) cannot be used to establish the behavior of $\hat{π} (θ | S_{n})$ in this case; please see Supplementary Appendix D.1 for further details.

When $Θ_{⋆}$ is a singleton, if the score equations $M_{n} (θ_{⋆})$ satisfy a central limit theorem, then the BSL posterior mean will be asymptotically Gaussian.

Assumption 9. For some matrix $W_{⋆}, \sqrt{n} M_{n} (θ_{⋆}) \Rightarrow N (0, W_{⋆})$ .

Corollary 1. For ${\bar{θ}}_{n} = \int_{Θ} θ \hat{π} (θ | S_{n}) d θ$ , if the conditions in Theorem 2 are satisfied and $\sqrt{n} / m = o (1)$ , then $\sqrt{n} ({\bar{θ}}_{n} - θ_{⋆}) \Rightarrow N (0, Δ^{} W_{⋆} Δ^{⊤})$ .

Theorem 2 demonstrates that if the model is misspecified, but the synthetic likelihood has a single mode asymptotically, then the BSL posterior resembles a shrinking Gaussian density in large samples. The posterior shape in Theorem 2 is in stark contrast to that exhibited by the approximate Bayesian computation posterior under model misspecification. Let $θ_{•}$ denote the minimizer of $| | b (θ) - b_{•} | |$ and let $ϵ_{•} = | | b (θ_{•}) - b_{•} | |$ . If the tolerance ϵ_n satisfies $\sqrt{n} (ϵ_{n} - ϵ_{•}) \to 0$ , from above, then Theorem 2 in Frazier et al. (2020) demonstrates that asymptotically the approximate Bayesian computation posterior is proportional to the following density: $N {{(‖ A {b (θ_{•}) - b_{•}} ‖ ϵ_{•})}^{- 1} [- {A \sqrt{n} (S_{n} - b_{•})}^{⊤} [A {b (θ_{•}) - b_{•}}] ϵ_{•} - x^{⊤} L (θ_{•}) x / 4]; 0, I_{d_{θ}}},$ for $x = n^{1 / 4} (θ - θ_{•}), A = Σ {(θ_{•})}^{- 1 / 2}$ , and $L (θ_{•}) = (\partial^{2} / \partial θ \partial θ^{⊤}) | | b (θ) - b_{•} | | |_{θ = θ_{•}}$ .

It is clear that the ABC and BSL posteriors produce significantly different inferences in misspecified models. Even in the case where the BSL posterior concentrates onto a single mode, the two posteriors will generally concentrate onto different points in Θ. This is an interesting contrast to the case of correctly specified models, where the two posteriors have the same shape asymptotically (see Li and Fearnhead, 2018, Frazier et al., 2018 and Frazier et al., 2022 for details).

Furthermore, the behavior of the approximate Bayesian computation posterior mean in misspecified models is currently unknown. However, the form of the above limiting posterior suggests that its behavior may be non-standard. If this is indeed the case, it would again be a contrast to the case of Bayesian synthetic likelihood, which, from Corollary 1, has a posterior mean that exhibits standard asymptotic behavior when the synthetic likelihood has a single mode asymptotically.

Remark 8. Theorem 2 implies that the width of posterior credible sets is determined by Δ. However, Corollary 1 implies that the asymptotic variance of the BSL posterior mean is $Δ W_{⋆} Δ^{⊤}$ . If $b (θ_{⋆}) \neq b_{•}$ and $Δ \neq W_{⋆}$ , we can immediately conclude that $\int_{T_{n}} t t^{⊤} \hat{π} (t | S_{n}) d t = Δ^{} + o_{p} (1) \neq Var {\sqrt{n} ({\bar{θ}}_{n} - θ_{⋆})} = Δ W_{⋆} Δ^{⊤} .$

Consequently, the BSL posterior does not deliver asymptotically valid uncertainty quantification for $θ_{⋆}$ in misspecified models. Moreover, in Supplementary Appendix C we show that Δ directly depends on the level of model misspecification, via ${b (θ_{⋆}) - b_{•}}$ , and in general satisfies $Δ \neq W_{⋆}$ when $b (θ_{⋆}) \neq b_{•}$ .

Remark 9. The asymptotic covariance matrix of the BSL posterior mean, $Δ W_{⋆} Δ^{⊤}$ , has the same structure as the sandwich covariance matrix for the exact Bayesian posterior mean: the “bread” of the covariance is the inverse Hessian of the synthetic likelihood, Δ, and the “meat” is the variance of the score equations, denoted by $W_{⋆}$ . Since the synthetic likelihood $g_{n} (S_{n} | θ)$ depends on the θ-dependent mean $b (θ)$ and inverse covariance $Σ_{n}^{- 1} (θ)$ , the inverse Hessian in the BSL case, Δ, depends on both the first and second derivatives of $b (θ)$ , and $Σ_{n}^{- 1} (θ)$ , which ensures that this covariance matrix has a complicated form. We refer the interested reader to Supplementary Appendix C for further discussion.

4 Robustifying the synthetic likelihood

The asymptotic mixed Gaussianity that can result from applying Bayesian synthetic likelihood methods in misspecified models can produce inaccurate inferences, and irrelevant conclusions. Hence, there is a strong sense in which we should attempt to guard against this behavior when applying these methods. In this section, we compare different approaches for ameliorating the performance of Bayesian synthetic likelihood in misspecified models.

4.1 Tempering the synthetic likelihood

To obtain robustness in misspecified models several authors, including, Bhattacharya et al. (2019), Grünwald et al. (2017), Bissiri et al. (2016), and Miller and Dunson (2019), have proposed to temper the likelihood used in Bayesian inference. It is therefore tempting to apply this strategy to correct the behavior of synthetic likelihood under model misspecification.

For $α \geq 0$ a positive constant, the standard approach to tempering would be to use a powered version of the assumed model density within an MCMC scheme in order to generate draws from the tempered posterior. Using likelihood tempering in the case of synthetic likelihood would then lead us to use the density $N {S_{n}; {\hat{b}}_{n} (θ), {\hat{Σ}}_{n} (θ)}^{α}$ within the corresponding MCMC scheme. The results of Bhattacharya et al. (2019) suggest that, in the case of a genuine likelihood, so long as $α \in (0, 1)$ , the tempered likelihood can still display posterior concentration.

Following arguments in Andrieu and Roberts (2009), as well as those given in Price et al. (2018), using $N {S_{n}; {\hat{b}}_{n} (θ), {\hat{Σ}}_{n} (θ)}^{α}$ within the MCMC scheme results in draws from the target posterior $\begin{matrix} {\hat{π}}_{α} (θ | S_{n}) \propto {\hat{g}}_{n}^{α} (S_{n} | θ) π (θ) \\ {\hat{g}}_{n}^{α} (S_{n} | θ) = \int N {S_{n}; {\hat{b}}_{n} (θ), {\hat{Σ}}_{n} (θ)}^{α} \prod_{i = 1}^{m} d P_{θ}^{(n)} {S_{n} (z^{i})} d S_{n} (z^{1}) \dots d S_{n} (z^{m}) . \end{matrix}$

The posterior ${\hat{π}}_{α} (θ | S_{n})$ does not resemble a tempered posterior, but instead resembles a posterior based on an integrated likelihood.

We now return to the running example and examine the behavior of the tempered posterior ${\hat{π}}_{α} (θ | S_{n})$ . Similar to the posterior $\hat{π} (θ | S_{n})$ , since we can compute the mean and variance of the summaries exactly, it is possible to compute an exact version of the tempered posterior. We apply the tempered version of the exact posterior using a fixed tempering schedule with $α = 1 / 2$ for each value of n. Following the introductory example, we plot the tempered posterior for n = 100, 500, 1000 and compare the results to those obtained in Figure 1.⁸

Figure 3 demonstrates that the tempered posterior displays similar behavior to the exact posterior in Figure 1, and does not lead to any meaningful increase in posterior mass near θ = 0, the point under which $| | b (θ) - b_{•} | |$ is smallest. This result is perhaps unsurprising considering that the synthetic likelihood is Gaussian, and so tempering only changes the scaling of the posterior.

We now formally demonstrate that the tempered posterior ${\hat{π}}_{α} (θ | S_{n})$ produces qualitatively similar behavior to $\hat{π} (θ | S_{n})$ when the results of Theorem 1 are valid. To state this result recall the local parameter $t = \sqrt{n} (θ - θ_{n})$ , let ${\hat{π}}_{α} (t | S_{n}) = {\hat{π}}_{α} (θ_{n} + t / \sqrt{n} | S_{n}) / {\sqrt{n}}^{d_{θ}}$ be the fractional posterior for t, and let $π_{α} (t)$ be a density function that satisfies $π_{α} (t) \propto N {t; 0, Δ}^{α}$ .

Corollary 2 (Fractional Posteriors). If Assumptions 1-7 are satisfied, then for any finite $γ > 0$ , and any fixed $α \in [0, 1]$ , $| \int_{| | t | | \leq γ} \hat{π} (t | S_{n}) d t - \int_{| | t | | \leq γ} π_{α} (t) d t | = O_{p} (1 / m)$ for $m \to \infty$ as $n \to \infty$ .

4.2 Robust synthetic likelihood

The multi-modality observed in the running example exists because $\hat{π} (θ | S_{n})$ assigns high probability mass to values of θ that ensure $| | {\hat{Σ}}_{n} {(θ)}^{- 1 / 2} {{\hat{b}}_{n} (θ) - S_{n}} | |$ is small. Measuring differences between summary statistics using this relative distance, rather than an absolute distance like $| | {\hat{b}}_{n} (θ) - S_{n} | |$ , means that there can exist values of θ such that $| | {\hat{b}}_{n} (θ) - S_{n} | |$ is large, while $| | {\hat{Σ}}_{n} {(θ)}^{- 1 / 2} {{\hat{b}}_{n} (θ) - S_{n}} | |$ is small. With the above realization, there are several approaches for correcting this behavior. For brevity, we focus on two, leaving a detailed comparison and discussion on alternative approaches for future research.

4.2.1 Robust Bayesian Synthetic Likelihood

The first approach we consider is the robust Bayesian synthetic likelihood approach presented in Frazier and Drovandi (2021).⁹ This approach accounts for model misspecification by altering the covariance matrix used in the synthetic likelihood to ensure that the magnitude of $| | {\hat{b}}_{n} (θ) - S_{n} | |$ is properly taken into account. For $Γ = (γ_{1}, \dots, γ_{d})'$ denoting a d-dimensional random vector with support $G$ , define the regularized covariance matrix ${\hat{Σ}}_{n} (θ, Γ) = {\hat{Σ}}_{n} (θ) + {\hat{Σ}}_{n}^{1 / 2} (θ) diag {γ_{1}, \dots, γ_{d}} {\hat{Σ}}_{n}^{1 / 2} (θ) .$ Let ${\hat{g}}_{n} (S_{n} | θ, Γ) = \int N {S_{n}; {\hat{b}}_{n} (θ), {\hat{Σ}}_{n} (θ, Γ)} \prod_{i = 1}^{m} d P_{θ}^{(n)} {S_{n} (z_{}^{i})} d S_{n} (z_{}^{1}) \dots d S_{n} (z_{}^{m})$ denote the synthetic likelihood based on ${\hat{Σ}}_{n} (θ, Γ)$ .

The parameters Γ allow the variance of the synthetic likelihood to increase so that the weighted norm $| | {\hat{Σ}}_{n} {(θ, Γ)}^{- 1 / 2} {{\hat{b}}_{n} (θ) - S_{n}} | |$ can be made small even if there is no value in Θ under which $| | {\hat{b}}_{n} (θ) - S_{n} | |$ is small. For $π (Γ)$ denoting the prior density of Γ, Frazier and Drovandi (2021) use independent exponential priors, the joint posterior is $\hat{π} (θ, Γ | S_{n}) \propto π (θ) π (Γ) {\hat{g}}_{n} (S_{n} | θ, Γ),$ and Markov chain Monte Carlo methods can be used to sample from $\hat{π} (θ, Γ | S_{n})$ .

We now illustrate the behavior of the posterior $\hat{π} (θ, Γ | S_{n})$ under different levels of model misspecification in the running example. Following the analysis in Section 2.2, we consider three sample sizes n = 100, 500, 1000. The posterior is sampled using the robust option in the BSL package (An et al., 2022) under the default prior choice for Γ.¹⁰

We plot the robust posteriors in Figure 4. The results demonstrate that the robust posteriors for θ are Gaussian and concentrating around θ = 0, with the robust posterior seemingly being insensitive to the level of model misspecification. This behavior is due to the regularization of the covariance matrix, which ensures the criterion is globally concave and achieves its maximum at θ = 0. The results in Figure 4 provide convincing evidence that R-BSL yields reliable posteriors in a much broader set of circumstances than originally investigated in Frazier and Drovandi (2021), which considered the analysis of the R-BSL posterior in the MA(1) model but only considered a single misspecified DGP with sample size n = 100.

While Frazier and Drovandi (2021) prove posterior concentration of $\hat{π} (θ, Γ | S_{n})$ in the case of correctly specified models, as with the posterior $\hat{π} (θ | S_{n})$ , these arguments do not extend to the case of misspecified models. Determining the theoretical behavior of $\hat{π} (θ, Γ | S_{n})$ is significantly complicated by the introduction of Γ and the behavior of these components when the model is misspecified. While the authors have observed reliable behavior for $\hat{π} (θ, Γ | S_{n})$ across a multitude of examples, formal results on the asymptotic behavior of $\hat{π} (θ, Γ | S_{n})$ would require specific conditions on the prior $π (Γ)$ , and the construction of novel arguments to deduce the asymptotic results. Given these complications, we leave a comprehensive study on the behavior of $\hat{π} (θ, Γ | S_{n})$ for future work.

4.2.2 A Robust Adjustment Approach

While the robust synthetic likelihood approach delivers reliable inference even in highly-misspecified models, it requires conducting posterior inference over $d_{θ} + d$ (where $d \geq d_{θ})$ elements, which can become cumbersome in cases where either θ or S_n is high-dimensional. However, the key insight of Frazier and Drovandi (2021) in regards to misspecification is that it can be handled by sufficiently altering the structure of the synthetic likelihood.

An alternative approach to deal with model misspecification in the case of high-dimensional summaries, or parameters, is to replace the covariance matrix ${\hat{Σ}}_{n} (θ)$ in $g_{n} (S_{n} | θ)$ with a naive but fixed version A_n . Replacing ${\hat{Σ}}_{n} (θ)$ by the fixed matrix A_n means that $\log g_{n} (S_{n} | θ)$ is roughly a weighted quadratic form based on a fixed covariance matrix, and thus should be well-behaved. As the result of Theorem 2 suggests, if this naive posterior is indeed uni-modal, then it will be approximately Gaussian in large samples, but with a covariance matrix that depends on the choice of A_n .

An unintended consequence of replacing ${\hat{Σ}}_{n} (θ)$ in $g_{n} (S_{n} | θ)$ by the naive covariance matrix A_n is that the resulting posterior will concentrate onto values of θ under which $| | A_{n}^{- 1 / 2} {b (θ) - S_{n}} | |_{}$ is small. Therefore, such a procedure ensures that the specific choice of A_n influences the resulting pseudo-true value onto which the posterior will concentrate. To ensure that the pseudo-true value onto which the posterior concentrates remains meaningful, we suggest setting $A_{n} = I_{d} / n$ , so that the naive posterior asymptotically concentrates onto the minimizer of $| | b (θ) - b_{•} | |$ . This choice ensures that the resulting pseudo-true value remains a meaningful quantity in approximate Bayesian inference.¹¹

While the naive posterior will concentrate onto a meaningful pseudo-true value, the coverage of the resulting posterior will not have the correct level (see Remark 8). However, we can adjust the posterior variance to ensure it attains the correct level of frequentist coverage. In particular, we can follow a similar idea to the adjustment procedure of Frazier et al. (2022) and adjust the posterior draws using the following algorithm.

1. Take ${\hat{Σ}}_{n} (θ) = A_{n}$ , for all $θ \in Θ$ , as the covariance matrix in the synthetic likelihood and obtain the corresponding naive posterior mean, ${\bar{θ}}_{n}$ , and posterior covariance, ${\hat{Δ}}_{n}$ .

2. For θ^j , $j = 1, \dots, N$ , a sample from the naive posterior, adjust θ^j according to ${\tilde{θ}}^{j} = {\bar{θ}}_{n} + {\hat{Δ}}_{n} {\hat{W}}_{n}^{1 / 2} {\hat{Δ}}_{n}^{- 1 / 2} (θ^{j} - {\bar{θ}}_{n}),$ where ${\hat{W}}_{n}$ is any consistent estimator of the asymptotic variance $W_{⋆} : = \lim_{n} var {\sqrt{n} M_{n} (θ_{⋆})}$ .

While the naive posterior in the first step ensures concentration onto values of θ under which $| | b (θ) - b_{•} | |$ is small, this posterior has unreliable uncertainty quantification, and so the second step adjusts the posterior variance. When the model is misspecified, the most reliable estimator of $W_{⋆} : = \lim_{n} var {\sqrt{n} M_{n} (θ_{⋆})}$ is obtained using bootstrapping where we re-sample the summary statistics and recalculate the synthetic likelihood equations $M_{n} (θ)$ at each bootstrapped summary statistic. In this way, we can interpret the above adjusted posterior as being similar to the “BayesBag” posterior (see Huggins and Miller, 2019), but in the synthetic likelihood context. However, unlike BayesBag our approach does not require re-running any posterior sampling mechanism, and only requires bootstrapping summary statistics and recalculating the synthetic likelihood score equations. Moreover, unlike BayesBag, the following result demonstrates that the adjusted posterior delivers asymptotically valid uncertainty quantification.

Let $θ_{n} : = \arg \min_{θ \in Θ} | | b (θ) - S_{n} | |, ϑ : = \sqrt{n} (\tilde{θ} - θ_{n})$ , and $T_{n} = {t : ϑ = \sqrt{n} (\tilde{θ} - θ_{n}), \tilde{θ} \in Θ}$ , where $\tilde{θ}$ denotes the transformed version of θ given in the second step of the algorithm.

Corollary 3. Assume Assumptions 1, 2, and Assumptions 4-9 are satisfied, and that ${\hat{Δ}}_{n}$ and ${\hat{W}}_{n}$ are consistent estimators for Δ and $W_{⋆}$ , respectively. For $m \to \infty$ as $n \to \infty$ , $\int_{T_{n}} | \hat{π} (ϑ | S_{n}) - N {ϑ; 0, Δ W_{⋆} Δ^{⊤}} | d ϑ = o_{p} (1)$ , and for ${\tilde{θ}}_{n}^{} : = \int_{Θ} \tilde{θ} \cdot \hat{π} (\tilde{θ} | S_{n}) d \tilde{θ}$ , $\sqrt{n} ({\tilde{θ}}_{n} - θ_{•}) \Rightarrow N (0, Δ^{} W_{⋆} Δ^{⊤})$ .

Remark 10. The adjusted BSL posterior concentrates onto the same limiting value as the ABC posterior under model misspecification. However, unlike the ABC posterior the adjusted BSL posterior displays Gaussian posterior concentration, and asymptotically correctly quantifies uncertainty about the pseudo-true value $θ_{•} : = \arg \min_{θ \in Θ} | | b (θ) - b_{•} | |$ . Consequently, if the user is faced with a misspecified model in approximate Bayesian inference, and correct uncertainty quantification is desirable, then we recommend the use of robust BSL methods over ABC methods.

Remark 11. The above adjustment procedure is related to, but distinct from, the procedure discussed in Section 4 of Frazier et al. (2022). In correctly specified models, Frazier et al. (2022) use a similar approach to the second step of the above algorithm to correct the posterior covariance in cases where a computational convenient covariance matrix is initially used in BSL. We advise against using the approach outlined in Frazier et al. (2022) when the model is misspecified, since the resulting posterior will concentrate onto a point that is determined by the choice of computationally convenient covariance matrix. Due to space limitations, we forgo a detailed comparison with the adjustment approach of Frazier et al. (2022) to Appendix D.2.

We now examine the behavior of the adjusted posterior in the running example through a repeated sampling experiment. In particular, we generate five hundred datasets from the model in (2), where the parameter values are $ω = - 0.15, ρ = 0.90$ and $σ_{v} = 0.40$ , and with n = 100, 500, 1000 observations. The adjusted posterior is obtained by calculating the exact naive posterior, setting A_n to be the identity covariance matrix, and then adjusting 10,000 samples from the naive posterior. The variance term ${\hat{W}}_{n}$ used in the adjustment is estimated via the block bootstrap with a block size of 5 and using 1,000 bootstrap samples.

In Table 1 we record the posterior mean (multiplied by 10³), variance and Monte Carlo coverage for both the adjusted and naive approaches, and across each of the three samples sizes. The results demonstrate that both approaches give precise estimators for the location of the pseudo-true value, θ = 0, with the adjusted approach having a smaller posterior variance across all sample sizes. In terms of Monte Carlo coverage, both procedures display over-coverage for the unknown pseudo-true value. Therefore, given the tighter posteriors for the adjusted approach, and similar posterior means, we conclude that the adjusted approach is more accurate than the naive approach.

In Supplementary Appendix F we apply the robust procedures discussed in Section 4.2 to analyze a simple model of financial returns. Both methods behave similarly and suggest the presence of significant model misspecification.

5 Discussion

Over the last decade, approximate Bayesian methods like synthetic likelihood have gained acceptance in the statistical community for their ability to produce meaningful inferences in complex models. The ease with which these methods can be applied has also led to their use in diverse fields of research; see Sisson et al. (2018) for examples.

While the initial impetus for these methods was one of practicality, recent research has begun to focus on the theoretical behavior of these methods. In the context of BSL, Frazier et al. (2022) demonstrate that BSL posteriors are well-behaved in large samples, and can deliver inferences that are just as reliable as those obtained by ABC, assuming the model is correctly specified.

However, the important message delivered in this paper is that if the assumed model is misspecified, then synthetic likelihood-based inference can be unreliable, and the resulting posterior can be significantly different to those obtained from other approximate Bayesian methods. Critically, the type of behavior exhibited by the Bayesian synthetic likelihood posterior is intimately related to the form and degree of model misspecification, which cannot be reliably measured without first conducting some form of inference.

While our results have only focused on the most commonly applied variant of the synthetic likelihood posterior, we conjecture that recently proposed variations, such as the semiparametric approach of An et al. (2020) or the whitening approach of Priddle et al. (2021), will exhibit similar behavior. The semiparametric synthetic likelihood approach of An et al. (2020) allows the user to remove the implicit Gaussian assumption for the summary statistic likelihood and estimate the likelihood of the summaries using kernel smoothing methods. However, this more general approach will still suffer from the same issues as the original BSL posterior under model misspecification. Estimating the distribution of the summary statistics does not change the fact that when the model is misspecified the observed summary statistic cannot be matched by the simulated statistics. Consequently, changing the class of approximations used for the synthetic likelihood from the Gaussian to a more flexible class does not address the underlying issue of model misspecification, and so the resulting posterior will behave similarly to those analyzed herein.

Supplementary material. The supplementary material contains proofs of all technical results presented in the paper, additional numerical details and results, and an application of the adjustment procedures discussed in Section 4 to an empirical example on financial returns.

Acknowledgments. We are grateful to the Associate Editor and two referees for their very helpful comments and suggestions that have significantly improved the paper. All remaining errors are our own. Frazier and Drovandi were supported by Australian Research Council funding schemes DE200101070 and FT210100260, respectively. Nott was supported by a Singapore Ministry of Education Academic Research Fund Tier 1 grant.

Notes

¹ We later clarify that even the proof techniques used in Frazier et al. (2022) are not applicable in misspecified models, and that the posterior concentration results in Frazier and Drovandi (2021) do not extend to misspecified models.

² We note that the behavior observed in Figure 1 is in stark contrast to the findings in Frazier and Drovandi (2021), where the authors found that, under a different parameterization of the same model, at small sample sizes (n = 100) the BSL posterior placed most of its mass near the boundary of the parameter space.

³ The precise formulas are too long to state analytically. The interested reader is referred to the supplementary material where it is given in full detail.

⁴ We thank an anonymous referee for bringing this possible interpretation to our attention.

⁵ The behavior of the posterior when a root is on the boundary of the parameter space can be complicated, and so we leave a detailed study of this situation for future research.

⁶ Given Assumption 1, a sufficient condition for Assumption 4 is that $θ \mapsto b (θ)$ and $θ \mapsto Σ_{n} (θ)$ are twice continuously differentiable.

⁷ In the case of correctly specified models, Δ in Theorem 1 has the form $Δ = {[{\partial b (θ) / \partial θ^{⊤}}^{⊤} Σ {(θ)}^{- 1} {\partial b (θ) / \partial θ^{⊤}}]}^{- 1}$ . However, when the model is misspecified this is not the case, and the general definition of Δ is given in Supplementary Appendix C.

⁸ The results displayed in Figure 3 are not overly sensitive to the choice of α.

⁹ For simplicity, we only focus on the variance adjustment approach detailed in Frazier and Drovandi (2021).

¹⁰ We start the sampler at θ = 0 and retain all resulting draws. In addition, we run the sampler for 50,000 iterations and use 10 synthetic datasets for each replication. These choices are fixed across the different sample size and misspecification combinations. The acceptance rates for the resulting procedure are reasonable, and between 20% and 60% across all combinations.

¹¹ Under the choice of $A_{n} = I_{d} / n$ , the pseudo-true value onto which the posterior would concentrate would be $θ_{•} : = \arg \min_{θ \in Θ} | | b (θ) - b_{•} | |$ , which can be interpreted as the value of $θ \in Θ$ that yields the closest match to the observed summaries in the Euclidean norm, and is the same value onto which the ABC posterior will concentrate if $| | \cdot | |$ were used in ABC.

Table 1 Summary measures of posterior accuracy, calculated as averages across the replications. Mean - posterior mean multiplied by 10³, Variance - posterior variance, Coverage - Monte Carlo coverage. n-BSL refers to naive BSL, and a-BSL refers to adjusted BSL.

Download CSV Display Table

Figure 1 Bayesian synthetic likelihood posterior for θ in the misspecified moving average model.

Figure 2 Comparison of the exact (synthetic likelihood) posterior under different levels of model misspecification. The solid line corresponds to n = 100, the dashed line to n = 500 and the dotted line to n = 1000.

Figure 3 Tempered Bayesian synthetic likelihood posteriors for θ in the running example.

Figure 4 r-BSL Posteriors for θ in the misspecified MA(1) model across six different levels of model misspecification. The solid line corresponds to n = 100, the dashed line to n = 500 and the dotted line to n = 1000.

Supplemental material

Supplemental Material

Download Zip (904.5 KB)

References

An, Z., D. J. Nott, and C. Drovandi (2020). Robust Bayesian synthetic likelihood via a semi-parametric approach. Statistics and Computing 30 (3), 543–557.
Google Scholar
An, Z., L. F. South, and C. Drovandi (2022). BSL: An R package for efficient parameter estimation for simulation-based models via Bayesian synthetic likelihood. Journal of Statistical Software 101, 1–33.
Google Scholar
Andrieu, C. and G. O. Roberts (2009). The pseudo-marginal approach for efficient monte carlo computations. The Annals of Statistics 37 (2), 697–725.
Google Scholar
Bhattacharya, A., D. Pati, Y. Yang, et al. (2019). Bayesian fractional posteriors. The Annals of Statistics 47 (1), 39–66.
Google Scholar
Bissiri, P. G., C. C. Holmes, and S. G. Walker (2016). A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78 (5), 1103.
Google Scholar
De Gooijer, J. (1981). An investigation of the moments of the sample autocovariances and autocorrelations for general arma processes. Journal of Statistical Computation and Simulation 12 (3-4), 175–192.
Google Scholar
Frazier, D. T. and C. Drovandi (2021). Robust approximate Bayesian inference with synthetic likelihood. Journal of Computational and Graphical Statistics 30 (4), 958–976.
Google Scholar
Frazier, D. T., G. M. Martin, C. P. Robert, and J. Rousseau (2018). Asymptotic properties of approximate Bayesian computation. Biometrika 105 (3), 593–607.
Google Scholar
Frazier, D. T., D. J. Nott, C. Drovandi, and R. Kohn (in press: (2022)). Bayesian inference using synthetic likelihood: asymptotics and adjustments. Journal of the American Statistical Association .
Google Scholar
Frazier, D. T., C. P. Robert, and J. Rousseau (2020). Model misspecification in approximate Bayesian computation: consequences and diagnostics. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82 (2), 421–444.
Google Scholar
Gneiting, T. and A. E. Raftery (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association 102 (477), 359–378.
Google Scholar
Grünwald, P., T. Van Ommen, et al. (2017). Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Analysis 12 (4), 1069–1103.
Google Scholar
Huggins, J. H. and J. W. Miller (2019). Using bagged posteriors for robust inference and model criticism. arXiv preprint arXiv:1912.07104 .
Google Scholar
Jennrich, R. I. (1969). Asymptotic properties of non-linear least squares estimators. The Annals of Mathematical Statistics 40 (2), 633–643.
Google Scholar
Kleijn, B. and A. van der Vaart (2012). The Bernstein-von-Mises theorem under misspecification. Electron. J. Statist. 6, 354–381.
Google Scholar
Li, W. and P. Fearnhead (2018). On the asymptotic efficiency of approximate Bayesian computation estimators. Biometrika 105 (2), 285–299.
Google Scholar
Marin, J.-M., N. S. Pillai, C. P. Robert, and J. Rousseau (2014). Relevant statistics for Bayesian model choice. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 833–859.
Google Scholar
Marin, J.-M., P. Pudlo, C. P. Robert, and R. J. Ryder (2012). Approximate Bayesian computational methods. Statistics and Computing 22 (6), 1167–1180.
Google Scholar
Martin, G. M., D. T. Frazier, and C. P. Robert (2023). Approximating Bayes in the 21st century. Statistical Science 1 (1), 1–26.
Google Scholar
Miller, J. W. and D. B. Dunson (2019). Robust Bayesian inference via coarsening. Journal of the American Statistical Association 114 (527), 1113–1125.
Google Scholar
Price, L. F., C. C. Drovandi, A. Lee, and D. J. Nott (2018). Bayesian synthetic likelihood. Journal of Computational and Graphical Statistics 27 (1), 1–11.
Google Scholar
Priddle, J. W., S. A. Sisson, D. T. Frazier, I. Turner, and C. Drovandi (2021). Efficient Bayesian synthetic likelihood with whitening transformations. Journal of Computational and Graphical Statistics, 1–14.
Google Scholar
Sisson, S. A., Y. Fan, and M. Beaumont (2018). Handbook of Approximate Bayesian Computation. New York: Chapman and Hall/CRC.
Google Scholar
Van der Vaart, A. W. (2000). Asymptotic statistics, Volume 3. Cambridge university press.
Google Scholar
Wood, S. N. (2010). Statistical inference for noisy nonlinear ecological dynamic systems. Nature 466 (7310), 1102–1104.
Google Scholar

Synthetic likelihood in misspecified models

Abstract

Disclaimer

Table 1 Summary measures of posterior accuracy, calculated as averages across the replications. Mean - posterior mean multiplied by 10³, Variance - posterior variance, Coverage - Monte Carlo coverage. n-BSL refers to naive BSL, and a-BSL refers to adjusted BSL.

Supplemental Material

References

Information for

Open access

Opportunities

Help and information

Synthetic likelihood in misspecified models

Abstract

Disclaimer

Table 1 Summary measures of posterior accuracy, calculated as averages across the replications. Mean - posterior mean multiplied by 103, Variance - posterior variance, Coverage - Monte Carlo coverage. n-BSL refers to naive BSL, and a-BSL refers to adjusted BSL.

Supplemental Material

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 1 Summary measures of posterior accuracy, calculated as averages across the replications. Mean - posterior mean multiplied by 10³, Variance - posterior variance, Coverage - Monte Carlo coverage. n-BSL refers to naive BSL, and a-BSL refers to adjusted BSL.