Full article: Scalable Bayesian Transport Maps for High-Dimensional Non-Gaussian Spatial Fields

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

A multivariate distribution can be described by a triangular transport map from the target distribution to a simple reference distribution. We propose Bayesian nonparametric inference on the transport map by modeling its components using Gaussian processes. This enables regularization and uncertainty quantification of the map estimation, while resulting in a closed-form and invertible posterior map. We then focus on inferring the distribution of a nonstationary spatial field from a small number of replicates. We develop specific transport-map priors that are highly flexible and are motivated by the behavior of a large class of stochastic processes. Our approach is scalable to high-dimensional distributions due to data-dependent sparsity and parallel computations. We also discuss extensions, including Dirichlet process mixtures for flexible marginals. We present numerical results to demonstrate the accuracy, scalability, and usefulness of our methods, including statistical emulation of non-Gaussian climate-model output. Supplementary materials for this article are available online.

keyword:

1 Introduction

Inference on a high-dimensional joint distribution based on a relatively small number of replicates is important in many applications. For example, generative modeling of nonstationary and non-Gaussian spatial distributions is crucial for statistical climate-model emulation (e.g., Castruccio et al. Citation2014; Nychka et al. Citation2018; Haugen et al. Citation2019), in ensemble-based data assimilation (e.g., Houtekamer and Zhang Citation2016; Katzfuss, Stroud, and Wikle Citation2016), and design studies for new satellite observing systems at NASA using observing system simulation experiments (Errico et al. Citation2013).

Continuous multivariate distributions can be characterized via triangular transport maps (see Marzouk et al. Citation2016, for a review) that transform the target distribution to a reference distribution (e.g., standard Gaussian), as illustrated in . For Gaussian target distributions, such a map is linear and given by the Cholesky factor of the precision matrix; non-Gaussian distributions can be obtained by allowing nonlinearities in the map. Given an invertible transport map, it is straightforward to sample from the target distribution and some of its conditionals, or to transform the non-Gaussian data to the reference space, in which simple linear operations such as regression or interpolation can be applied. Typically, the map is estimated based on training data, often by iteratively expanding a finite-dimensional parameterization of the transport map (e.g., El Moselhy and Marzouk Citation2012; Bigoni, Spantini, and Marzouk Citation2016; Marzouk et al. Citation2016; Parno, Moselhy, and Marzouk Citation2016; Baptista, Zahm, and Marzouk Citation2020); subsequent inference is then carried out assuming that the map is known.

Fig. 1 Top panel: Illustration of a transport map $T$ transforming a (bivariate) non-Gaussian distribution $p (y)$ to a standard Gaussian distribution $N (0, I)$ . Bottom: Equivalently, $T$ converts a realization (here, a spatial field) $y \sim p (y)$ to standard Gaussian coefficients $z = T (y) \sim N (0, I)$ . Under maximin ordering (), z can be viewed as scores corresponding to a nonlinear version of principal components, and they decrease in importance and in corresponding spatial scale from left to right. The spatial field is output from a climate model on a grid of size $N = 288 \times 192 = 55, 296$ ; we want to learn $T$ characterizing the N-dimensional distribution based on an ensemble of n < 100 training samples (see Section 6).

We propose an approach for Bayesian inference on a transport map that describes a multivariate continuous distribution and is learned from a limited number of samples from the distribution. We model the map components using nonparametric, conjugate Gaussian-process priors, which probabilistically regularize the map and shrink toward linearity. The resulting generative model is flexible, naturally quantifies uncertainty, and adjusts to the amount of complexity that is discernible from the training data, thus, avoiding both over- and under-fitting. The conjugacy results in simple, closed-form inference. Instead of assuming Gaussianity for the multivariate target distribution, our approach is equivalent to a series of conditional GP regression problems that together characterize a non-Gaussian target distribution.

We then focus on learning or emulating structured target distributions corresponding to spatial fields observed at a finite but large number of locations, based on a relatively small number of training replicates. In this setting, our Bayesian transport maps impose sparsity and regularization motivated by the behavior of diffusion-type processes that are encountered in many environmental applications. After applying a so-called maximin ordering of the spatial locations, determining the triangular transport map essentially consists of conditional spatial-prediction problems on an increasingly fine scale. We discuss how this scale decay results in conditional near-Gaussianity for a large class of non-Gaussian stochastic processes associated with quasilinear partial differential equations. Hence, our prior distributions are motivated by the behavior of Gaussian fields with Matérn-type covariance, for which the so-called screening effect leads to a decay of influence that motivates sparse transport maps that only consider nearby observations in the spatial prediction problems, corresponding to assumptions of conditional independence. The degree of shrinkage and sparsity are determined by hyperparameters that are inferred from data. The resulting Bayesian methods require little user input, scale near-linearly in the number of spatial locations, and the main computations are trivially parallel.

We further increase the flexibility in the (continuous) marginal distributions by modeling the GP-regression error terms using Dirichlet process mixtures, which can be fit using a Gibbs sampler. The resulting method lets the data decide the degrees of nonlinearity, nonstationarity, and non-Gaussianity, without manual tuning or model-selection. We also discuss an extension for settings in which Euclidean distance between the locations is not meaningful or in which variables are not identified by spatial locations (e.g., multivariate spatial processes).

Most existing methods for spatial inference are in principle applicable in our emulation setting, but they are often geared toward spatial prediction based on a single training replicate and assume Gaussian processes (GPs) with simple parametric covariance functions (e.g., Cressie Citation1993; Banerjee et al. Citation2004), whereas our methodology is not designed for spatial prediction at unobserved locations. Many extensions to nonstationary (e.g., as reviewed by Risser Citation2016) or nonparametric covariances (e.g., Huang, Hsing, and Cressie Citation2011; Choi, Li, and Wang Citation2013; Porcu et al. Citation2021) have been proposed, but these typically still rely on implicit or explicit assumptions of Gaussianity. This includes locally parametric methods specifically developed for climate-model emulation (Nychka et al. Citation2018; Wiens, Nychka, and Kleiber Citation2020; Wiens Citation2021) that locally fit anisotropic Matérn covariances in small windows and then combine the local fits into a global model. For non-Gaussian spatial data, GPs can be transformed or used as latent building blocks (see, e.g., Gelfand and Schliep Citation2016; Xu and Genton Citation2017 and references therein), but relying on a GP’s covariance function limits the types of dependence that can be captured. Parametric non-Gaussian Matérn fields can be constructed using stochastic partial differential equations driven by non-Gaussian noise (Wallin and Bolin Citation2015; Bolin and Wallin Citation2020). Models for non-Gaussian spatial data can also be built using copulas; for example, Gräler (Citation2014) proposed vine copulas for spatial fields with extremal behavior, and the factor copula approach of Krupskii, Huser, and Genton (Citation2018) assumes all locations in a homogeneous spatial region to be affected by a common latent factor. Many existing non-Gaussian spatial methods are not scalable to large datasets.

A popular way to achieve scalability for Gaussian spatial fields with parametric covariances is via the Vecchia approximation (e.g., Vecchia Citation1988; Stein, Chi, and Welty Citation2004; Datta et al. Citation2016; Katzfuss and Guinness Citation2021; Schäfer, Katzfuss, and Owhadi Citation2021), which implicitly uses a linear transport map given by a sparse inverse Cholesky factor. Kidd and Katzfuss (Citation2022) proposed a Bayesian approach to infer the Cholesky factor nonparametrically. Our (sparse) nonlinear transport maps can be viewed as a Bayesian, nonparametric, and non-Gaussian generalization of Vecchia approximations.

A close relative of transport maps in machine learning are normalizing flows (see Kobyzev, Prince, and Brubaker Citation2020 for a review), where triangular layers ensure easy evaluation and inversion of likelihood objectives. Normalizing flows have been used to model point-process intensity functions over the sphere (Ng and Zammit-Mangion Citation2022, Citation2023) and random fields in cosmology (Rouhiainen, Giri, and Münchmeyer Citation2021). Variational autoencoders (VAEs) and generative adversarial networks (GANs) relying on deep neural networks (e.g., Goodfellow, Bengio, and Courville Citation2016) can be highly expressive and have been employed for climate-model emulation (e.g., Ayala et al. Citation2021; Besombes et al. Citation2021). Kovachki et al. (Citation2020) designed GANs with triangular generators that allow conditional sampling. Our approach can be viewed as a Bayesian shallow autoencoder, with the posterior transport map and its inverse acting as the encoder and decoder, respectively. In contrast to our method, deep-learning approaches typically require massive training data, can be expensive to train, and are often highly sensitive to tuning-parameter and network-architecture choices (e.g., Arjovsky and Bottou Citation2017; Hestness et al. Citation2017; Mescheder, Geiger, and Nowozin Citation2018). Hence, in many low-data applications such approaches are only useful when paired with laborious and application-specific techniques, such as data augmentation, transfer learning, or advances in physics-informed machine learning (e.g., Kashinath et al. Citation2021).

The remainder of this article is organized as follows. In Section 2, we develop Bayesian transport maps. In Section 3, we consider the special case of high-dimensional spatial distributions. In Section 4, we discuss extensions to non-Gaussian errors using Dirichlet process mixtures. Sections 5 and 6 provide comparisons and applications to simulated data and climate-model output, respectively. Section 7 concludes and discusses future work. Appendices A–G in the supplementary materials contain proofs and further details. Fully automated implementations of our methods, along with code to reproduce all results, are available at https://github.com/katzfuss-group/BaTraMaSpa.

2 Bayesian Transport Maps

2.1 Transport Maps and Regression

Consider a continuous random vector $y = {(y_{1}, \dots, y_{N})}^{⊤}$ , for example describing a spatial field at N locations as in . For simplicity, assume that y has been centered to have mean zero.

For a multivariate Gaussian distribution, $y \sim N_{N} (0, Σ)$ with $Σ^{- 1} = L^{⊤} L$ , the (lower-triangular) Cholesky factor L represents a transformation to a standard normal: $z = Ly \sim N_{N} (0, I_{N})$ . As a natural extension, we can characterize any continuous N-variate distribution $p (y)$ by a potentially nonlinear transport map $T : R^{N} \to R^{N}$ (Villani Citation2009), such that $z = T (y) \sim N_{N} (0, I_{N})$ for $y \sim p (y)$ . Like L, we can assume without loss of generality that the transport map $T$ is lower-triangular (Rosenblatt Citation1952; Carlier, Galichon, and Santambrogio Citation2009), (1) $T (y) = [\begin{matrix} T_{1} (y_{1}) \\ T_{2} (y_{1}, y_{2}) \\ ⋮ \\ T_{N} (y_{1}, y_{2}, \dots, y_{N}) \end{matrix}],$ (1) where each $T_{i} (y_{1 : i})$ with $y_{1 : i} = {(y_{1}, \dots, y_{i})}^{⊤}$ is an increasing function of its ith argument to ensure that $T$ is invertible and implies a proper density $p (y)$ . Letting $N (x | μ, σ^{2})$ denote a Gaussian density with parameters μ and $σ^{2}$ evaluated at x, we then have (2) $p (y) = p_{z} (T (y)) | det \nabla T | = \prod_{i = 1}^{N} (N (T_{i} (y_{1 : i}) | 0, 1) | \frac{\partial T_{i} (y_{1 : i})}{\partial y_{i}} |),$ (2) as the triangular $T$ also implies a triangular $\nabla T = {(\frac{\partial T_{i} (y_{1 : i})}{\partial y_{j}})}_{i, j = 1, \dots, N}$ .

Throughout, we assume each $T_{i}$ to be linearly additive in its ith argument, (3) $T_{i} (y_{1 : i}) = (y_{i} - f_{i} (y_{1 : i - 1})) / d_{i}, i = 1, \dots, N,$ (3) for some $d_{i} \in R^{+}, f_{i} : R^{i - 1} \to R$ for $i = 2, \dots, N$ , and $f_{i} (y_{1 : i - 1}) \equiv 0$ for i = 1. Then, $\partial_{i} T_{i} (y_{1 : i}) = 1 / d_{i} > 0$ , as required. Using (2), it is easy to show that (4) $\begin{matrix} p (y) \propto \prod_{i = 1}^{N} (\exp (- \frac{1}{2 d_{i}^{2}} {(y_{i} - f_{i} (y_{1 : i - 1}))}^{2}) \frac{1}{d_{i}}) \\ \propto \prod_{i = 1}^{N} N (y_{i} | f_{i} (y_{1 : i - 1}), d_{i}^{2}) . \end{matrix}$ (4)

Thus, the transport-map approach has turned the difficult problem of inferring the N-variate distribution of y into N independent regressions of y_i on $y_{1 : i - 1}$ of the form (5) $y_{i} = f_{i} (y_{1 : i - 1}) + ϵ_{i}, ϵ_{i} \sim N (0, d_{i}^{2}), i = 1, \dots, N .$ (5)

Sparsity in the map $T$ corresponds to conditional independence in the joint distribution $p (y)$ (see Spantini, Bigoni, and Marzouk Citation2018). Specifically, if we assume $f_{i} (y_{1 : i - 1}) = f_{i} (y_{c_{i}})$ for a subset $c_{i} \subset {1, \dots, i - 1}$ , then $T$ is sparse in that $T_{i}$ only depends on y_j if $j \in c_{i}$ (or if j = i). Making such a sparsity assumption for $i = 2, \dots, N$ (and setting $y_{c_{1}} = \emptyset$ ), we have from (4) that $p (y) = \prod_{i = 1}^{N} p (y_{i} | y_{c_{i}})$ , meaning that y_i is independent of ${y_{j} : j \notin c_{i}, j < i}$ conditional on $y_{c_{i}}$ . We will exploit this sparsity for computational gain for inferring large non-Gaussian spatial fields in Section 3.

2.2 Modeling the Map Functions Using Gaussian Processes

In the existing transport-map literature (e.g., Marzouk et al. Citation2016), $f_{i} : R^{i - 1} \to R$ and $d_{i} \in R^{+}$ in (3)–(5) are often assumed to have parametric form, whose parameters are estimated and then assumed known. Instead, we here assume a flexible, nonparametric prior on the map $T$ by specifying independent conjugate Gaussian-process-inverse-Gamma priors for the f_i and $d_{i}^{2}$ . These prior assumptions induce prior distributions on the map components $T_{i}$ in (3), and thus on the entire map $T$ in (1).

Specifically, for the “noise” variances $d_{i}^{2}$ , we assume inverse-Gamma distributions, (6) $d_{i}^{2} \overset{ind .}{\sim} I G (α_{i}, β_{i}), with α_{i} > 1, β_{i} > 0, i = 1, \dots, N .$ (6)

Conditional on $d_{i}^{2}$ , each function f_i is modeled as a Gaussian process (GP) with inputs $y_{1 : i - 1}$ , (7) $f_{i} | d_{i} \overset{ind .}{\sim} G P (0, d_{i}^{2} K_{i}), i = 1, \dots, N,$ (7) where $K_{i} (\cdot, \cdot) = C_{i} (\cdot, \cdot) / E (d_{i}^{2}), E (d_{i}^{2}) = β_{i} / (α_{i} - 1)$ , (8) $\begin{matrix} C_{i} (y_{1 : i - 1}, y_{1 : i - 1}^{'}) \\ = y_{1 : i - 1}^{⊤} Q_{i} y_{1 : i - 1}^{'} + σ_{i}^{2} ρ_{i} (y_{1 : i - 1}, y_{1 : i - 1}^{'}), i = 1, \dots, N, \end{matrix}$ (8)

$σ_{i} \in R_{0}^{+}$ , and ρ_i is a positive-definite correlation function such that $ρ_{i} (y_{1 : i - 1}, y_{1 : i - 1}) = 1$ . This prior on f_i is motivated by considering ${\tilde{f}}_{i} | b_{i} \sim G P (b_{i}^{⊤} (\cdot), σ_{i}^{2} ρ_{i} (\cdot, \cdot))$ with inputs $y_{1 : i - 1}$ , where $b_{i} \sim N (0, Q_{i})$ . Integrating out $b_{i}$ , we obtain ${\tilde{f}}_{i} \sim G P (0, C_{i})$ with C_i as in (8), and hence $f_{i} = (d_{i} / \sqrt{E (d_{i}^{2})}) {\tilde{f}}_{i}$ as in (7). The degree of nonlinearity of f_i is determined by $σ_{i}^{2}$ ; if $σ_{i}^{2} = 0$ , then f_i is a linear function of $y_{1 : i - 1}$ . The prior distributions (i.e., α_i, β_i, C_i) may depend on hyperparameters $θ$ ; see Section 2.4 for more details.

2.3 The Posterior Map

Now assume that we have observed n independent training samples $y^{(1)}, \dots, y^{(n)}$ from the distribution in Section 2.1 conditional on $f = (f_{1}, \dots, f_{N})$ and $d = (d_{1}, \dots, d_{N})$ , such that $y^{(j)} \overset{iid}{\sim} p (y | f, d)$ with $T (y^{(j)}) | f, d \sim N_{N} (0, I_{N}), j = 1, \dots, n$ . We combine the samples into an n × N data matrix Y whose jth row is given by $y^{(j)}$ . Then, for the regression in (5), the responses $y_{i}$ and the covariates $Y_{1 : i - 1}$ are given by the ith and the first i – 1 columns of Y, respectively. Below, let $y^{⋆}$ denote a new observation sampled from the same distribution, $y^{⋆} \sim p (y | f, d)$ , independently of Y.

Based on the prior distribution for f and d in Section 2.2, we can now determine the posterior map $\tilde{T}$ learned from the training data Y, with f and d integrated out. This map is available in closed form and invertible:

Proposition 1.

The transport map $\tilde{T}$ from $y^{⋆} \sim p (y | Y)$ to $z^{⋆} = \tilde{T} (y^{⋆}) \sim N_{N} (0, I_{N})$ is a triangular map with components (9) $\begin{matrix} z_{i}^{⋆} = {\tilde{T}}_{i} (y_{1}^{⋆}, \dots, y_{i}^{⋆}) \\ = Φ^{- 1} (F_{2 {\tilde{α}}_{i}} ({\hat{d}}_{i}^{- 1} {(v_{i} (y_{1 : i - 1}^{⋆}) + 1)}^{- 1 / 2} (y_{i}^{⋆} - {\hat{f}}_{i} (y_{1 : i - 1}^{⋆})))), \\ i = 1, \dots, N, \end{matrix}$ (9) where ${\tilde{α}}_{i} = α_{i} + n / 2, {\tilde{β}}_{i} = β_{i} + y_{i}^{⊤} G_{i}^{- 1} y_{i} / 2, {\hat{d}}_{i}^{2} = {\tilde{β}}_{i} / {\tilde{α}}_{i}, G_{i} = K_{i} + I_{n}, K_{i} = K_{i} (Y_{1 : i - 1}, Y_{1 : i - 1}) = {(K_{i} (y_{1 : i - 1}^{(j)}, y_{1 : i - 1}^{(l)}))}_{j, l = 1, \dots, n}$ , (10) ${\hat{f}}_{i} (y_{1 : i - 1}^{⋆}) = K_{i} (y_{1 : i - 1}^{⋆}, Y_{1 : i - 1}) G_{i}^{- 1} y_{i},$ (10) (11) $\begin{matrix} v_{i} (y_{1 : i - 1}^{⋆}) = K_{i} (y_{1 : i - 1}^{⋆}, y_{1 : i - 1}^{⋆}) \\ - K_{i} (y_{1 : i - 1}^{⋆}, Y_{1 : i - 1}) G_{i}^{- 1} K_{i} (Y_{1 : i - 1}, y_{1 : i - 1}^{⋆}), \end{matrix}$ (11) for $i = 2, \dots, N, {\hat{f}}_{1} = v_{1} = 0$ for i = 1, and $Φ$ and $F_{κ}$ denote the cumulative distribution functions of the standard normal and the t distribution with κ degrees of freedom, respectively. The inverse map ${\tilde{T}}^{- 1}$ can be evaluated at a given $z^{⋆}$ by solving the nonlinear triangular system $\tilde{T} (y^{⋆}) = z^{⋆}$ for $y^{⋆}$ ; because $\tilde{T}$ is triangular, the solution can be expressed recursively as (12) $y_{i}^{⋆} = {\hat{f}}_{i} (y_{1 : i - 1}^{⋆} ) + F_{2 {\tilde{α}}_{i}}^{- 1} (Φ (z_{i}^{⋆})) {\hat{d}}_{i} {(v_{i} (y_{1 : i - 1}^{⋆}) + 1 )}^{1 / 2}, i = 1, \dots, N .$ (12)

All proofs are provided in Appendix A, supplementary materials. We can write the prior map in a similar form, but this is only useful in the case of highly informative priors.

Determining ${\tilde{T}}_{i}$ requires $O (n^{3} + i n^{2})$ time, mostly for computing and decomposing the n × n matrix $G_{i}$ , for each $i = 1, \dots, N$ . However, note that the N rows or components of $\tilde{T}$ can be computed completely in parallel, as in the optimization-based transport-map estimation reviewed in Marzouk et al. (Citation2016). Each application of the transport map or its inverse then consists of the GP prediction in (10)–(11) and only requires $O (n^{2} + i n)$ time for $i = 1, \dots, N$ , but the inverse map is evaluated recursively (i.e., not in parallel).

In contrast to existing transport-map approaches, our approach is Bayesian and naturally quantifies uncertainty in the nonlinear transport functions. The GP priors on the f_i automatically adapt to the amount of information available, only resulting in strongly nonlinear function estimates when supplied the requisite evidence by the data. If n is increasing, then ${\tilde{α}}_{i}$ increases, $F_{2 {\tilde{α}}_{i}}$ converges to $Φ$ , and $v_{i} (y_{1 : i - 1}^{⋆})$ typically converges to zero, and so the map components simplify to (13) ${\tilde{T}}_{i} (y_{1}^{⋆}, \dots, y_{i}^{⋆}) = (y_{i}^{⋆} - {\hat{f}}_{i} (y_{1 : i - 1}^{⋆} )) / {\hat{d}}_{i} and y_{i}^{⋆} = {\hat{f}}_{i} (y_{1 : i - 1}^{⋆} ) + z_{i} {\hat{d}}_{i} .$ (13)

When employed for finite n, this simplified version of the map ignores posterior uncertainty in f and d and instead relies on the point estimates ${\hat{f}}_{i} (y_{1 : i - 1})$ and ${\hat{d}}_{i}^{2}$ . If we further assume that $σ_{i} = 0$ in (8) for all $i = 1, \dots, N$ , then all f_i and all ${\tilde{T}}_{i}$ become linear functions; we can think of the resulting linear map $\tilde{T} (y^{⋆}) = L^{⊤} y^{⋆}$ as an inverse Cholesky factor, in the sense that $y^{⋆} | Y \sim N (0, Λ^{- 1})$ with $Λ = L L^{⊤}$ .

Transport maps can be used for a variety of purposes. For example, we can obtain new samples $y^{⋆}$ from the posterior predictive distribution $p (y | Y)$ by sampling $z^{⋆} \sim N_{N} (0, I_{N})$ and computing $y^{⋆} = {\tilde{T}}^{- 1} (z^{⋆})$ using (12). The map $\tilde{T}$ in (9) provides a transformation from a non-Gaussian vector $y^{⋆}$ to the standard Gaussian $z^{⋆} = \tilde{T} (y^{⋆})$ ; we call $z^{⋆} = (z_{1}^{⋆}, \dots, z_{N}^{⋆})$ the map coefficients corresponding to $y^{⋆}$ (see for an illustration). Because the nonlinear dependencies have been removed, many operations are more meaningful on $z^{⋆}$ than on $y^{⋆}$ , including linear regressions, translations using linear shifts, and quantifying similarity using inner products. We can also detect inadequacies of the map $\tilde{T}$ for describing the target distribution by examining the degree of non-Gaussianity and dependence in $z^{⋆}$ . These uses of transport maps will be considered further in Section 3.5.

2.4 Hyperparameters

The prior distributions on the f_i and d_i in Section 2.2 may depend on unknown hyperparameters $θ$ . For example, by making inference on hyperparameters in the σ_i in (8), we can let the data decide the degree of nonlinearity in the map and thus the non-Gaussianity in the resulting joint target distribution. We can write in closed form the integrated likelihood $p (Y)$ , where f and d have been integrated out.

Proposition 2.

The integrated likelihood is (14) $p (Y) \propto \prod_{i = 1}^{N} (| G_{i} |^{- 1 / 2} \times (β_{i}^{α_{i}} / {\tilde{β}}_{i}^{{\tilde{α}}_{i}}) \times Γ ({\tilde{α}}_{i}) / Γ (α_{i})),$ (14) where $Γ (\cdot)$ denotes the gamma function, and ${\tilde{α}}_{i}, {\tilde{β}}_{i}, G_{i}$ are defined in Proposition 1.

Now denote by $p_{θ} (Y)$ the integrated likelihood $p (Y)$ computed based on a particular value $θ$ of the hyperparameters. There are two main possibilities for inference on $θ$ . First, an empirical Bayesian approach consists of estimating $θ$ by the value that maximizes $\log p_{θ} (Y)$ , and then regarding $θ$ as fixed and known. As $\log p_{θ} (Y)$ is a sum of N simple terms, it is straightforward to optimize this function using stochastic gradient ascent based on automatic differentiation. Second, we can carry out fully Bayesian inference by specifying a prior $p (θ)$ , and sampling $θ$ from its posterior distribution $p (θ | Y) \propto p_{θ} (Y) p (θ)$ using Metropolis-Hastings; subsequent inference then relies on these posterior draws.

For our numerical results, we employed the empirical Bayesian approach, because it is faster and preserves the closed-form map properties in Section 2.3. In exploratory numerical experiments, we observed no significant decrease in inferential accuracy relative to the fully Bayesian approach, likely due to working with a small number of hyperparameters in $θ$ .

3 Bayesian Transport Maps for Large Spatial Fields

Now assume that $y = {(y_{1}, \dots, y_{N})}^{⊤}$ consists of spatial observations or computer-model output at spatial locations $s_{1}, \dots, s_{N}$ in a region or domain $D \subset R^{dim}$ . We assume Bayesian transport maps as in Section 2.1, with regressions of the form (5) in $(i - 1)$ -dimensional space for $i = 1, \dots, N$ . As N is very large in many relevant applications, we will specify priors distributions of the form described in Section 2.2 that induce substantial regularization and sparsity, as a function of hyperparameters $θ = (θ_{σ, 1}, θ_{σ, 2}, θ_{d, 1}, θ_{d, 2}, θ_{γ}, θ_{q})$ to be introduced in Sections 3.2–3.4.

3.1 Maximin Ordering and Nearest Neighbors

A triangular map $T (y)$ as in (1) depends on the ordering of the variables $y_{1}, \dots, y_{N}$ . We assume a maximum-minimum-distance (maximin) ordering of the corresponding locations $s_{1}, \dots, s_{N}$ (see ), in which we sequentially choose each location to maximize the minimum distance to all previously ordered locations. Specifically, the first index i₁ is chosen arbitrarily (e.g., $i_{1} = 1$ ), and then the subsequent indices are selected as $i_{j} = arg \max_{i \notin I_{j}} \min_{j \in I_{j}} ‖ s_{i} - s_{j} ‖$ for $j = 2, \dots, N$ , where $I_{j} = {i_{1}, \dots, i_{j - 1}}$ . For notational simplicity, we assume throughout that $y = (y_{1}, \dots, y_{N})$ follows maximin ordering (i.e., $y_{j} = y_{i_{j}}$ ). Define $c_{i} (k)$ as the index of the kth nearest (previously ordered) neighbor of the ith location (and so $s_{c_{i} (1)}, \dots, s_{c_{i} (4)}$ are indicated by $x$ in ).

Fig. 2 Maximin ordering (Section 3.1) for locations on a grid (small gray points) of size $N = 60 \times 60 = 3600$ on a unit square, ${[0, 1]}^{dim}$ with $dim = 2$ . (a)–(c): The ith ordered location (+), the previous i – 1 locations ( $°$ ), including the nearest m = 4 neighbors ( $x$ ) and the distance $l_{i}$ to the nearest neighbor (—). (d): For $i = 1, \dots, N$ , the length scales (i.e., minimum distances) decay as $l_{i} = i^{- 1 / dim}$ .

Fig. 2 Maximin ordering (Section 3.1) for locations on a grid (small gray points) of size N=60×60=3600 on a unit square, [0,1]dim with dim=2. (a)–(c): The ith ordered location (+), the previous i – 1 locations (°), including the nearest m = 4 neighbors (x) and the distance li to the nearest neighbor (—). (d): For i=1,…,N, the length scales (i.e., minimum distances) decay as li=i−1/dim.

The maximin ordering can be interpreted as a multiresolution decomposition into coarse scales early in the ordering and fine scales later in the ordering. In particular, the minimal pairwise distance $l_{i} = ‖ s_{i} - s_{c_{i} (1)} ‖$ among the first i locations of the ordering decays roughly as $l_{i} \propto i^{- 1 / dim}$ , where $dim$ here is the dimension of the spatial domain (see ). As a result of the maximin ordering, the ith regression in (5) can be viewed as a spatial prediction at location $s_{i}$ based on data at locations $s_{1}, \dots, s_{i - 1}$ that lie roughly on a regular grid with distance (i.e., scale) $l_{i}$ .

When the variables $y_{1}, \dots, y_{N}$ are not associated with spatial locations or when Euclidean distance between the locations is not meaningful (e.g., nonstationary, multivariate, spatio-temporal, or functional data), the maximin and neighbor ordering can be carried out based on other distance metrics, such as ${(1 - | correlation |)}^{1 / 2}$ based on some guess or estimate of the correlation between variables (Kang and Katzfuss Citation2023; Kidd and Katzfuss Citation2022).

3.2 Priors on the Conditional Non-Gaussianity $σ_{i}^{2}$

In (8), $σ_{i}^{2}$ determines the degree of nonlinearity in f_i; hence, $(HTML translation failed)$ together determine the conditional non-Gaussianity in the distribution of $y_{i : N}$ given $y_{1 : i - 1}$ . A priori, we assume that the degree of nonlinearity decays polynomially with length scale $l_{i}$ , namely $σ_{i}^{2} = e^{θ_{σ, 1}} l_{i}^{θ_{σ, 2}}$ , which allows the conditional distributions of $y_{i : N}$ given $y_{1 : i - 1}$ to be increasingly Gaussian as i increases, as a function of hyperparameters $θ_{σ, 1}, θ_{σ, 2}$ .

This prior assumption is motivated by the behavior of stochastic processes with quasiquadratic loglikelihoods. A quasiquadratic loglikelihood of order r is the sum of a quadratic leading-order term that depends on the rth derivatives of the process, and a nonquadratic term that may only depend on derivatives up to order $r - 1$ . Gaussian smoothness priors (with quadratic loglikelihoods) such as the Matérn model (Whittle Citation1954, Citation1963) are closely related to linear elliptic PDEs. They can formally be thought of as having log-densities $- 〈 u, A u 〉 / 2 - 〈 u, b 〉$ that are maximized by solutions of the linear equation Au = b. Similarly, the maximizers of quasiquadratic log-densities $- 〈 u, L (D^{r} u) / 2 〉 - V (D^{r - 1} u, \dots u)$ are solutions of quasilinear PDEs $L (D^{r} u) = - \frac{d}{d u} V (D^{r - 1} u, \dots u)$ . A wide range of physical phenomena is governed by quasilinear PDEs. For instance, the Cahn-Hilliard (Cahn and Hilliard Citation1958) and Allen-Cahn (Allen and Cahn Citation1972) equations describe phase separation in multi-component systems, the Navier-Stokes equation describes the dynamics of fluids, and the Föppl-von Kármán equations describe the large deformations of thin elastic plates. If the order of a data-generating quasilinear PDE is known, a Matérn model with matching regularity is a sensible choice to model the leading-order behavior. However, most of the time, the observations will not arise from a known PDE model, and so the above arguments primarily motivate us to expect screening and power laws, without quantifying their effects.

In its simplest form, the mechanism relating local conditioning and approximate Gaussianity is captured by the classical Pointcaré inequality (Adams and Fournier Citation2003).

Lemma 1

(Poincaré inequality). Let $Ω \subset R^{d}$ be a Lipschitz-bounded domain with diameter $l$ , let u and its first derivative be square-integrable, and let $u_{Ω} = \frac{1}{| Ω |} \int_{Ω} u (x) d x$ be the mean of u over Ω. Then, we have (15) $‖ u - u_{Ω} ‖_{L^{2} (Ω)} \leq l ‖ \nabla u ‖_{L^{2} (Ω)} .$ (15)

The Poincaré inequality directly implies the following corollary.

Corollary 1.

Let Ω be a Lipschitz-bounded domain. Let τ be a partition of Ω into Lipschitz-bounded subdomains with diameter upper-bounded by $l$ , and assume that u, v and their first derivatives are square-integrable and satisfy $\int_{t} (u - v) d x = 0$ for all $t \in τ$ . Then, (16) $‖ u - v ‖_{L^{2} (Ω)} \leq l ‖ \nabla u - \nabla v ‖_{L^{2} (Ω)} .$ (16)

This means that after conditioning a stochastic process on averages of diameter $l ≪ 1$ , even a minor perturbation in u results in a large change of $\nabla u$ . For a quasiquadratic likelihood of order r = 1 with lower-bounded curvature of the quadratic part, a perturbation that effects even a minor change in the nonlinear part, depending only on u, must effect a major change in the leading-order, quadratic term, assuming the curvature of the latter is bounded from below. Under suitable growth conditions on the nonlinear term, this means that the conditional density of a quasiquadratic likelihood of order r = 1 is dominated by the leading-order quadratic term as $l$ approaches zero. As a result, the conditional stochastic process is approximately Gaussian. This is illustrated in a numerical example in . Additional details are provided in Appendix B, supplementary materials. Using generalizations of the Poincaré inequality to r > 1 and point-wise measurements (e.g., Schäfer, Sullivan, and Owhadi Citation2021, thm. 5.9(2)), the above argument can be extended to the setting of r > 1 and conditioning on point sets with distance $l$ (instead of local averages).

Fig. 3 Samples from the non-Gaussian process in (21) in Appendix B, supplementary materials feature regions of negative (dark) and positive (light) values (first panel). The distribution at a given location is a mixture of these two possibilities and thus non-Gaussian (second panel). By contrast, after conditioning on averages over regions of size $l = 2^{- 1}$ (third panel) or $l = 2^{- 5}$ (fourth panel), the conditional distribution is close to Gaussian, as these averages determine with high probability whether the location is in a positive or negative region. (See Appendix B, supplementary materials for details.)

3.3 Priors on the Conditional Variances $d_{i}^{2}$

As we have argued in Section 3.2, even non-Gaussian stochastic processes with quasiquadratic loglikelihoods exhibit conditional near-Gaussianity on fine scales. Thus, we will now describe prior assumptions for the $d_{i}^{2}$ and f_i in (5) that are motivated by the behavior of a transport map $T$ for a Gaussian target distribution with Matérn covariance (see ), which is a highly popular assumption in spatial statistics. The Matérn covariance function is also the Green’s function of an elliptic PDE (Whittle Citation1954, Citation1963).

Fig. 4 For a Gaussian process with exponential covariance on the grid and with the ordering from , expressing the joint distribution $p (y)$ using a transport map as in (1)–(3) results in a series of regressions as in (5) with linear predictors, $f_{i} (y_{1 : i - 1}) = \sum_{k = 1}^{i - 1} y_{c_{i} (k)} b_{i, k}$ , where $c_{i} (k)$ indicates the kth nearest (previously ordered) neighbor of the ith location. (For non-Gaussian $p (y)$ , the functions f_i are nonlinear.) (a): For n = 100 simulations, the values of $y_{i}$ and its first and fifth nearest neighbor (NN) lie on a low-dimensional manifold; the regression plane (assuming all other variables to be fixed) indicates a stronger influence of the first NN (see the slope of the intersection of the regression plane with the front of the box) than of the fifth NN. (b): The conditional standard deviations decay as a function of the length scale $l_{i}$ (see ). (c) The squared regression coefficients decay rapidly as a function of neighbor number k.

Fig. 4 For a Gaussian process with exponential covariance on the grid and with the ordering from Figure 2, expressing the joint distribution p(y) using a transport map as in (1)–(3) results in a series of regressions as in (5) with linear predictors, fi(y1:i−1)=∑k=1i−1yci(k)bi,k, where ci(k) indicates the kth nearest (previously ordered) neighbor of the ith location. (For non-Gaussian p(y), the functions fi are nonlinear.) (a): For n = 100 simulations, the values of yi and its first and fifth nearest neighbor (NN) lie on a low-dimensional manifold; the regression plane (assuming all other variables to be fixed) indicates a stronger influence of the first NN (see the slope of the intersection of the regression plane with the front of the box) than of the fifth NN. (b): The conditional standard deviations decay as a function of the length scale li (see Figure 2(d)). (c) The squared regression coefficients decay rapidly as a function of neighbor number k.

Schäfer, Sullivan, and Owhadi (Citation2021, thm. 2.3) show that Gaussian processes with covariance functions given by the Green’s function of elliptic PDEs of order r have conditional variance of order $l_{i}^{2 r}$ when conditioned on the first i elements of the maximin ordering (see ).

Hence, for the noise or conditional variances $d_{i}^{2} \sim I G (α_{i}, β_{i})$ as in Section 2.2, we set $E (d_{i}^{2}) = β_{i} / (α_{i} - 1) = e^{θ_{d, 1}} l_{i}^{θ_{d, 2}}$ . Assuming the prior standard deviation of $d_{i}^{2}$ to be equal to g times the mean, we obtain $α_{i} = 2 + 1 / g^{2}$ and $β_{i} = e^{θ_{d, 1}} l_{i}^{θ_{d, 2}} (1 + 1 / g^{2})$ . For our numerical experiments, we chose g = 4 to obtain a relatively vague prior for the $d_{i}^{2}$ .

3.4 Priors on the Regression Functions f_i

The regression functions $f_{i} : R^{i - 1} \to R$ in (5) were specified to be GPs in $(i - 1)$ -dimensional space in Section 2.2. For the covariance function in (8), we assume that $ρ_{i} (y_{1 : i - 1}, y_{1 : i - 1}^{'}) = ρ (h_{i} (y_{1 : i - 1}, y_{1 : i - 1}^{'}) / γ)$ , where $h_{i}^{2} (y_{1 : i - 1}, y_{1 : i - 1}^{'}) = {(y_{1 : i - 1} - y_{1 : i - 1}^{'})}^{⊤} Q_{i} (y_{1 : i - 1} - y_{1 : i - 1}^{'})$ , $γ = \exp (θ_{γ})$ is a range parameter, and ρ is an isotropic correlation function, taken to be Matérn with smoothness 1.5 for our numerical experiments.

To make this potentially high-dimensional regression feasible, we again use the example of a spatial GP with Matérn covariance to motivate regularization and sparsity via the relevance matrix $Q_{i} = diag (q_{i, 1}^{2}, \dots, q_{i, i - 1}^{2})$ . We assume that the relevance of the kth neighbor (see Section 3.1) decays exponentially as a function of k, such that $q_{i, c_{i} (k)}$ decays as $\exp (θ_{q} k)$ . This type of behavior, often referred to as the screening effect (e.g., Stein Citation2011), is illustrated in , and it has been exploited for covariance estimation of a Gaussian spatial field by Kidd and Katzfuss (Citation2022). Recently, Schäfer, Sullivan, and Owhadi (Citation2021) proved exponential rates of screening for Gaussian processes derived from elliptic boundary-value problems; following the discussion in Section 3.2, we expect similar conditional-independence phenomena to hold on the fine scales of processes with quasiquadratic loglikelihoods. As shown in , we also observed this behavior for climate data.

Given this exponential decay as a function of the neighbor number k, the relevance will be essentially zero for sufficiently large k, and so we achieve sparsity by setting (17) $q_{i, c_{i} (k)} = {\begin{matrix} \exp (θ_{q} k), & k \leq m, \\ 0, & k > m, \end{matrix}$ (17) where the sparsity parameter $m = \max {k : \exp (θ_{q} k) \geq ε}$ is determined by the data through the hyperparameter θ_q. We used $ε = 0.01$ for our numerical examples, which produced highly accurate inference and usually resulted in m < 10. Assumption (17) induces a sparse transport map, in that f_i (and thus $T_{i}$ ) depend on $y_{1 : i - 1}$ only through the m nearest neighbors $y_{c_{i} (1)}, \dots, y_{c_{i} (m)}$ , where ρ_i is isotropic as a function of the scaled inputs $y_{c_{i} (k)} / q_{i, c_{i} (k)}$ . Sparsity in the transport map is equivalent to an assumption of ordered conditional independence. Similar ordered-conditional-independence assumptions are also popular for Vecchia approximations of Gaussian fields with parametric covariance functions.

Identifying the regression functions f_i in m-dimensional space is further aided by the data approximately concentrating on a lower-dimensional manifold due to the strong dependence between most $y_{c_{i} (k)}$ and $y_{c_{i} (l)}$ for small $k, l \leq m$ (e.g., see ).

3.5 Inference

Algorithm 1:

Inference for the spatial transport map

1: Order $y_{1}, \dots, y_{N}$ in maximin ordering and compute scales $l_{i}$ and nearest-neighbor indices $c_{i} (1), \dots, c_{i} ( m_{\max} )$ (e.g., $m_{\max} = 30$ ) for each $i = 1, \dots, N$ (see Section 3.1)

2: Compute $\hat{θ} = arg \max_{θ} \log p (Y)$ via stochastic gradient ascent, where $p (Y) \propto \prod_{i = 1}^{N} (| G_{i} |^{- 1 / 2} \times (β_{i}^{α_{i}} / {\tilde{β}}_{i}^{{\tilde{α}}_{i}}) \times Γ ({\tilde{α}}_{i}) / Γ (α_{i}))$ , with $θ = (θ_{σ, 1}, θ_{σ, 2}, θ_{d, 1}, θ_{d, 2}, θ_{γ}, θ_{q})$ , ${\tilde{α}}_{i} = α_{i} + n / 2, {\tilde{β}}_{i} = β_{i} + y_{i}^{⊤} G_{i}^{- 1} y_{i} / 2, α_{i} = 2 + 1 / g^{2}, β_{i} = e^{θ_{d, 1}} l_{i}^{θ_{d, 2}} (1 + 1 / g^{2})$ , g = 4, $G_{i} = {(C_{i} (y^{(j)}, y^{(l)}))}_{j, l = 1, \dots, n} / (e^{θ_{d, 1}} l_{i}^{θ_{d, 2}}) + I_{n}, C_{i} (y^{(j)}, y^{(l)}) = \sum_{k = 1}^{m} {\tilde{y}}_{c_{i} (k)}^{(j)} {\tilde{y}}_{c_{i} (k)}^{(l)} + σ_{i}^{2} ρ ({(\sum_{k = 1}^{m} {({\tilde{y}}_{c_{i} (k)}^{(j)} - {\tilde{y}}_{c_{i} (k)}^{(l)})}^{2})}^{1 / 2} / γ)$ , ${\tilde{y}}_{c_{i} (k)}^{(j)} = y_{c_{i} (k)}^{(j)} e^{θ_{q} k}$ , $m = \max {k : e^{θ_{q} k} \geq 0.01}, σ_{i}^{2} = e^{θ_{σ, 1}} l_{i}^{θ_{σ, 2}}, γ = e^{θ_{γ}}, ρ (x) = (1 + x \sqrt{3}) \exp (- x \sqrt{3})$

3: Use fitted map as desired. For example, generate a new sample $y^{⋆} = {\tilde{T}}_{\hat{θ}}^{- 1} (z^{⋆})$ using (12) based on $z^{⋆} \sim N_{N} (0, I_{N})$ .

Based on the prior distributions in Sections 3.2–3.4, we can carry out inference and compute the transport map as in Section 2.3. The prior distributions depend on a vector of hyperparameters, $θ = (θ_{σ, 1}, θ_{σ, 2}, θ_{d, 1}, θ_{d, 2}, θ_{γ}, θ_{q})$ . When making inference on $θ$ as described in Section 2.4, we effectively let the training data Y decide the degree of sparsity (through θ_q via m) and the degree of nonlinearity (through $θ_{σ, 1}, θ_{σ, 2}$ via σ_i). Algorithm 1 summarizes the inference procedure. illustrates estimation of transport-map components in a simulated example.

Fig. 5 Simulation from a nonlinear map with sine structure in f_i, described as NR900 in Section 5. For n = 100 and i = 80, y_i versus its first and second nearest neighbor (NN): true f_i (a), observations $y_{i}$ (b), together with linear (c) and nonlinear (d) fit (i.e., posterior means) of f_i, with further variables in $y_{1 : i - 1}$ held at their mean levels. The linear map in (c) is estimated under the restriction $σ_{i} = 0$ . In (d), we have a nonlinear regression in 79-dimensional space, with m = 5 active variables in the estimated (via $θ$ ) nonlinear model.

Due to the sparsity assumption in (17), the computational complexity is lower than in Section 2.3; specifically, determining ${\tilde{T}}_{i}$ now only requires $O (n^{3} + m n^{2})$ time, again in parallel for each $i = 1, \dots, N$ . Each application of the transport map or its inverse then requires $O (N (n^{2} + m n))$ time. The maximin ordering and nearest neighbors can also be computed in quasilinear time in N (Schäfer, Katzfuss, and Owhadi Citation2021, Alg. 7).

In Section 2.3, we discussed using $\tilde{T}$ in (9) to transform the non-Gaussian $y^{⋆}$ to standard Gaussian map coefficients $z^{⋆} = \tilde{T} (y^{⋆})$ . This concept, which is illustrated in , is especially interesting in our spatial setting. Due to the maximin ordering (), the scales $l_{i}$ are arranged in decreasing order, and in our prior the $d_{i}^{2}$ also follow a decreasing stochastic order with $E (d_{i}^{2}) = e^{θ_{d, 1}} l_{i}^{θ_{d, 2}}$ (see, e.g., ). Thus, we can view the map components as a form of nonlinear principal components (NPCs), with the map coefficients as the corresponding component scores. For Gaussian processes with covariance functions given by the Green’s function of elliptic PDEs, similar to the Matérn family, it can be shown that these principal components based on the maximin ordering are approximately optimal (Schäfer, Sullivan, and Owhadi Citation2021). For example, as illustrated in Appendix F, supplementary materials, these NPCs can be used for dimension reduction by only storing or modeling the first k, say, map coefficients $z_{1 : k}^{⋆} = {(z_{1}^{⋆}, \dots, z_{k}^{⋆})}^{⊤}$ . Note that if we set $z_{k + 1 : N}^{⋆} = 0$ , we assume $y_{i}^{⋆} = {\hat{f}}_{i} (y_{1 : i - 1}^{⋆})$ for i > k, which overestimates dependence and underestimates variability; hence, it is preferable to draw $z_{k + 1 : N}^{*} \sim N (0, I)$ . In addition to reducing storage, we can also use this approach for conditional simulation (Marzouk et al. Citation2016, Lemma 1), in which we fix the large-scale features of an observed field by fixing the first k map coefficients (see for an illustration). To model a time series of spatial fields, we could assume a linear vector autoregressive model for the NPCs, such that the map coefficients at time t + 1, say $z_{1 : k}^{(t + 1)}$ , linearly depend on $z_{1 : k}^{(t)}$ . When it is of interest to regress some response on a spatial field, one could also use the first k map coefficients of the field as the covariates, similar to the use of function principal component scores in regression.

Fig. 6 Illustration of map coefficients $z^{⋆} = \tilde{T} (y^{⋆})$ (see Sections 2.3 and 3.5) for the simulated NR900 data using $\tilde{T}$ inferred from n = 100 training data. (a)–(b): The N = 900 map coefficients corresponding to one test sample are roughly iid Gaussian. (c): For 1000 test samples, $y_{80}^{*}$ versus first and second NNs (see ). (d) When averaging pairs of two map-coefficient vectors in reference space and transforming back to the original space using (12), the sinusoidal relationship between $y_{80}^{*}$ and its NNs is preserved in the resulting 500 averages. (e) When averaging test samples directly in original space, the nonlinear structure is lost.

Fig. 6 Illustration of map coefficients z⋆=T˜(y⋆) (see Sections 2.3 and 3.5) for the simulated NR900 data using T˜ inferred from n = 100 training data. (a)–(b): The N = 900 map coefficients corresponding to one test sample are roughly iid Gaussian. (c): For 1000 test samples, y80* versus first and second NNs (see Figure 5). (d) When averaging pairs of two map-coefficient vectors in reference space and transforming back to the original space using (12), the sinusoidal relationship between y80* and its NNs is preserved in the resulting 500 averages. (e) When averaging test samples directly in original space, the nonlinear structure is lost.

4 Non-Gaussian Errors

So far, we have focused on nonlinear, non-Gaussian dependence structures. The model described in Sections 2 and 3 assumes Gaussian errors in the regressions (5), which implies a marginal Gaussian distribution for y₁, the first variable in the maximin ordering. If this does not hold at least approximately, extensions based on additional marginal (i.e., pointwise) transformations, especially of the first few variables in the ordering, are straightforward. For example, assume that the model from Sections 2 and 3 holds for y, but that we actually observe $\tilde{y} = G (y)$ such that ${\tilde{y}}_{i} = g_{i} (y_{i})$ . If the g_i are one-to-one differentiable functions, the resulting posterior map is a simple extension of that in Proposition 1. The g_i can be pre-determined (see Section 6 for an example with a log transform) or may depend on $θ$ and thus be inferred based on a minor modification of the integrated likelihood in Proposition 2.

To increase flexibility of the marginal distributions, the GP errors $ϵ_{i}^{(j)}$ can be modeled using Bayesian nonparametrics for all $i = 1, \dots, N$ . More precisely, we will use Dirichlet process mixtures (DPMs). In (5), we now assume that $f_{i} (\cdot) \sim G P (0, C_{i})$ , and the $ϵ_{i}^{(j)}$ are distributed according to a DPM for $j = 1, \dots, n$ : (18) $\begin{matrix} ϵ_{i}^{(j)} | μ_{i}^{(j)}, d_{i}^{(j)} \sim N (μ_{i}^{(j)}, {(d_{i}^{(j)})}^{2}), (μ_{i}^{(j)}, {(d_{i}^{(j)})}^{2}) | F_{i} \sim F_{i}, \\ F_{i} \sim D P (N I G (ξ_{i}, η_{i}, α_{i}, β_{i}), ζ_{i}), \end{matrix}$ (18) where ζ_i is the concentration parameter, and the base measure $N I G (ξ_{i}, η_{i}, α_{i}, β_{i})$ is a normal-inverse-Gamma distribution with density $p (x, y) = η_{i}^{1 / 2} {(2 π y)}^{- 1 / 2} β_{i}^{α_{i}} / Γ (α_{i}) y^{- α_{i} - 1} \exp (- (2 β_{i} + η_{i} {(x - ξ_{i})}^{2}) / (2 y)),$ where we assume $ξ_{i} = 0$ . The degree of non-Gaussianity allowed for the $ϵ_{i}^{(j)}$ is determined by η_i and ζ_i. A small value of ζ_i concentrates the Dirichlet process near the NIG base measure, for which a large value of η_i shrinks the $μ_{i}^{(j)}$ toward zero. Thus, in the limit as $ζ_{i} \to 0$ and $η_{i} \to \infty$ , we obtain a model similar to that in Section 2.2 (except that here the $d_{i}^{(j)}$ do not appear in the variance of the GP f_i). Conversely, for large ζ_i (or large n), the posterior of $ϵ_{i}^{(j)}$ will be a Gaussian mixture that may differ substantially from the posterior implied by the model in Section 2.2.

For the spatial setting with maximin ordering of Section 3, we can again find a sparse parameterization in terms of hyperparameters $θ = (θ_{σ, 1}, θ_{σ, 2}, θ_{d, 1}, θ_{d, 2}, θ_{γ}, θ_{q}, θ_{ζ, 1}, θ_{ζ, 2}, θ_{η, 1}, θ_{η, 2})$ . We parameterize the α_i, β_i, C_i in terms of the first six hyperparameters as in Sections 3.2–3.4. For the concentration parameter $ζ_{i} = e^{θ_{ζ, 1}} l_{i}^{θ_{ζ, 2}}$ , we allow increasing shrinkage toward Gaussianity for increasing i. We similarly set $η_{i} = e^{θ_{η, 1}} l_{i}^{θ_{η, 2}}$ . For this DPM model, we take a fully Bayesian perspective and assume an improper uniform prior for $θ$ over $R^{10}$ .

The resulting model is fully nonparametric with the exception of the additivity assumption in (3). Specifically, due to the nonparametric nature of the DPM, the universal approximation property of GPs (Micchelli, Xu, and Zhang Citation2006), and nonzero prior probability for the dense (non-sparse) transport map, the posterior distribution obtained using this model contracts (for $n \to \infty$ and fixed N) to the Kullback-Leibler (KL) projection of the actual distribution of y onto the space of distributions that can be described by a transport map whose components are additive in the ith argument as in (3), due to the KL optimality of the Knothe-Rosenblatt map (Marzouk et al. Citation2016, sec. 4.1). In other words, as the number of replicates increases, the learned distribution gets as close as possible to the truth under the additivity restriction.

Inference for our DPM model cannot be carried out in closed form anymore and instead relies on a Metropolis-within-Gibbs Markov chain Monte Carlo (MCMC) sampler. We can also compute and draw samples from the posterior predictive distribution $p (y^{⋆} | Y) = \prod_{i = 1}^{N} p (y_{i}^{⋆} | y_{1 : i - 1}^{⋆}, Y),$ for which each $p (y_{i}^{⋆} | y_{1 : i - 1}^{⋆}, Y)$ is approximated as a Gaussian mixture based on the MCMC output. Details for the MCMC procedure and the posterior predictive distribution are given in Appendix C, supplementary materials.

In the spatial setting with sparsity parameter m, each MCMC iteration still has time complexity $O (N (n^{3} + n^{2} m))$ and the computations within each iteration are highly parallel; however, the actual computational cost for this sampler is much higher (typically, roughly two orders of magnitude higher) than for the empirical Bayes approach in Section 2.4 due to the large number of MCMC iterations required. Because of this larger computational expense and the loss of a closed-form transport map for the DPM model, we recommend the empirical Bayes approach (potentially after a pre-transformation $G$ as described above) as the first option in most large-scale applications; the DPM model is most useful for settings in which its computational expense is not crucial, the training size n is sufficiently large to discern non-Gaussian error structure, and only posterior sampling (as opposed to other functions that transport maps can provide) is of interest.

5 Simulation Study

We compared the following methods:

nonlin: Our method with Bayesian uncertainty quantification described in Section 3.

S-nonlin: Simplified version of nonlin ignoring uncertainty in the f_i and d_i, as in (13).

linear: Same as nonlin, but forcing $θ_{σ, 1} = - \infty$ and hence linear f_i.

S-linear: Simplified version of linear ignoring uncertainty in the f_i and d_i as in (13), which results in a joint Gaussian posterior predictive distribution and is similar to the approach proposed and used in numerical comparisons in Kidd and Katzfuss (Citation2022).

DPM: The model with Dirichlet process mixture residuals described in Section 4.

MatCov: Gaussian with zero mean and isotropic Matérn covariance, whose three hyperparameters are inferred via maximum likelihood estimation.

tapSamp: Gaussian with a covariance matrix given by the sample covariance tapered (i.e., element-wise multiplied) by an exponential correlation matrix with range equal to the maximum pairwise distance among the locations.

autoFRK: resolution-adaptive automatic fixed rank kriging (Tzeng and Huang Citation2018; Tzeng et al. Citation2021) with approximately $\sqrt{N}$ basis functions.

local: a locally parametric method for climate data (Wiens Citation2021) that fits anisotropic Matérn covariances in local windows and combines the local fits into a global model.

We also compared to a VAE (Kingma and Welling Citation2014) and a GAN designed for climate-model output (Besombes et al. Citation2021), but these deep-learning methods were not competitive in our simulation settings or for the climate data in Section 6 (see Appendix G, supplementary materials).

We considered four simulation scenarios, for which samples are illustrated in the top row of , consisting of a Gaussian distribution with an exponential covariance and three non-Gaussian extensions thereof. All scenarios can be characterized via transport maps as in Section 2.1, with d_i as given by a Gaussian with exponential covariance in the form (3):

Fig. 7 Top row: Simulated spatial fields for four simulation scenarios described in Section 5. Bottom row: Corresponding comparisons of KL divergence as a function of ensemble size n (on a log scale) for different methods. The KL divergences for tapSamp in (a)–(d) and for S-nonlin in (b)–(c) were too high and are not visible. DPM is only included in (d), while local is omitted from (c) because it was created for regular grids.

LR900: Linear map (i.e., a Gaussian distribution) with components $f_{i}^{L} (y_{1 : i - 1}) = \sum_{k = 1}^{i - 1} b_{i, k} y_{c_{i} (k)}$ , where the $b_{i, k}$ are based on an exponential covariance with unit variance and range parameter 0.3 on a Regular grid of size $N = 30 \times 30 = 900$ on the unit square.

NR900: Nonlinear extension of LR900 by a sine function of a weighted sum of the nearest two neighbors: $f_{i}^{NL} (y_{1 : i - 1}) = f_{i}^{L} (y_{1 : i - 1}) + 2 \sin (4 (b_{i, 1} y_{c_{i} (1)} + b_{i, 2} y_{c_{i} (2)}))$ (see )

NI3600: Same as NR900, but at $N = 3600$ Irregularly spaced locations sampled uniformly at random

NR900B: Same as NR900, but with a Bimodal distribution for the ϵ_i in (5): $ϵ_{i} | μ_{i}, d_{i} \sim N (μ_{i}, d_{i}^{2})$ with μ_i sampled from ${- 3.5 d_{i}, 3.5 d_{i}}$ with equal probability

For computational simplicity, each (true) f_i was assumed to only depend on the nearest 30 previously ordered neighbors, but this gives a highly accurate approximation of a “full” exponential covariance in the LR900 case, as the true fields exhibit strong screening due to being based on the same maximin ordering as our methods. A further ordering-invariant simulation scenario is considered in Appendix D, supplementary materials.

We compared the accuracy of the methods via the Kullback-Leibler (KL) divergence, (19) $E (\log p_{0} (y)) - E (\log p (y | Y)),$ (19) between the true distribution $p_{0} (y)$ and the inferred distribution $p (y | Y)$ implied by the posterior map (see (18) in Appendix A, supplementary materials), where the expectations are taken with respect to the true distribution. We approximated the expectations by averaging over 50 simulated test fields $y^{⋆}$ , and so the resulting KL divergence is the difference of the log-scores (e.g., Gneiting and Katzfuss Citation2014) of the true and inferred distributions.

The results are shown in . Whenever nonlinear structure was not discernible from the data (because the true map was linear or because the ensemble size n was too small), nonlin performed similarly to linear and hence did not suffer due to its over-flexibility. For larger ensemble size and nonlinear truths, nonlin at times far outperformed linear. S-linear and S-nonlin were generally less accurate than their counterparts with uncertainty quantification; in the linear LR900 setting, this was only an issue for small ensemble size, but S-nonlin performed extremely poorly when the nonlinear structure was clearly apparent in the data, likely due to overfitting without accounting for uncertainty. tapSamp and autoFRK performed uniformly worst. As MatCov (with smoothness 0.5) is the true model for LR900, it was almost exact in that scenario. The other three scenarios are extensions of a Matérn GP, and so MatCov also performed well for n < 20 or so. The local Matérn method was less accurate than MatCov for LR900 and NR900 but performed well for NR900B. For simulation scenarios that deviate more strongly from a Matérn GP, nonlin was uniformly more accurate than MatCov and local (see Appendix D, Supplementary Materials).

Estimating $θ$ via stochastic gradient ascent with three epochs and fitting the map based on n = 20 samples took less than 7 sec for the scenarios with N = 900 and less than 44 sec for the larger NI3600 scenario for nonlin, linear, S-linear, and S-nonlin on a single core on a laptop (2.5 GHz Intel Core i7 with 16GB RAM); DPM required a total of around 16 min for 500 MCMC iterations for NR900B.

6 Climate-Data Application

An important application of our methods is the analysis and emulation of output from climate models. Climate models are essentially large sets of computer code describing the behavior of the Earth system (e.g., the atmosphere) via systems of differential equations. Much time and resources have been spent on developing these models, and enormous computational power is required to produce ensembles (i.e., solve the differential equations for different starting conditions) on fine latitude-longitude grids for various scenarios of greenhouse-gas emissions. Of the large amount of data and output that have been generated, only a small fraction has been fully explored or analyzed (e.g., Benestad et al. Citation2017). Stochastic weather generators infer the distribution of one or more variables, so that relevant summaries or additional samples can be computed more cheaply than via more runs of the computer model.

We considered log-transformed total precipitation rate (in m/s) on a roughly $1^{°}$ longitude-latitude global grid of size $N = 288 \times 192 = 55, 296$ in the middle of the Northern summer (July 1) in 98 consecutive years (the number of years contained in one NetCDF data file), starting in the year 402, from the Community Earth System Model (CESM) Large Ensemble Project (Kay et al. Citation2015). We obtained precipitation anomalies by standardizing the data at each grid location to mean zero and variance one, shown in . For our methods, we used chordal distance to compute the maximin ordering and nearest neighbors.

Fig. 8 Two members of an ensemble of log-transformed precipitation anomalies produced by a climate model, on a global grid of size $N = 288 \times 192 = 55, 296$ . We want to infer the underlying N-dimensional distribution based on an ensemble of n < 100 training samples.

Fig. 8 Two members of an ensemble of log-transformed precipitation anomalies produced by a climate model, on a global grid of size N=288×192=55,296. We want to infer the underlying N-dimensional distribution based on an ensemble of n < 100 training samples.

For ease of comparison and illustration, we first considered a smaller grid of size $N = 37 \times 74 = 2738$ in a subregion containing large parts of the Americas ( $45^{°}$ S to $45^{°}$ N and $130^{°}$ W to $30^{°}$ W) containing ocean, land, and mountains. As shown in , the precipitation anomalies exhibited similar features as our simulated data in , with regression data concentrating on lower-dimensional manifolds and weights decaying rapidly as a function of neighbor number.

Fig. 9 The precipitation anomalies (in the Americas subregion) have similar properties as the Gaussian distribution with exponential covariance in : (a) Our approach can be viewed as N regressions as in (5) of each y_i on ordered nearest neighbors (NNs), with the regression data on low-dimensional manifolds. (b) For linear regressions with $f_{i} (y_{1 : i - 1}) = \sum_{k = 1}^{i - 1} y_{c_{i} (k)} b_{i, k}$ fitted via Lasso, the squared (estimated) regression coefficients decay rapidly as a function of neighbor number k.

For comparing the methods from Section 5 on the precipitation anomalies, computing the KL divergence as in (19) was not possible, as the true distribution $p_{0} (y)$ was unknown. Hence, we compared the methods using various training data sizes n in terms of log-scores, which approximate the KL divergence up to an additive constant; specifically, these log-scores consist of the second part of (19), $- E (\log p (y | Y))$ , with the expectation approximated by averaging over 18 test replicates and over five random training/test splits.

The comparison for the Americas subregion is shown in . (A prediction comparison for partially observed test data provided in Appendix E, supplementary materials produced similar results.) nonlin outperformed linear, and DPM was even more accurate than nonlin for large n, indicating that the precipitation anomalies exhibit joint and marginal non-Gaussian features. As in Section 5, S-linear and S-nonlin performed poorly due to ignoring uncertainty in the estimated map. local performed similarly to linear but was less accurate than nonlin and DPM for all n. VAE, MatCov, tapSamp, and autoFRK were not competitive for this dataset.

Fig. 10 For precipitation anomalies, comparison of log-score (LS; equal to KL divergence up to an additive constant) for estimated joint distribution as a function of ensemble size n: (a) Americas subregion; S-nonlin, tapSamp, and autoFRK are not shown because their LS were too high. (b) LS for linear and nonlin for precipitation anomalies on the global grid.

We also considered the map coefficients $z^{⋆} = \tilde{T} (y^{⋆})$ discussed in Sections 2.3 and 3.5, using the map obtained by fitting nonlinear to the first n = 97 replicates as training data. In , the map coefficients for a held-out test field appeared roughly iid standard Gaussian, with sample autocorrelations near zero (not shown). illustrates that the map coefficients offer similar properties for non-Gaussian fields as principal-component scores do for Gaussian settings. For example, the medians of the posterior distributions of the d_i (see (17) in Appendix A, supplementary materials) decreased rapidly as a function of i, which means that the map coefficients early in the maximin ordering captured much more (nonlinear) variation than later-ordered coefficients (see, e.g., (12) and (13)). Further, we computed the map coefficients for all 98 replicates for July 2–30 (still based on the posterior map trained on July 1 data), and the lag-1 autocorrelation over time between map coefficients also decreased with i. Specifically, while most of the first 100 were greater than 0.2, many later autocorrelations were negligible; this indicates that a spatio-temporal analysis could proceed by fitting a simple (linear) autoregressive model over time to only the first k, say, map coefficients, while treating the remaining N – k coefficients as independent over time. As shown in Appendix F, supplementary materials, the nonlinear map coefficients strongly outperformed standard linear principal components in terms of dimension reduction and reconstruction of the Americas climate fields.

Fig. 11 Properties of the map coefficients $z^{⋆} = \tilde{T} (y^{⋆})$ for the precipitation anomalies on the grid of size N = 2738 in the Americas subregion. (a): The map coefficients corresponding to the test field in in the original data ordering (first by longitude, then latitude) appeared roughly iid standard Gaussian, aside from slightly heavy tails. (b): The posterior medians of the d_i decreased rapidly as a function of i, meaning that the first few map coefficients captured much more variation than later-ordered coefficients. (c): The autocorrelation between consecutive days also decreased with i; while most were greater than 0.2 for i < 100, many autocorrelations for i > 100 were negligible.

Fig. 11 Properties of the map coefficients z⋆=T˜(y⋆) for the precipitation anomalies on the grid of size N = 2738 in the Americas subregion. (a): The map coefficients corresponding to the test field in Figure 12(a) in the original data ordering (first by longitude, then latitude) appeared roughly iid standard Gaussian, aside from slightly heavy tails. (b): The posterior medians of the di decreased rapidly as a function of i, meaning that the first few map coefficients captured much more variation than later-ordered coefficients. (c): The autocorrelation between consecutive days also decreased with i; while most were greater than 0.2 for i < 100, many autocorrelations for i > 100 were negligible.

To demonstrate scalability to large datasets, we compared linear and nonlinear on the entire global precipitation anomaly fields of size $N = 288 \times 192 = 55, 296$ . As shown in , nonlin outperformed linear even more decisively than for the Americas subregion. Even in the largest and most accurate setting (n = 80), the estimated $θ$ for nonlin implied m = 9, meaning that the corresponding transport maps were extremely sparse and hence computationally efficient; estimating $θ$ (4 epochs) and fitting the map for nonlin took only around 6 min on a single core on a laptop (2.5 GHz Intel Core i7 with 16GB RAM) for n = 10. In contrast, MatCov and local (which already took about two hours for the much smaller Americas region) were too computationally demanding for the global data. A Vecchia approximation of MatCov resulted in a log-score above +77,000 and was thus not competitive. Also, for nonlin all but 113 of the $N = 55, 296$ posterior medians of the d_i were more than 20 times smaller than the largest posterior median (i.e., that of d₁), indicating that our approach could be used for massive dimension reduction without losing too much information.

Finally, the fitted map (or rather, its inverse ${\tilde{T}}^{- 1}$ ) can also be viewed as a stochastic emulator of the climate model. Specifically, we can produce a new precipitation-anomaly sample by drawing $z^{*} \sim N_{N} (0, I_{N})$ and then computing $y^{*} = {\tilde{T}}^{- 1} (z^{*})$ . One such sample (for the full global grid) is shown in and appears qualitatively similar to the model output in ; while producing the latter requires a supercomputer, the former can be generated in a few seconds on a laptop. Further, our approach can also be used to draw conditional samples, in which we fix the first i, say, map coefficients, for example at the values corresponding to a given spatial field. Such draws, which maintain the large-scale features in the held-out (98th) test field but allow for newly sampled fine-scale features, are shown in . This is related to the supervised conditional sampling ideas in Kovachki et al. (Citation2020), with their inputs given by our first i ordered test observations.

Fig. 12 For the global climate data ( $N = 55, 296$ ), we fitted a stochastic emulator using nonlin based on n = 97 training replicates. Given a held-out test field $y^{*}$ in (a), we show conditional simulations based on fixing the first i map coefficients in $z^{*} = \tilde{T} (y^{⋆})$ . (b): Only differs in some fine-scale features from (a). (c): Some large-scale features from (a) are preserved. (d): Unconditional simulation (i.e., independent from (a)).

7 Conclusions

We have developed a Bayesian approach to inferring a non-Gaussian target distribution via a transport map from the target to a standard normal distribution. The components of the map are modeled using Gaussian processes. For the distribution of spatial fields, we have developed specific prior assumptions that result in sparse maps and thus scalability to high dimensions. Instead of manually or iteratively expanding a finite-dimensional parameterization of the transport map, our Bayesian approach probabilistically regularizes the map; the resulting approach is flexible and nonparametric, but guards against overfitting and quantifies uncertainty in the estimation of the map. Because our method can be fitted rapidly, is fully automated, and was highly accurate in our numerical comparisons, we recommend it for most spatial emulation tasks, except for applications in which very few replicates are available or for which exploratory analyses have shown that a (Gaussian) parametric approach can provide a good fit. In addition, due to conjugate priors and the resulting closed-form expressions for the posterior map and its inverse, our approach also allows us to convert non-Gaussian data into iid Gaussian map coefficients, which can be thought of as a nonlinear extension of principal components.

As our approach essentially turns estimation of a high-dimensional joint distribution into a series of GP regressions, it is straightforward to include additional covariates and examine their nonlinear, non-Gaussian effect on the distribution. Shrinkage toward a joint Gaussian distribution with a parametric covariance function could be achieved by assuming the mean for the GP regressions to be the one implied by a Vecchia approximation of that covariance function (Kidd and Katzfuss Citation2022); this could enable meaningful predictions at unobserved spatial locations (see Appendix E, supplementary materials). Extensions to more complicated input domains (e.g., space-time) could be obtained using correlation-based ordering (Section 3.1). Another major avenue of future work would be to use the inferred distribution as the prior of a latent field, which we then update to obtain a posterior given noisy observations; among numerous other applications, this would enable the use of our technique to infer the forecast distribution and account for uncertainty in ensemble-based data assimilation (Boyles and Katzfuss Citation2021), leading to nonlinear updates for non-Gaussian applications. We are currently pursuing multiple extensions and applications of our methods to climate science, including climate-change detection and attribution, climate-model calibration, and climate-model emulation and interpolation in covariate space (e.g., as a function of CO₂ emissions).

Supplementary Materials

Appendices A–G contain proofs, a discussion of conditional near-Gaussianity for quasiquadratic loglikelihoods, details on the Gibbs sampler for Dirichlet process mixture model, and additional numerical results and comparisons.

Supplemental material

Supplemental Material

Download PDF (2.1 MB)

Acknowledgments

We would like to thank Joe Guinness and several reviewers for helpful comments. We are especially grateful to Jian Cao, who wrote a Python implementation, produced timing results, and obtained GAN and VAE results, and to Trevor Harris, who created the VAE implementation for our numerical comparisons.

Disclosure Statement

The authors report there are no competing interests to declare.

Additional information

Funding

Katzfuss was partially supported by National Science Foundation (NSF) grants DMS–1654083, DMS–1953005, and CCF–1934904, and by NASA’s Advanced Information Systems Technology Program (AIST-21). Schäfer gratefully acknowledges support by the Air Force Office of Scientific Research under award number FA9550-18-1-0271, and the Office of Naval Research under award N00014-18-1-2363.

References

Adams, R. A., and Fournier, J. J. F. (2003), Sobolev Spaces, volume 140 of Pure and Applied Mathematics (2nd ed.), Amsterdam: Elsevier/Academic Press.
Google Scholar
Allen, S. M., and Cahn, J. W. (1972), “Ground State Structures in Ordered Binary Alloys with Second Neighbor Interactions,” Acta Metallurgica, 20, 423–433. DOI: 10.1016/0001-6160(72)90037-5.
Google Scholar
Arjovsky, M., and Bottou, L. (2017), “Towards Principled Methods for Training Generative Adversarial Networks,” in International Conference on Learning Representations.
Google Scholar
Ayala, A., Drazic, C., Hutchinson, B., Kravitz, B., and Tebaldi, C. (2021), “Loosely Conditioned Emulation of Global Climate Models with Generative Adversarial Networks,” arXiv:2105.06386.
Google Scholar
Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2004), Hierarchical Modeling and Analysis for Spatial Data, London: Chapman & Hall.
Google Scholar
Baptista, R., Zahm, O., and Marzouk, Y. (2020), “An Adaptive Transport Framework for Joint and Conditional Density Estimation,” arXiv preprint arXiv:2009.10303.
Google Scholar
Benestad, R., Sillmann, J., Thorarinsdottir, T. L., Guttorp, P., Mesquita, M. D., Tye, M. R., Uotila, P., Maule, C. F., Thejll, P., Drews, M., and Parding, K. M. (2017), “New Vigour Involving Statisticians to Overcome Ensemble Fatigue,” Nature Climate Change, 7, 697–703. DOI: 10.1038/nclimate3393.
Web of Science ®Google Scholar
Besombes, C., Pannekoucke, O., Lapeyre, C., Sanderson, B., and Thual, O. (2021), “Producing Realistic Climate Data with Generative Adversarial Networks,” Nonlinear Processes in Geophysics, 28, 347–370. DOI: 10.5194/npg-28-347-2021.
Web of Science ®Google Scholar
Bigoni, D., Spantini, A., and Marzouk, Y. M. (2016), “Adaptive Construction of Measure Transports for Bayesian Inference,” in NIPS 2016 Workshop on Advances in Approximate Bayesian Inference.
Google Scholar
Bolin, D., and Wallin, J. (2020), “Multivariate Type G Matérn Stochastic Partial Differential Equation Random Fields,” Journal of the Royal Statistical Society, Series B, 82, 215–239. DOI: 10.1111/rssb.12351.
Google Scholar
Boyles, W., and Katzfuss, M. (2021), “Ensemble Kalman Filter Updates Based On Regularized Sparse Inverse Cholesky Factors,” Monthly Weather Review, 149, 2231–2238. DOI: 10.1175/MWR-D-20-0299.1.
Web of Science ®Google Scholar
Cahn, J. W., and Hilliard, J. E. (1958), “Free Energy of a Nonuniform System. I. Interfacial Free Energy,” The Journal of Chemical Physics, 28, 258–267. DOI: 10.1063/1.1744102.
Web of Science ®Google Scholar
Carlier, G., Galichon, A., and Santambrogio, F. (2009), “From Knothe’s Transport to Brenier’s Map and a Continuation Method for Optimal Transport,” SIAM Journal on Mathematical Analysis, 41, 2554–2576. DOI: 10.1137/080740647.
Web of Science ®Google Scholar
Castruccio, S., McInerney, D. J., Stein, M. L., Crouch, F. L., Jacob, R. L., and Moyer, E. J. (2014), “Statistical Emulation of Climate Model Projections based on Precomputed GCM Runs,” Journal of Climate, 27, 1829–1844. DOI: 10.1175/JCLI-D-13-00099.1.
Web of Science ®Google Scholar
Choi, I. K., Li, B., and Wang, X. (2013), “Nonparametric Estimation of Spatial and Space-Time Covariance Function,” Journal of Agricultural, Biological, and Environmental Statistics, 18, 611–630. DOI: 10.1007/s13253-013-0152-z.
Web of Science ®Google Scholar
Cressie, N. (1993), Statistics for Spatial Data (rev. ed.), New York: Wiley.
Google Scholar
Datta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E. (2016), “Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets,” Journal of the American Statistical Association, 111, 800–812. DOI: 10.1080/01621459.2015.1044091.
PubMed Web of Science ®Google Scholar
El Moselhy, T. A., and Marzouk, Y. M. (2012), “Bayesian Inference with Optimal Maps,” Journal of Computational Physics, 231, 7815–7850. DOI: 10.1016/j.jcp.2012.07.022.
Web of Science ®Google Scholar
Errico, R. M., Yang, R., Privé, N. C., Tai, K. S., Todling, R., Sienkiewicz, M. E., and Guo, J. (2013), “Development and Validation of Observing-System Simulation Experiments at NASA’s Global Modeling and Assimilation Office,” Quarterly Journal of the Royal Meteorological Society, 139, 1162–1178. DOI: 10.1002/qj.2027.
Web of Science ®Google Scholar
Gelfand, A. E., and Schliep, E. M. (2016), “Spatial Statistics and Gaussian Processes: A Beautiful Marriage,” Spatial Statistics, 18, 86–104. DOI: 10.1016/j.spasta.2016.03.006.
Web of Science ®Google Scholar
Gneiting, T., and Katzfuss, M. (2014), “Probabilistic Forecasting,” Annual Review of Statistics and Its Application, 1, 125–151. DOI: 10.1146/annurev-statistics-062713-085831.
Web of Science ®Google Scholar
Goodfellow, I., Bengio, Y., and Courville, A. (2016), Deep Learning, Cambridge, MA: MIT Press.
Google Scholar
Gräler, B. (2014), “Modelling Skewed Spatial Random Fields through the Spatial Vine Copula,” Spatial Statistics, 10, 87–102. DOI: 10.1016/j.spasta.2014.01.001.
Web of Science ®Google Scholar
Haugen, M. A., Stein, M. L., Sriver, R. L., and Moyer, E. J. (2019), “Future Climate Emulations Using Quantile Regressions on Large Ensembles,” Advances in Statistical Climatology, Meteorology, and Oceanography, 5, 37–55. DOI: 10.5194/ascmo-5-37-2019.
Google Scholar
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y. (2017), “Deep Learning Scaling is Predictable, Empirically, arXiv:1712.00409.
Google Scholar
Houtekamer, P. L., and Zhang, F. (2016), “Review of the Ensemble Kalman Filter for Atmospheric Data Assimilation,” Monthly Weather Review, 144, 4489–4532. DOI: 10.1175/MWR-D-15-0440.1.
Web of Science ®Google Scholar
Huang, C., Hsing, T., and Cressie, N. (2011), “Nonparametric Estimation of the Variogram and its Spectrum,” Biometrika, 98, 775–789. DOI: 10.1093/biomet/asr056.
Web of Science ®Google Scholar
Kang, M., and Katzfuss, M. (2023), “Correlation-based Sparse Inverse Cholesky Factorization for Fast Gaussian-Process Inference,” Statistics and Computing, 33, 1–17. DOI: 10.1007/s11222-023-10231-5.
PubMed Web of Science ®Google Scholar
Kashinath, K., Mustafa, M., Albert, A., Wu, J. L., Jiang, C., Esmaeilzadeh, S., Azizzadenesheli, K., Wang, R., Chattopadhyay, A., Singh, A., Manepalli, A., Chirila, D., Yu, R., Walters, R., White, B., Xiao, H., Tchelepi, H. A., Marcus, P., Anandkumar, A., Hassanzadeh, P., and Prabhat (2021), “Physics-Informed Machine Learning: Case Studies for Weather and Climate Modelling,” Philosophical Transactions of the Royal Society A, 379, 20200093. DOI: 10.1098/rsta.2020.0093.
Web of Science ®Google Scholar
Katzfuss, M., and Guinness, J. (2021), “A General Framework for Vecchia Approximations of Gaussian Processes,” Statistical Science, 36, 124–141. DOI: 10.1214/19-STS755.
Web of Science ®Google Scholar
Katzfuss, M., Stroud, J. R., and Wikle, C. K. (2016), “Understanding the Ensemble Kalman Filter,” The American Statistician, 70, 350–357. DOI: 10.1080/00031305.2016.1141709.
Web of Science ®Google Scholar
Kay, J. E., Deser, C., Phillips, A., Mai, A., Hannay, C., Strand, G., Arblaster, J. M., Bates, S. C., Danabasoglu, G., Edwards, J., Holland, M., Kushner, P., Lamarque, J. F., Lawrence, D., Lindsay, K., Middleton, A., Munoz, E., Neale, R., Oleson, K., Polvani, L., and Vertenstein, M. (2015), “The Community Earth System Model (CESM) Large Ensemble Project: A Community Resource for Studying Climate Change in the Presence of Internal Climate Variability,” Bulletin of the American Meteorological Society, 96, 1333–1349. DOI: 10.1175/BAMS-D-13-00255.1.
Web of Science ®Google Scholar
Kidd, B., and Katzfuss, M. (2022), “Bayesian Nonstationary and Nonparametric Covariance Estimation for Large Spatial Data,” (with Discussion), Bayesian Analysis, 17, 291–351. DOI: 10.1214/21-BA1273.
Web of Science ®Google Scholar
Kingma, D. P., and Welling, M. (2014), “Auto-Encoding Variational Bayes,” in 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track Proceedings.
Google Scholar
Kobyzev, I., Prince, S., and Brubaker, M. (2020), “Normalizing Flows: An Introduction and Review of Current Methods,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 3964–3979. DOI: 10.1109/TPAMI.2020.2992934.
Web of Science ®Google Scholar
Kovachki, N. B., Hosseini, B., Baptista, R., and Marzouk, Y. M. (2020), “Conditional Sampling with Monotone GANs,” arXiv:2006.06755.
Google Scholar
Krupskii, P., Huser, R., and Genton, M. G. (2018), “Factor Copula Models for Replicated Spatial Data,” Journal of the American Statistical Association, 113, 467–479. DOI: 10.1080/01621459.2016.1261712.
Web of Science ®Google Scholar
Marzouk, Y. M., Moselhy, T., Parno, M., and Spantini, A. (2016), “Sampling via Measure Transport: An Introduction,” in Handbook of Uncertainty Quantification, eds. R. Ghanem, D. Higdon, and H. Owhadi, pp. 785–783, Cham: Springer.
Google Scholar
Mescheder, L., Geiger, A., and Nowozin, S. (2018), “Which Training Methods for GANs do Actually Converge?” in International Conference on Machine Learning, pp. 3481–3490.
Google Scholar
Micchelli, C. A., Xu, Y., and Zhang, H. (2006), “Universal Kernels,” Journal of Machine Learning Research, 7, 2651–2667.
Web of Science ®Google Scholar
Ng, T. L. J., and Zammit-Mangion, A. (2022), “Spherical Poisson Point Process Intensity Function Modeling and Estimation with Measure Transport,” Spatial Statistics, 50, 100629. DOI: 10.1016/j.spasta.2022.100629.
Web of Science ®Google Scholar
Ng, T. L. J., and Zammit-Mangion, A. (2023), “Mixture Modeling with Normalizing Flows for Spherical Density Estimation,” arXiv:2301.06404.
Google Scholar
Nychka, D. W., Hammerling, D. M., Krock, M., and Wiens, A. (2018), “Modeling and Emulation of Nonstationary Gaussian Fields,” Spatial Statistics, 28, 21–38. DOI: 10.1016/j.spasta.2018.08.006.
Web of Science ®Google Scholar
Parno, M., Moselhy, T., and Marzouk, Y. (2016), “A Multiscale Strategy for Bayesian Inference Using Transport Maps,” SIAM/ASA Journal on Uncertainty Quantification, 4, 1160–1190. DOI: 10.1137/15M1032478.
Web of Science ®Google Scholar
Porcu, E., Bissiri, P. G., Tagle, F., and Quintana, F. (2021), “Nonparametric Bayesian Modeling and Estimation of Spatial Correlation Functions for Global Data,” Bayesian Analysis, 16, 845–873. DOI: 10.1214/20-BA1228.
Web of Science ®Google Scholar
Risser, M. D. (2016), “Review: Nonstationary Spatial Modeling, with Emphasis on Process Convolution and Covariate-Driven Approaches,” arXiv:1610.02447.
Google Scholar
Rosenblatt, M. (1952), “Remarks on a Multivariate Transformation,” The Annals of Mathematical Statistics, 23, 470–472. DOI: 10.1214/aoms/1177729394.
Google Scholar
Rouhiainen, A., Giri, U., and Münchmeyer, M. (2021), “Normalizing Flows for Random Fields in Cosmology,” arXiv:2105.12024.
Google Scholar
Schäfer, F., Katzfuss, M., and Owhadi, H. (2021), “Sparse Cholesky Factorization by Kullback-Leibler Minimization,” SIAM Journal on Scientific Computing, 43, A2019–A2046. DOI: 10.1137/20M1336254.
Web of Science ®Google Scholar
Schäfer, F., Sullivan, T. J., and Owhadi, H. (2021), “Compression, Inversion, and Approximate PCA of Dense Kernel Matrices at Near-Linear Computational Complexity,” Multiscale Modeling & Simulation, 19, 688–730. DOI: 10.1137/19M129526X.
Web of Science ®Google Scholar
Spantini, A., Bigoni, D., and Marzouk, Y. M. (2018), “Inference via Low-Dimensional Couplings,” Journal of Machine Learning Research, 19, 1–71.
Web of Science ®Google Scholar
Stein, M. L. (2011), “2010 Rietz lecture: When Does the Screening Effect Hold? The Annals of Statistics, 39, 2795–2819. DOI: 10.1214/11-AOS909.
PubMed Web of Science ®Google Scholar
Stein, M. L., Chi, Z., and Welty, L. (2004), “Approximating Likelihoods for Large Spatial Data Sets,” Journal of the Royal Statistical Society, Series B, 66, 275–296. DOI: 10.1046/j.1369-7412.2003.05512.x.
Google Scholar
Tzeng, S. L., and Huang, H.-C. (2018), “Resolution Adaptive Fixed Rank Kriging,” Technometrics, 60, 198–208. DOI: 10.1080/00401706.2017.1345701.
Web of Science ®Google Scholar
Tzeng, S. L., Huang, H.-C., Wang, W.-T., Nychka, D. W., and Gillespie, C. (2021), “autoFRK: Automatic Fixed Rank Kriging.”
Google Scholar
Vecchia, A. (1988), “Estimation and Model Identification for Continuous Spatial Processes,” Journal of the Royal Statistical Society, Series B, 50, 297–312. DOI: 10.1111/j.2517-6161.1988.tb01729.x.
Google Scholar
Villani, C. (2009), Optimal Transport: Old and New, Berlin: Springer.
Google Scholar
Wallin, J., and Bolin, D. (2015), “Geostatistical Modelling Using non-Gaussian Matérn Fields,” Scandinavian Journal of Statistics, 42, 872–890. DOI: 10.1111/sjos.12141.
Web of Science ®Google Scholar
Whittle, P. (1954), “On Stationary Processes in the Plane,” Biometrika, 41, 434–449. DOI: 10.1093/biomet/41.3-4.434.
Web of Science ®Google Scholar
Whittle, P. (1963), “Stochastic Processes in Several Dimensions,” Bulletin of the International Statistical Institute, 40, 974–994.
Google Scholar
Wiens, A. (2021), “Nonstationary Covariance Modeling for Gaussian Processes and Gaussian Markov Random Fields,” available at https://github.com/ashtonwiens/nonstationary.
Google Scholar
Wiens, A., Nychka, D. W., and Kleiber, W. (2020), “Modeling Spatial Data Using Local Likelihood Estimation and a Matérn to Spatial Autoregressive Translation,” Environmetrics, 31, 1–15. DOI: 10.1002/env.2652.
Web of Science ®Google Scholar
Xu, G., and Genton, M. G. (2017), “Tukey g-and-h Random Fields,” Journal of the American Statistical Association, 112, 1236–1249. DOI: 10.1080/01621459.2016.1205501.
Web of Science ®Google Scholar

Scalable Bayesian Transport Maps for High-Dimensional Non-Gaussian Spatial Fields

Abstract

1 Introduction