564
Views
0
CrossRef citations to date
0
Altmetric
Applications and Case Studies

Efficient Stochastic Generators with Spherical Harmonic Transformation for High-Resolution Global Climate Simulations from CESM2-LENS2

, &
Received 03 Oct 2023, Accepted 22 May 2024, Published online: 08 Jul 2024

Abstract

Earth system models (ESMs) are fundamental for understanding Earth’s complex climate system. However, the computational demands and storage requirements of ESM simulations limit their utility. For the newly published CESM2-LENS2 data, which suffer from this issue, we propose a novel stochastic generator (SG) as a practical complement to the CESM2, capable of rapidly producing emulations closely mirroring training simulations. Our SG leverages the spherical harmonic transformation (SHT) to shift from spatial to spectral domains, enabling efficient low-rank approximations that significantly reduce computational and storage costs. By accounting for axial symmetry and retaining distinct ranks for land and ocean regions, our SG captures intricate nonstationary spatial dependencies. Additionally, a modified Tukey g-and-h (TGH) transformation accommodates non-Gaussianity in high-temporal-resolution data. We apply the proposed SG to generate emulations for surface temperature simulations from the CESM2-LENS2 data across various scales, marking the first attempt of reproducing daily data. These emulations are then meticulously validated against training simulations. This work offers a promising complementary pathway for efficient climate modeling and analysis while overcoming computational and storage limitations. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.

1 Introduction

Earth system models (ESMs) are complex mathematical equations that describe and simulate the transformation and interaction of energy and materials within the Earth’s climate system. Their development is grounded in the fundamental laws of physics, fluid dynamics, and chemistry, necessitating the identification and quantification of physical processes represented through mathematical equations (National Oceanic and Atmospheric Administration (NOAA) Citation2022). ESMs are indispensable tools in both scientific research and government policymaking, enabling a deeper understanding and more accurate predictions of Earth’s systems. For example, the Intergovernmental Panel on Climate Change (IPCC) used datasets from the Coupled Model Intercomparison Project’s fifth (CMIP5) and sixth (CMIP6) phases (Taylor, Stouffer, and Meehl Citation2012; Eyring et al. Citation2016) to compile their fifth and sixth synthesis assessment reports on climate change.

Despite their significance, ESMs have limitations as approximations of the Earth’s complex system. They are sensitive to changes in model inputs, such as physics parameters and emission scenarios, which need being assessed. Additionally, distinguishing between model errors and the effects of internal variability, which arises from atmospheric, oceanic, land, and cryospheric processes and their interactions, is challenging with a limited number of simulations (Kay et al. Citation2015). This underscores the need for multiple climate simulations to comprehensively characterize Earth’s system. The Community Earth System Model (CESM, Kay et al. Citation2015) is one such model designed to study climate change while accounting for internal climate variability.

Running ESMs is a time-consuming process, even with the support of powerful supercomputers. Scientists partition the planet into thousands of three-dimensional grids. Within each grid cell, they specify variables and conditions, employ computers to solve equations, propagate results to neighboring cells, and then iterate the above procedure with the updated variables and conditions. The whole process is repeated through time steps of different scales, which is the temporal resolution. The spatial resolution of the model is determined by the size of the grid cell, with smaller sizes yielding higher resolution but requiring more time. Generating multiple climate simulations demands weeks and even months of computational resources, exclusively available to several research institutes globally (Huang et al. Citation2023). Furthermore, while modern computational power allows for the generation of climate simulations, storing them is constrained by technological limitations, resource availability, and cost considerations. For instance, the storage requirement for CMIP5 data exceeds 2.5 Petabytes, incurring significant expenses for institutions like the National Center for Atmospheric Research (NCAR), a prominent U.S. research center for Earth Systems science. Consequently, budget constraints may necessitate the abandonment of partial data or higher resolutions (Guinness and Hammerling Citation2018).

Emulators play a crucial role in providing rapid approximations for computationally demanding model outputs and serve as effective surrogates for the original model. Specifically, emulators are statistical models that undergo proper training using a set of existing simulations and are then employed to approximate simulations for unexplored inputs. Over the past few decades, emulators have found widespread application, including assessing the sensitivity of various physical parameters (Oakley and O’Hagan Citation2002, 2004) and performing model calibration (Kennedy and O’Hagan Citation2001; Chang et al. Citation2014). Within the climate research community, emulators have been extensively developed to evaluate the influence of physics parameters (Sacks, Schiller, and Welch Citation1989), emission scenarios (Castruccio and Stein Citation2013; Castruccio et al. Citation2014), and the internal variability (Jeong et al. Citation2019; Hu and Castruccio Citation2021) on climate model outputs. These emulators enable efficient exploration of model behavior and facilitate a deeper understanding of complex climate systems.

Stochastic generators (SGs, Jeong et al. Citation2018), as a type of emulator for quantifying internal variability uncertainty, align closely with our requirements. SGs are spatio-temporal models adeptly tailored with a limited number of climate simulations, capable of generating numerous annual and monthly simulations across extensive and high-resolution domains at any moment (Jeong et al. Citation2018, Citation2019; Castruccio et al. Citation2019; Tagle et al. Citation2020), yet daily scale remains elusive to this date. Their implementation is affordable and even efficient, albeit at the expense of detailed physical mechanisms. SGs require the storage of only model parameters, enabling them to generate simulations that capture the general behavior of the original climate simulations. It is essential to distinguish SGs from compression methods (Baker et al. Citation2017; Underwood et al. Citation2022), which effectively reduce data volume using compression algorithms. While compression methods store and recover compressed data, they are designed for reconstructing individual data rather than generating new simulations. Moreover, the storage requirements for compression methods tend to increase with additional simulations.

SGs typically use Gaussian processes (GPs) and auto-regressive models to capture the intricate spatio-temporal dependencies among climate simulations driven by underlying physical mechanisms. While SGs offer several advantages, they face notable challenges when emulating ESMs. First, global climate simulations, which need the technique of SGs, involve extensive high-resolution data. Inference with GPs on such data can be computationally prohibitive, often requiring time on the order of data size cubed. Addressing this computational challenge can draw inspiration from existing techniques such as low-rank approximations (Cressie and Johannesson Citation2008; Banerjee et al. Citation2008), composite likelihoods (Vecchia Citation1988; Guinness Citation2021), stochastic partial differential equations (Rue, Martino, and Chopin Citation2009; Bolin, Simas, and Xiong Citation2023), covariance tapering (Kaufman, Schervish, and Nychka Citation2008) and their combinations (Datta et al. Citation2016). Second, global climate simulations frequently exhibit nonstationary spatial dependence. The covariance structure can vary significantly with latitude while may remain relatively stationary along the longitude, which is called axial symmetry (Jones Citation1963) and has been noticed and discussed by Stein (Citation2007), Stein (2008) and Jun and Stein (Citation2008). Additionally, the smoothness and variance of data can change between land and ocean at the same latitude (Castruccio and Guinness Citation2017; Jeong et al. Citation2018). A well-defined covariance function based on geodesic distance rather than chordal distance is needed to model the nonstationary spatial dependence structure. The use of chordal distance may cause physically unrealistic distortions (Jeong, Jun, and Genton Citation2017). Lastly, global climate simulations with high temporal resolutions, such as monthly and daily scales, may deviate from Gaussian assumptions when studying temporal evolution. To address this, the Tukey g-and-h (TGH) transformation has been employed to Gaussianize the data, striking a balance between model flexibility and simplicity (Yan and Genton Citation2019; Jeong et al. Citation2019).

For global surface temperature simulations from the CESM2-LENS2 data, which are newly published in the climate community and face the above-mentioned challenges, we propose a novel SG. Our SG serves as a pragmatic complement to CESM2, offering the capability to rapidly generate an unlimited number of emulations closely resembling simulations. Central to our method is the utilization of the spherical harmonic transformation (SHT)—the well-known spherical counterpart of the Fourier transformation, allowing us to efficiently transition global simulations from the spatial to the spectral domain. Moreover, it unlocks a practical low-rank approximation strategy, using spherical harmonics as available multi-resolution basis functions that are naturally suitable for global data. The modeling in the spectral domain ushers in significant reductions in both computational and storage costs. By judiciously retaining distinct ranks for land and ocean regions and applying a general assumption of axial symmetry, our SG captures the intricate nonstationary spatial dependence among different latitudes and between land and ocean regions. Furthermore, the employment of a modified TGH transformation equips our SG to effectively model the temporal dependence of simulations, accommodating the non-Gaussianity inherent in high-temporal-resolution data. In our case study, we apply these principles to construct SGs and generate emulations for the surface temperature simulations, spanning annual, monthly, and even daily scales. We meticulously illustrate and compare the generated emulations against the training simulations, both through visual inspection and quantitative metrics. Notably, our work marks a significant advancement by being the first to generate, display, and evaluate emulations at a daily scale.

The remainder of the article is structured as follows. In Section 2, we provide an overview of temperature simulations across annual, monthly, and daily scales. Section 3 delves into the intricate procedures involved in constructing an SG using the SHT and generating emulations. The generated emulations, along with comparisons to the training simulations, are showcased in Section 4. Finally, Section 5 encapsulates our conclusions and outlines avenues for future research.

2 CESM2-LENS2 Data

The Community Earth System Model version 2 Large Ensembles (CESM2-LENS2), as detailed in Rodgers et al. (Citation2021), is a recently released extensive ensemble of climate change projections. It originates from the CESM2 and serves as a valuable resource for studying the sensitivity of internal climate fluctuations to greenhouse warming. This ensemble comprises 100 individual members, operating at a spatial resolution of 0.9375°×1.25° (latitude × longitude), spanning the temporal domain from 1850 to 2100, and generated through a combination of diverse oceanic and atmospheric initial states. CESM2-LENS2 draws upon CMIP6 historical forcing, spanning the period from 1850 to 2014, and integrates the Shared Socioeconomic Pathways 370 (SSP370) future radiative forcing (RF, an influential factor in climate system describing the difference between incoming and outgoing sun radiation) from 2015 to 2100. The SSP scenarios, introduced in response to the sixth IPCC report, encompass a range of socioeconomic developments and atmospheric greenhouse gas concentration pathways (O’Neill et al. Citation2014; Riahi et al. Citation2017). Notably, SSP370 represents a medium-to-high emission scenario, projecting an additional RF of 7 Wm2 by the year 2100 under the social development pathway of regional rivalry.

The choice of CESM2-LENS2 in our study is motivated by several factors. First, it provides comprehensive and credible global climate simulations at high resolutions, which often pose storage challenges. Second, its ample ensemble size allows for effective training and evaluation of our SG. Third, despite generating considerable interest within the climate community (Kay et al. Citation2022; Muñoz et al. Citation2023; Dong, Peings, and Magnusdottir Citation2023), CESM2-LENS2 has yet to gain widespread attention in the field of statistics. This article marks the first introduction of an SG, extending to daily scales, to complement CESM2-LENS2. Both CESM2-LENS2 and SSP370 RF are readily accessible from the NCAR and Zenodo websites, respectively. Our analysis focuses on global surface temperatures for the period 2015–2100, encompassing all 192×288=55,296 grid points in space. Existing literature (Jeong et al. Citation2018, Citation2019; Huang et al. Citation2023) has demonstrated that fewer than ten ensembles are sufficient to construct an efficient SG for a climate variable. Therefore, we select training simulations from a subset of 20 members micro-initialized from the year 1251, with each member created through random perturbations of the atmospheric potential temperature field. These members are assumed to be independent from each other. We aggregate the ensembles at annual, monthly, and daily scales, despite their original availability at a three-hourly resolution. Consequently, each annual, monthly, and daily aggregated temperature simulation comprises approximately 4.75 million (192×288×86), 57.07 million, and 1.73 billion data points, respectively.

Let yt(r)(Li,lj) denote the temperature in Celsius at latitude Li, longitude lj, time point t after year 2014, and ensemble r, where i=1,,I (I = 192), j=1,,J (J = 288), t=1,,T, and r=1,,R. The value of T varies depending on the temporal resolution chosen, with possible values of 86, 1032, or 31,390. Similarly, let xt represent the RF at time t for all ensembles, as they are generated with the same external forcing. We use members 11 to 17 (out of 20) to illustrate certain data characteristics. The justification for the number of ensembles will be further discussed in , using an index for goodness-of-fit. Let yA,t(r)(Li,lj) be the annually aggregated temperature at grid point (Li, lj), year t+2014 and ensemble r. Consequently, the ensemble mean and standard deviation at (Li, lj) and t+2014 are denoted as y¯A,t(Li,lj)=R1r=1RyA,t(r)(Li,lj) and yA,tsd(Li,lj)={R1r=1R(yA,t(r)(Li,lj)y¯A,t(Li,lj))2}1/2, respectively. depict maps of y¯A,9(Li,lj) and yA,9sd(Li,lj). The maps align with the intuitive understanding that temperature generally decreases with increasing latitude. The surface temperatures in the Himalayas and Andes ranges are lower than in surrounding regions. Numerical instabilities occur near the two pole regions. Additionally, the surface temperature at the tropical Pacific Ocean exhibits larger variation across ensembles due to the climate phenomenon called El Niño–Southern Oscillation.

Fig. 1 Illustration of simulations with different scales. (a) and (b) are maps of the ensemble mean and standard deviation of the annually aggregated temperature in the year 2023. Two black “×” marks indicate grid points located at coordinates (40.05,43.75) and (30.63,170.00), designating points on land (GL) and ocean (GO), respectively. (c) shows annual temperature time series at GL and GO, where each ensemble is represented by a distinct color. (d) shows the monthly temperature time series at GL and GO, where the black curves are ensemble means. (e)–(g) are histograms of annual, monthly and daily temperature simulations at GL and GO, which have been detrended by ensemble means. Only data of years 2020, 2040, 2060, 2080, and 2100 are used in daily data.

Fig. 1 Illustration of simulations with different scales. (a) and (b) are maps of the ensemble mean and standard deviation of the annually aggregated temperature in the year 2023. Two black “×” marks indicate grid points located at coordinates (40.05,43.75) and (−30.63,170.00), designating points on land (GL) and ocean (GO), respectively. (c) shows annual temperature time series at GL and GO, where each ensemble is represented by a distinct color. (d) shows the monthly temperature time series at GL and GO, where the black curves are ensemble means. (e)–(g) are histograms of annual, monthly and daily temperature simulations at GL and GO, which have been detrended by ensemble means. Only data of years 2020, 2040, 2060, 2080, and 2100 are used in daily data.

Let {yA,t(r)(Li,lj)}t=1T represent the annual temperature trajectory of ensemble r at (Li, lj). illustrates these time series for both GL (land) and GO (ocean) locations, that is, {yA,t(r)(GL)}t=1T and {yA,t(r)(GO)}t=1T, alongside their respective ensemble means {y¯A,t(GL)}t=1T and {y¯A,t(GO)}t=1T. Take {yA,t(r)(GL)}t=1T as an example. All ensembles exhibit a shared increasing trend, influenced by the RF also depicted in . Moreover, each ensemble exhibits its unique shape, signifying the presence of uncertainty among ensembles. These motivate us to develop an SG incorporating both deterministic and stochastic factors. Similarly, provides monthly temperature time series of years 2021–2025 at GL and GO, along with their respective ensemble means. They experience alternating peaks and valleys because they are situated in the northern and southern hemispheres, respectively. In comparison to annual data, exhibits a clear seasonality, which is a common feature in monthly and daily data. The seasonality of surface temperature varies across grid points and remains consistent among all ensembles at a fixed grid point. Notably, from , both the uncertainty and the variation are more pronounced at GL.

demonstrate histograms, skewness and kurtosis of the detrended annual, monthly and daily temperature simulations at GL and GO, respectively. For example, the top panel of displays the histogram of {yA,t(r)(GL)R1r=1RyA,t(r)(GL)}r=1,,R;t=1,,T and its skewness and kurtosis. The close-to-zero skewness and the close-to-three kurtosis enable a Gaussian assumption when we model R time series {yA,t(r)(GL)R1r=1RyA,t(r)(GL)}t=1T, with an auto-regression. In contrast, the histograms in tend to be skewed and heavy-tailed. Figure S1 presents the skewness and kurtosis for time series across all global grid points. The degree of skewness and the presence of heavy tails become more pronounced with increasing temporal resolution. The non-Gaussianity is particularly severe in the band region below latitude 60°, termed as Band, and the North Pole region, where the surface temperature simulations exhibit numerical instabilities. The reason will be discussed in Section 4.2. These observations underscore the necessity for additional transformations and parameters to Gaussianize the monthly and daily simulations, as the first two moments alone are insufficient to characterize data with greater temporal complexity.

3 Stochastic Generator Methodology

In this section, we detail the procedures of constructing an SG and generating emulations for the temperature simulations described in Section 2. Leveraging the SHT, our proposed SG offers efficient and adaptable capabilities for dealing with global climate simulations that exhibit: (a) a substantial size in terms of the number of observations in space and time; (b) nonstationary spatial dependencies among latitudes and land/ocean regions; and (c) non-Gaussian temporal trajectories.

The data characteristics outlined in Section 2 underscore the deterministic chaotic nature inherent in climate models (Lorenz Citation1963; Branstator and Teng Citation2010; Castruccio and Genton 2018), which motivates us to decouple the data into deterministic and stochastic components as follows: (1) yt(r)(Li,lj)=mt(Li,lj)+σ(Li,lj)Zt(r)(Li,lj).(1)

Here, mt and σ are deterministic functions responsible for the mean trend and standard error, respectively, and are shared across all ensembles. Zt(r)(Li,lj) is the stochastic component at grid point (Li, lj), time point t, and ensemble r. Given the expansive parameter space, simultaneously inferring both the deterministic and stochastic components would be computationally infeasible. Therefore, we adopt a two-stage approach, commonly employed in existing work (Castruccio and Genton 2018; Jeong et al. Citation2019; Huang et al. Citation2023). First, we evaluate mt(Li,lj) and σ(Li,lj) at each grid point, making the independence assumption regarding {Zt(r)(Li,lj)}i=1,,I;j=1,,J;t=1,,T;r=1,,R. Subsequently, we analyze the dependence structure of Zt(r)(Li,lj) by detrending and rescaling yt(r)(Li,lj) with the estimate of mt(Li,lj) and σ(Li,lj), respectively.

3.1 Deterministic Component of SG

In the first stage of constructing the SG, we focus on determining the mean trend mt. It is crucial that the mean trend is both simple enough to minimize the storage requirements for parameters and informative enough to capture the physical relationships between simulations and essential covariates. Let yt(r)(Li,lj) represent the annually aggregated temperature data. Previous research (Castruccio et al. Citation2014; Huang et al. Citation2023) has shown its dependence on the RF trajectory and adopted an infinite distributed lag model: mt(Li,lj)=β0(Li,lj)+β1(Li,lj)xt+β2(Li,lj){1ρ(Li,lj)}s=1ρ(Li,lj)s1xts,where β0(Li,lj) is the intercept, β1(Li,lj) and β2(Li,lj) are slopes for the current and past RFs, respectively. Lag weights β2(Li,lj){1ρ(Li,lj)}ρ(Li,lj)s1 decrease the impact of past RFs exponentially by ρ(Li,lj)[0,1]. Moreover, if yt(r)(Li,lj) represents the monthly aggregated temperature data, additional harmonic terms {cos(2πtk/12),sin(2πtk/12)}k=1KM should be included to fit the interannual cycle as shown in . That is, mt(Li,lj)=β0(Li,lj)+β1(Li,lj)xt/12+β2(Li,lj){1ρ(Li,lj)}s=1ρ(Li,lj)s1xt/12s+k=1KM{ak(Li,lj)cos(2πtk12)+bk(Li,lj)sin(2πtk12)}, which would also be applied to the daily aggregated temperature by replacing KM and t/12 with KD and t/365, respectively. The larger the value of KM (or KD), the higher the frequency the harmonic terms can present. For the deterministic component, {β0(Li,lj),β1(Li,lj),β2(Li,lj),ρ(Li,lj),σ(Li,lj), a1(Li,lj),,aK(Li,lj),b1(Li,lj),,bK(Li,lj)}i=1,,I;j=1,,J are all (5+2K)IJ parameters to be evaluated and stored, where K can take values of 0, KM, and KD to represent the annual, monthly and daily temperature, respectively. With the independence assumption, we can efficiently estimate the parameters of the deterministic component in parallel for each grid point. The detailed inferential process is given in Section S3.1 of the supplementary materials.

3.2 Stochastic Component of SG

Now, we sequentially model the spatial and temporal dependence of the stochastic component Zt(r)(Li,lj), which is a general principle for analyzing space-time data (Stein Citation2007; Castruccio and Genton 2018) to bypass the prohibitive computation.

3.2.1 Modeling the Spatial Dependence

The inference of the spatial dependence should consider several crucial factors: (a) The simulations are extensive; (b) The simulations encompass the entire globe, making the use of chordal distance inappropriate (Jeong, Jun, and Genton Citation2017); and (c) The simulations exhibit nonstationarity among different latitudes and between land and ocean regions. Given these considerations, we propose using the SHT (Jones Citation1963; Stein Citation2007, 2008), which involves expanding the stochastic component Zt(r)(Li,lj) with spherical harmonics: (2) Zt(r)(Li,lj)=q=0Q1m=qq(st(r))qmHqm(Li,lj)+εt(r)(Li,lj).(2)

In (2), {Hqm}q=0,1,2,;m=q,,q is a complete set of orthonormal basis functions defined in L2(S2) and indexed by integer degree q and order m, where L2(S2) is the Hilbert space consisting of squared-integrable functions on the sphere S2. The spherical harmonic Hqm is a complex-valued function with a closed form Hqm(Li,lj)=2q+14π(qm)!(q+m)!Pqm(cosθi)exp(ιmψj), where ι2=1, (θi,ψj)=(πLi/180+π/2,πlj/180) is a reparameterization of (Li, lj) and Pqm is the associated Legendre polynomial of degree q and order m, such that Hqm=(1)mHqm¯. Corresponding to the Hqm, (st(r))qm is the complex-valued spherical harmonic coefficient satisfying (st(r))qm=(1)m(st(r))qm¯. The norm of (st(r))qm, that is, ||(st(r))qm||2=R2{(st(r))qm}+I2{(st(r))qm}, indicates the energy gathered at degree q and order m. More details about spherical harmonics and their application on spectrum and differentiability analysis are given in Section S3.2.1 of the supplementary materials. Q is an integer ranging from 0 to Qmax, with Qmax=min{I1,(J+1)/2} derived in Section S3.2.2 and determined by the spatial resolution of the data. The denser the grid points are, the higher the frequency they capture, and the larger the value of Qmax could be. For CESM2-LENS2, Qmax=144. The term εt(r)(Li,lj) accounts for the remaining information and is assumed to be independent and follow N(0,v2(Li,lj)). In a physical sense, (2) translates {Zt(r)(Li,lj)}i=1,,I;j=1,,J from the spatial domain to {(st(r))qm}q=0,,Q1;l=q,,q in the spectral domain. In a statistical sense, with QIJ, (2) provides a low-rank approximation for the stochastic process Zt, where (st)qm serves as a random coefficient and Q trades off the quality of approximation against the computational complexity.

We provide to facilitate the understanding of SHT. shows a set of stochastic components in the spatial domain, that is, {Z9(1)(Li,lj)}i=1,,I;j=1,,J, which is obtained by detrending and rescaling the annually aggregated temperature at year 2023 and ensemble one with ensemble mean and standard error in , respectively. After applying the SHT, presents the pattern of {log10||(s9(1))qm||2}q=0,1,,Qmax;m=q,,q, which indicates the energy allocation in the spectral domain. The energy is more concentrated at lower degrees of lower frequencies, gradually decreasing as the degree increases. The decay rate reflects the differentiability property of the data, as detailed in Section S3.2.1 of the supplementary materials. show the influence of Q on the loss of using SHT for a single set of stochastic components. Section S3.2.3 provides a more comprehensive illustration of the general approximation performance. In essence, a larger value of Q retains more information and provides a better approximation. Additionally, the spatial structure of surface temperature varies significantly between land and ocean.

Fig. 2 Demonstration of SHT. (a) shows a set of stochastic component {Z9(1)(Li,lj)}i=1,,I;j=1,,J in the spatial domain. (b) indicates energy allocation of (a) in spectral domain. (c)–(e) depict the absolute values of {ε9(1)(Li,lj)}i=1,,I;j=1,,J in (2) with Q = 36, 72, and 116, respectively. (f) depicts the absolute values of {ε9(1)(Li,lj)}i=1,,I;j=1,,J obtained by using LatticeKrig with 13,658 bases of 6 levels, where 13,658>1162=13,456.

Fig. 2 Demonstration of SHT. (a) shows a set of stochastic component {Z9(1)(Li,lj)}i=1,…,I;j=1,…,J in the spatial domain. (b) indicates energy allocation of (a) in spectral domain. (c)–(e) depict the absolute values of {ε9(1)(Li,lj)}i=1,…,I;j=1,…,J in (2) with Q = 36, 72, and 116, respectively. (f) depicts the absolute values of {ε9(1)(Li,lj)}i=1,…,I;j=1,…,J obtained by using LatticeKrig with 13,658 bases of 6 levels, where 13,658>1162=13,456.

We compare SHT with two popular bases used in low-rank approximations, empirical orthogonal functions (EOFs) and those in LatticeKrig (Nychka et al. Citation2016), as detailed in Section S3.2.4. EOFs, as data-driven basis functions, lack a closed form and must be stored. Their evaluation involves computing and eigen-decomposing an IJ × IJ matrix, leading to constraints on computational time and resources. illustrate comparable approximation performance of SHT and LatticeKrig on Z9(1)(Li,lj) with similar numbers of bases. However, LatticeKrig entails more time for parameter tuning selection and coefficient calculation, along with greater storage requirements. For a more detailed and comprehensive comparison, readers are referred to Section S3.2.4. We emphasize the advantages of employing the spherical harmonics to represent substantial global climate simulations: (a) Natural basis functions. Spherical harmonics are eigenfunctions of the Laplace-Beltrami operator, making them well-suited for analyzing spherical data of various variables, as well as their spectrum; (b) Easily available multi-resolution basis functions. Spherical harmonics have a closed form, which saves the procedures for getting bases, conserves the memory for storing them, and facilitates the derivation of special forms of covariance under certain assumptions. Their resolutions are automatically multiple, eliminating the need for researchers to manually allocate knots or tuning parameters. Instead, one can simply select a degree Q to control the level of detail; (c) Efficient coefficient calculation. As shown in Section S3.2.2, both the SHT and its inverse can be efficiently calculated with a computational time O(Q3). Moreover, this computational time can be further reduced to O(Q2) by parallel computing. On a MacBook with a 3.2 GHz Apple M1 Pro processor with ten cores, performing the SHT with Q=Qmax in takes about 3 sec, and the inverse SHT with Q = 36, 72, and 144 take no more than 0.06, 0.3, and 2 sec, respectively.

Next, we model the spatial dependence structure, which is assumed to remain consistent over time. The covariance between Zt(r)(Li,lj) and Zt(r)(Li,lj) is (3) c{(Li,lj),(Li,lj)}=H(Li,lj)KH(Li,lj)¯+δi=iδj=jv2(Li,lj),(3) where K=Kt(r)=E{st(r)(st(r))H} is a Q2×Q2 covariance matrix of {(st(r))qm}q=0,1,,Q;m=q,,q, H is the operator of conjugate transpose, H(Li,lj) and st(r) are vectors consisting of basis functions and spherical harmonic coefficients, indexed by (q2+q+m+1). Specifically, the (q2+q+m+1)th elements of H(Li,lj) and st(r) are Hqm(Li,lj) and (st(r))qm, respectively.

Further examining the structure of K, it is evident that the stochastic component in exhibits strong heterogeneity across different latitudes, as observed in previous studies of global climate data (Stein Citation2007; Jeong et al. Citation2018; Huang et al. Citation2023). Consequently, the commonly used isotropic assumption for Euclidean data is no longer appropriate. Instead, we adopt the assumption of axial symmetry, which posits that only data on the same latitude are stationary (Jones Citation1963). This leads to the covariance c{(Li,lj),(Li,lj)} becoming a function of Li (θi), Li (θi), and |ljlj| (|ψjψj|, the central angle between points (Li, lj) and (Li,lj) expressed in chordal distance). Using a simple derivation outlined in Jones (Citation1963), we have (4) c(Li,Li,|ljlj|)=q=0Q1q=0Q1m=qqkqqmHqm(θi,0)Hqm(θi,0)exp(ιm|ψjψj|)(4) with E{(st(r))qm(st(r))qm}=δm=mkqqm. It means that axial symmetry assumes dependence only on coefficients of the same order. Thus, K becomes a sparse matrix, consisting of Q2+(Q1)2++12=Q3/3+Q2/2+Q/6 nonzero elements.

However, from and , we still observe heterogeneity between land and ocean data on the same latitude. The stochastic components on land and ocean exhibit different spatial structures, requiring different numbers of basis functions to capture them adequately. Therefore, we propose replacing (2) by (5) Zt(r)(Li,lj)=δ(Li,lj)Slq=0Ql1m=qq(st(r))qmHqm(Li,lj)+δ(Li,lj)Soq=0Qo1m=qq(st(r))qmHqm(Li,lj)+εt(r)(Li,lj),(5) where So (Sl) represents the set of grid points over the ocean (land) and Qo (Ql) denotes the value of Q for ocean (land). We ignore the continuity of temperature changes between land and ocean, since (5) is formulated on grid points and our target is not to predict at new points. Accordingly, the covariance c(Li,Li,|ljlj|) is given by plugging the Q values of (Li, lj) and (Li,lj) into (4). We present the process of choosing appropriate values for Qo and Ql in Section 4.

3.2.2 Modeling the Temporal Dependence

Next, we investigate the temporal dependence within the vector time series {st(r)}t=1T in the spectral domain. We will employ an auto-regressive model, but with two additional operations: converting the coefficients to real-valued ones and Gaussianizing them using the TGH transformation. Specifically, for the complex-valued vector st(r) with the special structure of (st(r))qm=(1)m(st(r))qm¯, we first linearly transform it into a real-valued vector: s˜t(r)=A1st(r)=[(st(r))00,I{(st(r))11},(st(r))10,R{(st(r))11},,I{(st(r))Q1Q1},,R{(st(r))Q1Q1}],where both A and its inverse are Q2×Q2 sparse matrices with known structures. For example, in the case of Q = 2, A and A1 can be represented as (10000ι0100100ι01)and(10000ι/20ι/2001001/201/2), respectively. For simplicity, we denote the (q2+q+m+1)th element of s˜t(r) as (s˜t(r))qm. Consequently, we have Zt(r)(Li,lj)=H(Li,lj)As˜t(r)+εt(r)(Li,lj), where AH(Li,lj) and s˜t(r) are vectors of bases and coefficients, respectively. Under the assumption of axial symmetry, the covariance of s˜t(r), denoted as K˜0=E(s˜t(r)s˜t(r)), is also a sparse matrix. Specifically, E{(s˜t(r))qm(s˜t(r))qm}=E{(s˜t(r))qm(s˜t(r))qm}=k˜qqmδm=mδm0. The derivation is given in Section S3.3 of supplementary materials. The matrix K˜0 consists of 2Q3/3+Q/3 nonzero elements and only Q3/3+Q2/2+Q/3 ones need to be stored. If Zt(r)(Li,lj) represents the stochastic component of annually aggregated temperature, it is reasonable to assume the Gaussianity of {Zt(r)(Li,lj)}t=1T, and therefore, the Gaussianity of {s˜t(r)}t=1T. This allows us to model it with a vector auto-regressive model of order P (VAR(P)): s˜t(r)=p=1PΦps˜tp(r)+ξt(r), where ξt(r)iidNQ2(0,U) and U=K˜0p=1PΦpK˜0Φpp=1PppPΦpK˜|pp|Φp with K˜|pp|=E(s˜tp(r)s˜tp(r)).

However, as in Sections 2 and S2, when Zt(r)(Li,lj) is the stochastic component of the monthly or daily aggregated temperature, the Gaussianity assumption for time series {Zt(r)(Li,lj)}t=1T at most grid points (Li, lj) may not hold. It implies that {(s˜t(r))qm}t=1T at some (q, m) pairs are not Gaussian and motivates us to employ a modified TGH transformation. Denote by Sgh the set of (q, m) such that {(s˜t(r))qm}t=1T rejects the Gaussianity. We model the coefficients with a TGH auto-regressive model: sˇt(r)=p=1PΦpsˇtp(r)+ξt(r), where the (q2+q+m+1)th element of sˇt(r) is (6) (sˇt(r))qm={λqmτgqm,hqm1{(s˜t(r))qm/ωqm},if(q,m)Sgh,(s˜t(r))qm,if(q,m)Sgh,(6) with τgqm,hqm(s)=(gqm)1{exp(gqms)1}exp(hqms2/2). Compared to the regular TGH, the one in (6) removes a location parameter and introduces a scale parameter λqm. The former is due to the zero-mean {(s˜t(r))qm}t=1,,T;r=1,,R. The later ensures the standard deviation of {(sˇt(r))qm}t=1,,T;r=1,,R to be equal to that of {(s˜t(r))qm}t=1,,T;r=1,,R. Four additional parameters describing characteristics beyond the first two moments are to be stored. Based on (6), the covariance matrix of ξt(r) is U=Kˇ0p=1PΦpKˇ0Φpp=1PppPΦpKˇ|pp|Φp, where both Kˇ0=E(sˇt(r)sˇt(r)) with E{(sˇt(r))qm(sˇt(r))qm}=E{(sˇt(r))qm(sˇt(r))qm}=kˇqqmδm=mδm0 and Kˇ|pp|=E(sˇtp(r)sˇtp(r)) have sparse structures under the assumption of axial symmetry.

Furthermore, we assume that Φp is a diagonal matrix with the (q2+q+m+1)th diagonal element denoted as (ϕp)qm, which neglects the cross dependence between {(s˜t(r))qm}q=0,,Q1;m=q,,q. This assumption enables the independent and parallel evaluation of {(ϕp)qm}p=1P across q and m and ensures U to inherit the same sparse structure from Kˇ0 and Kˇ|pp|. We briefly validated it in Section S3.4 by testing the significance of the first temporal lag of the cross-correlation. The proportion of p-values less than 0.05 is nearly zero, indicating negligible cross-dependence. The details about the parameter estimation are given in Section S3.5.

3.3 Emulation with SG

The procedures of constructing an SG for the monthly and daily simulations from CESM2-LENS2 is summarized in Algorithm S1 of Section S3.6. We assume that the tuning parameters K, Ql, Qo, and P are known. More details about their selection will be provided in Section 4. In Stage 1, we calculate and store a total of (5+2K)IJ parameters in the deterministic component. In Step 1) of Stage 2, although we use different Q values for land and ocean data to evaluate v(Li,lj), we still need coefficients up to Q=max(Ql,Qo). The number of stored parameters in Stage 2 is IJ+(4+P)Q2+(Q3/3+Q2/2+Q/3). Therefore, the total number of parameters is of the order O(IJ+Q3). For the annually aggregated temperature, there is no need to perform a TGH transformation in Step 3) of Stage 2. Consequently, Sgh is an empty set, and only IJ+PQ2+(Q3/3+Q2/2+Q/3) parameters are needed for the stochastic component. In summary, we develop an SG by training on simulations of size IJRT and store it for generating unlimited emulations using at most (6+2K)IJ+(4+P)Q2+(Q3/3+Q2/2+Q/3) memorized parameters.

Here, we contrast the storage requirements of our SG with that of another efficient SG proposed by Huang et al. (Citation2023) (referred to as HCBG-SG) for surface temperature simulations from LENS1 (Kay et al. Citation2015). HCBG-SG was primarily developed within the spatial domain and required memorization of (2K+PH+8)IJ+6I+2 parameters. This consisted of (5+2K)IJ parameters for the deterministic component, PHIJ parameters for evaluating the temporal dependence with PH representing the selected order for the auto-regressive model, 3IJ parameters for TGH transformation, and 6I+2 parameters for assessing the nonstationary spatial dependence. The storage advantage of our SG stems from its modeling in the spectral domain rather than the spatial domain, with Q2IJ. Taking the case of P=PH=1 and Q=70 as an example, which maximizes storage requirements of our SG for all temporal resolutions, the daily HCBG-SG necessitates additional 25,747 parameters. However, for annual simulations, which do not need a TGH, HCBG-SG needs 120,541 fewer parameters. A further comparison of their performance is presented in Section 4.

In Algorithm 1, we outline the procedures for generating monthly and daily temperature emulations for CESM2-LENS2. Note that a Cholesky decomposition should be performed on U when generating sˇt(r), which may be time-consuming if Q is not small. However, the sparse structure of U avoids the possible computational limitation. For the annually aggregated temperature, we can simply remove Step (b) in Stage 2 and replace s˜SG,t(r) in Step (c) with sˇSG,t(r).

Algorithm 1

Emulation with SG

Input: ρ(Li,lj), β0(Li,lj), β1(Li,lj), β2(Li,lj), a1(Li,lj),…, aK(Li,lj), b1(Li,lj),…, bK(Li,lj), σ(Li,lj), v(Li,lj), λqm, ωqm, gqm, hqm, (ϕ1)qm,,(ϕP)qm, i=1,,I, j=1,,J, q=0,,Qo1, m=q,,q, U, Ql, Qo, K, P, ARQo2×Qo2, R.

Stage 1: Preliminary

1) For each grid point (Li, lj), calculate {mt(Li,lj)}t=1T with given parameters of the deterministic component, i=1,,I and j=1,,J.

Stage 2: Emulation

For r=1,,R:

1) generate sˇSG,(p1)(r)NQo2(0,U), p=1,,P,

2) for t=1,,T:

(a) generate ξSG,t(r)NQo2(0,U), and calculate sˇSG,t(r)=p=1PΦpsˇSG,tp(r)+ξSG,t(r),

(b) calculate s˜SG,t(r), where (s˜SG,t(r))qm=ωqmτgqm,hqm{(sˇSG,t(r))qm/λqm},

(c) calculate sSG,t(r)=As˜SG,t(r), and perform the inverse SHT on sSG,t(r) with (5) to obtain {invSHT(sSG,t(r))(Li,lj)}i=1,,I;j=1,,J,

(d) generate εSG,t(r)(Li,lj)N(0,v̂(Li,lj)2), i=1,,I;j=1,,J;

(e) generate emulations ySG,t(r)(Li,lj)=mt(Li,lj)+σ(Li,lj){invSHT(sSG,t(r))(Li,lj) +εSG,t(r)(Li,lj)}.

Output: {ySG,t(r)(Li,lj)}i=1,,I;j=1,,J;r=1,,R.

4 Case Study: CESM2-LENS2

In this section, we develop emulators and generate emulations for surface temperature simulations from CESM2-LENS2 using the proposed algorithms. We delve into the implementation of our SG (referred to as SHT-SG in this section), covering the choice of tuning parameters and the modeling of the dependence structure, and the assessment of their performance. Additionally, we include the results of HCBG-SG for comparison. Discussion about the monthly SG is presented in Section S4.2 of the supplementary materials to save space.

The assessment of an SG relies on the specific purpose for which the emulator is intended (Castruccio et al. Citation2014). Our main goal is to generate emulations that closely resemble the given simulations. To achieve this, our SG should efficiently decouple and capture the deterministic and stochastic components of the data. First, the estimated mean trend m̂t should mimic and extract the variation of the ensemble mean in the simulations. After adding the stochastic component, the variability of emulations should be similar to the internal variability observed in simulations. Therefore, we will discuss two key factors: goodness-of-fit and variability, quantified with Ifit(Li,lj)=r=1Rt=1T{yt(r)(Li,lj)m̂t(Li,lj)}2R/(R1)r=1Rt=1T{yt(r)(Li,lj)y¯t(Li,lj)}2

and Iuq(Li,lj)=CRA{ySG,t(1)(Li,lj),,ySG,t(R)(Li,lj)}CRA{yt(1)(Li,lj),,yt(R)(Li,lj)},respectively. Given a grid point (Li, lj), the index Ifit(Li,lj) compares m̂t(Li,lj) with 2K+4 parameters with the ensemble mean y¯t(Li,lj)=R1r=1Ryt(r)(Li,lj) with T parameters among all ensembles and time points. The closer the Ifit value to 1, the better the ability of the SG to capture the variability of the ensemble mean. Index Iuq, a measure proposed in Huang et al. (Citation2023), leverages the concept of functional data depths (López-Pintado and Romo Citation2009). In particular, we treat {ySG,t(r)(Li,lj)}r=1R as R functions of t at (Li, lj) and measure the interquartile range (Sun and Genton Citation2011) of these functions using the central region area (CRA). Then, Iuq(Li,lj) is the uncertainty qualification of {ySG,t(r)(Li,lj)}r=1R against that of {yt(r)(Li,lj)}r=1R. A value of Iuq close to 1 suggests that the variability of the generated emulations closely reflects the internal variability of the training simulations.

Moreover, we will visually and numerically compare various statistical characteristics of the emulations to those of simulations. This ensures that our emulations faithfully capture the essential features of the temperature data from CESM2-LENS2. For example, at each grid point, we compare the empirical distribution of {ySG,t(r)(Li,lj)}t=1,,T;r=1,,R to that of {yt(r)(Li,lj)}t=1,,T;r=1,,R by the Wasserstein distance (Santambrogio Citation2015), which is denoted as WDS(Li,lj). Similarly, we calculate the Wasserstein distance between the empirical distribution of {ySG,t(r)(Li,lj)}i=1,,I;j=1,,J;r=1,,R and {yt(r)(Li,lj)}i=1,,I;j=1,,J;r=1,,R at each time point t, which is denoted as WDT(t).

4.1 Annually Aggregated Temperature

First, we evaluate the deterministic component mt and σ of the annually aggregated temperature simulations {yt(r)(Li,lj)}i=1,,I;j=1,,J;t=1,,T;r=1,,R, with a total of T = 86 time points. presents boxplots of {Ifit(Li,lj)}i=1,,I;j=1,,J for different values of R. These values tend to get closer to 1 as R increases, and they stabilize after R7, which means that using seven training simulations is sufficient to stabilize the inference. Therefore, we select R = 7 for the subsequent analysis. The corresponding {Ifit(Li,lj)}i=1,,I;j=1,,J values are shown in Figure S5(a), with a median value of 1.001. Temperature can be fitted well at most grid points. Furthermore, displays the map of σ̂(Li,lj), illustrating the nonstationarity of the data. It is evident that larger variations of surface temperatures are observed over land.

Fig. 3 Inference process for the annual SG. (a) shows boxplots of {Ifit(Li,lj)}i=1,,I;j=1,,J for different values of R. (b) is the map of the estimated standard deviation {σ̂(Li,lj)}i=1,,I;j=1,,J. (c) shows boxplots of {BIC*,t(r)(Q)}r=1,,R;t=1,,T for different values of Q, where * takes “l” and “o”. Points × show medians of BIC values. The red points indicate the minimum values. (d) is the map of {v̂(Li,lj)}i=1,,I;j=1,,J.

Fig. 3 Inference process for the annual SG. (a) shows boxplots of {Ifit(Li,lj)}i=1,…,I;j=1,…,J for different values of R. (b) is the map of the estimated standard deviation {σ̂(Li,lj)}i=1,…,I;j=1,…,J. (c) shows boxplots of {BIC*,t(r)(Q)}r=1,…,R;t=1,…,T for different values of Q, where * takes “l” and “o”. Points × show medians of BIC values. The red points indicate the minimum values. (d) is the map of {v̂(Li,lj)}i=1,…,I;j=1,…,J.

Then, we perform the SHT on the stochastic component, necessitating the selection of appropriate values for Ql and Qo. Based on the Gaussian assumption of εt(r)(Li,lj), we propose a Bayesian information criterion (BIC) for {Zt(r)(Li,lj)}i=1,,I;j=1,,J at time t of ensemble r: BIC*,t(r)(Q)=log(n*)Q2+n*log(2π)+δ(Li,lj)S*i=1Ij=1Jlog{v̂2(Li,lj)}+δ(Li,lj)S*i=1Ij=1J{εt(r)(Li,lj)v̂(Li,lj)}2,where v̂2(Li,lj)=(RT)1r=1Rt=1T{εt(r)(Li,lj)}2. When * takes “o”, the above BIC with no=i=1Ij=1Jδ(Li,lj)So helps select an appropriate Qo for Zt(r)(Li,lj) over the ocean. Similarly, when * is “l”, it leads to the determination of Ql. illustrates boxplots of {BIC*,t(r)}r=1,,R;t=1,,T against different values of Q, revealing a substantial discrepancy in preferred Q values between land and ocean. The stochastic components on land exhibit a preference for a smaller Q value compared to those on the ocean. This might seem counterintuitive, given that suggest that the stochastic component on land requires more harmonic terms for a better approximation. However, it is essential to note that the introduction of Q aims not only for accurate approximation but also to strike a balance between accuracy and the number of parameters, leading to improved model selection. Examining the boxplots for land, the improvement gained by including additional terms does not fully compensate for the associated drawbacks, such as increased storage requirements and accumulated errors from evaluation. The subsequent further validate this model selection result. Finally, the × points in suggest selecting Ql = 35 and Qo = 69 to minimize the medians of BIC for land and ocean, respectively.

Fig. 4 Temporal and spatial dependence structure. (a) is the map of the estimated {(ϕ1)qm}q=0,,Qo1;m=q,,q for the annual simulations. (b) is the first 69 rows and columns of K˜0 empirically evaluated by (RT)1r=1Rt=1Ts˜t(r)s˜t(r) and denoted as K˜emp. (c) is the first 69 rows and columns of K˜0 evaluated under the axial symmetric assumption and denoted as K˜axl. (d) and (e) show empirical and fitted auto-covariance at latitudes 11.8° and 36.3°, respectively.

Fig. 4 Temporal and spatial dependence structure. (a) is the map of the estimated {(ϕ1)qm}q=0,…,Qo−1;m=−q,…,q for the annual simulations. (b) is the first 69 rows and columns of K˜0 empirically evaluated by (RT)−1∑r=1R∑t=1Ts˜t(r)s˜t(r)⊤ and denoted as K˜emp. (c) is the first 69 rows and columns of K˜0 evaluated under the axial symmetric assumption and denoted as K˜axl. (d) and (e) show empirical and fitted auto-covariance at latitudes −11.8° and 36.3°, respectively.

In , we further analyze the values of v̂(Li,lj), which serves as an assessment of the low-rank approximation on (Li, lj). There are several regions on land that exhibit significantly larger v̂(Li,lj): (a) High-altitude regions such as the Himalayas and the Andes ranges with altitudes exceeding 3500 meters above sea level; and (b) Regions characterized by diverse topography, such as the Indonesian archipelago, encompassing islands, mountains, and coastal areas. These geographic factors contribute to a more complex spatial and temporal structure in surface temperature in these regions. To capture more details, one may consider including additional covariates in the deterministic component and choosing specific values of Q for these regions in the stochastic component. These considerations are avenues for future exploration.

Next, we assess the temporal dependence by fitting {(s˜t(r))qm}t=1,,T;r=1,,R with the VAR(P) model. For each (q, m), we suggest to choose P by minimizing BICqm(P)=Plog{(TP)R}+R(TP){log(2π)+1}+R(TP)log((ûqm)2), where (ûqm)2=R1(TP)1r=1Rt=P+1T{(s˜t(r))qmp=1P(ϕ̂p)qm(s˜tp(r))qm}2. For briefness, we use the conditional log-likelihood (given (s˜1(r))qm,,(s˜P(r))qm) rather than an exact log-likelihood in the above BIC. The proportions of P=1,,5 are 99.3%, 0.2%, 0.2%, 0.1%, and 0.2%, respectively. Therefore, we opt for P = 1 for the annual case. further demonstrates the estimates of {(ϕ1)qm}q=0,,Qo1;m=q,,q.

Finally, we evaluate the spatial dependence by examining the matrix K˜0. show top-left corners of two evaluations of K˜0. K˜emp is the sample covariance matrix of s˜t(r), forming a dense Qo2×Qo2 matrix. In contrast, K˜axl is a sparse matrix. The sparsity of K˜axl would be inherited by U, facilitating storage and enabling fast Cholesky decomposition using efficient algorithms (Furrer and Sain Citation2010). display empirical and fitted covariances for Z(Li,lj) and Z(Li,lj+1), specifically the auto-covariance {cov{Z(Li,lj),Z(Li,lj+1)}}j=1J1 at Li=11.8° and 36.3°. These plots highlight the distinction between the proposed separate SHT for land and ocean (5) (Axial-land/ocean) and the original SHT (2) (Axial). The empirical auto-covariance over the ocean appears almost flat, supporting the assumption of axial symmetry over the ocean. In contrast, the auto-covariance over land significantly differs from that over the ocean. This difference is not captured by the fitted covariance of Axial, which selects Q = 77 via BIC. Axial-land/ocean fits the covariance separately for land and ocean, resulting in segmented lines in . However, land temperature exhibits a complex and nonstationary dependence structure, with covariance weakening as one moves farther inland. This intricacy warrants further investigation in future research.

Inputting the above-estimated parameters, we generate R=R=7 emulations with Algorithm 1. Each emulation takes approximately 30 sec to produce without parallel computation. Additionally, we include emulations generated by HCBG-SG for comparison, with detailed implementation procedures shown in the supplementary materials. compare Iuq values for the proposed SHT-SG and HCBG-SG, with median values of 1.013 and 1.085, respectively. Our emulations can successfully capture the variability of simulations at most grid locations, including polar regions. A high-Iuq region near the location (50.00,300.00) is caused by the poor evaluation of the mean trend, which is discussed in Section S4.1.1 of the supplementary materials. shows a poor spatial continuity for HCBG-SG, which may be caused by the independent evaluation of parameters at each latitude and a relatively weak model of dependence across latitudes. A similar pattern can be seen in , which depicts the map of WDS for HCBG-SG, with a median value of 0.076. The WDS values for SHT-SG in , which are closer to zero with a median value of 0.062, provide further evidence of the similarity between our emulations and the simulations. From , geographic factors have no obvious impact on both two SGs regarding Iuq and WDS values. The WDT values for SHT-SG and HCBG-SG are comparable, with medians 0.126 and 0.108, respectively. In the supplementary materials, we further compare other basic statistical characteristics of our emulations to those of the simulations by replicating the procedures outlined in .

Fig. 5 (a) and (b) are maps of {Iuq(Li,lj)}i=1,,I;j=1,,J for the annual temperature emulations. (c) and (d) are maps of {WDS(Li,lj)}i=1,,I;j=1,,J for the annual temperature emulations.

Fig. 5 (a) and (b) are maps of {Iuq(Li,lj)}i=1,…,I;j=1,…,J for the annual temperature emulations. (c) and (d) are maps of {WDS(Li,lj)}i=1,…,I;j=1,…,J for the annual temperature emulations.

4.2 Daily Aggregated Temperature

Now, we develop an SG for the daily aggregated temperature simulations with T=31,390. Analyzing such a vast amount of data presents a computational challenge. Therefore, we adopt a strategy similar to Huang et al. (Citation2023) to perform inference only on data for the years 2020, 2040, 2060, 2080, and 2100. The procedure of inference closely follows that of the monthly case. Therefore, we put the details of development to Section S4.3 of the supplementary materials and focus on assessing the performance of the emulations and a further exploration of data characteristics.

For the daily data, we choose KD = 4, Ql = 36, Qo = 68, and P = 1. The choice of KD>KM=3 reveals the presence of more complex fluctuations in daily time series, requiring harmonic terms with higher frequencies in the model. The more structured and higher (ϕ̂1)qm in Figure S14(f) indicates a stronger dependence among time points. The performance of our SG, both with and without the use of TGH (referred to as SHT-SG(with TGH) and SHT-SG(without TGH), respectively), along with HCBG-SG, is illustrated in . For better demonstration, indices larger than maximum values in are excluded from , respectively. Comparing emulations generated by HCBG-SG with those of SHT-SG(with TGH), the latter illustrates a closer resemblance to simulations, greater robustness in numerically unstable regions, and better spatial continuity. From , the incorporation of TGH in our SG results in an improvement in WDS. The empirical distribution of emulations from SHT-SG(with TGH) is closer to that of simulations, aligning with the correct assumption. This advantage may not be apparent in , as Iuq is used mainly for assessing the variabilities of emulations.

Fig. 6 Performance assessment of R=7 daily emulations generated by various methods. (a) and (b) are maps of {Iuq(Li,lj)}i=1,,I;j=1,,J for the SHT-SG(with TGH) and HCBG-SG, respectively. (c) and (d) are maps of {WDS(Li,lj)}i=1,,I;j=1,,J for SHT-SG(with TGH) and HCBG-SG, respectively. Points × represent four testing grid points TGL=(2.36,10.00), TGO=(30.63,10.00), TGN=(56.07,10.00), and TGB=(64.55,10.00). (e) and (f) are boxplots of {Iuq(Li,lj)}i=1,,I;j=1,,J and {WDS(Li,lj)}i=1,,I;j=1,,J for comparing the SHT-SG(with TGH) and SHT-SG(without TGH).

Fig. 6 Performance assessment of R′=7 daily emulations generated by various methods. (a) and (b) are maps of {Iuq(Li,lj)}i=1,…,I;j=1,…,J for the SHT-SG(with TGH) and HCBG-SG, respectively. (c) and (d) are maps of {WDS(Li,lj)}i=1,…,I;j=1,…,J for SHT-SG(with TGH) and HCBG-SG, respectively. Points × represent four testing grid points TGL=(−2.36,10.00), TGO=(−30.63,10.00), TGN=(−56.07,10.00), and TGB=(−64.55,10.00). (e) and (f) are boxplots of {Iuq(Li,lj)}i=1,…,I;j=1,…,J and {WDS(Li,lj)}i=1,…,I;j=1,…,J for comparing the SHT-SG(with TGH) and SHT-SG(without TGH).

As in the monthly case, the daily emulations exhibit challenges in accurately representing certain regions, notably the North Pole and the Band regions. These regions display larger Iuq values, indicating higher uncertainty. Additionally, they may deviate from the behavior observed in the simulations, as reflected by the larger WDS values. Moreover, WDS values are generally higher over land compared to ocean areas, which may be attributed to the complex spatial and temporal structures and larger variation of the temperature on land.

We further explore these by intuitively displaying and comparing simulations and emulations at four testing grid points selected from the southern hemisphere with the same longitude in . TGL is a land point near the equator. In , temperature simulations on TGL exhibit larger variability but can be well replicated by the emulations. TGO is an ocean point in the middle latitudes. From , temperature simulations on TGO have smaller variability overall, with relatively larger variability at peaks and valleys of each year. These patterns can also be observed at emulations on TGO. TGN is also an ocean point in the middle latitudes, but very close to the Band region. In (and Figure S14(a)), some temperature simulations exhibit sudden drops at a few days in year 2020, which leads to an overestimate of σ̂(TGN), and hence a relatively higher variability in the emulations. Despite this, with Iuq(TGN)=1.200 and WDS(TGN)=0.078, the abnormal drops do not significantly affect these indices. Both Iuq and WDS exhibit a certain tolerance to isolated “outliers”. TGB is an ocean point within the Band region. Compared with the scenarios in , temperature simulations at TGB display much more rapid and significant fluctuations at lots of time slots in years 2020, 2040, and 2060. In contrast, in years 2080 and 2100, the temperature simulations behave as other grid points on ocean with small variability. As shown in Figure S14(b), simulations at the North Pole region exhibit similar patterns, that is, surface temperature experiences sharp fluctuations on numerous days. According to the explanation kindly provided by NCAR, these temperature fluctuations are caused by the way of data generation. Specifically, the surface temperature in simulations is the spatial average of the surface temperature of whatever medium is in the grid box, whether it is water, land, or sea ice. Grid points at Band and North Pole regions undergo a dynamic transition, shifting back and forth between being partial sea ice and being open ocean. Consequently, sometimes we get the temperature of sea ice, and sometimes we get the temperature of the top layer of the ocean (or a mixture of the two). Combining this with Figure S14, the numerical instabilities exhibit four features: they do (a) not occur in each ensemble; (b) not happen during a fixed period; (c) not vary within a similar extent; and 4) not exhibit a consistent pattern across different grid points. Therefore, although the high Iuq and WDS values at the North Pole and Band regions directly stem from the inadequate assumption of a fixed σ̂ over time, developing an SG that can accurately model the temperature fluctuations in these two regions is a challenge.

Fig. 7 Comparison of daily emulations and simulations at four testing grid points. The black solid curves are ensemble means for simulations and emulations. The red dashed curves are m̂t s.

Fig. 7 Comparison of daily emulations and simulations at four testing grid points. The black solid curves are ensemble means for simulations and emulations. The red dashed curves are m̂t s.

5 Conclusion and Future Work

In this article, we presented an efficient SG for global temperature simulations from the newly published CESM2-LENS2. With at most O(IJ+Qo3) parameters to be stored, an unlimited number of emulations of size O(IJT), even at a daily scale, can be generated to help with the investigation of climate internal variability. Such a saving of computational time and resources comes from the use of SHT, which expands data with Qo2IJ spherical harmonics and represents a practical low-rank approximation on the sphere. By customizing Q values for grid points on land and ocean and leveraging the axial symmetry, the proposed SG can properly capture the complex nonstationary dependencies among different latitudes and land/ocean regions. To account for the non-Gaussian nature of the high-resolution time series, we introduced a modified TGH transformation into the SG. In our case study, we developed SGs based on R = 7 annually, monthly and daily aggregated simulations, respectively. We evaluated the efficiency and accuracy of the proposed SG by comparing the generated emulations to the original simulations and emulations generated by Huang et al. (Citation2023) using various indexes and visual inspections.

While the proposed SGs have demonstrated commendable performance, there remains room for further enhancements: (a) Including altitude factors in both the deterministic and stochastic components. The influence of altitude has been evident in the inference process and the performance of emulations. Surface temperatures are sensitive to altitude (Castruccio and Genton Citation2016); (b) Improving the efficiency of SG on land, particularly for monthly and daily simulations. The larger variability observed in these cases suggests the need for more suitable models and additional factors to account for the complicated spatial and temporal structures on land; (c) Replacing the constant standard error with a temporal-varying one, denoted as σt. For simulations with higher temporal resolution, the assumption of a constant σ leads to an overestimate of standard error, further amplifying the variability of emulations, especially on land; and (d) Extending the current work to include hourly simulations. Hourly data may exhibit more intriguing characteristics, and exploring their emulation presents an opportunity for further investigation and understanding.

Beyond surface temperature, the idea of constructing an SG with SHT can be extended to various climate variables such as wind speed and atmospheric gas concentrations. After the removal of the deterministic component, the stochastic component can be efficiently expanded using spherical harmonics. The model in the spectral domain is contingent on the specific nature of the variable, which may ask for including other influential factors. Moreover, for variables that may exhibit interdependence, such as temperature and precipitation, exploring methods for their joint modeling in the spectral domain is also of interest. Investigating regional climate change, especially on land, is crucial for several reasons. On the one hand, climate change does not affect all regions equally. On the other hand, different regions have varying levels of vulnerability to climate change, for example, extreme weather events, sea-level rise, or precipitation patterns. Local communities, governments, and businesses may be interested in how specific areas are being impacted so that they can adapt and plan for the future. Therefore, another possible extension is to build an SG for high-resolution climate simulations on a specific region.

Supplementary Materials

Additional information about the CESM2-LENS2 data, inference process, derivation, validation, and results for the case studies are contained in the online supplementary materials. (.pdf file)

The R code and instructions for downloading, processing the data, and reproducing the results in the article are available at the GitHub repository: https://github.com/SpatialTemporalStats/LENS2_Emulator_Reproducibility_Materials

Supplemental material

Supplementary_Materials_WithAuthor.pdf

Download PDF (63.4 MB)

Disclosure Statement

The authors report there are no competing interests to declare.

References

  • Baker, A. H., Xu, H., Hammerling, D. M., Li, S., and Clyne, J. P. (2017), “Toward a Multi-Method Approach: Lossy Data Compression for Climate Simulation Data,” in International Conference on High Performance Computing, pp. 30–42, Springer.
  • Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008), “Gaussian Predictive Process Models for Large Spatial Data Sets,” Journal of the Royal Statistical Society, Series B, 70, 825–848. DOI: 10.1111/j.1467-9868.2008.00663.x.
  • Bolin, D., Simas, A. B., and Xiong, Z. (2023), “Covariance-based Rational Approximations of Fractional SPDEs for Computationally Efficient Bayesian Inference,” Journal of Computational and Graphical Statistics, 33, 64–74. DOI: 10.1080/10618600.2023.2231051.
  • Branstator, G., and Teng, H. (2010), “Two Limits of Initial-Value Decadal Predictability in a CGCM,” Journal of Climate, 23, 6292–6311. DOI: 10.1175/2010JCLI3678.1.
  • Castruccio, S., and Genton, M. G. (2016), “Compressing an Ensemble with Statistical Models: An Algorithm for Global 3D Spatio-Temporal Temperature,” Technometrics, 58, 319–328. DOI: 10.1080/00401706.2015.1027068.
  • ———(2018), “Principles for Statistical Inference on Big Spatio-Temporal Data from Climate Models,” Statistics & Probability Letters, 136, 92–96. DOI: 10.1016/j.spl.2018.02.026.
  • Castruccio, S., and Guinness, J. (2017), “An Evolutionary Spectrum Approach to Incorporate Large-Scale Geographical Descriptors on Global Processes,” Journal of the Royal Statistical Society, Series C, 66, 329–344. DOI: 10.1111/rssc.12167.
  • Castruccio, S., Hu, Z., Sanderson, B., Karspeck, A., and Hammerling, D. (2019), “Reproducing Internal Variability with Few Ensemble Runs,” Journal of Climate, 32, 8511–8522. DOI: 10.1175/JCLI-D-19-0280.1.
  • Castruccio, S., McInerney, D. J., Stein, M. L., Crouch, F. L., Jacob, R. L., and Moyer, E. J. (2014), “Statistical Emulation of Climate Model Projections based on Precomputed GCM Runs,” Journal of Climate, 27, 1829–1844. DOI: 10.1175/JCLI-D-13-00099.1.
  • Castruccio, S., and Stein, M. L. (2013), “Global Space-Time Models for Climate Ensembles,” The Annals of Applied Statistics, 7, 1593–1611. DOI: 10.1214/13-AOAS656.
  • Chang, W., Haran, M., Olson, R., and Keller, K. (2014), “Fast Dimension-Reduced Climate Model Calibration and the Effect of Data Aggregation,” The Annals of Applied Statistics, 8, 649–673. DOI: 10.1214/14-AOAS733.
  • Cressie, N., and Johannesson, G. (2008), “Fixed Rank Kriging for Very Large Spatial Data Sets,” Journal of the Royal Statistical Society, Series B, 70, 209–226. DOI: 10.1111/j.1467-9868.2007.00633.x.
  • Datta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E. (2016), “Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets,” Journal of the American Statistical Association, 111, 800–812. DOI: 10.1080/01621459.2015.1044091.
  • Dong, C., Peings, Y., and Magnusdottir, G. (2023), “Regulation of Southwestern United States precipitation by non-ENSO Teleconnections and the Impact of the Background Flow,” Journal of Climate, 36, 7415–7433. DOI: 10.1175/JCLI-D-23-0081.1.
  • Eyring, V., Bony, S., Meehl, G. A., Senior, C. A., Stevens, B., Stouffer, R. J., and Taylor, K. E. (2016), “Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6) Experimental Design and Organization,” Geoscientific Model Development, 9, 1937–1958. DOI: 10.5194/gmd-9-1937-2016.
  • Furrer, R., and Sain, S. R. (2010), “spam: A Sparse Matrix R Package with Emphasis on MCMC Methods for Gaussian Markov Random Fields,” Journal of Statistical Software, 36, 1–25. DOI: 10.18637/jss.v036.i10.
  • Guinness, J. (2021), “Gaussian Process Learning via Fisher Scoring of Vecchia’s Approximation,” Statistics and Computing, 31, 25. DOI: 10.1007/s11222-021-09999-1.
  • Guinness, J., and Hammerling, D. (2018), “Compression and Conditional Emulation of Climate Model Output,” Journal of the American Statistical Association, 113, 56–67. DOI: 10.1080/01621459.2017.1395339.
  • Hu, W., and Castruccio, S. (2021), “Approximating the Internal Variability of Bias-Corrected Global Temperature Projections with Spatial Stochastic Generators,” Journal of Climate, 34, 8409–8418. DOI: 10.1175/JCLI-D-21-0083.1.
  • Huang, H., Castruccio, S., Baker, A. H., and Genton, M. G. (2023), “Saving Storage in Climate Ensembles: A Model-based Sstochastic Approach” (with Discussion), Journal of Agricultural, Biological and Environmental Statistics, 28, 324–344. DOI: 10.1007/s13253-022-00518-x.
  • Jeong, J., Castruccio, S., Crippa, P., and Genton, M. G. (2018), “Reducing Storage of Global Wind Ensembles with Stochastic Generators,” The Annals of Applied Statistics, 12, 490–509. DOI: 10.1214/17-AOAS1105.
  • Jeong, J., Jun, M., and Genton, M. (2017), “Spherical Process Models for Global Spatial Statistics,” Statistical Science, 32, 501–513. DOI: 10.1214/17-STS620.
  • Jeong, J., Yan, Y., Castruccio, S., and Genton, M. G. (2019), “A Stochastic Generator of Global Monthly Wind Energy with Tukey g-and-h Autoregressive Processes,” Statistica Sinica, 29, 1105–1126. DOI: 10.5705/ss.202017.0474.
  • Jones, R. H. (1963), “Stochastic Processes on a Sphere,” The Annals of Mathematical Statistics, 34, 213–218. DOI: 10.1214/aoms/1177704257.
  • Jun, M., and Stein, M. (2008), “Nonstationary Covariance Models for Global Data,” The Annals of Applied Statistics, 2, 1271–1289. DOI: 10.1214/08-AOAS183.
  • Kaufman, C. G., Schervish, M. J., and Nychka, D. W. (2008), “Covariance Tapering for Likelihood-based Estimation in Large Spatial Data Sets,” Journal of the American Statistical Association, 103, 1545–1555. DOI: 10.1198/016214508000000959.
  • Kay, J. E., DeRepentigny, P., Holland, M. M., Bailey, D. A., DuVivier, A. K., Blanchard-Wrigglesworth, E., Deser, C., Jahn, A., Singh, H., Smith, M. M., et al. (2022), “Less Surface Sea Ice Melt in the CESM2 Improves Arctic Sea Ice Simulation with Minimal Non-Polar Climate Impacts,” Journal of Advances in Modeling Earth Systems, 14, e2021MS002679. DOI: 10.1029/2021MS002679.
  • Kay, J. E., Deser, C., Phillips, A., Mai, A., Hannay, C., Strand, G., Arblaster, J. M., Bates, S., Danabasoglu, G., Edwards, J., et al. (2015), “The Community Earth System Model (CESM) Large Ensemble Project: A Community Resource for Studying Climate Change in the Presence of Internal Climate Variability,” Bulletin of the American Meteorological Society, 96, 1333–1349. DOI: 10.1175/BAMS-D-13-00255.1.
  • Kennedy, M. C., and O’Hagan, A. (2001), “Bayesian Calibration of Computer Models,” Journal of the Royal Statistical Society, Series B, 63, 425–464. DOI: 10.1111/1467-9868.00294.
  • López-Pintado, S., and Romo, J. (2009), “On the Concept of Depth for Functional Data,” Journal of the American Statistical Association, 104, 718–734. DOI: 10.1198/jasa.2009.0108.
  • Lorenz, E. N. (1963), “Deterministic Nonperiodic Flow,” Journal of Atmospheric Sciences, 20, 130–141. DOI: 10.1175/1520-0469(1963)020<0130:DNF>2.0.CO;2.
  • Muñoz, S. E., Dee, S. G., Luo, X., Haider, M. R., O’Donnell, M., Parazin, B., and Remo, J. W. F. (2023), “Mississippi River Low-Flows: Context, Causes, and Future Projections,” Environmental Research: Climate, 2, 031001. DOI: 10.1088/2752-5295/acd8e3.
  • National Oceanic and Atmospheric Administration (NOAA) (2022), “NOAA Celebrates 200 Years of Science, Service, and Stewardship,” Revised October 24, 2022.
  • Nychka, D., Hammerling, D., Sain, S., and Lenssen, N. (2016), “LatticeKrig: Multiresolution Kriging Based on Markov Random Fields,” R package version 8.4.
  • Oakley, J. E., and O’Hagan, A. (2002), “Bayesian Inference for the Uncertainty Distribution of Computer Model Outputs,” Biometrika, 89, 769–784. DOI: 10.1093/biomet/89.4.769.
  • ———(2004), “Probabilistic Sensitivity Analysis of Complex Models: A Bayesian Approach,” Journal of the Royal Statistical Society, Series B, 66, 751–769. DOI: 10.1111/j.1467-9868.2004.05304.x.
  • O’Neill, B., Kriegler, E., Riahi, K., Ebi, K., Hallegatte, S., Carter, T., Mathur, R., and Vuuren, D. (2014), “A New Scenario Framework for Climate Change Research: The Concept of Shared Socioeconomic Pathways,” Climatic Change, 122, 401–414. DOI: 10.1007/s10584-013-0971-5.
  • Riahi, K., van Vuuren, D. P., Kriegler, E., et al. (2017), “The Shared Socioeconomic Pathways and their Energy, Land Use, and Greenhouse Gas Emissions Implications: An Overview,” Global Environmental Change, 42, 153–168. DOI: 10.1016/j.gloenvcha.2016.05.009.
  • Rodgers, K. B., Lee, S.-S., Rosenbloom, N., et al. (2021), “Ubiquity of human-induced changes in climate variability,” Earth System Dynamics, 12, 1393–1411. DOI: 10.5194/esd-12-1393-2021.
  • Rue, H., Martino, S., and Chopin, N. (2009), “Approximate Bayesian Inference for Latent Gaussian Models by Using Integrated Nested Laplace Approximations,” Journal of the Royal Statistical Society, Series B, 71, 319–392. DOI: 10.1111/j.1467-9868.2008.00700.x.
  • Sacks, J., Schiller, S., and Welch, W. (1989), “Designs for Computer Experiments,” Technometrics, 31, 41–47. DOI: 10.1080/00401706.1989.10488474.
  • Santambrogio, F. (2015), Optimal Transport for Applied Mathematicians (Vol. 55), pp. 58–63, New York: Birkäuser, 55.
  • Stein, M. L. (2007), “Spatial Variation of Total Column Ozone on a Global Scale,” The Annals of Applied Statistics, 1, 191–210. DOI: 10.1214/07-AOAS106.
  • ———(2008), “A Modeling Approach for Large Spatial Datasets,” Journal of the Korean Statistical Society, 37, 3–10. DOI: 10.1016/j.jkss.2007.09.001.
  • Sun, Y., and Genton, M. G. (2011), “Functional Boxplots,” Journal of Computational and Graphical Statistics, 20, 316–334. DOI: 10.1198/jcgs.2011.09224.
  • Tagle, F., Genton, M. G., Yip, A., Mostamandi, S., Stenchikov, G., and Castruccio, S. (2020), “A High-Resolution Bilevel Skew-t Stochastic Generator for Assessing Saudi Arabia’s Wind Energy Resources” (with Discussion), Environmetrics, 31, e2628.
  • Taylor, K. E., Stouffer, R. J., and Meehl, G. A. (2012), “An Overview of CMIP5 and the Experiment Design,” Bulletin of the American Meteorological Society, 93, 485–498. DOI: 10.1175/BAMS-D-11-00094.1.
  • Underwood, R., Bessac, J., Di, S., and Cappello, F. (2022), “Understanding the Effects of Modern Compressors on the Community Earth Science Model,” in 2022 IEEE/ACM 8th International Workshop on Data Analysis and Reduction for Big Scientific Data (DRBSD), IEEE, pp. 1–10. DOI: 10.1109/DRBSD56682.2022.00006.
  • Vecchia, A. V. (1988), “Estimation and Model Identification for Continuous Spatial Processes,” Journal of the Royal Statistical Society, Series B, 50, 297–312. DOI: 10.1111/j.2517-6161.1988.tb01729.x.
  • Yan, Y., and Genton, M. G. (2019), “Non-Gaussian Autoregressive Processes with Tukey g-and-h Transformations,” Environmetrics, 30, e2503.