938
Views
0
CrossRef citations to date
0
Altmetric
Articles

Unbiased Deep Solvers for Linear Parametric PDEs

, &
Pages 299-329 | Received 06 Apr 2021, Accepted 14 Jan 2022, Published online: 28 Mar 2022

Abstract

We develop several deep learning algorithms for approximating families of parametric PDE solutions. The proposed algorithms approximate solutions together with their gradients, which in the context of mathematical finance means that the derivative prices and hedging strategies are computed simultaneously. Having approximated the gradient of the solution, one can combine it with a Monte Carlo simulation to remove the bias in the deep network approximation of the PDE solution (derivative price). This is achieved by leveraging the Martingale Representation Theorem and combining the Monte Carlo simulation with the neural network. The resulting algorithm is robust with respect to the quality of the neural network approximation and consequently can be used as a black box in case only limited a-priori information about the underlying problem is available. We believe this is important as neural network-based algorithms often require fair amount of tuning to produce satisfactory results. The methods are empirically shown to work for high-dimensional problems (e.g., 100 dimensions). We provide diagnostics that shed light on appropriate network architectures.

2010 Mathematics Subject Classifications:

1. Introduction

Numerical algorithms that solve PDEs suffer from the so-called ‘curse of dimensionality’, making it impractical to apply known discretization algorithms such as finite difference schemes to solve high-dimensional PDEs. However, it has been recently shown that deep neural networks trained with stochastic gradient descent can overcome the curse of dimensionality (Beck, Gonon, and Jentzen Citation2020; Berner, Grohs, and Jentzen Citation2020), making them a popular choice to solve this computational challenge in the last few years.

In this work, we focus on the problem of numerically solving parametric linear PDEs arising from European option pricing in high dimensions. Let BRp,p1 be a parameter space (for instance, in the Black–Scholes equation with fixed rate, B is the domain of the volatility parameter). Consider v=v(t,x;β) satisfying (1) [tv+bxv+12tr[x2vσσ]cv](t,x;β)=0,v(T,x;β)=g(x;β),t[0,T], xRd, βB.(1) Here t[0,T], xRd and βB and b,σ,c and g are functions of (t,x;β) which specify the problem. The Feynman–Kac theorem provides a probabilistic representation for v so that Monte Carlo methods can be used for its unbiased approximation in one single point (t,x;β). What we propose in this work is a method for harnessing the power of deep learning algorithms to numerically solve (Equation1) in a way that is robust even in edge cases when the output of the neural network is not of the expected quality, by combining them with Monte Carlo algorithms.

From the results in this article, we observe that neural networks provide an efficient computational device for high dimensional problems. However, we observed that these algorithms are sensitive to the network architecture, parameters and distribution of training data. A fair amount of tuning is required to obtain good results. Based on this we believe that there is great potential in combining artificial neural networks with already developed and well understood probabilistic computational methods, in particular the control variate method for using potentially imperfect neural network approximations for finding unbiased solutions to a given problem, see Algorithm 1.

1.1. Main Contributions

We propose three classes of learning algorithms for simultaneously finding solutions and gradients to parametric families of PDEs.

  1. Projection solver: See Algorithm 2. We leverage Feynman–Kac representation together with the fact that conditional expectation can be viewed as an L2-projection operator. The gradient can be obtained by automatic differentiation of already obtained approximation of the PDE solution.

  2. Martingale representation solver: See Algorithm 3. This algorithm was inspired by Cvitanic and Zhang (Citation2005) and Weinan, Han, and Jentzen (Citation2017) and Han, Jentzen, and Weinan (Citation2017) and is referred to as deep BSDE solver. Our algorithm differs from Weinan, Han, and Jentzen (Citation2017) in that we approximate solution and its gradient at all the time-steps and across the entire space and parameter domains rather than only one space-time point. Furthermore we propose to approximate the solution-map and its gradient by separate networks.

  3. Martingale control variates solver: Algorithms 4 and 5. Here we exploit the fact that martingale representation induces control variate that can produce zero variance estimator. Obviously, such control variate is not implementable but provides a basis for a novel learning algorithm for the PDE solution.

For each of these classes of algorithms, we develop and test different implementation strategies. Indeed, one can either take one (large) network to approximate the entire family of solutions of (Equation1) or take a number of (smaller) networks, where each of them approximates the solution at a time point in a grid. The former has the advantage that one can take arbitrarily fine time discretization without increasing the overall network size. The advantage of the latter is that each learning task is simpler due to each network being smaller. One can further leverage the smoothness of the solution in time and learn the weights iteratively by initializing the network parameters to be those of the previous time step. We test both approaches numerically. At a high level all the algorithms work in path-dependent (non-Markovian) setting but there the challenge is an efficient method for encoding information in each path. This problem is solved in companion paper (Sabate-Vidales, Šiška, and Szpruch Citation2020).

To summarize the key contribution of this work are:

  1. We derive and implement three classes of learning algorithms for approximation of parametric PDE solution map and its gradient.

  2. We propose a novel iterative training algorithm that exploits regularity of the function we seek to approximate and allows using neural networks with smaller number of parameters.

  3. The proposed algorithms are truly black-box in that quality of the network approximation only impacts the computation benefit of the approach and does not introduce approximation bias. This is achieved by combining the network approximation with Monte Carlo as stated in Algorithm 1.

  4. Code for the numerical experiments presented in this paper is being made available on GitHub: https://github.com/msabvid/Deep-PDE-Solvers.

We stress the importance of point (iii) above by directing reader's attention to Figure , where we test generalization error of trained neural network for the five-dimensional family of PDEs corresponding to pricing a basket option under the Black–Scholes model. We refer reader to Example A.2 for details. We see that while the average error over test set is of order 105, the errors for a given input vary significantly. Indeed, it has been observed in deep learning community that for high dimensional problems one can find input data such that trained neural network that appears to generalize well (i.e., achieves small errors on the out of training data) produces poor results (Goodfellow, Shlens, and Szegedy Citation2014).

Figure 1. Histogram of mean-square-error of solution to the PDE on the test data set.

Figure 1. Histogram of mean-square-error of solution to the PDE on the test data set.

1.2. Literature Review

Deep neural networks trained with stochastic gradient descent proved to be extremely successful in number of applications such as computer vision, natural language processing, generative models or reinforcement learning (LeCun, Bengio, and Hinton Citation2015). The application to PDE solvers is relatively new and has been pioneered by Weinan, Han, and Jentzen (Citation2017), Han, Jentzen, and Weinan (Citation2017) and Sirignano and Spiliopoulos (Citation2017). See also Cvitanic and Zhang (Citation2005) for the ideas of solving PDEs with gradient methods and for direct PDE approximation algorithm. PDEs provide an excellent test bed for neural networks approximation because (a) there exists alternative solvers, e.g., Monte Carlo, (b) we have well-developed theory for PDEs, and that knowledge can be used to tune algorithms. This is contrast to mainstream neural network approximations in text or images classification.

Apart from growing body of empirical results in the literature on ‘Deep PDEs solvers’, Chan-Wai-Nam, Mikael, and Warin (Citation2019), Huré, Pham, and Warin (Citation2019), Beck et al. (Citation2018), Jacquier and Oumgari (Citation2019) and Henry-Labordere (Citation2017), there has been also some important theoretical contributions. It has been proved that deep artificial neural networks approximate solutions to parabolic PDEs to an arbitrary accuracy without suffering from the curse of dimensionality. The first mathematically rigorous proofs are given in Grohs et al. (Citation2018) and Jentzen, Salimova, and Welti (Citation2018). The high level idea is to show that neural network approximation to the PDE can be established by building on Feynman-Kac approximation and Monte Carlo approximation. By checking that Monte Carlo simulations do not suffer from the curse of dimensionality one can imply that the same is true for neural network approximation. Furthermore, it has been recently demonstrated in Hu et al. (Citation2021) and Mei, Montanari, and Nguyen (Citation2018) that noisy gradient descent algorithm used for training of neural networks of the form considered in Grohs et al. (Citation2018) and Jentzen, Salimova, and Welti (Citation2018) induces unique probability distribution function over the parameter space which minimizes learning. See Du et al. (Citation2018), Chizat and Bach (Citation2018), Rotskoff and Vanden-Eijnden (Citation2018), Sirignano and Spiliopoulos (Citation2020), Wang and Tang (Citation2021) and Han and Long (Citation2020) for related ideas on convergence of gradient algorithms for overparametrized neural networks. This means that there are theoretical guarantees for the approximation of (parabolic) PDEs with neural networks trained by noisy gradient methods alleviating the curse of dimensionality.

An important application of deep PDE solvers is that one can in fact approximate a parametric family of solutions of a PDE. To be more precise let BRp, p1, be a parameter space. In the context of finance these, for example, might be initial volatility, volatility of volatility, interest rate and mean reversion parameters. One can approximate the parametric family of functions F(;β)βB for an arbitrary range of parameters. This then allows for swift calibration of models to data (e.g. option prices). This is particularly appealing for high dimensional problems when calibrating directly using noisy Monte Carlo samples might be inefficient. This line of research gained some popularity recently and the idea has been tested numerically on various models and data sets (Horvath, Muguruza, and Tomas Citation2019; Liu et al. Citation2019; Bayer and Stemper Citation2018; Stone Citation2018; Hernandez Citation2016; Itkin Citation2019; McGhee Citation2018). There are some remarks that are in order. In the context of models calibration, while the training might be expensive one can do it offline, once and for good. One can also notice that training data could be used to produce a ‘look-up table’ taking model parameters to prices. From this perspective, the neural network, essentially, becomes an interpolator and a compression tool. Indeed the number of parameters of the network is much smaller than number of training data and therefore it is more efficient to store those. The final remark is that while there are other methods out there, such as Chebyshev functions, neural networks seem robust in high dimensions which make them our method of choice.

1.3. Notation

We denote by DN the set of all fully connected feedforward neural networks (see Appendix 3). We also use R[f]θDN with θRκ to denote a neural network with weights θ approximating the function f:Rd0Rd1 for some d0,d1N.

1.4. Outline

This paper is organized as follows. Section 2 provides theoretical underpinning for the derivation of all the algorithms we propose to solve (Equation1). More specifically in Section 2.2 we combine the approximation of the gradient of the solution of the PDE resulting from the Deep Learning algorithms with Monte Carlo to obtain an unbiased approximation of the solution of the PDE. In Section 3, we describe the algorithms in detail.

Finally in Section 4, we provide numerical tests of the proposed algorithms. We empirically test these methods on relevant examples including a 100 dimensional option pricing problems, see Examples 4.4 and A.3. We carefully measure the training cost and report the variance reduction achieved.

Since we work in situation where the function approximated by neural network can be obtained via other methods ( Monte Carlo, PDE solution) we are able to test the how the expressiveness of fully connected artificial neural networks depends on the number of layers and neurons per layer. See Section A.2 for details.

2. PDE Martingale Control Variate

Control variate is one of the most powerful variance reduction techniques for Monte Carlo simulation. While a good control variate can reduce the computational cost of Monte Carlo computation by several orders of magnitude, it relies on judiciously chosen control variate functions that are problem specific. For example, when computing price of basket options a sound strategy is to choose control variates to be call options written on each of the stocks in the basket, since in many models these are priced by closed-form formulae. In this article, we are interested in black-box-type control variate approach by leveraging the Martingale Representation Theorem and neural networks. The idea of using Martingale Representation to obtain control variates goes back at least to Newton (Citation1994). It has been further studied in combination with regression in Milstein and Tretyakov (Citation2009) and Belomestny et al. (Citation2018).

The bias in the approximation of the solution can be completely removed by employing control variates where the deep network provides the control variate resulting in very high variance reduction factor in the corresponding Monte Carlo simulation.

Let (Ω,F,P) be a probability space and consider an Rd-valued Wiener process W=(Wj)j=1d=((Wtj)t0)j=1d. We will use (FtW)t0 to denote the filtration generated by W. Consider a DRd-valued, continuous, stochastic process defined for the parameters βBRp, Xβ=(Xβ,i)i=1d=((Xtβ,i)t0)i=1d adapted to (FtW)t0 given as the solution to (2) dXsβ=b(s,Xsβ;β)ds+σ(s,Xs;β)dWs,s[t,T], Xtβ=xRd.(2) We will use (Ftβ)t0 to denote the filtration generated by Xβ.

Let g:RdR be a measurable function and we assume that there is a (stochastic) discount factor given by D(t1,t2;β):=et1t2c(s,Xsβ;β)ds for an appropriate function c=c(t,x;β). We will omit β from the discount factor notation for brevity. We now interpret P as some risk-neutral measure and so the P-price of our contingent claim is (3) v(t,x;β):=E[D(t,T)g(XTβ)|Xtβ=x].(3) Say we have iid r.v.s (XTβ,i)i=1N with the same distribution as XTβ, where for each i, Xtβ,i=x. Then the standard Monte Carlo estimator is vN(t,x;β):=1Ni=1NDi(t,T)g(XTβ,i). Convergence vN(t,x;β)v(t,x;β) in probability as N is granted by the Law of Large Numbers. Moreover the classical Central Limit Theorem tells that P(v(t,x;β)[vN(t,x;β)zα/2σN,vN(t,x;β)+zα/2σN])1αas N, where σ:=Var[D(t,T)g(XTβ)] and zα/2 is such that 1Φ(zα/2)=α/2 with Φ being distribution function (cumulative distribution function) of the standard normal distribution. To decrease the width of the confidence intervals one can increase N, but this also increases the computational cost. A better strategy is to reduce variance by finding an alternative Monte Carlo estimator, say VN(t,x;β), such that (4) E[VN(t,x;β)]=v(t,x;β)andVar[VN(t,x;β)]<Var[vN(t,x;β)],(4) and the cost of computing VN(t,x;β) is similar to vN(t,x;β).

In the remainder of the article, we will devise and test several strategies, based on deep learning, to find a suitable approximation for VN(t,x;β), by exploring the connection of the SDE (Equation2) and its associated PDE.

2.1. PDE Derivation of the Control Variate

It can be shown that under suitable assumptions on b, σ, c and g, and fixed βB that vC1,2([0,T]×D). See, e.g.,  Krylov (Citation1999). Let a:=12σσ. Then, from Feynman–Kac formula (see, e.g., Th. 8.2.1 in  Øksendal Citation2003), we get (5) {[tv+tr(ax2v)+bxvcv](t,x;β)=0in[0,T)×D,v(T,)=gonD.(5) Since vC1,2([0,T]×D) and since v satisfies the above PDE, if we apply Itô's formula then we obtain (6) D(t,T)v(T,XTβ;β)=v(t,x;β)+tTD(t,s)xv(s,Xsβ;β)σ(s,Xsβ;β)dWs.(6) Hence Feynman–Kac representation together with the fact that v(T,XTβ;β)=g(XTβ) yields (7) v(t,x;β)=D(t,T)g(XTβ)tTD(t,s)xv(s,Xsβ;β)σ(s,Xsβ;β)dWs.(7) Provided that sups[t,T]E[|D(t,s)xv(s,Xsβ;β)σ(s,Xsβ;β)|2]<, then the stochastic integral is a martingale. Thus we can consider the Monte Carlo estimator. (8) VN(t,x;β):=1Ni=1N{Di(t,T)g(XTβ,i)tTDi(t,s)xv(s,Xsβ,i;β)σ(s,Xsβ,i;β)dWsi}.(8) To obtain a control variate, we thus need to approximate xv. If one used classical approximation techniques to the PDE, such as finite difference or finite element methods, one would run into the curse of the dimensionality – the very reason one employs Monte Carlo simulations in the first place. Artificial neural networks have been shown to break the curse of dimensionality in specific situations (Grohs et al. Citation2018). To be more precise, authors in Berner, Grohs, and Jentzen (Citation2020), Elbrächter et al. (Citation2022), Jentzen, Salimova, and Welti (Citation2018), Grohs, Jentzen, and Salimova (Citation2019), Hutzenthaler et al. (Citation2020), Grohs et al. (Citation2018), Kutyniok et al. (Citation2020), Gonon et al. (Citation2021) and Reisinger and Zhang (Citation2020) have shown that there always exist a deep feed forward neural network and some parameters such that the corresponding neural network can approximate the solution of a linear PDE arbitrarily well in a suitable norm under reasonable assumptions (terminal condition and coefficients can be approximated by neural networks). Moreover, the number of parameters grows only polynomially in dimension and so there is no curse of dimensionality. However, while the papers above construct the network they do not tell us how to find the ‘good’ parameters. In practice, the parameter search still relies on gradient descent-based minimization over a non-convex landscape. The application of the deep-network approximation to the solution of the PDE as a martingale control variate is an ideal compromise.

If there is no exact solution to the PDE (Equation5), as would be the case in any reasonable application, then we will approximate xv by R[xv]θDN.

To obtain an implementable algorithm we discretize the integrals in Vtβ,N,v and take a partition of [0,T] denoted π:={t=t0<<tNsteps=T}, and consider an approximation of (Equation2) by (Xtkβ,π)tkπ. For simplicity we approximate all integrals arising by Riemann sums always taking the left-hand point when approximating the value of the integrand.

The implementable control variate Monte Carlo estimator is then the form (9) Vπ,θ,λ,N(t,x;β):=1Ni=1N{(Dπ(t,T))ig(XTβ,π,i)k=1Nsteps1λk=1Nsteps1(Dπ(t,tk))iR[xv]θ(tk,Xtkβ,π,i;β)σ(tk,Xtkβ,π,i;β)(Wtk+1iWtki)},(9) where Dπ(t,T):=ek=1Nsteps1c(tk,Xtkβ,π)(tk+1tk) and λ is a free parameter to be chosen (because we discretize and use approximation to the PDE it is expected λ1). Again, we point out that the only bias of the above estimator comes from the numerical scheme used to solve the forward and backward processes. Nevertheless, R[xv]θ does not add any additional bias independently of the choice θ. We will discuss possible approximation strategies for approximating xv with R[xv]θ in the following section.

In this section, we have actually derived an explicit form of the Martingale representation (see, e.g.,  Cohen and Elliott Citation2015, Th. 14.5.1) of D(t,T)g(XTβ) in terms of the solution of the PDE associated to the process Xβ, which is given as the solution to (Equation2). In Appendix 1, we provide a more general framework to build a low-variance Monte Carlo estimator VtN for any (possibly non-Markovian) FW-adapted process Xβ.

2.2. Unbiased Parametric PDE Approximation

After having trained the networks R[xv]θ (using any of Algorithms 2–3 that we will introduce in Section 3) and R[v]η (using any of Algorithms 2 and 3) that approximate v,xv one then has two options to approximate v(t,xt;β)

  1. Directly with R[v]η(t,xt;β) if Algorithms 2 or 3 were used, which will introduce some approximation bias.

  2. By combining R[xv]θ with the Monte Carlo approximation of v(t,xt;β) using (Equation9), which will yield an unbiased estimator of v(t,xt;β). The complete method is stated as Algorithm 1.

3. Deep PDE Solvers

In this section, we propose two algorithms that learn the PDE solution (or its gradient) and then use it to build control variate using (Equation9). We also include in Appendix 2 an additional algorithm to solve such linear PDEs using deep neural networks.

3.1. Projection Solver

Before we proceed further we recall a well-known property of conditional expectations, for proof see, e.g.,  Krylov (Citation2002, Ch.3 Th. 14).

Theorem 3.1

Let XL2(F). Let GF be a sub σ-algebra. There exists a random variable YL2(G) such that E[|XY|2]=infηL2(G)E[|Xη|2]. The minimizer, Y, is unique and is given by Y=E[X|G].

The theorem tell us that conditional expectation is an orthogonal projection of a random variable X onto L2(G). Instead of working directly with (Equation5) we work with its probabilistic representation (Equation6). To formulate the learning task, we replace X by D(t,T)g((XTβ)) so that v(t,Xtβ;β)=E[X|Xtβ]. Hence, by Theorem (3.1), E[|Xv(t,Xtβ;β)|2]=infηL2(σ(Xtβ))E[|Xη|2] and we know that for a fixed t the random variable which minimizes the mean square error is a function of Xt. But by the Doob–Dynkin Lemma (Cohen and Elliott Citation2015, Th. 1.3.12) we know that every ηL2(σ(Xt)) can be expressed as η=ht(Xtβ) for some appropriate measurable ht. For the practical algorithm, we restrict the search for the function ht to the class that can be expressed as deep neural networks DN. Hence we consider a family of functions RθDN and set learning task as (10) θ:=argminθEβ[E(Xtβ,π)tπ[k=0Nsteps(D(tk,T)g(XTβ,π)R[v]θtk(Xtkβ,π;β))2]].(10) The inner expectation in (Equation10) is taken across all paths generated using numerical scheme on (Equation2) for a fixed β and it allows to solve the PDE (Equation5) for such β. The outer expectation is taken on β for which the distribution is fixed beforehand (e.g., uniform on B if it is compact), thus allowing the algorithm to find the optimal neural network weights θ to solve the parametric family of PDEs (Equation5). Automatic differentiation is used to approximate xv. Algorithm 2 describes the method.

3.2. Probabilistic Representation Based on Backward SDE

Instead of working directly with (Equation5) we work with its probabilistic representation (Equation6) and view it as a BSDE. To formulate the learning task based on this, we recall the time-grid π so that we can write it recursively as v(tNsteps,XtNstepsβ;β)=g(XtNstepsβ),D(t,tm+1)v(tm+1,Xtm+1β;β)=D(t,tm)v(tm,Xtmβ;β)+tmtm+1D(t,s)xv(s,Xsβ;β)σ(s,Xsβ;β)dWsfor m=0,1,,Nsteps1. Next consider deep network approximations for each time step in π and for both the solution of  (Equation5) and its gradient. R[v]ηm(x;β)v(tm,x;β),tmπ, xRd and R[xv]θm(x;β)xv(tm,x;β),tmπ, xRd. Approximation depends on weights ηmRkη, θmRkθ. We then set the learning task as (11) (η,θ):=argmin(η,θ)Eβ,Xβ[|g(XtNstepsβ,π)R[v]ηNsteps(XtNstepsβ,π)|2m=0Nsteps1+1Nstepsm=0Nsteps1|Em+1(η,θ)|2],Em+1(η,θ):=D(t,tm+1)R[v]ηm+1(Xtm+1β,π;β)D(t,tm)R[v]ηm(Xβ,π;β)D(t,tm)R[xv]θm(Xtmβ,π;β)σ(tm,Xtmβ,π;β)ΔWtm+1,(11) where η={η0,,ηtNsteps},θ={θ0,,θtNsteps}. The complete learning method is stated as Algorithm 3, where we split the optimization (Equation11) in several optimization problems, one per time step: learning the weights θm or ηm at a certain time step tm<tNsteps only requires knowing the weights ηm+1. At m=Nsteps, learning the weights ηNsteps only requires the terminal condition g. Note that the algorithm assumes that adjacent networks in time will be similar, and therefore we initialize ηm and θm by ηm+1 and θm+1.

3.3. Martingale Control Variate Deep Solvers

So far, the presented methodology to obtain the control variate consists on first learning the solution of the PDE and more importantly its gradient (Algorithms 2, 3) which is then plugged in (Equation9). Alternatively, one can directly use the variance of (Equation9) as the loss function to be optimized in order to learn the control variate. We expand this idea and design two additional algorithms.

Recall definition of Vt,Tβ,π,θ,λ,N given by (Equation9). From (Equation8) we know that the theoretical control variate Monte Carlo estimator has zero variance and so it is natural to set-up a learning task which aims to learn the network weights θ in a way which minimizes said variance: θ,var:=argminθVar[Vt,Tβ,π,θ,λ,N]. Setting λ=1, the learning task is stated as Algorithm 4.

We include a second similar Algorithm in Appendix 2.

4. Examples and Experiments

4.1. Options in Black–Scholes Model on d>1 Assets

Take a d-dimensional Wiener process W. We assume that we are given a symmetric, positive-definite matrix (covariance matrix) Σ and a lower triangular matrix C s.t. Σ=CC. For such a positive-definite Σ we can always use Cholesky decomposition to find C. The risky assets will have volatilities given by σi. We will (abusing notation) write σij:=σiCij, when we don't need to separate the volatility of a single asset from correlations. The risky assets under the risk-neutral measure are then given by (12) dSti=rStidt+σiStijCijdWtj.(12) All sums will be from 1 to d unless indicated otherwise. Note that the SDE can be simulated exactly since Stn+1i=Stniexp((r12j(σij)2)(tn+1tn)+jσij(Wtn+1jWtnj)). The associated PDE is (with aij:=kσikσjk) tv(t,S)+12i,jaijSiSjxixjv(t,S)+riSiSiv(t,S)rv(t,S)=0, for (t,S)[0,T)×(R+)d together with the terminal condition v(T,S)=g(S) for S(R+)d.

4.2. Deep Learning Setting

In this section, we describe the neural networks used in the four proposed algorithms as well as the training setting, in the specific situation where we have an options problem in Black–Scholes model on d>1 assets.

Learning Algorithms 3–5 share the same underlying fully connected artificial network which will be different for different tk, k=0,1,,Nsteps1. At each time step, we use a fully connected artificial neural network denoted R[]θkDN. The choice of the number of layers and network width is motivated by empirical results on different possible architectures applied on a short-lived options problem. We present the results of this study in Appendix A.2. The architecture is similar to that proposed in Beck et al. (Citation2018).

At each time step, the network consists of four layers: one d-dimensional input layer, two (d+20)-dimensional hidden layers and one output layer. The output layer is one dimensional if the network is approximation for v and d-dimensional if the network is an approximation for xv. The non-linear activation function used on the hidden layers is the linear rectifier relu. In all experiments except for Algorithm 3 for the basket options problem, we used batch normalization (Sergey Ioffe Citation2015) on the input of each network, just before the two nonlinear activation functions in front of the hidden layers and also after the last linear transformation.

The networks' optimal parameters are approximated by the Adam optimizer (Diederik and Kingma Citation2017) on the loss function specific for each method. Each parameter update (i.e. ,one step of the optimizer) is calculated on a batch of 5103 paths (xtni)n=0Nsteps obtained by simulating the SDE. We take the necessary number of training steps until the stopping criteria defined below is met, with a learning rate of 103 during the first 104 iterations, decreased to 104 afterwards.

During training of any of the algorithms, the loss value at each iteration is kept. A model is assumed to be trained if the difference between the loss averages of the two last consecutive windows of length 100 is less than a certain ϵ.

4.3. Evaluating Variance Reduction

We use the specified network architectures to assess the variance reduction in several examples below. After training the models in each particular example, they are evaluated as follows:

  1. We calculate NMC=10 times the Monte Carlo estimate ΞT¯:=1Nini=1NinΞTi and the Monte Carlo with control variate estimate V¯t,Tπ,θ,λ,Nsteps=1Nini=1NinVt,Tπ,θ,λ,Nsteps,i using Nin=106 Monte Carlo samples.

  2. From Central Limit Theorem, as Nin increases the standardized estimators converge in distribution to the Normal. Therefore, a 95% confidence interval of the variance of the estimator is given by [(NMC1)S2χ1α/2,NMC1,(NMC1)S2χα/2,NMC1] where S is the sample variance of the NMC controlled estimators V¯t,Tπ,θ,λ,Nsteps, and α=0.05. These are calculated for both the Monte Carlo estimate and the Monte Carlo with control variate estimate.

  3. We use the NMCNin=107 generated samples ΞTi and Vt,Tπ,θ,λ,Nsteps,i to calculate and compare the empirical variances σ~ΞT2 and σ~Vt,Tπ,θ,λ,Nsteps,i2.

  4. The number of optimizer steps and equivalently number of random paths generated for training provide a cost measure of the proposed algorithms.

  5. We evaluate the variance reduction if we use the trained models to create control variates for options in Black–Scholes models with different volatilities than the one used to train our models.

Example 4.1

Low dimensional problem with explicit solution

We consider exchange option on two assets. In this case, the exact price is given by the Margrabe formula. We take d = 2, S0i=100, r=5%, σi=30%, Σii=1, Σij=0 for ij. The payoff is g(S)=g(S(1),S(2)):=max(0,S(1)S(2)). From Margrabe's formula, we know that v(0,S)=BlackScholes(risky price=S(1)S(2),strike=1,T,r,σ¯), where σ¯:=(σ11σ21)2+(σ22σ12)2.

We organize the experiment as follows: We train our models with batches of 5000 random paths (stni)n=0Nsteps sampled from the SDE (Equation12), where Nsteps=50. The assets' initial values st0i are sampled from a lognormal distribution Xexp((μ0.5σ2)τ+στξ), where ξN(0,1),μ=0.08,τ=0.1. The existence of an explicit solution allows to build a control variate of the form (Equation9) using the known exact solution to obtain xv. For a fixed number of time steps Nsteps this provides an upper bound on the variance reduction an artificial neural network approximation of xv can achieve.

We follow the evaluation framework to evaluate the model, simulating NMCNin paths by simulating (Equation12) with constant (S01,S02)i=(1,1). We report the following results:

  1. Table  provides the empirical variances calculated over 106 generated Monte Carlo samples and their corresponding control variates. The variance reduction measure indicates the quality of each control variate method. The variance reduction using the control variate given by Margrabe's formula provides a benchmark for our methods. Table  also provides the cost of training for each method, given by the number of optimizer iterations performed before hitting the stopping criteria, defined defined before with ϵ=5×106. We add an additional row with the control variate built using automatic differentiation on the network parametrized using the Deep Galerkin Method (Sirignano and Spiliopoulos Citation2017). The DGM attempts to find the optimal parameters of the network satisfying the PDE on a pre-determined time and space domain. In contrast to our algorithms, the DGM method is not restricted to learn the solution of the PDE on the paths built from the probabilistic representation of the PDE. However, this is what is precisely enhancing the performance of our methods in terms of variance reduction, since they are specifically learning an approximation to the solution of the PDE and its gradient such that the resulting control variate will yield a low-variance Monte Carlo estimator.

  2. Table  provides the confidence intervals for the variances and of the Monte Carlo estimator, and the Monte Carlo estimator with control variate assuming these are calculated on 106 random paths. Moreover, we add the confidence interval of the variance of the Monte Carlo estimator calculated over Nin antithetic paths where the first Nin/2 Brownian paths generated using (Zi)i=1,,Nsteps samples from a normal and the second half of the Brownian paths are generated using the antithetic samples (Zi)i=1,,Nsteps. See Belomestny, Iosipoi, and Zhivotovskiy (Citation2017, Section 4.2) for more details. All the proposed algorithms in this paper outperform the Monte Carlo estimator and the Monte Carlo estimator with antithetic paths; compared to the latter, our algorithms produce unbiased estimators with variances that 2 orders of magnitude less.

  3. Figure  studies the iterative training for the BSDE solver. As it has been observed before, this type of training does not allow us to study the overall loss function as the number of training steps increases. Therefore we train the same model four times for different values of ϵ between 0.01 and 5×106 and we study the number of iterations necessary to meet the stopping criteria defined by ϵ, the variance reduction once the stopping criteria is met, and the relationship between the number of iterations and the variance reduction. Note that the variance reduction stabilizes for ϵ<105. Moreover, the number of iterations necessary to meet the stopping criteria increases exponentially as ϵ decreases, and therefore for our results printed in Tables  and  we employ ϵ=5×106 (Figure ).

  4. Figure  displays the variance reduction after using the trained models on several Black Scholes problem with exchange options but with values of σ other than 0.3 which was the one used for training. We see that the various algorithms work similarly well in this case (not taking training cost into account). We note that the variance reduction is close to the theoretical maximum which is restricted by time discretization. Finally we see that the variance reduction is still significant even when the neural network was trained with different model parameters (in our case volatility in the option pricing example). The labels of Figure can be read as follows:

    1. MC + CV Corr op: Monte Carlo estimate with Deep Learning-based Control Variate built using Algorithm 5.

    2. MC + CV Var op: Monte Carlo estimate with Deep Learning-based Control Variate built using Algorithm 4.

    3. MC + CV BSDE solver: Monte Carlo estimate with Deep Learning-based Control Variate built using Algorithm 3.

    4. MC + CV Margrabe: Monte Carlo estimate with Control Variate using analytical solution for this problem given by Margrabe formula.

Figure 2. Left: Variance reduction in terms of number of optimizer iterations. Right: Variance reduction in terms of epsilon. Both are for Example 4.1 and Algorithm 3.

Figure 2. Left: Variance reduction in terms of number of optimizer iterations. Right: Variance reduction in terms of epsilon. Both are for Example 4.1 and Algorithm 3.

Figure 3. Number of optimizer iterations in terms of epsilon for Example 4.1 and Algorithm 3.

Figure 3. Number of optimizer iterations in terms of epsilon for Example 4.1 and Algorithm 3.

Figure 4. Variance reduction achieved by network trained with σ=0.3 but then applied in situations where σ[0.2,0.4]. We can see that the significant variance reduction is achieved by a neural network that was trained with ‘incorrect’ σ. Note that the ‘MC + CV Margbrabe’ displays the optimal variance reduction that can be achieved by using exact solution to the problem. The variance reduction is not infinite even in this case since stochastic integrals are approximated by Riemann sums.

Figure 4. Variance reduction achieved by network trained with σ=0.3 but then applied in situations where σ∈[0.2,0.4]. We can see that the significant variance reduction is achieved by a neural network that was trained with ‘incorrect’ σ. Note that the ‘MC + CV Margbrabe’ displays the optimal variance reduction that can be achieved by using exact solution to the problem. The variance reduction is not infinite even in this case since stochastic integrals are approximated by Riemann sums.

Table 1. Results on exchange option problem on two assets, Example 4.1. Empirical Variance and variance reduction factor.

Table 2. Results on exchange option problem on two assets, Example 4.1.

Example 4.2

Low-dimensional problem with explicit solution – Approximation of Price using PDE solver compared to Control Variate

We consider exchange options on two assets as in Example 4.1. We consider algorithm 3 that can be applied in two different ways:

  1. It directly approximates the solution of the PDE (Equation5) and its gradient in every point.

  2. We can use xv to build the control variate using probabilistic representation of the PDE  (Equation6).

We compare both applications by calculating the expected error of the L2-error of each of them with respect to the analytical solution given by Margrabe formula. From Margrabe's formula, we know that v(0,S)=BlackScholes(risky price=S(1)S(2),strike=1,T,r,σ¯). Let R[v]η0(x)v(0,x) be the Deep Learning approximation of price at any point at initial time, calculated using Algorithm 3, and R[xv]θm(x)xv(tm,x) be the Deep Learning approximation of its gradient for every time step in the time discretization. The aim of this experiment is to show how even if Algorithm 3 numerically converges to a biased approximation of v(0,x) (see Figure  left), it is still possible to use R[xv]θm(x) to build an unbiased Monte-Carlo approximation of v(0,x) with low variance.

Figure 5. Left: Loss of Algorithm 3 and squared error of R[v](t,x0) in terms of training iterations. Right: Expected MSE of the two different approaches with respect to analytical solution in terms of number of Monte Carlo samples.

Figure 5. Left: Loss of Algorithm 3 and squared error of R[v](t,x0) in terms of training iterations. Right: Expected MSE of the two different approaches with respect to analytical solution in terms of number of Monte Carlo samples.

We organize the experiment as follows.

  1. We calculate the expected value of the L2-error of Rη0(x) where each component of xR2 is sampled from a lognormal distribution: E[|v(0,x)R[v]η0(x)|2]1Ni=1N|v(0,xi)R[v]η0(xi)|2

  2. We calculate the expected value of the L2-error of the Monte Carlo estimator with control variate where each component of xR2 is sampled from a lognormal distribution: E[|v(0,x)V0,Tπ,θ,λ,NMC,x|2]1Ni=1N|v(0,xi)V0,Tπ,θ,λ,NMC,xi|2, where V0,Tπ,θ,λ,NMC,x is given by  (Equation9), and is calculated for different values of Monte Carlo samples.

  3. We calculate the expected value of the L2-error of the Monte Carlo estimator without control variate where each component of xR2 is sampled from a lognormal distribution: E[|v(0,x)Ξ0,Tπ,θ,λ,NMC,x|2]1Ni=1N|v(0,xi)Ξ0,Tπ,θ,λ,NMC,xi|2, where Ξ0,Tπ,θ,λ,NMC,x:=1NMCj=1NMCD(t,T)g(XTi)

Figure provides one realization of the described experiment for different Monte Carlo iterations between 10 and 200. It shows how in this realization, 60 Monte Carlo iterations are enough to build a Monte Carlo estimator with control variate having lower bias than the solution provided by Algorithm 3.

Example 4.3

Low-dimensional problem with explicit solution. Training on random values for volatility

We consider exchange option on two assets. In this case, the exact price is given by the Margrabe formula. The difference with respect to the last example is that now we aim to generalize our model, so that it can build control variates for different Black–Scholes models. For this we take d = 2, S0i=100, r = 0.05, σiUnif(0.2,0.4), Σii=1, Σij=0 for ij.

The payoff is g(S)=g(S(1),S(2)):=max(0,S(1)S(2)). We organize the experiment as follows: for comparison purposes with the BSDE solver from the previous example, we train our model for exactly the same number of iterations, i.e. ,1380 batches of 5000 random paths (stni)n=0Nsteps sampled from the SDE Equation12, where Nsteps=50. The assets' initial values st0i are sampled from a lognormal distribution Xexp((μ0.5σ2)τ+στξ), where ξN(0,1),μ=0.08,τ=0.1. Since now σ can take different values, it is included as input to the networks at each time step.

The existence of an explicit solution allows to build a control variate of the form (Equation9) using the known exact solution to obtain xv. For a fixed number of time steps Nsteps this provides an upper bound on the variance reduction an artificial neural network approximation of xv can achieve.

Figure  adds the performance of this model to Figure , where the variance reduction of the Control Variate is displayed for different values of the volatility between 0.2 and 0.4.

Figure 6. Extension of Figure with variance reduction achieved by training the model on different Black–Scholes models.

Figure 6. Extension of Figure 4 with variance reduction achieved by training the model on different Black–Scholes models.

Example 4.4

High-dimensional problem, exchange against average

We extend the previous example to 100 dimensions. This example is similar to EX10E from Broadie, Du, and Moallemi (Citation2015). We will take S0i=100, r=5%, σi=30%, Σii=1, Σij=0 for ij.

We will take this to be g(S):=max(0,S11d1i=2dSi). The experiment is organized as follows: we train our models with batches of 5103 random paths (stni)n=0Nsteps sampled from the SDE (Equation12), where Nsteps=50. The assets' initial values st0i are sampled from a lognormal distribution Xexp((μ0.5σ2)τ+στξ), where ξN(0,1), μ=0.08, τ=0.1.

We follow the evaluation framework to evaluate the model, simulating NMCNin paths by simulating (Equation12) with constant S0i=1 for i=1,,100. We have the following results:

  1. Table  provides the empirical variances calculated over 106 generated Monte Carlo samples and their corresponding control variates. The variance reduction measure indicates the quality of each control variate method. Table  also provides the cost of training for each method, given by the number of optimizer iterations performed before hitting the stopping criteria with ϵ=5106. Algorithm 3 outperforms the other algorithms in terms of variance reduction factor. This is not surprising as Algorithm 3 explicitly learns the discretization of the Martingale representation (Equation (Equation7)) from which the control variate arises.

  2. Table  provides the confidence interval for the variance of the Monte Carlo estimator, and the Monte Carlo estimator with control variate assuming these are calculated on 106 random paths.

  3. Figures  and  study the iterative training for the BSDE solver. We train the same model four times for different values of ϵ between 0.01 and 5×106 and we study the number of iterations necessary to meet the stopping criteria defined by ϵ, the variance reduction once the stopping criteria is met, and the relationship between the number of iterations and the variance reduction. Note that in this case the variance reduction does not stabilize for ϵ<105. However, the number of training iterations increases exponentially as ϵ decreases, and therefore we also choose ϵ=5×106 to avoid building a control that requires a high number of random paths to be trained (Figure ).

Figure 7. Left: Variance reduction in terms of number of optimizer iterations. Right: Variance reduction in terms of epsilon. Both for Example 4.4 and Algorithm 3.

Figure 7. Left: Variance reduction in terms of number of optimizer iterations. Right: Variance reduction in terms of epsilon. Both for Example 4.4 and Algorithm 3.

Figure 8. Number of optimizer iterations in terms of ϵ for Example 4.4 and Algorithm 3.

Figure 8. Number of optimizer iterations in terms of ϵ for Example 4.4 and Algorithm 3.

Figure 9. Variance reduction with network trained with σ=0.3 but applied for σ[0.2,0.4] for the model of Example 4.4. We see that the variance reduction factor is considerable even in the case when the network is used with ‘wrong’ σ. It seems that Algorithm 4 is not performing well in this case.

Figure 9. Variance reduction with network trained with σ=0.3 but applied for σ∈[0.2,0.4] for the model of Example 4.4. We see that the variance reduction factor is considerable even in the case when the network is used with ‘wrong’ σ. It seems that Algorithm 4 is not performing well in this case.

Table 3. Results on exchange option problem on 100 assets, Example 4.4. Empirical Variance and variance reduction factor and costs in terms of paths used for training and optimizer steps.

Table 4. Results on exchange option problem on 100 assets, Example 4.4.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

This work was supported by the Alan Turing Institute under EPSRC grant no. EP/N510129/1.

References

  • Bayer C., and B. Stemper. 2018. “Deep Calibration of Rough Stochastic Volatility Models.” arXiv:1810.03399.
  • Beck C., S. Becker, P. Grohs, N. Jaafari, and A. Jentzen. 2018. “Solving Stochastic Differential Equations and Kolmogorov Equations by Means of Deep Learning.” arXiv:1806.00421.
  • Beck C., L. Gonon, and A. Jentzen. 2020. “Overcoming the Curse of Dimensionality in the Numerical Approximation of High-Dimensional Semilinear Elliptic Partial Differential Equations”.
  • Belomestny D., L. Iosipoi, and N. Zhivotovskiy. 2017. “Variance Reduction via Empirical Variance Minimization: Convergence and Complexity.” arXiv:1712.04667.
  • Belomestny D., S. Hafner, T. Nagapetyan, and M. Urusov. 2018. “Variance Reduction for Discretised Diffusions Via Regression.” Journal of Mathematical Analysis and Applications 458 (1): 393–418.
  • Berner J., P. Grohs, and A. Jentzen. January, 2020. “Analysis of the Generalization Error: Empirical Risk Minimization Over Deep Artificial Neural Networks Overcomes the Curse of Dimensionality in the Numerical Approximation of Black–scholes Partial Differential Equations.” SIAM Journal on Mathematics of Data Science 2 (3): 631–657.
  • Blanka Horvath M. T., and Muguruza Aitor. 2019. “Deep Learning Volatility: A Deep Neural Network Perspective on Pricing and Alibration in (Rough) Volatility Models.” arXiv:1901.09647.
  • M. Broadie, Y. Du, and C. C. Moallemi. 2015. “Risk Estimation Via Regression.” Operations Research 63 (5): 1077–1097.
  • Chan-Wai-Nam Q., J. Mikael, and X. Warin. 2019. “Machine Learning for Semi Linear Pdes.” Journal of Scientific Computing 79 (3): 1667–1712.
  • Chizat L., and F. Bach. 2018. “On the Global Convergence of Gradient Descent for Over-Parameterized Models Using Optimal Transport.” In Advances in Neural Information Processing Systems, 3036–3046.
  • Cohen S. N., and R. J. Elliott. 2015. Stochastic Calculus and Applications. Vol. 2. New York: Birkhäuser.
  • Cont R., and Y. Lu. March, 2016. “Weak Approximation of Martingale Representations.” Stochastic Processes and Their Applications 126 (3): 857–882.
  • Cvitanic J., and J. Zhang. 2005. “The Steepest Descent Method for Forward-backward Sdes.” Electronic Journal of Probability 10: 1468–1495.
  • Diederik J. B., and P. Kingma. 2017. “Adam: A Method for Stochastic Optimization.” arXiv:1412.6980.
  • Du S. S., X. Zhai, B. Poczos, and A. Singh. 2018. “Gradient Descent Provably Optimizes Over-Parameterized Neural Networks.” arXiv:1810.02054.
  • Elbrächter D., P. Grohs, A. Jentzen, and C. Schwab. Dnn expression rate analysis of high-dimensional pdes. 2022. “Application to Option Pricing.” Constructive Approximation 55 (1): 3–7.
  • Glasserman P. 2013. Monte Carlo Methods in Financial Engineering. Springer.
  • Gonon L., P. Grohs, A. Jentzen, D. Kofler, and D. Šiška. 2021. “Uniform Error Estimates for Artificial Neural Network Approximations for the Heat Equation.” IMA Journal Numerical Analysis. https://academic.oup.com/imajna/advance-article/doi/https://doi.org/10.1093/imanum/drab027/6279436?guestAccessKey=cef28afb-c79a-433a-855c-08f76b4732fd&login=false
  • Goodfellow I., J. Shlens, and C. Szegedy. 2014. “Explaining and Harnessing Adversarial Examples.” arXiv:1412.6572.
  • Grohs P., F. Hornung, A. Jentzen, and P. von Wurstemberger. 2018. “A Proof That Artificial Neural Networks Overcome the Curse of Dimensionality in the Numerical Approximation of Black-Scholes Partial Differential Equations.” arXiv:1809.02362.
  • Grohs P., A. Jentzen, and D. Salimova. 2019. “Deep Neural Network Approximations for Monte Carlo Algorithms”.
  • Han J., and J. Long. 2020. “Convergence of the Deep Bsde Method for Coupled Fbsdes.” Probability, Uncertainty and Quantitative Risk 5 (1): 1–33.
  • Han J., A. Jentzen, and E. Weinan. 2017. “Solving High-Dimensional Partial Differential Equations Using Deep Learning.” arXiv:1707.02568.
  • Henry-Labordere P. 2017. “Deep Primal-Dual Algorithm for BSDEs: Applications of Machine Learning to CVA and IM.” Available at SSRN 3071506.
  • Hernandez A. 2016. “Model Calibration With Neural Networks.” Available at SSRN 2812140.
  • Horvath B., A. Muguruza, and M. Tomas. 2019. “Deep Learning Volatility.” Available at SSRN 3322085.
  • Hu K., Z. Ren, Šiška D., and L. Szpruch. 2021. “Mean-Field Langevin Dynamics and Energy Landscape of Neural Networks.” Annales de l'Institute Henry Poincaré Proability and Statistics 57 (4): 2043–2065.
  • Huré C., H. Pham, and X. Warin. 2019. “Some Machine Learning Schemes for High-Dimensional Nonlinear PDEs.” arXiv:1902.01599.
  • Hutzenthaler M., A. Jentzen, T. Kruse, and T. A. Nguyen. April, 2020. “A Proof that Rectified Deep Neural Networks Overcome the Curse of Dimensionality in the Numerical Approximation of Semilinear Heat Equations.” SN Partial Differential Equations and Applications 1 (2): 1–34.
  • Itkin A. 2019. “Deep Learning Calibration of Option Pricing Models: Some Pitfalls and Solutions.” arXiv:1906.03507.
  • Jacquier A. J., and M. Oumgari. 2019. “Deep PPDEs for Rough Local Stochastic Volatility.” Available at SSRN 3400035.
  • Jentzen A., D. Salimova, and T. Welti. 2018. “A Proof That Deep Artificial Neural Networks Overcome the Curse of Dimensionality in the Numerical Approximation of Kolmogorov Partial Differential Equations With Constant Diffusion and Nonlinear Drift Coefficients.” arXiv:1809.07321.
  • Krylov N. 1999. “On Kolmogorov's Equations for Finite Dimensional Diffusions.” In Stochastic PDE's and Kolmogorov Equations in Infinite Dimensions, 1–63. Springer.
  • Krylov N. V. 2002. Introduction to the Theory of Random Processes. Vol. 43. American Mathematical Society.
  • Kutyniok G., P. Petersen, M. Raslan, and R. Schneider. 2020. “A Theoretical Analysis of Deep Neural Networks and Parametric pdes”.
  • LeCun Y., Y. Bengio, and G. Hinton. 2015. “Deep Learning.” Nature 521 (7553): 436.
  • Liu S., A. Borovykh, L. A. Grzelak, and C. W. Oosterlee. 2019. “A Neural Network-Based Framework for Financial Model Calibration.” arXiv:1904.10523.
  • McGhee W. A. 2018. “An Artificial Neural Network Representation of the SABR Stochastic Volatility Model.” SSRN 3288882.
  • Mei S., A. Montanari, and P.-M. Nguyen. 2018. “A Mean Field View of the Landscape of Two-layer Neural Networks.” Proceedings of the National Academy of Sciences 115 (33): E7665–E7671.
  • Milstein G., and M. Tretyakov. 2009. “Solving Parabolic Stochastic Partial Differential Equations Via Averaging Over Characteristics.” Mathematics of Computation 78 (268): 2075–2106.
  • Newton N. J. 1994. “Variance Reduction for Simulated Diffusions.” SIAM Journal on Applied Mathematics 54 (6): 1780–1805.
  • Øksendal B. 2003. “Stochastic Differential Equations.” In Stochastic Differential Equations, 65–84. Springer.
  • Reisinger C., and Y. Zhang. 2020. “Rectified Deep Neural Networks Overcome the Curse of Dimensionality for Nonsmooth Value Functions in Zero-Sum Games of Nonlinear Stiff Systems”.
  • Rotskoff G. M., and Vanden-Eijnden E.. 2018. “Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error.” arXiv:1805.00915.
  • Sabate-Vidales M., D. Šiška, and L. Szpruch. 2020. “Solving Path Dependent pdes with LSTM Networks and Path Signatures”.
  • Sergey Ioffe C. S. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” arXiv:1502.03167.
  • Sirignano J., and K. Spiliopoulos. 2017. “DGM: A Deep Learning Algorithm for Solving Partial Differential Equations.” arXiv:1708.07469.
  • Sirignano J., and K. Spiliopoulos. 2020. “Mean Field Analysis of Neural Networks: A Central Limit Theorem.” Stochastic Processes and Their Applications 130 (3): 1820–1852.
  • Stone H. 2018. “Calibrating Rough Volatility Models: A Convolutional Neural Network Approach.” arXiv:1812.05315.
  • Wang Z., and S. Tang. 2021. “Gradient Convergence of Deep Learning-based Numerical Methods for Bsdes.” Chinese Annals of Mathematics, Series B 42 (2): 199–216.
  • Weinan E., J. Han, and A. Jentzen. 2017. “Deep Learning-based Numerical Methods for High-dimensional Parabolic Partial Differential Equations and Backward Stochastic Differential Equations.” Communications in Mathematics and Statistics 5 (4): 349–380.

Appendices

Appendix 1.

Martingale Control Variate

Let (Ω,F,P) be a probability space and consider an Rd-valued Wiener process W=(Wj)j=1d=((Wtj)t0)j=1d. We will use (FtW)t0 to denote the filtration generated by W. Consider a DRd-valued, continuous, stochastic process defined for the parameters βBRp, Xβ=(Xβ,i)i=1d=((Xtβ,i)t0)i=1d adapted to (FtW)t0.

Let g:C([0,T],Rd)R be a measurable function. We shall consider path-dependent contingent claims of the form g((Xsβ)s[0,T]). Finally we assume that there is a (stochastic) discount factor given by D(t1,t2;β)=et1t2c(s,Xsβ;β)ds for an appropriate function c=c(t,x;β). We will omit the β from the discount factor notation. Let ΞTβ:=D(t,T)g((Xsβ)s[0,T]). We now interpret P as some risk-neutral measure and so the P-price of our contingent claim is Vtβ=E[ΞTβ|Ftβ]=E[D(t,T)g((Xsβ)s[0,T])|Ftβ]. By assumption ΞTβ is FTW measurable and E[|ΞTβ|2]<. Hence, from the Martingale Representation Theorem, see, e.g.,  Cohen and Elliott (Citation2015, Th. 14.5.1), there exists a unique process (Ztβ)t adapted to (FtW)t with E[0T|Zsβ|2ds]< such that (A1) ΞTβ=E[ΞTβ|F0W]+0TZsβdWs.(A1) The proof of the existence of the process (Ztβ)t, is non-constructive. In the setup of the paper, we used the Markovian property of Ξtβ to approximate Ztβ via the associated linear PDE. In the more general non-Markovian setup, Cont and Lu (Citation2016) provide a numerical method to construct the martingale representation.

Observe that in our setup, F0=F0W, FtβFtW for t0. Hence tower property of the conditional expectation implies that (A2) E[ΞTβ|Ftβ]=E[ΞTβ|F0W]+0tZsβdWs.(A2) Consequently (EquationA1) and (EquationA2) imply E[ΞTβ|Ftβ]=ΞTβtTZsβdWs. We then observe that Vtβ=E[ΞTβ|Ftβ]=E[ΞTβtTZsβdWs|Ftβ]. If we can generate iid (Wi)i=1N and (Zβ,i)i=1N with the same distributions as W and Z respectively then we can consider the following Monte-Carlo estimator of Vtβ: Vtβ,N:=1Ni=1N(ΞTβ,itTZsβ,idWsi). In the companion paper (Sabate-Vidales, Šiška, and Szpruch Citation2020), we provide deep learning algorithms to price path-dependent options in the risk neutral measure by solving the corresponding path-dependent PDE, using a combination of Recurrent Neural networks and path signatures to parametrize the process Zβ.

Appendix 2.

Martingale Control Variate Deep Solvers

A.1. Empirical Correlation Maximization

This method is based on the idea that since we are looking for a good control variate we should directly train the network to maximize the variance reduction between the vanilla Monte-Carlo estimator and the control variates Monte-Carlo estimator by also trying to optimize λ.

Recall we denote ΞT=D(t,T)g((Xs)s[t,T]). We also denote as Mt,T as the stochastic integral that arises in the martingale representation of ΞT. The optimal coefficient λ,θ that minimizes the variance Var[ΞTλMt,Tθ] is λ,θ=Cov[ΞT,Mt,Tθ]Var[Mt,Tθ]. Let ρΞT,Mt,Tθ denote the Pearson correlation coefficient between ΞT and Mt,Tθ, i.e., ρΞT,Mt,Tθ=Cov(ΞT,Mt,Tθ)Var[ΞT]Var[Mt,Tθ]. With the optimal λ we then have that the variance reduction obtained from the control variate is Var[Vt,Tπ,θ,λ,N]Var[ΞT]=1(ρΞT,Mt,Tθ)2. See Glasserman (Citation2013, Ch. 4.1) for more details. Therefore we set the learning task as θ,cor:=argminθ[1(ρΞT,Mt,Tθ)2 ]. The implementable version requires the definition of Vt,Tβ,π,θ,λ,N in (Equation9), where we set ΞTβ,π,i:=Dπ(t,T))ig(XTβ,π,i)Mt,Tβ,π,i,θ:=k=1Nsteps1(Dπ(t,tk))iR[xv]θ(tk,Xtkβ,π,i)σ(tk,Xtkβ,π,i)(Wtk+1iWtki) The full method is stated as Algorithm 5.

Appendix 3.

Artificial Neural Networks

We fix a locally Lipschitz function a:RR and for dN define Ad:RdRd as the function given, for x=(x1,,xd) by Ad(x)=(a(x1),,a(xd)). We fix LN (the number of layers), lkN, k=0,1,L1 (the size of input to layer k) and lLN (the size of the network output). A fully connected artificial neural network is then given by Φ=((W1,B1),,(WL,BL)), where, for k=1,,L, we have real lk1×lk matrices Wk and real lk dimensional vectors Bk.

The artificial neural network defines a function RΦ:Rl0RlL given recursively, for x0Rl0, by RΦ(x0)=WLxL1+BL,xk=Alk(Wkxk1+Bk),k=1,,L1. We can also define the function P which counts the number of parameters as P(Φ)=i=1L(lk1lk+lk). We will call such class of fully connected artificial neural networks DN. Note that since the activation functions and architecture are fixed the learning task entails finding the optimal ΦRP(Φ).

Appendix 4.

Additional Numerical Results

Example A.1

Low dimensional basket option

We consider the basket options problem of pricing, using the example from Belomestny, Iosipoi, and Zhivotovskiy (Citation2017, Sec 4.2.3). The payoff function is g(S):=max(0,i=1dSiK). We first consider the basket options problem on two assets, with d = 2, S0i=70, r=50%, σi=100%, Σii=1, Σij=0 for ij, and constant strike K=i=1dS0i. In line with the example from Belomestny, Iosipoi, and Zhivotovskiy (Citation2017, Sec 4.2.3) for comparison purposes we organize the experiment as follows. The control variates on 20, 000 batches of 5000 samples each of (stni)n=0Nsteps by simulating the SDE Equation12, where Nsteps=50. The assets' initial values st0 are always constant St0i=0.7. We follow the evaluation framework to evaluate the model, simulating NMCNin paths by simulating Equation12 with constant S0i=0.7 for i=1,,100. We have the following results:

  1. Table  provides the empirical variances calculated over 106 generated Monte Carlo samples and their corresponding control variates. The variance reduction measure indicates the quality of each control variate method. Table  also provides the cost of training for each method, given by the number of optimizer iterations performed before hitting the stopping criteria, defined defined before with ϵ=5×106.

  2. Table  provides the confidence interval for the variance of the Monte Carlo estimator, and the Monte Carlo estimator with control variate assuming these are calculated on 106 random paths.

  3. Figures  and  study the iterative training for the BSDE solver. We train the same model four times for different values of ϵ between 0.01 and 5×106 and we study the number of iterations necessary to meet the stopping criteria defined by ϵ, the variance reduction once the stopping criteria is met, and the relationship between the number of iterations and the variance reduction. Note that the variance reduction stabilizes for ϵ<105. Furthermore, the number of training iterations increases exponentially as ϵ decreases. We choose ϵ=5×106.

Table E1. Results on basket options problem on two assets, Example A.1. Models trained with S0 fixed, non-random. Empirical Variance and variance reduction factor are presented.

Table E2. Results on basket options problem on two assets, Example A.1. Models trained with S0 fixed, non-random.

Figure A1. Left: Variance reduction in terms of number of optimizer iterations. Right: Variance reduction in terms of epsilon. Both refer to Algorithm 3 used in Example A.1.

Figure A1. Left: Variance reduction in terms of number of optimizer iterations. Right: Variance reduction in terms of epsilon. Both refer to Algorithm 3 used in Example A.1.

Figure A2. Number of optimizer iterations in terms of ϵ for Algorithm 3 used in Example A.1.

Figure A2. Number of optimizer iterations in terms of ϵ for Algorithm 3 used in Example A.1.

We note that in the example from Belomestny, Iosipoi, and Zhivotovskiy (Citation2017, Sec 4.2.3), the control variate is trained with S0=0.7 fixed. Using this setting, Algorithm 2 cannot be used to approximate the control variate in (Equation9): since the network at t = 0, R[v]η0, is trained only at S0=0.7, then automatic differentiation to approximate xR[v]η0(0.7) will yield a bad approximation of xv(0.7); indeed, during training R[v]η0 is unable to capture how v changes around S0 at t = 0. For this reason, Algorithm 2 is not included in the following results.

Example A.2

basket option with random sigma

In this example, as in Example 4.2, we aim to show how our approach – where we build a control variate by approximating the process (Ztk)k=0,,Nsteps – is more robust compared to directly approximating the price by a certain function in a high-dimensional setting.

We use the methodology proposed in Blanka Horvath and Muguruza (Citation2019), where the authors present a deep learning-based calibration method proposing a two-steps approach: first the authors learn the model that approximates the pricing map using an artificial neural network in which the inputs are the parameters of the volatility model. Second the authors calibrate the learned model using available data by means of different optimization methods.

For a fair comparison between our deep learning-based control variate approach vs. the method proposed in Blanka Horvath and Muguruza (Citation2019), we make the following remarks:

  1. We will only use the first step detailed in Blanka Horvath and Muguruza (Citation2019) where the input to the model that approximates the pricing map is the volatility model's parameters: σRd,r, and the initial price is considered constant for training purposes. We run the experiment for d = 5.

  2. In Blanka Horvath and Muguruza (Citation2019), the authors build a training set and then perform gradient descent-based optimization on the training set for a number of epochs. This is somewhat a limiting factor in the current setting where one can have as much data as they want since it is generated from some given distributions. In line with our experiments, instead of building a training set, in each optimization step we sample a batch from the given distributions.

  3. In Blanka Horvath and Muguruza (Citation2019), the price mapping function is learned for a grid of combinations of maturities and strikes. In this experiment, we reduce the grid to just one point considering T = 0.5, K=iS0i, where S0i=0.7i.

  4. We will use Algorithm 3 to build the control variate with the difference that now σRd,rR will be passed as input to the each network at each time step R[v]ηk,R[xv]θk.

The experiment is organized as follows:

  1. We train the network proposed in Blanka Horvath and Muguruza (Citation2019) approximating the price using Black–Scholes model and Basket options payoff. In each optimization iteration, a batch of size 1000, where the volatility model's parameters are sampled using σU(0.9,1.1) and rU(0.4,0.6). We keep a test set of size 150, S={[(σi,ri);p(σi,ri)],i=1,,150} where p(σi,ri) denotes the price and is generated using 50, 000 Monte Carlo samples.

  2. We use Algorithm 3 to build the control variate, where σ and r are sampled as above and are included as inputs to the network. We denote the trained model by R[xv]θk where k=1,,Nsteps. In contrast with Algorithm 3.

We present the following results:

  1. Figure displays the histogram of the squared error of the approximation of the PDE solution R[v]η for each instance in S. In this sample, it spans from almost 108 to 103 , i.e. ,for almost 5 orders of magnitude.

  2. We build the control variate for that instance in the test set for which R[v]η generalizes the worst. For those particular σ,r, Table  provides its variance reduction factor.

Table E3. Results on basket options problem on 5 assets, Model trained with non-random S0, and random σ,r.

Example A.3

High dimensional basket option

We also consider the basket options problem on d = 100 assets but otherwise identical to the setting of Example A.1. We compare our results against the same experiment in Belomestny, Iosipoi, and Zhivotovskiy (Citation2017, Sec 4.2.3, Table 6 and Table 7).

Table  shows a significant improvement of the variance reduction factor (10× and 100× better) of all our Algorithms than the methods proposed in Belomestny, Iosipoi, and Zhivotovskiy (Citation2017) and applied in the same example.

Table E4. Results on basket options problem on 100 assets, Example A.3. Models trained with non-random S0 so that the results can be directly compared to Belomestny, Iosipoi, and Zhivotovskiy (Citation2017).

Table E5. Results on basket options problem on 100 assets, Example A.3. Models trained with non-random S0.

A.2. Empirical Network Diagnostics

In this section, we consider the exchange options problem on two assets from Example 4.1, where the time horizon is 1 day. We consider different network architectures for the BSDE method described by Algorithm 3 to understand their impact on the final result and their ability to approximate the solution of the PDE and its gradient. We choose this problem given the existence of an explicit solution that can be used as a benchmark. The experiment is organized as follows:

  1. Let L−2 be the number of hidden layers of R[xv]θt0DN and R[v]θt0DN. Let lk be the number of neurones per hidden layer k.

  2. We train four times all the possible combinations for L2{1,2,3} and for lk{2,4,6,,20} using ϵ=5×106 for the stopping criteria. The assets' initial values st0i are sampled from a lognormal distribution Xexp((μ0.5σ2)τ+στξ), where ξN(0,1),μ=0.08,τ=0.1.

  3. We approximate the L2-error of R[v]θt0(x) and R[v]θt0(x) with respect to the exact solution given by Margrabe's formula and its gradient.

Figure  displays the average of the L2-errors and its confidence interval. We can conclude that for this particular problem, the accuracy of R[v]θt0(x) does not strongly depend on the number of layers, and that there is no improvement beyond eight nodes per hidden layer. The training (its inputs and the gradient descent algorithm together with the stopping criteria) becomes the limiting factor. The accuracy of R[v]θt0(x) is clearly better with two or three hidden layers than with just one. Moreover, it seems that there is benefit in taking as many as 10 nodes per hidden layer.

Figure A3. Left: Variance reduction in terms of number of optimizer iterations. Right: Variance reduction in terms of epsilon. Both are for Example A.3 and Algorithm 3.

Figure A3. Left: Variance reduction in terms of number of optimizer iterations. Right: Variance reduction in terms of epsilon. Both are for Example A.3 and Algorithm 3.

Figure A4. Average error of PDE solution approximation and its gradient and 95% confidence interval of different combination of # of layers and net width. Left: error model. Right: Error grad model.

Figure A4. Average error of PDE solution approximation and its gradient and 95% confidence interval of different combination of # of layers and net width. Left: error model. Right: Error grad model.