Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

We develop several deep learning algorithms for approximating families of parametric PDE solutions. The proposed algorithms approximate solutions together with their gradients, which in the context of mathematical finance means that the derivative prices and hedging strategies are computed simultaneously. Having approximated the gradient of the solution, one can combine it with a Monte Carlo simulation to remove the bias in the deep network approximation of the PDE solution (derivative price). This is achieved by leveraging the Martingale Representation Theorem and combining the Monte Carlo simulation with the neural network. The resulting algorithm is robust with respect to the quality of the neural network approximation and consequently can be used as a black box in case only limited a-priori information about the underlying problem is available. We believe this is important as neural network-based algorithms often require fair amount of tuning to produce satisfactory results. The methods are empirically shown to work for high-dimensional problems (e.g., 100 dimensions). We provide diagnostics that shed light on appropriate network architectures.

Keywords:

2010 Mathematics Subject Classifications:

1. Introduction

Numerical algorithms that solve PDEs suffer from the so-called ‘curse of dimensionality’, making it impractical to apply known discretization algorithms such as finite difference schemes to solve high-dimensional PDEs. However, it has been recently shown that deep neural networks trained with stochastic gradient descent can overcome the curse of dimensionality (Beck, Gonon, and Jentzen Citation2020; Berner, Grohs, and Jentzen Citation2020), making them a popular choice to solve this computational challenge in the last few years.

In this work, we focus on the problem of numerically solving parametric linear PDEs arising from European option pricing in high dimensions. Let $B \subseteq R^{p}, p \geq 1$ be a parameter space (for instance, in the Black–Scholes equation with fixed rate, B is the domain of the volatility parameter). Consider $v = v (t, x; β)$ satisfying (1) $\begin{aligned} [\partial_{t} v + b \nabla_{x} v + \frac{1}{2} t r [\nabla_{x}^{2} v σ^{*} σ] - c v] (t, x; β) = 0, \\ v (T, x; β) = g (x; β), t \in [0, T], x \in R^{d}, β \in B . \end{aligned}$ (1) Here $t \in [0, T]$ , $x \in R^{d}$ and $β \in B$ and $b, σ, c$ and g are functions of $(t, x; β)$ which specify the problem. The Feynman–Kac theorem provides a probabilistic representation for v so that Monte Carlo methods can be used for its unbiased approximation in one single point $(t, x; β)$ . What we propose in this work is a method for harnessing the power of deep learning algorithms to numerically solve (Equation1(1) $\begin{aligned} [\partial_{t} v + b \nabla_{x} v + \frac{1}{2} t r [\nabla_{x}^{2} v σ^{*} σ] - c v] (t, x; β) = 0, \\ v (T, x; β) = g (x; β), t \in [0, T], x \in R^{d}, β \in B . \end{aligned}$ (1) ) in a way that is robust even in edge cases when the output of the neural network is not of the expected quality, by combining them with Monte Carlo algorithms.

From the results in this article, we observe that neural networks provide an efficient computational device for high dimensional problems. However, we observed that these algorithms are sensitive to the network architecture, parameters and distribution of training data. A fair amount of tuning is required to obtain good results. Based on this we believe that there is great potential in combining artificial neural networks with already developed and well understood probabilistic computational methods, in particular the control variate method for using potentially imperfect neural network approximations for finding unbiased solutions to a given problem, see Algorithm 1.

1.1. Main Contributions

We propose three classes of learning algorithms for simultaneously finding solutions and gradients to parametric families of PDEs.

Projection solver: See Algorithm 2. We leverage Feynman–Kac representation together with the fact that conditional expectation can be viewed as an $L^{2}$ -projection operator. The gradient can be obtained by automatic differentiation of already obtained approximation of the PDE solution.
Martingale representation solver: See Algorithm 3. This algorithm was inspired by Cvitanic and Zhang (Citation2005) and Weinan, Han, and Jentzen (Citation2017) and Han, Jentzen, and Weinan (Citation2017) and is referred to as deep BSDE solver. Our algorithm differs from Weinan, Han, and Jentzen (Citation2017) in that we approximate solution and its gradient at all the time-steps and across the entire space and parameter domains rather than only one space-time point. Furthermore we propose to approximate the solution-map and its gradient by separate networks.
Martingale control variates solver: Algorithms 4 and 5. Here we exploit the fact that martingale representation induces control variate that can produce zero variance estimator. Obviously, such control variate is not implementable but provides a basis for a novel learning algorithm for the PDE solution.

For each of these classes of algorithms, we develop and test different implementation strategies. Indeed, one can either take one (large) network to approximate the entire family of solutions of (Equation1(1) $\begin{aligned} [\partial_{t} v + b \nabla_{x} v + \frac{1}{2} t r [\nabla_{x}^{2} v σ^{*} σ] - c v] (t, x; β) = 0, \\ v (T, x; β) = g (x; β), t \in [0, T], x \in R^{d}, β \in B . \end{aligned}$ (1) ) or take a number of (smaller) networks, where each of them approximates the solution at a time point in a grid. The former has the advantage that one can take arbitrarily fine time discretization without increasing the overall network size. The advantage of the latter is that each learning task is simpler due to each network being smaller. One can further leverage the smoothness of the solution in time and learn the weights iteratively by initializing the network parameters to be those of the previous time step. We test both approaches numerically. At a high level all the algorithms work in path-dependent (non-Markovian) setting but there the challenge is an efficient method for encoding information in each path. This problem is solved in companion paper (Sabate-Vidales, Šiška, and Szpruch Citation2020).

To summarize the key contribution of this work are:

We derive and implement three classes of learning algorithms for approximation of parametric PDE solution map and its gradient.
We propose a novel iterative training algorithm that exploits regularity of the function we seek to approximate and allows using neural networks with smaller number of parameters.
The proposed algorithms are truly black-box in that quality of the network approximation only impacts the computation benefit of the approach and does not introduce approximation bias. This is achieved by combining the network approximation with Monte Carlo as stated in Algorithm 1.
Code for the numerical experiments presented in this paper is being made available on GitHub: https://github.com/msabvid/Deep-PDE-Solvers.

We stress the importance of point (iii) above by directing reader's attention to Figure , where we test generalization error of trained neural network for the five-dimensional family of PDEs corresponding to pricing a basket option under the Black–Scholes model. We refer reader to Example A.2 for details. We see that while the average error over test set is of order $\approx 10^{- 5}$ , the errors for a given input vary significantly. Indeed, it has been observed in deep learning community that for high dimensional problems one can find input data such that trained neural network that appears to generalize well (i.e., achieves small errors on the out of training data) produces poor results (Goodfellow, Shlens, and Szegedy Citation2014).

Figure 1. Histogram of mean-square-error of solution to the PDE on the test data set.

1.2. Literature Review

Deep neural networks trained with stochastic gradient descent proved to be extremely successful in number of applications such as computer vision, natural language processing, generative models or reinforcement learning (LeCun, Bengio, and Hinton Citation2015). The application to PDE solvers is relatively new and has been pioneered by Weinan, Han, and Jentzen (Citation2017), Han, Jentzen, and Weinan (Citation2017) and Sirignano and Spiliopoulos (Citation2017). See also Cvitanic and Zhang (Citation2005) for the ideas of solving PDEs with gradient methods and for direct PDE approximation algorithm. PDEs provide an excellent test bed for neural networks approximation because (a) there exists alternative solvers, e.g., Monte Carlo, (b) we have well-developed theory for PDEs, and that knowledge can be used to tune algorithms. This is contrast to mainstream neural network approximations in text or images classification.

Apart from growing body of empirical results in the literature on ‘Deep PDEs solvers’, Chan-Wai-Nam, Mikael, and Warin (Citation2019), Huré, Pham, and Warin (Citation2019), Beck et al. (Citation2018), Jacquier and Oumgari (Citation2019) and Henry-Labordere (Citation2017), there has been also some important theoretical contributions. It has been proved that deep artificial neural networks approximate solutions to parabolic PDEs to an arbitrary accuracy without suffering from the curse of dimensionality. The first mathematically rigorous proofs are given in Grohs et al. (Citation2018) and Jentzen, Salimova, and Welti (Citation2018). The high level idea is to show that neural network approximation to the PDE can be established by building on Feynman-Kac approximation and Monte Carlo approximation. By checking that Monte Carlo simulations do not suffer from the curse of dimensionality one can imply that the same is true for neural network approximation. Furthermore, it has been recently demonstrated in Hu et al. (Citation2021) and Mei, Montanari, and Nguyen (Citation2018) that noisy gradient descent algorithm used for training of neural networks of the form considered in Grohs et al. (Citation2018) and Jentzen, Salimova, and Welti (Citation2018) induces unique probability distribution function over the parameter space which minimizes learning. See Du et al. (Citation2018), Chizat and Bach (Citation2018), Rotskoff and Vanden-Eijnden (Citation2018), Sirignano and Spiliopoulos (Citation2020), Wang and Tang (Citation2021) and Han and Long (Citation2020) for related ideas on convergence of gradient algorithms for overparametrized neural networks. This means that there are theoretical guarantees for the approximation of (parabolic) PDEs with neural networks trained by noisy gradient methods alleviating the curse of dimensionality.

An important application of deep PDE solvers is that one can in fact approximate a parametric family of solutions of a PDE. To be more precise let $B \subseteq R^{p}$ , $p \geq 1$ , be a parameter space. In the context of finance these, for example, might be initial volatility, volatility of volatility, interest rate and mean reversion parameters. One can approximate the parametric family of functions $F (\cdot; β)_{β \in B}$ for an arbitrary range of parameters. This then allows for swift calibration of models to data (e.g. option prices). This is particularly appealing for high dimensional problems when calibrating directly using noisy Monte Carlo samples might be inefficient. This line of research gained some popularity recently and the idea has been tested numerically on various models and data sets (Horvath, Muguruza, and Tomas Citation2019; Liu et al. Citation2019; Bayer and Stemper Citation2018; Stone Citation2018; Hernandez Citation2016; Itkin Citation2019; McGhee Citation2018). There are some remarks that are in order. In the context of models calibration, while the training might be expensive one can do it offline, once and for good. One can also notice that training data could be used to produce a ‘look-up table’ taking model parameters to prices. From this perspective, the neural network, essentially, becomes an interpolator and a compression tool. Indeed the number of parameters of the network is much smaller than number of training data and therefore it is more efficient to store those. The final remark is that while there are other methods out there, such as Chebyshev functions, neural networks seem robust in high dimensions which make them our method of choice.

1.3. Notation

We denote by $D N$ the set of all fully connected feedforward neural networks (see Appendix 3). We also use $R [f]_{θ} \in D N$ with $θ \in R^{κ}$ to denote a neural network with weights θ approximating the function $f : R^{d_{0}} \to R^{d_{1}}$ for some $d_{0}, d_{1} \in N$ .

1.4. Outline

This paper is organized as follows. Section 2 provides theoretical underpinning for the derivation of all the algorithms we propose to solve (Equation1(1) $\begin{aligned} [\partial_{t} v + b \nabla_{x} v + \frac{1}{2} t r [\nabla_{x}^{2} v σ^{*} σ] - c v] (t, x; β) = 0, \\ v (T, x; β) = g (x; β), t \in [0, T], x \in R^{d}, β \in B . \end{aligned}$ (1) ). More specifically in Section 2.2 we combine the approximation of the gradient of the solution of the PDE resulting from the Deep Learning algorithms with Monte Carlo to obtain an unbiased approximation of the solution of the PDE. In Section 3, we describe the algorithms in detail.

Finally in Section 4, we provide numerical tests of the proposed algorithms. We empirically test these methods on relevant examples including a 100 dimensional option pricing problems, see Examples 4.4 and A.3. We carefully measure the training cost and report the variance reduction achieved.

Since we work in situation where the function approximated by neural network can be obtained via other methods ( Monte Carlo, PDE solution) we are able to test the how the expressiveness of fully connected artificial neural networks depends on the number of layers and neurons per layer. See Section A.2 for details.

2. PDE Martingale Control Variate

Control variate is one of the most powerful variance reduction techniques for Monte Carlo simulation. While a good control variate can reduce the computational cost of Monte Carlo computation by several orders of magnitude, it relies on judiciously chosen control variate functions that are problem specific. For example, when computing price of basket options a sound strategy is to choose control variates to be call options written on each of the stocks in the basket, since in many models these are priced by closed-form formulae. In this article, we are interested in black-box-type control variate approach by leveraging the Martingale Representation Theorem and neural networks. The idea of using Martingale Representation to obtain control variates goes back at least to Newton (Citation1994). It has been further studied in combination with regression in Milstein and Tretyakov (Citation2009) and Belomestny et al. (Citation2018).

The bias in the approximation of the solution can be completely removed by employing control variates where the deep network provides the control variate resulting in very high variance reduction factor in the corresponding Monte Carlo simulation.

Let $(Ω, F, P)$ be a probability space and consider an $R^{d^{'}}$ -valued Wiener process $W = (W^{j})_{j = 1}^{d^{'}} = ((W_{t}^{j})_{t \geq 0})_{j = 1}^{d^{'}}$ . We will use $(F_{t}^{W})_{t \geq 0}$ to denote the filtration generated by W. Consider a $D \subseteq R^{d}$ -valued, continuous, stochastic process defined for the parameters $β \in B \subseteq R^{p}$ , $X^{β} = (X^{β, i})_{i = 1}^{d} = ((X_{t}^{β, i})_{t \geq 0})_{i = 1}^{d}$ adapted to $(F_{t}^{W})_{t \geq 0}$ given as the solution to (2) $d X_{s}^{β} = b (s, X_{s}^{β}; β) d s + σ (s, X_{s}; β) d W_{s}, s \in [t, T], X_{t}^{β} = x \in R^{d} .$ (2) We will use $(F_{t}^{β})_{t \geq 0}$ to denote the filtration generated by $X^{β}$ .

Let $g : R^{d} \to R$ be a measurable function and we assume that there is a (stochastic) discount factor given by $D (t_{1}, t_{2}; β) := e^{- \int_{t_{1}}^{t_{2}} c (s, X_{s}^{β}; β) d s}$ for an appropriate function $c = c (t, x; β)$ . We will omit β from the discount factor notation for brevity. We now interpret $P$ as some risk-neutral measure and so the $P$ -price of our contingent claim is (3) $v (t, x; β) := E [D (t, T) g (X_{T}^{β}) | X_{t}^{β} = x] .$ (3) Say we have iid r.v.s $(X_{T}^{β, i})_{i = 1}^{N}$ with the same distribution as $X_{T}^{β}$ , where for each i, $X_{t}^{β, i} = x$ . Then the standard Monte Carlo estimator is $v^{N} (t, x; β) := \frac{1}{N} \sum_{i = 1}^{N} D^{i} (t, T) g (X_{T}^{β, i}) .$ Convergence $v^{N} (t, x; β) \to v (t, x; β)$ in probability as $N \to \infty$ is granted by the Law of Large Numbers. Moreover the classical Central Limit Theorem tells that $P (v (t, x; β) \in [v^{N} (t, x; β) - z_{α / 2} \frac{σ}{\sqrt{N}}, v^{N} (t, x; β) + z_{α / 2} \frac{σ}{\sqrt{N}}]) \to 1 - α a s N \to \infty,$ where $σ := \sqrt{V a r [D (t, T) g (X_{T}^{β})]}$ and $z_{α / 2}$ is such that $1 - Φ (z_{α / 2}) = α / 2$ with Φ being distribution function (cumulative distribution function) of the standard normal distribution. To decrease the width of the confidence intervals one can increase N, but this also increases the computational cost. A better strategy is to reduce variance by finding an alternative Monte Carlo estimator, say $V^{N} (t, x; β)$ , such that (4) $E [V^{N} (t, x; β)] = v (t, x; β) a n d V a r [V^{N} (t, x; β)] < V a r [v^{N} (t, x; β)],$ (4) and the cost of computing $V^{N} (t, x; β)$ is similar to $v^{N} (t, x; β)$ .

In the remainder of the article, we will devise and test several strategies, based on deep learning, to find a suitable approximation for $V^{N} (t, x; β)$ , by exploring the connection of the SDE (Equation2(2) $d X_{s}^{β} = b (s, X_{s}^{β}; β) d s + σ (s, X_{s}; β) d W_{s}, s \in [t, T], X_{t}^{β} = x \in R^{d} .$ (2) ) and its associated PDE.

2.1. PDE Derivation of the Control Variate

It can be shown that under suitable assumptions on b, σ, c and g, and fixed $β \in B$ that $v \in C^{1, 2} ([0, T] \times D)$ . See, e.g., Krylov (Citation1999). Let $a := \frac{1}{2} σ σ^{*}$ . Then, from Feynman–Kac formula (see, e.g., Th. 8.2.1 in Øksendal Citation2003), we get (5) ${\begin{cases} [\partial_{t} v + t r (a \partial_{x}^{2} v) + b \partial_{x} v - c v] (t, x; β) = 0 i n [0, T) \times D, \\ v (T, \cdot) = g o n D . \end{cases}$ (5) Since $v \in C^{1, 2} ([0, T] \times D)$ and since v satisfies the above PDE, if we apply Itô's formula then we obtain (6) $D (t, T) v (T, X_{T}^{β}; β) = v (t, x; β) + \int_{t}^{T} D (t, s) \partial_{x} v (s, X_{s}^{β}; β) σ (s, X_{s}^{β}; β) d W_{s} .$ (6) Hence Feynman–Kac representation together with the fact that $v (T, X_{T}^{β}; β) = g (X_{T}^{β})$ yields (7) $v (t, x; β) = D (t, T) g (X_{T}^{β}) - \int_{t}^{T} D (t, s) \partial_{x} v (s, X_{s}^{β}; β) σ (s, X_{s}^{β}; β) d W_{s} .$ (7) Provided that $sup_{s \in [t, T]} E [| D (t, s) \partial_{x} v (s, X_{s}^{β}; β) σ (s, X_{s}^{β}; β) |^{2}] < \infty,$ then the stochastic integral is a martingale. Thus we can consider the Monte Carlo estimator. (8) $V^{N} (t, x; β) := \frac{1}{N} \sum_{i = 1}^{N} {D^{i} (t, T) g (X_{T}^{β, i}) - \int_{t}^{T} D^{i} (t, s) \partial_{x} v (s, X_{s}^{β, i}; β) σ (s, X_{s}^{β, i}; β) d W_{s}^{i}} .$ (8) To obtain a control variate, we thus need to approximate $\partial_{x} v$ . If one used classical approximation techniques to the PDE, such as finite difference or finite element methods, one would run into the curse of the dimensionality – the very reason one employs Monte Carlo simulations in the first place. Artificial neural networks have been shown to break the curse of dimensionality in specific situations (Grohs et al. Citation2018). To be more precise, authors in Berner, Grohs, and Jentzen (Citation2020), Elbrächter et al. (Citation2022), Jentzen, Salimova, and Welti (Citation2018), Grohs, Jentzen, and Salimova (Citation2019), Hutzenthaler et al. (Citation2020), Grohs et al. (Citation2018), Kutyniok et al. (Citation2020), Gonon et al. (Citation2021) and Reisinger and Zhang (Citation2020) have shown that there always exist a deep feed forward neural network and some parameters such that the corresponding neural network can approximate the solution of a linear PDE arbitrarily well in a suitable norm under reasonable assumptions (terminal condition and coefficients can be approximated by neural networks). Moreover, the number of parameters grows only polynomially in dimension and so there is no curse of dimensionality. However, while the papers above construct the network they do not tell us how to find the ‘good’ parameters. In practice, the parameter search still relies on gradient descent-based minimization over a non-convex landscape. The application of the deep-network approximation to the solution of the PDE as a martingale control variate is an ideal compromise.

If there is no exact solution to the PDE (Equation5(5) ${\begin{cases} [\partial_{t} v + t r (a \partial_{x}^{2} v) + b \partial_{x} v - c v] (t, x; β) = 0 i n [0, T) \times D, \\ v (T, \cdot) = g o n D . \end{cases}$ (5) ), as would be the case in any reasonable application, then we will approximate $\partial_{x} v$ by $R [\partial_{x} v]_{θ} \in D N$ .

To obtain an implementable algorithm we discretize the integrals in $V_{t}^{β, N, v}$ and take a partition of $[0, T]$ denoted $π := {t = t_{0} < \dots < t_{N_{s t e p s}} = T}$ , and consider an approximation of (Equation2(2) $d X_{s}^{β} = b (s, X_{s}^{β}; β) d s + σ (s, X_{s}; β) d W_{s}, s \in [t, T], X_{t}^{β} = x \in R^{d} .$ (2) ) by $(X_{t_{k}}^{β, π})_{t_{k} \in π}$ . For simplicity we approximate all integrals arising by Riemann sums always taking the left-hand point when approximating the value of the integrand.

The implementable control variate Monte Carlo estimator is then the form (9) $\begin{aligned} V^{π, θ, λ, N} (t, x; β) := \frac{1}{N} \sum_{i = 1}^{N} {(D^{π} (t, T))^{i} g (X_{T}^{β, π, i}) \\ - λ \sum_{k = 1}^{N_{s t e p s} - 1} (D^{π} (t, t_{k}))^{i} R [\partial_{x} v]_{θ} (t_{k}, X_{t_{k}}^{β, π, i}; β) σ (t_{k}, X_{t_{k}}^{β, π, i}; β) (W_{t_{k + 1}}^{i} - W_{t_{k}}^{i})}, \end{aligned}$ (9) where $D^{π} (t, T) := e^{- \sum_{k = 1}^{N_{s t e p s} - 1} c (t_{k}, X_{t_{k}}^{β, π}) (t_{k + 1} - t_{k})}$ and λ is a free parameter to be chosen (because we discretize and use approximation to the PDE it is expected $λ \neq 1$ ). Again, we point out that the only bias of the above estimator comes from the numerical scheme used to solve the forward and backward processes. Nevertheless, $R [\partial_{x} v]_{θ}$ does not add any additional bias independently of the choice θ. We will discuss possible approximation strategies for approximating $\partial_{x} v$ with $R [\partial_{x} v]_{θ}$ in the following section.

In this section, we have actually derived an explicit form of the Martingale representation (see, e.g., Cohen and Elliott Citation2015, Th. 14.5.1) of $D (t, T) g (X_{T}^{β})$ in terms of the solution of the PDE associated to the process $X^{β}$ , which is given as the solution to (Equation2(2) $d X_{s}^{β} = b (s, X_{s}^{β}; β) d s + σ (s, X_{s}; β) d W_{s}, s \in [t, T], X_{t}^{β} = x \in R^{d} .$ (2) ). In Appendix 1, we provide a more general framework to build a low-variance Monte Carlo estimator $V_{t}^{N}$ for any (possibly non-Markovian) $F^{W}$ -adapted process $X^{β}$ .

2.2. Unbiased Parametric PDE Approximation

After having trained the networks $R [\partial_{x} v]_{θ}$ (using any of Algorithms 2–3 that we will introduce in Section 3) and $R [v]_{η}$ (using any of Algorithms 2 and 3) that approximate $v, \partial_{x} v$ one then has two options to approximate $v (t, x_{t}; β)$

Directly with $R [v]_{η} (t, x_{t}; β)$ if Algorithms 2 or 3 were used, which will introduce some approximation bias.
By combining $R [\partial_{x} v]_{θ}$ with the Monte Carlo approximation of $v (t, x_{t}; β)$ using (Equation9(9) $\begin{aligned} V^{π, θ, λ, N} (t, x; β) := \frac{1}{N} \sum_{i = 1}^{N} {(D^{π} (t, T))^{i} g (X_{T}^{β, π, i}) \\ - λ \sum_{k = 1}^{N_{s t e p s} - 1} (D^{π} (t, t_{k}))^{i} R [\partial_{x} v]_{θ} (t_{k}, X_{t_{k}}^{β, π, i}; β) σ (t_{k}, X_{t_{k}}^{β, π, i}; β) (W_{t_{k + 1}}^{i} - W_{t_{k}}^{i})}, \end{aligned}$ (9) ), which will yield an unbiased estimator of $v (t, x_{t}; β)$ . The complete method is stated as Algorithm 1.

3. Deep PDE Solvers

In this section, we propose two algorithms that learn the PDE solution (or its gradient) and then use it to build control variate using (Equation9(9) $\begin{aligned} V^{π, θ, λ, N} (t, x; β) := \frac{1}{N} \sum_{i = 1}^{N} {(D^{π} (t, T))^{i} g (X_{T}^{β, π, i}) \\ - λ \sum_{k = 1}^{N_{s t e p s} - 1} (D^{π} (t, t_{k}))^{i} R [\partial_{x} v]_{θ} (t_{k}, X_{t_{k}}^{β, π, i}; β) σ (t_{k}, X_{t_{k}}^{β, π, i}; β) (W_{t_{k + 1}}^{i} - W_{t_{k}}^{i})}, \end{aligned}$ (9) ). We also include in Appendix 2 an additional algorithm to solve such linear PDEs using deep neural networks.

3.1. Projection Solver

Before we proceed further we recall a well-known property of conditional expectations, for proof see, e.g., Krylov (Citation2002, Ch.3 Th. 14).

Theorem 3.1

Let $X \in L^{2} (F)$ . Let $G \subset F$ be a sub σ-algebra. There exists a random variable $Y \in L^{2} (G)$ such that $E [| X - Y |^{2}] = inf_{η \in L^{2} (G)} E [| X - η |^{2}] .$ The minimizer, $Y$ , is unique and is given by $Y = E [X | G]$ .

The theorem tell us that conditional expectation is an orthogonal projection of a random variable X onto $L^{2} (G)$ . Instead of working directly with (Equation5(5) ${\begin{cases} [\partial_{t} v + t r (a \partial_{x}^{2} v) + b \partial_{x} v - c v] (t, x; β) = 0 i n [0, T) \times D, \\ v (T, \cdot) = g o n D . \end{cases}$ (5) ) we work with its probabilistic representation (Equation6(6) $D (t, T) v (T, X_{T}^{β}; β) = v (t, x; β) + \int_{t}^{T} D (t, s) \partial_{x} v (s, X_{s}^{β}; β) σ (s, X_{s}^{β}; β) d W_{s} .$ (6) ). To formulate the learning task, we replace $X$ by $D (t, T) g ((X_{T}^{β}))$ so that $v (t, X_{t}^{β}; β) = E [X | X_{t}^{β}]$ . Hence, by Theorem (3.1), $E [| X - v (t, X_{t}^{β}; β) |^{2}] = inf_{η \in L^{2} (σ (X_{t}^{β}))} E [| X - η |^{2}]$ and we know that for a fixed t the random variable which minimizes the mean square error is a function of $X_{t}$ . But by the Doob–Dynkin Lemma (Cohen and Elliott Citation2015, Th. 1.3.12) we know that every $η \in L^{2} (σ (X_{t}))$ can be expressed as $η = h_{t} (X_{t}^{β})$ for some appropriate measurable $h_{t}$ . For the practical algorithm, we restrict the search for the function $h_{t}$ to the class that can be expressed as deep neural networks $D N$ . Hence we consider a family of functions $R_{θ} \in D N$ and set learning task as (10) $θ^{*} := \underset{θ}{a r g m i n} E_{β} [E_{(X_{t}^{β, π})_{t \in π}} [\sum_{k = 0}^{N_{s t e p s}} {(D (t_{k}, T) g (X_{T}^{β, π}) - R [v]_{θ_{t_{k}}} (X_{t_{k}}^{β, π}; β))}^{2}]] .$ (10) The inner expectation in (Equation10(10) $θ^{*} := \underset{θ}{a r g m i n} E_{β} [E_{(X_{t}^{β, π})_{t \in π}} [\sum_{k = 0}^{N_{s t e p s}} {(D (t_{k}, T) g (X_{T}^{β, π}) - R [v]_{θ_{t_{k}}} (X_{t_{k}}^{β, π}; β))}^{2}]] .$ (10) ) is taken across all paths generated using numerical scheme on (Equation2(2) $d X_{s}^{β} = b (s, X_{s}^{β}; β) d s + σ (s, X_{s}; β) d W_{s}, s \in [t, T], X_{t}^{β} = x \in R^{d} .$ (2) ) for a fixed β and it allows to solve the PDE (Equation5(5) ${\begin{cases} [\partial_{t} v + t r (a \partial_{x}^{2} v) + b \partial_{x} v - c v] (t, x; β) = 0 i n [0, T) \times D, \\ v (T, \cdot) = g o n D . \end{cases}$ (5) ) for such β. The outer expectation is taken on β for which the distribution is fixed beforehand (e.g., uniform on B if it is compact), thus allowing the algorithm to find the optimal neural network weights $θ^{*}$ to solve the parametric family of PDEs (Equation5(5) ${\begin{cases} [\partial_{t} v + t r (a \partial_{x}^{2} v) + b \partial_{x} v - c v] (t, x; β) = 0 i n [0, T) \times D, \\ v (T, \cdot) = g o n D . \end{cases}$ (5) ). Automatic differentiation is used to approximate $\partial_{x} v$ . Algorithm 2 describes the method.

3.2. Probabilistic Representation Based on Backward SDE

Instead of working directly with (Equation5(5) ${\begin{cases} [\partial_{t} v + t r (a \partial_{x}^{2} v) + b \partial_{x} v - c v] (t, x; β) = 0 i n [0, T) \times D, \\ v (T, \cdot) = g o n D . \end{cases}$ (5) ) we work with its probabilistic representation (Equation6(6) $D (t, T) v (T, X_{T}^{β}; β) = v (t, x; β) + \int_{t}^{T} D (t, s) \partial_{x} v (s, X_{s}^{β}; β) σ (s, X_{s}^{β}; β) d W_{s} .$ (6) ) and view it as a BSDE. To formulate the learning task based on this, we recall the time-grid π so that we can write it recursively as $\begin{aligned} v (t_{N_{s t e p s}}, X_{t_{N_{s t e p s}}}^{β}; β) = g (X_{t_{N_{s t e p s}}}^{β}), \\ D (t, t_{m + 1}) v (t_{m + 1}, X_{t_{m + 1}}^{β}; β) = D (t, t_{m}) v (t_{m}, X_{t_{m}}^{β}; β) \\ + \int_{t_{m}}^{t_{m + 1}} D (t, s) \partial_{x} v (s, X_{s}^{β}; β) σ (s, X_{s}^{β}; β) d W_{s} f o r m = 0, 1, \dots, N_{s t e p s} - 1. \end{aligned}$ Next consider deep network approximations for each time step in π and for both the solution of (Equation5(5) ${\begin{cases} [\partial_{t} v + t r (a \partial_{x}^{2} v) + b \partial_{x} v - c v] (t, x; β) = 0 i n [0, T) \times D, \\ v (T, \cdot) = g o n D . \end{cases}$ (5) ) and its gradient. $R [v]_{η_{m}} (x; β) \approx v (t_{m}, x; β), t_{m} \in π, x \in R^{d}$ and $R [\partial_{x} v]_{θ_{m}} (x; β) \approx \partial_{x} v (t_{m}, x; β), t_{m} \in π, x \in R^{d} .$ Approximation depends on weights $η_{m} \in R^{k_{η}}$ , $θ_{m} \in R^{k_{θ}}$ . We then set the learning task as (11) $\begin{aligned} (η^{*}, θ^{*}) & := {a r g m i n}_{(η, θ)} E_{β, X^{β}} [{| g (X_{t_{N_{s t e p s}}}^{β, π}) - R [v]_{η_{N_{s t e p s}}} (X_{t_{N_{s t e p s}}}^{β, π}) |}^{2} \\ + \frac{1}{N_{s t e p s}} \sum_{m = 0}^{N_{s t e p s} - 1} | E_{m + 1}^{(η, θ)} |^{2}], \\ E_{m + 1}^{(η, θ)} & := D (t, t_{m + 1}) R [v]_{η_{m + 1}} (X_{t_{m + 1}}^{β, π}; β) - D (t, t_{m}) R [v]_{η_{m}} (X^{β, π}; β) \\ - D (t, t_{m}) R [\partial_{x} v]_{θ_{m}} (X_{t_{m}}^{β, π}; β) σ (t_{m}, X_{t_{m}}^{β, π}; β) Δ W_{t_{m + 1}}, \end{aligned}$ (11) where $η = {η_{0}, \dots, η_{t_{N_{s t e p s}}}}, θ = {θ_{0}, \dots, θ_{t_{N_{s t e p s}}}} .$ The complete learning method is stated as Algorithm 3, where we split the optimization (Equation11(11) $\begin{aligned} (η^{*}, θ^{*}) & := {a r g m i n}_{(η, θ)} E_{β, X^{β}} [{| g (X_{t_{N_{s t e p s}}}^{β, π}) - R [v]_{η_{N_{s t e p s}}} (X_{t_{N_{s t e p s}}}^{β, π}) |}^{2} \\ + \frac{1}{N_{s t e p s}} \sum_{m = 0}^{N_{s t e p s} - 1} | E_{m + 1}^{(η, θ)} |^{2}], \\ E_{m + 1}^{(η, θ)} & := D (t, t_{m + 1}) R [v]_{η_{m + 1}} (X_{t_{m + 1}}^{β, π}; β) - D (t, t_{m}) R [v]_{η_{m}} (X^{β, π}; β) \\ - D (t, t_{m}) R [\partial_{x} v]_{θ_{m}} (X_{t_{m}}^{β, π}; β) σ (t_{m}, X_{t_{m}}^{β, π}; β) Δ W_{t_{m + 1}}, \end{aligned}$ (11) ) in several optimization problems, one per time step: learning the weights $θ_{m}$ or $η_{m}$ at a certain time step $t_{m} < t_{N_{s t e p s}}$ only requires knowing the weights $η_{m + 1}$ . At $m = N_{s t e p s}$ , learning the weights $η_{N_{s t e p s}}$ only requires the terminal condition g. Note that the algorithm assumes that adjacent networks in time will be similar, and therefore we initialize $η_{m}$ and $θ_{m}$ by $η_{m + 1}^{*}$ and $θ_{m + 1}^{*}$ .

3.3. Martingale Control Variate Deep Solvers

So far, the presented methodology to obtain the control variate consists on first learning the solution of the PDE and more importantly its gradient (Algorithms 2, 3) which is then plugged in (Equation9(9) $\begin{aligned} V^{π, θ, λ, N} (t, x; β) := \frac{1}{N} \sum_{i = 1}^{N} {(D^{π} (t, T))^{i} g (X_{T}^{β, π, i}) \\ - λ \sum_{k = 1}^{N_{s t e p s} - 1} (D^{π} (t, t_{k}))^{i} R [\partial_{x} v]_{θ} (t_{k}, X_{t_{k}}^{β, π, i}; β) σ (t_{k}, X_{t_{k}}^{β, π, i}; β) (W_{t_{k + 1}}^{i} - W_{t_{k}}^{i})}, \end{aligned}$ (9) ). Alternatively, one can directly use the variance of (Equation9(9) $\begin{aligned} V^{π, θ, λ, N} (t, x; β) := \frac{1}{N} \sum_{i = 1}^{N} {(D^{π} (t, T))^{i} g (X_{T}^{β, π, i}) \\ - λ \sum_{k = 1}^{N_{s t e p s} - 1} (D^{π} (t, t_{k}))^{i} R [\partial_{x} v]_{θ} (t_{k}, X_{t_{k}}^{β, π, i}; β) σ (t_{k}, X_{t_{k}}^{β, π, i}; β) (W_{t_{k + 1}}^{i} - W_{t_{k}}^{i})}, \end{aligned}$ (9) ) as the loss function to be optimized in order to learn the control variate. We expand this idea and design two additional algorithms.

Recall definition of $V_{t, T}^{β, π, θ, λ, N}$ given by (Equation9(9) $\begin{aligned} V^{π, θ, λ, N} (t, x; β) := \frac{1}{N} \sum_{i = 1}^{N} {(D^{π} (t, T))^{i} g (X_{T}^{β, π, i}) \\ - λ \sum_{k = 1}^{N_{s t e p s} - 1} (D^{π} (t, t_{k}))^{i} R [\partial_{x} v]_{θ} (t_{k}, X_{t_{k}}^{β, π, i}; β) σ (t_{k}, X_{t_{k}}^{β, π, i}; β) (W_{t_{k + 1}}^{i} - W_{t_{k}}^{i})}, \end{aligned}$ (9) ). From (Equation8(8) $V^{N} (t, x; β) := \frac{1}{N} \sum_{i = 1}^{N} {D^{i} (t, T) g (X_{T}^{β, i}) - \int_{t}^{T} D^{i} (t, s) \partial_{x} v (s, X_{s}^{β, i}; β) σ (s, X_{s}^{β, i}; β) d W_{s}^{i}} .$ (8) ) we know that the theoretical control variate Monte Carlo estimator has zero variance and so it is natural to set-up a learning task which aims to learn the network weights θ in a way which minimizes said variance: $θ^{⋆, v a r} := \underset{θ}{a r g m i n} V a r [V_{t, T}^{β, π, θ, λ, N}] .$ Setting $λ = 1$ , the learning task is stated as Algorithm 4.

We include a second similar Algorithm in Appendix 2.

4. Examples and Experiments

4.1. Options in Black–Scholes Model on d>1 Assets

Take a d-dimensional Wiener process W. We assume that we are given a symmetric, positive-definite matrix (covariance matrix) Σ and a lower triangular matrix C s.t. $Σ = C C^{*}$ . For such a positive-definite Σ we can always use Cholesky decomposition to find C. The risky assets will have volatilities given by $σ^{i}$ . We will (abusing notation) write $σ^{i j} := σ^{i} C^{i j}$ , when we don't need to separate the volatility of a single asset from correlations. The risky assets under the risk-neutral measure are then given by (12) $d S_{t}^{i} = r S_{t}^{i} d t + σ^{i} S_{t}^{i} \sum_{j} C^{i j} d W_{t}^{j} .$ (12) All sums will be from 1 to d unless indicated otherwise. Note that the SDE can be simulated exactly since $S_{t_{n + 1}}^{i} = S_{t_{n}}^{i} \exp ((r - \frac{1}{2} \sum_{j} (σ^{i j})^{2}) (t_{n + 1} - t_{n}) + \sum_{j} σ^{i j} (W_{t_{n + 1}}^{j} - W_{t_{n}}^{j})) .$ The associated PDE is (with $a^{i j} := \sum_{k} σ^{i k} σ^{j k}$ ) $\partial_{t} v (t, S) + \frac{1}{2} \sum_{i, j} a^{i j} S^{i} S^{j} \partial_{x_{i} x_{j}} v (t, S) + r \sum_{i} S^{i} \partial_{S^{i}} v (t, S) - r v (t, S) = 0,$ for $(t, S) \in [0, T) \times (R^{+})^{d}$ together with the terminal condition $v (T, S) = g (S)$ for $S \in (R^{+})^{d}$ .

4.2. Deep Learning Setting

In this section, we describe the neural networks used in the four proposed algorithms as well as the training setting, in the specific situation where we have an options problem in Black–Scholes model on d>1 assets.

Learning Algorithms 3–5 share the same underlying fully connected artificial network which will be different for different $t_{k}$ , $k = 0, 1, \dots, N_{s t e p s} - 1$ . At each time step, we use a fully connected artificial neural network denoted $R [\cdot]_{θ_{k}} \in D N$ . The choice of the number of layers and network width is motivated by empirical results on different possible architectures applied on a short-lived options problem. We present the results of this study in Appendix A.2. The architecture is similar to that proposed in Beck et al. (Citation2018).

At each time step, the network consists of four layers: one d-dimensional input layer, two $(d + 20)$ -dimensional hidden layers and one output layer. The output layer is one dimensional if the network is approximation for v and d-dimensional if the network is an approximation for $\partial_{x} v$ . The non-linear activation function used on the hidden layers is the linear rectifier relu. In all experiments except for Algorithm 3 for the basket options problem, we used batch normalization (Sergey Ioffe Citation2015) on the input of each network, just before the two nonlinear activation functions in front of the hidden layers and also after the last linear transformation.

The networks' optimal parameters are approximated by the Adam optimizer (Diederik and Kingma Citation2017) on the loss function specific for each method. Each parameter update (i.e. ,one step of the optimizer) is calculated on a batch of $5 \cdot 10^{3}$ paths $(x_{t_{n}}^{i})_{n = 0}^{N_{s t e p s}}$ obtained by simulating the SDE. We take the necessary number of training steps until the stopping criteria defined below is met, with a learning rate of $10^{- 3}$ during the first $10^{4}$ iterations, decreased to $10^{- 4}$ afterwards.

During training of any of the algorithms, the loss value at each iteration is kept. A model is assumed to be trained if the difference between the loss averages of the two last consecutive windows of length 100 is less than a certain ϵ.

4.3. Evaluating Variance Reduction

We use the specified network architectures to assess the variance reduction in several examples below. After training the models in each particular example, they are evaluated as follows:

We calculate $N_{M C} = 10$ times the Monte Carlo estimate $\bar{Ξ_{T}} := \frac{1}{N_{i n}} \sum_{i = 1}^{N_{i n}} Ξ_{T}^{i}$ and the Monte Carlo with control variate estimate ${\bar{V}}_{t, T}^{π, θ, λ, N_{s t e p s}} = \frac{1}{N_{i n}} \sum_{i = 1}^{N_{i n}} V_{t, T}^{π, θ, λ, N_{s t e p s}, i}$ using $N_{i n} = 10^{6}$ Monte Carlo samples.
From Central Limit Theorem, as $N_{i n}$ increases the standardized estimators converge in distribution to the Normal. Therefore, a 95% confidence interval of the variance of the estimator is given by $[\frac{(N_{M C} - 1) S^{2}}{χ_{1 - α / 2, N_{M C} - 1}}, \frac{(N_{M C} - 1) S^{2}}{χ_{α / 2, N_{M C} - 1}}]$ where S is the sample variance of the $N_{M C}$ controlled estimators ${\bar{V}}_{t, T}^{π, θ, λ, N_{s t e p s}}$ , and $α = 0.05$ . These are calculated for both the Monte Carlo estimate and the Monte Carlo with control variate estimate.
We use the $N_{M C} \cdot N_{i n} = 10^{7}$ generated samples $Ξ_{T}^{i}$ and $V_{t, T}^{π, θ, λ, N_{s t e p s}, i}$ to calculate and compare the empirical variances ${\tilde{σ}}_{Ξ_{T}}^{2}$ and ${\tilde{σ}}_{V_{t, T}^{π, θ, λ, N_{s t e p s}, i}}^{2}$ .
The number of optimizer steps and equivalently number of random paths generated for training provide a cost measure of the proposed algorithms.
We evaluate the variance reduction if we use the trained models to create control variates for options in Black–Scholes models with different volatilities than the one used to train our models.

Example 4.1

Low dimensional problem with explicit solution

We consider exchange option on two assets. In this case, the exact price is given by the Margrabe formula. We take d = 2, $S_{0}^{i} = 100$ , $r = 5 %$ , $σ^{i} = 30 %$ , $Σ^{i i} = 1$ , $Σ^{i j} = 0$ for $i \neq j$ . The payoff is $g (S) = g (S^{(1)}, S^{(2)}) := max (0, S^{(1)} - S^{(2)}) .$ From Margrabe's formula, we know that $v (0, S) = B l a c k S c h o l e s (r i s k y p r i c e = \frac{S^{(1)}}{S^{(2)}}, s t r i k e = 1, T, r, \bar{σ}),$ where $\bar{σ} := \sqrt{(σ^{11} - σ^{21})^{2} + (σ^{22} - σ^{12})^{2}}$ .

We organize the experiment as follows: We train our models with batches of 5000 random paths $(s_{t_{n}}^{i})_{n = 0}^{N_{s t e p s}}$ sampled from the SDE (Equation12(12) $d S_{t}^{i} = r S_{t}^{i} d t + σ^{i} S_{t}^{i} \sum_{j} C^{i j} d W_{t}^{j} .$ (12) ), where $N_{s t e p s} = 50$ . The assets' initial values $s_{t_{0}}^{i}$ are sampled from a lognormal distribution $X \sim \exp ((μ - 0.5 σ^{2}) τ + σ \sqrt{τ} ξ),$ where $ξ \sim N (0, 1), μ = 0.08, τ = 0.1$ . The existence of an explicit solution allows to build a control variate of the form (Equation9(9) $\begin{aligned} V^{π, θ, λ, N} (t, x; β) := \frac{1}{N} \sum_{i = 1}^{N} {(D^{π} (t, T))^{i} g (X_{T}^{β, π, i}) \\ - λ \sum_{k = 1}^{N_{s t e p s} - 1} (D^{π} (t, t_{k}))^{i} R [\partial_{x} v]_{θ} (t_{k}, X_{t_{k}}^{β, π, i}; β) σ (t_{k}, X_{t_{k}}^{β, π, i}; β) (W_{t_{k + 1}}^{i} - W_{t_{k}}^{i})}, \end{aligned}$ (9) ) using the known exact solution to obtain $\partial_{x} v$ . For a fixed number of time steps $N_{s t e p s}$ this provides an upper bound on the variance reduction an artificial neural network approximation of $\partial_{x} v$ can achieve.

We follow the evaluation framework to evaluate the model, simulating $N_{M C} \cdot N_{i n}$ paths by simulating (Equation12(12) $d S_{t}^{i} = r S_{t}^{i} d t + σ^{i} S_{t}^{i} \sum_{j} C^{i j} d W_{t}^{j} .$ (12) ) with constant $(S_{0}^{1}, S_{0}^{2})^{i} = (1, 1)$ . We report the following results:

Table provides the empirical variances calculated over $10^{6}$ generated Monte Carlo samples and their corresponding control variates. The variance reduction measure indicates the quality of each control variate method. The variance reduction using the control variate given by Margrabe's formula provides a benchmark for our methods. Table also provides the cost of training for each method, given by the number of optimizer iterations performed before hitting the stopping criteria, defined defined before with $ϵ = 5 \times 10^{- 6}$ . We add an additional row with the control variate built using automatic differentiation on the network parametrized using the Deep Galerkin Method (Sirignano and Spiliopoulos Citation2017). The DGM attempts to find the optimal parameters of the network satisfying the PDE on a pre-determined time and space domain. In contrast to our algorithms, the DGM method is not restricted to learn the solution of the PDE on the paths built from the probabilistic representation of the PDE. However, this is what is precisely enhancing the performance of our methods in terms of variance reduction, since they are specifically learning an approximation to the solution of the PDE and its gradient such that the resulting control variate will yield a low-variance Monte Carlo estimator.
Table provides the confidence intervals for the variances and of the Monte Carlo estimator, and the Monte Carlo estimator with control variate assuming these are calculated on $10^{6}$ random paths. Moreover, we add the confidence interval of the variance of the Monte Carlo estimator calculated over $N_{i n}$ antithetic paths where the first $N_{i n} / 2$ Brownian paths generated using $(Z_{i})_{i = 1, \dots, N_{s t e p s}}$ samples from a normal and the second half of the Brownian paths are generated using the antithetic samples $(- Z_{i})_{i = 1, \dots, N_{s t e p s}}$ . See Belomestny, Iosipoi, and Zhivotovskiy (Citation2017, Section 4.2) for more details. All the proposed algorithms in this paper outperform the Monte Carlo estimator and the Monte Carlo estimator with antithetic paths; compared to the latter, our algorithms produce unbiased estimators with variances that 2 orders of magnitude less.
Figure studies the iterative training for the BSDE solver. As it has been observed before, this type of training does not allow us to study the overall loss function as the number of training steps increases. Therefore we train the same model four times for different values of ϵ between 0.01 and $5 \times 10^{- 6}$ and we study the number of iterations necessary to meet the stopping criteria defined by ϵ, the variance reduction once the stopping criteria is met, and the relationship between the number of iterations and the variance reduction. Note that the variance reduction stabilizes for $ϵ < 10^{- 5}$ . Moreover, the number of iterations necessary to meet the stopping criteria increases exponentially as ϵ decreases, and therefore for our results printed in Tables and we employ $ϵ = 5 \times 10^{- 6}$ (Figure ).
Figure displays the variance reduction after using the trained models on several Black Scholes problem with exchange options but with values of σ other than 0.3 which was the one used for training. We see that the various algorithms work similarly well in this case (not taking training cost into account). We note that the variance reduction is close to the theoretical maximum which is restricted by time discretization. Finally we see that the variance reduction is still significant even when the neural network was trained with different model parameters (in our case volatility in the option pricing example). The labels of Figure can be read as follows:
1. MC + CV Corr op: Monte Carlo estimate with Deep Learning-based Control Variate built using Algorithm 5.
2. MC + CV Var op: Monte Carlo estimate with Deep Learning-based Control Variate built using Algorithm 4.
3. MC + CV BSDE solver: Monte Carlo estimate with Deep Learning-based Control Variate built using Algorithm 3.
4. MC + CV Margrabe: Monte Carlo estimate with Control Variate using analytical solution for this problem given by Margrabe formula.

Figure 2. Left: Variance reduction in terms of number of optimizer iterations. Right: Variance reduction in terms of epsilon. Both are for Example 4.1 and Algorithm 3.

Figure 3. Number of optimizer iterations in terms of epsilon for Example 4.1 and Algorithm 3.

Figure 4. Variance reduction achieved by network trained with $σ = 0.3$ but then applied in situations where $σ \in [0.2, 0.4]$ . We can see that the significant variance reduction is achieved by a neural network that was trained with ‘incorrect’ σ. Note that the ‘MC + CV Margbrabe’ displays the optimal variance reduction that can be achieved by using exact solution to the problem. The variance reduction is not infinite even in this case since stochastic integrals are approximated by Riemann sums.

Figure 4. Variance reduction achieved by network trained with σ=0.3 but then applied in situations where σ∈[0.2,0.4]. We can see that the significant variance reduction is achieved by a neural network that was trained with ‘incorrect’ σ. Note that the ‘MC + CV Margbrabe’ displays the optimal variance reduction that can be achieved by using exact solution to the problem. The variance reduction is not infinite even in this case since stochastic integrals are approximated by Riemann sums.

Table 1. Results on exchange option problem on two assets, Example 4.1. Empirical Variance and variance reduction factor.

Display Table

Table 2. Results on exchange option problem on two assets, Example 4.1.

Display Table

Example 4.2

Low-dimensional problem with explicit solution – Approximation of Price using PDE solver compared to Control Variate

We consider exchange options on two assets as in Example 4.1. We consider algorithm 3 that can be applied in two different ways:

It directly approximates the solution of the PDE (Equation5(5) ${\begin{cases} [\partial_{t} v + t r (a \partial_{x}^{2} v) + b \partial_{x} v - c v] (t, x; β) = 0 i n [0, T) \times D, \\ v (T, \cdot) = g o n D . \end{cases}$ (5) ) and its gradient in every point.
We can use $\partial_{x} v$ to build the control variate using probabilistic representation of the PDE (Equation6(6) $D (t, T) v (T, X_{T}^{β}; β) = v (t, x; β) + \int_{t}^{T} D (t, s) \partial_{x} v (s, X_{s}^{β}; β) σ (s, X_{s}^{β}; β) d W_{s} .$ (6) ).

We compare both applications by calculating the expected error of the $L^{2}$ -error of each of them with respect to the analytical solution given by Margrabe formula. From Margrabe's formula, we know that $v (0, S) = B l a c k S c h o l e s (r i s k y p r i c e = \frac{S^{(1)}}{S^{(2)}}, s t r i k e = 1, T, r, \bar{σ}) .$ Let $R [v]_{η_{0}} (x) \approx v (0, x)$ be the Deep Learning approximation of price at any point at initial time, calculated using Algorithm 3, and $R [\partial_{x} v]_{θ_{m}} (x) \approx \partial_{x} v (t_{m}, x)$ be the Deep Learning approximation of its gradient for every time step in the time discretization. The aim of this experiment is to show how even if Algorithm 3 numerically converges to a biased approximation of $v (0, x)$ (see Figure left), it is still possible to use $R [\partial_{x} v]_{θ_{m}} (x)$ to build an unbiased Monte-Carlo approximation of $v (0, x)$ with low variance.

Figure 5. Left: Loss of Algorithm 3 and squared error of $R [v] (t, x_{0})$ in terms of training iterations. Right: Expected MSE of the two different approaches with respect to analytical solution in terms of number of Monte Carlo samples.

We organize the experiment as follows.

We calculate the expected value of the $L^{2}$ -error of $R η_{0} (x)$ where each component of $x \in R^{2}$ is sampled from a lognormal distribution: $E [| v (0, x) - R [v]_{η_{0}} (x) |^{2}] \approx \frac{1}{N} \sum_{i = 1}^{N} | v (0, x^{i}) - R [v]_{η_{0}} (x^{i}) |^{2}$
We calculate the expected value of the $L^{2}$ -error of the Monte Carlo estimator with control variate where each component of $x \in R^{2}$ is sampled from a lognormal distribution: $E [| v (0, x) - V_{0, T}^{π, θ, λ, N_{M C}, x} |^{2}] \approx \frac{1}{N} \sum_{i = 1}^{N} | v (0, x^{i}) - V_{0, T}^{π, θ, λ, N_{M C}, x^{i}} |^{2},$ where $V_{0, T}^{π, θ, λ, N_{M C}, x}$ is given by (Equation9(9) $\begin{aligned} V^{π, θ, λ, N} (t, x; β) := \frac{1}{N} \sum_{i = 1}^{N} {(D^{π} (t, T))^{i} g (X_{T}^{β, π, i}) \\ - λ \sum_{k = 1}^{N_{s t e p s} - 1} (D^{π} (t, t_{k}))^{i} R [\partial_{x} v]_{θ} (t_{k}, X_{t_{k}}^{β, π, i}; β) σ (t_{k}, X_{t_{k}}^{β, π, i}; β) (W_{t_{k + 1}}^{i} - W_{t_{k}}^{i})}, \end{aligned}$ (9) ), and is calculated for different values of Monte Carlo samples.
We calculate the expected value of the $L^{2}$ -error of the Monte Carlo estimator without control variate where each component of $x \in R^{2}$ is sampled from a lognormal distribution: $E [| v (0, x) - Ξ_{0, T}^{π, θ, λ, N_{M C}, x} |^{2}] \approx \frac{1}{N} \sum_{i = 1}^{N} | v (0, x^{i}) - Ξ_{0, T}^{π, θ, λ, N_{M C}, x^{i}} |^{2},$ where $Ξ_{0, T}^{π, θ, λ, N_{M C}, x} := \frac{1}{N_{M C}} \sum_{j = 1}^{N_{M C}} D (t, T) g (X_{T}^{i})$

Figure provides one realization of the described experiment for different Monte Carlo iterations between 10 and 200. It shows how in this realization, 60 Monte Carlo iterations are enough to build a Monte Carlo estimator with control variate having lower bias than the solution provided by Algorithm 3.

Example 4.3

Low-dimensional problem with explicit solution. Training on random values for volatility

We consider exchange option on two assets. In this case, the exact price is given by the Margrabe formula. The difference with respect to the last example is that now we aim to generalize our model, so that it can build control variates for different Black–Scholes models. For this we take d = 2, $S_{0}^{i} = 100$ , r = 0.05, $σ^{i} \sim U n i f (0.2, 0.4)$ , $Σ^{i i} = 1$ , $Σ^{i j} = 0$ for $i \neq j$ .

The payoff is $g (S) = g (S^{(1)}, S^{(2)}) := max (0, S^{(1)} - S^{(2)}) .$ We organize the experiment as follows: for comparison purposes with the BSDE solver from the previous example, we train our model for exactly the same number of iterations, i.e. ,1380 batches of 5000 random paths $(s_{t_{n}}^{i})_{n = 0}^{N_{s t e p s}}$ sampled from the SDE Equation12(12) $d S_{t}^{i} = r S_{t}^{i} d t + σ^{i} S_{t}^{i} \sum_{j} C^{i j} d W_{t}^{j} .$ (12) , where $N_{s t e p s} = 50$ . The assets' initial values $s_{t_{0}}^{i}$ are sampled from a lognormal distribution $X \sim \exp ((μ - 0.5 σ^{2}) τ + σ \sqrt{τ} ξ),$ where $ξ \sim N (0, 1), μ = 0.08, τ = 0.1$ . Since now σ can take different values, it is included as input to the networks at each time step.

The existence of an explicit solution allows to build a control variate of the form (Equation9(9) $\begin{aligned} V^{π, θ, λ, N} (t, x; β) := \frac{1}{N} \sum_{i = 1}^{N} {(D^{π} (t, T))^{i} g (X_{T}^{β, π, i}) \\ - λ \sum_{k = 1}^{N_{s t e p s} - 1} (D^{π} (t, t_{k}))^{i} R [\partial_{x} v]_{θ} (t_{k}, X_{t_{k}}^{β, π, i}; β) σ (t_{k}, X_{t_{k}}^{β, π, i}; β) (W_{t_{k + 1}}^{i} - W_{t_{k}}^{i})}, \end{aligned}$ (9) ) using the known exact solution to obtain $\partial_{x} v$ . For a fixed number of time steps $N_{s t e p s}$ this provides an upper bound on the variance reduction an artificial neural network approximation of $\partial_{x} v$ can achieve.

Figure adds the performance of this model to Figure , where the variance reduction of the Control Variate is displayed for different values of the volatility between 0.2 and 0.4.

Figure 6. Extension of Figure with variance reduction achieved by training the model on different Black–Scholes models.

Figure 6. Extension of Figure 4 with variance reduction achieved by training the model on different Black–Scholes models.

Example 4.4

High-dimensional problem, exchange against average

We extend the previous example to 100 dimensions. This example is similar to ${E X}_{10 E}$ from Broadie, Du, and Moallemi (Citation2015). We will take $S_{0}^{i} = 100$ , $r = 5 %$ , $σ^{i} = 30 %$ , $Σ^{i i} = 1$ , $Σ^{i j} = 0$ for $i \neq j$ .

We will take this to be $g (S) := max (0, S^{1} - \frac{1}{d - 1} \sum_{i = 2}^{d} S^{i}) .$ The experiment is organized as follows: we train our models with batches of $5 \cdot 10^{3}$ random paths $(s_{t_{n}}^{i})_{n = 0}^{N_{s t e p s}}$ sampled from the SDE (Equation12(12) $d S_{t}^{i} = r S_{t}^{i} d t + σ^{i} S_{t}^{i} \sum_{j} C^{i j} d W_{t}^{j} .$ (12) ), where $N_{s t e p s} = 50$ . The assets' initial values $s_{t_{0}}^{i}$ are sampled from a lognormal distribution $X \sim \exp ((μ - 0.5 σ^{2}) τ + σ \sqrt{τ} ξ),$ where $ξ \sim N (0, 1)$ , $μ = 0.08$ , $τ = 0.1$ .

We follow the evaluation framework to evaluate the model, simulating $N_{M C} \cdot N_{i n}$ paths by simulating (Equation12(12) $d S_{t}^{i} = r S_{t}^{i} d t + σ^{i} S_{t}^{i} \sum_{j} C^{i j} d W_{t}^{j} .$ (12) ) with constant $S_{0}^{i} = 1$ for $i = 1, \dots, 100$ . We have the following results:

Table provides the empirical variances calculated over $10^{6}$ generated Monte Carlo samples and their corresponding control variates. The variance reduction measure indicates the quality of each control variate method. Table also provides the cost of training for each method, given by the number of optimizer iterations performed before hitting the stopping criteria with $ϵ = 5 \cdot 10^{- 6}$ . Algorithm 3 outperforms the other algorithms in terms of variance reduction factor. This is not surprising as Algorithm 3 explicitly learns the discretization of the Martingale representation (Equation (Equation7(7) $v (t, x; β) = D (t, T) g (X_{T}^{β}) - \int_{t}^{T} D (t, s) \partial_{x} v (s, X_{s}^{β}; β) σ (s, X_{s}^{β}; β) d W_{s} .$ (7) )) from which the control variate arises.
Table provides the confidence interval for the variance of the Monte Carlo estimator, and the Monte Carlo estimator with control variate assuming these are calculated on $10^{6}$ random paths.
Figures and study the iterative training for the BSDE solver. We train the same model four times for different values of ϵ between 0.01 and $5 \times 10^{- 6}$ and we study the number of iterations necessary to meet the stopping criteria defined by ϵ, the variance reduction once the stopping criteria is met, and the relationship between the number of iterations and the variance reduction. Note that in this case the variance reduction does not stabilize for $ϵ < 10^{- 5}$ . However, the number of training iterations increases exponentially as ϵ decreases, and therefore we also choose $ϵ = 5 \times 10^{- 6}$ to avoid building a control that requires a high number of random paths to be trained (Figure ).

Figure 7. Left: Variance reduction in terms of number of optimizer iterations. Right: Variance reduction in terms of epsilon. Both for Example 4.4 and Algorithm 3.

Figure 8. Number of optimizer iterations in terms of ϵ for Example 4.4 and Algorithm 3.

Figure 9. Variance reduction with network trained with $σ = 0.3$ but applied for $σ \in [0.2, 0.4]$ for the model of Example 4.4. We see that the variance reduction factor is considerable even in the case when the network is used with ‘wrong’ σ. It seems that Algorithm 4 is not performing well in this case.

Figure 9. Variance reduction with network trained with σ=0.3 but applied for σ∈[0.2,0.4] for the model of Example 4.4. We see that the variance reduction factor is considerable even in the case when the network is used with ‘wrong’ σ. It seems that Algorithm 4 is not performing well in this case.

Table 3. Results on exchange option problem on 100 assets, Example 4.4. Empirical Variance and variance reduction factor and costs in terms of paths used for training and optimizer steps.

Display Table

Table 4. Results on exchange option problem on 100 assets, Example 4.4.

Display Table

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

This work was supported by the Alan Turing Institute under EPSRC grant no. EP/N510129/1.

References

Bayer C., and B. Stemper. 2018. “Deep Calibration of Rough Stochastic Volatility Models.” arXiv:1810.03399.
Google Scholar
Beck C., S. Becker, P. Grohs, N. Jaafari, and A. Jentzen. 2018. “Solving Stochastic Differential Equations and Kolmogorov Equations by Means of Deep Learning.” arXiv:1806.00421.
Google Scholar
Beck C., L. Gonon, and A. Jentzen. 2020. “Overcoming the Curse of Dimensionality in the Numerical Approximation of High-Dimensional Semilinear Elliptic Partial Differential Equations”.
Google Scholar
Belomestny D., L. Iosipoi, and N. Zhivotovskiy. 2017. “Variance Reduction via Empirical Variance Minimization: Convergence and Complexity.” arXiv:1712.04667.
Google Scholar
Belomestny D., S. Hafner, T. Nagapetyan, and M. Urusov. 2018. “Variance Reduction for Discretised Diffusions Via Regression.” Journal of Mathematical Analysis and Applications 458 (1): 393–418.
Web of Science ®Google Scholar
Berner J., P. Grohs, and A. Jentzen. January, 2020. “Analysis of the Generalization Error: Empirical Risk Minimization Over Deep Artificial Neural Networks Overcomes the Curse of Dimensionality in the Numerical Approximation of Black–scholes Partial Differential Equations.” SIAM Journal on Mathematics of Data Science 2 (3): 631–657.
Google Scholar
Blanka Horvath M. T., and Muguruza Aitor. 2019. “Deep Learning Volatility: A Deep Neural Network Perspective on Pricing and Alibration in (Rough) Volatility Models.” arXiv:1901.09647.
Google Scholar
M. Broadie, Y. Du, and C. C. Moallemi. 2015. “Risk Estimation Via Regression.” Operations Research 63 (5): 1077–1097.
Web of Science ®Google Scholar
Chan-Wai-Nam Q., J. Mikael, and X. Warin. 2019. “Machine Learning for Semi Linear Pdes.” Journal of Scientific Computing 79 (3): 1667–1712.
Web of Science ®Google Scholar
Chizat L., and F. Bach. 2018. “On the Global Convergence of Gradient Descent for Over-Parameterized Models Using Optimal Transport.” In Advances in Neural Information Processing Systems, 3036–3046.
Google Scholar
Cohen S. N., and R. J. Elliott. 2015. Stochastic Calculus and Applications. Vol. 2. New York: Birkhäuser.
Google Scholar
Cont R., and Y. Lu. March, 2016. “Weak Approximation of Martingale Representations.” Stochastic Processes and Their Applications 126 (3): 857–882.
Web of Science ®Google Scholar
Cvitanic J., and J. Zhang. 2005. “The Steepest Descent Method for Forward-backward Sdes.” Electronic Journal of Probability 10: 1468–1495.
Web of Science ®Google Scholar
Diederik J. B., and P. Kingma. 2017. “Adam: A Method for Stochastic Optimization.” arXiv:1412.6980.
Google Scholar
Du S. S., X. Zhai, B. Poczos, and A. Singh. 2018. “Gradient Descent Provably Optimizes Over-Parameterized Neural Networks.” arXiv:1810.02054.
Google Scholar
Elbrächter D., P. Grohs, A. Jentzen, and C. Schwab. Dnn expression rate analysis of high-dimensional pdes. 2022. “Application to Option Pricing.” Constructive Approximation 55 (1): 3–7.
Google Scholar
Glasserman P. 2013. Monte Carlo Methods in Financial Engineering. Springer.
Google Scholar
Gonon L., P. Grohs, A. Jentzen, D. Kofler, and D. Šiška. 2021. “Uniform Error Estimates for Artificial Neural Network Approximations for the Heat Equation.” IMA Journal Numerical Analysis. https://academic.oup.com/imajna/advance-article/doi/https://doi.org/10.1093/imanum/drab027/6279436?guestAccessKey=cef28afb-c79a-433a-855c-08f76b4732fd&login=false
Google Scholar
Goodfellow I., J. Shlens, and C. Szegedy. 2014. “Explaining and Harnessing Adversarial Examples.” arXiv:1412.6572.
Google Scholar
Grohs P., F. Hornung, A. Jentzen, and P. von Wurstemberger. 2018. “A Proof That Artificial Neural Networks Overcome the Curse of Dimensionality in the Numerical Approximation of Black-Scholes Partial Differential Equations.” arXiv:1809.02362.
Google Scholar
Grohs P., A. Jentzen, and D. Salimova. 2019. “Deep Neural Network Approximations for Monte Carlo Algorithms”.
Google Scholar
Han J., and J. Long. 2020. “Convergence of the Deep Bsde Method for Coupled Fbsdes.” Probability, Uncertainty and Quantitative Risk 5 (1): 1–33.
Web of Science ®Google Scholar
Han J., A. Jentzen, and E. Weinan. 2017. “Solving High-Dimensional Partial Differential Equations Using Deep Learning.” arXiv:1707.02568.
Google Scholar
Henry-Labordere P. 2017. “Deep Primal-Dual Algorithm for BSDEs: Applications of Machine Learning to CVA and IM.” Available at SSRN 3071506.
Google Scholar
Hernandez A. 2016. “Model Calibration With Neural Networks.” Available at SSRN 2812140.
Google Scholar
Horvath B., A. Muguruza, and M. Tomas. 2019. “Deep Learning Volatility.” Available at SSRN 3322085.
Google Scholar
Hu K., Z. Ren, Šiška D., and L. Szpruch. 2021. “Mean-Field Langevin Dynamics and Energy Landscape of Neural Networks.” Annales de l'Institute Henry Poincaré Proability and Statistics 57 (4): 2043–2065.
Web of Science ®Google Scholar
Huré C., H. Pham, and X. Warin. 2019. “Some Machine Learning Schemes for High-Dimensional Nonlinear PDEs.” arXiv:1902.01599.
Google Scholar
Hutzenthaler M., A. Jentzen, T. Kruse, and T. A. Nguyen. April, 2020. “A Proof that Rectified Deep Neural Networks Overcome the Curse of Dimensionality in the Numerical Approximation of Semilinear Heat Equations.” SN Partial Differential Equations and Applications 1 (2): 1–34.
Google Scholar
Itkin A. 2019. “Deep Learning Calibration of Option Pricing Models: Some Pitfalls and Solutions.” arXiv:1906.03507.
Google Scholar
Jacquier A. J., and M. Oumgari. 2019. “Deep PPDEs for Rough Local Stochastic Volatility.” Available at SSRN 3400035.
Google Scholar
Jentzen A., D. Salimova, and T. Welti. 2018. “A Proof That Deep Artificial Neural Networks Overcome the Curse of Dimensionality in the Numerical Approximation of Kolmogorov Partial Differential Equations With Constant Diffusion and Nonlinear Drift Coefficients.” arXiv:1809.07321.
Google Scholar
Krylov N. 1999. “On Kolmogorov's Equations for Finite Dimensional Diffusions.” In Stochastic PDE's and Kolmogorov Equations in Infinite Dimensions, 1–63. Springer.
Google Scholar
Krylov N. V. 2002. Introduction to the Theory of Random Processes. Vol. 43. American Mathematical Society.
Google Scholar
Kutyniok G., P. Petersen, M. Raslan, and R. Schneider. 2020. “A Theoretical Analysis of Deep Neural Networks and Parametric pdes”.
Google Scholar
LeCun Y., Y. Bengio, and G. Hinton. 2015. “Deep Learning.” Nature 521 (7553): 436.
PubMed Web of Science ®Google Scholar
Liu S., A. Borovykh, L. A. Grzelak, and C. W. Oosterlee. 2019. “A Neural Network-Based Framework for Financial Model Calibration.” arXiv:1904.10523.
Google Scholar
McGhee W. A. 2018. “An Artificial Neural Network Representation of the SABR Stochastic Volatility Model.” SSRN 3288882.
Google Scholar
Mei S., A. Montanari, and P.-M. Nguyen. 2018. “A Mean Field View of the Landscape of Two-layer Neural Networks.” Proceedings of the National Academy of Sciences 115 (33): E7665–E7671.
PubMed Web of Science ®Google Scholar
Milstein G., and M. Tretyakov. 2009. “Solving Parabolic Stochastic Partial Differential Equations Via Averaging Over Characteristics.” Mathematics of Computation 78 (268): 2075–2106.
Web of Science ®Google Scholar
Newton N. J. 1994. “Variance Reduction for Simulated Diffusions.” SIAM Journal on Applied Mathematics 54 (6): 1780–1805.
Web of Science ®Google Scholar
Øksendal B. 2003. “Stochastic Differential Equations.” In Stochastic Differential Equations, 65–84. Springer.
Google Scholar
Reisinger C., and Y. Zhang. 2020. “Rectified Deep Neural Networks Overcome the Curse of Dimensionality for Nonsmooth Value Functions in Zero-Sum Games of Nonlinear Stiff Systems”.
Google Scholar
Rotskoff G. M., and Vanden-Eijnden E.. 2018. “Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error.” arXiv:1805.00915.
Google Scholar
Sabate-Vidales M., D. Šiška, and L. Szpruch. 2020. “Solving Path Dependent pdes with LSTM Networks and Path Signatures”.
Google Scholar
Sergey Ioffe C. S. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” arXiv:1502.03167.
Google Scholar
Sirignano J., and K. Spiliopoulos. 2017. “DGM: A Deep Learning Algorithm for Solving Partial Differential Equations.” arXiv:1708.07469.
Google Scholar
Sirignano J., and K. Spiliopoulos. 2020. “Mean Field Analysis of Neural Networks: A Central Limit Theorem.” Stochastic Processes and Their Applications 130 (3): 1820–1852.
PubMed Web of Science ®Google Scholar
Stone H. 2018. “Calibrating Rough Volatility Models: A Convolutional Neural Network Approach.” arXiv:1812.05315.
Google Scholar
Wang Z., and S. Tang. 2021. “Gradient Convergence of Deep Learning-based Numerical Methods for Bsdes.” Chinese Annals of Mathematics, Series B 42 (2): 199–216.
Web of Science ®Google Scholar
Weinan E., J. Han, and A. Jentzen. 2017. “Deep Learning-based Numerical Methods for High-dimensional Parabolic Partial Differential Equations and Backward Stochastic Differential Equations.” Communications in Mathematics and Statistics 5 (4): 349–380.
Web of Science ®Google Scholar

Appendices

Appendix 1.

Martingale Control Variate

Let

(Ω, F, P)

be a probability space and consider an

R^{d^{'}}

-valued Wiener process

W = (W^{j})_{j = 1}^{d^{'}} = ((W_{t}^{j})_{t \geq 0})_{j = 1}^{d^{'}}

. We will use

(F_{t}^{W})_{t \geq 0}

to denote the filtration generated by W. Consider a

D \subseteq R^{d}

-valued, continuous, stochastic process defined for the parameters

β \in B \subseteq R^{p}

X^{β} = (X^{β, i})_{i = 1}^{d} = ((X_{t}^{β, i})_{t \geq 0})_{i = 1}^{d}

adapted to

(F_{t}^{W})_{t \geq 0}

Let $g : C ([0, T], R^{d}) \to R$ be a measurable function. We shall consider path-dependent contingent claims of the form $g ((X_{s}^{β})_{s \in [0, T]})$ . Finally we assume that there is a (stochastic) discount factor given by $D (t_{1}, t_{2}; β) = e^{- \int_{t_{1}}^{t_{2}} c (s, X_{s}^{β}; β) d s}$ for an appropriate function $c = c (t, x; β)$ . We will omit the β from the discount factor notation. Let $Ξ_{T}^{β} := D (t, T) g ((X_{s}^{β})_{s \in [0, T]}) .$ We now interpret $P$ as some risk-neutral measure and so the $P$ -price of our contingent claim is $V_{t}^{β} = E [Ξ_{T}^{β} | F_{t}^{β}] = E [D (t, T) g ((X_{s}^{β})_{s \in [0, T]}) | F_{t}^{β}] .$ By assumption $Ξ_{T}^{β}$ is $F_{T}^{W}$ measurable and $E [| Ξ_{T}^{β} |^{2}] < \infty$ . Hence, from the Martingale Representation Theorem, see, e.g., Cohen and Elliott (Citation2015, Th. 14.5.1), there exists a unique process $(Z_{t}^{β})_{t}$ adapted to $(F_{t}^{W})_{t}$ with $E [\int_{0}^{T} | Z_{s}^{β} |^{2} d s] < \infty$ such that (A1) $Ξ_{T}^{β} = E [Ξ_{T}^{β} | F_{0}^{W}] + \int_{0}^{T} Z_{s}^{β} d W_{s} .$ (A1) The proof of the existence of the process $(Z_{t}^{β})_{t}$ , is non-constructive. In the setup of the paper, we used the Markovian property of $Ξ_{t}^{β}$ to approximate $Z_{t}^{β}$ via the associated linear PDE. In the more general non-Markovian setup, Cont and Lu (Citation2016) provide a numerical method to construct the martingale representation.

Observe that in our setup, $F_{0} = F_{0}^{W}$ , $F_{t}^{β} \subseteq F_{t}^{W}$ for $t \geq 0$ . Hence tower property of the conditional expectation implies that (A2) $E [Ξ_{T}^{β} | F_{t}^{β}] = E [Ξ_{T}^{β} | F_{0}^{W}] + \int_{0}^{t} Z_{s}^{β} d W_{s} .$ (A2) Consequently (EquationA1(A1) $Ξ_{T}^{β} = E [Ξ_{T}^{β} | F_{0}^{W}] + \int_{0}^{T} Z_{s}^{β} d W_{s} .$ (A1) ) and (EquationA2(A2) $E [Ξ_{T}^{β} | F_{t}^{β}] = E [Ξ_{T}^{β} | F_{0}^{W}] + \int_{0}^{t} Z_{s}^{β} d W_{s} .$ (A2) ) imply $E [Ξ_{T}^{β} | F_{t}^{β}] = Ξ_{T}^{β} - \int_{t}^{T} Z_{s}^{β} d W_{s} .$ We then observe that $V_{t}^{β} = E [Ξ_{T}^{β} | F_{t}^{β}] = E [Ξ_{T}^{β} - \int_{t}^{T} Z_{s}^{β} d W_{s} | F_{t}^{β}] .$ If we can generate iid $(W^{i})_{i = 1}^{N}$ and $(Z^{β, i})_{i = 1}^{N}$ with the same distributions as W and Z respectively then we can consider the following Monte-Carlo estimator of $V_{t}^{β}$ : $V_{t}^{β, N} := \frac{1}{N} \sum_{i = 1}^{N} (Ξ_{T}^{β, i} - \int_{t}^{T} Z_{s}^{β, i} d W_{s}^{i}) .$ In the companion paper (Sabate-Vidales, Šiška, and Szpruch Citation2020), we provide deep learning algorithms to price path-dependent options in the risk neutral measure by solving the corresponding path-dependent PDE, using a combination of Recurrent Neural networks and path signatures to parametrize the process $Z^{β}$ .

Appendix 2.

Martingale Control Variate Deep Solvers

A.1. Empirical Correlation Maximization

This method is based on the idea that since we are looking for a good control variate we should directly train the network to maximize the variance reduction between the vanilla Monte-Carlo estimator and the control variates Monte-Carlo estimator by also trying to optimize λ.

Recall we denote $Ξ_{T} = D (t, T) g ((X_{s})_{s \in [t, T]})$ . We also denote as $M_{t, T}$ as the stochastic integral that arises in the martingale representation of $Ξ_{T}$ . The optimal coefficient $λ^{*, θ}$ that minimizes the variance $V a r [Ξ_{T} - λ M_{t, T}^{θ}]$ is $λ^{*, θ} = \frac{C o v [Ξ_{T}, M_{t, T}^{θ}]}{V a r [M_{t, T}^{θ}]} .$ Let $ρ^{Ξ_{T}, M_{t, T}^{θ}}$ denote the Pearson correlation coefficient between $Ξ_{T}$ and $M_{t, T}^{θ}$ , i.e., $ρ^{Ξ_{T}, M_{t, T}^{θ}} = \frac{C o v (Ξ_{T}, M_{t, T}^{θ})}{\sqrt{V a r [Ξ_{T}] V a r [M_{t, T}^{θ}]}} .$ With the optimal $λ^{*}$ we then have that the variance reduction obtained from the control variate is $\frac{V a r [V_{t, T}^{π, θ, λ^{*}, N}]}{V a r [Ξ_{T}]} = 1 - {(ρ^{Ξ_{T}, M_{t, T}^{θ}})}^{2} .$ See Glasserman (Citation2013, Ch. 4.1) for more details. Therefore we set the learning task as $θ^{*, c o r} := {a r g m i n}_{θ} [1 - {(ρ^{Ξ_{T}, M_{t, T}^{θ}})}^{2}] .$ The implementable version requires the definition of $V_{t, T}^{β, π, θ, λ, N}$ in (Equation9(9) $\begin{aligned} V^{π, θ, λ, N} (t, x; β) := \frac{1}{N} \sum_{i = 1}^{N} {(D^{π} (t, T))^{i} g (X_{T}^{β, π, i}) \\ - λ \sum_{k = 1}^{N_{s t e p s} - 1} (D^{π} (t, t_{k}))^{i} R [\partial_{x} v]_{θ} (t_{k}, X_{t_{k}}^{β, π, i}; β) σ (t_{k}, X_{t_{k}}^{β, π, i}; β) (W_{t_{k + 1}}^{i} - W_{t_{k}}^{i})}, \end{aligned}$ (9) ), where we set $\begin{aligned} Ξ_{T}^{β, π, i} & := D^{π} (t, T))^{i} g (X_{T}^{β, π, i}) \\ M_{t, T}^{β, π, i, θ} & := \sum_{k = 1}^{N_{s t e p s} - 1} (D^{π} (t, t_{k}))^{i} R [\partial_{x} v]_{θ} (t_{k}, X_{t_{k}}^{β, π, i}) σ (t_{k}, X_{t_{k}}^{β, π, i}) (W_{t_{k + 1}}^{i} - W_{t_{k}}^{i}) \end{aligned}$ The full method is stated as Algorithm 5.

Appendix 3.

Artificial Neural Networks

We fix a locally Lipschitz function $a : R \to R$ and for $d \in N$ define $A_{d} : R^{d} \to R^{d}$ as the function given, for $x = (x_{1}, \dots, x_{d})$ by $A_{d} (x) = (a (x_{1}), \dots, a (x_{d}))$ . We fix $L \in N$ (the number of layers), $l_{k} \in N$ , $k = 0, 1, \dots L - 1$ (the size of input to layer k) and $l_{L} \in N$ (the size of the network output). A fully connected artificial neural network is then given by $Φ = ((W_{1}, B_{1}), \dots, (W_{L}, B_{L}))$ , where, for $k = 1, \dots, L$ , we have real $l_{k - 1} \times l_{k}$ matrices $W_{k}$ and real $l_{k}$ dimensional vectors $B_{k}$ .

The artificial neural network defines a function $R_{Φ} : R^{l_{0}} \to R^{l_{L}}$ given recursively, for $x_{0} \in R^{l_{0}}$ , by $R_{Φ} (x_{0}) = W_{L} x_{L - 1} + B_{L}, x_{k} = A_{l_{k}} (W_{k} x_{k - 1} + B_{k}), k = 1, \dots, L - 1.$ We can also define the function $P$ which counts the number of parameters as $P (Φ) = \sum_{i = 1}^{L} (l_{k - 1} l_{k} + l_{k}) .$ We will call such class of fully connected artificial neural networks $D N$ . Note that since the activation functions and architecture are fixed the learning task entails finding the optimal $Φ \in R^{P (Φ)}$ .

Appendix 4.

Additional Numerical Results

Example A.1

Low dimensional basket option

We consider the basket options problem of pricing, using the example from Belomestny, Iosipoi, and Zhivotovskiy (Citation2017, Sec 4.2.3). The payoff function is $g (S) := max (0, \sum_{i = 1}^{d} S^{i} - K) .$ We first consider the basket options problem on two assets, with d = 2, $S_{0}^{i} = 70$ , $r = 50 %$ , $σ^{i} = 100 %$ , $Σ^{i i} = 1$ , $Σ^{i j} = 0$ for $i \neq j$ , and constant strike $K = \sum_{i = 1}^{d} S_{0}^{i}$ . In line with the example from Belomestny, Iosipoi, and Zhivotovskiy (Citation2017, Sec 4.2.3) for comparison purposes we organize the experiment as follows. The control variates on 20, 000 batches of 5000 samples each of $(s_{t_{n}}^{i})_{n = 0}^{N_{s t e p s}}$ by simulating the SDE Equation12(12) $d S_{t}^{i} = r S_{t}^{i} d t + σ^{i} S_{t}^{i} \sum_{j} C^{i j} d W_{t}^{j} .$ (12) , where $N_{s t e p s} = 50$ . The assets' initial values $s_{t_{0}}$ are always constant $S_{t_{0}}^{i} = 0.7$ . We follow the evaluation framework to evaluate the model, simulating $N_{M C} \cdot N_{i n}$ paths by simulating Equation12(12) $d S_{t}^{i} = r S_{t}^{i} d t + σ^{i} S_{t}^{i} \sum_{j} C^{i j} d W_{t}^{j} .$ (12) with constant $S_{0}^{i} = 0.7$ for $i = 1, \dots, 100$ . We have the following results:

Table provides the empirical variances calculated over $10^{6}$ generated Monte Carlo samples and their corresponding control variates. The variance reduction measure indicates the quality of each control variate method. Table also provides the cost of training for each method, given by the number of optimizer iterations performed before hitting the stopping criteria, defined defined before with $ϵ = 5 \times 10^{- 6}$ .
Table provides the confidence interval for the variance of the Monte Carlo estimator, and the Monte Carlo estimator with control variate assuming these are calculated on $10^{6}$ random paths.
Figures and study the iterative training for the BSDE solver. We train the same model four times for different values of ϵ between 0.01 and $5 \times 10^{- 6}$ and we study the number of iterations necessary to meet the stopping criteria defined by ϵ, the variance reduction once the stopping criteria is met, and the relationship between the number of iterations and the variance reduction. Note that the variance reduction stabilizes for $ϵ < 10^{- 5}$ . Furthermore, the number of training iterations increases exponentially as ϵ decreases. We choose $ϵ = 5 \times 10^{- 6}$ .

Table E1. Results on basket options problem on two assets, Example A.1. Models trained with $S_{0}$ fixed, non-random. Empirical Variance and variance reduction factor are presented.

Display Table

Table E2. Results on basket options problem on two assets, Example A.1. Models trained with $S_{0}$ fixed, non-random.

Display Table

Figure A1. Left: Variance reduction in terms of number of optimizer iterations. Right: Variance reduction in terms of epsilon. Both refer to Algorithm 3 used in Example A.1.

Figure A2. Number of optimizer iterations in terms of ϵ for Algorithm 3 used in Example A.1.

We note that in the example from Belomestny, Iosipoi, and Zhivotovskiy (Citation2017, Sec 4.2.3), the control variate is trained with $S_{0} = 0.7$ fixed. Using this setting, Algorithm 2 cannot be used to approximate the control variate in (Equation9(9) $\begin{aligned} V^{π, θ, λ, N} (t, x; β) := \frac{1}{N} \sum_{i = 1}^{N} {(D^{π} (t, T))^{i} g (X_{T}^{β, π, i}) \\ - λ \sum_{k = 1}^{N_{s t e p s} - 1} (D^{π} (t, t_{k}))^{i} R [\partial_{x} v]_{θ} (t_{k}, X_{t_{k}}^{β, π, i}; β) σ (t_{k}, X_{t_{k}}^{β, π, i}; β) (W_{t_{k + 1}}^{i} - W_{t_{k}}^{i})}, \end{aligned}$ (9) ): since the network at t = 0, $R [v]_{η_{0}}$ , is trained only at $S_{0} = 0.7$ , then automatic differentiation to approximate $\partial_{x} R [v]_{η_{0}} (0.7)$ will yield a bad approximation of $\partial_{x} v (0.7)$ ; indeed, during training $R [v]_{η_{0}}$ is unable to capture how v changes around $S_{0}$ at t = 0. For this reason, Algorithm 2 is not included in the following results.

Example A.2

basket option with random sigma

In this example, as in Example 4.2, we aim to show how our approach – where we build a control variate by approximating the process $(Z_{t_{k}})_{k = 0, \dots, N_{s t e p s}}$ – is more robust compared to directly approximating the price by a certain function in a high-dimensional setting.

We use the methodology proposed in Blanka Horvath and Muguruza (Citation2019), where the authors present a deep learning-based calibration method proposing a two-steps approach: first the authors learn the model that approximates the pricing map using an artificial neural network in which the inputs are the parameters of the volatility model. Second the authors calibrate the learned model using available data by means of different optimization methods.

For a fair comparison between our deep learning-based control variate approach vs. the method proposed in Blanka Horvath and Muguruza (Citation2019), we make the following remarks:

We will only use the first step detailed in Blanka Horvath and Muguruza (Citation2019) where the input to the model that approximates the pricing map is the volatility model's parameters: $σ \in R^{d}, r$ , and the initial price is considered constant for training purposes. We run the experiment for d = 5.
In Blanka Horvath and Muguruza (Citation2019), the authors build a training set and then perform gradient descent-based optimization on the training set for a number of epochs. This is somewhat a limiting factor in the current setting where one can have as much data as they want since it is generated from some given distributions. In line with our experiments, instead of building a training set, in each optimization step we sample a batch from the given distributions.
In Blanka Horvath and Muguruza (Citation2019), the price mapping function is learned for a grid of combinations of maturities and strikes. In this experiment, we reduce the grid to just one point considering T = 0.5, $K = \sum_{i} S_{0}^{i}$ , where $S_{0}^{i} = 0.7 \forall i$ .
We will use Algorithm 3 to build the control variate with the difference that now $σ \in R^{d}, r \in R$ will be passed as input to the each network at each time step $R [v]_{η_{k}}, R [\partial_{x} v]_{θ_{k}}$ .

The experiment is organized as follows:

We train the network proposed in Blanka Horvath and Muguruza (Citation2019) approximating the price using Black–Scholes model and Basket options payoff. In each optimization iteration, a batch of size $1 000$ , where the volatility model's parameters are sampled using $σ \sim U (0.9, 1.1)$ and $r \sim U (0.4, 0.6)$ . We keep a test set of size 150, $S = {[(σ^{i}, r^{i}); p (σ^{i}, r^{i})], i = 1, \dots, 150}$ where $p (σ^{i}, r^{i})$ denotes the price and is generated using 50, 000 Monte Carlo samples.
We use Algorithm 3 to build the control variate, where σ and r are sampled as above and are included as inputs to the network. We denote the trained model by $R [\partial_{x} v]_{θ_{k}}$ where $k = 1, \dots, N_{s t e p s}$ . In contrast with Algorithm 3.

We present the following results:

Figure displays the histogram of the squared error of the approximation of the PDE solution $R [v]_{η}$ for each instance in $S$ . In this sample, it spans from almost $10^{- 8}$ to $10^{- 3}$ , i.e. ,for almost 5 orders of magnitude.
We build the control variate for that instance in the test set for which $R [v]_{η}$ generalizes the worst. For those particular $σ, r$ , Table provides its variance reduction factor.

Table E3. Results on basket options problem on 5 assets, Model trained with non-random $S_{0}$ , and random $σ, r$ .

Display Table

Example A.3

High dimensional basket option

We also consider the basket options problem on d = 100 assets but otherwise identical to the setting of Example A.1. We compare our results against the same experiment in Belomestny, Iosipoi, and Zhivotovskiy (Citation2017, Sec 4.2.3, Table 6 and Table 7).

Table shows a significant improvement of the variance reduction factor (10× and 100× better) of all our Algorithms than the methods proposed in Belomestny, Iosipoi, and Zhivotovskiy (Citation2017) and applied in the same example.

Table E4. Results on basket options problem on 100 assets, Example A.3. Models trained with non-random $S_{0}$ so that the results can be directly compared to Belomestny, Iosipoi, and Zhivotovskiy (Citation2017).

Display Table

Table E5. Results on basket options problem on 100 assets, Example A.3. Models trained with non-random $S_{0}$ .

Display Table

A.2. Empirical Network Diagnostics

In this section, we consider the exchange options problem on two assets from Example 4.1, where the time horizon is 1 day. We consider different network architectures for the BSDE method described by Algorithm 3 to understand their impact on the final result and their ability to approximate the solution of the PDE and its gradient. We choose this problem given the existence of an explicit solution that can be used as a benchmark. The experiment is organized as follows:

Let L−2 be the number of hidden layers of $R [\partial_{x} v]_{θ_{t_{0}}} \in D N$ and $R [v]_{θ_{t_{0}}} \in D N$ . Let $l_{k}$ be the number of neurones per hidden layer k.
We train four times all the possible combinations for $L - 2 \in {1, 2, 3}$ and for $l_{k} \in {2, 4, 6, \dots, 20}$ using $ϵ = 5 \times 10^{- 6}$ for the stopping criteria. The assets' initial values $s_{t_{0}}^{i}$ are sampled from a lognormal distribution $X \sim \exp ((μ - 0.5 σ^{2}) τ + σ \sqrt{τ} ξ),$ where $ξ \sim N (0, 1), μ = 0.08, τ = 0.1$ .
We approximate the $L^{2}$ -error of $R [v]_{θ_{t_{0}}} (x)$ and $R [v]_{θ_{t_{0}}} (x)$ with respect to the exact solution given by Margrabe's formula and its gradient.

Figure displays the average of the $L^{2}$ -errors and its confidence interval. We can conclude that for this particular problem, the accuracy of $R [v]_{θ_{t_{0}}} (x)$ does not strongly depend on the number of layers, and that there is no improvement beyond eight nodes per hidden layer. The training (its inputs and the gradient descent algorithm together with the stopping criteria) becomes the limiting factor. The accuracy of $R [v]_{θ_{t_{0}}} (x)$ is clearly better with two or three hidden layers than with just one. Moreover, it seems that there is benefit in taking as many as 10 nodes per hidden layer.

Figure A3. Left: Variance reduction in terms of number of optimizer iterations. Right: Variance reduction in terms of epsilon. Both are for Example A.3 and Algorithm 3.

Figure A4. Average error of PDE solution approximation and its gradient and 95% confidence interval of different combination of # of layers and net width. Left: error model. Right: Error grad model.

Unbiased Deep Solvers for Linear Parametric PDEs

Abstract

1. Introduction

1.1. Main Contributions

1.2. Literature Review

1.3. Notation

1.4. Outline

2. PDE Martingale Control Variate

2.1. PDE Derivation of the Control Variate

2.2. Unbiased Parametric PDE Approximation

3. Deep PDE Solvers

3.1. Projection Solver

3.2. Probabilistic Representation Based on Backward SDE

3.3. Martingale Control Variate Deep Solvers

4. Examples and Experiments

4.1. Options in Black–Scholes Model on d>1 Assets

4.2. Deep Learning Setting

4.3. Evaluating Variance Reduction

Low dimensional problem with explicit solution

Table 1. Results on exchange option problem on two assets, Example 4.1. Empirical Variance and variance reduction factor.

Table 2. Results on exchange option problem on two assets, Example 4.1.

Low-dimensional problem with explicit solution – Approximation of Price using PDE solver compared to Control Variate

Low-dimensional problem with explicit solution. Training on random values for volatility

High-dimensional problem, exchange against average

Table 3. Results on exchange option problem on 100 assets, Example 4.4. Empirical Variance and variance reduction factor and costs in terms of paths used for training and optimizer steps.

Table 4. Results on exchange option problem on 100 assets, Example 4.4.

Disclosure statement

Additional information

Funding

References

Appendices

Appendix 1.

Martingale Control Variate

Appendix 2.

Martingale Control Variate Deep Solvers

A.1. Empirical Correlation Maximization

Appendix 3.

Artificial Neural Networks

Appendix 4.

Additional Numerical Results

Low dimensional basket option

Table E1. Results on basket options problem on two assets, Example A.1. Models trained with S0 fixed, non-random. Empirical Variance and variance reduction factor are presented.

Table E2. Results on basket options problem on two assets, Example A.1. Models trained with S0 fixed, non-random.

basket option with random sigma

Table E3. Results on basket options problem on 5 assets, Model trained with non-random S0, and random σ,r.

High dimensional basket option

Table E4. Results on basket options problem on 100 assets, Example A.3. Models trained with non-random S0 so that the results can be directly compared to Belomestny, Iosipoi, and Zhivotovskiy (Citation2017).

Table E5. Results on basket options problem on 100 assets, Example A.3. Models trained with non-random S0.

A.2. Empirical Network Diagnostics

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Table E1. Results on basket options problem on two assets, Example A.1. Models trained with $S_{0}$ fixed, non-random. Empirical Variance and variance reduction factor are presented.

Table E2. Results on basket options problem on two assets, Example A.1. Models trained with $S_{0}$ fixed, non-random.

Table E3. Results on basket options problem on 5 assets, Model trained with non-random $S_{0}$ , and random $σ, r$ .

Table E4. Results on basket options problem on 100 assets, Example A.3. Models trained with non-random $S_{0}$ so that the results can be directly compared to Belomestny, Iosipoi, and Zhivotovskiy (Citation2017).

Table E5. Results on basket options problem on 100 assets, Example A.3. Models trained with non-random $S_{0}$ .