Abstract
We propose a CNN-LSTM deep learning model, which has been trained to classify profitable from unprofitable spread sequences of cointegrated stocks, for a large scale market backtest ranging from January 1991 to December 2017. We show that the proposed model can achieve high levels of accuracy and successfully derives features from the market data. We formalize and implement a trading strategy based on the model output which generates significant risk-adjusted excess returns that are orthogonal to market risks. The generated out-of-sample Sharpe ratio and alpha coefficient significantly outperform the reference model, which is based on a standard deviation rule, even after accounting for transaction costs.
Acknowledgements
We thank the editor and two anonymous referees for carefully reading the manuscript and for several constructive and detailed comments that helped to improve our paper.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Notes
1 Harlacher (Citation2016) finds that differences between the ADF test and other unit root tests such as the Phillips–Perron or the Phillips–Ouliaris test are not significant.
2 CNN-LSTM architectures have been found to achieve state of the art performance in time series forecasting tasks related, e.g. to heart rate signals Swapna et al. (Citation2018), rainfall intensity Shi et al. (Citation2017), particulate matter Huang and Kuo (Citation2018), waterworks operations Cao et al. (Citation2018), or the gold price Livieris et al. (Citation2020).
3 Hyperparameters are inspired by the choices in Livieris et al. (Citation2020), except for the number of filters, where we found better optimization results for a lower number of filters compared to 32 and 64 filters used in Livieris et al. (Citation2020).
4 We tested the following hyperparameters: hidden LSTM layers ∈ [1,2] and LSTM cells per layer ∈ [2, 5, 10, 15, 20] on 10 randomly selected pairs. We found that a single layer with 10 cells returned the most accurate results.
5 The total number of trainable parameters of the model is 1,891.
6 For example, each element of the input vector that is passed to the outermost layer is standardized according to
.
7 According to the classification in Krauss (Citation2017), this model represents a stochastic control approach.
9 We note that we did not notice any problems related to vanishing or exploding gradients during the training of the models.
10 Precision is defined as and F1 score is defined as
where TP, FP, and FN refer to true positives, false positives and false negatives, respectively.
11 We did a comparison for different values of extrapolation parameter k and obtained the most promising results for k = 5. We found that the superior performance of the k = 5-variant can be attributed to the better out-of-sample classification accuracy compared to the alternatives for k = 10 or 20 days. Final average out-of-sample accuracies are 68.5% for k = 5, 66.5% for k = 10, 67.1% for k = 20.
12 Alpha and beta coefficients relate to the one-factor model regression. We discuss further dependencies on risk factors in Section 4.3.3.
13 We use the statsmodels library (Seabold and Perktold Citation2010) in Python with default parameters for the linear regression.
14 The authors thank Kenneth French for allowing to source all data from his website: https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html
15 Note that Chen and Bassett (Citation2014) show that due to the self-financing nature of these factor portfolios and the market capitalization structure, this interpretation is not necessarily true.
16 It is important to note that deep learning techniques such as LSTM or CNN models have been introduced in the late 1990s. As such, the high risk-adjusted returns in the 1990s need to be seen against the backdrop that neither the theory nor the necessary technology for this strategy has been available for the majority of market participants.
17 Note that for the backtest with m>5 we need to optimize the extended CNN-LSTM models each trading period again based on the enlarged data set, i.e. on m = 20. The results for are therefore based on newly trained models.
18 We refer to Petersen (Citation2020) for a detailed mathematical study on neural networks.
19 We will refer to the LSTM model as established by Gers et al. (Citation2000), who modified the original LSTM of Hochreiter and Schmidhuber (Citation1997) and proposed a total of three gates named according to their functions: input, output and forget gate.
20 Subscripts are expressing the to-from-relationships, i.e. denotes the recurrent weight connection from the previous time step's hidden state
to the current time step's forget gate
.