Full article: Assessing the accuracy of exponentially weighted moving average models for Value-at-Risk and Expected Shortfall of crypto portfolios

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

A plethora of academic papers on generalized autoregressive conditional heteroscedasticity (GARCH) models for bitcoin and other cryptocurrencies have been published in academic journals. Yet few, if indeed any, of these are employed by practitioners. Previous academic studies produce results that are fragmented, confusing and conflicting, so there is no commercial incentive to drive an expensive implementation of complex multivariate GARCH models, which anyway would commonly require more data for calibration than are available in the history of most cryptocurrencies, at least at the daily frequency. Consequently, this paper assesses the forecasting accuracy of simple parametric RiskMetrics $^{TM}$ type volatility and covariance models, with a focus on ad hoc parameter choice instead of a data-intensive calibration procedure. We provide extensive backtests of hourly and daily Value-at-Risk (VaR) and Expected Shortfall (ES) forecasts that are regarded as best practice in the industry and commonly used for regulatory approval. Our results demonstrate that much simpler models in the exponentially weighted moving average (EWMA) class are just as accurate as GARCH models for VaR and ES forecasting, provided they capture an asymmetric volatility response and a heavy-tailed returns distribution. Moreover, on ranking each model's variance and covariance forecasts using average scores generated from proper univariate and multivariate scoring rules, there is no evidence of superior performance of variance and covariance forecasts generated by GARCH models, using either daily or hourly data.

Keywords:

JEL Classification:

1. Introduction

The modelling and forecasting of volatility and quantile risk measures for cryptocurrencies is a fairly well-researched topic. Almost 350 papers have been published by academic journals and over 100 of these have appeared during the last 2 years.Footnote¹ This strand of research has become increasingly complex over time, examining numerous variants from the generalized autoregressive conditional heteroscedasticity (GARCH) family of models initially introduced by Bollerslev (Citation1986), several models in the generalized autoregressive score (GAS) class introduced by Creal et al. (Citation2013), as well as mixture and regime-switching specifications of both. A similar degree of variety and complexity exists in the distribution assumptions for cryptocurrency returns: while the normal distribution is used by some authors, the most common choices are heavy-tailed distributions such as the Student-t. Many papers employ even more complex heavy-tailed and skewed distributions, such as the generalized error distribution (GED), the Weibull, Beta, generalized hyperbolic, inverse Gaussian and Johnson's SU distribution.

However, this complexity in modelling choices for cryptocurrency risk modelling in the academic literature is in stark contrast with current practice in cryptocurrency markets. It is quite common for investors to apply no form of risk analysis at all, with risk management strategies consisting at most of stop-loss limit orders placed at arbitrary price levels for open positions.Footnote² The few online sources that do discuss, use or provide forecasts of volatility, Value-at-Risk (VaR) and/or Expected Shortfall (ES) use equally-weighted methodologies and inappropriate assumptions. For instance, Cryptodatadownload, a cryptocurrency market data and analytics provider, produces daily 1% and 5% VaR and ES forecasts for several cryptocurrencies using a historical methodology over a two-year period, i.e. the percentage VaR is forecast as $- 1 \times$ the corresponding quantile of the empirical returns distribution and ES is $- 1 \times$ the average of the returns that are lower than the corresponding quantile. A blog from the cryptocurrency exchange OKEx presents a parametric VaR estimation for bitcoin, under the assumption that its one-minute returns follow a normal distribution; the 1% and 5% VaR are then forecast using the sample mean and standard deviation of one-minute returns over the past seven days.Footnote³ Similarly, the daily ‘Bitcoin Volatility Index’ is calculated using the standard deviation of returns over the past 30 and 60 days; and the bitcoin Fear & Greed Index and a Forbes article (Bovaird Citation2021) reporting on bitcoin's volatility both appear to be estimating volatility with a similar equally-weighted moving average. But there is a very well-known problem with any equally-weighted VaR or ES model. Even a single historical outlier, a large negative return which may have occurred far in the past, will have exactly the same influence on the current value of the risk measure as if it happened just now.Footnote⁴

The calibration of GARCH and GAS models requires a large number of historical returns.Footnote⁵ While some cryptocurrencies such as bitcoin or ether have been trading for some time, the continuous emergence of new coins and tokens that gain investor attention often means that newer cryptocurrencies have insufficient data available to produce robust parameter estimates. For instance, at the time of writing, the list of top ten cryptocurrencies by market cap reported by Cryptocompare includes Avalanche, Solana and Terra which have only been trading for about two years. For such cryptocurrencies, volatility models that can be ‘jump-started’ and produce forecasts without the need for a lengthy estimation period, such as the RiskMetrics $^{TM}$ exponentially-weighted moving average (EWMA) model (Longerstaey and Spencer Citation1996), are ideal. EWMA models have the added advantage of allowing the use of ad hoc parameter values even when we include features such as an asymmetric volatility response and a heavy-tailed Student-t distribution assumption.

However, before this paper we had little or no idea of the performance of EWMA models for bitcoin and other cryptocurrencies, relative to the more complex models that have been the focus of previous academic research. Indeed, a major limitation of the extant literature is the lack of consideration of simpler models, even though such models are most commonly employed by practitioners. There are also numerous gaps in the extant literature on cryptocurrency risk metrics. For instance, there is a complete absence of the traffic lights for VaR and ES backtesting which have been standard practice in the industry since Basel Committee (Citation1996)—and there is just one single paper which uses scoring rules for density forecast evaluation. Likewise, only one other paper examines the VaR and ES of short positions on cryptocurrencies even though these are as easily traded as long positions on all the major exchanges. Furthermore, hardly any other papers examine the forecasting accuracy of multivariate models, even though these should form the corner stone of cryptocurrency portfolio optimization techniques. And all previous academic studies employ data at the daily frequency, with samples that are often too small to yield robust and reliable results. None of them use hourly data even though these data are readily available and there are distinct advantages of using hourly data: firstly for a 24-fold increase in sample size and hence a much larger data set for risk model calibration and backtesting; and secondly for a means to capture intraday volatility, which is especially important for cryptocurrencies because they have many more price jumps and short bursts of volatility than traditional assets. One purpose of this paper is to fill all these gaps in the otherwise highly prolific literature.

By contrast, the complex end of the modelling spectrum is over-researched, at least from the cryptocurrency practitioner's perspective. Our tenet is that there is very limited scope for real-world applications of FIGARCH, ACGARCH, TGARCH, H-GARCH, ALL-GARCH, APARCH, MS-GARCH and several other varieties that have been explored in this strand of cryptocurrency research. By contrast, a class of EWMA models which extends the basic RiskMetrics $^{TM}$ methodology is ideally suited for risk-based applications of cryptocurrency portfolios—for two main reasons: first because the methodology is easy to understand, validate, and explain in a simple technical document; second, and perhaps most importantly, these models do not require large samples of historical data for parameter estimation, and so they can be backtested using the maximum amount of historical data available which, for some cryptocurrencies, is already rather small.

This paper investigates the relative performance of different types of EWMA model and a variety of GARCH models for capturing volatility clustering in USD prices of bitcoin, ether, ripple and litecoin. We use these coins because, unlike many other coins or tokens, they have the sufficiently long history that is needed for proper calibration and thorough backtesting of multivariate GARCH models. Bitcoin, ether and ripple are also among the largest coins by market capitalization, as litecoin also used to be. Our main purpose is to quantify the gains, if any, from using the complex GARCH models whose performance for bitcoin and a few other cryptocurrencies has already been extensively analysed in a burgeoning yet fragmented literature. First we present a concise and accessible summary of the crypto GARCH literature, reviewing its unifying themes and obvious gaps, and conclude that there is no consistent evidence to support the use of any model more complex than a simple asymmetric GARCH(1,1) with Student-t innovations. Empirical results are divided as to whether the exponential GARCH (EGARCH) model of Nelson (Citation1991) or the GJR-GARCH model of Glosten et al. (Citation1993) is better at capturing the necessary asymmetry—we find the EGARCH slightly better for major coins, but either would serve.

Our benchmark volatility model is the sample standard deviation—a simple equally-weighted moving average of past squared returns—against which we assess the performance, in both univariate and multivariate systems, of several adapted EWMA models, with and without asymmetric volatility responses, and both symmetric and asymmetric GARCH models, all with Student-t innovations. Our applications extend previous research in several ways: by analysing hourly as well as daily log returns; by backtesting one-step-ahead ES as well as standard VaR metrics; by studying both univariate and multivariate systems; and by further evaluating the volatility and covariance forecasts using univariate and multivariate proper scoring rules.

The daily data backtesting sample is from January 2017 to August 2021 and for the hourly data we produce forecasts from 1 May 2021 to 1 July 2021. Because this research is targeted towards risk management professionals, we backtest VaR forecasts with the industry-standard traffic light and conditional coverage test of Christoffersen (Citation1998); similarly we use a modified traffic light test for ES as well as the exceedance residual test of McNeil and Frey (Citation2000). The accuracy of volatility forecasts is also assessed using the continuous ranked probability score of Gneiting and Ranjan (Citation2011), the energy score developed by Gneiting and Raftery (Citation2007) is employed for evaluating covariance forecasts, and we also assess forecasting accuracy using the univariate and multivariate negatively oriented logarithmic scoring rules, as mentioned by Gneiting and Ranjan (Citation2011) and used by Catania et al. (Citation2019).

Overall, we conclude that EWMA models perform at least as well as GARCH models at all levels of coverage up to and including 99%, and sometimes they perform even better. Interestingly, we find that hourly forecasts are less accurate than daily forecasts in general, when examining the number of models that fail the VaR and ES backtesting in each case. Nevertheless, most EWMA models are sufficiently accurate to pass traffic light and coverage tests at all three tail quantiles, for both long and short positions. By contrast, the more sophisticated Student-t exponential GARCH models often fail to make accurate predictions at the hourly level. Their parameter estimates are less stable than they are with a daily rolling-window re-calibration. At the hourly frequency it seems that GARCH models are fitting high-frequency fluctuations that appear irrelevant for forecasting the tails of one-hour-ahead distributions and it is better to use the stable, if ad hoc parameters of a EWMA model.

For predicting the volatility and covariance structure and when assessing the results using proper scoring rules, all models (including the random walk benchmark) are equally (in)accurate. This is true for both univariate and multivariate density predictions and for one-day-ahead as well as one-hour-ahead forecasts. This finding supports a simple form of market efficiency, which is not surprising since the trading volumes on large coins have grown very rapidly during the last few years, so by now the markets have become quite mature. Nevertheless it is worthwhile to have demonstrated this efficiency empirically, at the daily frequency since January 2017 and at the hourly frequency since 1 May 2021.

In the following: Section 2 provides a critical survey of the extensive literature on cryptocurrency volatility, VaR, ES and covariance forecasting; Section 3 specifies the models used in our empirical study, as well as the backtesting of VaR and ES predictions and the use of proper scoring rules for assessing the accuracy of volatility and covariance forecasts; Section 4 provides an overview of the daily and hourly historical data used for the analysis; Section 5 presents our empirical results; and Section 6 summarizes and concludes.

2. State-of-the-art crypto risk models

Here we summarize the burgeoning academic literature on cryptocurrency market risk modelling by focusing on papers which assess the in-sample and out-of-sample performance of parametric volatility and/or covariance models applied to cryptocurrency returns. For ease of reference, the main characteristics of the most relevant academic papers are summarized in Table .

Table 1. Key characteristics of the relevant academic papers that assess the forecasting performance of cryptocurrency volatility and covariance models.

Download CSV Display Table

Table reports the cryptocurrencies examined, the sample period, the models employed and their distributional assumptions, and the performance criteria used to discriminate between competing models. The cryptocurrencies most commonly examined are: bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC), dogecoin (DOGE), dash, monero (XMR), maidsafecoin (MAID), stellar (XML), bytecoin (BCN), bitcoin cash (BCH), bitcoin gold (BTG), bitcoin diamond (BCD), bitcoin private (BTCP), and also an equally-weighted and a minimum variance portfolio.Footnote⁶ A few authors examine a more expanded cryptocurrency universe, e.g. Catania and Grassi (Citation2021) analyse a total of 606 cryptocurrencies having at least 700 daily price observations until September 2019, but the majority of papers focus on bitcoin, ether, ripple and litecoin because these offer a historical period of at least five years and they are consistently amongst the largest coins by market capitalization. The sample frequency is almost invariably daily and the sample period used in each paper usually depends on the available historical data. For example, Katsiampa (Citation2017) and Baur et al. (Citation2018) only examine bitcoin, so their sample period begins in 2010. However, Fantazzini and Zimin (Citation2020) use less than three years of data for both calibration and backtesting. This, like most of the studies summarized in Table , would not pass the stringent Basel guidelines on historical data for market risk capital calculation.Footnote⁷

2.1. Survey of models employed

First we summarize the models used not only in the papers summarized in Table but also for numerous other applications of GARCH models to cryptocurrencies returns. Regarding the literature summarized in Table , the most common choices include the symmetric GARCH of Bollerslev (Citation1986) and asymmetric models such as the GJR-GARCH of Glosten et al. (Citation1993), the exponential GARCH (EGARCH) of Nelson (Citation1991), the threshold GARCH (TGARCH) of Zakoian (Citation1994), the asymmetric power ARCH (APARCH) of Ding et al. (Citation1993) and, less often, the AGARCH of Engle and Ng (Citation1993). These models are in some cases extended further with distribution mixture and Markov switching (MS) frameworks. Some authors use the component GARCH (CGARCH) of Engle and Lee (Citation1999) and variants such as its asymmetric extension ACGARCH, the weighted component GARCH (wCGARCH) of Bauwens and Storti (Citation2009) and the component with multiple threshold (CMT) GARCH of Bouoiyour and Selmi (Citation2014). Still more complex volatility model choices include the H-GARCH and ALL-GARCH of Hentschel (Citation1995), the non-linear NGARCH of Higgins and Bera (Citation1992), the AVGARCH of Schwert (Citation1990), the robust GARCH model of Trucíos et al. (Citation2017), the realized GARCH model of Hansen et al. (Citation2012), the GARCH-MIDAS (mixed data sampling) model of Engle et al. (Citation2013), and also an autoregressive jump intensity (ARJI) model and a stochastic volatility model with co-jumps (SVCJ).

More sophisticated univariate models include the realized GARCH and stochastic volatility models which are discussed by Takahashi et al. (Citation2016), Chen et al. (Citation2021) and Takahashi et al. (Citation2021). In the context of cryptocurrencies, stochastic volatility models have been used by Tiwari et al. (Citation2019) and realized GARCH by Trucíos and Taylor (Citation2022)—but only in the univeraite context. These papers also have mixed results, suggesting a possible need for further research. In a simiar vein, more complex distributional assumptions beyond the normal and Student-t and their skewed variants could be applied—including the generalized error distribution (GED), generalized hyperbolic, Weibull, Laplace, Beta-skew-t, generalized Pareto, reflected Gamma, inverse Gaussian and Johnson's SU distribution. All of these GARCH variants have been explored in the voluminous research literature on univariate GARCH modelling, but their extension to large dimensional multivariate systems of returns presents a challenge. Consequently it is not surprizing that there is no evidence of widespread adoption of these complex models by financial risk practitioners, even for volatility modelling in traditional asset classes.Footnote⁸ It may be that some of these state-of-the-art volatility models could produce superior results, for some individual cryptocurrencies, but in this paper our focus is on the widespread uptake of simpler, multivariate risk models by practitioners, specifically those that fall within an asymmetric extension of the RiskMetrics ${TM}$ EWMA class.

The vast majority of other papers are about the diversification or hedging effects of bitcoin, and these typically employ some variant of the GARCH class with normal or Student-t distributed innovations, and again all such models are GARCH(1,1).Footnote⁹ For instance: Dyhrberg (Citation2016) compares bitcoin with gold and the dollar using both symmetric and exponential normal GARCH; Bouri et al. (Citation2017) examine the hedging and safe-haven properties of bitcoin and use a symmetric model with innovations that follow a generalized error distribution (GED); Al-Khazali et al. (Citation2018) compare the impact of macroeconomic news on bitcoin and gold and find that the best GARCH model is the exponential GARCH with normally distributed error terms; Corbet et al. (Citation2018) examine the applications of bitcoin futures and use a symmetric GARCH; Vidal-Tomás and Ibañez (Citation2018) use a component GARCH to examine the efficiency of bitcoin traded prices; Al-Yahyaee et al. (Citation2019) study the diversification effects of bitcoin and gold for crude oil and S&P 500 investments and use several GARCH models including a fractionally integrated (FI) EGARCH model; and López-Cabarcos et al. (Citation2020) analyse the effect of investor sentiment and S&P 500 and VIX returns on bitcoin's volatility, using GARCH and EGARCH models.

Due to its simplicity and ease of use, the RiskMetrics $^{TM}$ EWMA model of Longerstaey and Spencer (Citation1996) is very popular in financial market applications, and some academic papers focus on assessing its forecasting accuracy using traditional asset as well as cryptocurrency data. For instance, Pafka and Kondor (Citation2001) examine its VaR forecasting ability for returns on the 30 constituent stocks of the DJIA index, arguing that it performs well at lower (e.g. 95%) coverage levels and for short-term risk horizons, but that its accuracy declines at 99% coverage and also for multi-period forecasts. Similar results are reported by McMillan and Kambouroudis (Citation2009), now examining 31 stock market indices. Specifically in the cryptocurrency literature, there is some support for the use of integrated GARCH (IGARCH) models—and the EWMA model falls into the integrated volatility model class. For instance Chu et al. (Citation2017) and Köchling et al. (Citation2020) find that IGARCH provides the optimal in-sample fit for bitcoin and other cryptocurrencies; and Bouoiyour and Selmi (Citation2016) and Baur et al. (Citation2018) both find that bitcoin's variance process is integrated. The forecasting performance of EWMA volatility models is assessed by Catania et al. (Citation2019), Bazán-Palomino (Citation2020), Nekhili and Sultan (Citation2020) and Silahli et al. (Citation2021). Silahli et al. (Citation2021) also examine an even simpler equally-weighted moving average (EQMA) model as a benchmark, while Guesmi et al. (Citation2019) and Segnon and Bekiros (Citation2020) use fractionally integrated models such as the FIGARCH and FIAPARCH. Liu et al. (Citation2020) consider several score-driven EWMA models based on the generalized autoregressive score (GAS) model framework of Creal et al. (Citation2013), and Trucíos (Citation2019), Troster et al. (Citation2019) and Catania and Grassi (Citation2021) also use GAS models.

The forecasting performance of multivariate covariance models has been only rarely studied, and in these few papers only in-sample performance has been assessed. Bouri et al. (Citation2017) were the first to examine cryptocurrencies in a multivariate context, using a dynamic conditional correlation (DCC) model of Engle (Citation2002) to test the hedge and safe-haven properties of bitcoin. The majority of other studies use the DCC model and only a few employ the earlier BEKK model of Engle and Kroner (Citation1995). For instance, Bazán-Palomino (Citation2020) considers the relationship between bitcoin and similarly structured cryptocurrencies using the multivariate EWMA, BEKK-GARCH and DCC-GARCH, while Guesmi et al. (Citation2019) use the DCC model to examine bitcoin as well as a number of traditional financial assets. Regarding applications of a multivariate EWMA model, Matkovskyy et al. (Citation2020) use one to examine the interdependence between bitcoin, economic policy uncertainty and traditional financial assets, but none of the relevant papers assess its forecasting performance for VaR and ES of cryptocurrencies, nor do they evaluate the accuracy of covariance forecasts via scoring rules. Other covariance modelling choices reported in Table include the asymmetric ADCC model of Cappiello et al. (Citation2006), the modified cDCC and cADCC of Aielli (Citation2013), multivariate extensions of the marginal densities using copula functions to model the correlation structure and time-varying parameter vector autoregression (TVP-VAR) models.

2.2. Survey of performance results

Engle et al. (Citation2012) provide a useful survey of the numerous papers that explore the best specification for univariate GARCH models on different types of financial data. To update this survey to include the recent research on cryptocurrencies is difficult because the results are often contradictory, suggesting that the best in-sample fit very much depends on both the cryptocurrencies chosen and the sample period, which vary considerably from study to study. Although, as noted above, at least all previous work employs data at the same, daily frequency.

Katsiampa (Citation2017) tests several parametric volatility models for the best in-sample fit on bitcoin returns and all criteria indicate that the ACGARCH model is optimal; this is consistent with Bouoiyour and Selmi (Citation2016) whose in-sample analysis also indicates a model with a transitory and a permanent volatility component. The in-sample analysis of Baur et al. (Citation2018) indicates superiority of the EGARCH model for bitcoin returns, and the authors note that using different asymmetric volatility models does not improve the in-sample fit. Tiwari et al. (Citation2019) compare the fit of GARCH and stochastic volatility models for bitcoin and litecoin and find mixed results, for instance concluding that cryptocurrency returns do not exhibit any asymmetric volatility response, which is at odds with the previous findings. The findings of Sosa et al. (Citation2019) suggest that an EGARCH model with GED innovations provides the best in-sample model fit for bitcoin. Troster et al. (Citation2019) agree that a GED assumption instead of a normal significantly improves goodness-of-fit, but further conclude that the hyperbolic HGARCH model with GED innovations provides the best in-sample fit, which is again contrary to previous findings.

In the class of regime-switching volatility models, Ardia et al. (Citation2019) find that a two-state Markov switching skewed Student-t GJR-GARCH provides a better in-sample fit for bitcoin compared to both non-switching and three-state switching models; the authors propose that the two-state model provides a better trade–off between fitting quality and model complexity and further show for three–regime models that fitting gains are only observed for the normal distribution. Alexander and Dakos (Citation2020) also explore the in-sample fit of two-state Markov switching GARCH models for bitcoin returns and show that the best model depends on the exact source of data used.

To sum up, the plethora of in-sample diagnostics applied to GARCH models of cryptocurrency volatility reveals a picture of numerous, but imprecise and highly contradictory conclusions, derived from the painstaking estimation of increasingly complex models which often use insufficient data to provide robust and accurate results. Yet, the state-of-the-art results on out-of-sample forecasting for cryptocurrency returns, to which we now turn, are even more confusing.

Out-of-sample forecasting centres on VaR and Expected Shortfall backtests, usually focusing on the left tail of the returns' distribution to assess the risk of downward price movements on long crypto asset positions. It is worth noting that the only study other than ours that assesses the performance of right-tail forecasts for losses made on short positions is that of Stavroyiannis (Citation2018), who examines the GJR-GARCH model calibrated to bitcoin returns. The most common backtesting methodologies for VaR forecasts are the unconditional coverage (UC) test of Kupiec (Citation1995), the conditional coverage (CC) test of Christoffersen (Citation1998) and the dynamic quantile (DQ) test of Engle and Manganelli (Citation2004); for ES, common backtesting methods include the exceedance residual (ER) of McNeil and Frey (Citation2000), the regression-based ESR test of Bayer and Dimitriadis (Citation2020) and the multi-level backtest approximation via VaR of Kratz et al. (Citation2018). Other methods of analysis include the use of loss functions either in the model confidence set (MCS) process of Hansen et al. (Citation2011) or also in hypothesis tests of equal forecasting performance such as the DM test of Diebold and Mariano (Citation1995). Finally, the use of proper scoring rules to evaluate cryptocurrency returns density forecasts is much less common, with Catania and Grassi (Citation2021) using the continuous ranked probability score and Catania et al. (Citation2019) using the log score. Also, and very much in the vein of our paper, we emphasize that the industry standard traffic light backtesting framework of the Basel Committee (Citation1996), e.g. as described by Costanzino and Curran (Citation2018), is overlooked by all these papers.

One reason for the confusing conclusions drawn from out-of-sample results is that they depend not only on the models employed but also on the particular cryptocurrency returns studied, the sample period employed and the significance levels examined. For instance, Ardia et al. (Citation2019) compare the VaR forecasting accuracy of single-regime and Markov switching models for bitcoin, concluding that only regime-switching models produce accurate VaR forecasts at the 1% significance level; however, it is worth noting that 5% daily VaR forecasts produced using the relatively simpler single-regime skewed Student-t GJR-GARCH model also succeed the CC test because we cannot reject the null hypothesis of no clustering in exceedances at the 5% significance level—and the DQ test also. Maciel (Citation2021) compares the prediction performance of Markov switching GARCH against single-regime GARCH models for several crypto assets and is in favour of more complex models similar to Ardia et al. (Citation2019), but the results for a similar set of single- and two-regime GARCH models, also applied to bitcoin, are somewhat mixed. Caporale and Zekokh (Citation2019) also apply a variety of different backtests to VaR and ES forecasts for bitcoin, ether, ripple and litecoin with an exhaustive set of mixture and regime switching model combinations, but again the results are inconclusive.

It further transpires that even when very complex volatility models can produce accurate out-of-sample VaR and ES forecasts, relatively simpler models can produce equally accurate results. For instance, Bonello and Suda (Citation2018) compare VaR forecasts for bitcoin using single-regime and two-regime normal and Student-t GARCH models, and find that all specifications can produce accurate VaR forecasts at a 5% significance level. Troster et al. (Citation2019) backtest daily 1% VaR forecasts for bitcoin and find that a Student-t standard GARCH model is on a par with several more complex GARCH and GAS models included in their study. Trucíos (Citation2019) evaluates VaR forecasts for bitcoin between 2011 and 2017 using six competing models, finding that only a robust bootstrap VaR method produces accurate forecasts at the 1% significance level. In fact, in the preliminary results of a subsequent working paper, Trucíos and Taylor (Citation2022) use a more recent sample period and show that bitcoin and ether VaR forecasts based on simpler volatility models such as the standard GARCH may be considered accurate. Acereda et al. (Citation2020) find that more complex model specifications do not outperform the simpler ones for bitcoin VaR, as long as heavy-tailed distributions are used instead of the standard normal. Silahli et al. (Citation2021) also find that simple benchmark models succeed in various VaR backtests for several crypto assets.

Contradictory results are even apparent when one considers EWMA models alone. For example, Silahli et al. (Citation2021) claim that a normal EWMA volatility model produces accurate VaR forecasts for all cryptocurrencies, but Liu et al. (Citation2020) find that a similar model fails VaR backtests. Nekhili and Sultan (Citation2020) examine the out-of-sample performance of a benchmark RiskMetrics $^{TM}$ EWMA model and find that it produces accurate VaR forecasts at the 5% level, but not at 1%; yet for ES forecasts of almost all cryptocurrencies examined, a EWMA produces accurate ES forecasts according to the ER test. Within the multivariate setting the results seem a little more consistent: Silahli et al. (Citation2021) find that a EWMA covariance model used to produce VaR forecasts for a portfolio of bitcoin, litecoin, ripple and dash passes performance tests; and Catania et al. (Citation2019) examine bitcoin, ether, ripple and litecoin, testing several complex multivariate models against a vector autoregression with EWMA variance and find that none significantly outperform this much simpler benchmark.

Finally, Catania et al. (Citation2019) and Catania and Grassi (Citation2021) are the only applications of proper scoring rules specific to cryptocurrencies at the time of writing. Catania et al. (Citation2019) produce multi-period point and density forecasts for bitcoin, litecoin, ripple and ether returns, employing the log score as a measure of forecast accuracy and conclude that most models outperform the EWMA benchmark. Catania and Grassi (Citation2021) use the continuous ranked probability score (CRPS) to assess volatility forecasts from the GAS model versus EGARCH, concluding equal predictive ability as measured by the DM test. They backtest VaR and ES forecasts for a total of 606 cryptocurrencies with at least 700 daily price observations until September 2019. The authors use the score-driven volatility model specifications that incorporate several stylized features such as leverage effects, long memory of the volatility process and time-varying higher order moments, with a generalized hyperbolic skewed Student-t distribution. These models are compared against a benchmark Beta-Skew-t-EGARCH, producing multi-period 1% and 5% VaR and ES forecasts. VaR and ES forecasts are backtested with the DQ and ER tests and the density forecasts are assessed using the CRPS. Score-driven specifications produce accurate 5% and 1% ES and 5% VaR forecasts more often than the Beta-Skew-t-EGARCH benchmark, but GAS models and the EGARCH benchmark are on par when backtesting 1% VaR. Regarding density forecast evaluation via CRPS the authors find that certain score-driven models outperform the benchmark more often than they underperform it. However, even for these successful specifications, equal predictive ability is the most common outcome. For instance, when examining the uniformly-weighted CRPS of the one-day-ahead density forecast across all cryptocurrencies, equal predictive ability occurs in 83% of cryptocurrencies examined, including bitcoin, ether, ripple and litecoin.

While both Liu et al. (Citation2020) and Catania and Grassi (Citation2021) examine several volatility model specifications, the range of models examined is somewhat limited in both cases. Liu et al. (Citation2020) focus specifically on EWMA-type models and do not test other more complex models such as GARCH specifications, nor simpler model specifications that require no calibration such as an equally-weighted moving average or a EWMA with an ad-hoc value chosen for the decay parameter. Therefore, their results are not conclusive with respect to the overall suitability of EWMA-type models in forecasting cryptocurrency volatility compared to other more complex or simpler models. By comparison, Catania and Grassi (Citation2021) focus on highly sophisticated GAS model specifications with a similarly sophisticated heavy-tailed distribution assumption and test these against an already complex benchmark Beta-skew-t-EGARCH model, often finding equal forecasting performance. It is important to note that, as discussed previously, the above finding also extends to VaR and ES forecasting, i.e. the VaR and ES forecasting performance of highly complex GARCH and GAS model specifications can be on par with relatively simpler models such as the standard GARCH. For instance, this is shown in the results of Bonello and Suda (Citation2018), Troster et al. (Citation2019), Acereda et al. (Citation2020), Silahli et al. (Citation2021) and also in the working paper results of Trucíos and Taylor (Citation2022).

3. Methodology

Our benchmark model is that returns are normally distributed with zero mean and variance estimated as an equally-weighted moving average of the past n squared returns. Except for the benchmark model, we make the universal assumption of Student-t innovations, again with zero mean returns.Footnote¹⁰ This is because previous results, available on request, showed that none of the normal models outperformed their Student-t equivalent, for any cryptocurrency. On the other hand, using more complex distributional assumptions as in Chu et al. (Citation2017), Trucíos (Citation2019) and Liu et al. (Citation2020) is tangential to the theme of this paper. It would obfuscate the motivation for this paper by providing too many details. Therefore, to retain our focus on the main story here—i.e. the relative effectiveness of using ad-hoc values for EWMA parameters—we only describe the models and report the results for Student-t innovations in all the EWMA and GARCH models.

Our benchmark model assumes returns are normal with variance estimated by an n-period equally-weighted moving average of squared returns, we call it the random walk for short. Then we have a EWMA model as per the RiskMetrics $^{TM}$ technical document (Longerstaey and Spencer Citation1996) and our own asymmetric extension similar to the A-GARCH model of Engle and Ng (Citation1993), a symmetric GARCH(1,1) model (Bollerslev Citation1986) and an asymmetric EGARCH(1,1) model (Nelson Citation1991)—and all these models assume a Student-t distribution. Joint density forecasts are produced via n-period equally-weighted moving average covariance matrix estimates, multivariate versions of the EWMA models, and the GARCH and EGARCH models are combined with the dynamic conditional correlation (DCC) model of Engle (Citation2002) and Tse and Tsui (Citation2002) and also its asymmetric extension (ADCC) model of Cappiello et al. (Citation2006).

The basic econometric methodology consists of producing one-period-ahead volatility and covariance forecasts on a daily or hourly rolling basis. These are then combined with parametric distribution assumptions to produce one-period-ahead VaR and ES forecasts at various quantiles, where each model has univariate versions for each cryptocurrency and a multivariate version. To assess the risk of both long and short positions we backtest quantiles at 1%, 2.5%, 5%, 95%, 97.5% and 99%. Then the accuracy of one-period ahead volatility and covariance forecasts are evaluated via univariate and multivariate proper scoring rules, respectively.

We test the performance of VaR and ES predictions using the traffic light backtests which have been the industry standard for more than two decades, (Basel Committee Citation1996), along with the two standard tests for clustering of exceedances, i.e. the conditional coverage (CC) test of Christoffersen (Citation1998) for VaR, and the (raw) exceedance residual (ER) test of McNeil and Frey (Citation2000) for ES.Footnote¹¹ Beyond quantile prediction backtesting, we also examine the accuracy of volatility forecasts using the continuous ranked probability score (CRPS) of Gneiting and Ranjan (Citation2011) for univariate forecasts and the energy score from Gneiting and Raftery (Citation2007) for covariance forecasts.Footnote¹² Additionally, we employ the univariate and multivariate negatively oriented logarithmic scoring rule as described by Gneiting and Ranjan (Citation2011). Note that all models assume a zero mean so these scoring rules aim to examine the accuracy of one-period ahead volatility and covariance forecasts, over and above the specific quantile predictions previously assessed.

3.1. Variance and covariance models

Denote the return on a single cryptocurrency at time t by $r_{t}$ and assume their mean is zero. In the random walk benchmark model we have: (1) $r_{t} = σ_{t} ϵ_{t}, with ϵ_{t} \sim N (0, 1),$ (1) where $σ_{t}^{2}$ is the average squared return over the most recent n periods. In both the EWMA and GARCH models, returns are assumed to follow a zero-mean, location-scale transformed Student-t distribution: (2) $r_{t} = σ_{t} ϵ_{t} with \sqrt{\frac{ν - 2}{ν}} ϵ_{t} \sim t_{ν},$ (2) where $t_{ν}$ denotes the standardized Student-t distribution with ν degrees of freedom, $σ_{t}$ is the standard deviation of $r_{t}$ and the distribution of $ϵ_{t}$ is defined such that $ϵ_{t}$ has unit standard deviation. The variance under the standard EWMA model with decay parameter λ is calculated as: (3) $σ_{t}^{2} = (1 - λ) r_{t - 1}^{2} + λ σ_{t - 1}^{2} .$ (3) Based on the AGARCH model of Engle and Ng (Citation1993), we introduce the asymmetric EWMA model with a decay parameter λ and an asymmetric volatility response parameter η. Under the AEWMA model, the variance is calculated as: (4) $σ_{t}^{2} = (1 - λ) (r_{t - 1} - η)^{2} + λ σ_{t - 1}^{2} .$ (4) In the standard (symmetric) GARCH(1,1) model, the conditional variance is given by: (5) $σ_{t}^{2} = ω + α r_{t - 1}^{2} + β σ_{t - 1}^{2} .$ (5) Similarly, in the Student-t EGARCH(1,1) model, we have: (6) $\begin{aligned} ln (σ_{t}^{2}) & = ω + g (ϵ_{t - 1}) + β ln (σ_{t - 1}^{2}) \\ g (ϵ_{t}) & = θ ϵ_{t} + γ (| ϵ_{t} | - E [| ϵ_{t} |]) . \end{aligned}$ (6) Regarding volatility forecasts, the random walk, EWMA and AEWMA models described in Equations (Equation1(1) $r_{t} = σ_{t} ϵ_{t}, with ϵ_{t} \sim N (0, 1),$ (1) ), (Equation3(3) $σ_{t}^{2} = (1 - λ) r_{t - 1}^{2} + λ σ_{t - 1}^{2} .$ (3) ) and (Equation4(4) $σ_{t}^{2} = (1 - λ) (r_{t - 1} - η)^{2} + λ σ_{t - 1}^{2} .$ (4) ) have a constant volatility term structure, so their volatility forecasts for period t + 1 are set equal to the corresponding volatility estimates at time t. For the GARCH and EGARCH models the one-period-ahead volatility forecasts ${\hat{σ}}_{t + 1}$ are obtained by updating the conditional volatility Equations (Equation5(5) $σ_{t}^{2} = ω + α r_{t - 1}^{2} + β σ_{t - 1}^{2} .$ (5) ) and (Equation6(6) $\begin{aligned} ln (σ_{t}^{2}) & = ω + g (ϵ_{t - 1}) + β ln (σ_{t - 1}^{2}) \\ g (ϵ_{t}) & = θ ϵ_{t} + γ (| ϵ_{t} | - E [| ϵ_{t} |]) . \end{aligned}$ (6) ) using the estimated model parameters and the last of the in-sample estimates for ${\hat{σ}}_{t}$ and ${\hat{ϵ}}_{t}$ .

In a multivariate setting, denote by $r_{t}$ the $(m \times 1)$ vector of the m cryptocurrency returns at time t. The multivariate random walk benchmark model assumes that $r_{t}$ follows a multivariate normal distribution: (7) $r_{t} \sim N (0, Σ_{t}),$ (7) where the covariance matrix $Σ_{t}$ is estimated as the sample covariance matrix of returns over the past n days. The EWMA and GARCH models follow their univariate counterparts, so the vector of returns is assumed to follow a multivariate location-scale transformed Student-t distribution with ν degrees of freedom: (8) $r_{t} \sim t_{ν} (0, \frac{ν - 2}{ν} Σ_{t}),$ (8) where $Σ_{t}$ is the covariance matrix of $r_{t}$ , so that $\frac{ν - 2}{ν} Σ_{t}$ is the distribution's scale matrix. The covariance matrix in the multivariate EWMA model with parameter λ is given by: (9) $Σ_{t} = (1 - λ) r_{t - 1} r_{t - 1}^{'} + λ Σ_{t - 1} .$ (9) The covariance matrix of the asymmetric EWMA with parameters λ and η is calculated as: (10) $Σ_{t} = (1 - λ) (r_{t - 1} - η 1) (r_{t - 1} - η 1)^{'} + λ Σ_{t - 1},$ (10) where $1$ is an $(m \times 1)$ vector of ones. For the multivariate GARCH models, the covariance matrix is modelled as: (11) $\begin{aligned} Σ_{t} = D_{t} C_{t} D_{t} \\ C_{t} = diag (Q_{t})^{- 1 / 2} Q_{t} diag (Q_{t})^{- 1 / 2}, \end{aligned}$ (11) where $D_{t}$ is the diagonal matrix of variances estimated via the univariate GARCH or EGARCH model and $C_{t}$ is the conditional correlation matrix, which is modelled indirectly via the $Q_{t}$ matrix to ensure that $C_{t}$ is a proper, positive semi-definite correlation matrix. In the DCC model, $Q_{t}$ is given by: (12) $Q_{t} = (1 - a - b) \bar{Q} + a ϵ_{t - 1} ϵ_{t - 1}^{'} + b Q_{t - 1} .$ (12) Similarly, in the ADCC model $Q_{t}$ is calculated as: (13) $\begin{aligned} Q_{t} & = (1 - a - b) \bar{Q} - g {\bar{Q}}^{-} + a ϵ_{t - 1} ϵ_{t - 1}^{'} + b Q_{t - 1} \\ + g ϵ_{t - 1}^{-} ϵ_{t - 1}^{-^{'}}, \end{aligned}$ (13) where $ϵ_{t}$ is the vector of standardized errors; $ϵ_{t}^{-}$ are the zero-threshold errors defined as equal to $ϵ_{t}$ when the corresponding elements are less than zero and equal to zero otherwise; and $\bar{Q}$ and ${\bar{Q}}^{-}$ are the unconditional covariance matrices of $ϵ_{t}$ and $ϵ_{t}^{-}$ .

The one-period ahead covariance matrix forecasts are produced similar to the volatility forecasts as described previously. For the multivariate random walk, EWMA and AEWMA the 1-period-ahead covariance matrix forecast at time t is set equal to the estimate at time t−1 and for the DCC and ADCC models it is obtained by updating the conditional covariance Equation (Equation11(11) $\begin{aligned} Σ_{t} = D_{t} C_{t} D_{t} \\ C_{t} = diag (Q_{t})^{- 1 / 2} Q_{t} diag (Q_{t})^{- 1 / 2}, \end{aligned}$ (11) ).

3.2. Backtesting methods

The forecasting accuracy of the volatility models presented in the previous section is assessed by producing rolling forecasts and backtesting them against realized returns. For each of the two quantile risk measures, we use the industry standard traffic light test of the Basel Committee (Citation1996) and one academic standard test, i.e. the conditional coverage (CC) test of Christoffersen (Citation1998) for VaR and the exceedance residual (ER) test of McNeil and Frey (Citation2000) for ES.

3.2.1. Value-at-Risk

The VaR at a significance level α is defined as $- 1 \times$ the α-quantile of the one-period-ahead forecast $F_{t}$ that is made at time t of returns' distribution function. We set $α = 1 %, 2.5 %, 5 %$ for lower (left-tail) quantiles, using $1 - α$ for upper (right tail) quantiles, so: (14) ${VaR}_{t} (α) = {\begin{cases} - F_{t}^{- 1} (α), & for long positions (left-tail VaR) \\ F_{t}^{- 1} (1 - α) & \begin{array}{l} for short positions \\ (r i g h t - t a i l V a R) . \end{array} \end{cases}$ (14) The traffic light approach of the Basel Committee (Citation1996), as described in Costanzino and Curran (Citation2018), is extended here to both left- and right-tail VaR. The exceedance indicator $X_{t}^{VaR} (α)$ of each 1-period-ahead left- and right-tail 100α%-VaR forecast at times $t = 1, \dots, N$ is defined as: (15) $X_{t}^{VaR} (α) = {\begin{cases} 1_{{r_{t} \leq - {VaR}_{t} (α)}}, & for long positions \\ 1_{{r_{t} \geq {VaR}_{t} (α)}} & for short positions, \end{cases}$ (15) where $1_{{condition}}$ denotes an indicator function which equals 1 if the condition is satisfied and 0 otherwise. The cumulative number of VaR exceedances $X_{N}^{VaR} (α)$ over the entire forecasting period $t = 1, \dots, N$ is then calculated as: (16) $X_{N}^{VaR} (α) = \sum_{t = 1}^{N} X_{t}^{VaR} (α) .$ (16) Under the null hypothesis that the VaR model is specified correctly, the total number of VaR exceedances follows a binomial distribution with parameters N and α;Footnote¹³ we approximate the binomial with a normal distribution as:Footnote¹⁴ (17) $X_{N}^{VaR} (α) \sim N (N α, N α (1 - α)) .$ (17) Let $x^{VaR}$ be the number of realized VaR exceedances over the forecasting period and let z be its standard normal transform. Denote the probability of obtaining $x^{VaR}$ or fewer exceedances as $Φ (z)$ , where Φ is the standard normal distribution function.Footnote¹⁵ The traffic light colour zones are then defined as: Green if $Φ (z) < 0.95$ ; Yellow if $0.95 \leq Φ (z) < 0.9999$ ; Red if $Φ (z) \geq 0.9999$ .

As described by the Basel Committee (Citation1996), the three-zone approach is introduced to mitigate the statistical limitations of backtesting and balance the two error types: type I, i.e. the possibility that an accurate model is classified as inaccurate based on its backtesting results; type II, i.e. the possibility that an inaccurate model is not classified as such based on its backtesting results. In the green zone, the backtesting results are considered consistent with an accurate model and the probability of erroneously accepting an inaccurate model is low. In the red zone, the backtesting results are highly unlikely to have resulted from an accurate model, and the probability of erroneously rejecting an accurate is model is low. In the yellow zone, backtesting results could be consistent with either accurate or inaccurate models, so additional information is required to determine whether the model is specified correctly.

The VaR forecasts are further backtested using the conditional coverage (CC) test of Christoffersen (Citation1998), for which the likelihood ratio test statistic $L R_{c c}$ is: (18) $L R_{c c} = \frac{α^{n_{1}} {(1 - α)}^{n_{0}}}{{\hat{π}}_{01}^{n_{01}} {(1 - {\hat{π}}_{01})}^{n_{00}} {\hat{π}}_{11}^{n_{11}} {(1 - {\hat{π}}_{11})}^{n_{10}}},$ (18) where: α is the significance level used in the VaR model; ${\hat{π}}_{01} = (\frac{n_{01}}{n_{00} + n_{01}})$ ; ${\hat{π}}_{11} = (\frac{n_{11}}{n_{10} + n_{11}})$ ; $n_{1}$ is the number of realized VaR exceedances; $n_{0} = N - n_{1}$ is the number of realized returns that do not exceed the VaR forecast; $n_{00}$ is the number of non-exceedances preceded by a non-exceedance; $n_{01}$ is the number of exceedances preceded by a non-exceedance; $n_{10}$ is the number of non-exceedances preceded by an exceedance; $n_{11}$ is the number of exceedances preceded by an exceedance.Footnote¹⁶ The asymptotic distribution of $- 2 ln L R_{c c}$ under the null hypothesis is chi-squared with 2 degrees of freedom and the null hypothesis of the CC test for the true transition probabilities $π_{01}$ and $π_{11}$ is that $π_{01} = π_{11} = α$ , suggesting that there is a correct probability of exceedances and no clustering in exceedances.

3.2.2. Expected Shortfall

Expected Shortfall (ES) is defined as the expected loss given that the corresponding VaR forecast is exceeded, i.e. (19) ${ES}_{t} (α) = \frac{1}{α} \int_{0}^{α} {VaR}_{t} (p) d p .$ (19) Also called ‘expected tail loss’ or sometimes ‘conditional VaR’, ES addresses a limitation of VaR in that it cannot capture tail risk beyond the specified quantile of the returns distribution (Basel Committee Citation2012). A traffic light backtesting method for ES was introduced by Costanzino and Curran (Citation2018) as a generalization of the VaR traffic light backtest of the Basel Committee (Citation1996). Extending the idea of VaR exceedances, Costanzino and Curran (Citation2018) introduce the ES generalized exceedance indicator $X_{t}^{ES} (α) \in [0, 1]$ by applying the definition of ES in Equation (Equation19(19) ${ES}_{t} (α) = \frac{1}{α} \int_{0}^{α} {VaR}_{t} (p) d p .$ (19) ) to the left- and right-tail VaR exceedance indicator $X_{t}^{VaR} (α)$ defined in Equation (Equation15(15) $X_{t}^{VaR} (α) = {\begin{cases} 1_{{r_{t} \leq - {VaR}_{t} (α)}}, & for long positions \\ 1_{{r_{t} \geq {VaR}_{t} (α)}} & for short positions, \end{cases}$ (15) ), i.e. $X_{t}^{ES} (α) = \frac{1}{α} \int_{0}^{α} X_{t}^{VaR} (p) d p$ . We further extend this definition to right-tail ES, which yields: (20) $X_{t}^{ES} (α) = {\begin{cases} (1 - \frac{F_{t} (r_{t})}{α}) 1_{{r_{t} \leq - {VaR}_{t} (α)}}, & \begin{array}{l} for long \\ positions \end{array} \\ (1 - \frac{1 - F_{t} (r_{t})}{α}) 1_{{r_{t} \geq {VaR}_{t} (α)}}, & \begin{array}{l} for short \\ positions. \end{array} \end{cases}$ (20) The terms $(1 - \frac{F_{t} (r_{t})}{α})$ and $(1 - \frac{1 - F_{t} (r_{t})}{α})$ capture the severity of each VaR exceedance. Returns that exceed the VaR but not the ES receive a relatively low weight and $X_{t}^{ES} (α)$ is dominated by returns of greater magnitude that exceed both the VaR and ES. The cumulative ES generalized exceedance is then calculated as: (21) $X_{N}^{ES} (α) = \sum_{t = 1}^{N} X_{t}^{ES} (α) .$ (21) Under the null hypothesis that the ES model is specified correctly, the distribution of $X_{N}^{ES} (α)$ is provided by Costanzino and Curran (Citation2018) based on the binomial and Irwin-Hall distributions;Footnote¹⁷ the authors further note that the distribution tends asymptotically to a normal distribution for large forecasting periods, based on the derivation of Costanzino and Curran (Citation2015):Footnote¹⁸ (22) $X_{N}^{ES} (α) \sim N (\frac{1}{2} N α, N α (\frac{4 - 3 α}{12})) .$ (22) Given the total realized ES generalized exceedances over the forecasting period $x^{ES}$ , the probability of obtaining $x^{ES}$ or fewer ES generalized exceedances is $Φ (z)$ , where z is again derived from the standard normal transformation of $x^{ES}$ . The traffic light colour zones are therefore again defined as: Green if $Φ (z) < 0.95$ ; Yellow if $0.95 \leq Φ (z) < 0.9999$ ; Red if $Φ (z) \geq 0.9999$ .

The ES forecasts are further analysed using the exceedance residual (ER) test of McNeil and Frey (Citation2000) based on the raw residuals—i.e. not divided by the estimated standard deviation, as suggested by Bayer and Dimitriadis (Citation2020): (23) $ϵ_{t} = {\begin{cases} (- r_{t} - {ES}_{t} (α)) 1_{{r_{t} \leq - {VaR}_{t} (α)}}, & for long positions \\ (r_{t} - {ES}_{t} (α)) 1_{{r_{t} \geq {VaR}_{t} (α)}}, & for short positions. \end{cases}$ (23) The ER test statistic is then calculated as the sample mean of $ϵ_{t}$ : (24) $\hat{μ} = {\begin{cases} \frac{\sum_{t = 1}^{N} ϵ_{t}}{\sum_{t = 1}^{N} 1_{{r_{t} \leq - {VaR}_{t} (α)}}}, & for long positions \\ \frac{\sum_{t = 1}^{N} ϵ_{t}}{\sum_{t = 1}^{N} 1_{{r_{t} \geq {VaR}_{t} (α)}}}, & for short positions. \end{cases}$ (24) The test statistic $\hat{μ}$ does not have a standard distribution so we estimate it using a bootstrap simulation. In the results presented in Section 5, the distribution of the ER test statistic $\hat{μ}$ is simulated using 1000 bootstrapped replications. The null hypothesis is that $E [ϵ_{t}] = 0$ ; this is tested against a 1-sided alternative that $E [ϵ_{t}] > 0$ , suggesting that ES is systematically underestimated.

3.2.3. Score-based tests for variance

Scoring rules measure the accuracy of probabilistic forecasts and allow for comparisons between competing prediction models. In the case of negatively oriented scoring rules, a lower score indicates a better forecast for the entire distribution, but the most important determinant of the score is the ability to predict an accurate expected value. Yet here we are setting all models equal in that sense—every model simply assuming a zero mean return, because our focus is on the accuracy (or otherwise) of RiskMetrics $^{TM}$ type volatility forecasts. Therefore, the difference between scores in our study is entirely due to difference in accuracy of the variance forecast. We find these score-based tests useful, above and beyond the quantile predictions relating to VaR and ES metrics, because our scores can be used to rank the accuracy of a variance forecast in one simple number.

We use the continuous ranked probability score (CRPS) for univariate distribution forecasts and its multivariate extension, the energy score, for joint density forecast evaluation. Similarly, we use the negatively oriented logarithmic score (LogS) to evaluate the univariate and joint density forecasts. The CRPS (Matheson and Winkler Citation1976 and Gneiting and Ranjan Citation2011) generalizes the mean absolute error of an observation y under a forecast distribution F: (25) $C R P S (F, y) = \int_{- \infty}^{+ \infty} {(F (z) - 1_{{y \leq z}})}^{2} d z$ (25) According to Gneiting and Raftery (Citation2007), the CRPS can also be expressed as: (26) $C R P S (F, y) = E_{F} | X - y | - \frac{1}{2} E_{F} | X - X^{'} |,$ (26) where X and $X^{'}$ are independent random variables with sampling distribution F. This representation leads to the energy score extension which generalizes the CRPS for multivariate distributions and is defined (Gneiting and Raftery Citation2007) as: (27) $E S (F, y) = E_{F} (‖ X - y ‖) - \frac{1}{2} E_{F} (‖ X - X^{'} ‖),$ (27) where $‖ \cdot ‖$ denotes the Euclidian norm on $R^{n}$ , $X$ and $X^{'}$ are independent ( $n \times 1$ ) random vectors from a multivariate distribution with CDF forecast F and $y = (y_{1}, \dots, y_{n})$ is a realized observation. Moreover, if F is given via m discrete (n-dimensional) samples $X = (X_{1}, \dots, X_{n})$ , then the energy score is calculated as: (28) $E S (F, y) = \frac{1}{m} \sum_{i = 1}^{m} ‖ X_{i} - y ‖ - \frac{1}{2 m^{2}} \sum_{i = 1}^{m} \sum_{j = 1}^{m} ‖ X_{i} - X_{j} ‖ .$ (28) Finally, the uniformly-weighted, negatively oriented logarithmic score of an observation y from a univariate or multivariate forecast distribution F is defined by Gneiting and Ranjan (Citation2011) as: (29) $L o g S (F, y) = - l o g F (y) .$ (29) Given the 1-period-ahead probability density function forecasts $f_{t}$ , $g_{t}$ and their corresponding univariate or multivariate scores $S (f_{t})$ and $S (g_{t})$ produced on a rolling basis over the out-of-sample period $t = 1, \dots, N$ , we compare the forecasting performance of f and g directly using their average scores over the out-of-sample period. Alternatively, we use the hypothesis test of equal performance described by Gneiting and Ranjan (Citation2011). If the average scores of f and g over the out-of-sample period are ${\bar{S}}_{N}^{f}$ and ${\bar{S}}_{N}^{g}$ respectively, then the test of equal performance is based on the statistic: (30) $t_{N} = \sqrt{N} (\frac{{\bar{S}}_{N}^{f} - {\bar{S}}_{N}^{g}}{{\hat{σ}}_{N}}),$ (30) where: (31) ${\hat{σ}}_{N}^{2} = \frac{1}{N} \sum_{t = 1}^{N} (S (f_{t}) - S (g_{t}))^{2} .$ (31) The test statistic $t_{N}$ is asymptotically standard normal under the null hypothesis of vanishing expected score differentials; therefore in case of rejection, f is chosen if $t_{N}$ is negative and g is chosen if $t_{N}$ is positive.

4. Data

Intraday volatility is much greater in cryptocurrency markets than in traditional financial markets, so it is worth analysing hourly data here.Footnote¹⁹ Thus, we obtain both daily and hourly historical data on four of the largest cap cryptocurrencies as of 1 January 2021: bitcoin, ether, ripple and litecoin. Since then all but litecoin have remained in the top five cryptocurrencies by market cap. Nevertheless, we retain litecoin because so many of the papers reviewed earlier also apply their models and tests to litecoin. Historical price data are collected using the Cryptocompare API and are in the form of volume-weighted (VWAP) close prices, averaged across multiple USD-denominated exchange-traded prices for each crypto asset. For the daily frequency analysis, the sample period is between 20 August 2015 and 31 August 2021, with daily prices recorded at 00:00 UTC 365 days per year. The rolling estimation window length for the GARCH models is fixed at 500 days so that the forecasting period consists of 1,704 daily observations between 1 January 2017 and 31 August 2021. For the hourly frequency analysis, the sample period is between 1 January 2021 00:00 UTC and 1 July 2021 00:00 UTC, with an estimation window length of 4 months, i.e. 2,882 hourly returns observations; the forecasting period therefore consists of 1,465 hourly observations, between 1 May 2021 00:00 UTC and 1 July 2021 00:00 UTC.

Figure depicts time series of daily log returns for each cryptocurrency. Bitcoin appears to be considerably less volatile than the other currencies, except during the ‘Black Thursday’ crypto market crash on 12 March 2020, and common volatility clusters are often observed simultaneously across all four cryptocurrencies. Figure displays the time series of hourly log returns for each cryptocurrency over the entire sample period January to June 2021. All returns exhibit common volatility clustering and some extreme hourly returns above 10% or below −10%, as also shown in the minimum and maximum returns in Table .

Figure 1. Daily log returns on bitcoin, ether, ripple and litecoin VWAP USD prices obtained from Cryptocompare. The sample period is 20 August 2015 to 31 August 2021.

Figure 2. Hourly log returns on bitcoin, ether, ripple and litecoin VWAP USD prices obtained from Cryptocompare. The sample period is 1 January 2021 to 1 July 2021.

Assessing the accuracy of exponentially weighted moving average models for Value-at-Risk and Expected Shortfall of crypto portfolios

Abstract

1. Introduction

2. State-of-the-art crypto risk models

Table 1. Key characteristics of the relevant academic papers that assess the forecasting performance of cryptocurrency volatility and covariance models.

2.1. Survey of models employed

2.2. Survey of performance results

3. Methodology

3.1. Variance and covariance models

3.2. Backtesting methods

3.2.1. Value-at-Risk

3.2.2. Expected Shortfall

3.2.3. Score-based tests for variance

4. Data

Table 2. Summary statistics of daily log returns on bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC) VWAP USD prices obtained from Cryptocompare.

Table 3. Summary statistics of hourly log returns on bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC) VWAP USD prices obtained from Cryptocompare.

Table 4. Parameter estimates and p-values (in parentheses) for each cryptocurrency obtained from the robust standard errors of the univariate skewed-Student-t EGARCH model estimated for the entire daily frequency sample period 20 August 2015—31 August 2021.

Table 5. Parameter estimates and p-values (in parentheses) for each cryptocurrency obtained from the robust standard errors of the univariate skewed-Student-t EGARCH model estimated for the entire hourly frequency sample period 1 January 2021—1 July 2021.

Table 6. Parameter estimates and p-values (in parentheses) for each cryptocurrency obtained from the robust standard errors of the univariate Student-t EGARCH model estimated for the entire daily frequency sample period 20 August 2015—31 August 2021.

Table 7. Parameter estimates and p-values (in parentheses) for each cryptocurrency obtained from the robust standard errors of the univariate Student-t EGARCH model estimated for the entire hourly frequency sample period 1 January 2021—1 July 2021.

5. Empirical results

5.1. Daily forecasts

5.1.1. VaR and ES backtests

Table 8. Backtesting results for one-day-ahead left-tail 1% and 2.5% VaR forecasts for bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC), based on an out-of-sample period between 1 January 2017—31 August 2021.

Table 9. Backtesting results for one-day-ahead right-tail 1% and 2.5% VaR forecasts for bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC), based on an out-of-sample period between 1 January 2017—31 August 2021.

Table 10. Backtesting results for one-day-ahead left-tail 1% and 2.5% ES forecasts for bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC), based on an out-of-sample period between 1 January 2017—31 August 2021.

Table 11. Backtesting results for one-day-ahead right-tail 1% and 2.5% ES forecasts for bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC), based on an out-of-sample period between 1 January 2017—31 August 2021.

5.1.2. Score-based tests for variance

Table 12. Average CRPS of one-day-ahead univariate density forecasts for bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC) daily log returns, based on an out-of-sample period between 1 January 2017—31 August 2021.

Table 13. Average log score of one-day-ahead univariate density forecasts for bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC) daily log returns.

5.2. Hourly forecasts

5.2.1. VaR and ES backtesting

Table 15. Backtesting results for one-day-ahead left-tail 1% and 2.5% VaR forecasts for bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC), based on an out-of-sample period between 1 January 2017—31 August 2021.

Table 16. Backtesting results for one-hour-ahead left-tail 1% and 2.5% ES forecasts for bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC).

Table 17. Backtesting results for one-hour-ahead right-tail 1% and 2.5% ES forecasts for bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC).

5.2.2. Score-based tests for variance

Table 18. Average CRPS of one-hour-ahead univariate density forecasts for bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC) hourly log returns.

Table 19. Average log score of one-hour-ahead univariate density forecasts for bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC) hourly log returns.

6. Conclusions

Disclosure statement

Notes

References

Appendix

Table A1. Backtesting results for one-day-ahead left- (long position) and right-tail (short position) 5% VaR forecasts for bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC), based on an out-of-sample period between 1 January 2017 and 31 August 2021.

Table A2. Backtesting results for one-day-ahead left- (long position) and right-tail (short position) 5% ES forecasts for bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC), based on an out-of-sample period between 1 January 2017 and 31 August 2021.

Table A3. Backtesting results for one-hour-ahead one-day-ahead left- (long position) and right-tail (short position) 5% VaR forecasts for bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC), based on an out-of-sample period between 1 May 2021 and 1 July 2021.

Table A4. Backtesting results for one-hour-ahead one-day-ahead left- (long position) and right-tail (short position) 5% ES forecasts for bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC), based on an out-of-sample period between 1 May 2021 and 1 July 2021.

Table A5. Backtesting of daily left-tail 1% VaR for a short position on bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC).

Table A6. Backtesting of hourly left-tail 1% VaR on bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC).

Table A7. Backtesting of hourly right-tail 1% VaR on bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC).

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date