1,337
Views
1
CrossRef citations to date
0
Altmetric
Finance and Banking Economics

Value-at-risk in the presence of asset price bubbles

ORCID Icon & ORCID Icon
Pages 361-384 | Received 13 Dec 2019, Accepted 03 May 2021, Published online: 12 Apr 2022

ABSTRACT

In this study, we respond to the criticism that the value-at-risk (VaR) measure fails during financial crises and is only applicable during periods without asset price bubbles. We propose a new dating mechanism that is based on the work of Phillips (2015) to date-stamp the origination and termination of the asset price bubbles. Our method relaxed the minimum bubble duration constraint in the original model, and the empirical application statistically identified the bubbles periods in nine stock markets (Australia, Canada, China, Germany, Spain, Hong Kong, Japan, the United Kingdom, and the United States). We choose the two most widely adopted VaR models (RiskMetrics and RiskMetrics 2006) to test the performance. Our results show that the RiskMetrics model fails in most periods, whereas the RiskMetrics 2006 performs efficiently in the periods with asset price bubbles. These results prove the criticism that all the VaR models fail during crises as invalid.

1. Introduction

Financial markets have experienced several crises over the last two decades. The Asian financial crisis in 1997, the dot-com bubble burst in 2001, the subprime mortgage crisis in 2008, and the European sovereign debt crisis in 2009 have driven the financial institutions to use better measures to manage downside risk. The losses caused by these financial crises were tremendous; the 2008 subprime mortgage crisis costed the US economy up to 14 USD trillion (Atkinson, Luttrell, & Rosenblum, Citation2013), which is equivalent to the average annual output of the entire US economy. Furthermore, the global economy has become more integrated and the cross-border financial flows have steadily increased. This widens the contagion effect and complicates the effects of the financial crisis further.

Financial crises are predominantly caused by the burst of asset price bubbles. An asset price bubble is formed when the asset’s price deviates from its fundamental value. During the bubble booming period, an asset’s price grows at an explosive rate. Blanchard and Watson (Citation1982) suggest that asset bubbles last only till the market realizes it and makes corrections. The correction is usually associated with a large sale force that causes a plunge in the stock price. This process is commonly referred as a bubble burst. However, it is difficult to identify bubbles and date-stamp the bursts.

The losses from the financial crises have led the practitioners and regulators to actively look for risk models that measure the downside risks. Over the past decade, value-at-risk (VaR) has emerged as one of the most popular methods for measuring the downside risks of financial investments. The downside risk of a financial investment predicts the minimum loss of a portfolio’s value for a period with a certain probability. After the subprime mortgage crisis in 2008, practitioners and regulators criticised the VaR models for failing to reveal the underlying risk, which led many financial institutions to suffer unexpected losses well above the VaR value and resulted in a credit crunch. However, most criticism stems from the lack of statistical analysis, which leads to false conclusions on the efficiency of VaR models.

Although a considerable body of research criticises VaR for performing poorly during turbulent financial periods (see Mertzanis, Citation2013; Lockwood, Citation2015), most literature classifies crisis and non-crisis periods unclearly by its own subjective judgment. Halbleib and Pohlmeier (Citation2012) proposed a data-driven VaR method that combines quantile forecasts to improve VaR’s performance during a crisis. They formed three portfolios to represent small-, middle-, and large-cap stocks in the Dow Jones index and evaluated the performance in the claim period (1 January 2007 to 17 July 2007), crisis period (18 July 2007 to 1 July 2009), and crash period (1 September 2008 to 1 July 2009). The results show that the method performs well in the crisis periods. However, the crisis and non-crisis periods were defined arbitrarily without a statistical basis. Chen, Gerlach, Lin, and Lee (Citation2012) studied the RiskMetrics models (Riskmetrics, Citation1996) and several generalised autoregressive conditional heteroscedasticity (GARCH) (Bollerslev, Citation1986) family VaR models under Bayesian forecasting tests in Japan, Hong Kong, Korea, and the US markets. Their study ranks the RiskMetrics model last among all models, and shows that all VaR models underestimate the risk level during a crisis period. However, similar to the work of Halbleib and Pohlmeier (Citation2012), the crisis and non-crisis periods were defined arbitrarily and the crisis periods were considered to be the same across different stock markets with different characteristics, which is debatable.

Our study contributes to the literature that responds to these limitations, by empirically testing the VaR models’ performance in periods with and without asset price bubbles, particularly in the periods of financial crisis after the bubbles burst. Against the background of criticism for VaR models, we perform a series of backtests to statistically evaluate its performance. Following Phillips, Shi, and Yu (Citation2015), we use the backward supremum augmented Dickey–Fuller (BSADF) test to date-stamp the origination and termination of the bubbles. However, we modify the date-stamping algorithm to suit backtesting. We define three periods for the tests: the pre-bubble period, bubble period, and post-burst period. The bubble period is bounded by the bubble’s origination and termination dates and the pre- and post-burst periods are defined as 2 years before and 2 years after the termination date, respectively. Our date-stamping method allows researchers to evaluate the financial models that may behave differently at different stages of a bubble.

Though VaR is a popular means to quantify downside risk, there is little consensus on the preferred VaR model. Köksal and Orhan (Citation2013) tested a VaR model based on a simple autoregressive conditional heteroscedasticity (ARCH) setting in the developed and emerging markets during financial crisis. Their results show that VaR performs more inefficiently in developed countries than in emerging countries and, overall, fails to reveal the downside risk. However, their results may only be true for the ARCH-based VaR model. Bams, Blanchard, and Lehnert (Citation2017) compared VaR’s performance in the S&P 500, Dow Jones Industrial Average, and Nasdaq 100 indices. The results show that the VaR based on historical volatility measure outperforms other parametric VaR models. Lee, Chiou, and Lin (Citation2006) employed Engle (Citation2002) dynamic conditional correlation (DCC) estimators to estimate a portfolio’s VaR. The authors found that VaR performance at a portfolio level may be unreliable due to the difficulty in forecasting the dynamic correlation among assets. For better modeling of a portfolio’s volatility, Chiriac and Voev (Citation2011) developed a multivariate vector fractionally integrated ARMA (VARFIMA) process that models the long memory characteristics of financial volatility. Furthermore, Engle, Ledoit, and Wolf (Citation2019) proposed a new standard of DCC model that applies nonlinear shrinkage in the estimation of the portfolios of large dimensions. Recent studies such as Fiszeder, Fałdziński, and Molnár (Citation2019) and Law, Li, and Yu (Citation2020) incorporate the state-of-the-art volatility estimation method in the calculation of portfolio VaR. The results show that the choice of volatility model significantly impacts the accuracy of the VaR estimates.

We follow Campbell and Shiller (Citation1989) and Phillips et al. (Citation2015) to consider the explosive behaviour in log price–dividend ratio for the detection of asset price bubble. To understand the VaR performance in different markets, we examine nine stock markets (Australia, Canada, China, Germany, Spain, Hong Kong, Japan, the UK, and the US, and proxy each of the market portfolios by its respective market index. In light of this, our analysis focuses on univariate VaR models and uses the two widely adopted univariate VaR models – RiskMetrics (Riskmetrics, Citation1996) and RiskMetrics 2006 (Zumbach, Citation2007) – in our empirical tests in response to the criticism of VaR failure. To evaluate the performance of the VaR models in different periods, we conduct the unconditional coverage tests (Kupiec, Citation1995), the independence tests (Christoffersen & Pelletier, Citation2004), and the joint coverage tests. The empirical results of both the conditional and unconditional coverage tests suggested that the RiskMetrics 2006 model adequately described the downside risks in all the nine markets during the bubble periods. However, one weakness noted in the RiskMetrics 2006 model was its tendency to occasionally overstate the downside risk after a market turbulence. Further, the RiskMetrics 2006 model reported zero VaR violations in post-burst periods in Australia, Canada, Japan, and the UK. The empirical results of the conditional coverage tests, unconditional coverage tests, and Fisher’s exact tests suggested that the RiskMetrics 2006 model performs efficiently in most periods, rendering the criticism against VaR models statistically invalid.

The remainder of this paper is organised as follows: Section 2 discusses the rationale of using the log price/dividend ratio as an indicator of asset price bubbles. Section 3 explains the SADF and generalised SADF (GSADF) tests used to identify asset price bubbles and the date-stamping mechanism. Section 4 describes the RiskMetrics and RiskMetrics 2006 VaR models, as well as the backtesting methods. The empirical results of the GSADF tests, identification of the bubble periods, and VaR backtesting results for the sample markets are presented in Section 5. Section 6 concludes the study.

2. Asset prices and bubbles

Asset price bubbles are usually driven by speculative behaviours that bid up the asset prices beyond their fundamental values. The fundamental value of an asset is the sum of the discounted future cash flows. In the presence of bubbles, asset prices behave explosively. The log price of a security is defined as

(1) pt=ptf+bt,(1)

where pt is the log price at a time t, ptf is the fundamental value, and bt is the bubble component (Campbell & Shiller, Citation1989).

In this Equationequation 1, the value of ptf is the fundamental component of the stock price as it depends on the expected dividends. In contrast, bt is the speculative component as it is based on the future expectations on the stock price. The stock price will be explosive if the bubble component bt in the Equationequation 1 is non-zero.

The log price–dividend ratio is the summation of a series of log dividend differences and the bubble components (Campbell & Shiller, Citation1989; Phillips et al., Citation2015). If the log dividend is stationary after differencing, an explosive behaviour of the log price–dividend ratio would be caused by the presence of a non-zero bubble component bt. Thus, we can detect if asset price bubbles exist by examining any explosive behaviour in the log price–dividend ratio series and non-explosive behaviour in the first difference log dividend series.

3. SADF and GSADF tests

Previous studies have suggested using a supremum of a set of recursive right-tailed augmented Dickey–Fuller (ADF) tests to detect the presence of stock bubbles (Dickey & Fuller, Citation1979). This SADF test applies the right-tailed ADF test with the null hypothesis of a unit root (ϕ=0) and the alternative hypothesis of an explosive root (ϕ>0).

The regression model used in the SADF test is

(2) Δyt=α+πyt1+j=1kγjΔytj+t,(2)

where k is the lag order and t is the random error.

The SADF test begins with testing the first r0 fraction of the observations. It is followed by repeated ADF tests till r0 is increased to 1, which is denoted by ADFr01. The forward sequence of the regression starts from observation 1 and ends with Trw, where . is the integer part of the argument, T is the total number of observations, and rw[r0,1] is the fraction of the observations. The SADF statistic is

SADFr0=suprw[r0,1]ADF0rw.

The corresponding asymptotic distribution of the SADF test is discussed by Phillips, Shi, and Yu (Citation2014). The asymptotic distribution of the SADF test statistic for the null hypothesis that the true process is a random walk without drift is

(3) SADFr0suprw[r0,1]rw[0rwWdW1/2rw]W(rw)0rwWdrrw1/2{rw0rwW2dr[0rwW(r)dr]2}1/2,(3)

where W is a Wiener process.

We determine if the behaviour of the data series is explosive by comparing the SADF statistic of the data series with the asymptotic distribution of the Dickey–Fuller t-statistic in Equationequation 3. We perform a backward ADF (BADF) test to date-stamp the explosion. The BADF test performs the ADF test repeatedly by fixing the starting point of the sample at the first observation, and rolling the ending point from Tr0 to T. For example, if the testing sequence starts from r1 (r1=0 in the BADF test) and ends at r2, the corresponding BADF test statistic would be BADFr1r2. The BADF test statistic is denoted by BADFr2 because the first observation is the starting point.

The explosion originates at Tre when it is the first occurrence after the BADFre statistic exceeds the critical value. Phillips, Wu, and Yu (Citation2011) impose the conditions that the bubble duration must be longer than log(T) and the termination date of the explosion (Trf) must be the first occurrence after the observation Tre+log(T), when the BADFrf statistic is below the critical value.

(4a) re=infr2[r0,1]{r2:BADFr2>cvr2βT}(4a)
(4b) rf=infr2[rˆe+δlog(T)T,1]{r2:BADFr2<cvr2βT}(4b)

Phillips et al. (Citation2011) further recommended that the critical value cvr2βT should vary with the number of observations in the testing window for it to diverge to infinity and eliminate type I errors for large T; specifically, they suggested setting cvr2βT=log(log(Trs))/100. The test using the bubble date-stamping method of equation 4 is referred to as the PWY test in this paper.

The disadvantage of the PWY test is the possible failure when multiple bubbles are present in the sample. Phillips et al. (Citation2015) proposed the generalised version of SADF (i.e., the GSADF test) to address this issue for its flexibility in allowing changes to the starting point of the testing window. The GSADF statistic is defined as

GSADFr0=suprw[r0,1]{supr1[0,1rw]ADFr1rw}.

The asymptotic distribution of the GSADF test is elaborated by Phillips et al. (Citation2014).

The date-stamping method used in the GSADF test is an extended version of the one used in the BADF statistic. The BSADF test performs an SADF test by rolling the starting point of the test window r1[0,r2r0] from observation T(r2r0) to the first observation. The BSADF statistic for a testing sequence that starts at r1 and ends at r2 is defined as

BSADFr2=supr1[0,r2r0]BADFr1r2.

Similar to the PWY test, the origination date of a bubble in the GSADF test is Trˆe, when Trˆe is the first occurrence after the BSADFrˆe statistic exceeds the critical value. The minimum bubble duration in the BSADF statistic is generalised to δlog(T), where δ is a frequency-dependent parameter. Furthermore, the termination date of explosion Trˆf is the first occurrence after the observation Trˆe+δlog(T), when the BSADFr2 statistic is below the critical value. The bubble date-stamping method of equation 5 is referred to as the PSY test in this paper.

(5a) rˆe=infr2[r0,1]r2:BSADFr2>cvr2βT(5a)
(5b) rˆf=infr2[rˆe+δlog(T)T,1]r2:BSADFr2<cvr2βT(5b)

The bubble origination date in the GSADF test is date-stamped using Equationequation 5a, which is the ending point r2 of the testing window in the BSADF statistic. The PSY date-stamping method picks the ending point in an explosive series as the bubble origination date. Though this gives confidence in forward tests, it might miss the bubble formation period, which is crucial for a backtest. Alternately, we examine the bubble origination date by considering the starting point as r1 instead of r2 of the explosive series. Thereby, we modify the date-stamping method for the bubble origination date as in Equationequation 6a. The new bubble origination date is Trˉe. rˉ2, in Equationequation 6a, is the end of the testing sequence in the BSADF test for bubble origination. We use this ending point as a starting point to detect the termination date of the explosion Trˉf in Equationequation 6b. The minimum length of the bubble duration is the size of the testing window when the bubble origination date starts at r1. Thus, we address the minimum bubble duration δlog(T) constraint in the PSY test to obtain the bubble termination date. The differences between the PSY date-stamping method and the modified method are presented in section 5.2. This new date-stamping method for the bubble origination date, presented in equation 6, is used throughout the study.

(6a) rˉe;rˉ2=infr2[r0,1]r1;r2:BSADFr2>cvr2βT(6a)
(6b) rˉf=infr2[rˉ2,1]r2:BSADFr2<cvr2βT(6b)

4. VaR models and backtests

We compute daily 1% VaRs by using the RiskMetrics (Riskmetrics, Citation1996) and RiskMetrics 2006 (Zumbach, Citation2007) models to compare the performance of VaR in the pre-bubble, bubble, and post-burst periods. The RiskMetrics model assumes that the asset returns xt are normally distributed with mean μt and variance σt2. Using the standard normal cumulative distribution function Φ(xt)=1/2πxey2/2dy and the cumulative distribution function of the asset return F(xt;μt,σt)=Φ((xtμt)/σt), the α percent one-day VaR can be obtained using equation 7. The estimated values of ut and σt are computed from an estimation window of size WE, which is set to 250 days in this study. As the Student’s t-distribution is another prominent distribution to describe financial asset returns, further from the original RiskMetrics model, we study a variant model that uses cumulative t-distribution Γv(1α) with v degree of freedom to compute the α percent one-day VaR, and the formula is shown in Equationequation 8.

(7a) α=F(VaRα)(7a)
(7b) VaRα=μt+Φ1(α)σt(7b)
(7c) VaRα=Φ1(1α)σtμt(7c)
(8) VaRα=Γv1(1α)v2vσtμt(8)

RiskMetrics estimates the volatility σt by using the second central moment of the asset returns xt. In contrast, RiskMetrics 2006 allocates a heavier weight to recent observations, while preserving the long-lasting impact of shocks. Towards this end, RiskMetrics 2006 introduces a hyperbolic decay factor hk=exp(1/τk), based on the geometric time horizon factor τk defined using Equationequation 9.

(9) τk=τ1ρk1(9)

Here ρ is an operationalised parameter. RiskMetrics 2006 obtains the volatility σt+12 using Equationequation 10, by summing the K historical volatilities σk,t2 (defined in Equationequation 11a) with logarithmic decay weights wk derived using Equationequation 11b.

(10) σt+12=k=1Kwkσk,t2(10)
(11a) σk,t2=hkσk,t12+(1hk)(xtμt)2(11a)
(11b) wk=1C1ln(τk)ln(τ0)(11b)
(11c) C=Kk=1Kln(τk)ln(τ0)(11c)

where C is kept constant to ensure that the sum of the weights wk is equal to 1 (i.e., wk=1). The long memory RiskMetrics 2006 model is controlled by three parameters: the logarithmic decay factor τ0, lower cut-off τ1, and upper cut-off τk. We set ρ=2, τ0=1,560 days, τ1=4 days, τk=512 days, and K=15 following Zumbach (Citation2007).

We define T as the total number of observations in the data set, WE as the size of the estimation window, and WT as the size of the testing window for VaR violations. A VaR violation (xt=1) is recorded when the loss on a trading day t exceeds the calculated VaR value. The total number of VaR violations ν1 in the testing period WT is calculated using Equationequation 12; ν0, calculated using Equationequation 13, indicates the number of days without violations.

WE+WT=T
xt=1ifytVaRt,0ifyt>VaRt
(12) ν1=xt(12)
(13) ν0=WTν1(13)

We employ three categories of backtests to test the accuracy of the VaR models: the unconditional, conditional, and joint coverage tests. Unconditional coverage tests evaluate the VaR models by testing the number of violations at a given confidence level. In the unconditional coverage tests, we employ Kupiec (Citation1995) proportion of failures (POF) test and time until first failure (TUFF) test to assess the VaR values. The POF test is a likelihood ratio test that examines if a VaR failure rate (number of violations divided by the number of observations) is consistent with the VaR confidence level. The TUFF test assesses the time for the first violation to occur in a sample. Both methods test the hypothesis that assumes that the violation rate is equal to the significance level α. Further, based on the frequency of the violations (unconditional coverage), independence of the violations (conditional coverage) is considered to hold equal importance when assessing a VaR model. Ideally, the probabilities of VaR violations of time t and t1 should show no dependences. Christoffersen and Pelletier (Citation2004) developed an interval forecast test using a binary first-order Markov chain and a transition probability matrix to test the independence of the violations. However, the Christoffersen’s independence test has a limited ability in assessing violations between two consecutive trading days. The mixed-Kupiec independence test (Hass, Citation2001) overcame this limitation with the mixed-Kupiec independence test, which considers the timing of the violations without the use of first-order Markov chain. Both the conditional coverage tests examine the hypothesis that assume that no violation clustering effect (consecutive violations) exists. Furthermore, we employ the Christoffersen’s joint test (a likelihood ratio test which combines POF test and Christoffersen’s independence test) and mixed-Kupiec joint test (a likelihood ratio test, which combines TUFF test and mixed-Kupiec independence test) to test the hypotheses that the true violation rate is α and the violations are independent. The details of the coverage tests are provided in Appendix A.

4.1. Fisher’s exact test

We perform the Fisher’s exact test (Fisher, Citation1922) to examine the significance of associations between the VaR failure rates in different crisis periods. We perform the Fisher’s exact test in this study due to the small sample size of VaR violations (1% VaR represents 5 violations of 500 observations). We use a 2×2 contingency table to represent the number of VaR violations in different periods. For instance, in , for T days in period A, the VaR measure performed well for v0A days but failed for v1A (Tv0A) days. Under the null hypothesis, the VaR failures in different periods are stochastically independent. The VaR failure rates show no significant difference between the crisis and non-crisis periods. The alternative hypothesis is that the systematic difference in VaR failures compared to the expectation in different periods is coincidental.

Table 1. Contingency table of fisher’s exact test for VaR measures in different periods

5. Data and empirical results

The nine stock markets employed in our empirical analysis represent both the developed and emerging markets. The selection was based on the market capitalisation and trading time zones, shown in .

Table 2. Stock exchanges and the respective indices used

Monthly price and dividend data were used in the bubble tests and daily price data were used in the VaR backtests. The data were obtained from Bloomberg.

5.1. Pre-bubble, bubble, and post-burst periods

shows the GSADF tests of the log price–dividend ratio and log dividend difference series of the nine markets. The finite sample of critical values used in the GSADF tests was obtained from 2,000 simulations. We followed Phillips et al. (Citation2015) in determining the minimum window size of the test based on the rule r0=0.01+1.8/T, where T is the corresponding sample size. As the GSADF test statistic of all the log price–dividend ratios exceeded their corresponding 10% right-tail critical values, the summary test showed an explosive behaviour in the log price–dividend ratio and a non-explosive behaviour in the log dividend difference series. This provided evidence for explosive sub-periods in the nine markets.

Table 3. GSADF tests for the log price–dividend ratio and log dividend difference in the nine markets

We performed both the PSY test and the modified test to date-stamp the origination and termination of the bubbles using equations 5 and 6, respectively. shows the bubble date-stamping results from these methods in the nine stock markets. The estimated bubble origination dates were close to the termination dates in most cases and the bubble duration was short in the PSY date-stamping method. The average duration of the bubble period was 7 months. Although all the tests identified asset price bubbles around the subprime mortgage crisis, they exhibited a time lag. For the stock markets in Australia, Canada, Germany, Spain, Japan, the UK, and the US, the PSY method suggested that the bubbles originated in late 2008 and terminated in mid-2009. The origination dates were close to the original bubble burst, that is, when Lehman Brothers filed for the largest bankruptcy protection in the US history on 15 September 2008. The reason for the short durations of asset price bubbles indicated by the PSY test, contrary to general agreement that it takes time for the bubble to form, is the date-stamping strategy, which is based on choosing the ending point of the testing window when the whole sample is explosive. We believe that a more appropriate choice for the bubble origination date would be the starting point of the testing window, instead of the ending point.

Table 4. Bubble periods obtained by the PSY test and the modified date-stamping method

The bubble origination dates identified by the modified method are from late 2002 to mid-2005 and the termination dates are from early 2009. Our results agree with those from the previous studies (Brown, Stein, & Zafar, Citation2015; Demyanyk & Van Hemert, Citation2011; Lewis, Citation2009; Mian & Sufi, Citation2009; Sanders, Citation2008).

To illustrate the differences between the two date-stamping methods, we used the Hong Kong stock market as an example. The PSY approach identified a five-month bubble from November 2007 to March 2018; this is illustrated in , with the bubble period highlighted in grey. In contrast, the bubble period identified by the modified method was from April 2004 to October 2007 as shown in . Unlike the PSY approach, the result of the modified method did not indicate shorter durations and agreed well with the asset price bubble cycles of the subprime mortgage crisis (see Dell’ariccia, Igan, & Laeven, Citation2012; Lewis, Citation2009; Tridico, Citation2012).

Figure 1. Bubble identification results for the hong kong stock market.

Figure 1. Bubble identification results for the hong kong stock market.

5.2. Backtesting results

We defined the pre-bubble and post-burst periods as 2 years before and after the bubble.

(14) prebubbleperiod:Trˉep,Trˉe(14)
(15) postburstperiod:Trˉf,Trˉf+p(15)

Here p is the time frequency-dependent parameter; p=24 for the monthly data and p=500 for the daily data (we assume 250 trading days per year). The pre-bubble period extended from September 2002 to August 2004 and the post-burst period from November 2007 to October 2009 for a bubble period identified from September 2004 to October 2007, when Hong Kong was used as an example. Further, we define the period not covered by the pre-bubble, bubble, and post-burst periods as ”normal periods.” For example, two normal periods of September 1993 to August 2002 and November 2009 to October 2019 were defined from the entire period considered for the Hong Kong market (September 1993 to October 2019). shows the bubble periods identified for the nine stock markets. One bubble period was identified each for Australia, Canada, China, Germany, Hong Kong, Japan, and the UK, and two each for Spain and the US. The reason for the additional bubble period in Spain and the US could be the extent of data available. Spain’s data starts from January 1991 and the US’s data from February 1978.

Table 5. Full, pre-bubble, bubble, post-burst, and normal periods for the bubbles examined in the nine stock markets

show the results of the RiskMetrics VaR backtests, and show the results of RiskMetrics 2006 in different periods. In all the hypothesis tests, we tested the null hypotheses at a significance level of 5%. For the periods where no VaR violation was observed, we have not reported the value for the independence tests as the results of the violation clustering test were ambiguous (see an example of the independence test and joint test results in the normal period for the UK (period 1) in ). In the coverage tests, to deem a case as a ”failure,” the null hypotheses should be rejected in both coverage tests. An analogous criterion was also adopted for the independence tests and joint tests.

Table 6. Backtest results of riskmetrics VaR in the pre-bubble, bubble, and post-burst periods

Table 7. Backtest results of riskmetrics VaR in the normal and full periods

Table 8. Backtest Results of RiskMetrics 2006 VaR in the Pre-bubble, Bubble, and Post-burst Periods

Table 9. Backtest results of riskmetrics 2006 VaR in the full and normal periods

shows that RiskMetrics performed poorly around turbulent financial times. Both the RiskMetrics models with normal distribution and t-distribution had analogous results. Minor differences were found in Canada and the UK but such have no impact on the results in the hypothesis tests. Within the large estimation window size used in the RiskMetrics model the fitted t-distribution approaches the normal. Further, RiskMetrics performed effectively only in the coverage tests for the pre-bubble period. We could not reject the null hypothesis because the probability of a violation was the same as the coverage rate for Australia, Canada, China, Spain, Hong Kong, Japan, and the US markets. The exceptions were Germany (p-value was 0.028 in the TUFF test) and the UK markets (p-value was 0.003 in the POF test). Violation clustering was also noted with Canada, Germany, the UK, and the US, which led to rejecting the null hypotheses in these markets in all the joint tests.

RiskMetrics performed inadequately in the bubble period as well. The null hypotheses of the POF coverage tests were rejected in all the nine markets. Although the Christoffersen’s independence test showed no significant consecutive violations, the violation clustering was severe leading to the null hypotheses being rejected by the mixed-Kupiec independence test and all the joint tests. However, RiskMetrics performed noticeably better in the post-burst period as it rejected the null hypotheses only for China, Hong Kong, and Japan. The coverage tests (with a p-value of 0.001 in the POF test, and 0.011 in the TUFF test) showed that the number of violations in China was misspecified. The independence tests showed that the Japanese market exhibited violation clusters (with a p-value of 0.035 in the Christoffersen test and 0.005 in the mixed-Kupiec test). In the post-burst period, RiskMetrics underestimated the downside risk in China, Hong Kong, and Japan. Further, shows that RiskMetrics presented inaccurate downside risk measures for both the normal and full periods. Similar to the results for the bubble period, the null hypotheses were rejected in most of the POF coverage tests, independence tests, and joint tests. We found that RiskMetrics failed to reveal the downside risk in most cases.

The results in provide convincing evidence to show that RiskMetrics 2006 works efficiently in a period of market turbulence. In the pre-bubble period, we could not reject the null hypotheses in all the tests, except for China and the US. In China, the null hypothesis was rejected by both coverage tests (the p-value was 0.002 in the POF test and 0.033 in the TUFF test). In the US, both the independence tests (period 1) rejected the null hypothesis (with p-values of 0.034 and 0.020). Further, RiskMetrics 2006 performed efficiently in the bubble period. It failed only in the coverage tests and joint tests for the US market in period 2, defined in . However, the long memory RiskMetrics 2006 behaved conservatively in the post-burst period although no violation was found in Australia, Canada, Japan, the UK, and the US (period 2). In the normal period, RiskMetrics 2006 performed efficiently; failing only in period 2 of the independence tests for Australia (with a p-value of 0.008 in the Christoffersen test and 0.004 in the mixed-Kupiec test), Japan (with a p-value of 0.008 in the Christoffersen test and 0.008 in the mixed-Kupiec test), and the US (with a p-value of 0.002 in the Christoffersen test and 0.000 in the mixed-Kupiec test).

RiskMetrics 2006 behaved conservatively in overestimating the downside risk after the bubbles. However, in the total duration considered, the number of VaR violations were 34, 57, 49, 23, 45, 43, 43, 25, and 77 in Australia, Canada, China, Spain, Hong Kong, Japan, the UK, and the US, respectively. These violation numbers were significantly lower than those found by the RiskMetrics model, which were 100, 172, 112, 129, 139, 140, 146, 108, and 212, respectively, as shown in . Although RiskMetrics 2006 may require financial institutions to allocate more capital than necessary to manage risks, it performed efficiently in most periods.

5.3. VaR performance in pre-bubble, bubble, and post-burst periods

We conducted the Fisher’s exact tests for both the RiskMetrics and RiskMetrics 2006 models to gain further insight on their performance in different crisis periods. We performed 10 Fisher’s exact tests to compare the VaR performance in each of the nine markets during the full, pre-bubble, bubble, post-burst, and normal periods. The results are shown in . shows that unlike RiskMetrics, the RiskMetrics 2006 has a consistent performance regardless of the periods in all the nine markets. In each panel, the lower triangular matrix contain the p-values of the RiskMetrics tests, while the upper triangular matrix contains the p-values of the RiskMetrics 2006 tests. A p-value of Fisher test greater than 5% is indicative of no statistically significant relationship of the VaR failure rates between two chosen periods. Panels 10(a), 10(g), and 10(h) show that the performance of RiskMetrics VaR differed in the bubble periods. The null hypotheses were rejected in the tests between the bubble period and the pre-bubble, post-burst, and full periods. After combining the results in , the performance of RiskMetrics was found to be inconsistent in different periods, with the frequency of failure being higher in the bubble periods. Panel 10(i) shows considerable difference with RiskMetrics in the US post-burst period when compared to other periods. Comparing this with the results in , we observe that the RiskMetrics model failed in all the periods in the US except for the post-burst period. Similarly, panel 10(c) shows that it only performed well in the pre-bubble period in China. The overall results suggested a significant difference in the RiskMetrics performance across different countries in different crisis periods.

Table 10. Fisher exact test for the significance level of the difference of VaR failure rate by time period

Contrary to RiskMetrics, RiskMetrics 2006 did not show extensive performance differences in different periods. The VaR performance seen panel 10(d), panel 10(e), panel 10(f), panel 10(i) shows no significant difference, whereas occasional inconsistencies are observed in panel 10(b), 10(g), and 10(h). Specifically, the inconsistency in these three panels was between the bubble and the post-burst periods. The Fisher’s test results support our finding previously mentioned in section 5.2 that though RiskMetrics 2006 occasionally overstated the downside risk in post-burst periods, it performed efficiently in most periods.

6. Summary and conclusions

In this study, we conducted empirical tests to respond to the criticism that VaR models fail during financial crises when asset bubbles burst. Our empirical tests on the nine markets, namely Australia, Canada, China, Spain, Hong Kong, Japan, the UK, and the US, were conducted using data from the earliest available dates to October 2019. Asset price bubbles were date-stamped using the modified PSY tests on the log price-dividend ratios of these nine markets. However, as the date-stamping method of the original PSY test is forward-looking, it selects the ending point of the testing window as the bubble origination date. To suit our need for a backward-looking date-stamping method for the backtests, we modified the date-stamping method by choosing the starting point of the testing window. Our results show that the original date-stamping method has time lag and indicates a shorter duration for the bubble period. It date-stamps the bubble start near the end of a true bubble and the average duration of the bubble periods detected is 8 months. The results also show that our modified method addressed both the time lag and shorter duration issues present in the original date-stamping method and, thus, more suitable for backtesting.

The empirical test results of the nine backtests allowed us to draw two main conclusions. First, the RiskMetrics 2006 model outperforms the RiskMetrics model. Specifically, the former works well in pre-bubble, bubble, and normal periods but behaves conservatively in the post-burst period, which may overestimate the downside risk. Second, the criticism that all VaR models fail in crisis or bubble periods is statistically invalid. The power of VaR is affected by the modelling practices adopted by different financial institutions.

An interesting direction for future research would be to examine parametric and non-parametric VaR approaches. Non-parametric approaches comprise historical and Monte Carlo simulations, whereas parametric approaches include GARCH, GJR-GARCH (Glosten, Jagannathan, & Runkle, Citation1993), and the multivariate DCC (Engle, Citation2002) approach.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Notes on contributors

Raymond Kwong

Raymond Kwong is a Senior Lecturer in the College of Professional and Continuing Education, the Hong Kong Polytechnic University. He received his PhD in Economics from the Newcastle University, United Kingdom. His research focuses on issues of asset price bubbles, market microstructure, price discovery and risk management in equity market.

Helen Wong

Helen Wong is a Principal Lecturer in the College of Professional and Continuing Education, the Hong Kong Polytechnic University (HKPU). She has devoted her research study in customer relationship building, corporate social responsibility, accounting and finance. Prior joining HKPU, Helen has worked in several renowned corporations in Hong Kong and Canada, such as PricewaterhouseCoopers, Hong Kong Stock Exchange, University of Toronto, and Bank of Montreal.

References

Appendix A.

Technical details of VaR backtests

A.1. Unconditional Coverage Tests

The POF test examines whether the observed fail rate (number of VaR violations) is significantly different from the selected failure rate p. The null hypothesis of the POF test is H0:p=pˆ=ν1/WT, and equation A.1 defines the likelihood ratio test statistic of the POF test. Under the null hypothesis, the likelihood ratio test statistic LRPOF is asymptotically χ2-distributed with 1 degree of freedom.

The TUFF test assesses the time it takes for the first violation to occur. It assumes that the first violation occurs in ν=1/p days. For the 1% VaR calculation, a violation is expected to occur every 100 days. The null hypothesis of the TUFF test is H0:p=pˆ=1/ν. Equation A.2 defines the likelihood ratio test statistic of the TUFF test. Similar to the POF test, under the null hypothesis, the likelihood ratio test statistic LRTUFF is asymptotically χ2-distributed with 1 degree of freedom.

(A.1) LRPOF=2ln(1p)WTν1pν1(1pˆ)WTν1pˆν1(A.1)
(A.2) LRTUFF=2lnp(1p)ν1pˆ(1pˆ)ν1(A.2)

A.2. Conditional Coverage Tests

Christoffersen’s independence test uses a binary first-order Markov chain and a transition probability matrix Π to test the independence of the violations.

(A.3) Π1=(1p01)p01(1p11)p11(A.3)

Indicator variable It takes the value 1 when a VaR violation occurs and 0 otherwise, and pij=Pr(It=j|It1=i). Christoffersen and Pelletier (Citation2004) suggested that the likelihood function for the violation sequence is Bernoulli-distributed and follows the first-order Markov chain of equation A.4.

(A.4) LU(Π1;I1,I2,,It)=(1p01)n00p01n01(1p11)n10p11n11(A.4)

where nij is the total number of observations occurring after It1=i, until It=j. The unrestricted form of the maximum likelihood estimates for the log-likelihood functions LU(Π1) are simply ratios of the counts of the respective cells as shown in equation A.5.

(A.5) Π1=n00n00+n01n01n00+n01n10n10+n11n11n00+n11(A.5)

If the violations are independent, the violation on day t is independent from that on day t1; thus, the probability p01 should be equal to p11. Equation A.6 shows the transition probability matrix under the null hypothesis H0:p01=p11=pˆ and equation A.7 shows the restricted form of the maximum likelihood estimates for the log-likelihood function LR(Π0).

(A.6) H0:Π0=(1pˆ)pˆ(1pˆ)pˆ(A.6)
(A.7) LR(Π0|;I1,I2,,It)=(1pˆ)n00+n10pˆn01+n11(A.7)

where pˆ=(n01+n11)/(n00+n01+n10+n11). Equation A.8 shows the likelihood ratio test statistic for independence, LRindc, which is asymptotically χ2-distributed with 1 degree of freedom under the null hypothesis of the restricted model being the preferred model.

(A.8) LRindc=2(logLU(Π0)logLR(Π1))χ2(1)(A.8)

Hass (Citation2001) proposed the mixed-Kupiec independence test, which combined the ideas of Christoffersen and Kupiec, to consider the timing of different occurrences to test the independence of the violations. The mixed-Kupiec test overcomes the limitation of Christoffersen’s independence test in which the first-order Markov chain considers only the dependence between two consecutive trading days. Equation A.9 shows the likelihood ratio statistic LRindmk of the mixed-Kupiec test.

(A.9) LRindmk=i=1n2lnp(1p)bi1(1/bi)(11/bi)bi1(A.9)

where bi is the time between VaR violations i1 and i, and n is the number of exceptions in the testing period.

A.3. Joint Coverage Tests

The joint test considers both the coverage and the independence of the violations by combining the likelihood ratios of both the tests. The Christoffersen joint test combines the likelihood ratios of the POF test and the Christoffersen’s independence test, whereas the mixed-Kupiec joint test combines the likelihood ratios of the TUFF test and the mixed-Kupiec independence test.

The Christoffersen joint test of coverage and independence can be performed by combining the LRPOF and LRindc statistics. The null hypothesis is that there are no coverage and clustering issues. Under the null hypothesis, the conditional coverage LRCC statistic, shown in equation A.10, is asymptotically χ2-distributed with 2 degrees of freedom.

(A.10) LRcc=LRPOF+LRindcχ2(2)(A.10)

Similarly, the mixed-Kupiec joint test combines the LRTUFF and LRindmk statistics into the conditional coverage LRmixed statistic. This statistic, shown in equation A.11, is asymptotically χ2-distributed with (n+1) degrees of freedom under the null hypothesis.

(A.11) LRmixed=LRTUFF+LRindmkχ2(n+1)(A.11)