23 Test Backtesting

Statistical validation of risk models requires moving beyond simple violation counting to formal hypothesis testing. Having generated backtesting data in Chapter 22, we now implement the statistical framework to determine whether our VaR and ES models perform adequately.

This chapter addresses the statistical testing of backtesting results as described in Chapter 7 of Financial Risk Forecasting. We implement VaR coverage testing to examine whether violations occur at the expected frequency, and VaR independence testing to determine whether violations are randomly distributed over time. The chapter also discusses ES backtesting challenges and limitations, along with statistical comparison across different risk modelling approaches.

Building on the backtesting data from Chapter 22, we implement the Christoffersen (1998) testing framework for VaR validation and discuss the Acerbi-Székely (2014) approach for ES evaluation.

23.1 Data and libraries

library(zoo)
library(lubridate)
library(parallel)
source("common/functions.r",chdir=TRUE)

23.2 Backtest Data

We load the backtesting results generated in Chapter 22 using the LoadBacktestData() function from functions.r. This data contains VaR and ES forecasts from all four methods (HS, EWMA, GARCH, and t-GARCH) alongside actual returns and dates.

# Load backtest data using the centralized function
result = LoadBacktestData()
backtest = result$backtest
par = result$par

cat("Loaded", nrow(backtest), "trading days of backtest data\n")

Loaded 6938 trading days of backtest data

cat("Period:", as.character(min(backtest$date)), "to", as.character(max(backtest$date)), "\n")

Period: 19971128 to 20250630

The par object contains the backtesting parameters including the confidence level and window sizes used in the forecasting process.

23.3 Reusable Analysis Functions

To support consistent analysis and visualization throughout the chapter, we create several reusable functions. This approach eliminates code duplication and ensures consistent formatting across different time periods and data subsets.

23.3.1 VaR vs Returns Plotting Function

First, create a flexible plotting function for visualizing VaR forecasts against actual returns:

plot_var_vs_returns = function(data) {
  main_title = paste0("S&P 500 Returns vs VaR Forecasts")
  plot(data$date.t, data$y * par$value, type='l',
       main=main_title,
       xlab="Date", ylab="P&L ($)",
       col="black", lwd=1)
  lines(data$date.t, -data$VaR.HS, col="red", lwd=2)
  lines(data$date.t, -data$VaR.EWMA, col="blue", lwd=2)
  lines(data$date.t, -data$VaR.GARCH, col="green", lwd=2)
  lines(data$date.t, -data$VaR.tGARCH, col="purple", lwd=2)
  legend("bottomleft",
         c("Returns", "HS VaR", "EWMA VaR", "GARCH VaR", "tGARCH VaR"),
         col=c("black", "red", "blue", "green", "purple"),
         lwd=c(1,2,2,2,2), cex=0.8,bty='n',ncol=3)
  abline(h=0, col="gray", lty=2)
  w=seq(ymd(19900101),ymd(20301231),by="year")
  axis(1,at=w,label=FALSE,pch=-0.3)
}

23.3.2 Violation Plotting Function

Next, create a function for plotting violation patterns over time:

plot_violations = function(data, violations_matrix) {
  colors = c("red", "blue", "green", "purple")
  methods = c("HS", "EWMA", "GARCH", "tGARCH")

  plot(data$date.t,violations_matrix[,1] ,
       type='p', pch=20, col=colors[1], main="VaR Violations",
       xlab="Date", ylab="", ylim=c(0.5, 4.5), yaxt="n")
  for (i in 2:4) {
    points(data$date.t,violations_matrix[,i]*i,
           pch=20, col=colors[i])
  }
  axis(2, at=1:4, labels=methods, las=1)
  w=seq(ymd(19900101),ymd(20301231),by="year")
  axis(1,at=w,label=FALSE,pch=-0.3)
}

23.3.3 Violation Analysis Function

Finally, create a function for calculating violation statistics and generating summary tables:

# Define VaR methods globally for use throughout the chapter
VaR_methods = c("VaR.HS", "VaR.EWMA", "VaR.GARCH", "VaR.tGARCH")

# Create reusable violation analysis function
create_violation_analysis = function(data, par, data_name = "Dataset") {
  # Create violations matrix
  PL = data$y * par$value
  Violations = matrix(0, nrow=nrow(data), ncol=length(VaR_methods))
  colnames(Violations) = VaR_methods

  # Create binary violation indicators (1 = violation, 0 = no violation)
  for(i in 1:length(VaR_methods)){
    VaR_values = data[,VaR_methods[i]]
    Violations[,i] = ifelse(PL < -VaR_values, 1, 0)
  }

  # Calculate violation statistics and create summary table
  violation_counts = colSums(Violations)
  violation_rates = violation_counts / nrow(Violations)
  expected_violations = par$p * nrow(data)
  expected_rate = par$p

  # Consolidated violation summary
  violation_summary = data.frame(
    Method = gsub("VaR\\.", "", VaR_methods),
    Violations = violation_counts,
    Rate_Percent = round(violation_rates * 100, 2),
    Expected_Count = round(expected_violations, 1),
    Expected_Rate = round(expected_rate * 100, 1),
    Difference = violation_counts - expected_violations,
    Performance = ifelse(abs(violation_rates - expected_rate) < 0.01, "Good",
                        ifelse(violation_rates < expected_rate, "Conservative", "Liberal"))
  )

  cat(data_name, "Violation Analysis:\n")
  print(violation_summary)

  # Identify best performing method
  best_method = which.min(abs(violation_rates - expected_rate))
  cat("\nClosest to expected rate:", violation_summary$Method[best_method], "\n")

  # Return results for further analysis
  return(list(
    violations = Violations,
    summary = violation_summary,
    rates = violation_rates,
    counts = violation_counts
  ))
}

23.4 Initial data exploration

The backtest data frame contains VaR and ES forecasts for all methods, along with actual returns and dates. We can visualise the forecasts against actual returns to see how well our models capture market risk.

plot_var_vs_returns(backtest)

# Summary of the loaded data
cat("Backtest period:", as.character(min(backtest$date)), "to", as.character(max(backtest$date)), "\n")

Backtest period: 19971128 to 20250630

cat("Number of trading days:", nrow(backtest), "\n")

Number of trading days: 6938

cat("Estimation window size:", par$WE, "days\n")

Estimation window size: 2000 days

cat("Testing window size:", par$WT, "days\n")

Testing window size: 6938 days

cat("VaR confidence level:", (1-par$p)*100, "%\n")

VaR confidence level: 99 %

23.5 Violations

A violation occurs when the actual loss exceeds the VaR forecast. For a well-calibrated VaR model, violations should occur at the expected frequency — a 1% VaR model should be exceeded roughly 1% of the time.

We create violations by comparing actual losses to VaR forecasts and convert them to binary indicators for systematic analysis.

# Use function to create violation analysis for full dataset
violation_results = create_violation_analysis(backtest, par, "Overall")

Overall Violation Analysis:
           Method Violations Rate_Percent Expected_Count Expected_Rate
VaR.HS         HS        107         1.54           69.4             1
VaR.EWMA     EWMA        151         2.18           69.4             1
VaR.GARCH   GARCH        136         1.96           69.4             1
VaR.tGARCH tGARCH         87         1.25           69.4             1
           Difference Performance
VaR.HS          37.62        Good
VaR.EWMA        81.62     Liberal
VaR.GARCH       66.62        Good
VaR.tGARCH      17.62        Good

Closest to expected rate: tGARCH

# Extract results for use in other sections
Violations = violation_results$violations
violation_summary = violation_results$summary
violation_rates = violation_results$rates
violation_counts = violation_results$counts

23.6 Violation patterns over time

The temporal distribution of violations reveals important information about model performance during different market conditions. Well-calibrated models should show violations distributed randomly over time, while clustering indicates the model fails to adapt quickly to changing volatility.

# Use function to create violation plot
plot_violations(backtest, Violations)

The plot reveals significant clustering during stress periods, particularly during the COVID-19 market disruption in early 2020. This clustering suggests that all methods struggle to adapt quickly enough to rapidly changing market conditions, with violations occurring in concentrated bursts rather than randomly. Such patterns violate the independence assumption required for well-functioning VaR models and motivate the formal independence testing discussed below.

23.7 Statistical Analysis of VaR

The violation analysis reveals that HS performs closest to the expected 1% rate (1.5%), while EWMA, GARCH and tGARCH all show elevated violation rates around 1.3-2.2%. However, we cannot conclude from violation counts alone whether these differences represent genuine model failures or acceptable random variation.

To make definitive judgements about model adequacy, we need formal hypothesis testing. A model showing 2.2% violations instead of 1% might still be acceptable if this difference falls within normal sampling variation. Conversely, even small deviations might be statistically significant with sufficient data.

We implement the testing framework developed by Christoffersen (1998). This framework tests two fundamental properties that well-functioning VaR models must satisfy.

23.7.1 Coverage Test

The Bernoulli coverage test examines whether the violation rate is statistically different from the expected rate. Under the null hypothesis, violations should occur at the expected frequency (1% for our models).

# Bernoulli coverage test (from Financial Risk Forecasting book)
bern_test = function(p, v) {
  lv = length(v)
  sv = sum(v)
  al = log(p) * sv + log(1-p) * (lv-sv)
  bl = log(sv/lv) * sv + log(1-sv/lv) * (lv-sv)
  return(-2 * (al-bl))
}

Bernoulli Coverage Test Results:

H0: Violation rate equals expected rate (1%)

# Apply coverage test to all methods
m = c("HS", "EWMA", "GARCH", "tGARCH")
coverage_results = data.frame(
  Method = m,
  Test_Statistic = numeric(4),
  P_Value = numeric(4),
  Result = character(4),
  stringsAsFactors = FALSE
)

for(i in 1:4) {
  test_stat = bern_test(par$p, Violations[,i])
  p_value = 1 - pchisq(test_stat, df = 1)

  coverage_results$Test_Statistic[i] = round(test_stat, 3)
  coverage_results$P_Value[i] = round(p_value, 4)
  coverage_results$Result[i] = ifelse(p_value < 0.05, "REJECT", "PASS")

  cat(sprintf("%-8s: Test statistic = %.3f, p-value = %.4f (%s)\n",
              m[i], test_stat, p_value,
              ifelse(p_value < 0.05, "REJECT H0", "PASS")))
}

HS      : Test statistic = 17.678, p-value = 0.0000 (REJECT H0)
EWMA    : Test statistic = 72.593, p-value = 0.0000 (REJECT H0)
GARCH   : Test statistic = 50.480, p-value = 0.0000 (REJECT H0)
tGARCH  : Test statistic = 4.183, p-value = 0.0408 (REJECT H0)

print(coverage_results)

  Method Test_Statistic P_Value Result
1     HS         17.678  0.0000 REJECT
2   EWMA         72.593  0.0000 REJECT
3  GARCH         50.480  0.0000 REJECT
4 tGARCH          4.183  0.0408 REJECT

23.7.2 Independence Test

The independence test examines whether violations are randomly distributed over time or exhibit clustering patterns. If violations tend to cluster, this suggests the VaR model fails to adapt quickly to changing market conditions.

A properly functioning VaR model should produce violations that appear random — like a coin flip with probability 0.01. When violations cluster together, it indicates the model reacts too slowly to volatility changes or allows repeated surprises during stress periods.

We model violations as a first-order Markov chain where today’s violation probability depends only on yesterday’s outcome. Under the null hypothesis of independence, the probability of a violation today should be the same regardless of whether there was a violation yesterday.

The test examines four transition states: - \(\eta_{00}\): No violation yesterday, no violation today - \(\eta_{01}\): No violation yesterday, violation today - \(\eta_{10}\): Violation yesterday, no violation today - \(\eta_{11}\): Violation yesterday, violation today

# Independence test function (from Financial Risk Forecasting book)
ind_test = function(V){
  J = matrix(ncol=4, nrow=length(V))
  for (i in 2:length(V)){
    J[i,1] = V[i-1]==0 & V[i]==0
    J[i,2] = V[i-1]==0 & V[i]==1
    J[i,3] = V[i-1]==1 & V[i]==0
    J[i,4] = V[i-1]==1 & V[i]==1
  }
  V_00 = sum(J[,1], na.rm=TRUE)
  V_01 = sum(J[,2], na.rm=TRUE)
  V_10 = sum(J[,3], na.rm=TRUE)
  V_11 = sum(J[,4], na.rm=TRUE)
  p_00 = V_00/(V_00+V_01)
  p_01 = V_01/(V_00+V_01)
  p_10 = V_10/(V_10+V_11)
  p_11 = V_11/(V_10+V_11)
  hat_p = (V_01+V_11)/(V_00+V_01+V_10+V_11)
  al = log(1-hat_p)*(V_00+V_10) + log(hat_p)*(V_01+V_11)
  bl = log(p_00)*V_00 + log(p_01)*V_01 + log(p_10)*V_10 + log(p_11)*V_11
  return(-2*(al-bl))
}

Independence Test Results:

H0: Violations are independent over time

# Apply independence test to all methods
independence_results = data.frame(
  Method = m,
  Test_Statistic = numeric(4),
  P_Value = numeric(4),
  Result = character(4),
  stringsAsFactors = FALSE
)

for(i in 1:4) {
  test_stat = ind_test(Violations[,i])
  p_value = 1 - pchisq(test_stat, df = 1)

  independence_results$Test_Statistic[i] = round(test_stat, 3)
  independence_results$P_Value[i] = round(p_value, 4)
  independence_results$Result[i] = ifelse(p_value < 0.05, "REJECT", "PASS")

  cat(sprintf("%-8s: Test statistic = %.3f, p-value = %.4f (%s)\n",
              m[i], test_stat, p_value,
              ifelse(p_value < 0.05, "REJECT H0", "PASS")))
}

HS      : Test statistic = 13.343, p-value = 0.0003 (REJECT H0)
EWMA    : Test statistic = 9.453, p-value = 0.0021 (REJECT H0)
GARCH   : Test statistic = 1.703, p-value = 0.1919 (PASS)
tGARCH  : Test statistic = 4.776, p-value = 0.0289 (REJECT H0)

print(independence_results)

  Method Test_Statistic P_Value Result
1     HS         13.343  0.0003 REJECT
2   EWMA          9.453  0.0021 REJECT
3  GARCH          1.703  0.1919   PASS
4 tGARCH          4.776  0.0289 REJECT

The independence test results confirm the clustering patterns observed in the violation plots. Methods that fail this test show systematic dependencies in violation timing, indicating inadequate adaptation to changing market volatility. This temporal clustering violates a fundamental assumption of well-functioning VaR models and suggests these methods react too slowly to evolving market conditions.

23.8 ES Backtesting

ES, also known as Conditional Value-at-Risk, measures the average loss in the worst cases beyond the VaR threshold:

\[\text{ES}_\alpha = -\mathbb{E}[r_t \mid r_t < -\text{VaR}_\alpha]\]

Unlike VaR, ES cannot be backtested using simple violation counting because:

ES is not defined by a single threshold breach but depends on the full shape of the loss tail
ES is not elicitable on its own (no simple scoring function exists)
ES is sensitive to rare, extreme losses that occur infrequently
Traditional frequency-based backtesting (counting violations) does not apply

However, Fissler and Ziegel (2016) showed that VaR and ES are jointly elicitable, meaning we can create consistent scoring functions that evaluate both measures together, enabling meaningful backtesting.

23.8.1 The Acerbi-Székely (2019) Approach

ES can be back‑tested with the minimally biased statistic proposed by Acerbi and Székely (2019)¹. The statistic

\[Z_{\mathrm{ES},t}=\text{ES}_t - \text{VaR}_t - \frac{1}{\rho}\min\bigl(0,\,y_t + \text{VaR}_t\bigr)\]

where:

\(\text{ES}_t\) = ES forecast for day \(t\) (predicted expected shortfall)
\(\text{VaR}_t\) = VaR forecast for day \(t\) (predicted value-at-risk threshold)
\(y_t\) = realised return for day \(t\) (actual portfolio return, negative for losses)
\(\rho\) = confidence level (e.g., 0.025 for 97.5% confidence)
\(\min(0, y_t + \text{VaR}_t)\) = captures only cases where actual loss exceeds VaR threshold

The formula provides unbiased estimates when the ES model is correctly specified up to a small prudential bias that vanishes when the accompanying VaR forecast is exact. The bias disadvantages the modeller and hence provides a conservative assessment.

23.8.1.1 Implementation

# Minimally biased ES backtest
Z_ES = function(ES_t, VaR_t, y_t, p = 0.025) {
  ES_t - VaR_t - (1 / p) * pmin(0, y_t + VaR_t)
}

# Example implementation for first method
method_name = gsub("VaR\\.", "", VaR_methods[1])
var_forecasts = backtest[,paste0("VaR.", method_name)]
es_forecasts = backtest[,paste0("ES.", method_name)]
PL = backtest$y * par$value

# Identify violation days
violations = PL < -var_forecasts
violation_days = which(violations)

if (length(violation_days) > 0) {
  # Compute statistic on violation days
  Z_values = Z_ES(es_forecasts[violation_days],
                  var_forecasts[violation_days],
                  backtest$y[violation_days])

  t_test = t.test(Z_values, mu = 0)
  print(paste("Method:", method_name))
  print(paste("T-statistic:", round(t_test$statistic, 3)))
  print(paste("P-value:", round(t_test$p.value, 4)))
}

[1] "Method: HS"
[1] "T-statistic: 28.096"
[1] "P-value: 0"

23.8.1.2 Interpretation

Negative t values show ES underestimation
Positive values point to overestimation
Small magnitudes imply consistency

A relative variant divides \(Z_{\mathrm{ES},t}\) by \(\text{ES}_t\) and provides a scale-free factor useful for multipliers. This relative measure is particularly valuable for regulatory capital calculations where proportional adjustments to ES estimates are required.

Acerbi, C. and Székely, B. (2019). The minimally biased backtest for ES. Risk, September 2019, 76-81.↩︎