Coverage for src/scores/stats/statistical_tests/diebold_mariano_impl.py: 100%
90 statements
« prev ^ index » next coverage.py v7.3.2, created at 2024-02-28 12:51 +1100
« prev ^ index » next coverage.py v7.3.2, created at 2024-02-28 12:51 +1100
1"""
2Functions for calculating a modified Diebold-Mariano test statistic
3"""
4import warnings
5from typing import Literal
7import numpy as np
8import scipy as sp
9import xarray as xr
10from scipy.optimize import least_squares
12from scores.utils import dims_complement
15def diebold_mariano(
16 da_timeseries: xr.DataArray,
17 ts_dim: str,
18 h_coord: str,
19 method: Literal["HG", "HLN"] = "HG",
20 confidence_level: float = 0.95,
21 statistic_distribution: Literal["normal", "t"] = "normal",
22) -> xr.Dataset:
23 """
24 Given an array of (multiple) timeseries, with each timeseries consisting of score
25 differences for h-step ahead forecasts, calculates a modified Diebold-Mariano test
26 statistic for each timeseries. Several other statistics are also returned such as
27 the confidence that the population mean of score differences is greater than zero
28 and confidence intervals for that mean.
30 Two methods for calculating the test statistic have been implemented: the "HG"
31 method Hering and Genton (2011) and the "HLN" method of Harvey, Leybourne and
32 Newbold (1997). The default "HG" method has an advantage of only generating positive
33 estimates for the spectral density contribution to the test statistic. For further
34 details see `scores.stats.confidence_intervals.impl._dm_test_statistic`.
36 Prior to any calculations, NaNs are removed from each timeseries. If there are NaNs
37 in `da_timeseries` then a warning will occur. This is because NaNs may impact the
38 autocovariance calculation.
40 To determine the value of h for each timeseries of score differences of h-step ahead
41 forecasts, one may ask 'How many observations of the phenomenon will be made between
42 making the forecast and having the observation that will validate the forecast?'
43 For example, suppose that the phenomenon is afternoon precipitation accumulation in
44 New Zealand (00z to 06z each day). Then a Day+1 forecast issued at 03z on Day+0 will
45 be a 2-ahead forecast, since Day+0 and Day+1 accumulations will be observed before
46 the forecast can be validated. On the other hand, a Day+1 forecast issued at 09z on
47 Day+0 will be a 1-step ahead forecast. The value of h for each timeseries in the
48 array needs to be specified in one of the sets of coordinates.
49 See the example below.
51 Confidence intervals and "confidence_gt_0" statistics are calculated using the
52 test statistic, which is assumed to have either the standard normal distribution
53 or Student's t distribution with n - 1 degrees of freedom (where n is the length of
54 the timeseries). The distribution used is specified by `statistic_distribution`. See
55 Harvey, Leybourne and Newbold (1997) for why the t distribution may be preferred,
56 especially for shorter timeseries.
58 If `da_timeseries` is a chunked array, data will be brought into memory during
59 this calculation due to the autocovariance implementation.
61 Args:
62 da_timeseries: a 2 dimensional array containing the timeseries.
63 ts_dim: name of the dimension which identifies each timeseries in the array.
64 h_coord: name of the coordinate specifying, for each timeseries, that the
65 timeseries is an h-step ahead forecast. `h_coord` coordinates must be
66 indexed by the dimension `ts_dim`.
67 method: method for calculating the test statistic, one of "HG" or "HLN".
68 confidence_level: the confidence level, between 0 and 1 exclusive, at which to
69 calculate confidence intervals.
70 statistic_distribution: the distribution of the test-statistic under the null
71 hypothesis of equipredictive skill. Used to calculate the "confidence_gt_0"
72 statistic and confidence intervals. One of "normal" or "t" (for Student's t
73 distribution).
75 Returns:
76 Dataset, indexed by `ts_dim`, with six variables:
77 - "mean": the mean value for each timeseries, ignoring NaNs
78 - "dm_test_stat": the modified Diebold-Mariano test statistic for each
79 timeseries
80 - "timeseries_len": the length of each timeseries, with NaNs removed.
81 - "confidence_gt_0": the confidence that the mean value of the population is
82 greater than zero, based on the specified `statistic_distribution`.
83 Precisely, it is the value of the cumululative distribution function
84 evaluated at `dm_test_stat`.
85 - "ci_upper": the upper end point of a confidence interval about the mean at
86 specified `confidence_level`.
87 - "ci_lower": the lower end point of a confidence interval about the mean at
88 specified `confidence_level`.
90 Raises:
91 ValueError: if `method` is not one of "HG" or "HLN".
92 ValueError: if `statistic_distribution` is not one of "normal" or "t".
93 ValueError: if `0 < confidence_level < 1` fails.
94 ValueError: if `len(da_timeseries.dims) != 2`.
95 ValueError: if `ts_dim` is not a dimension of `da_timeseries`.
96 ValueError: if `h_coord` is not a coordinate of `da_timeseries`.
97 ValueError: if `ts_dim` is not the only dimension of
98 `da_timeseries[h_coord]`.
99 ValueError: if `h_coord` values aren't positive integers.
100 ValueError: if `h_coord` values aren't less than the lengths of the
101 timeseries after NaNs are removed.
102 RuntimeWarnning: if there is a NaN in diffs.
104 References:
105 - Diebold and Mariano, 'Comparing predictive accuracy', Journal of Business and
106 Economic Statistics 13 (1995), 253-265.
107 - Hering and Genton, 'Comparing spatial predictions',
108 Technometrics 53 no. 4 (2011), 414-425.
109 - Harvey, Leybourne and Newbold, 'Testing the equality of prediction mean
110 squared errors', International Journal of Forecasting 13 (1997), 281-291.
112 Example:
114 This array gives three timeseries of score differences.
115 Coordinates in the "lead_day" dimension uniquely identify each timeseries.
116 Here `ts_dim="lead_day"`.
117 Coordinates in the "valid_date" dimension give the forecast validity timestamp
118 of each item in the timeseries.
119 The "h" coordinates specify that the timeseries are for 2, 3 and 4-step
120 ahead forecasts respectively. Here `h_coord="h"`.
122 >>> da_timeseries = xr.DataArray(
123 ... data=[[1, 2, 3.0, 4, np.nan], [2.0, 1, -3, -1, 0], [1.0, 1, 1, 1, 1]],
124 ... dims=["lead_day", "valid_date"],
125 ... coords={
126 ... "lead_day": [1, 2, 3],
127 ... "valid_date": ["2020-01-01", "2020-01-02", "2020-01-03", "2020-01-04", "2020-01-05"],
128 ... "h": ("lead_day", [2, 3, 4]),
129 ... },
130 ... )
132 >>> dm_test_stats(da_timeseries, "lead_day", "h")
133 """
134 if method not in ["HLN", "HG"]:
135 raise ValueError("`method` must be one of 'HLN' or 'HG'.")
137 if statistic_distribution not in ["normal", "t"]:
138 raise ValueError("`statistic_distribution` must be one of 'normal' or 't'.")
140 if not 0 < confidence_level < 1:
141 raise ValueError("`confidence_level` must be strictly between 0 and 1.")
143 if len(da_timeseries.dims) != 2:
144 raise ValueError("`da_timeseries` must have exactly two dimensions.")
146 if ts_dim not in da_timeseries.dims:
147 raise ValueError(f"`ts_dim` '{ts_dim}' must be a dimension of `da_timeseries`.")
149 if h_coord not in da_timeseries.coords:
150 raise ValueError("`h_coord` must be among the coordinates of `da_timeseries`.")
152 # the following will also catch NaNs in da_timeseries[h_coord]
153 # It allows values like 7.0 to pass, which is OK for the application.
154 if any(da_timeseries[h_coord].values % 1 != 0):
155 raise ValueError("Every value in `da_timeseries[h_coord]` must be an integer.")
157 if (da_timeseries[h_coord] <= 0).any():
158 raise ValueError("Every value in `da_timeseries[h_coord]` must be positive.")
160 other_dim = dims_complement(da_timeseries, [ts_dim])[0]
161 da_timeseries_len = da_timeseries.count(other_dim)
163 if (da_timeseries_len <= da_timeseries[h_coord]).any():
164 msg = "Each `h_coord` value must be less than the length of the corresponding timeseries"
165 raise ValueError(msg + " after NaNs are removed")
167 ts_dim_len = len(da_timeseries[ts_dim])
168 test_stats = np.empty([ts_dim_len])
169 ts_mean = da_timeseries.mean(other_dim).values
171 for i in range(ts_dim_len):
172 timeseries = da_timeseries.isel({ts_dim: i})
173 h = int(timeseries[h_coord])
174 test_stats[i] = _dm_test_statistic(timeseries.values, h, method)
176 if statistic_distribution == "normal":
177 pvals = sp.stats.norm.cdf(test_stats)
178 ci_quantile = sp.stats.norm.ppf(1 - (1 - confidence_level) / 2)
179 else:
180 pvals = sp.stats.t.cdf(test_stats, da_timeseries_len.values - 1)
181 ci_quantile = sp.stats.t.ppf(1 - (1 - confidence_level) / 2, da_timeseries_len.values - 1)
183 result = xr.Dataset(
184 data_vars=dict(
185 mean=([ts_dim], ts_mean),
186 dm_test_stat=([ts_dim], test_stats),
187 timeseries_len=([ts_dim], da_timeseries_len.values),
188 confidence_gt_0=([ts_dim], pvals),
189 ci_upper=([ts_dim], ts_mean * (1 + ci_quantile / test_stats)),
190 ci_lower=([ts_dim], ts_mean * (1 - ci_quantile / test_stats)),
191 ),
192 coords={ts_dim: da_timeseries[ts_dim].values},
193 )
195 return result
198def _dm_test_statistic(diffs: np.ndarray, h: int, method: Literal["HG", "HLN"] = "HG") -> float:
199 """
200 Given a timeseries of score differences for h-step ahead forecasts, as a 1D numpy
201 array, returns a modified Diebold-Mariano test statistic. NaNs are removed prior
202 to computing the statistic.
204 Two methods for computing the test statistic can be used: either the "HLN" method of
205 Harvey, Leybourne and Newbold (1997), or "HG" method of Hering and Genton (2011).
207 Both methods use a different technique for estimating the spectral density of
208 `diffs` at frequency 0, compared with Diebold and Mariano (1995). The HLN method
209 uses an improved and less biased estimate (see V_hat (see Equation (5) in Harvey)).
210 However, this estimate can sometimes be nonpositive, in which case NaN is returned.
212 The HG method estimates the spectral density component using an exponential model
213 for the autocovariances of `diffs`, so that positivity is guaranteed.
214 Hering and Genton (2011) fit model parameters using empirical autocovariances
215 computed up to half of the maximum lag. In this implementation, empirical
216 autocovariances are computed up to half of the maximum lag or a lag of `h`,
217 whichever is larger. Model parameters are computed using
218 `scipy.optimize.least_squares`. It is assumed that the two model parameters (sigma
219 and theta, in the notation of Hering and Genton (2011)) are positive.
221 In both methods, if the `diff` sequence consists only of 0 values then NaN is
222 returned.
224 Args:
225 diffs: timeseries of score difference as a 1D numpy array, assumed not all NaN.
226 h: integer indicating that forecasts are h-step ahead, assumed to be positive and
227 less than the length of the timeseries with NaNs removed.
228 method: the method for computing the test statistic, either "HG" or "HLN".
230 Returns:
231 Modified Diebold-Mariano test statistic for sequence of score differences, with
232 NaNs removed.
234 Raises:
235 ValueError: if `method` is not one of "HLN" or "HG".
236 ValueError: if `0 < h < len(diffs)` fails after NaNs removed.
237 RuntimeWarnning: if there is a NaN in diffs.
239 References:
240 - Diebold and Mariano, 'Comparing predictive accuracy', Journal of Business and
241 Economic Statistics 13 (1995), 253-265.
242 - Hering and Genton, 'Comparing spatial predictions',
243 Technometrics 53 no. 4 (2011), 414-425.
244 - Harvey, Leybourne and Newbold, 'Testing the equality of prediction mean
245 squared errors', International Journal of Forecasting 13 (1997), 281-291.
246 """
247 if method not in ["HLN", "HG"]:
248 raise ValueError("`method` must be one of 'HLN' or 'HG'.")
250 if np.isnan(np.sum(diffs)):
251 warnings.warn(
252 RuntimeWarning(
253 "A least one NaN value was detected in `da_timeseries`. This may impact the "
254 "calculation of autocovariances."
255 )
256 )
258 diffs = diffs[~np.isnan(diffs)]
260 if not 0 < h < len(diffs):
261 raise ValueError("The condition `0 < h < len(diffs)`, after NaNs removed, failed.")
263 nonzero_diffs = diffs[diffs != 0]
265 if len(nonzero_diffs) == 0:
266 test_stat = np.nan
267 elif method == "HLN":
268 test_stat = _hln_method_stat(diffs, h)
269 else: # method == 'HG'
270 test_stat = _hg_method_stat(diffs, h)
272 return test_stat
275def _hg_func(pars: list, lag: np.ndarray, acv: np.ndarray) -> np.ndarray:
276 """
277 Function whose values are to be minimised as part of the HG method for estimating
278 the spectral density at 0.
280 Args:
281 pars: list of two model parameters (sigma and theta in the notation of Hering
282 and Genton 2011)
283 lag: 1D numpy array of the form [0, 1, 2, ..., n - 1], where n is the length
284 of `acv`
285 acv: 1D numpy array of empirical autocovariances with lags corresponding
286 to `lag`
288 Returns:
289 Difference between modelled and empirical autocoveriances.
291 References:
292 Hering and Genton, 'Comparing spatial predictions',
293 Technometrics 53 no. 4 (2011), 414-425.
294 """
295 return (pars[0] ** 2) * np.exp(-3 * lag / pars[1]) - acv
298def _hg_method_stat(diffs: np.ndarray, h: int) -> float:
299 """
300 Calculates the modified Diebold-Mariano test statistic using the "HG" method.
301 Assumes that h < len(diffs).
303 Args:
304 diffs: a single (1D array) timeseries of score differences with NaNs removed.
305 h: integer indicating that forecasts are h-step ahead, assumed to be positive
306 and less than the length of the timeseries with NaNs removed.
308 Returns:
309 Diebold-Mariano test statistic using the HG method.
310 """
311 from scores.stats.statistical_tests.acovf import acovf
313 n = len(diffs)
315 # use an exponential model for autocovariances of `diffs`
316 max_lag = int(max(np.floor((n - 1) / 2), h))
317 sample_autocvs = acovf(diffs)[0:max_lag]
318 sample_lags = np.arange(max_lag)
319 model_params = least_squares(_hg_func, [1, 1], args=(sample_lags, sample_autocvs), bounds=(0, np.inf)).x
321 # use the model autocovariances to estimate spectral density at 0
322 all_lags = np.arange(n)
323 model_autocovs = (model_params[0] ** 2) * np.exp(-3 * all_lags / model_params[1])
324 density_estimate = model_autocovs[0] + 2 * np.sum(model_autocovs[1:])
325 test_stat = np.mean(diffs) / np.sqrt(density_estimate / n)
327 return test_stat
330def _hln_method_stat(diffs: np.ndarray, h: int) -> float:
331 """
332 Given a timeseries of score differences for h-step ahead forecasts, as a 1D numpy
333 array without NaNs, returns the modified Diebold-Mariano test statistic of
334 Harvey et al (1997).
336 If the value V_hat (see Equation (5) in Harvey) is nonpositive then NaN is returned.
338 Args:
339 diffs: timeseries of score difference as a 1D numpy array without any NaNs.
340 h: integer indicating that forecasts are h-step ahead, assumed to be positive
341 and less than the length of the timeseries with NaNs removed.
343 Returns:
344 Diebold-Mariano test statistic using the HLN method.
345 """
346 n = len(diffs)
347 diffs_bar = np.mean(diffs)
349 # Harvey (1997) Equation (3)
350 test_stat = diffs_bar / _dm_v_hat(diffs, diffs_bar, n, h) ** 0.5
352 # Harvey (1997) Equation (9)
353 correction_factor = (n + 1 - 2 * h + h * (h - 1) / n) / n
354 test_stat = (correction_factor**0.5) * test_stat
356 return test_stat
359def _dm_gamma_hat_k(diffs: np.ndarray, diffs_bar: float, n: int, k: int) -> float:
360 """
361 Computes the quantity (n - k) * gamma_hat_star_k of Equation (5) in
362 Harvey et al (1997).
364 Args:
365 diffs: a single timeseries of score differences with NaNs removed.
366 diffs_bar: mean of diffs.
367 n: length of diffs.
368 k: integer between 1 and n-1 (inclusive), where n = len(diffs)
370 Returns:
371 The quantity (n - k) * gamma_hat_star_k.
372 """
373 prod = (diffs[k:n] - diffs_bar) * (diffs[0 : n - k] - diffs_bar)
375 return np.sum(prod)
378def _dm_v_hat(diffs: np.ndarray, diffs_bar: float, n: int, h: int) -> float:
379 """
380 Computes the the quantity V_hat(d_bar) of Equation (5) in Harvey et al (1997).
382 Args:
383 diffs: a single timeseries of score differences with NaNs removed.
384 diffs_bar: mean of diffs.
385 n: length of diffs.
386 h: integer between 1 and n - 1 (inclusive), where n = len(diffs)
388 Returns:
389 The quantity V_hat(d_bar). If the result is not positive, NaN is returned.
390 """
391 summands = np.empty(h - 1)
392 for k in range(h - 1):
393 summands[k] = _dm_gamma_hat_k(diffs, diffs_bar, n, k + 1)
395 result = (_dm_gamma_hat_k(diffs, diffs_bar, n, 0) + 2 * np.sum(summands)) / n**2
397 if result <= 0:
398 result = np.nan
400 return result