Coverage for src/scores/stats/statistical_tests/diebold_mariano_impl.py: 100%

90 statements  

« prev     ^ index     » next       coverage.py v7.3.2, created at 2024-02-28 12:51 +1100

1""" 

2Functions for calculating a modified Diebold-Mariano test statistic 

3""" 

4import warnings 

5from typing import Literal 

6 

7import numpy as np 

8import scipy as sp 

9import xarray as xr 

10from scipy.optimize import least_squares 

11 

12from scores.utils import dims_complement 

13 

14 

15def diebold_mariano( 

16 da_timeseries: xr.DataArray, 

17 ts_dim: str, 

18 h_coord: str, 

19 method: Literal["HG", "HLN"] = "HG", 

20 confidence_level: float = 0.95, 

21 statistic_distribution: Literal["normal", "t"] = "normal", 

22) -> xr.Dataset: 

23 """ 

24 Given an array of (multiple) timeseries, with each timeseries consisting of score 

25 differences for h-step ahead forecasts, calculates a modified Diebold-Mariano test 

26 statistic for each timeseries. Several other statistics are also returned such as 

27 the confidence that the population mean of score differences is greater than zero 

28 and confidence intervals for that mean. 

29 

30 Two methods for calculating the test statistic have been implemented: the "HG" 

31 method Hering and Genton (2011) and the "HLN" method of Harvey, Leybourne and 

32 Newbold (1997). The default "HG" method has an advantage of only generating positive 

33 estimates for the spectral density contribution to the test statistic. For further 

34 details see `scores.stats.confidence_intervals.impl._dm_test_statistic`. 

35 

36 Prior to any calculations, NaNs are removed from each timeseries. If there are NaNs 

37 in `da_timeseries` then a warning will occur. This is because NaNs may impact the 

38 autocovariance calculation. 

39 

40 To determine the value of h for each timeseries of score differences of h-step ahead 

41 forecasts, one may ask 'How many observations of the phenomenon will be made between 

42 making the forecast and having the observation that will validate the forecast?' 

43 For example, suppose that the phenomenon is afternoon precipitation accumulation in 

44 New Zealand (00z to 06z each day). Then a Day+1 forecast issued at 03z on Day+0 will 

45 be a 2-ahead forecast, since Day+0 and Day+1 accumulations will be observed before 

46 the forecast can be validated. On the other hand, a Day+1 forecast issued at 09z on 

47 Day+0 will be a 1-step ahead forecast. The value of h for each timeseries in the 

48 array needs to be specified in one of the sets of coordinates. 

49 See the example below. 

50 

51 Confidence intervals and "confidence_gt_0" statistics are calculated using the 

52 test statistic, which is assumed to have either the standard normal distribution 

53 or Student's t distribution with n - 1 degrees of freedom (where n is the length of 

54 the timeseries). The distribution used is specified by `statistic_distribution`. See 

55 Harvey, Leybourne and Newbold (1997) for why the t distribution may be preferred, 

56 especially for shorter timeseries. 

57 

58 If `da_timeseries` is a chunked array, data will be brought into memory during 

59 this calculation due to the autocovariance implementation. 

60 

61 Args: 

62 da_timeseries: a 2 dimensional array containing the timeseries. 

63 ts_dim: name of the dimension which identifies each timeseries in the array. 

64 h_coord: name of the coordinate specifying, for each timeseries, that the 

65 timeseries is an h-step ahead forecast. `h_coord` coordinates must be 

66 indexed by the dimension `ts_dim`. 

67 method: method for calculating the test statistic, one of "HG" or "HLN". 

68 confidence_level: the confidence level, between 0 and 1 exclusive, at which to 

69 calculate confidence intervals. 

70 statistic_distribution: the distribution of the test-statistic under the null 

71 hypothesis of equipredictive skill. Used to calculate the "confidence_gt_0" 

72 statistic and confidence intervals. One of "normal" or "t" (for Student's t 

73 distribution). 

74 

75 Returns: 

76 Dataset, indexed by `ts_dim`, with six variables: 

77 - "mean": the mean value for each timeseries, ignoring NaNs 

78 - "dm_test_stat": the modified Diebold-Mariano test statistic for each 

79 timeseries 

80 - "timeseries_len": the length of each timeseries, with NaNs removed. 

81 - "confidence_gt_0": the confidence that the mean value of the population is 

82 greater than zero, based on the specified `statistic_distribution`. 

83 Precisely, it is the value of the cumululative distribution function 

84 evaluated at `dm_test_stat`. 

85 - "ci_upper": the upper end point of a confidence interval about the mean at 

86 specified `confidence_level`. 

87 - "ci_lower": the lower end point of a confidence interval about the mean at 

88 specified `confidence_level`. 

89 

90 Raises: 

91 ValueError: if `method` is not one of "HG" or "HLN". 

92 ValueError: if `statistic_distribution` is not one of "normal" or "t". 

93 ValueError: if `0 < confidence_level < 1` fails. 

94 ValueError: if `len(da_timeseries.dims) != 2`. 

95 ValueError: if `ts_dim` is not a dimension of `da_timeseries`. 

96 ValueError: if `h_coord` is not a coordinate of `da_timeseries`. 

97 ValueError: if `ts_dim` is not the only dimension of 

98 `da_timeseries[h_coord]`. 

99 ValueError: if `h_coord` values aren't positive integers. 

100 ValueError: if `h_coord` values aren't less than the lengths of the 

101 timeseries after NaNs are removed. 

102 RuntimeWarnning: if there is a NaN in diffs. 

103 

104 References: 

105 - Diebold and Mariano, 'Comparing predictive accuracy', Journal of Business and 

106 Economic Statistics 13 (1995), 253-265. 

107 - Hering and Genton, 'Comparing spatial predictions', 

108 Technometrics 53 no. 4 (2011), 414-425. 

109 - Harvey, Leybourne and Newbold, 'Testing the equality of prediction mean 

110 squared errors', International Journal of Forecasting 13 (1997), 281-291. 

111 

112 Example: 

113 

114 This array gives three timeseries of score differences. 

115 Coordinates in the "lead_day" dimension uniquely identify each timeseries. 

116 Here `ts_dim="lead_day"`. 

117 Coordinates in the "valid_date" dimension give the forecast validity timestamp 

118 of each item in the timeseries. 

119 The "h" coordinates specify that the timeseries are for 2, 3 and 4-step 

120 ahead forecasts respectively. Here `h_coord="h"`. 

121 

122 >>> da_timeseries = xr.DataArray( 

123 ... data=[[1, 2, 3.0, 4, np.nan], [2.0, 1, -3, -1, 0], [1.0, 1, 1, 1, 1]], 

124 ... dims=["lead_day", "valid_date"], 

125 ... coords={ 

126 ... "lead_day": [1, 2, 3], 

127 ... "valid_date": ["2020-01-01", "2020-01-02", "2020-01-03", "2020-01-04", "2020-01-05"], 

128 ... "h": ("lead_day", [2, 3, 4]), 

129 ... }, 

130 ... ) 

131 

132 >>> dm_test_stats(da_timeseries, "lead_day", "h") 

133 """ 

134 if method not in ["HLN", "HG"]: 

135 raise ValueError("`method` must be one of 'HLN' or 'HG'.") 

136 

137 if statistic_distribution not in ["normal", "t"]: 

138 raise ValueError("`statistic_distribution` must be one of 'normal' or 't'.") 

139 

140 if not 0 < confidence_level < 1: 

141 raise ValueError("`confidence_level` must be strictly between 0 and 1.") 

142 

143 if len(da_timeseries.dims) != 2: 

144 raise ValueError("`da_timeseries` must have exactly two dimensions.") 

145 

146 if ts_dim not in da_timeseries.dims: 

147 raise ValueError(f"`ts_dim` '{ts_dim}' must be a dimension of `da_timeseries`.") 

148 

149 if h_coord not in da_timeseries.coords: 

150 raise ValueError("`h_coord` must be among the coordinates of `da_timeseries`.") 

151 

152 # the following will also catch NaNs in da_timeseries[h_coord] 

153 # It allows values like 7.0 to pass, which is OK for the application. 

154 if any(da_timeseries[h_coord].values % 1 != 0): 

155 raise ValueError("Every value in `da_timeseries[h_coord]` must be an integer.") 

156 

157 if (da_timeseries[h_coord] <= 0).any(): 

158 raise ValueError("Every value in `da_timeseries[h_coord]` must be positive.") 

159 

160 other_dim = dims_complement(da_timeseries, [ts_dim])[0] 

161 da_timeseries_len = da_timeseries.count(other_dim) 

162 

163 if (da_timeseries_len <= da_timeseries[h_coord]).any(): 

164 msg = "Each `h_coord` value must be less than the length of the corresponding timeseries" 

165 raise ValueError(msg + " after NaNs are removed") 

166 

167 ts_dim_len = len(da_timeseries[ts_dim]) 

168 test_stats = np.empty([ts_dim_len]) 

169 ts_mean = da_timeseries.mean(other_dim).values 

170 

171 for i in range(ts_dim_len): 

172 timeseries = da_timeseries.isel({ts_dim: i}) 

173 h = int(timeseries[h_coord]) 

174 test_stats[i] = _dm_test_statistic(timeseries.values, h, method) 

175 

176 if statistic_distribution == "normal": 

177 pvals = sp.stats.norm.cdf(test_stats) 

178 ci_quantile = sp.stats.norm.ppf(1 - (1 - confidence_level) / 2) 

179 else: 

180 pvals = sp.stats.t.cdf(test_stats, da_timeseries_len.values - 1) 

181 ci_quantile = sp.stats.t.ppf(1 - (1 - confidence_level) / 2, da_timeseries_len.values - 1) 

182 

183 result = xr.Dataset( 

184 data_vars=dict( 

185 mean=([ts_dim], ts_mean), 

186 dm_test_stat=([ts_dim], test_stats), 

187 timeseries_len=([ts_dim], da_timeseries_len.values), 

188 confidence_gt_0=([ts_dim], pvals), 

189 ci_upper=([ts_dim], ts_mean * (1 + ci_quantile / test_stats)), 

190 ci_lower=([ts_dim], ts_mean * (1 - ci_quantile / test_stats)), 

191 ), 

192 coords={ts_dim: da_timeseries[ts_dim].values}, 

193 ) 

194 

195 return result 

196 

197 

198def _dm_test_statistic(diffs: np.ndarray, h: int, method: Literal["HG", "HLN"] = "HG") -> float: 

199 """ 

200 Given a timeseries of score differences for h-step ahead forecasts, as a 1D numpy 

201 array, returns a modified Diebold-Mariano test statistic. NaNs are removed prior 

202 to computing the statistic. 

203 

204 Two methods for computing the test statistic can be used: either the "HLN" method of 

205 Harvey, Leybourne and Newbold (1997), or "HG" method of Hering and Genton (2011). 

206 

207 Both methods use a different technique for estimating the spectral density of 

208 `diffs` at frequency 0, compared with Diebold and Mariano (1995). The HLN method 

209 uses an improved and less biased estimate (see V_hat (see Equation (5) in Harvey)). 

210 However, this estimate can sometimes be nonpositive, in which case NaN is returned. 

211 

212 The HG method estimates the spectral density component using an exponential model 

213 for the autocovariances of `diffs`, so that positivity is guaranteed. 

214 Hering and Genton (2011) fit model parameters using empirical autocovariances 

215 computed up to half of the maximum lag. In this implementation, empirical 

216 autocovariances are computed up to half of the maximum lag or a lag of `h`, 

217 whichever is larger. Model parameters are computed using 

218 `scipy.optimize.least_squares`. It is assumed that the two model parameters (sigma 

219 and theta, in the notation of Hering and Genton (2011)) are positive. 

220 

221 In both methods, if the `diff` sequence consists only of 0 values then NaN is 

222 returned. 

223 

224 Args: 

225 diffs: timeseries of score difference as a 1D numpy array, assumed not all NaN. 

226 h: integer indicating that forecasts are h-step ahead, assumed to be positive and 

227 less than the length of the timeseries with NaNs removed. 

228 method: the method for computing the test statistic, either "HG" or "HLN". 

229 

230 Returns: 

231 Modified Diebold-Mariano test statistic for sequence of score differences, with 

232 NaNs removed. 

233 

234 Raises: 

235 ValueError: if `method` is not one of "HLN" or "HG". 

236 ValueError: if `0 < h < len(diffs)` fails after NaNs removed. 

237 RuntimeWarnning: if there is a NaN in diffs. 

238 

239 References: 

240 - Diebold and Mariano, 'Comparing predictive accuracy', Journal of Business and 

241 Economic Statistics 13 (1995), 253-265. 

242 - Hering and Genton, 'Comparing spatial predictions', 

243 Technometrics 53 no. 4 (2011), 414-425. 

244 - Harvey, Leybourne and Newbold, 'Testing the equality of prediction mean 

245 squared errors', International Journal of Forecasting 13 (1997), 281-291. 

246 """ 

247 if method not in ["HLN", "HG"]: 

248 raise ValueError("`method` must be one of 'HLN' or 'HG'.") 

249 

250 if np.isnan(np.sum(diffs)): 

251 warnings.warn( 

252 RuntimeWarning( 

253 "A least one NaN value was detected in `da_timeseries`. This may impact the " 

254 "calculation of autocovariances." 

255 ) 

256 ) 

257 

258 diffs = diffs[~np.isnan(diffs)] 

259 

260 if not 0 < h < len(diffs): 

261 raise ValueError("The condition `0 < h < len(diffs)`, after NaNs removed, failed.") 

262 

263 nonzero_diffs = diffs[diffs != 0] 

264 

265 if len(nonzero_diffs) == 0: 

266 test_stat = np.nan 

267 elif method == "HLN": 

268 test_stat = _hln_method_stat(diffs, h) 

269 else: # method == 'HG' 

270 test_stat = _hg_method_stat(diffs, h) 

271 

272 return test_stat 

273 

274 

275def _hg_func(pars: list, lag: np.ndarray, acv: np.ndarray) -> np.ndarray: 

276 """ 

277 Function whose values are to be minimised as part of the HG method for estimating 

278 the spectral density at 0. 

279 

280 Args: 

281 pars: list of two model parameters (sigma and theta in the notation of Hering 

282 and Genton 2011) 

283 lag: 1D numpy array of the form [0, 1, 2, ..., n - 1], where n is the length 

284 of `acv` 

285 acv: 1D numpy array of empirical autocovariances with lags corresponding 

286 to `lag` 

287 

288 Returns: 

289 Difference between modelled and empirical autocoveriances. 

290 

291 References: 

292 Hering and Genton, 'Comparing spatial predictions', 

293 Technometrics 53 no. 4 (2011), 414-425. 

294 """ 

295 return (pars[0] ** 2) * np.exp(-3 * lag / pars[1]) - acv 

296 

297 

298def _hg_method_stat(diffs: np.ndarray, h: int) -> float: 

299 """ 

300 Calculates the modified Diebold-Mariano test statistic using the "HG" method. 

301 Assumes that h < len(diffs). 

302 

303 Args: 

304 diffs: a single (1D array) timeseries of score differences with NaNs removed. 

305 h: integer indicating that forecasts are h-step ahead, assumed to be positive 

306 and less than the length of the timeseries with NaNs removed. 

307 

308 Returns: 

309 Diebold-Mariano test statistic using the HG method. 

310 """ 

311 from scores.stats.statistical_tests.acovf import acovf 

312 

313 n = len(diffs) 

314 

315 # use an exponential model for autocovariances of `diffs` 

316 max_lag = int(max(np.floor((n - 1) / 2), h)) 

317 sample_autocvs = acovf(diffs)[0:max_lag] 

318 sample_lags = np.arange(max_lag) 

319 model_params = least_squares(_hg_func, [1, 1], args=(sample_lags, sample_autocvs), bounds=(0, np.inf)).x 

320 

321 # use the model autocovariances to estimate spectral density at 0 

322 all_lags = np.arange(n) 

323 model_autocovs = (model_params[0] ** 2) * np.exp(-3 * all_lags / model_params[1]) 

324 density_estimate = model_autocovs[0] + 2 * np.sum(model_autocovs[1:]) 

325 test_stat = np.mean(diffs) / np.sqrt(density_estimate / n) 

326 

327 return test_stat 

328 

329 

330def _hln_method_stat(diffs: np.ndarray, h: int) -> float: 

331 """ 

332 Given a timeseries of score differences for h-step ahead forecasts, as a 1D numpy 

333 array without NaNs, returns the modified Diebold-Mariano test statistic of 

334 Harvey et al (1997). 

335 

336 If the value V_hat (see Equation (5) in Harvey) is nonpositive then NaN is returned. 

337 

338 Args: 

339 diffs: timeseries of score difference as a 1D numpy array without any NaNs. 

340 h: integer indicating that forecasts are h-step ahead, assumed to be positive 

341 and less than the length of the timeseries with NaNs removed. 

342 

343 Returns: 

344 Diebold-Mariano test statistic using the HLN method. 

345 """ 

346 n = len(diffs) 

347 diffs_bar = np.mean(diffs) 

348 

349 # Harvey (1997) Equation (3) 

350 test_stat = diffs_bar / _dm_v_hat(diffs, diffs_bar, n, h) ** 0.5 

351 

352 # Harvey (1997) Equation (9) 

353 correction_factor = (n + 1 - 2 * h + h * (h - 1) / n) / n 

354 test_stat = (correction_factor**0.5) * test_stat 

355 

356 return test_stat 

357 

358 

359def _dm_gamma_hat_k(diffs: np.ndarray, diffs_bar: float, n: int, k: int) -> float: 

360 """ 

361 Computes the quantity (n - k) * gamma_hat_star_k of Equation (5) in 

362 Harvey et al (1997). 

363 

364 Args: 

365 diffs: a single timeseries of score differences with NaNs removed. 

366 diffs_bar: mean of diffs. 

367 n: length of diffs. 

368 k: integer between 1 and n-1 (inclusive), where n = len(diffs) 

369 

370 Returns: 

371 The quantity (n - k) * gamma_hat_star_k. 

372 """ 

373 prod = (diffs[k:n] - diffs_bar) * (diffs[0 : n - k] - diffs_bar) 

374 

375 return np.sum(prod) 

376 

377 

378def _dm_v_hat(diffs: np.ndarray, diffs_bar: float, n: int, h: int) -> float: 

379 """ 

380 Computes the the quantity V_hat(d_bar) of Equation (5) in Harvey et al (1997). 

381 

382 Args: 

383 diffs: a single timeseries of score differences with NaNs removed. 

384 diffs_bar: mean of diffs. 

385 n: length of diffs. 

386 h: integer between 1 and n - 1 (inclusive), where n = len(diffs) 

387 

388 Returns: 

389 The quantity V_hat(d_bar). If the result is not positive, NaN is returned. 

390 """ 

391 summands = np.empty(h - 1) 

392 for k in range(h - 1): 

393 summands[k] = _dm_gamma_hat_k(diffs, diffs_bar, n, k + 1) 

394 

395 result = (_dm_gamma_hat_k(diffs, diffs_bar, n, 0) + 2 * np.sum(summands)) / n**2 

396 

397 if result <= 0: 

398 result = np.nan 

399 

400 return result