Time series forecasting with Python and Scikit-learn

Joaquín Amat Rodrigo, Javier Escobar Ortiz
February, 2021 (last update December 2021)

Introduction


A time series is a succession of chronologically ordered data spaced at equal or unequal intervals. The forecasting process consists of predicting the future value of a time series, either by modeling the series solely based on its past behavior (autoregressive) or by using other external variables.

This paper describes how to use Scikit-learn regression models to perform forecasting on time series. Specifically, it introduces Skforecast, a simple package that contains the classes and functions necessary to adapt any Scikit-learn regression model to forecasting problems.

Multi-Step Time Series Forecasting


The common objective of working with time series is not only to predict the next element in the series ($t_{+1}$) but an entire future interval or a point far away in time ($t_{+n}$). Each prediction jump is known as a step.

There are several strategies that allow generating this type of multiple prediction.

Recursive multi-step forecasting

Since to predict the moment $t_{n}$ the value of $t_{n-1}$ is needed, which is unknown, it is necessary to make recursive predictions. New predictions use previous ones as predictors. This process is known as recursive forecasting or recursive multi-step forecasting.

The main adaptation needed to apply Scikit-learn models to recursive multi-step forecasting problems is to transform the time series into a matrix in which each value is associated with the time window (lags) preceding it. This forecasting strategy can be easily generated with the ForecasterAutoreg and ForecasterAutoregCustom classes from the Skforecast package.

Transformation of a time series into a 5 lags matrix and a vector with the value of the series that follows each row of the matrix.

This type of transformation also allows the inclusion of exogenous variables to the time series.

Transformation of a time series joining an exogenous variable.

Direct multi-step forecasting

The direct multi-step forecasting method consists of training a different model for each step. For example, to predict the following 5 values of a time series, 5 different models are required to be trained, one for each step. As a result, the predictions are independent of each other.

The main complexity of this approach is to generate the correct training matrices for each model. The ForecasterAutoregMultiOutput class of the Skforecast package automates this process. It is also important to bear in mind that this strategy has a higher computational cost since it requires the train of multiple models. The following diagram shows the process for a case in which the response variable and two exogenous variables are available.

Transformation of a time series into the necessary matrices to train a direct multi-step forecasting model.



Multiple output forecasting

Certain models are capable of simultaneously predicting several values of a sequence (one-shot). An example of a model with this capability is the LSTM neural network.

Recursive autoregressive forecasting


A time series is available with the monthly expenditure (millions of dollars) on corticosteroid drugs that the Australian health system had between 1991 and 2008. It is intended to create an autoregressive model capable of predicting future monthly expenditures.

Packages


The packages used in this paper are:

In [1]:
# Data manipulation
# ==============================================================================
import numpy as np
import pandas as pd

# Plots
# ==============================================================================
import matplotlib.pyplot as plt
%matplotlib inline

# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('ignore')

In addition to the above, Skforecast, a library containing the classes and functions needed to adapt any Scikit-learn regression model to forecasting problems, is used. It can be installed in the following ways:

pip install skforecast

A specific version:

pip install git+https://github.com/JoaquinAmatRodrigo/skforecast@v0.3.0

Last version (unstable):

pip install git+https://github.com/JoaquinAmatRodrigo/skforecast#master

In [2]:
# Modeling and Forecasting
# ==============================================================================
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.ForecasterAutoregCustom import ForecasterAutoregCustom
from skforecast.ForecasterAutoregMultiOutput import ForecasterAutoregMultiOutput
from skforecast.model_selection import grid_search_forecaster
from skforecast.model_selection import backtesting_forecaster
from skforecast.model_selection import backtesting_forecaster_intervals

from joblib import dump, load

Data


The data used in the examples of this paper have been obtained from the magnificent book Forecasting: Principles and Practice by Rob J Hyndman and George Athanasopoulos.

In [3]:
# Data download
# ==============================================================================
url = 'https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o.csv'
data_raw = pd.read_csv(url, sep=',')
data_raw = data_raw.rename(columns={'fecha': 'date'})

The column date has been stored as a string. To convert it to datetime the pd.to_datetime() function can be use. Once in datetime format, and to make use of pandas functionalities, it is set as an index. Also, since the data is monthly, the frequency is set as Monthly Started 'MS'.

In [4]:
# Data preparation
# ==============================================================================
data = data_raw.copy()
data['date'] = pd.to_datetime(data['date'], format='%Y/%m/%d')
data = data.set_index('date')
data = data.rename(columns={'x': 'y'})
data = data.asfreq('MS')
data = data['y']
data = data.sort_index()

The time series is verified to be complete.

In [5]:
# Verify that a temporary index is complete
# ==============================================================================
(data.index == pd.date_range(start=data.index.min(),
                             end=data.index.max(),
                             freq=data.index.freq)).all()
Out[5]:
True
In [6]:
# Fill gaps in a temporary index
# ==============================================================================
# data.asfreq(freq='30min', fill_value=np.nan)

The last 36 months are used as the test set to evaluate the predictive capacity of the model.

In [7]:
# Split data into train-test
# ==============================================================================
steps = 36
data_train = data[:-steps]
data_test  = data[-steps:]

fig, ax=plt.subplots(figsize=(9, 4))
data_train.plot(ax=ax, label='train')
data_test.plot(ax=ax, label='test')
ax.legend();

ForecasterAutoreg


With the ForecasterAutoreg class, a model is created and trained from a RandomForestRegressor regressor with a time window of 6 lags. This means that the model uses the previous 6 months as predictors.

In [8]:
# Create and train forecaster
# ==============================================================================
forecaster_rf = ForecasterAutoreg(
                    regressor=RandomForestRegressor(random_state=123),
                    lags=6
                )

forecaster_rf.fit(y=data_train)

forecaster_rf
Out[8]:
=======================ForecasterAutoreg=======================
Regressor: RandomForestRegressor(random_state=123)
Lags: [1 2 3 4 5 6]
Exogenous variable: False, None
Parameters: {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': 123, 'verbose': 0, 'warm_start': False}

Predictions


Once the model is trained, the test data is predicted (36 months into the future).

In [9]:
# Predictions
# ==============================================================================
steps = 36
predictions = forecaster_rf.predict(steps=steps)
# Temporal index is added to predictions
predictions = pd.Series(data=predictions, index=data_test.index)
predictions.head()
Out[9]:
date
2005-07-01    0.866263
2005-08-01    0.874688
2005-09-01    0.951697
2005-10-01    0.991223
2005-11-01    0.952589
Freq: MS, dtype: float64
In [10]:
# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(9, 4))
data_train.plot(ax=ax, label='train')
data_test.plot(ax=ax, label='test')
predictions.plot(ax=ax, label='predictions')
ax.legend();

Prediction error in the test set


The error that the model makes in its predictions is quantified. In this case, the metric used is the mean squared error (mse).

In [11]:
# Error
# ==============================================================================
error_mse = mean_squared_error(
                y_true = data_test,
                y_pred = predictions
            )
print(f"Test error (mse): {error_mse}")
Test error (mse): 0.0665147739321922

Hyperparameter tuning


The trained ForecasterAutoreg uses a 6 lag time window and a Random Forest model with the default hyperparameters. However, there is no reason why these values are the most suitable.

To identify the best combination of lags and hyperparameters, time series cross-validation and backtesting strategies are available in the Skforecast package. Regardless of the procedure used, it is important not to include the test data in the search process to avoid overfitting problems. Time series cross-validation along the training dataset is used in this case. For the first fold, the initial 50% of the observations are the training data and, the next 10 steps represent the validation set. In successive folds, the training set will contain all the data used in the previous fold and, the next 10 steps will be used as new validation data. This process will be repeated until the entire training data set is used.

In [12]:
# Hyperparameter Grid search
# ==============================================================================
forecaster_rf = ForecasterAutoreg(
                    regressor = RandomForestRegressor(random_state=123),
                    lags      = 12 # This value will be replaced in the grid search
                 )

# Regressor's hyperparameters
param_grid = {'n_estimators': [100, 500],
              'max_depth': [3, 5, 10]}

# Lags used as predictors
lags_grid = [10, 20]

results_grid = grid_search_forecaster(
                        forecaster  = forecaster_rf,
                        y           = data_train,
                        param_grid  = param_grid,
                        lags_grid   = lags_grid,
                        steps       = 10,
                        method      = 'cv',
                        metric      = 'mean_squared_error',
                        initial_train_size    = int(len(data_train)*0.5),
                        allow_incomplete_fold = False,
                        return_best = True,
                        verbose     = False
                   )
2021-12-02 14:53:49,167 root       INFO  Number of models compared: 12
loop lags_grid:   0%|                                     | 0/2 [00:00<?, ?it/s]
loop param_grid:   0%|                                    | 0/6 [00:00<?, ?it/s]
loop param_grid:  17%|████▋                       | 1/6 [00:01<00:05,  1.20s/it]
loop param_grid:  33%|█████████▎                  | 2/6 [00:07<00:16,  4.07s/it]
loop param_grid:  50%|██████████████              | 3/6 [00:08<00:08,  2.73s/it]
loop param_grid:  67%|██████████████████▋         | 4/6 [00:13<00:07,  3.82s/it]
loop param_grid:  83%|███████████████████████▎    | 5/6 [00:15<00:02,  2.91s/it]
loop param_grid: 100%|████████████████████████████| 6/6 [00:21<00:00,  4.07s/it]
loop lags_grid:  50%|██████████████▌              | 1/2 [00:21<00:21, 21.53s/it]
loop param_grid:   0%|                                    | 0/6 [00:00<?, ?it/s]
loop param_grid:  17%|████▋                       | 1/6 [00:01<00:05,  1.14s/it]
loop param_grid:  33%|█████████▎                  | 2/6 [00:06<00:14,  3.75s/it]
loop param_grid:  50%|██████████████              | 3/6 [00:07<00:07,  2.60s/it]
loop param_grid:  67%|██████████████████▋         | 4/6 [00:13<00:07,  3.95s/it]
loop param_grid:  83%|███████████████████████▎    | 5/6 [00:15<00:02,  2.98s/it]
loop param_grid: 100%|████████████████████████████| 6/6 [00:21<00:00,  4.10s/it]
loop lags_grid: 100%|█████████████████████████████| 2/2 [00:43<00:00, 21.51s/it]
2021-12-02 14:54:32,205 root       INFO  Refitting `forecaster` using the best found parameters and the whole data set: 
lags: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20] 
params: {'max_depth': 10, 'n_estimators': 500}

In [13]:
# Grid Search results
# ==============================================================================
results_grid
Out[13]:
lags params metric max_depth n_estimators
11 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'max_depth': 10, 'n_estimators': 500} 0.005271 10 500
10 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'max_depth': 10, 'n_estimators': 100} 0.005331 10 100
9 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'max_depth': 5, 'n_estimators': 500} 0.005354 5 500
8 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'max_depth': 5, 'n_estimators': 100} 0.005514 5 100
7 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'max_depth': 3, 'n_estimators': 500} 0.005744 3 500
6 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'max_depth': 3, 'n_estimators': 100} 0.005821 3 100
5 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] {'max_depth': 10, 'n_estimators': 500} 0.026603 10 500
4 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] {'max_depth': 10, 'n_estimators': 100} 0.028092 10 100
2 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] {'max_depth': 5, 'n_estimators': 100} 0.028693 5 100
3 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] {'max_depth': 5, 'n_estimators': 500} 0.028870 5 500
1 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] {'max_depth': 3, 'n_estimators': 500} 0.029387 3 500
0 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] {'max_depth': 3, 'n_estimators': 100} 0.030020 3 100

The best results are obtained using a time window of 20 lags and a Random Forest set up of {'max_depth': 10, 'n_estimators': 500}.

Final model


Finally, a ForecasterAutoreg is trained with the optimal configuration found by validation. This step is not necessary if return_best = True is specified in the grid_search_forecaster() function.

In [14]:
# Create and train forecaster with the best hyperparameters
# ==============================================================================
regressor = RandomForestRegressor(max_depth=10, n_estimators=500, random_state=123)

forecaster_rf = ForecasterAutoreg(
                    regressor = regressor,
                    lags      = 20
                )

forecaster_rf.fit(y=data_train)
In [15]:
# Predictions
# ==============================================================================
predictions = forecaster_rf.predict(steps=steps)
# Temporal index is added to predictions
predictions = pd.Series(data=predictions, index=data_test.index)
In [16]:
# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(9, 4))
data_train.plot(ax=ax, label='train')
data_test.plot(ax=ax, label='test')
predictions.plot(ax=ax, label='predictions')
ax.legend();
In [17]:
# Error de test
# ==============================================================================
error_mse = mean_squared_error(
                y_true = data_test,
                y_pred = predictions
            )
print(f"Test error (mse): {error_mse}")
Test error (mse): 0.003977043111878048

The optimal combination of hyperparameters significantly reduces test error.

Predictors importance


Since the ForecasterAutoreg object uses Scikit-learn models, the importance of predictors can be accessed once trained. When the regressor used is a LinearRegression(), Lasso() or Ridge(), the coefficients of the model reflect their importance, obtained with the get_coef() method. In GradientBoostingRegressor() or RandomForestRegressor() regressors, the importance of predictors is based on impurity reduction and is accessible through the get_feature_importances() method. In both cases, the values returned are sorted as the lags order.

In [18]:
# Predictors importance
# ==============================================================================
importance = forecaster_rf.get_feature_importances()
dict(zip(forecaster_rf.lags, importance))
Out[18]:
{1: 0.012553886713487061,
 2: 0.08983951332807713,
 3: 0.010659102591406495,
 4: 0.002089457758242796,
 5: 0.0020659673443678165,
 6: 0.0025603607878369916,
 7: 0.002581234419075128,
 8: 0.00692666663137377,
 9: 0.011527319810863246,
 10: 0.02536813007770579,
 11: 0.0179467896103424,
 12: 0.7756201041018168,
 13: 0.0031870877064889024,
 14: 0.014718107972907108,
 15: 0.00787688226587692,
 16: 0.003480591965198101,
 17: 0.0027116142341500264,
 18: 0.0020417148550876244,
 19: 0.0018876654387458749,
 20: 0.004357802386950057}

Recursive autoregressive forecasting with exogenous variables


In the previous example, only lags of the predicted variable itself have been used as predictors. In certain scenarios, it is possible to have information about other variables, whose future value is known, so could serve as additional predictors in the model.

Continuing with the previous example, a new variable whose behavior is correlated with the modeled time series and it is wanted to incorporate as a predictor is simulated. The same applies to multiple exogenous variables.

Data

In [19]:
# Data download
# ==============================================================================
url = 'https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o_exog.csv'
data_raw = pd.read_csv(url, sep=',')
data_raw = data_raw.rename(columns={'fecha': 'date'})

# Data preparation
# ==============================================================================
data = data_raw.copy()
data['date'] = pd.to_datetime(data['date'], format='%Y/%m/%d')
data = data.set_index('date')
data = data.asfreq('MS')
data = data.sort_index()

fig, ax = plt.subplots(figsize=(9, 4))
data['y'].plot(ax=ax, label='y')
data['exog_1'].plot(ax=ax, label='exogenous variable')
ax.legend();
In [20]:
# Split data into train-test
# ==============================================================================
steps = 36
data_train = data[:-steps]
data_test  = data[-steps:]

ForecasterAutoreg

In [21]:
# Create and train forecaster
# ==============================================================================
forecaster_rf = ForecasterAutoreg(
                    regressor = RandomForestRegressor(random_state=123),
                    lags      = 8
                )

forecaster_rf.fit(y=data_train['y'], exog=data_train['exog_1'])

Predictions


If the ForecasterAutoreg is trained with an exogenous variable, the value of this variable must be passed to predict(). It is only applicable to scenarios in which future information on the exogenous variable is available.

In [22]:
# Predictions
# ==============================================================================
predictions = forecaster_rf.predict(steps=steps, exog=data_test['exog_1'])
# Temporal index is added to predictions
predictions = pd.Series(data=predictions, index=data_test.index)
In [23]:
# Plot
# ==============================================================================
fig, ax=plt.subplots(figsize=(9, 4))
data_train['y'].plot(ax=ax, label='train')
data_test['y'].plot(ax=ax, label='test')
predictions.plot(ax=ax, label='predictions')
ax.legend();

Prediction error in the test set

In [24]:
# Error
# ==============================================================================
error_mse = mean_squared_error(
                y_true = data_test['y'],
                y_pred = predictions
            )
print(f"Test error (mse): {error_mse}")
Test error (mse): 0.03989087922533575

Hyperparameter tuning

In [25]:
# Hyperparameter Grid search
# ==============================================================================
forecaster_rf = ForecasterAutoreg(
                    regressor = RandomForestRegressor(random_state=123),
                    lags      = 12 # This value will be replaced in the grid search
                 )

param_grid = {'n_estimators': [50, 100, 500],
              'max_depth': [3, 5, 10]}

lags_grid = [5, 12, 20]

results_grid = grid_search_forecaster(
                        forecaster  = forecaster_rf,
                        y           = data_train['y'],
                        exog        = data_train['exog_1'],
                        param_grid  = param_grid,
                        lags_grid   = lags_grid,
                        steps       = 10,
                        method      = 'cv',
                        metric      = 'mean_squared_error',
                        initial_train_size    = int(len(data_train)*0.5),
                        allow_incomplete_fold = False,
                        return_best = True,
                        verbose     = False
                    )
2021-12-02 14:54:40,852 root       INFO  Number of models compared: 27
loop lags_grid:   0%|                                     | 0/3 [00:00<?, ?it/s]
loop param_grid:   0%|                                    | 0/9 [00:00<?, ?it/s]
loop param_grid:  11%|███                         | 1/9 [00:00<00:04,  1.69it/s]
loop param_grid:  22%|██████▏                     | 2/9 [00:01<00:06,  1.15it/s]
loop param_grid:  33%|█████████▎                  | 3/9 [00:06<00:16,  2.80s/it]
loop param_grid:  44%|████████████▍               | 4/9 [00:07<00:09,  1.93s/it]
loop param_grid:  56%|███████████████▌            | 5/9 [00:08<00:06,  1.62s/it]
loop param_grid:  67%|██████████████████▋         | 6/9 [00:13<00:08,  2.88s/it]
loop param_grid:  78%|█████████████████████▊      | 7/9 [00:14<00:04,  2.14s/it]
loop param_grid:  89%|████████████████████████▉   | 8/9 [00:15<00:01,  1.81s/it]
loop param_grid: 100%|████████████████████████████| 9/9 [00:20<00:00,  2.96s/it]
loop lags_grid:  33%|█████████▋                   | 1/3 [00:20<00:41, 20.98s/it]
loop param_grid:   0%|                                    | 0/9 [00:00<?, ?it/s]
loop param_grid:  11%|███                         | 1/9 [00:00<00:04,  1.71it/s]
loop param_grid:  22%|██████▏                     | 2/9 [00:01<00:06,  1.11it/s]
loop param_grid:  33%|█████████▎                  | 3/9 [00:07<00:17,  2.99s/it]
loop param_grid:  44%|████████████▍               | 4/9 [00:07<00:10,  2.05s/it]
loop param_grid:  56%|███████████████▌            | 5/9 [00:08<00:06,  1.73s/it]
loop param_grid:  67%|██████████████████▋         | 6/9 [00:14<00:09,  3.08s/it]
loop param_grid:  78%|█████████████████████▊      | 7/9 [00:15<00:04,  2.28s/it]
loop param_grid:  89%|████████████████████████▉   | 8/9 [00:16<00:01,  1.94s/it]
loop param_grid: 100%|████████████████████████████| 9/9 [00:22<00:00,  3.20s/it]
loop lags_grid:  67%|███████████████████▎         | 2/3 [00:43<00:21, 21.86s/it]
loop param_grid:   0%|                                    | 0/9 [00:00<?, ?it/s]
loop param_grid:  11%|███                         | 1/9 [00:00<00:04,  1.67it/s]
loop param_grid:  22%|██████▏                     | 2/9 [00:01<00:06,  1.08it/s]
loop param_grid:  33%|█████████▎                  | 3/9 [00:07<00:18,  3.07s/it]
loop param_grid:  44%|████████████▍               | 4/9 [00:08<00:10,  2.11s/it]
loop param_grid:  56%|███████████████▌            | 5/9 [00:09<00:07,  1.79s/it]
loop param_grid:  67%|██████████████████▋         | 6/9 [00:15<00:09,  3.26s/it]
loop param_grid:  78%|█████████████████████▊      | 7/9 [00:16<00:04,  2.41s/it]
loop param_grid:  89%|████████████████████████▉   | 8/9 [00:17<00:02,  2.06s/it]
loop param_grid: 100%|████████████████████████████| 9/9 [00:23<00:00,  3.41s/it]
loop lags_grid: 100%|█████████████████████████████| 3/3 [01:07<00:00, 22.40s/it]
2021-12-02 14:55:48,054 root       INFO  Refitting `forecaster` using the best found parameters and the whole data set: 
lags: [ 1  2  3  4  5  6  7  8  9 10 11 12] 
params: {'max_depth': 3, 'n_estimators': 50}

In [26]:
# Grid Search results
# ==============================================================================
results_grid.head()
Out[26]:
lags params metric max_depth n_estimators
9 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] {'max_depth': 3, 'n_estimators': 50} 0.007815 3 50
12 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] {'max_depth': 5, 'n_estimators': 50} 0.007940 5 50
20 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'max_depth': 3, 'n_estimators': 500} 0.007944 3 500
19 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'max_depth': 3, 'n_estimators': 100} 0.008057 3 100
25 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'max_depth': 10, 'n_estimators': 100} 0.008106 10 100

The best results are obtained using a time window of 12 lags and a Random Forest set up of {'max_depth': 3, 'n_estimators': 50}.

Final model


Setting return_best = True in the grid_search_forecaster(), after the search, the ForecasterAutoreg object has been modified and trained with the best match found.

In [27]:
# Predictions
# ==============================================================================
predictions = forecaster_rf.predict(steps=steps, exog=data_test['exog_1'])
# Temporal index is added to predictions
predictions = pd.Series(data=predictions, index=data_test.index)

# Plot
# ==============================================================================
fig, ax=plt.subplots(figsize=(9, 4))
data_train['y'].plot(ax=ax, label='train')
data_test['y'].plot(ax=ax, label='test')
predictions.plot(ax=ax, label='predictions')
ax.legend();
In [28]:
# Error
# ==============================================================================
error_mse = mean_squared_error(y_true = data_test['y'], y_pred = predictions)
print(f"Test error (mse) {error_mse}")
Test error (mse) 0.003965444559763559

Recursive autoregressive forecasting with custom predictors


In addition to the lags, it may be interesting to incorporate other characteristics of the time series in some scenarios. For example, the moving average of the last n values could be used to capture the series's trend.

The ForecasterAutoregCustom class behaves very similar to the ForecasterAutoreg class seen in the previous sections, but with the difference that it is the user who defines the function used to create the predictors.

The first example of the paper about predicting the last 36 months of the time series is repeated. In this case, the predictors are the first 10 lags and the values' moving average of the lasts 20 months.

Data

In [29]:
# Data download
# ==============================================================================
url = 'https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o.csv'
data_raw = pd.read_csv(url, sep=',')
data_raw = data_raw.rename(columns={'fecha': 'date'})
In [30]:
# Data preparation
# ==============================================================================
data = data_raw.copy()
data['date'] = pd.to_datetime(data['date'], format='%Y/%m/%d')
data = data.set_index('date')
data = data.rename(columns={'x': 'y'})
data = data.asfreq('MS')
data = data['y']
data = data.sort_index()
In [31]:
# Split data into train-test
# ==============================================================================
steps = 36
data_train = data[:-steps]
data_test  = data[-steps:]

ForecasterAutoregCustom


A ForecasterAutoregCustom is created and trained from a RandomForestRegressor regressor. The create_predictor() function, which calculates the first 10 lags and the moving average of the last 20 values, is used to create the predictors.

In [32]:
# Function to calculate predictors from time series
# ==============================================================================
def create_predictors(y):
    '''
    Create the first 10 lags.
    Calculate the moving average of the last 20 values.
    '''
    
    X_train = pd.DataFrame({'y':y.copy()})
    for i in range(0, 10):
        X_train[f'lag_{i+1}'] = X_train['y'].shift(i)
        
    X_train['moving_avg'] = X_train['y'].rolling(20).mean()
    
    X_train = X_train.drop(columns='y').tail(1).to_numpy()  
    
    return X_train  

When creating the forecaster, the window_size argument must be equal to or greater than the window used by the function that creates the predictors. This value, in this case, is 20.

In [33]:
# Create and train forecaster
# ==============================================================================
forecaster_rf = ForecasterAutoregCustom(
                    regressor      = RandomForestRegressor(random_state=123),
                    fun_predictors = create_predictors,
                    window_size    = 20
                )

forecaster_rf.fit(y=data_train)
forecaster_rf
Out[33]:
=======================ForecasterAutoregCustom=======================
Regressor: RandomForestRegressor(random_state=123)
Predictors created with: create_predictors
Window size: 20
Exogenous variable: False, None
Parameters: {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': 123, 'verbose': 0, 'warm_start': False}

Predictions

In [34]:
# Predictions
# ==============================================================================
steps = 36
predictions = forecaster_rf.predict(steps=steps)
# Temporal index is added to predictions
predictions = pd.Series(data=predictions, index=data_test.index)
In [35]:
# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(9, 4))
data_train.plot(ax=ax, label='train')
data_test.plot(ax=ax, label='test')
predictions.plot(ax=ax, label='predictions')
ax.legend();

Prediction error in the test set

In [36]:
# Error
# ==============================================================================
error_mse = mean_squared_error(
                y_true = data_test,
                y_pred = predictions
            )
print(f"Test error (mse): {error_mse}")
Test error (mse): 0.04487765885818191

Hyperparameter tuning


When using the grid_search_forecaster() function with a ForecasterAutoregCustom, thelags_grid argument is not specified.

In [37]:
# Hyperparameter Grid search
# ==============================================================================
forecaster_rf = ForecasterAutoregCustom(
                    regressor      = RandomForestRegressor(random_state=123),
                    fun_predictors = create_predictors,
                    window_size    = 20
                )

# Regressor's hyperparameters
param_grid = {'n_estimators': [100, 500],
              'max_depth': [3, 5, 10]}

results_grid = grid_search_forecaster(
                        forecaster  = forecaster_rf,
                        y           = data_train,
                        param_grid  = param_grid,
                        steps       = 10,
                        method      = 'cv',
                        metric      = 'mean_squared_error',
                        initial_train_size    = int(len(data_train)*0.5),
                        allow_incomplete_fold = True,
                        return_best = True,
                        verbose     = False
                    )
2021-12-02 14:55:49,913 root       INFO  Number of models compared: 6
loop lags_grid:   0%|                                     | 0/1 [00:00<?, ?it/s]
loop param_grid:   0%|                                    | 0/6 [00:00<?, ?it/s]
loop param_grid:  17%|████▋                       | 1/6 [00:05<00:27,  5.58s/it]
loop param_grid:  33%|█████████▎                  | 2/6 [00:15<00:33,  8.35s/it]
loop param_grid:  50%|██████████████              | 3/6 [00:21<00:21,  7.10s/it]
loop param_grid:  67%|██████████████████▋         | 4/6 [00:31<00:16,  8.42s/it]
loop param_grid:  83%|███████████████████████▎    | 5/6 [00:37<00:07,  7.41s/it]
loop param_grid: 100%|████████████████████████████| 6/6 [00:48<00:00,  8.55s/it]
loop lags_grid: 100%|█████████████████████████████| 1/1 [00:48<00:00, 48.33s/it]
2021-12-02 14:56:38,247 root       INFO  Refitting `forecaster` using the best found parameters and the whole data set: 
lags: custom predictors 
params: {'max_depth': 10, 'n_estimators': 500}

In [38]:
# Grid Search results
# ==============================================================================
results_grid
Out[38]:
lags params metric max_depth n_estimators
5 custom predictors {'max_depth': 10, 'n_estimators': 500} 0.022736 10 500
3 custom predictors {'max_depth': 5, 'n_estimators': 500} 0.022742 5 500
4 custom predictors {'max_depth': 10, 'n_estimators': 100} 0.023564 10 100
2 custom predictors {'max_depth': 5, 'n_estimators': 100} 0.024030 5 100
1 custom predictors {'max_depth': 3, 'n_estimators': 500} 0.025694 3 500
0 custom predictors {'max_depth': 3, 'n_estimators': 100} 0.026545 3 100

Final model

In [39]:
# Predictions
# ==============================================================================
predictions = forecaster_rf.predict(steps=steps)
# Temporal index is added to predictions
predictions = pd.Series(data=predictions, index=data_test.index)

# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(9, 4))
data_train.plot(ax=ax, label='train')
data_test.plot(ax=ax, label='test')
predictions.plot(ax=ax, label='predictions')
ax.legend();
In [40]:
# Error
# ==============================================================================
error_mse = mean_squared_error(y_true = data_test, y_pred = predictions)
print(f"Test error (mse) {error_mse}")
Test error (mse) 0.044590618568342955

Direct multi-step forecasting


The ForecasterAutoreg and ForecasterAutoregCustom models follow a recursive prediction strategy in which each new prediction builds on the previous one. An alternative is to train a model for each of the steps to be predicted. This strategy, commonly known as direct multi-step forecasting, is computationally more expensive than recursive since it requires training several models. However, in some scenarios, it achieves better results. These kinds of models can be obtained with the ForecasterAutoregMultiOutput class and can include one or multiple exogenous variables.

Data

In [41]:
# Data download
# ==============================================================================
url = 'https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o.csv'
data_raw = pd.read_csv(url, sep=',')
data_raw = data_raw.rename(columns={'fecha': 'date'})

# Data preparation
# ==============================================================================
data = data_raw.copy()
data['date'] = pd.to_datetime(data['date'], format='%Y/%m/%d')
data = data.set_index('date')
data = data.rename(columns={'x': 'y'})
data = data.asfreq('MS')
data = data['y']
data = data.sort_index()
In [42]:
# Split data into train-test
# ==============================================================================
steps = 36
data_train = data[:-steps]
data_test  = data[-steps:]

ForecasterAutoregMultiOutput


Unlike when using ForecasterAutoreg or ForecasterAutoregCustom, the number of steps to be predicted must be indicated in the ForecasterAutoregMultiOutput type models. This means that the number of predictions obtained when executing the predict() method is always the same.

In [43]:
# Hyperparameter Grid search
# ==============================================================================
forecaster_rf = ForecasterAutoregMultiOutput(
                    regressor = Lasso(random_state=123),
                    steps     = 36,
                    lags      = 8 # This value will be replaced in the grid search
                )

param_grid = {'alpha': np.logspace(-5, 5, 10)}

lags_grid = [5, 12, 20]

results_grid = grid_search_forecaster(
                        forecaster  = forecaster_rf,
                        y           = data_train,
                        param_grid  = param_grid,
                        lags_grid   = lags_grid,
                        steps       = 36,
                        method      = 'cv',
                        metric      = 'mean_squared_error',
                        initial_train_size    = int(len(data_train)*0.5),
                        allow_incomplete_fold = False,
                        return_best = True,
                        verbose     = False
                    )
2021-12-02 14:56:40,855 root       INFO  Number of models compared: 30
loop lags_grid:   0%|                                     | 0/3 [00:00<?, ?it/s]
loop param_grid:   0%|                                   | 0/10 [00:00<?, ?it/s]
loop param_grid:  30%|████████                   | 3/10 [00:00<00:00, 23.51it/s]
loop param_grid:  60%|████████████████▏          | 6/10 [00:00<00:00, 24.36it/s]
loop param_grid:  90%|████████████████████████▎  | 9/10 [00:00<00:00, 24.84it/s]
loop lags_grid:  33%|█████████▋                   | 1/3 [00:00<00:00,  2.40it/s]
loop param_grid:   0%|                                   | 0/10 [00:00<?, ?it/s]
loop param_grid:  30%|████████                   | 3/10 [00:00<00:00, 25.10it/s]
loop param_grid:  60%|████████████████▏          | 6/10 [00:00<00:00, 25.13it/s]
loop param_grid:  90%|████████████████████████▎  | 9/10 [00:00<00:00, 25.14it/s]
loop lags_grid:  67%|███████████████████▎         | 2/3 [00:00<00:00,  2.42it/s]
loop param_grid:   0%|                                   | 0/10 [00:00<?, ?it/s]
loop param_grid:  20%|█████▍                     | 2/10 [00:00<00:00, 15.65it/s]
loop param_grid:  50%|█████████████▌             | 5/10 [00:00<00:00, 20.30it/s]
loop param_grid:  80%|█████████████████████▌     | 8/10 [00:00<00:00, 22.53it/s]
loop lags_grid: 100%|█████████████████████████████| 3/3 [00:01<00:00,  2.32it/s]
2021-12-02 14:56:42,208 root       INFO  Refitting `forecaster` using the best found parameters and the whole data set: 
lags: [ 1  2  3  4  5  6  7  8  9 10 11 12] 
params: {'alpha': 0.0016681005372000592}

In [44]:
# Grid Search results
# ==============================================================================
results_grid.head()
Out[44]:
lags params metric alpha
12 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] {'alpha': 0.0016681005372000592} 0.009650 0.001668
22 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'alpha': 0.0016681005372000592} 0.009872 0.001668
11 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] {'alpha': 0.0001291549665014884} 0.012075 0.000129
10 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] {'alpha': 1e-05} 0.012371 0.000010
21 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'alpha': 0.0001291549665014884} 0.015304 0.000129

The best results are obtained using a time window of 12 lags and a Lasso setting {'alpha': 0.001668}.

In [45]:
# Predictions
# ==============================================================================
predictions = forecaster_rf.predict()
# Temporal index is added to predictions
predictions = pd.Series(data=predictions, index=data_test.index)

# Gráfico
# ==============================================================================
fig, ax = plt.subplots(figsize=(9, 4))
data_train.plot(ax=ax, label='train')
data_test.plot(ax=ax, label='test')
predictions.plot(ax=ax, label='predictions')
ax.legend();
In [46]:
# Error
# ==============================================================================
error_mse = mean_squared_error(y_true = data_test, y_pred = predictions)
print(f"Test error (mse) {error_mse}")
Test error (mse) 0.008363774087426282

Backtesting


The Backtesting process consists of simulating the behavior that the model would have had if it had been run on a recurring basis, for example, predicting at intervals of 3 years (36 months) a total of 9 years. This type of evaluation can be easily applied with the backtesting_forecaster() function. This function returns, in addition to the predictions, an error metric.

In [47]:
# Backtesting
# ==============================================================================
n_test = 36*3 # The last 9 years are separated for the backtest
data_train = data[:-n_test]
data_test  = data[-n_test:]

steps = 36 # 3 year (36 month) folds are used
regressor = LinearRegression()
forecaster = ForecasterAutoreg(regressor=regressor, lags=15)

metric, predictions_backtest = backtesting_forecaster(
                                    forecaster = forecaster,
                                    y          = data,
                                    initial_train_size = len(data_train),
                                    steps      = steps,
                                    metric     = 'mean_squared_error',
                                    verbose    = True
                               )

print(f"Backtest error: {metric}")
Number of observations used for training: 96
Number of observations used for testing: 108
    Number of folds: 3
    Number of steps per fold: 36
Backtest error: [0.0202643]
In [48]:
# Add datetime index to predictions
predictions_backtest = pd.Series(data=predictions_backtest, index=data_test.index)
fig, ax = plt.subplots(figsize=(9, 4))
#data_train.plot(ax=ax, label='train')
data_test.plot(ax=ax, label='test')
predictions_backtest.plot(ax=ax, label='predictions')
ax.legend();

Prediction intervals


A prediction interval defines the interval within which the true value of $y$ is expected to be found with a given probability.

Rob J Hyndman and George Athanasopoulos, list in their book Forecasting: Principles and Practice multiple ways to estimate prediction intervals, most of which require that the residuals (errors) of the model are distributed in a normal way. When this property cannot be assumed, bootstrapping can be resorted to, which only assumes that the residuals are uncorrelated. This is the method used in the Skforecast library for the ForecasterAutoreg and ForecasterAutoregCustom type models.

In [49]:
# Data download
# ==============================================================================
url = 'https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o.csv'
data_raw = pd.read_csv(url, sep=',')
data_raw = data_raw.rename(columns={'fecha': 'date'})

# Data preparation
# ==============================================================================
data = data_raw.copy()
data['date'] = pd.to_datetime(data['date'], format='%Y/%m/%d')
data = data.set_index('date')
data = data.rename(columns={'x': 'y'})
data = data.asfreq('MS')
data = data['y']
data = data.sort_index()

# Split data into train-test
# ==============================================================================
steps = 36
data_train = data[:-steps]
data_test  = data[-steps:]
In [50]:
# Create and train forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                    regressor=LinearRegression(),
                    lags=15
                )

forecaster.fit(y=data_train)

# Prediction intervals
# ==============================================================================
predictions = forecaster.predict_interval(
                    steps    = steps,
                    interval = [1, 99],
                    n_boot   = 1000
              )

# Datetime index added
predictions = pd.DataFrame(data=predictions, index=data_test.index)

# Prediction Error
# ==============================================================================
error_mse = mean_squared_error(
                y_true = data_test,
                y_pred = predictions.iloc[:, 0]
            )
print(f"Test error (mse): {error_mse}")

# Plot
# ==============================================================================
fig, ax=plt.subplots(figsize=(9, 4))
#data_train.plot(ax=ax, label='train')
data_test.plot(ax=ax, label='test')
predictions.iloc[:, 0].plot(ax=ax, label='predictions')
ax.fill_between(predictions.index,
                predictions.iloc[:, 1],
                predictions.iloc[:, 2],
                alpha=0.5)
ax.legend();
Test error (mse): 0.011051937043503714
In [51]:
# Backtest with prediction intervals
# ==============================================================================
n_test = 36*3
data_train = data[:-n_test]
data_test  = data[-n_test:]

steps = 36
regressor = LinearRegression()
forecaster = ForecasterAutoreg(regressor=regressor, lags=15)

metric, predictions = backtesting_forecaster_intervals(
                            forecaster = forecaster,
                            y          = data,
                            initial_train_size = len(data_train),
                            steps      = steps,
                            metric     = 'mean_squared_error',
                            interval            = [1, 99],
                            n_boot              = 100,
                            in_sample_residuals = True,
                            verbose             = True,
                       )

print(metric)

# Datetime index is added
predictions = pd.DataFrame(data=predictions, index=data_test.index)

# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(9, 4))
#data_train.plot(ax=ax, label='train')
data_test.plot(ax=ax, label='test')
predictions.iloc[:, 0].plot(ax=ax, label='predictions')
ax.fill_between(predictions.index,
                predictions.iloc[:, 1],
                predictions.iloc[:, 2],
                alpha=0.5)
ax.legend();
Number of observations used for training: 96
Number of observations used for testing: 108
    Number of folds: 3
    Number of steps per fold: 36
[0.0202643]

Load and save models


Skforecast models can be loaded and stored using pickle or joblib packages. A simple example using joblib is shown below.

In [52]:
# Create forecaster
forecaster = ForecasterAutoreg(LinearRegression(), lags=3)
forecaster.fit(y=pd.Series(np.arange(50)))
In [53]:
# Save model
dump(forecaster, filename='forecaster.py')
Out[53]:
['forecaster.py']
In [54]:
# Load model
forecaster_loaded = load('forecaster.py')
In [55]:
# Predict
forecaster_loaded.predict(steps=5)
Out[55]:
array([50., 51., 52., 53., 54.])

Session info

In [56]:
import session_info
session_info.show(html=False)
-----
ipykernel           6.4.1
joblib              1.0.1
matplotlib          3.4.3
numpy               1.19.5
pandas              1.3.0
session_info        1.0.0
skforecast          0.3.0
sklearn             0.24.2
-----
IPython             7.27.0
jupyter_client      6.1.12
jupyter_core        4.7.1
jupyterlab          3.1.11
notebook            6.3.0
-----
Python 3.7.10 (default, Feb 26 2021, 18:47:35) [GCC 7.3.0]
Linux-5.11.0-41-generic-x86_64-with-debian-bullseye-sid
-----
Session information updated at 2021-12-02 14:56

Bibliography


Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. Book

Time Series Analysis and Forecasting with ADAM Ivan Svetunkov Book

Python Data Science Handbook by Jake VanderPlas Book

Python for Finance: Mastering Data-Driven Finance Book

Skforecast

How to cite this paper?

Forecasting series temporales con Python y Scikitlearn by Joaquín Amat Rodrigo and Javier Escobar Ortiz, available under a Attribution 4.0 International (CC BY 4.0) at https://www.cienciadedatos.net/py27-forecasting-series-temporales-python-scikitlearn.html

Creative Commons Licence
This work by Joaquín Amat Rodrigo is licensed under a Creative Commons Attribution 4.0 International License.