Model evaluation

This example illustrates how to evaluate a model’s performance on soccer historical data.

# Author: Georgios Douzas <gdouzas@icloud.com>
# Licence: MIT

import numpy as np
from sportsbet.datasets import SoccerDataLoader
from sklearn.neighbors import KNeighborsClassifier

Extracting the training data

We extract the training data for the spanish league. We also remove any missing values and select the market maximum odds.

dataloader = SoccerDataLoader(param_grid={'league': ['Spain']})
X_train, Y_train, Odds_train = dataloader.extract_train_data(
    drop_na_thres=1.0, odds_type='market_maximum'
)

Out:

Football-Data.co.uk: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

The input data:

X_train
home_team away_team league division year home_team_soccer_power_index away_team_soccer_power_index home_team_probability_win away_team_probability_win probability_draw home_team_projected_score away_team_projected_score match_quality
date
2016-08-19 Malaga Osasuna Spain 1 2017 72.57 56.93 0.5475 0.1897 0.2628 1.56 0.70 63.805561
2016-08-19 La Coruna Eibar Spain 1 2017 66.52 62.29 0.5003 0.2260 0.2738 1.47 0.79 64.335545
2016-08-20 Barcelona Betis Spain 1 2017 96.35 69.95 0.9591 0.0071 0.0337 3.40 0.42 81.054510
2016-08-20 Sevilla Espanol Spain 1 2017 78.76 68.75 0.5952 0.1760 0.2288 1.89 0.88 73.415362
2016-08-20 Granada Villarreal Spain 1 2017 55.69 76.79 0.3194 0.3917 0.2889 1.07 1.19 64.559709
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2021-12-19 Fuenlabrada Oviedo Spain 2 2022 23.93 34.47 0.3181 0.3483 0.3336 0.97 1.03 28.248873
2021-12-19 Ponferradina Amorebieta Spain 2 2022 32.85 25.44 0.4926 0.2281 0.2793 1.53 0.95 28.674009
2021-12-31 Burgos Amorebieta Spain 2 2022 25.64 25.98 0.3896 0.3090 0.3014 1.26 1.09 25.808880
2021-12-31 Oviedo Ponferradina Spain 2 2022 34.04 32.07 0.4257 0.2608 0.3135 1.24 0.91 33.025648
2021-12-31 Eibar Sociedad B Spain 2 2022 35.73 25.54 0.5271 0.2011 0.2718 1.60 0.88 29.787635

2724 rows × 13 columns



The targets:

Y_train
away_win__full_time_goals draw__full_time_goals home_win__full_time_goals over_2.5__full_time_goals under_2.5__full_time_goals
0 False True False False True
1 False False True True False
2 False False True True False
3 False False True True False
4 False True False False True
... ... ... ... ... ...
2719 False True False False True
2720 False True False False True
2721 False True False True False
2722 False False True False True
2723 False False True True False

2724 rows × 5 columns



Splitting the data

We split the training data into training and testing data by keeping the first 80% of observations as training data, since the data are already sorted by date.

ind = int(len(X_train) * 0.80)
X_test, Y_test, Odds_test = X_train[ind:], Y_train[ind:], Odds_train[ind:]
X_train, Y_train = X_train[:ind], Y_train[:ind]

Training a multi-output classifier

We train a KNeighborsClassifier using only numerical features from the input data. We also use the extracted targets.

num_features = [
    col
    for col in X_train.columns
    if X_train[col].dtype in (np.dtype(int), np.dtype(float))
]
clf = KNeighborsClassifier()
clf.fit(X_train[num_features], Y_train)

Out:

KNeighborsClassifier()

Estimating the value bets

We can estimate the value bets by using the fitted classifier.

Y_pred_prob = np.concatenate(
    [prob[:, 1].reshape(-1, 1) for prob in clf.predict_proba(X_test[num_features])],
    axis=1,
)
value_bets = Y_pred_prob * Odds_test > 1

We assume that we bet an amount of +1 in every value bet. Then we have the following mean profit per bet:

profit = np.nanmean((Y_test.values * Odds_test.values - 1) * value_bets.values)
profit

Out:

-0.013812844036697245

Total running time of the script: ( 0 minutes 45.232 seconds)

Gallery generated by Sphinx-Gallery