--- title: Kaggle-Competition-Favorita keywords: fastai sidebar: home_sidebar nb_path: "nbs/data_datasets__favorita.ipynb" ---
{% raw %}
{% endraw %} {% raw %}
%%html
<style> table {float:left} </style>
{% endraw %} {% raw %}
{% endraw %} {% raw %}

check_nans[source]

check_nans(df)

For data wrangling logs

{% endraw %} {% raw %}
{% endraw %} {% raw %}

class Favorita[source]

Favorita()

{% endraw %} {% raw %}
{% endraw %} {% raw %}
 
100%|██████████| 480M/480M [00:15<00:00, 31.5MiB/s] 
INFO:nixtla.data.datasets.utils:Successfully downloaded favorita-grocery-sales-forecasting.zip?dl=1, 480014675, bytes.
INFO:nixtla.data.datasets.utils:Decompressing zip file...
INFO:nixtla.data.datasets.utils:Successfully decompressed data/favorita/favorita-grocery-sales-forecasting.zip?dl=1
INFO:__main__:Successfully decompressed ./data/favorita/holidays_events.csv.7z
INFO:__main__:Successfully decompressed ./data/favorita/items.csv.7z
INFO:__main__:Successfully decompressed ./data/favorita/oil.csv.7z
INFO:__main__:Successfully decompressed ./data/favorita/sample_submission.csv.7z
INFO:__main__:Successfully decompressed ./data/favorita/stores.csv.7z
INFO:__main__:Successfully decompressed ./data/favorita/test.csv.7z
INFO:__main__:Successfully decompressed ./data/favorita/train.csv.7z
INFO:__main__:Successfully decompressed ./data/favorita/transactions.csv.7z
saved train.csv to train.feather for fast access
S wrangle time 8.49 segs 



dataframe n_rows 174685
            col  dtype  nan_prc
0      item_nbr  int32      0.0
1       traj_id  int64      0.0
2      family_0  uint8      0.0
3      family_1  uint8      0.0
4      family_2  uint8      0.0
..          ...    ...      ...
481  cluster_13  uint8      0.0
482  cluster_14  uint8      0.0
483  cluster_15  uint8      0.0
484  cluster_16  uint8      0.0
485   store_nbr   int8      0.0

[486 rows x 3 columns]




dataframe n_rows 125497040
           col           dtype  nan_prc
0         date  datetime64[ns]      0.0
1    store_nbr            int8      0.0
2     item_nbr           int32      0.0
3   unit_sales         float32      0.0
4  onpromotion         boolean      0.0
5         open           int64      0.0


Filter and Removing returns data 4.39 segs


dataframe n_rows 41518905
           col           dtype  nan_prc
0         date  datetime64[ns]      0.0
1    store_nbr            int8      0.0
2     item_nbr           int32      0.0
3   unit_sales         float32      0.0
4  onpromotion         boolean      0.0
5         open           int64      0.0
6      traj_id           int64      0.0


Resampling


dataframe n_rows 58106641
           col           dtype  nan_prc
0    store_nbr            int8      0.0
1     item_nbr           int32      0.0
2      traj_id           int64      0.0
3  onpromotion         boolean      0.0
4         date  datetime64[ns]      0.0
5   unit_sales         float32      0.0
6         open           int64      0.0
7    log_sales         float32      0.0


Resampling to regular grid 803.1067633628845 segs
Transactions variable 12.72 segs
Oil variables 12.66 segs
Calendar variables 12.76 segs
Holiday variables 106.95 segs


dataframe n_rows 58106641
             col           dtype  nan_prc
0      store_nbr            int8      0.0
1       item_nbr           int32      0.0
2        traj_id           int64      0.0
3    onpromotion         boolean      0.0
4           date  datetime64[ns]      0.0
5     unit_sales         float32      0.0
6           open           int64      0.0
7      log_sales         float32      0.0
8   transactions         float64      0.0
9            oil         float64      0.0
10   day_of_week           int64      0.0
11  day_of_month           int64      0.0
12         month           int64      0.0
13  national_hol          object      0.0
14  regional_hol          object      0.0
15     local_hol          object      0.0


Saving processed file to ./data/favorita/temporal.feather
{% endraw %} {% raw %}
 
item_nbr traj_id family_0 family_1 family_2 family_3 family_4 family_5 family_6 family_7 ... cluster_7 cluster_8 cluster_9 cluster_10 cluster_11 cluster_12 cluster_13 cluster_14 cluster_15 cluster_16
0 103665 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 105574 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 105575 2 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 108079 3 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 108701 4 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 485 columns

{% endraw %} {% raw %}
 
unique_id ds log_sales
0 1 2015-01-02 2.302585
1 1 2015-01-03 1.386294
2 1 2015-01-04 1.098612
3 1 2015-01-05 0.693147
4 1 2015-01-06 1.098612
{% endraw %} {% raw %}
 
unique_id ds onpromotion open transactions oil day_of_week day_of_month month
0 1 2015-01-02 False 1 3114.0 52.72 4 2 1
1 1 2015-01-03 False 1 2495.0 -1.00 5 3 1
2 1 2015-01-04 False 1 999.0 -1.00 6 4 1
3 1 2015-01-05 False 1 981.0 50.05 0 5 1
4 1 2015-01-06 False 1 827.0 47.98 1 6 1
{% endraw %}

Kaggle-Competition-Favorita References

The evaluation metric of the Favorita Kaggle competition was the normalized weighted root mean squared logarithmic error (NWRMSLE). Perishable items have a score weight of 1.25; otherwise, the weight is 1.0.

{% raw %} $$ NWRMSLE = \sqrt{\frac{\sum^{n}_{i=1} w_{i}\left(log(\hat{y}_{i}+1) - log(y_{i}+1)\right)^{2}}{\sum^{n}_{i=1} w_{i}}}$$ {% endraw %}

Kaggle Competition Forecasting Methods 16D ahead NWRMSLE
LGBM [1] 0.5091
Seq2Seq WaveNet [2] 0.5129

Favorita Extra References

Given that the test set is unavailable, some papers have chosen a different data partition to work on the dataset. The following table uses the favorita dataset with 90 days for training and predicts 30 days into the future for the logarithm of sales. The evaluation metric used in other papers was the quantile risk for P50 (normalized quantile loss):

{% raw %} $$ QL-risk = \frac{2 \sum_{y_{t} \in \Omega } \sum^{\tau_{max}}_{\tau=t} q (y_{t+\tau}-\hat{y}^{(q)}_{t+\tau})_{+} + (1-q) (\hat{y}^{(q)}_{t+\tau}-y_{t+\tau})_{+} }{\sum_{y_{t} \in \Omega } \sum^{\tau_{max}}_{\tau=1} | y_{t} | }$$ {% endraw %}

Forecasting Methods 30D ahead P50 QL-risk
MQTransformer [4] 0.323
TFT [3] 0.354