rainforest.ml package

This submodule deals with the training and evaluation of machine learning QPE methods. It also allows to read them from pickle files stored in the rf_models subfolder.

rainforest.ml.rf : main module used to train RF regressors

rainforest.ml.rfdefinitions : reference module that contains definitions of the RF regressors and allows to load them from files

rainforest.ml.rf_train : command-line utility to train RF models and prepare input features

rainforest.ml.utils : small utilities used in this module only (for example for vertical aggregation)

rainforest.ml.rf module

Main module to

class rainforest.ml.rf.RFTraining(db_location, input_location=None, force_regenerate_input=False)

Bases: object

This is the main class that allows to preparate data for random forest training, train random forests and perform cross-validation of trained models

Initializes the class and if needed prepare input data for the training

Note that when calling this constructor the input data is only generated for the central pixel (NX = NY = 0 = loc of gauge), if you want to regenerate the inputs for all neighbour pixels, please call the function self.prepare_input(only_center_pixel = False)

Parameters
  • db_location (str) – Location of the main directory of the database (with subfolders ‘reference’, ‘gauge’ and ‘radar’ on the filesystem)

  • input_location (str) – Location of the prepared input data, if this data can not be found in this folder, it will computed here, default is a subfolder called rf_input_data within db_location

  • force_regenerate_input (bool) – if True the input parquet files will always be regenerated from the database even if already present in the input_location folder

fit_models(config_file, features_dic, tstart=None, tend=None, output_folder=None)

Fits a new RF model that can be used to compute QPE realizations and saves them to disk in pickle format

Parameters
  • config_file (str) – Location of the RF training configuration file, if not provided the default one in the ml submodule will be used

  • features_dic (dict) – A dictionary whose keys are the names of the models you want to create (a string) and the values are lists of features you want to use. For example {‘RF_dualpol’:[‘RADAR’, ‘zh_VISIB_mean’, ‘zv_VISIB_mean’,’KDP_mean’,’RHOHV_mean’,’T’, ‘HEIGHT’,’VISIB_mean’]} will train a model with all these features that will then be stored under the name RF_dualpol_BC_<type of BC>.p in the ml/rf_models dir

  • tstart (datetime) – the starting time of the training time interval, default is to start at the beginning of the time interval covered by the database

  • tend (datetime) – the end time of the training time interval, default is to end at the end of the time interval covered by the database

  • output_folder (str) – Location where to store the trained models in pickle format, if not provided it will store them in the standard location <library_path>/ml/rf_models

model_intercomparison(features_dic, intercomparison_configfile, output_folder, reference_products=['CPC', 'RZC'], bounds10=[0, 2, 10, 100], bounds60=[0, 1, 10, 100], K=5)

Does an intercomparison (cross-validation) of different RF models and reference products (RZC, CPC, …) and plots the performance plots

Parameters
  • features_dic (dict) – A dictionary whose keys are the names of the models you want to compare (a string) and the values are lists of features you want to use. For example {‘RF_dualpol’:[‘RADAR’, ‘zh_VISIB_mean’, ‘zv_VISIB_mean’,’KDP_mean’,’RHOHV_mean’,’T’, ‘HEIGHT’,’VISIB_mean’], ‘RF_hpol’:[‘RADAR’, ‘zh_VISIB_mean’,’T’, ‘HEIGHT’,’VISIB_mean’]} will compare a model of RF with polarimetric info to a model with only horizontal polarization

  • output_folder (str) – Location where to store the output plots

  • intercomparison_config (str) – Location of the intercomparison configuration file, which is a yaml file that gives for every model key of features_dic which parameters of the training you want to use (see the file intercomparison_config_example.yml in this module for an example)

  • reference_products (list of str) – Name of the reference products to which the RF will be compared they need to be in the reference table of the database

  • bounds10 (list of float) – list of precipitation bounds for which to compute scores separately at 10 min time resolution [0,2,10,100] will give scores in range [0-2], [2-10] and [10-100]

  • bounds60 (list of float) – list of precipitation bounds for which to compute scores separately at hourly time resolution [0,1,10,100] will give scores in range [0-1], [1-10] and [10-100]

  • K (int) – Number of splits in iterations do perform in the K fold cross-val

prepare_input(only_center=True)

Reads the data from the database in db_location and processes it to create easy to use parquet input files for the ML training and stores them in the input_location, the processing steps involve

For every neighbour of the station (i.e. from -1-1 to +1+1):

  • Replace missing flags by nans

  • Filter out timesteps which are not present in the three tables (gauge, reference and radar)

  • Filter out incomplete hours (i.e. where less than 6 10 min timesteps are available)

  • Add height above ground and height of iso0 to radar data

  • Save a separate parquet file for radar, gauge and reference data

  • Save a grouping_idx pickle file containing grp_vertical index (groups all radar rows with same timestep and station), grp_hourly (groups all timesteps with same hours) and tstamp_unique (list of all unique timestamps)

Parameters

only_center (bool) – If set to True only the input data for the central neighbour i.e. NX = NY = 0 (the location of the gauge) will be recomputed this takes much less time and is the default option since until now the neighbour values are not used in the training of the RF QPE

rainforest.ml.rf_train module

Command line script to prepare input features and train RF models

see rf_train

rainforest.ml.rf_train.main()

rainforest.ml.rfdefinitions module

Class declarations and reading functions required to unpickle trained RandomForest models

Daniel Wolfensberger MeteoSwiss/EPFL daniel.wolfensberger@epfl.ch December 2019

class rainforest.ml.rfdefinitions.MyCustomUnpickler

Bases: _pickle.Unpickler

This is an extension of the pickle Unpickler that handles the bookeeeping references to the RandomForestRegressorBC class

find_class(module, name)

Return an object from a specified module.

If necessary, the module will be imported. Subclasses may override this method (e.g. to restrict unpickling of arbitrary classes and functions).

This method is called whenever a class or a function object is needed. Both arguments passed are str objects.

class rainforest.ml.rfdefinitions.RandomForestRegressorBC(variables, beta, degree=1, bctype='cdf', n_estimators=100, criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False)

Bases: sklearn.ensemble.forest.RandomForestRegressor

This is an extension of the RandomForestRegressor regressor class of sklearn that does additional bias correction, is able to apply a rounding function to the outputs on the fly and adds a bit of metadata:

bctype : type of bias correction method variables : name of input features beta : weighting factor in vertical aggregation degree : order of the polyfit used in some bias-correction methods

For bc_type tHe available methods are currently “raw”: simple linear fit between prediction and observation, “cdf”: linear fit between sorted predictions and sorted observations and “spline” : spline fit between sorted predictions and sorted observations. Any new method should be added in this class in order to be used.

For any information regarding the sklearn parent class see

https://github.com/scikit-learn/scikit-learn/blob/b194674c4/sklearn/ensemble/_forest.py#L1150

fit(X, y, sample_weight=None)

Fit both estimator and a-posteriori bias correction :param X: The input samples. Use dtype=np.float32 for maximum

efficiency. Sparse matrices are also supported, use sparse csc_matrix for maximum efficiency.

Parameters

sample_weight (array-like of shape (n_samples,), default=None) – Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. In the case of classification, splits are also ignored if they would result in any single class carrying a negative weight in either child node.

Returns

self

Return type

object

predict(X, round_func=None, bc=True)

Predict regression target for X. The predicted regression target of an input sample is computed as the mean predicted regression targets of the trees in the forest. :param X: The input samples. Internally, its dtype will be converted to

dtype=np.float32. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.

Parameters
  • round_func (lambda function) – Optional function to apply to outputs (for example to discretize them using MCH lookup tables). If not provided f(x) = x will be applied (i.e. no function)

  • bc (bool) – if True the bias correction function will be applied

Returns

y – The predicted values.

Return type

array-like of shape (n_samples,) or (n_samples, n_outputs)

rainforest.ml.rfdefinitions.read_rf(rf_name)

Reads a randomForest model from the RF models folder using pickle. All custom classes and functions used in the construction of these pickled models must be defined in the script ml/rf_definitions.py

Parameters

rf_name (str) – Name of the randomForest model, it must be stored in the folder /ml/rf_models and computed with the rf:RFTraining.fit_model function

Returns

  • A trained sklearn randomForest instance that has the predict() method,

  • that allows to predict precipitation intensities for new points

rainforest.ml.utils module

Utility functions for the ML submodule

rainforest.ml.utils.nesteddictvalues(d)
rainforest.ml.utils.split_event(timestamps, n=5, threshold_hr=12)

Splits the dataset into n subsets by separating the observations into separate precipitation events and attributing these events randomly to the subsets

Parameters
  • timestamps (int array) – array containing the UNIX timestamps of the precipitation observations

  • n (int) – number of subsets to create

  • threshold_hr (int) – threshold in hours to distinguish precip events. Two timestamps are considered to belong to a different event if there is a least threshold_hr hours of no observations (no rain) between them.

Returns

split_idx – array containing the subset grouping, with values from 0 to n - 1

Return type

int array

rainforest.ml.utils.vert_aggregation(radar_data, vert_weights, grp_vertical, visib_weight=True, visib=None)

Performs vertical aggregation of radar observations aloft to the ground using a weighted average. Categorical variables such as ‘RADAR’, ‘HYDRO’, ‘TCOUNT’, will be assigned dummy variables and these dummy variables will be aggregated, resulting in columns such as RADAR_propA giving the weighted proportion of radar observation aloft that were obtained with the Albis radar

Parameters
  • radar_data (Pandas DataFrame) – A Pandas DataFrame containing all required input features aloft as explained in the rf.py module

  • vert_weights (np.array of float) – vertical weights to use for every observation in radar, must have the same len as radar_data

  • grp_vertical (np.array of int) – grouping index for the vertical aggregation. It must have the same len as radar_data. All observations corresponding to the same timestep must have the same label

  • visib_weight (bool) – if True the input features will be weighted by the visibility when doing the vertical aggregation to the ground

  • visib (np array) – visibily of every observation, required only if visib_weight = True