preprocessors module#

akerbp.mlpet.preprocessors.encode_columns(df, ...)

Encodes categorical columns. Only available for:

akerbp.mlpet.preprocessors.feature_engineering

akerbp.mlpet.preprocessors.fill_zloc_from_depth(df, ...)

Fill missing values in Z_LOC column with values from the DEPTH_MD column

akerbp.mlpet.preprocessors.fillna_with_fillers(df, ...)

Fills all NaNs in numeric columns with a num_filler and all NaNs in categorical columns with a cat_filler.

akerbp.mlpet.preprocessors.normalize_curves(df, ...)

Normalizes dataframe columns.

akerbp.mlpet.preprocessors.process_wells(df, ...)

Performs preprocessing per well

akerbp.mlpet.preprocessors.remove_noise(df, ...)

Removes noise by applying a median rolling window on each curve.

akerbp.mlpet.preprocessors.remove_outliers(df, ...)

Returns the dataframe after applying the curve specific cutoff values if the threshold (th) of the number of outliers is passed.

akerbp.mlpet.preprocessors.remove_small_negative_values(df, ...)

Replaces small negative values with np.NaN in all the numeric columns.

akerbp.mlpet.preprocessors.scale_curves(df, ...)

Scales specified columns

akerbp.mlpet.preprocessors.select_columns(df, ...)

Returns a dataframe with only curves chosen by user, filtered from the original dataframe

akerbp.mlpet.preprocessors.drop_columns(df, ...)

Returns a dataframe with the requested curves dropped

akerbp.mlpet.preprocessors.set_as_nan(df, ...)

Replaces the provided numerical and categorical values with np.nan in the respective numerical and categorical columns.

This module contains all the preprocessors available to the Dataset class of the mlpet repo (besides the preprocessing functions found in feature_engineering and imputers). All preprocessing functions in mlpet MUST follow a strict API in order to be used in conjunction with the preprocess method of the Dataset class.

The preprocessing API looks like this:

def some_preprocessing_function(df: pd.DataFrame, **kwargs) -> pd.DataFrame:
    ...
    do_something
    ...
    return df

This API allows for defining a preprocessing pipeline at runtime and passing it to the preprocess method instead of defining it prior to the initialisation of the Dataset class.

akerbp.mlpet.preprocessors.set_as_nan(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame[source]#

Replaces the provided numerical and categorical values with np.nan in the respective numerical and categorical columns. If numerical or categorical column names are not provided they will be inferred using the get_col_types utility function

Parameters

df (pd.DataFrame) – dataframe to apply metadata to

Keyword Arguments
  • numerical_curves (List[str], optional) – The numerical columns in which the numerical value should be replaced with np.nan.

  • categorical_curves (List[str], optional) – The categorical columns in which the numerical value should be replaced with np.nan.

  • numerical_value (float/int, optional) – The numerical value that should be replaced with np.nan.

  • categorical_value (str, optional) – The categorical value that should be replaced with np.nan.

Returns

The original dataframe filled with np.nan where

requested

Return type

df (pd.Dataframe)

akerbp.mlpet.preprocessors.remove_outliers(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame[source]#

Returns the dataframe after applying the curve specific cutoff values if the threshold (th) of the number of outliers is passed. The following curves and corresponding cutoff values are used (if they exist in the provided curves list):

  • GR: low cutoff: 0, high cutoff: 250

  • RMED: high cutoff: 100

  • RDEP: high cutoff: 100

  • RSHA: high cutoff: 100

  • NEU: low cutoff: -0.5, high cutoff: 1 (replaced with np.nan)

  • PEF: high cutoff: 10 (replaced with np.nan)

If not otherwise specified above, values above/below the cutoffs are replaced with the corresponding cutoff value.

Parameters

df (pd.DataFrame) – dataframe to remove outliers

Keyword Arguments
  • outlier_curves (list) – The curves to remove outliers for using the above rules

  • threshold (float, optional) – threshold of number of samples that are outliers. Used for displaying warnings of too many samples removed. Defaults to 0.05.

Returns

dataframe without outliers

Return type

pd.DataFrame

akerbp.mlpet.preprocessors.remove_small_negative_values(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame[source]#

Replaces small negative values with np.NaN in all the numeric columns. The small negative values are determined by defining a nan_threshold. If the negative value is smaller than the threshold it is set to nan. Naturally, this operation is only done on numeric columns.

Parameters
  • df (pd.DataFrame) – dataframe to be preprocessed

  • numerical_curves (List[str]) – The column names for which small negative values should be replaced with NaNs. If not provided, this list is generated using the get_col_types utility function

  • nan_threshold (float, optional) – The threshold determing the smallest acceptable negative value. Defaults to None

Returns

preprocessed dataframe

Return type

pd.DataFrame

akerbp.mlpet.preprocessors.fill_zloc_from_depth(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame[source]#

Fill missing values in Z_LOC column with values from the DEPTH_MD column

Parameters

df (pd.DataFrame) – The dataframe containing both Z_LOC and DEPTH_MD columns

Returns

The original dataframe with the Z_LOC column filled where

possible.

Return type

pd.DataFrame

akerbp.mlpet.preprocessors.fillna_with_fillers(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame[source]#

Fills all NaNs in numeric columns with a num_filler and all NaNs in categorical columns with a cat_filler. All four of these variables are passed as kwargs.

If a num_filler and/or cat_filler is passed without corresponding column names, column types are inferred using the get_col_type utility function.

Parameters

df (pd.DataFrame) – The dataframe to be preprocessed

Keyword Arguments
  • num_filler (float) – The numeric value to fill nans with in the numeric

  • numerical_curves (List[str]) – The column names for all numeric columns where the NaNs will be filled with the num_filler

  • cat_filler (float) – The numeric value to fill nans with in the numeric

  • categorical_curves (List[str]) – The column names for all numeric columns where the NaNs will be filled with the num_filler

Returns

Preprocessed dataframe

Return type

pd.DataFrame

akerbp.mlpet.preprocessors.encode_columns(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame[source]#
Encodes categorical columns. Only available for:
  • FORMATION column - categories are encoded using the formations_map

    provided in the kwargs.

  • GROUP column - categories are encoded using the groups_map

    provided in the kwargs

  • lsuName column - categories are encoded using the groups_map

    provided in the kwargs

Note: All names are standardized prior to mapping using the utility function

standardize_group_formation_name and all categories that weren’t mapped are encoded with -1.

Parameters

df (pd.DataFrame) – dataframe to which apply encoding of categorical variables

Keyword Arguments
  • columns_to_encode (list) – which columns to encode. Default to no columns being encoded. If no columns are passed the get_col_types utility function is used to determine the categorical columns

  • formations_map (dict) – A mapping dictionary mapping formation names to corresponding integers. Defaults to an empty dictionary (ie no encoding).

  • groups_map (dict) – A mapping dictionary mapping group names to corresponding integers. Defaults to an empty dictionary (ie no encoding).

  • missing_encoding_value (int) – The value to fill encode categories for which no match was found in the provided mappings. Defaults to -1.

Returns

dataframe with categorical columns encoded

Return type

pd.DataFrame

akerbp.mlpet.preprocessors.select_columns(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame[source]#

Returns a dataframe with only curves chosen by user, filtered from the original dataframe

Parameters

df (pd.DataFrame) – dataframe to filter

Keyword Arguments
  • curves_to_select (list) – which curves should be kept. Defaults to None.

  • label_column (str) – The name of the label column to keep if also desired. Defaults to None

  • id_column (str) – The name of the id column to keep if also desired. Defaults to None

Returns

dataframe with relevant curves

Return type

pd.DataFrame

akerbp.mlpet.preprocessors.drop_columns(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame[source]#

Returns a dataframe with the requested curves dropped

Parameters

df (pd.DataFrame) – dataframe to filter

Keyword Arguments

curves_to_drop (list) – The curves to be dropped. Defaults to None.

Returns

dataframe with requested curves dropped

Return type

pd.DataFrame

akerbp.mlpet.preprocessors.normalize_curves(df: pandas.core.frame.DataFrame, **kwargs) Union[pandas.core.frame.DataFrame, Tuple[pandas.core.frame.DataFrame, Dict[str, Any]]][source]#

Normalizes dataframe columns.

We choose one well to be a “key well” and normalize all other wells to its low and high values. This process requires the kwarg ‘id_column’ to be passed so that wells can be grouped by their ID.

For each curve to be normalized, high and low quantiles are calculated per well (the high and low percentage keyword arguments dictate this).

If the user provides key wells, key wells calculation is not perfomed.

Parameters

df (pd.DataFrame) – dataframe with columns to normalize

Keyword Arguments
  • curves_to_normalize (list) – List of curves (column names) to normalize. Defaults to None (i.e. no curves being normalized).

  • id_column (str) – The name of the well ID column. This keyword argument MUST be provided to use this method.

  • low_perc (float) – low quantile to use as min value. Defaults to 5%

  • high_perc (float) – high quantile to use as max value. Defaults to 95%

  • user_key_wells (dict) – dictionary with curves as keys and min/max values and key well as values

  • save_key_wells (bool) – whether to save keys wells dictionary in folder_path. Defaults to False

  • folder_path (str) – The folder to save the key wells dictionary in. Defaults to “” so an error will be raised is saving is set to True but no folder_path is provided.

Returns

pd.DataFrame with normalized values and dictionary with key wells

that were used to normalize the curves_to_normalize

Return type

tuple(pd.DataFrame, dict)

akerbp.mlpet.preprocessors.scale_curves(df: pandas.core.frame.DataFrame, **kwargs) Union[pandas.core.frame.DataFrame, Tuple[pandas.core.frame.DataFrame, Dict[str, sklearn.base.BaseEstimator]]][source]#

Scales specified columns

Parameters

df (pd.DataFrame) – dataframe containing columns to scale

Keyword Arguments
  • curves_to_scale (list) – list of curves (column names) to scale

  • scaler_method (str) – string of any sklearn scalers. Defaults to RobustScaler

  • scaler_kwargs (dict) – dictionary of any kwargs to pass to the sklearn scaler

  • scaler (BaseEstimator) – a pre-fitted sklearn scaler object to apply directly to the curves_to_scale. If this kwarg is provided none of the other kwargs BESIDES curves_to_scale is needed.

  • save_scaler (bool) – whether to save scaler in folder_path or not. Defaults to False.

  • folder_path (str) – Which folder to save the scalers in. Defaults to no path so a path needs to be provided if the save_scaler kwarg is set to True.

Returns

scaled columns and the scaler object that was used to scale the

scaled columns stored in a dict.

Return type

tuple(pd.DataFrame, dict)

akerbp.mlpet.preprocessors.process_wells(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame[source]#

Performs preprocessing per well

This a convenience function that will perform several preprocessing steps per well if an id_column is provided in the kwargs. Otherwise it will treat the entire df as one well and preprocess it according to the same pipeline as the per well treatment.

The preprocessing pipeline performed is as follows:

  1. imputation (if the ‘imputer’ kwarg is set)

  2. feature engineering:

    • Rolling features created using the add_rolling_features function

      (if the ‘rolling_features’ kwarg is set)

    • Gradient features created using the add_gradient_features function

      (if the ‘gradient_features’ kwarg is set)

    • Sequential features created using the add_sequential_features function

      (if the ‘sequential_features’ kwarg is set)

The kwargs for each method discussed above must also be provided to this method. Please refer to the specific methods to determine which kwargs to provide

Parameters

df (pd.DataFrame) – dataframe of data to be preprocessed

Keyword Arguments
  • id_column (str) – The well ID column name to use to groupby well ID

  • imputation_type (str) –

    Which imputer to use. Can be one of the following two options:

    1. ’iterative’ - runs the iterative_impute method from the

      imputers module. Please refer to that method to read up on all necessary kwargs to use that method properly

    2. ’simple’ - runs the simple_impute method from the

      imputers module. Please refer to that method to read up on all necessary kwargs to use that method properly

Returns

dataframe of preprocessed data

Return type

pd.Dataframe

akerbp.mlpet.preprocessors.remove_noise(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame[source]#

Removes noise by applying a median rolling window on each curve.

Warning

Both kwargs are required for this function. If they are not provided, no noise filtering is performed and the df is returned untouched

Parameters

df (pd.DataFrame) – dataframe to which apply median filtering

Keyword Arguments
  • noisy_curves (list) – list of curves (columns) to apply noise removal with median filter if none are provided, median filtering will be applied to all numerical columns. Numerical columns are identified using the get_col_types utility function

  • noise_removal_window (int) – the window size to use when applying median filtering

Returns

dataframe after removing noise

Return type

pd.DataFrame

akerbp.mlpet.preprocessors.apply_calibration(df_measured: pandas.core.frame.DataFrame, df_predicted: pandas.core.frame.DataFrame, curves: List[str], location_curves: List[str], level: str, mode: str, id_column: str, distance_thres: float = 9999.0, calib_map: Optional[pandas.core.frame.DataFrame] = None, standardize_level_names: bool = True) pandas.core.frame.DataFrame[source]#

Applies calibration to predicted curves, removing biases with the help of measured curves, either in the same well or closest wells

Parameters
  • df_measured (pd.DataFrame) – original dataframe with measured values

  • df_predicted (pd.DataFrame) – datframe with predicted values, same column names

  • curves (List[str]) – curves to which apply calibration

  • location_curves (List[str]) – which curves to use for distance to get closest wells

  • level (str) – which grouping type to apply calibration (poer group, per formation)

  • mode (str) – type of value aggregation (mean, median, mode)

  • id_column (str) – well id

  • distance_thres (float, optional) – threshold for a well to be considered close enough.

  • 9999.0. (Defaults to) –

  • calib_map (pd.DataFrame, optional) – calibration map for the level. Defaults to None.

  • standardize_level_names (bool optional) – whether to standardize formation

  • True. (or group names. Defaults to) –

Returns

dataframe with calibrated values

Return type

pd.DataFrame