preprocessors module#
Encodes categorical columns. Only available for: |
|
|
|
Fill missing values in Z_LOC column with values from the DEPTH_MD column |
|
Fills all NaNs in numeric columns with a num_filler and all NaNs in categorical columns with a cat_filler. |
|
Normalizes dataframe columns. |
|
Performs preprocessing per well |
|
Removes noise by applying a median rolling window on each curve. |
|
Returns the dataframe after applying the curve specific cutoff values if the threshold (th) of the number of outliers is passed. |
|
|
Replaces small negative values with np.NaN in all the numeric columns. |
Scales specified columns |
|
Returns a dataframe with only curves chosen by user, filtered from the original dataframe |
|
Returns a dataframe with the requested curves dropped |
|
Replaces the provided numerical and categorical values with np.nan in the respective numerical and categorical columns. |
This module contains all the preprocessors available to the Dataset class of the mlpet repo (besides the preprocessing functions found in feature_engineering and imputers). All preprocessing functions in mlpet MUST follow a strict API in order to be used in conjunction with the preprocess method of the Dataset class.
The preprocessing API looks like this:
def some_preprocessing_function(df: pd.DataFrame, **kwargs) -> pd.DataFrame:
...
do_something
...
return df
This API allows for defining a preprocessing pipeline at runtime and passing it to the preprocess method instead of defining it prior to the initialisation of the Dataset class.
- akerbp.mlpet.preprocessors.set_as_nan(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame [source]#
Replaces the provided numerical and categorical values with np.nan in the respective numerical and categorical columns. If numerical or categorical column names are not provided they will be inferred using the get_col_types utility function
- Parameters
df (pd.DataFrame) – dataframe to apply metadata to
- Keyword Arguments
numerical_curves (List[str], optional) – The numerical columns in which the numerical value should be replaced with np.nan.
categorical_curves (List[str], optional) – The categorical columns in which the numerical value should be replaced with np.nan.
numerical_value (float/int, optional) – The numerical value that should be replaced with np.nan.
categorical_value (str, optional) – The categorical value that should be replaced with np.nan.
- Returns
- The original dataframe filled with np.nan where
requested
- Return type
df (pd.Dataframe)
- akerbp.mlpet.preprocessors.remove_outliers(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame [source]#
Returns the dataframe after applying the curve specific cutoff values if the threshold (th) of the number of outliers is passed. The following curves and corresponding cutoff values are used (if they exist in the provided curves list):
GR: low cutoff: 0, high cutoff: 250
RMED: high cutoff: 100
RDEP: high cutoff: 100
RSHA: high cutoff: 100
NEU: low cutoff: -0.5, high cutoff: 1 (replaced with np.nan)
PEF: high cutoff: 10 (replaced with np.nan)
If not otherwise specified above, values above/below the cutoffs are replaced with the corresponding cutoff value.
- Parameters
df (pd.DataFrame) – dataframe to remove outliers
- Keyword Arguments
outlier_curves (list) – The curves to remove outliers for using the above rules
threshold (float, optional) – threshold of number of samples that are outliers. Used for displaying warnings of too many samples removed. Defaults to 0.05.
- Returns
dataframe without outliers
- Return type
pd.DataFrame
- akerbp.mlpet.preprocessors.remove_small_negative_values(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame [source]#
Replaces small negative values with np.NaN in all the numeric columns. The small negative values are determined by defining a nan_threshold. If the negative value is smaller than the threshold it is set to nan. Naturally, this operation is only done on numeric columns.
- Parameters
df (pd.DataFrame) – dataframe to be preprocessed
numerical_curves (List[str]) – The column names for which small negative values should be replaced with NaNs. If not provided, this list is generated using the get_col_types utility function
nan_threshold (float, optional) – The threshold determing the smallest acceptable negative value. Defaults to None
- Returns
preprocessed dataframe
- Return type
pd.DataFrame
- akerbp.mlpet.preprocessors.fill_zloc_from_depth(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame [source]#
Fill missing values in Z_LOC column with values from the DEPTH_MD column
- Parameters
df (pd.DataFrame) – The dataframe containing both Z_LOC and DEPTH_MD columns
- Returns
- The original dataframe with the Z_LOC column filled where
possible.
- Return type
pd.DataFrame
- akerbp.mlpet.preprocessors.fillna_with_fillers(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame [source]#
Fills all NaNs in numeric columns with a num_filler and all NaNs in categorical columns with a cat_filler. All four of these variables are passed as kwargs.
If a num_filler and/or cat_filler is passed without corresponding column names, column types are inferred using the get_col_type utility function.
- Parameters
df (pd.DataFrame) – The dataframe to be preprocessed
- Keyword Arguments
num_filler (float) – The numeric value to fill nans with in the numeric
numerical_curves (List[str]) – The column names for all numeric columns where the NaNs will be filled with the num_filler
cat_filler (float) – The numeric value to fill nans with in the numeric
categorical_curves (List[str]) – The column names for all numeric columns where the NaNs will be filled with the num_filler
- Returns
Preprocessed dataframe
- Return type
pd.DataFrame
- akerbp.mlpet.preprocessors.encode_columns(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame [source]#
- Encodes categorical columns. Only available for:
- FORMATION column - categories are encoded using the formations_map
provided in the kwargs.
- GROUP column - categories are encoded using the groups_map
provided in the kwargs
- lsuName column - categories are encoded using the groups_map
provided in the kwargs
- Note: All names are standardized prior to mapping using the utility function
standardize_group_formation_name and all categories that weren’t mapped are encoded with -1.
- Parameters
df (pd.DataFrame) – dataframe to which apply encoding of categorical variables
- Keyword Arguments
columns_to_encode (list) – which columns to encode. Default to no columns being encoded. If no columns are passed the get_col_types utility function is used to determine the categorical columns
formations_map (dict) – A mapping dictionary mapping formation names to corresponding integers. Defaults to an empty dictionary (ie no encoding).
groups_map (dict) – A mapping dictionary mapping group names to corresponding integers. Defaults to an empty dictionary (ie no encoding).
missing_encoding_value (int) – The value to fill encode categories for which no match was found in the provided mappings. Defaults to -1.
- Returns
dataframe with categorical columns encoded
- Return type
pd.DataFrame
- akerbp.mlpet.preprocessors.select_columns(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame [source]#
Returns a dataframe with only curves chosen by user, filtered from the original dataframe
- Parameters
df (pd.DataFrame) – dataframe to filter
- Keyword Arguments
curves_to_select (list) – which curves should be kept. Defaults to None.
label_column (str) – The name of the label column to keep if also desired. Defaults to None
id_column (str) – The name of the id column to keep if also desired. Defaults to None
- Returns
dataframe with relevant curves
- Return type
pd.DataFrame
- akerbp.mlpet.preprocessors.drop_columns(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame [source]#
Returns a dataframe with the requested curves dropped
- Parameters
df (pd.DataFrame) – dataframe to filter
- Keyword Arguments
curves_to_drop (list) – The curves to be dropped. Defaults to None.
- Returns
dataframe with requested curves dropped
- Return type
pd.DataFrame
- akerbp.mlpet.preprocessors.normalize_curves(df: pandas.core.frame.DataFrame, **kwargs) Union[pandas.core.frame.DataFrame, Tuple[pandas.core.frame.DataFrame, Dict[str, Any]]] [source]#
Normalizes dataframe columns.
We choose one well to be a “key well” and normalize all other wells to its low and high values. This process requires the kwarg ‘id_column’ to be passed so that wells can be grouped by their ID.
For each curve to be normalized, high and low quantiles are calculated per well (the high and low percentage keyword arguments dictate this).
If the user provides key wells, key wells calculation is not perfomed.
- Parameters
df (pd.DataFrame) – dataframe with columns to normalize
- Keyword Arguments
curves_to_normalize (list) – List of curves (column names) to normalize. Defaults to None (i.e. no curves being normalized).
id_column (str) – The name of the well ID column. This keyword argument MUST be provided to use this method.
low_perc (float) – low quantile to use as min value. Defaults to 5%
high_perc (float) – high quantile to use as max value. Defaults to 95%
user_key_wells (dict) – dictionary with curves as keys and min/max values and key well as values
save_key_wells (bool) – whether to save keys wells dictionary in folder_path. Defaults to False
folder_path (str) – The folder to save the key wells dictionary in. Defaults to “” so an error will be raised is saving is set to True but no folder_path is provided.
- Returns
- pd.DataFrame with normalized values and dictionary with key wells
that were used to normalize the curves_to_normalize
- Return type
tuple(pd.DataFrame, dict)
- akerbp.mlpet.preprocessors.scale_curves(df: pandas.core.frame.DataFrame, **kwargs) Union[pandas.core.frame.DataFrame, Tuple[pandas.core.frame.DataFrame, Dict[str, sklearn.base.BaseEstimator]]] [source]#
Scales specified columns
- Parameters
df (pd.DataFrame) – dataframe containing columns to scale
- Keyword Arguments
curves_to_scale (list) – list of curves (column names) to scale
scaler_method (str) – string of any sklearn scalers. Defaults to RobustScaler
scaler_kwargs (dict) – dictionary of any kwargs to pass to the sklearn scaler
scaler (BaseEstimator) – a pre-fitted sklearn scaler object to apply directly to the curves_to_scale. If this kwarg is provided none of the other kwargs BESIDES curves_to_scale is needed.
save_scaler (bool) – whether to save scaler in folder_path or not. Defaults to False.
folder_path (str) – Which folder to save the scalers in. Defaults to no path so a path needs to be provided if the save_scaler kwarg is set to True.
- Returns
- scaled columns and the scaler object that was used to scale the
scaled columns stored in a dict.
- Return type
tuple(pd.DataFrame, dict)
- akerbp.mlpet.preprocessors.process_wells(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame [source]#
Performs preprocessing per well
This a convenience function that will perform several preprocessing steps per well if an id_column is provided in the kwargs. Otherwise it will treat the entire df as one well and preprocess it according to the same pipeline as the per well treatment.
The preprocessing pipeline performed is as follows:
imputation (if the ‘imputer’ kwarg is set)
feature engineering:
- Rolling features created using the add_rolling_features function
(if the ‘rolling_features’ kwarg is set)
- Gradient features created using the add_gradient_features function
(if the ‘gradient_features’ kwarg is set)
- Sequential features created using the add_sequential_features function
(if the ‘sequential_features’ kwarg is set)
The kwargs for each method discussed above must also be provided to this method. Please refer to the specific methods to determine which kwargs to provide
- Parameters
df (pd.DataFrame) – dataframe of data to be preprocessed
- Keyword Arguments
id_column (str) – The well ID column name to use to groupby well ID
imputation_type (str) –
Which imputer to use. Can be one of the following two options:
- ’iterative’ - runs the iterative_impute method from the
imputers module. Please refer to that method to read up on all necessary kwargs to use that method properly
- ’simple’ - runs the simple_impute method from the
imputers module. Please refer to that method to read up on all necessary kwargs to use that method properly
- Returns
dataframe of preprocessed data
- Return type
pd.Dataframe
- akerbp.mlpet.preprocessors.remove_noise(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame [source]#
Removes noise by applying a median rolling window on each curve.
Warning
Both kwargs are required for this function. If they are not provided, no noise filtering is performed and the df is returned untouched
- Parameters
df (pd.DataFrame) – dataframe to which apply median filtering
- Keyword Arguments
noisy_curves (list) – list of curves (columns) to apply noise removal with median filter if none are provided, median filtering will be applied to all numerical columns. Numerical columns are identified using the get_col_types utility function
noise_removal_window (int) – the window size to use when applying median filtering
- Returns
dataframe after removing noise
- Return type
pd.DataFrame
- akerbp.mlpet.preprocessors.apply_calibration(df_measured: pandas.core.frame.DataFrame, df_predicted: pandas.core.frame.DataFrame, curves: List[str], location_curves: List[str], level: str, mode: str, id_column: str, distance_thres: float = 9999.0, calib_map: Optional[pandas.core.frame.DataFrame] = None, standardize_level_names: bool = True) pandas.core.frame.DataFrame [source]#
Applies calibration to predicted curves, removing biases with the help of measured curves, either in the same well or closest wells
- Parameters
df_measured (pd.DataFrame) – original dataframe with measured values
df_predicted (pd.DataFrame) – datframe with predicted values, same column names
curves (List[str]) – curves to which apply calibration
location_curves (List[str]) – which curves to use for distance to get closest wells
level (str) – which grouping type to apply calibration (poer group, per formation)
mode (str) – type of value aggregation (mean, median, mode)
id_column (str) – well id
distance_thres (float, optional) – threshold for a well to be considered close enough.
9999.0. (Defaults to) –
calib_map (pd.DataFrame, optional) – calibration map for the level. Defaults to None.
standardize_level_names (bool optional) – whether to standardize formation
True. (or group names. Defaults to) –
- Returns
dataframe with calibrated values
- Return type
pd.DataFrame