feature_engineering module#
|
Adds a FORMATION AND GROUP column to the dataframe based on the well formation tops metadata and the depth in the column. |
|
Creates columns with gradient of curves. |
Creates columns with log10 of curves. |
|
|
Creates petrophysical features according to relevant heuristics/formulas. |
|
Creates columns with window/rolling features of curves. |
|
Adds n past values of columns (for sequential models modelling). |
Calculates AI from DEN and AC according to the following formula. |
|
Calculates CALI-BS assuming at least CALI is provided in the dataframe argument. |
|
Calculates FI from LFI according to the following formula. |
|
Calculates LFI from NEU and DEN according to the following formula. |
|
Calculates LI from LFI according to the following formula. |
|
Calculates PR from VP and VS or ACS and AC (if VP and VS are not found) according to the following formula. |
|
Calculates RAVG from RDEP, RMED, RSHA according to the following formula. |
|
Calculates VPVS from ACS and AC according to the following formula. |
|
Calculates the VSH curve based off the GR curve and the type of formation defined in the GROUP column, as follows. |
|
Guess bitsize from CALI, given the standard bitsizes |
- akerbp.mlpet.feature_engineering.add_log_features(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame [source]#
Creates columns with log10 of curves. All created columns are suffixed with ‘_log’. All negative values are set to zero and 1 is added to all values. In other words, this function is synonymous of numpy’s log1p.
- Parameters
df (pd.DataFrame) – dataframe with columns to calculate log10 from
- Keyword Arguments
log_features (list, optional) – list of column names for the columns that should be loggified. Defaults to None
- Returns
New dataframe with calculated log columns
- Return type
pd.DataFrame
- akerbp.mlpet.feature_engineering.add_gradient_features(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame [source]#
Creates columns with gradient of curves. All created columns are suffixed with ‘_gradient’.
- Parameters
df (pd.DataFrame) – dataframe with columns to calculate gradient from
- Keyword Arguments
gradient_features (list, optional) – list of column names for the columns that gradient features should be calculated for. Defaults to None.
- Returns
New dataframe with calculated gradient feature columns
- Return type
pd.DataFrame
- akerbp.mlpet.feature_engineering.add_rolling_features(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame [source]#
Creates columns with window/rolling features of curves. All created columns are suffixed with ‘_window_mean’ / ‘_window_max’ / ‘_window_min’.
- Parameters
df (pd.DataFrame) – dataframe with columns to calculate rolling features from
- Keyword Arguments
rolling_features (list) – columns to apply rolling features to. Defaults to None.
depth_column (str) – The name of the column to use to determine the sampling rate. Without this kwarg no rolling features are calculated.
window (float) – The window size to use for calculating the rolling features. The window size is defined in distance! The sampling rate is determined from the depth_column kwarg and used to transform the window size into an index based window. If this is not provided, no rolling features are calculated.
- Returns
New dataframe with calculated rolling feature columns
- Return type
pd.DataFrame
- akerbp.mlpet.feature_engineering.add_sequential_features(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame [source]#
Adds n past values of columns (for sequential models modelling). All created columns are suffixed with ‘_1’ / ‘_2’ / … / ‘_n’.
- Parameters
df (pd.DataFrame) – dataframe to add time features to
- Keyword Arguments
sequential_features (list, optional) – columns to apply shifting to. Defaults to None.
shift_size (int, optional) – Size of the shifts to calculate. In other words, number of past values to include. If this is not provided, no sequential features are calculated.
- Returns
New dataframe with sequential gradient columns
- Return type
pd.DataFrame
- akerbp.mlpet.feature_engineering.add_petrophysical_features(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame [source]#
Creates petrophysical features according to relevant heuristics/formulas.
The features created are as follows (each one can be toggled on/off via the ‘petrophysical_features’ kwarg):
- VPVS = ACS / AC - PR = (VP ** 2 * 2 * VS ** 2) / (2 * (VP ** 2 * VS ** 2)) where - VP = 304.8 / AC - VS = 304.8 / ACS - RAVG = AVG(RDEP, RMED, RSHA), if at least two of those are present - LFI = 2.95 * ((NEU + 0.15) / 0.6) * DEN, and - LFI < *0.9 = 0 - NaNs are filled with 0 - FI = (ABS(LFI) + LFI) / 2 - LI = ABS(ABS(LFI) * LFI) / 2 - AI = DEN * ((304.8 / AC) ** 2) - CALI*BS = CALI * BS, where - BS is calculated using the guess_BS_from_CALI function from this module it is not found in the pass dataframe - VSH = Refer to the calculate_VSH docstring for more info on this
- Parameters
df (pd.DataFrame) – dataframe to which add features from and to
- Keyword Arguments
petrophysical_features (list) – A list of all the petrophysical features that should be created (see above for all the potential features this method can create). This defaults to an empty list (i.e. no features created).
- Returns
dataframe with added features
- Return type
pd.DataFrame
- akerbp.mlpet.feature_engineering.add_well_metadata(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame [source]#
Adds well metadata columns to the provided dataframe from the provided well metadata dictionary (kwarg)
Warning
This method will not work without the three kwargs listed below! It will return the df untouched and print a warning if kwargs are missing.
- Parameters
df (pd.DataFrame) – The dataframe in which the well metadata columns will be added
- Keyword Arguments
metadata_dict (dict) – The dictionary containing the relevant metadata per well (usually generated with the :py:meth: get_well_metadata <akerbp.mlpet.utilties.get_well_metadata> function).
metadata_columns (list) – List of metadata columns to add (each entry must correspond to a metadata key in the provided metadata_dict kwarg)
id_column (str) – The name of the column containing the well names (to be matched with the keys in the provided metadata_dict)
- Returns
Return the passed dataframe with the requested columns added
- Return type
pd.DataFrame
- akerbp.mlpet.feature_engineering.add_formations_and_groups(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame [source]#
Adds a FORMATION AND GROUP column to the dataframe based on the well formation tops metadata and the depth in the column.
Note
This function requires several kwargs to be able to run. If they are not provided a warning is raised and instead the df is returned untouched.
Note
If the well is not found in formation_tops_mapping, the code will print a warning and continue to the next well.
Example
An example mapper dictionary that would classify all depths in WELL_A between 120 & 879 as NORDLAND GP and all depths between 879 and 2014 as HORDALAND GP, would look like this:
formation_tops_mapper = { "WELL_A": { "labels": [NORDLAND GP, HORDALAND GP], "levels": [120.0, 879.0, 2014.0] } ... }
It can be generated by using the :py:meth: get_formation_tops <akerbp.mlpet.utilties.get_formation_tops> function
- Parameters
df (pd.DataFrame) – The dataframe in which the formation tops label column should be added
- Keyword Arguments
id_column (str) – The name of the column of well IDs
depth_column (str) – The name of the depth column to use for applying the mappings.
formation_tops_mapper (dict) –
A dictionary mapping the well IDs to the formation tops labels, chronostrat and depth levels. For example:
formation_tops_mapper = { "31/6-6": { "group_labels": ['Nordland Group', 'Hordaland Group', ...], "group_labels_chronostrat": ['Cenozoic', 'Paleogene', ...] "group_levels": [336.0, 531.0, 650.0, ...], "formation_labels": ['Balder Formation', 'Sele Formation', ...], "formation_labels_chronostrat": ['Eocene', 'Paleocene', ...], "formation_levels": [650.0, 798.0, 949.0, ...] } ... }
The above example would classify all depths in well 31/6-6 between 336 & 531 to belong to the Nordland Group, and the corresponding chronostrat is the Cenozoic period. Depths between 650 and 798 are classified to belong to the Balder formation, which belongs to the Eocene period.
client (CogniteClient) – client to query CDF for formaiton tops if a mapping dictionary is not provided Defaults to None
- Returns
dataframe with additional columns for FORMATION and GROUP
- Return type
pd.DataFrame
- akerbp.mlpet.feature_engineering.guess_BS_from_CALI(df: pandas.core.frame.DataFrame, standard_BS_values: Optional[List[float]] = None) pandas.core.frame.DataFrame [source]#
Guess bitsize from CALI, given the standard bitsizes
- Parameters
df (pd.DataFrame) – dataframe to preprocess
- Keyword Arguments
standard_BS_values (ndarray) –
Numpy array of standardized bitsizes to consider. Defaults to:
np.array([6, 8.5, 9.875, 12.25, 17.5, 26])
- Returns
preprocessed dataframe
- Return type
pd.DataFrame
- akerbp.mlpet.feature_engineering.calculate_CALI_BS(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]#
Calculates CALI-BS assuming at least CALI is provided in the dataframe argument. If BS is not provided, it is estimated using the
guess_BS_from_CALI
method from this module.- Parameters
df (pd.DataFrame) – The dataframe to which CALI-BS should be added.
- Raises
ValueError – Raises an error if neither CALI nor BS are provided
- Returns
Returns the dataframe with CALI-BS as a new column
- Return type
pd.DataFrame
- akerbp.mlpet.feature_engineering.calculate_AI(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]#
Calculates AI from DEN and AC according to the following formula:
AI = DEN * ((304.8 / AC) ** 2)
- Parameters
df (pd.DataFrame) – The dataframe to which AI should be added.
- Raises
ValueError – Raises an error if neither DEN nor AC are provided
- Returns
Returns the dataframe with AI as a new column
- Return type
pd.DataFrame
- akerbp.mlpet.feature_engineering.calculate_LI(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]#
Calculates LI from LFI according to the following formula:
LI = ABS(ABS(LFI) - LFI) / 2
If LFI is not in the provided dataframe, it is calculated using the calculate_LFI method of this module.
- Parameters
df (pd.DataFrame) – The dataframe to which LI should be added.
- Raises
ValueError – Raises an error if neither NEU nor DEN or LFI are provided
- Returns
Returns the dataframe with LI as a new column
- Return type
pd.DataFrame
- akerbp.mlpet.feature_engineering.calculate_FI(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]#
Calculates FI from LFI according to the following formula:
FI = (ABS(LFI) + LFI) / 2
If LFI is not in the provided dataframe, it is calculated using the calculate_LFI method of this module.
- Parameters
df (pd.DataFrame) – The dataframe to which FI should be added.
- Raises
ValueError – Raises an error if neither NEU nor DEN or LFI are provided
- Returns
Returns the dataframe with FI as a new column
- Return type
pd.DataFrame
- akerbp.mlpet.feature_engineering.calculate_LFI(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]#
Calculates LFI from NEU and DEN according to the following formula:
LFI = 2.95 - ((NEU + 0.15) / 0.6) - DEN
where:
LFI < -0.9 = 0
NaNs are filled with 0
- Parameters
df (pd.DataFrame) – The dataframe to which LFI should be added.
- Raises
ValueError – Raises an error if neither NEU nor DEN are provided
- Returns
Returns the dataframe with LFI as a new column
- Return type
pd.DataFrame
- akerbp.mlpet.feature_engineering.calculate_RAVG(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]#
Calculates RAVG from RDEP, RMED, RSHA according to the following formula:
RAVG = AVG(RDEP, RMED, RSHA), if at least two of those are present
- Parameters
df (pd.DataFrame) – The dataframe to which RAVG should be added.
- Raises
ValueError – Raises an error if one or less resistivity curves are found in the provided dataframe
- Returns
Returns the dataframe with RAVG as a new column
- Return type
pd.DataFrame
- akerbp.mlpet.feature_engineering.calculate_VPVS(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]#
Calculates VPVS from ACS and AC according to the following formula:
VPVS = ACS / AC
- Parameters
df (pd.DataFrame) – The dataframe to which VPVS should be added.
- Raises
ValueError – Raises an error if neither ACS nor AC are found in the provided dataframe
- Returns
Returns the dataframe with VPVS as a new column
- Return type
pd.DataFrame
- akerbp.mlpet.feature_engineering.calculate_PR(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]#
Calculates PR from VP and VS or ACS and AC (if VP and VS are not found) according to the following formula:
PR = (VP ** 2 - 2 * VS ** 2) / (2 * (VP ** 2 - VS ** 2))
where:
VP = 304.8 / AC
VS = 304.8 / ACS
- Parameters
df (pd.DataFrame) – The dataframe to which PR should be added.
- Raises
ValueError – Raises an error if none of AC, ACS, VP or VS are found in the provided dataframe
- Returns
Returns the dataframe with PR as a new column
- Return type
pd.DataFrame
- akerbp.mlpet.feature_engineering.calculate_VP(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame [source]#
Calculates VP (if AC is found) according to the following formula:
VP = 304.8 / AC
- Parameters
df (pd.DataFrame) – The dataframe to which PR should be added.
- Raises
ValueError – Raises an error if AC is not found in the provided dataframe
- Returns
Returns the dataframe with VP as a new column
- Return type
pd.DataFrame
- akerbp.mlpet.feature_engineering.calculate_VS(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame [source]#
Calculates VS (if ACS is found) according to the following formula:
VS = 304.8 / ACS
- Parameters
df (pd.DataFrame) – The dataframe to which PR should be added.
- Raises
ValueError – Raises an error if ACS is not found in the provided dataframe
- Returns
Returns the dataframe with VS as a new column
- Return type
pd.DataFrame
- akerbp.mlpet.feature_engineering.calculate_VSH(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame [source]#
Calculates the VSH curve based off the GR curve and the type of formation defined in the GROUP column, as follows:
VSH = (GR - GR_ss) / (GR_sh_Gp_f - GR_ss)
where:
- GR_ss = The 5th quantile (quant_ss - value can be changed via the
kwargs) of each defined system (some systems are grouped if relevant)
- GR_sh_Gp_f = Shale formation groups are grouped by GROUP and a rolling
window calculation is applied to each group (window size is determined by the ‘window’ kwarg and quantile is determined by the quant_sh kwarg - these default to 2500 and 0.95 respectively). A savgol filter of windowlength min(501, number_of_non_nans // 2) and polynomial order 3 is then applied to the rolling quantile group. Note that the filter is ONLY applied if there is enough non NaN data present in the rolling quantiles. This limit is currently set to 10. If after this filter is applied the group still has np.NaNs, linear interpolation is applied to fill the gaps (provided there is data that can be used to interpolate). GR_sh_Gp_f represents this final result for all groups.
Note
This calculation is performed per well! Formation tops column in input df is forced into upper case for generalization.
Warning
If a calculation fails for one well, the well will be skipped and calculation continuous for the next well.
Note
If no mapping could be made to the pre-defined systems, the GROUP will be labeled as ‘other’.
- Parameters
df (pd.DataFrame) – The dataframe to which VSH should be added.
- Keyword Arguments
groups_column_name (str) – The name of the column containing group names. Defaults to ‘GROUP’
formations_column_name (str) – The name of the column containing formation names. Defaults to ‘FORMATION’
id_column (str) – The name of the well ID column to use for grouping the dataset by well. Defaults to ‘well_name’
rolling_window_size (int) – The size of the window to use for the rolling quantile calculation of the shale formation groups. Defaults to 2000 or len(group_df) // 2 if less than 2000 where group_df is the dataframe for the specific shale formation group.
filter_window_size (int) – The size of the window to use for the savgol filtering. Defaults to 501 or odd(len(filter_series) // 2) if less than 501 where filter_series is the series of rolling quantiles to be filtered by the savgol filter. MUST be odd (if an even int is provided, the code automatically converts it to an odd window size)
quant_ss (float) – The quantile to use for each age group in the sand formation groups calculation (GR_ss). Defaults to 0.02
quant_sh (float) – The quantile to use in the rolling quantile calculation of the shale formation groups. Defaults to 0.95
NHR_ss_threshold (float) –
The sand point threshold above which the Nordland, Hordaland & Rogaland (NHR) groups should be merged. The threshold is represented as the ratio between the group specific sandpoint (quant_ss) and the NHR system sand point (quant_ss calculated across all three groups - N, H & R). If this ratio is greater than this threshold the groups are merged according to the following strategy:
- Nordland’s sandpoint is set to Hordaland’s sandpoint. If there
is no Hordaland group present in the well it falls back to being set to the NHR system sandpoint.
- Hordaland’s sandpoint is set to the average of Nordland and
Rogaland’s sandpoints
- Rogaland’s sandpoint is set to Hordaland’s sandpoint. If there
is no Hordaland group present in the well it falls back to being set to the NHR system sandpoint.
non_shale_window_threshold (float) –
A threshold for the following ratio:
NSWT = GR_ss / (GR_sh_Gp_f * (GR_sh_Gp_f - GR_ss))
This threshold causes the VSH_AUTO calculation to linearly interpolate between local minimas in the GR_sh_Gp_f curve whenever the above ratio goes above the user provided threshold. Initial user testing suggests a threshold of 0.015 is a good starting point.
- Returns
Returns the dataframe with VSH as a new column
- Return type
pd.DataFrame
- akerbp.mlpet.feature_engineering.add_vertical_depths(df: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame [source]#
Add vertical depths, i.e. TVDKB, TVDSS and TVDBML, to the input dataframe. This function relies on a keyword argument for a vertical depth mapper dictionary, created by querying CDF at discrete points along the wellbore for each well. To map the vertical depths along the entire wellbore, the data in the dictionary is interpolated by using the measured depth
- Parameters
df (pd.DataFrame) – pandas dataframe to add vertical depths to
- Keyword Arguments
md_column (str) – identifier for the measured depth column in the provided dataframe Defaults to None
id_column (str) – identifier for the well column in the provided dataframe Defaults to None
vertical_depths_mapper (dict) –
dictionary containing vertical- and measured depths queried from CDF at discrete points along the wellbore for each well. For example:
vertical_depths_mapper = { "25/6-2": { "TVDKB": [0.0, 145.0, 149.9998, ...], "TVDSS": [-26.0, 119.0, 123.9998, ...], "TVDBML": [-145.0, 0.0, 4.999799999999993, ...], "MD": [0.0, 145.0, 150.0, ...] } }
Defaults to an empty dictionary, i.e. {}
client (CogniteClient) – client for querying vertical depths from CDF if a mapping dictionary is not provided Defaults to None
- Returns
dataframe with additional column for TVDKB, TVDSS and TVDBML
- Return type
pd.DataFrame