utilities module#

akerbp.mlpet.utilities.df_split_train_test(df, ...)

Splits dataframe into two groups: train and val/test set.

akerbp.mlpet.utilities.drop_rows_wo_label(df, ...)

Removes columns with missing targets.

akerbp.mlpet.utilities.feature_target_split(df, ...)

Splits set into features and target

akerbp.mlpet.utilities.get_col_types(df[, ...])

Returns lists of numerical and categorical columns

akerbp.mlpet.utilities.get_formation_tops(...)

Retrieves formation tops metadata for a provided list of well names (IDs) from CDF and returns them in a dictionary of depth levels and labels per well.

akerbp.mlpet.utilities.get_well_metadata(...)

Retrieve relevant well metadata for the provided well_names

akerbp.mlpet.utilities.normalize(col, ...)

Helper function that applies min-max normalization on a pandas series and rescales it according to a reference range according to the following formula:

akerbp.mlpet.utilities.standardize_curve_names(df, ...)

Standardize curve names in a dataframe based on the curve_mappings dictionary.

akerbp.mlpet.utilities.standardize_group_formation_name(name)

Performs several string operations to standardize group formation names for later categorisation.

akerbp.mlpet.utilities.standardize_names(...)

Standardize curve names in a list based on the curve_mappings dictionary.

akerbp.mlpet.utilities.train_test_split(df, ...)

Splits a dataset into training and val/test sets by well (i.e.

akerbp.mlpet.utilities.wells_split_train_test(df, ...)

Splits wells into two groups (train and val/test)

akerbp.mlpet.utilities.drop_rows_wo_label(df: pandas.core.frame.DataFrame, label_column: str, **kwargs) pandas.core.frame.DataFrame[source]#

Removes columns with missing targets.

Now that the imputation is done via pd.df.fillna(), what we need is the constant filler_value If the imputation is everdone using one of sklearn.impute methods or a similar API, we can use the indicator column (add_indicator=True)

Parameters
  • df (pd.DataFrame) – dataframe to process

  • label_column (str) – Name of the label column containing rows without labels

Keyword Arguments

missing_label_value (str, optional) – If nans are denoted differently than np.nans, a missing_label_value can be passed as a kwarg and all rows containing this missing_label_value in the label column will be dropped

Returns

processed dataframe

Return type

pd.DataFrame

akerbp.mlpet.utilities.readPickle(path)[source]#

A cached helper function for loading pickle files. Loading pickle files multiple times can really slow down execution

Parameters

path (str) – Path to the pickled object to be loaded

Returns

Return the loaded pickled data

Return type

data

akerbp.mlpet.utilities.map_formation_and_group(form_or_group: pandas.core.series.Series, MissingValue: Union[float, str] = nan) Tuple[Union[float, str], Union[float, str]][source]#

A helper function for retrieving the formation and group of a standardised formation/group based on mlpet’s NPD pickle mapper.

Parameters
  • form_or_group (pd.Series) – A pandas series containing AkerBP legal formation/group names to be mapped

  • MissingValue (Any) – If no mapping is found, return this missing value

Returns

Returns a formation and group series respectively corresponding

to the input string series

Return type

tuple(pd.Series)

akerbp.mlpet.utilities.standardize_group_formation_name(name: Union[str, Any]) Union[str, Any][source]#

Performs several string operations to standardize group formation names for later categorisation.

Parameters

name (str) – A group formation name

Returns

Returns the standardized group formation name or np.nan

if the name == “NAN”.

Return type

float or str

akerbp.mlpet.utilities.standardize_names(names: List[str], mapper: Dict[str, str]) Tuple[List[str], Dict[str, str]][source]#

Standardize curve names in a list based on the curve_mappings dictionary. Any columns not in the dictionary are ignored.

Parameters
  • names (list) – list with curves names

  • mapper (dictionary) – dictionary with mappings. Defaults to curve_mappings.

Returns

list of strings with standardized curve names

Return type

list

akerbp.mlpet.utilities.standardize_curve_names(df: pandas.core.frame.DataFrame, mapper: Dict[str, str]) pandas.core.frame.DataFrame[source]#

Standardize curve names in a dataframe based on the curve_mappings dictionary. Any columns not in the dictionary are ignored.

Parameters
  • df (pd.DataFrame) – dataframe to which apply standardization of columns names

  • mapper (dictionary) – dictionary with mappings. Defaults to curve_mappings. They keys should be the old curve name and the values the desired curved name.

Returns

dataframe with columns names standardized

Return type

pd.DataFrame

akerbp.mlpet.utilities.get_col_types(df: pandas.core.frame.DataFrame, categorical_curves: Optional[List[str]] = None, warn: bool = True) Tuple[List[str], List[str]][source]#

Returns lists of numerical and categorical columns

Parameters
  • df (pd.DataFrame) – dataframe with columns to classify

  • categorical_curves (list) – List of column names that should be considered as categorical. Defaults to an empty list.

  • warn (bool) – Whether to warn the user if categorical curves were detected which were not in the provided categorical curves list.

Returns

lists of numerical and categorical columns

Return type

tuple

akerbp.mlpet.utilities.wells_split_train_test(df: pandas.core.frame.DataFrame, id_column: str, test_size: float, **kwargs) Tuple[List[str], List[str], List[str]][source]#

Splits wells into two groups (train and val/test)

NOTE: Set operations are used to perform the splits so ordering is not

preserved! The well IDs will be randomly ordered.

Parameters
  • df (pd.DataFrame) – dataframe with data of wells and well ID

  • id_column (str) – The name of the column containing well names which will be used to perform the split.

  • test_size (float) – percentage (0-1) of wells to be in val/test data

Returns

well IDs test_wells (list): wells IDs of val/test data training_wells (list): wells IDs of training data

Return type

wells (list)

akerbp.mlpet.utilities.df_split_train_test(df: pandas.core.frame.DataFrame, id_column: str, test_size: float = 0.2, test_wells: Optional[List[str]] = None, **kwargs) Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, List[str]][source]#

Splits dataframe into two groups: train and val/test set.

Parameters
  • df (pd.Dataframe) – dataframe to split

  • id_column (str) – The name of the column containing well names which will be used to perform the split.

  • test_size (float, optional) – size of val/test data. Defaults to 0.2.

  • test_wells (list, optional) – list of wells to be in val/test data. Defaults to None.

Returns

dataframes for train and test sets, and list of test well IDs

Return type

tuple

akerbp.mlpet.utilities.train_test_split(df: pandas.core.frame.DataFrame, target_column: str, id_column: str, **kwargs) Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame][source]#

Splits a dataset into training and val/test sets by well (i.e. for an 80-20 split, the provided dataset would need data from at least 5 wells).

This function makes use of several other utility functions. The workflow it executes is:

  1. Drops row without labels

  2. Splits into train and test sets using df_split_train_test which in

    turn performs the split via wells_split_train_test

Parameters
  • df (pd.DataFrame, optional) – dataframe with data

  • target_column (str) – Name of the target column (y)

  • id_column (str) – Name of the wells ID column. This is used to perform the split based on well ID.

Keyword Arguments
  • test_size (float, optional) – size of val/test data. Defaults to 0.2.

  • test_wells (list, optional) – list of wells to be in val/test data. Defaults to None.

  • missing_label_value (str, optional) – If nans are denoted differently than np.nans, a missing_label_value can be passed as a kwarg and all rows containing this missing_label_value in the label column will be dropped

Returns

dataframes for train and test sets, and list of test wells IDs

Return type

tuple

akerbp.mlpet.utilities.feature_target_split(df: pandas.core.frame.DataFrame, target_column: str) Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame][source]#

Splits set into features and target

Parameters
  • df (pd.DataFrame) – dataframe to be split

  • target_column (str) – target column name

Returns

input (features) and output (target) dataframes

Return type

tuple

akerbp.mlpet.utilities.normalize(col: pandas.core.series.Series, ref_min: numpy.float64, ref_max: numpy.float64, col_min: float, col_max: float) pandas.core.series.Series[source]#

Helper function that applies min-max normalization on a pandas series and rescales it according to a reference range according to the following formula:

ref_low + ((col - col_min) * (ref_max - ref_min) / (col_max - col_min))

Parameters
  • col (pd.Series) – column from dataframe to normalize (series)

  • ref_low (float) – min value of the column of the well of reference

  • ref_high (float) – max value of the column of the well of reference

  • well_low (float) – min value of the column of well to normalize

  • well_high (float) – max value of the column of well to normalize

Returns

normalized series

Return type

pd.Series

akerbp.mlpet.utilities.get_well_metadata(client: cognite.client._cognite_client.CogniteClient, well_names: List[str]) Dict[str, Dict[str, Any]][source]#

Retrieve relevant well metadata for the provided well_names

Warning

If a well is not found in the asset database, it is not returned in the returned dictionary. Instead a warning is printed to the console with the corresponding well name.

Metadata retrieved:

  • COMPLETION_DATE

  • COORD_SYSTEM_NAME

  • KB_ELEV

  • KB_ELEV_OUOM

  • PUBLIC

  • SPUD_DATE

  • WATER_DEPTH

  • CDF_wellName

  • WATER_DEPTH_DSDSUNIT

  • X_COORDINATE

  • Y_COORDINATE

  • DATUM_ELEVATION

  • DATUM_ELEVATION_UNIT

  • LATITUDE

  • LONGITUDE

Parameters
  • client (CogniteClient) – A connected cognite client instance

  • well_names (List) – The list of well names to retrieve metadata for

Returns

Returns a dictionary where the keys are the well names and the

values are dictionaries with metadata keys and values.

Return type

dict

Example

Example return dictionary:

{
    '25/10-10': {
        'COMPLETION_DATE': '2010-04-02T00:00:00',
        'COORD_SYSTEM_NAME': 'ED50 / UTM zone 31N',
        'DATUM_ELEVATION': '0.0',
        ...},
    '25/10-12 ST2': {
        'COMPLETION_DATE': '2015-01-18T00:00:00',
        'COORD_SYSTEM_NAME': 'ED50 / UTM zone 31N',
        'DATUM_ELEVATION': nan,
        ...},
}
akerbp.mlpet.utilities.get_formation_tops(well_names: str, client: cognite.client._cognite_client.CogniteClient, **kwargs) Dict[str, Dict[str, Any]][source]#

Retrieves formation tops metadata for a provided list of well names (IDs) from CDF and returns them in a dictionary of depth levels and labels per well.

Parameters
  • well_names (str) – A list of well names (IDs)

  • client (CogniteClient) – A connected instance of the Cognite Client.

Keyword Arguments

undefined_name (str) – Name for undefined formation/group tops. Defaults to ‘UNKNOWN’

NOTE: The formation will be skipped if it’s only 1m thick.

NPD do not provide technial side tracks, such that information (formation tops) provided by NPD is missing T-labels.

Returns

Returns a dictionary of formation tops metadata per map in this

format:

formation_tops_mapper = {
    "31/6-6": {
        "group_labels": ['Nordland Group', 'Hordaland Group', ...],
        "group_labels_chronostrat": ['Cenozoic', 'Paleogene', ...]
        "group_levels": [336.0, 531.0, 650.0, ...],
        "formation_labels": ['Balder Formation', 'Sele Formation', ...],
        "formation_labels_chronostrat": ['Eocene', 'Paleocene', ...],
        "formation_levels": [650.0, 798.0, 949.0, ...]
    }
    ...
}

Return type

Dict

NOTE: The length of the levels entries equals the length of the corresponding labels entries + 1,

such that the first entry of a label entry lies between the first and the second entries of the corresponding level entry.

akerbp.mlpet.utilities.get_vertical_depths(well_names: List[str], client: cognite.client._cognite_client.CogniteClient) Dict[str, Dict[str, List[float]]][source]#

Makes trajectory queries to CDF for all provided wells and extracts vertical- and measured depths. These depths will further down the pipeline be used to interpolate the vertical depths along all the entire wellbores.

Parameters
  • well_names (List[str]) – list of well names

  • client (CogniteClient) – cognite client

Returns

Dictionary containing vertical- and measured depths (values) for each well (keys), list of wells with empty trajectory query to CDF

Return type

Dict[str, Dict[str, List[float]]]

akerbp.mlpet.utilities.get_calibration_map(df: pandas.core.frame.DataFrame, curves: List[str], location_curves: List[str], mode: str, id_column: str, levels: Optional[List[str]] = None, standardize_level_names: bool = True) Dict[str, pandas.core.frame.DataFrame][source]#

Returns calibration maps for each level, per well, typically formation and group. Calibration maps are pandas dataframes with the well name and unique values for each curve and location, where the value is the chosen “mode”, such as mean, median, mode, etc, specified by the user. Useful for functions preprocessors.apply_calibration() and imputers.fillna_callibration_values().

Parameters
  • df (pd.DataFrame) – dataframe with wells data

  • curves (List[str]) – list of curves to fetch unique values

  • location_curves (List[str]) – list of curves indicating location of

  • well/formation/group.

  • latitude (Typically) –

  • longitude

  • tvdbml

  • depth

  • mode (str) – any method supported in pandas dataframe for representing the curve,

  • median (such as) –

  • mean

  • mode

  • min

  • max

  • etc.

  • id_column (str) – column with well names

  • levels (List[str], optional) – how to group samples in a well, typically per

  • ["FORMATION" (group or formation. Defaults to) –

  • "GROUP"].

  • standardize_level_names (bool, optional) – whether to standardize formation

  • True. (or group names. Defaults to) –

Returns

dictionary with keys being level and values being the calibration map in dataframe format

Return type

Dict[str, pd.DataFrame]

akerbp.mlpet.utilities.get_calibration_values(df: pandas.core.frame.DataFrame, curves: List[str], location_curves: List[str], level: str, mode: str, id_column: str, distance_thres: float = 99999.0, calibration_map: Optional[pandas.core.frame.DataFrame] = None, standardize_level_names: bool = True) Dict[str, pandas.core.frame.DataFrame][source]#

Get calibration map and fill na values (if any) for that well in calibration maps from closeby wells.

Parameters
  • df (pd.DataFrame) – dataframe

  • curves (List[str]) – list of curves to take into account for maps

  • location_curves (List[str]) – which curves to consider for calculating the distance between wells

  • level (str) – how to group samples in a well, typically per group or formation

  • mode (str) – any method supported in pandas dataframe for representing the curve,

  • median (such as) –

  • mean

  • mode

  • min

  • max

  • etc.

  • id_column (str) – column with well names

  • distance_thres (float, optional) – threshold for indicating a well is to

  • 99999.0. (far to be considered close enough. Defaults to) –

  • calibration_map (pd.DataFrame, optional) – calibration map for the level. Defaults to None.

  • standardize_level_names (bool, optional) – whether to standardize formation

  • True. (or group names. Defaults to) –

Returns

_description_

Return type

Dict[str, pd.DataFrame]

akerbp.mlpet.utilities.get_violation_indices(mask: pandas.core.series.Series) pandas.core.frame.DataFrame[source]#

Helper function to retrieve the indices where a mask series is True

Parameters

mask (pd.Series) – The mask series to retrieve True indices of

Returns

A dataframe with the columns [“first”, “last”] denoting

the start and end indices of each block of True values in the passed mask.

Return type

pd.DataFrame

akerbp.mlpet.utilities.inflection_points(df: pandas.core.frame.DataFrame, curveName: str, before: int, after: int) Tuple[int, int][source]#

Helper function for identifying the first inflection point in a curve before and after certain indices.

Parameters
  • df (pd.DataFrame) – The dataframe containing the specified curveName.

  • curveName (str) – The curve for which to detect inflection points.

  • before (int) – The index before which inflection points should be detected

  • after (int) – The index after which inflection points should be detected

Returns

The first inflection point in the curve before the before index and after the after index

If no inflection point is found, np.nan is returned. If no inflection point before the before index and after the after index is found, a ValueError is raised.

Return type

tuple(int, int)

akerbp.mlpet.utilities.calculate_sampling_rate(array, max_sampling_rate=1)[source]#

Calculates the sampling rate of an array by calculating the weighed average diff between the array’s values.

Parameters
  • array (pd.Series) – The array for which the sampling rate should be calculated

  • max_sampling_rate – The maximum acceptable sampling rate above which the the calculated sampling rates should not be included in the weighted average calculation (defined in samples/unit length e.g. m). Defaults to max 1 sample per m (where m is the assumed unit of the provided array)