utilities module#
Splits dataframe into two groups: train and val/test set. |
|
Removes columns with missing targets. |
|
Splits set into features and target |
|
|
Returns lists of numerical and categorical columns |
Retrieves formation tops metadata for a provided list of well names (IDs) from CDF and returns them in a dictionary of depth levels and labels per well. |
|
Retrieve relevant well metadata for the provided well_names |
|
|
Helper function that applies min-max normalization on a pandas series and rescales it according to a reference range according to the following formula: |
Standardize curve names in a dataframe based on the curve_mappings dictionary. |
|
|
Performs several string operations to standardize group formation names for later categorisation. |
Standardize curve names in a list based on the curve_mappings dictionary. |
|
Splits a dataset into training and val/test sets by well (i.e. |
|
Splits wells into two groups (train and val/test) |
- akerbp.mlpet.utilities.drop_rows_wo_label(df: pandas.core.frame.DataFrame, label_column: str, **kwargs) pandas.core.frame.DataFrame [source]#
Removes columns with missing targets.
Now that the imputation is done via pd.df.fillna(), what we need is the constant filler_value If the imputation is everdone using one of sklearn.impute methods or a similar API, we can use the indicator column (add_indicator=True)
- Parameters
df (pd.DataFrame) – dataframe to process
label_column (str) – Name of the label column containing rows without labels
- Keyword Arguments
missing_label_value (str, optional) – If nans are denoted differently than np.nans, a missing_label_value can be passed as a kwarg and all rows containing this missing_label_value in the label column will be dropped
- Returns
processed dataframe
- Return type
pd.DataFrame
- akerbp.mlpet.utilities.readPickle(path)[source]#
A cached helper function for loading pickle files. Loading pickle files multiple times can really slow down execution
- Parameters
path (str) – Path to the pickled object to be loaded
- Returns
Return the loaded pickled data
- Return type
data
- akerbp.mlpet.utilities.map_formation_and_group(form_or_group: pandas.core.series.Series, MissingValue: Union[float, str] = nan) Tuple[Union[float, str], Union[float, str]] [source]#
A helper function for retrieving the formation and group of a standardised formation/group based on mlpet’s NPD pickle mapper.
- Parameters
form_or_group (pd.Series) – A pandas series containing AkerBP legal formation/group names to be mapped
MissingValue (Any) – If no mapping is found, return this missing value
- Returns
- Returns a formation and group series respectively corresponding
to the input string series
- Return type
tuple(pd.Series)
- akerbp.mlpet.utilities.standardize_group_formation_name(name: Union[str, Any]) Union[str, Any] [source]#
Performs several string operations to standardize group formation names for later categorisation.
- Parameters
name (str) – A group formation name
- Returns
- Returns the standardized group formation name or np.nan
if the name == “NAN”.
- Return type
float or str
- akerbp.mlpet.utilities.standardize_names(names: List[str], mapper: Dict[str, str]) Tuple[List[str], Dict[str, str]] [source]#
Standardize curve names in a list based on the curve_mappings dictionary. Any columns not in the dictionary are ignored.
- Parameters
names (list) – list with curves names
mapper (dictionary) – dictionary with mappings. Defaults to curve_mappings.
- Returns
list of strings with standardized curve names
- Return type
list
- akerbp.mlpet.utilities.standardize_curve_names(df: pandas.core.frame.DataFrame, mapper: Dict[str, str]) pandas.core.frame.DataFrame [source]#
Standardize curve names in a dataframe based on the curve_mappings dictionary. Any columns not in the dictionary are ignored.
- Parameters
df (pd.DataFrame) – dataframe to which apply standardization of columns names
mapper (dictionary) – dictionary with mappings. Defaults to curve_mappings. They keys should be the old curve name and the values the desired curved name.
- Returns
dataframe with columns names standardized
- Return type
pd.DataFrame
- akerbp.mlpet.utilities.get_col_types(df: pandas.core.frame.DataFrame, categorical_curves: Optional[List[str]] = None, warn: bool = True) Tuple[List[str], List[str]] [source]#
Returns lists of numerical and categorical columns
- Parameters
df (pd.DataFrame) – dataframe with columns to classify
categorical_curves (list) – List of column names that should be considered as categorical. Defaults to an empty list.
warn (bool) – Whether to warn the user if categorical curves were detected which were not in the provided categorical curves list.
- Returns
lists of numerical and categorical columns
- Return type
tuple
- akerbp.mlpet.utilities.wells_split_train_test(df: pandas.core.frame.DataFrame, id_column: str, test_size: float, **kwargs) Tuple[List[str], List[str], List[str]] [source]#
Splits wells into two groups (train and val/test)
- NOTE: Set operations are used to perform the splits so ordering is not
preserved! The well IDs will be randomly ordered.
- Parameters
df (pd.DataFrame) – dataframe with data of wells and well ID
id_column (str) – The name of the column containing well names which will be used to perform the split.
test_size (float) – percentage (0-1) of wells to be in val/test data
- Returns
well IDs test_wells (list): wells IDs of val/test data training_wells (list): wells IDs of training data
- Return type
wells (list)
- akerbp.mlpet.utilities.df_split_train_test(df: pandas.core.frame.DataFrame, id_column: str, test_size: float = 0.2, test_wells: Optional[List[str]] = None, **kwargs) Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, List[str]] [source]#
Splits dataframe into two groups: train and val/test set.
- Parameters
df (pd.Dataframe) – dataframe to split
id_column (str) – The name of the column containing well names which will be used to perform the split.
test_size (float, optional) – size of val/test data. Defaults to 0.2.
test_wells (list, optional) – list of wells to be in val/test data. Defaults to None.
- Returns
dataframes for train and test sets, and list of test well IDs
- Return type
tuple
- akerbp.mlpet.utilities.train_test_split(df: pandas.core.frame.DataFrame, target_column: str, id_column: str, **kwargs) Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame] [source]#
Splits a dataset into training and val/test sets by well (i.e. for an 80-20 split, the provided dataset would need data from at least 5 wells).
This function makes use of several other utility functions. The workflow it executes is:
Drops row without labels
- Splits into train and test sets using df_split_train_test which in
turn performs the split via wells_split_train_test
- Parameters
df (pd.DataFrame, optional) – dataframe with data
target_column (str) – Name of the target column (y)
id_column (str) – Name of the wells ID column. This is used to perform the split based on well ID.
- Keyword Arguments
test_size (float, optional) – size of val/test data. Defaults to 0.2.
test_wells (list, optional) – list of wells to be in val/test data. Defaults to None.
missing_label_value (str, optional) – If nans are denoted differently than np.nans, a missing_label_value can be passed as a kwarg and all rows containing this missing_label_value in the label column will be dropped
- Returns
dataframes for train and test sets, and list of test wells IDs
- Return type
tuple
- akerbp.mlpet.utilities.feature_target_split(df: pandas.core.frame.DataFrame, target_column: str) Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame] [source]#
Splits set into features and target
- Parameters
df (pd.DataFrame) – dataframe to be split
target_column (str) – target column name
- Returns
input (features) and output (target) dataframes
- Return type
tuple
- akerbp.mlpet.utilities.normalize(col: pandas.core.series.Series, ref_min: numpy.float64, ref_max: numpy.float64, col_min: float, col_max: float) pandas.core.series.Series [source]#
Helper function that applies min-max normalization on a pandas series and rescales it according to a reference range according to the following formula:
ref_low + ((col - col_min) * (ref_max - ref_min) / (col_max - col_min))
- Parameters
col (pd.Series) – column from dataframe to normalize (series)
ref_low (float) – min value of the column of the well of reference
ref_high (float) – max value of the column of the well of reference
well_low (float) – min value of the column of well to normalize
well_high (float) – max value of the column of well to normalize
- Returns
normalized series
- Return type
pd.Series
- akerbp.mlpet.utilities.get_well_metadata(client: cognite.client._cognite_client.CogniteClient, well_names: List[str]) Dict[str, Dict[str, Any]] [source]#
Retrieve relevant well metadata for the provided well_names
Warning
If a well is not found in the asset database, it is not returned in the returned dictionary. Instead a warning is printed to the console with the corresponding well name.
Metadata retrieved:
COMPLETION_DATE
COORD_SYSTEM_NAME
KB_ELEV
KB_ELEV_OUOM
PUBLIC
SPUD_DATE
WATER_DEPTH
CDF_wellName
WATER_DEPTH_DSDSUNIT
X_COORDINATE
Y_COORDINATE
DATUM_ELEVATION
DATUM_ELEVATION_UNIT
LATITUDE
LONGITUDE
- Parameters
client (CogniteClient) – A connected cognite client instance
well_names (List) – The list of well names to retrieve metadata for
- Returns
- Returns a dictionary where the keys are the well names and the
values are dictionaries with metadata keys and values.
- Return type
dict
Example
Example return dictionary:
{ '25/10-10': { 'COMPLETION_DATE': '2010-04-02T00:00:00', 'COORD_SYSTEM_NAME': 'ED50 / UTM zone 31N', 'DATUM_ELEVATION': '0.0', ...}, '25/10-12 ST2': { 'COMPLETION_DATE': '2015-01-18T00:00:00', 'COORD_SYSTEM_NAME': 'ED50 / UTM zone 31N', 'DATUM_ELEVATION': nan, ...}, }
- akerbp.mlpet.utilities.get_formation_tops(well_names: str, client: cognite.client._cognite_client.CogniteClient, **kwargs) Dict[str, Dict[str, Any]] [source]#
Retrieves formation tops metadata for a provided list of well names (IDs) from CDF and returns them in a dictionary of depth levels and labels per well.
- Parameters
well_names (str) – A list of well names (IDs)
client (CogniteClient) – A connected instance of the Cognite Client.
- Keyword Arguments
undefined_name (str) – Name for undefined formation/group tops. Defaults to ‘UNKNOWN’
- NOTE: The formation will be skipped if it’s only 1m thick.
NPD do not provide technial side tracks, such that information (formation tops) provided by NPD is missing T-labels.
- Returns
- Returns a dictionary of formation tops metadata per map in this
format:
formation_tops_mapper = { "31/6-6": { "group_labels": ['Nordland Group', 'Hordaland Group', ...], "group_labels_chronostrat": ['Cenozoic', 'Paleogene', ...] "group_levels": [336.0, 531.0, 650.0, ...], "formation_labels": ['Balder Formation', 'Sele Formation', ...], "formation_labels_chronostrat": ['Eocene', 'Paleocene', ...], "formation_levels": [650.0, 798.0, 949.0, ...] } ... }
- Return type
Dict
- NOTE: The length of the levels entries equals the length of the corresponding labels entries + 1,
such that the first entry of a label entry lies between the first and the second entries of the corresponding level entry.
- akerbp.mlpet.utilities.get_vertical_depths(well_names: List[str], client: cognite.client._cognite_client.CogniteClient) Dict[str, Dict[str, List[float]]] [source]#
Makes trajectory queries to CDF for all provided wells and extracts vertical- and measured depths. These depths will further down the pipeline be used to interpolate the vertical depths along all the entire wellbores.
- Parameters
well_names (List[str]) – list of well names
client (CogniteClient) – cognite client
- Returns
Dictionary containing vertical- and measured depths (values) for each well (keys), list of wells with empty trajectory query to CDF
- Return type
Dict[str, Dict[str, List[float]]]
- akerbp.mlpet.utilities.get_calibration_map(df: pandas.core.frame.DataFrame, curves: List[str], location_curves: List[str], mode: str, id_column: str, levels: Optional[List[str]] = None, standardize_level_names: bool = True) Dict[str, pandas.core.frame.DataFrame] [source]#
Returns calibration maps for each level, per well, typically formation and group. Calibration maps are pandas dataframes with the well name and unique values for each curve and location, where the value is the chosen “mode”, such as mean, median, mode, etc, specified by the user. Useful for functions preprocessors.apply_calibration() and imputers.fillna_callibration_values().
- Parameters
df (pd.DataFrame) – dataframe with wells data
curves (List[str]) – list of curves to fetch unique values
location_curves (List[str]) – list of curves indicating location of
well/formation/group. –
latitude (Typically) –
longitude –
tvdbml –
depth –
mode (str) – any method supported in pandas dataframe for representing the curve,
median (such as) –
mean –
mode –
min –
max –
etc. –
id_column (str) – column with well names
levels (List[str], optional) – how to group samples in a well, typically per
["FORMATION" (group or formation. Defaults to) –
"GROUP"]. –
standardize_level_names (bool, optional) – whether to standardize formation
True. (or group names. Defaults to) –
- Returns
dictionary with keys being level and values being the calibration map in dataframe format
- Return type
Dict[str, pd.DataFrame]
- akerbp.mlpet.utilities.get_calibration_values(df: pandas.core.frame.DataFrame, curves: List[str], location_curves: List[str], level: str, mode: str, id_column: str, distance_thres: float = 99999.0, calibration_map: Optional[pandas.core.frame.DataFrame] = None, standardize_level_names: bool = True) Dict[str, pandas.core.frame.DataFrame] [source]#
Get calibration map and fill na values (if any) for that well in calibration maps from closeby wells.
- Parameters
df (pd.DataFrame) – dataframe
curves (List[str]) – list of curves to take into account for maps
location_curves (List[str]) – which curves to consider for calculating the distance between wells
level (str) – how to group samples in a well, typically per group or formation
mode (str) – any method supported in pandas dataframe for representing the curve,
median (such as) –
mean –
mode –
min –
max –
etc. –
id_column (str) – column with well names
distance_thres (float, optional) – threshold for indicating a well is to
99999.0. (far to be considered close enough. Defaults to) –
calibration_map (pd.DataFrame, optional) – calibration map for the level. Defaults to None.
standardize_level_names (bool, optional) – whether to standardize formation
True. (or group names. Defaults to) –
- Returns
_description_
- Return type
Dict[str, pd.DataFrame]
- akerbp.mlpet.utilities.get_violation_indices(mask: pandas.core.series.Series) pandas.core.frame.DataFrame [source]#
Helper function to retrieve the indices where a mask series is True
- Parameters
mask (pd.Series) – The mask series to retrieve True indices of
- Returns
- A dataframe with the columns [“first”, “last”] denoting
the start and end indices of each block of True values in the passed mask.
- Return type
pd.DataFrame
- akerbp.mlpet.utilities.inflection_points(df: pandas.core.frame.DataFrame, curveName: str, before: int, after: int) Tuple[int, int] [source]#
Helper function for identifying the first inflection point in a curve before and after certain indices.
- Parameters
df (pd.DataFrame) – The dataframe containing the specified curveName.
curveName (str) – The curve for which to detect inflection points.
before (int) – The index before which inflection points should be detected
after (int) – The index after which inflection points should be detected
- Returns
- The first inflection point in the curve before the before index and after the after index
If no inflection point is found, np.nan is returned. If no inflection point before the before index and after the after index is found, a ValueError is raised.
- Return type
tuple(int, int)
- akerbp.mlpet.utilities.calculate_sampling_rate(array, max_sampling_rate=1)[source]#
Calculates the sampling rate of an array by calculating the weighed average diff between the array’s values.
- Parameters
array (pd.Series) – The array for which the sampling rate should be calculated
max_sampling_rate – The maximum acceptable sampling rate above which the the calculated sampling rates should not be included in the weighted average calculation (defined in samples/unit length e.g. m). Defaults to max 1 sample per m (where m is the assumed unit of the provided array)