dataset module#
|
The main class representing a dataset |
- class akerbp.mlpet.dataset.Dataset(mappings: Union[str, Dict[str, str]], settings: Union[str, Dict[str, Any]], folder_path: Union[str, pathlib.Path])[source]#
Bases:
akerbp.mlpet.dataloader.DataLoader
The main class representing a dataset
Note
All settings on the first level of the settings dictionary/YAML passed to the class instance are set as class attributes
Warning
ALL filepaths (regardless of whether it is directlty passed to the class at instantiation or in the settings.yaml file) MUST be specified in absolute form!
Note: The id_column is always considered a categorical variable!
- Parameters
mappings – dict or path to a yaml file. If a path is provided it must be provided as an absolute path
settings –
dict or path to a yaml file. If a path is provided it must be provided as an absolute path. The possible keys for the settings:
id_column (required): name of the id column, eg. well_name
depth_column (optional): name of the measured depth column, e.g. “DEPTH”
label_column (optional): name of the column containing the labels
num_filler (optional - default 0): filler value for numerical curves(existing or wishing value for replacing missing values)
cat_filler (optional - default ‘MISSING’): filler value categorical curves(existing or wishing value for replacing missing values)
- categorical_curves (optional - default [id_column]): The curves to be considered as categorical when identifying which column as numerical
(this setting is used several places throughout the library and can be nice to have defined in advance)
- keep_columns (optional - default []): If you would like to keep some of the columns passed in your dataframe that will not be part
of the preprocessing_pipeline you define but should still make part of the preprocessed dataframe, this setting enables that.
- preprocessing_pipeline (optional - default None): The list of preprocessing functions to be run when the classes’ preprocess function is called.
If this is not provided, the pipeline MUST be provided in the preprocess call. Each key in the preprocessing_pipeline can have the relevant kwargs for that particular preprocessor as it’s value. All passed kwargs are parsed and saved to the class instance where relevant for use as defaults in the preprocessing functions
folder_path – The path to where preprocessing artifacts are stored/shall be saved to. Similar to the other two arguments this path must be provided as an absolute path.
- settings: Dict[str, Any]#
- settings_path: str#
- all_curves: Set[str]#
- id_column: str#
- label_column: str#
- num_filler: float#
- cat_filler: str#
- mappings: Dict[str, Any]#
- petrophysical_features: List[str]#
- keep_columns: List[str]#
- preprocessing_pipeline: Dict[str, Dict[str, Any]]#
- categorical_curves: List[str]#
- preprocess(df: Optional[pandas.core.frame.DataFrame] = None, verbose=True, **kwargs) pandas.core.frame.DataFrame [source]#
Main preprocessing function. Pass the dataframe to be preprocessed along with any kwargs for running any desired order (within reason) of the various supported preprocessing functions.
To see which functions are supported for preprocessing you can access the class attribute ‘supported_preprocessing_functions’.
To see what all the default settings are for all the supported preprocessing functions are, run the class ‘get_preprocess_defaults’ method without any arguments.
To see what kwargs are being used for the default workflow, run the class ‘get_preprocess_defaults’ with the class attribute ‘default_preprocessing_workflow’ as the main arg.
Warning
The preprocess function will run through the provided kwargs in the order provided by the kwargs dictionary. In python 3.7+, dictionaries are insertion ordered and it is this implemnetational detail this function builds upon. As such, do not use any Python version below 3.7 or ensure to pass an OrderedDict instance as your kwargs to have complete control over what order the preprocessing functions are run in!
- Parameters
df (pd.Dataframe, optional) – dataframe to which apply preprocessing. If none is provided, it will use the class’ original df if exists.
verbose (bool, optional) – Whether to display some logs on the progression off the preprocessing pipeline being run. Defaults to True.
- Keyword Arguments
relevant (See above in the docstring on all potential kwargs and their) –
structures. –
- Returns
preprocessed dataframe
- Return type
pd.Dataframe
- get_preprocess_defaults(kwargs: Optional[Dict[str, Dict[str, Any]]] = None) Dict[str, Any] [source]#
Wrapper function to define and provide the default kwargs to use for preprocessing. This function allows the user to only tweak certain function kwargs rather than having to define a setting for every single function kwargs. If a kwargs dictionary is passed to the function, only the defaults for the provided function names found in the kwargs will be returned. In other words, to generate a full default kwargs example, run this method without any arguments.
- Parameters
kwargs (Dict[str, Any], optional) – Any user defined kwargs that should override the defaults. Defaults to {}.
- Returns
- A populated kwargs dictionary to be passed to all
supported methods in preprocessing.
- Return type
Dict[str, Any]