Working with the Models Subpackage
The models
subpackage is crafted to offer a comprehensive suite of tools for creating and managing various machine learning models within the MED3pa
package.
Using the ModelFactory Class
The ModelFactory
class within the models
subpackage offers a streamlined approach to creating machine learning models, either from predefined configurations or from serialized states. Here’s how to leverage this functionality effectively:
Step 1: Importing Necessary Modules
Start by importing the required classes and utilities for model management:
from pprint import pprint
from MED3pa.models import factories
Step 2: Creating an Instance of ModelFactory
Instantiate the ModelFactory
, which serves as your gateway to generating various model instances:
factory = factories.ModelFactory()
Step 3: Discovering Supported Models
Before creating a model, check which models are currently supported by the factory:
print("Supported models:", factory.get_supported_models())
Output:
Supported models: ['XGBoostModel']
With this knowledge, we can proceed to create a model with specific hyperparameters.
Step 4: Specifying and Creating a Model
Define hyperparameters for an XGBoost model and use these to instantiate a model:
xgb_params = {
'objective': 'binary:logistic',
'eval_metric': 'auc',
'eta': 0.1,
'max_depth': 6,
'subsample': 0.8,
'colsample_bytree': 0.8,
'min_child_weight': 1,
'nthread': 4,
'tree_method': 'hist',
'device': 'cpu'
}
xgb_model = factory.create_model_with_hyperparams('XGBoostModel', xgb_params)
Now, let’s inspect the model’s configuration:
pprint(xgb_model.get_info())
Output:
{'data_preparation_strategy': 'ToDmatrixStrategy',
'model': 'XGBoostModel',
'model_type': 'Booster',
'params': {'colsample_bytree': 0.8,
'device': 'cpu',
'eta': 0.1,
'eval_metric': 'auc',
'max_depth': 6,
'min_child_weight': 1,
'nthread': 4,
'objective': 'binary:logistic',
'subsample': 0.8,
'tree_method': 'hist'},
'pickled_model': False}
This gives us general information about the model, such as its data_preparation_strategy
, indicating that the input data for training, prediction, and evaluation will be transformed to Dmatrix
to better suit the xgb.Booster
model. It also retrieves the model’s parameters, the underlying model instance class (Booster
in this case), and the wrapper class (XGBoostModel
in this case). Finally, it indicates whether this model has been created from a pickled file.
Step 5: Loading a Model from a Serialized State
For pre-trained models, we can make use of the create_model_from_pickled
method to load a model from its serialized (pickled) state. You only need to specify the path to this pickled file. This function will examine the pickled file and extract all necessary information.
xgb_model_pkl = factory.create_model_from_pickled('path_to_model.pkl')
pprint(xgb_model_pkl.get_info())
Output:
{'data_preparation_strategy': 'ToDmatrixStrategy',
'model': 'XGBoostModel',
'model_type': 'Booster',
'params': {'alpha': 0,
'base_score': 0.5,
'boost_from_average': 1,
'booster': 'gbtree',
'cache_opt': 1,
...
'updater': 'grow_quantile_histmaker',
'updater_seq': 'grow_quantile_histmaker',
'validate_parameters': 0},
'pickled_model': True}
Using the Model Class
In this section, we will learn how to train, predict, and evaluate a machine learning model. For this, we will directly use the created model from the previous section.
Step 1: Training the Model
Generate Training and Validation Data:
Prepare the data for training and validation. The following example generates synthetic data for demonstration purposes:
np.random.seed(0)
X_train = np.random.randn(1000, 10)
y_train = np.random.randint(0, 2, 1000)
X_val = np.random.randn(1000, 10)
y_val = np.random.randint(0, 2, 1000)
Training the Model:
When training a model, you can specify additional training_parameters
. If they are not specified, the model will use the initialization parameters. You can also specify whether you’d like to balance the training classes.
training_params = {
'eval_metric': 'logloss',
'eta': 0.1,
'max_depth': 6
}
xgb_model.train(X_train, y_train, X_val, y_val, training_params, balance_train_classes=True)
This process optimizes the model based on the specified hyperparameters and validation data to prevent overfitting.
Step 2: Predicting Using the Trained Model
Model Prediction:
Once the model is trained, use it to predict labels or probabilities on a new dataset. This step demonstrates predicting binary labels for the test data. The return_proba
parameter specifies whether to return the predicted_probabilities
or the predicted_labels
. The labels are calculated based on the threshold
.
X_test = np.random.randn(1000, 10)
y_test = np.random.randint(0, 2, 1000)
y_pred = xgb_model.predict(X_test, return_proba=False, threshold=0.5)
Step 3: Evaluating the Model
Evaluate the model’s performance using various metrics to understand its effectiveness in making predictions. The supported metrics include Accuracy, AUC, Precision, Recall, and F1 Score, among others. The evaluate
method will handle the model predictions and then evaluate the model based on these predictions. You only need to specify the test data.
To retrieve the list of supported classification_metrics
, you can use ClassificationEvaluationMetrics.supported_metrics()
:
from MED3pa.models import ClassificationEvaluationMetrics
# Display supported metrics
print("Supported evaluation metrics:", ClassificationEvaluationMetrics.supported_metrics())
# Evaluate the model
evaluation_results = xgb_model.evaluate(X_test, y_test, eval_metrics=['Auc', 'Accuracy'], print_results=True)
Output:
Supported evaluation metrics: ['Accuracy', 'BalancedAccuracy', 'Precision', 'Recall', 'F1Score', 'Specificity', 'Sensitivity', 'Auc', 'LogLoss', 'Auprc', 'NPV', 'PPV', 'MCC']
Evaluation Results:
Auc: 0.51
Accuracy: 0.50
Step 4: Retrieving Model Information
The get_info
method provides detailed information about the model, including its type, parameters, data preparation strategy, and whether it’s a pickled model. This is useful for understanding the configuration and state of the model.
model_info = xgb_model.get_info()
pprint(model_info)
Output:
{'model': 'XGBoostModel',
'model_type': 'Booster',
'params': {'objective': 'binary:logistic',
'eval_metric': 'auc',
'eta': 0.1,
'max_depth': 6,
'subsample': 0.8,
'colsample_bytree': 0.8,
'min_child_weight': 1,
'nthread': 4,
'tree_method': 'hist',
'device': 'cpu'},
'data_preparation_strategy': 'ToDmatrixStrategy',
'pickled_model': False}
Step 5: Saving Model Information
You can save the model by using the save method, which will save the underlying model instance as a pickled file, and the model’s information as a .json file:
xgb_model.save("./models/saved_model")