5. Ensemble of samplers

5.1. Samplers

An imbalanced data set can be balanced by creating several balanced subsets. The module imblearn.ensemble allows to create such sets.

EasyEnsemble creates an ensemble of data set by randomly under-sampling the original set:

>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
...                            n_redundant=0, n_repeated=0, n_classes=3,
...                            n_clusters_per_class=1,
...                            weights=[0.01, 0.05, 0.94],
...                            class_sep=0.8, random_state=0)
>>> print(sorted(Counter(y).items()))
[(0, 64), (1, 262), (2, 4674)]
>>> from imblearn.ensemble import EasyEnsemble
>>> ee = EasyEnsemble(random_state=0, n_subsets=10)
>>> X_resampled, y_resampled = ee.fit_sample(X, y)
>>> print(X_resampled.shape)
(10, 192, 2)
>>> print(sorted(Counter(y_resampled[0]).items()))
[(0, 64), (1, 64), (2, 64)]

EasyEnsemble has two important parameters: (i) n_subsets will be used to return number of subset and (ii) replacement to randomly sample with or without replacement.

BalanceCascade differs from the previous method by using a classifier (using the parameter estimator) to ensure that misclassified samples can again be selected for the next subset. In fact, the classifier play the role of a “smart” replacement method. The maximum number of subset can be set using the parameter n_max_subset and an additional bootstraping can be activated with bootstrap set to True:

>>> from imblearn.ensemble import BalanceCascade
>>> from sklearn.linear_model import LogisticRegression
>>> bc = BalanceCascade(random_state=0,
...                     estimator=LogisticRegression(random_state=0),
...                     n_max_subset=4)
>>> X_resampled, y_resampled = bc.fit_sample(X, y)
>>> print(X_resampled.shape)
(4, 192, 2)
>>> print(sorted(Counter(y_resampled[0]).items()))
[(0, 64), (1, 64), (2, 64)]

See Easy ensemble and Balance cascade.

5.2. Chaining ensemble of samplers and estimators

In ensemble classifiers, bagging methods build several estimators on different randomly selected subset of data. In scikit-learn, this classifier is named BaggingClassifier. However, this classifier does not allow to balance each subset of data. Therefore, when training on imbalanced data set, this classifier will favor the majority classes:

>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import confusion_matrix
>>> from sklearn.ensemble import BaggingClassifier
>>> from sklearn.tree import DecisionTreeClassifier
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
>>> bc = BaggingClassifier(base_estimator=DecisionTreeClassifier(),
...                        random_state=0)
>>> bc.fit(X_train, y_train) 
BaggingClassifier(...)
>>> y_pred = bc.predict(X_test)
>>> confusion_matrix(y_test, y_pred)
array([[   9,    1,    2],
       [   0,   54,    5],
       [   1,    6, 1172]])

BalancedBaggingClassifier allows to resample each subset of data before to train each estimator of the ensemble. In short, it combines the output of an EasyEnsemble sampler with an ensemble of classifiers (i.e. BaggingClassifier). Therefore, BalancedBaggingClassifier takes the same parameters than the scikit-learn BaggingClassifier. Additionally, there is two additional parameters, ratio and replacement, as in the EasyEnsemble sampler:

>>> from imblearn.ensemble import BalancedBaggingClassifier
>>> bbc = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(),
...                                 ratio='auto',
...                                 replacement=False,
...                                 random_state=0)
>>> bbc.fit(X_train, y_train) 
BalancedBaggingClassifier(...)
>>> y_pred = bbc.predict(X_test)
>>> confusion_matrix(y_test, y_pred)
array([[   9,    1,    2],
       [   0,   55,    4],
       [  42,   46, 1091]])

It also possible to turn a balanced bagging classifier into a balanced random forest using a decision tree classifier and setting the parameter max_features='auto'. It allows to randomly select a subset of features for each tree:

>>> brf = BalancedBaggingClassifier(
...          base_estimator=DecisionTreeClassifier(max_features='auto'),
...          random_state=0)
>>> brf.fit(X_train, y_train) 
BalancedBaggingClassifier(...)
>>> y_pred = brf.predict(X_test)
>>> confusion_matrix(y_test, y_pred)
array([[   9,    1,    2],
       [   0,   54,    5],
       [  31,   34, 1114]])

See Comparison of balanced and imbalanced bagging classifiers.