dodoml

dodoml.compute_features_impact(model, Xval, Yval, row_sample=50000, features=None)[source]

Compute the features impact for each features.

Feature Impact for a given column measures how much worse a model’s error score would be if we made predictions after randomly shuffling that column (while leaving other columns unchanged). This technique is sometimes called Permutation Importance.

The calculation may take time. To speed up you can either: - Use sampling with the row_sample parameter (in number of rows, 0 for all rows) - Specify the columns of interest with features paramters (list of columns, None for all)

dodoml.compute_partial_dependence(model, Xval, features=None, row_sample=10000, percentiles=(0.05, 0.95), grid_resolution=20)[source]

Compute the partial dependence (of each feature on the prediction function)

For a linear model, we can look at the regression coefficients to tell whether a feature impacts positively or negatively the predictions

For a more complex model, we use partial dependence to visualize this relationship

The calculation may take time. To speed up you can either: - Use sampling with the row_sample parameter (in number of rows, 0 for all rows) - Specify the columns of interest with features paramters (list of columns, None for all)

dodoml.compute_ace(df, numeric_features, categoric_features, target, target_as_cat)[source]

Compute the ACE correlation (http://www.stat.cmu.edu/~ryantibs/datamining/lectures/11-cor2-marked.pdf) between each of the numeric/categoric features vs the target. The target variable can be treated either as numeric or as categoric using target_as_cat parameter.

Machine Learning

class dodoml.ml.Hyperband(model, feat_space, task, max_iter=81)[source]

Simple sklearn-compative implementation for Hyperband algorithms http://people.eecs.berkeley.edu/~kjamieson/hyperband.html

Parameters:
  • `model` (The underlying model to optimize (sklearn classifier/regressor))
  • `feat_space` (The hyper-parameters space (dict))
  • `task` (Either classification or regression)
  • `max_iter` (The maximum number of iteration)

Examples

from dodoml.ml import Hyperband, ContinuableLGBMClassifier
from scipy.stats.distributions import uniform, randint

param_space = {
    'max_depth': randint(2, 11),
    'min_child_weight': randint(1, 11),
    'subsample': uniform(0.5, 0.5),
}

model = make_pipeline(
    feature_pipeline,
    Hyperband(
        ContinuableLGBMClassifier(learning_rate=0.1),
        feat_space=param_space,
        task='classification'
    )
)

model.fit(Xtrain, Ytrain)
roc_auc_score(Ytest, model.predict_proba(Xtest)[:, 1])
dodoml.ml.lgbm_hyperband_classifier(numeric_features, categoric_features, learning_rate=0.08)[source]

Simple classification pipeline using hyperband to optimize lightgbm hyper-parameters

Parameters:
  • `numeric_features` (The list of numeric features)
  • `categoric_features` (The list of categoric features)
  • `learning_rate` (The learning rate)
dodoml.ml.lgbm_hyperband_regressor(numeric_features, categoric_features, learning_rate=0.08)[source]

Simple classification pipeline using hyperband to optimize lightgbm hyper-parameters

Parameters:
  • `numeric_features` (The list of numeric features)
  • `categoric_features` (The list of categoric features)
  • `learning_rate` (The learning rate)
class dodoml.ml.ContinuableLGBMClassifier(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=10, max_bin=255, subsample_for_bin=50000, objective=None, min_split_gain=0, min_child_weight=5, min_child_samples=10, subsample=1, subsample_freq=1, colsample_bytree=1, reg_alpha=0, reg_lambda=0, seed=0, nthread=-1, silent=True, **kwargs)[source]
clf = ContinuableLGBMClassifier(n_estimators=100)
clf.fit(X, Y)
clf.set_params(n_estimators=110)
clf.fit(X, Y)  # train 10 more estimators, not from scratch
class dodoml.ml.ContinuableLGBMRegressor(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=10, max_bin=255, subsample_for_bin=50000, objective=None, min_split_gain=0, min_child_weight=5, min_child_samples=10, subsample=1, subsample_freq=1, colsample_bytree=1, reg_alpha=0, reg_lambda=0, seed=0, nthread=-1, silent=True, **kwargs)[source]
clf = ContinuableLGBMRegressor(n_estimators=100)
clf.fit(X, Y)
clf.set_params(n_estimators=110)
clf.fit(X, Y)  # train 10 more estimators, not from scratch

Pipeline Operations

class dodoml.pipeline.BoxCoxTransformer[source]

Boxcox transformation for numerical columns To make them more Gaussian-like

class dodoml.pipeline.ColumnApplier(underlying)[source]

Some sklearn transformers can apply only on ONE column at a time Wrap them with ColumnApplier to apply on all the dataset

class dodoml.pipeline.CountFrequencyEncoder(min_card=5, count_na=False)[source]

Encode the value by their frequency observed in the training set

class dodoml.pipeline.Logify[source]

Log transformation

class dodoml.pipeline.OrdinalEncoder(min_support)[source]

Encode the categorical value by natural number based on alphabetical order N/A are encoded to -2 rare values to -1 Very similar to TolerentLabelEncoder TODO: improve the implementation

class dodoml.pipeline.TolerantLabelEncoder[source]

LabelEncoder is not tolerant to unseen values

class dodoml.pipeline.UniqueCountColumnSelector(lowerbound, upperbound)[source]

To select those columns whose unique-count values are between lowerbound (inclusive) and upperbound (exclusive)

class dodoml.pipeline.YToLog(delegate, shift=0)[source]

Transforming Y to log before fitting and transforming back the prediction to real values before return

Random Layer

The random_layer module implements Random Layer transformers.

Random layers are arrays of hidden unit activations that are random functions of input activation values (dot products for simple activation functions, distances from prototypes for radial basis functions).

They are used in the implementation of Extreme Learning Machines (ELMs), but can be used as a general input mapping.

class dodoml.ml.random_layer.RandomLayer(n_hidden=20, alpha=0.5, random_state=None, activation_func='tanh', activation_args=None, user_components=None, rbf_width=1.0)[source]

RandomLayer is a transformer that creates a feature mapping of the inputs that corresponds to a layer of hidden units with randomly generated components.

The transformed values are a specified function of input activations that are a weighted combination of dot product (multilayer perceptron) and distance (rbf) activations:

input_activation = alpha * mlp_activation + (1-alpha) * rbf_activation

mlp_activation(x) = dot(x, weights) + bias rbf_activation(x) = rbf_width * ||x - center||/radius

alpha and rbf_width are specified by the user

weights and biases are taken from normal distribution of mean 0 and sd of 1

centers are taken uniformly from the bounding hyperrectangle of the inputs, and radii are max(||x-c||)/sqrt(n_centers*2)

The input activation is transformed by a transfer function that defaults to numpy.tanh if not specified, but can be any callable that returns an array of the same shape as its argument (the input activation array, of shape [n_samples, n_hidden]). Functions provided are ‘sine’, ‘tanh’, ‘tribas’, ‘inv_tribas’, ‘sigmoid’, ‘hardlim’, ‘softlim’, ‘gaussian’, ‘multiquadric’, ‘inv_multiquadric’ and ‘reclinear’.

Parameters:
  • `n_hidden` (int, optional (default=20)) – Number of units to generate

  • `alpha` (float, optional (default=0.5)) – Mixing coefficient for distance and dot product input activations: activation = alpha*mlp_activation + (1-alpha)*rbf_width*rbf_activation

  • `rbf_width` (float, optional (default=1.0)) – multiplier on rbf_activation

  • `user_components` (dictionary, optional (default=None)) – dictionary containing values for components that woud otherwise be randomly generated. Valid key/value pairs are as follows:

    ‘radii’ : array-like of shape [n_hidden] ‘centers’: array-like of shape [n_hidden, n_features] ‘biases’ : array-like of shape [n_hidden] ‘weights’: array-like of shape [n_features, n_hidden]

  • `activation_func` ({callable, string} optional (default=’tanh’)) – Function used to transform input activation

    It must be one of ‘tanh’, ‘sine’, ‘tribas’, ‘inv_tribas’, ‘sigmoid’, ‘hardlim’, ‘softlim’, ‘gaussian’, ‘multiquadric’, ‘inv_multiquadric’, ‘reclinear’ or a callable. If None is given, ‘tanh’ will be used.

    If a callable is given, it will be used to compute the activations.

  • `activation_args` (dictionary, optional (default=None)) – Supplies keyword arguments for a callable activation_func

  • `random_state` (int, RandomState instance or None (default=None)) – Control the pseudo random number generator used to generate the hidden unit weights at fit time.

Variables:
  • input_activations_ (numpy array of shape [n_samples, n_hidden]) – Array containing dot(x, hidden_weights) + bias for all samples
  • components_ (dictionary containing two keys:) – bias_weights_ : numpy array of shape [n_hidden] hidden_weights_ : numpy array of shape [n_features, n_hidden]
class dodoml.ml.random_layer.MLPRandomLayer(n_hidden=20, random_state=None, activation_func='tanh', activation_args=None, weights=None, biases=None)[source]

Wrapper for RandomLayer with alpha (mixing coefficient) set to 1.0 for MLP activations only

class dodoml.ml.random_layer.RBFRandomLayer(n_hidden=20, random_state=None, activation_func='gaussian', activation_args=None, centers=None, radii=None, rbf_width=1.0)[source]

Wrapper for RandomLayer with alpha (mixing coefficient) set to 0.0 for RBF activations only

class dodoml.ml.random_layer.GRBFRandomLayer(n_hidden=20, grbf_lambda=0.001, centers=None, radii=None, random_state=None)[source]

Random Generalized RBF Hidden Layer transformer

Creates a layer of radial basis function units where:

f(a), s.t. a = ||x-c||/r

with c the unit center and f() is exp(-gamma * a^tau) where tau and r are computed based on [1]

Parameters:
  • `n_hidden` (int, optional (default=20)) – Number of units to generate, ignored if centers are provided
  • `grbf_lambda` (float, optional (default=0.05)) – GRBF shape parameter
  • `gamma` ({int, float} optional (default=1.0)) – Width multiplier for GRBF distance argument
  • `centers` (array of shape (n_hidden, n_features), optional (default=None)) – If provided, overrides internal computation of the centers
  • `radii` (array of shape (n_hidden), optional (default=None)) – If provided, overrides internal computation of the radii
  • `use_exemplars` (bool, optional (default=False)) – If True, uses random examples from the input to determine the RBF centers, ignored if centers are provided
  • `random_state` (int or RandomState instance, optional (default=None)) – Control the pseudo random number generator used to generate the centers at fit time, ignored if centers are provided
Variables:
  • components_ (dictionary containing two keys:) – radii_ : numpy array of shape [n_hidden] centers_ : numpy array of shape [n_hidden, n_features]
  • input_activations_ (numpy array of shape [n_samples, n_hidden]) – Array containing ||x-c||/r for all samples

See also

ELMRegressor, ELMClassifier, SimpleELMRegressor, SimpleELMClassifier, SimpleRandomLayer

References

[1]Fernandez-Navarro, et al, “MELM-GRBF: a modified version of the extreme learning machine for generalized radial basis function neural networks”, Neurocomputing 74 (2011), 2502-2510