dodoml¶
-
dodoml.
compute_features_impact
(model, Xval, Yval, row_sample=50000, features=None)[source]¶ Compute the features impact for each features.
Feature Impact for a given column measures how much worse a model’s error score would be if we made predictions after randomly shuffling that column (while leaving other columns unchanged). This technique is sometimes called Permutation Importance.
The calculation may take time. To speed up you can either: - Use sampling with the row_sample parameter (in number of rows, 0 for all rows) - Specify the columns of interest with features paramters (list of columns, None for all)
-
dodoml.
compute_partial_dependence
(model, Xval, features=None, row_sample=10000, percentiles=(0.05, 0.95), grid_resolution=20)[source]¶ Compute the partial dependence (of each feature on the prediction function)
For a linear model, we can look at the regression coefficients to tell whether a feature impacts positively or negatively the predictions
For a more complex model, we use partial dependence to visualize this relationship
The calculation may take time. To speed up you can either: - Use sampling with the row_sample parameter (in number of rows, 0 for all rows) - Specify the columns of interest with features paramters (list of columns, None for all)
-
dodoml.
compute_ace
(df, numeric_features, categoric_features, target, target_as_cat)[source]¶ Compute the ACE correlation (http://www.stat.cmu.edu/~ryantibs/datamining/lectures/11-cor2-marked.pdf) between each of the numeric/categoric features vs the target. The target variable can be treated either as numeric or as categoric using target_as_cat parameter.
Machine Learning¶
-
class
dodoml.ml.
Hyperband
(model, feat_space, task, max_iter=81)[source]¶ Simple sklearn-compative implementation for Hyperband algorithms http://people.eecs.berkeley.edu/~kjamieson/hyperband.html
Parameters: - `model` (The underlying model to optimize (sklearn classifier/regressor))
- `feat_space` (The hyper-parameters space (dict))
- `task` (Either classification or regression)
- `max_iter` (The maximum number of iteration)
Examples
from dodoml.ml import Hyperband, ContinuableLGBMClassifier from scipy.stats.distributions import uniform, randint param_space = { 'max_depth': randint(2, 11), 'min_child_weight': randint(1, 11), 'subsample': uniform(0.5, 0.5), } model = make_pipeline( feature_pipeline, Hyperband( ContinuableLGBMClassifier(learning_rate=0.1), feat_space=param_space, task='classification' ) ) model.fit(Xtrain, Ytrain) roc_auc_score(Ytest, model.predict_proba(Xtest)[:, 1])
-
dodoml.ml.
lgbm_hyperband_classifier
(numeric_features, categoric_features, learning_rate=0.08)[source]¶ Simple classification pipeline using hyperband to optimize lightgbm hyper-parameters
Parameters: - `numeric_features` (The list of numeric features)
- `categoric_features` (The list of categoric features)
- `learning_rate` (The learning rate)
-
dodoml.ml.
lgbm_hyperband_regressor
(numeric_features, categoric_features, learning_rate=0.08)[source]¶ Simple classification pipeline using hyperband to optimize lightgbm hyper-parameters
Parameters: - `numeric_features` (The list of numeric features)
- `categoric_features` (The list of categoric features)
- `learning_rate` (The learning rate)
-
class
dodoml.ml.
ContinuableLGBMClassifier
(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=10, max_bin=255, subsample_for_bin=50000, objective=None, min_split_gain=0, min_child_weight=5, min_child_samples=10, subsample=1, subsample_freq=1, colsample_bytree=1, reg_alpha=0, reg_lambda=0, seed=0, nthread=-1, silent=True, **kwargs)[source]¶ clf = ContinuableLGBMClassifier(n_estimators=100) clf.fit(X, Y) clf.set_params(n_estimators=110) clf.fit(X, Y) # train 10 more estimators, not from scratch
-
class
dodoml.ml.
ContinuableLGBMRegressor
(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=10, max_bin=255, subsample_for_bin=50000, objective=None, min_split_gain=0, min_child_weight=5, min_child_samples=10, subsample=1, subsample_freq=1, colsample_bytree=1, reg_alpha=0, reg_lambda=0, seed=0, nthread=-1, silent=True, **kwargs)[source]¶ clf = ContinuableLGBMRegressor(n_estimators=100) clf.fit(X, Y) clf.set_params(n_estimators=110) clf.fit(X, Y) # train 10 more estimators, not from scratch
Pipeline Operations¶
-
class
dodoml.pipeline.
BoxCoxTransformer
[source]¶ Boxcox transformation for numerical columns To make them more Gaussian-like
-
class
dodoml.pipeline.
ColumnApplier
(underlying)[source]¶ Some sklearn transformers can apply only on ONE column at a time Wrap them with ColumnApplier to apply on all the dataset
-
class
dodoml.pipeline.
CountFrequencyEncoder
(min_card=5, count_na=False)[source]¶ Encode the value by their frequency observed in the training set
-
class
dodoml.pipeline.
OrdinalEncoder
(min_support)[source]¶ Encode the categorical value by natural number based on alphabetical order N/A are encoded to -2 rare values to -1 Very similar to TolerentLabelEncoder TODO: improve the implementation
Random Layer¶
The random_layer
module
implements Random Layer transformers.
Random layers are arrays of hidden unit activations that are random functions of input activation values (dot products for simple activation functions, distances from prototypes for radial basis functions).
They are used in the implementation of Extreme Learning Machines (ELMs), but can be used as a general input mapping.
-
class
dodoml.ml.random_layer.
RandomLayer
(n_hidden=20, alpha=0.5, random_state=None, activation_func='tanh', activation_args=None, user_components=None, rbf_width=1.0)[source]¶ RandomLayer is a transformer that creates a feature mapping of the inputs that corresponds to a layer of hidden units with randomly generated components.
The transformed values are a specified function of input activations that are a weighted combination of dot product (multilayer perceptron) and distance (rbf) activations:
input_activation = alpha * mlp_activation + (1-alpha) * rbf_activation
mlp_activation(x) = dot(x, weights) + bias rbf_activation(x) = rbf_width * ||x - center||/radius
alpha and rbf_width are specified by the user
weights and biases are taken from normal distribution of mean 0 and sd of 1
centers are taken uniformly from the bounding hyperrectangle of the inputs, and radii are max(||x-c||)/sqrt(n_centers*2)
The input activation is transformed by a transfer function that defaults to numpy.tanh if not specified, but can be any callable that returns an array of the same shape as its argument (the input activation array, of shape [n_samples, n_hidden]). Functions provided are ‘sine’, ‘tanh’, ‘tribas’, ‘inv_tribas’, ‘sigmoid’, ‘hardlim’, ‘softlim’, ‘gaussian’, ‘multiquadric’, ‘inv_multiquadric’ and ‘reclinear’.
Parameters: `n_hidden` (int, optional (default=20)) – Number of units to generate
`alpha` (float, optional (default=0.5)) – Mixing coefficient for distance and dot product input activations: activation = alpha*mlp_activation + (1-alpha)*rbf_width*rbf_activation
`rbf_width` (float, optional (default=1.0)) – multiplier on rbf_activation
`user_components` (dictionary, optional (default=None)) – dictionary containing values for components that woud otherwise be randomly generated. Valid key/value pairs are as follows:
‘radii’ : array-like of shape [n_hidden] ‘centers’: array-like of shape [n_hidden, n_features] ‘biases’ : array-like of shape [n_hidden] ‘weights’: array-like of shape [n_features, n_hidden]
`activation_func` ({callable, string} optional (default=’tanh’)) – Function used to transform input activation
It must be one of ‘tanh’, ‘sine’, ‘tribas’, ‘inv_tribas’, ‘sigmoid’, ‘hardlim’, ‘softlim’, ‘gaussian’, ‘multiquadric’, ‘inv_multiquadric’, ‘reclinear’ or a callable. If None is given, ‘tanh’ will be used.
If a callable is given, it will be used to compute the activations.
`activation_args` (dictionary, optional (default=None)) – Supplies keyword arguments for a callable activation_func
`random_state` (int, RandomState instance or None (default=None)) – Control the pseudo random number generator used to generate the hidden unit weights at fit time.
Variables: - input_activations_ (numpy array of shape [n_samples, n_hidden]) – Array containing dot(x, hidden_weights) + bias for all samples
- components_ (dictionary containing two keys:) – bias_weights_ : numpy array of shape [n_hidden] hidden_weights_ : numpy array of shape [n_features, n_hidden]
-
class
dodoml.ml.random_layer.
MLPRandomLayer
(n_hidden=20, random_state=None, activation_func='tanh', activation_args=None, weights=None, biases=None)[source]¶ Wrapper for RandomLayer with alpha (mixing coefficient) set to 1.0 for MLP activations only
-
class
dodoml.ml.random_layer.
RBFRandomLayer
(n_hidden=20, random_state=None, activation_func='gaussian', activation_args=None, centers=None, radii=None, rbf_width=1.0)[source]¶ Wrapper for RandomLayer with alpha (mixing coefficient) set to 0.0 for RBF activations only
-
class
dodoml.ml.random_layer.
GRBFRandomLayer
(n_hidden=20, grbf_lambda=0.001, centers=None, radii=None, random_state=None)[source]¶ Random Generalized RBF Hidden Layer transformer
Creates a layer of radial basis function units where:
f(a), s.t. a = ||x-c||/rwith c the unit center and f() is exp(-gamma * a^tau) where tau and r are computed based on [1]
Parameters: - `n_hidden` (int, optional (default=20)) – Number of units to generate, ignored if centers are provided
- `grbf_lambda` (float, optional (default=0.05)) – GRBF shape parameter
- `gamma` ({int, float} optional (default=1.0)) – Width multiplier for GRBF distance argument
- `centers` (array of shape (n_hidden, n_features), optional (default=None)) – If provided, overrides internal computation of the centers
- `radii` (array of shape (n_hidden), optional (default=None)) – If provided, overrides internal computation of the radii
- `use_exemplars` (bool, optional (default=False)) – If True, uses random examples from the input to determine the RBF centers, ignored if centers are provided
- `random_state` (int or RandomState instance, optional (default=None)) – Control the pseudo random number generator used to generate the centers at fit time, ignored if centers are provided
Variables: - components_ (dictionary containing two keys:) – radii_ : numpy array of shape [n_hidden] centers_ : numpy array of shape [n_hidden, n_features]
- input_activations_ (numpy array of shape [n_samples, n_hidden]) – Array containing ||x-c||/r for all samples
See also
ELMRegressor
,ELMClassifier
,SimpleELMRegressor
,SimpleELMClassifier
,SimpleRandomLayer
References
[1] Fernandez-Navarro, et al, “MELM-GRBF: a modified version of the extreme learning machine for generalized radial basis function neural networks”, Neurocomputing 74 (2011), 2502-2510