hybparsimony package

Subpackages

Submodules

hybparsimony.Examples module

hybparsimony.hybparsimony module

hybparsimony for Python is a package for searching accurate parsimonious models by combining feature selection (FS), model hyperparameter optimization (HO), and parsimonious model selection (PMS) based on a separate cost and complexity evaluation.

To improve the search for parsimony, the hybrid method combines GA mechanisms such as selection, crossover and mutation within a PSO-based optimization algorithm that includes a strategy in which the best position of each particle (thus also the best position of each neighborhood) is calculated taking into account not only the goodness-of-fit, but also the parsimony principle.

In hybparsimony, the percentage of variables to be replaced with GA at each iteration $t$ is selected by a decreasing exponential function:

$pcrossover=max(0.80 cdot e^{(-Gamma cdot t)}, 0.10)$, that is adjusted by a $Gamma$ parameter (by default $Gamma$ is set to $0.50$). Thus, in the first iterations parsimony is promoted by GA mechanisms, i.e., replacing by crossover a high percentage of particles at the beginning. Subsequently, optimization with PSO becomes more relevant for the improvement of model accuracy. This differs from other hybrid methods in which the crossover is applied between the best individual position of each particle or other approaches in which the worst particles are also replaced by new particles, but at extreme positions.

Experiments show that, in general, and with a suitable $Gamma$, hybparsimony allows to obtain better, more parsimonious and more robust models compared to other methods. It also reduces the number of iterations and, consequently, the computational effort.

References

Divasón, J., Pernia-Espinoza, A., Martinez-de-Pison, F.J. (2022). New Hybrid Methodology Based on Particle Swarm Optimization with Genetic Algorithms to Improve the Search of Parsimonious Models in High-Dimensional Databases. In: García Bringas, P., et al. Hybrid Artificial Intelligent Systems. HAIS 2022. Lecture Notes in Computer Science, vol 13469. Springer, Cham. [https://doi.org/10.1007/978-3-031-15471-3_29](https://doi.org/10.1007/978-3-031-15471-3_29)

class hybparsimony.hybparsimony.HYBparsimony(fitness=None, features=None, algorithm=None, custom_eval_fun=None, cv=None, scoring=None, type_ini_pop='improvedLHS', npart=15, maxiter=250, early_stop=None, Lambda=1.0, c1=1.1931471805599454, c2=1.1931471805599454, IW_max=0.9, IW_min=0.4, K=3, pmutation=0.1, gamma_crossover=0.5, tol=0.0001, rerank_error=1e-09, keep_history=False, feat_thres=0.9, best_global_thres=1, particles_to_delete=None, seed_ini=1234, not_muted=3, feat_mut_thres=0.1, n_jobs=1, verbose=0)[source]

Bases: object

__init__(fitness=None, features=None, algorithm=None, custom_eval_fun=None, cv=None, scoring=None, type_ini_pop='improvedLHS', npart=15, maxiter=250, early_stop=None, Lambda=1.0, c1=1.1931471805599454, c2=1.1931471805599454, IW_max=0.9, IW_min=0.4, K=3, pmutation=0.1, gamma_crossover=0.5, tol=0.0001, rerank_error=1e-09, keep_history=False, feat_thres=0.9, best_global_thres=1, particles_to_delete=None, seed_ini=1234, not_muted=3, feat_mut_thres=0.1, n_jobs=1, verbose=0)[source]

A class for searching parsimonious models by feature selection and parameter tuning with an hybrid method based on genetic algorithms and particle swarm optimization.

fitnessfunction, optional

The fitness function, any function which takes as input a chromosome which combines the model parameters to tune and the features to be selected. Fitness function returns a numerical vector with three values: validation_cost, testing_cost and model_complexity, and the trained model.

featureslist of str, default=None

The name of features/columns in the dataset. If None, it extracts the names if X is a dataframe, otherwise it generates a list of the positions according to the value of X.shape[1].

algorithm: string or dict, default=None

Id string, the name of the algorithm to optimize (defined in ‘hybparsimony.util.models.py’) or a dictionary defined with the following properties: {‘estimator’: any machine learning algorithm compatible with scikit-learn, ‘complexity’: the function that measures the complexity of the model, ‘the hyperparameters of the algorithm’: in this case, they can be fixed values (defined by Population.CONSTANT) or a search range $[min, max]$ defined by {“range”:(min, max), “type”: Population.X} and which type can be of three values: integer (Population.INTEGER), float (Population.FLOAT) or in powers of 10 (Population.POWER), i.e. $10^{[min, max]}$}. If algorithm==None, hybparsimony uses ‘LogisticRegression()’ for classification problems, and ‘Ridge’ for regression problems.

custom_eval_funfunction, default=None

An evaluation function similar to scikit-learns’s ‘cross_val_score()’. If None, hybparsimony uses ‘cross_val_score(cv=5)’.

cv: int, cross-validation generator or an iterable, default=None

Determines the cross-validation splitting strategy (see scikit-learn’s ‘cross_val_score()’ function)

scoring: str, callable, list, tuple, or dict, default=None.

Strategy to evaluate the performance of the cross-validated model on the test set. If None cv=5 and ‘scoring’ is defined as MSE for regression problems, ‘log_loss’ for binary classification problems, and ‘f1_macro’ for multiclass problems. (see scikit-learn’s ‘cross_val_score()’ function)

type_ini_popstr, {‘randomLHS’, ‘geneticLHS’, ‘improvedLHS’, ‘maximinLHS’, ‘optimumLHS’, ‘random’}, optional

Method to create the first population with GAparsimony._population function. Possible values: randomLHS, geneticLHS, improvedLHS, maximinLHS, optimumLHS, random. First 5 methods correspond with several latine hypercube for initial sampling. By default is set to improvedLHS.

npart = int, default=15

Number of particles in the swarm (population size)

maxiter = int, default=250

The maximum number of iterations to run before the HYB process is halted.

early_stopint, optional

The number of consecutive generations without any improvement lower than a difference of ‘tol’ in the ‘best_fitness’ value before the search process is stopped.

tolfloat, default=1e-4,

Value defining a significant difference between the ‘best_fitness’ values between iterations for ‘early stopping’.

rerank_errorfloat, default=1e-09

When a value is provided, a second reranking process according to the model complexities is called by parsimony_rerank function. Its primary objective isto select individuals with high validation cost while maintaining the robustnessof a parsimonious model. This function switches the position of two models if the first one is more complex than the latter and no significant difference is found between their fitness values in terms of cost. Thus, if the absolute difference between the validation costs are lower than rerank_error they are considered similar.

gamma_crossoverfloat, default=0.50

In hybparsimony, the percentage of variables to be replaced with GA at each iteration $t$ is selected by a decreasing exponential function that is adjusted by a ‘gamma_crossover’ parameter (see references for more info).

Lambdafloat, default=1.0

PSO parameter (see References)

c1float, default=1/2 + math.log(2)

PSO parameter (see References)

c2float, default=1/2 + math.log(2)

PSO parameter (see References)

IW_maxfloat, default=0.9

PSO parameter (see References)

IW_minfloat, default=0.4

PSO parameter (see References)

Kint, default=4

PSO parameter (see References)

best_global_thresfloat, default=1.0

Percentage of particles that will be influenced by the best global of their neighbourhoods (otherwise, they will be influenced by the best of the iteration in each neighbourhood) particles_to_delete is not None and len(particles_to_delete) < maxiter:

particles_to_deletefloat, default=None

The length of the particles to delete is lower than the iterations, the array is completed with zeros up to the number of iterations.

mutationfloat, default=0.1

The probability of mutation in a parent chromosome. Usually mutation occurs with a small probability. By default is set to 0.10.

feat_mut_thresfloat, default=0.1

Probability of the muted features-chromosome to be one. Default value is set to 0.10.

feat_thresfloat, default=0.90

Proportion of selected features in the initial population. It is recommended a high percentage of the selected features for the first generations.

keep_historybool default=False,

If True keep results of all particles in each iteration into ‘history’ attribute.

seed_iniint, optional

An integer value containing the random number generator state.

n_jobsint, default=1,

Number of cores to parallelize the evaluation of the swarm. It should be used with caution because the algorithms used or the ‘cross_validate()’ function used by default to evaluate individuals may also parallelize their internal processes.

verboseint, default=0

The level of messages that we want it to show us. Possible values: 0=silent mode, 1=monitor level, 2=debug level.

Attributes

minutes_totalfloat

Total elapsed time (in minutes).

historyfloat

A list with the results of the population of all iterations.’history[iter]’ returns a DataFrame with the results of iteration ‘iter’.

best_model

The best model in the whole optimization process.

best_scorefloat

The validation score of the best model.

best_complexityfloat

The complexity of the best model.

selected_featureslist,

The name of the selected features for the best model.

selected_features_boollist,

The selected features for the best model in Boolean form.

best_model_confChromosome

The parameters and features of the best model in the whole optimization process.

Examples

Usage example for a regression model using the sklearn ‘diabetes’ dataset

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
from hybparsimony import hybparsimony

# Load 'diabetes' dataset
diabetes = load_diabetes()

X, y = diabetes.data, diabetes.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=1234)

# Standarize X and y
scaler_X = StandardScaler()
X_train = scaler_X.fit_transform(X_train)
X_test = scaler_X.transform(X_test)
scaler_y = StandardScaler()
y_train = scaler_y.fit_transform(y_train.reshape(-1,1)).flatten()
y_test = scaler_y.transform(y_test.reshape(-1,1)).flatten()

algo = 'KernelRidge'
HYBparsimony_model = hybparsimony(algorithm=algo,
                                features=diabetes.feature_names,
                                rerank_error=0.001,
                                verbose=1)

# Search the best hyperparameters and features 
# (increasing 'time_limit' to improve RMSE with high consuming algorithms)
HYBparsimony_model.fit(X_train, y_train, time_limit=0.20)
Running iteration 0
Best model -> Score = -0.510786 Complexity = 9,017,405,352.5 
Iter = 0 -> MeanVal = -0.88274  ValBest = -0.510786   ComplexBest = 9,017,405,352.5 Time(min) = 0.005858

Running iteration 1
Best model -> Score = -0.499005 Complexity = 8,000,032,783.88 
Iter = 1 -> MeanVal = -0.659969  ValBest = -0.499005   ComplexBest = 8,000,032,783.88 Time(min) = 0.004452

...
...
...

Running iteration 34
Best model -> Score = -0.489468 Complexity = 8,000,002,255.68 
Iter = 34 -> MeanVal = -0.527314  ValBest = -0.489468   ComplexBest = 8,000,002,255.68 Time(min) = 0.007533

Running iteration 35
Best model -> Score = -0.489457 Complexity = 8,000,002,199.12 
Iter = 35 -> MeanVal = -0.526294  ValBest = -0.489457   ComplexBest = 8,000,002,199.12 Time(min) = 0.006522

Time limit reached. Stopped.

Usage example for a classification model using the ‘breast_cancer’ dataset

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import log_loss
from hybparsimony import hybparsimony

# load 'breast_cancer' dataset
breast_cancer = load_breast_cancer()
X, y = breast_cancer.data, breast_cancer.target 
print(X.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=1)

# Standarize X and y (some algorithms require that)
scaler_X = StandardScaler()
X_train = scaler_X.fit_transform(X_train)
X_test = scaler_X.transform(X_test)

HYBparsimony_model = hybparsimony(features=breast_cancer.feature_names,
                                rerank_error=0.005,
                                verbose=1)
HYBparsimony_model.fit(X_train, y_train, time_limit=0.50)
# Extract probs of class==1
preds = HYBparsimony_model.predict_proba(X_test)[:,1]
print(f'\n\nBest Model = {HYBparsimony_model.best_model}')
print(f'Selected features:{HYBparsimony_model.selected_features}')
print(f'Complexity = {round(HYBparsimony_model.best_complexity, 2):,}')
print(f'5-CV logloss = {-round(HYBparsimony_model.best_score,6)}')
print(f'logloss test = {round(log_loss(y_test, preds),6)}')
(569, 30)
Detected a binary-class problem. Using 'neg_log_loss' as default scoring function.
Running iteration 0
Best model -> Score = -0.091519 Complexity = 29,000,000,005.11 
Iter = 0 -> MeanVal = -0.297448  ValBest = -0.091519   ComplexBest = 29,000,000,005.11 Time(min) = 0.006501

Running iteration 1
Best model -> Score = -0.085673 Complexity = 27,000,000,009.97 
Iter = 1 -> MeanVal = -0.117216  ValBest = -0.085673   ComplexBest = 27,000,000,009.97 Time(min) = 0.004273

...
...

Running iteration 102
Best model -> Score = -0.064557 Complexity = 11,000,000,039.47 
Iter = 102 -> MeanVal = -0.076314  ValBest = -0.066261   ComplexBest = 9,000,000,047.25 Time(min) = 0.004769

Running iteration 103
Best model -> Score = -0.064557 Complexity = 11,000,000,039.47 
Iter = 103 -> MeanVal = -0.086243  ValBest = -0.064995   ComplexBest = 11,000,000,031.2 Time(min) = 0.004591

Time limit reached. Stopped.

Best Model = LogisticRegression(C=5.92705799354935)
Selected features:['mean texture' 'mean concave points' 'radius error' 'area error'
'compactness error' 'worst radius' 'worst perimeter' 'worst area'
'worst smoothness' 'worst concavity' 'worst symmetry']
Complexity = 11,000,000,039.47
5-CV logloss = 0.064557
logloss test = 0.076254
fit(X, y, time_limit=None)[source]
Performs the search of accurate parsimonious models by combining feature selection, hyperparameter optimizacion,

and parsimonious model selection (PMS) with data matrix (X) and targets (y).

Parameters

Xpandas.DataFrame or numpy.array

Training vector.

ypandas.DataFrame or numpy.array

Target vector relative to X.

time_limitfloat, default=None

Maximum time to perform the optimization process in minutes.

predict(X)[source]

Predict result for samples in X.

Parameters

Xnumpy.array or pandas.DataFrame

Samples.

Returns

numpy.array

A numpy.array with predictions.

predict_proba(X)[source]

Predict probabilities for each class and sample in X (only for classification models).

Parameters

Xnumpy.array or pandas.DataFrame

Samples.

Returns

numpy.array

A numpy.array with predictions. Returns the probability of the sample for each class in the model.

hybparsimony.hybparsimony.multinomial(n, pvals, size=None)

Draw samples from a multinomial distribution.

The multinomial distribution is a multivariate generalization of the binomial distribution. Take an experiment with one of p possible outcomes. An example of such an experiment is throwing a dice, where the outcome can be 1 through 6. Each sample drawn from the distribution represents n such experiments. Its values, X_i = [X_0, X_1, ..., X_p], represent the number of times the outcome was i.

Note

New code should use the ~numpy.random.Generator.multinomial method of a ~numpy.random.Generator instance instead; please see the random-quick-start.

Parameters

nint

Number of experiments.

pvalssequence of floats, length p

Probabilities of each of the p different outcomes. These must sum to 1 (however, the last element is always assumed to account for the remaining probability, as long as sum(pvals[:-1]) <= 1).

sizeint or tuple of ints, optional

Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.

Returns

outndarray

The drawn samples, of shape size, if that was provided. If not, the shape is (N,).

In other words, each entry out[i,j,...,:] is an N-dimensional value drawn from the distribution.

See Also

random.Generator.multinomial: which should be used for new code.

Examples

Throw a dice 20 times:

>>> np.random.multinomial(20, [1/6.]*6, size=1)
array([[4, 1, 7, 5, 2, 1]]) # random

It landed 4 times on 1, once on 2, etc.

Now, throw the dice 20 times, and 20 times again:

>>> np.random.multinomial(20, [1/6.]*6, size=2)
array([[3, 4, 3, 3, 4, 3], # random
       [2, 4, 3, 4, 0, 7]])

For the first run, we threw 3 times 1, 4 times 2, etc. For the second, we threw 2 times 1, 4 times 2, etc.

A loaded die is more likely to land on number 6:

>>> np.random.multinomial(100, [1/7.]*5 + [2/7.])
array([11, 16, 14, 17, 16, 26]) # random

The probability inputs should be normalized. As an implementation detail, the value of the last entry is ignored and assumed to take up any leftover probability mass, but this should not be relied on. A biased coin which has twice as much weight on one side as on the other should be sampled like so:

>>> np.random.multinomial(100, [1.0 / 3, 2.0 / 3])  # RIGHT
array([38, 62]) # random

not like:

>>> np.random.multinomial(100, [1.0, 2.0])  # WRONG
Traceback (most recent call last):
ValueError: pvals < 0, pvals > 1 or pvals contains NaNs

Module contents