hybparsimony package
Subpackages
- hybparsimony.lhs package
- hybparsimony.util package
Submodules
hybparsimony.Examples module
hybparsimony.hybparsimony module
hybparsimony for Python is a package for searching accurate parsimonious models by combining feature selection (FS), model hyperparameter optimization (HO), and parsimonious model selection (PMS) based on a separate cost and complexity evaluation.
To improve the search for parsimony, the hybrid method combines GA mechanisms such as selection, crossover and mutation within a PSO-based optimization algorithm that includes a strategy in which the best position of each particle (thus also the best position of each neighborhood) is calculated taking into account not only the goodness-of-fit, but also the parsimony principle.
- In hybparsimony, the percentage of variables to be replaced with GA at each iteration $t$ is selected by a decreasing exponential function:
$pcrossover=max(0.80 cdot e^{(-Gamma cdot t)}, 0.10)$, that is adjusted by a $Gamma$ parameter (by default $Gamma$ is set to $0.50$). Thus, in the first iterations parsimony is promoted by GA mechanisms, i.e., replacing by crossover a high percentage of particles at the beginning. Subsequently, optimization with PSO becomes more relevant for the improvement of model accuracy. This differs from other hybrid methods in which the crossover is applied between the best individual position of each particle or other approaches in which the worst particles are also replaced by new particles, but at extreme positions.
Experiments show that, in general, and with a suitable $Gamma$, hybparsimony allows to obtain better, more parsimonious and more robust models compared to other methods. It also reduces the number of iterations and, consequently, the computational effort.
References
Divasón, J., Pernia-Espinoza, A., Martinez-de-Pison, F.J. (2022). New Hybrid Methodology Based on Particle Swarm Optimization with Genetic Algorithms to Improve the Search of Parsimonious Models in High-Dimensional Databases. In: García Bringas, P., et al. Hybrid Artificial Intelligent Systems. HAIS 2022. Lecture Notes in Computer Science, vol 13469. Springer, Cham. [https://doi.org/10.1007/978-3-031-15471-3_29](https://doi.org/10.1007/978-3-031-15471-3_29)
- class hybparsimony.hybparsimony.HYBparsimony(fitness=None, features=None, algorithm=None, custom_eval_fun=None, cv=None, scoring=None, type_ini_pop='improvedLHS', npart=15, maxiter=250, early_stop=None, Lambda=1.0, c1=1.1931471805599454, c2=1.1931471805599454, IW_max=0.9, IW_min=0.4, K=3, pmutation=0.1, gamma_crossover=0.5, tol=0.0001, rerank_error=1e-09, keep_history=False, feat_thres=0.9, best_global_thres=1, particles_to_delete=None, seed_ini=1234, not_muted=3, feat_mut_thres=0.1, n_jobs=1, verbose=0)[source]
Bases:
object
- __init__(fitness=None, features=None, algorithm=None, custom_eval_fun=None, cv=None, scoring=None, type_ini_pop='improvedLHS', npart=15, maxiter=250, early_stop=None, Lambda=1.0, c1=1.1931471805599454, c2=1.1931471805599454, IW_max=0.9, IW_min=0.4, K=3, pmutation=0.1, gamma_crossover=0.5, tol=0.0001, rerank_error=1e-09, keep_history=False, feat_thres=0.9, best_global_thres=1, particles_to_delete=None, seed_ini=1234, not_muted=3, feat_mut_thres=0.1, n_jobs=1, verbose=0)[source]
A class for searching parsimonious models by feature selection and parameter tuning with an hybrid method based on genetic algorithms and particle swarm optimization.
- fitnessfunction, optional
The fitness function, any function which takes as input a chromosome which combines the model parameters to tune and the features to be selected. Fitness function returns a numerical vector with three values: validation_cost, testing_cost and model_complexity, and the trained model.
- featureslist of str, default=None
The name of features/columns in the dataset. If None, it extracts the names if X is a dataframe, otherwise it generates a list of the positions according to the value of X.shape[1].
- algorithm: string or dict, default=None
Id string, the name of the algorithm to optimize (defined in ‘hybparsimony.util.models.py’) or a dictionary defined with the following properties: {‘estimator’: any machine learning algorithm compatible with scikit-learn, ‘complexity’: the function that measures the complexity of the model, ‘the hyperparameters of the algorithm’: in this case, they can be fixed values (defined by Population.CONSTANT) or a search range $[min, max]$ defined by {“range”:(min, max), “type”: Population.X} and which type can be of three values: integer (Population.INTEGER), float (Population.FLOAT) or in powers of 10 (Population.POWER), i.e. $10^{[min, max]}$}. If algorithm==None, hybparsimony uses ‘LogisticRegression()’ for classification problems, and ‘Ridge’ for regression problems.
- custom_eval_funfunction, default=None
An evaluation function similar to scikit-learns’s ‘cross_val_score()’. If None, hybparsimony uses ‘cross_val_score(cv=5)’.
- cv: int, cross-validation generator or an iterable, default=None
Determines the cross-validation splitting strategy (see scikit-learn’s ‘cross_val_score()’ function)
- scoring: str, callable, list, tuple, or dict, default=None.
Strategy to evaluate the performance of the cross-validated model on the test set. If None cv=5 and ‘scoring’ is defined as MSE for regression problems, ‘log_loss’ for binary classification problems, and ‘f1_macro’ for multiclass problems. (see scikit-learn’s ‘cross_val_score()’ function)
- type_ini_popstr, {‘randomLHS’, ‘geneticLHS’, ‘improvedLHS’, ‘maximinLHS’, ‘optimumLHS’, ‘random’}, optional
Method to create the first population with GAparsimony._population function. Possible values: randomLHS, geneticLHS, improvedLHS, maximinLHS, optimumLHS, random. First 5 methods correspond with several latine hypercube for initial sampling. By default is set to improvedLHS.
- npart = int, default=15
Number of particles in the swarm (population size)
- maxiter = int, default=250
The maximum number of iterations to run before the HYB process is halted.
- early_stopint, optional
The number of consecutive generations without any improvement lower than a difference of ‘tol’ in the ‘best_fitness’ value before the search process is stopped.
- tolfloat, default=1e-4,
Value defining a significant difference between the ‘best_fitness’ values between iterations for ‘early stopping’.
- rerank_errorfloat, default=1e-09
When a value is provided, a second reranking process according to the model complexities is called by parsimony_rerank function. Its primary objective isto select individuals with high validation cost while maintaining the robustnessof a parsimonious model. This function switches the position of two models if the first one is more complex than the latter and no significant difference is found between their fitness values in terms of cost. Thus, if the absolute difference between the validation costs are lower than rerank_error they are considered similar.
- gamma_crossoverfloat, default=0.50
In hybparsimony, the percentage of variables to be replaced with GA at each iteration $t$ is selected by a decreasing exponential function that is adjusted by a ‘gamma_crossover’ parameter (see references for more info).
- Lambdafloat, default=1.0
PSO parameter (see References)
- c1float, default=1/2 + math.log(2)
PSO parameter (see References)
- c2float, default=1/2 + math.log(2)
PSO parameter (see References)
- IW_maxfloat, default=0.9
PSO parameter (see References)
- IW_minfloat, default=0.4
PSO parameter (see References)
- Kint, default=4
PSO parameter (see References)
- best_global_thresfloat, default=1.0
Percentage of particles that will be influenced by the best global of their neighbourhoods (otherwise, they will be influenced by the best of the iteration in each neighbourhood) particles_to_delete is not None and len(particles_to_delete) < maxiter:
- particles_to_deletefloat, default=None
The length of the particles to delete is lower than the iterations, the array is completed with zeros up to the number of iterations.
- mutationfloat, default=0.1
The probability of mutation in a parent chromosome. Usually mutation occurs with a small probability. By default is set to 0.10.
- feat_mut_thresfloat, default=0.1
Probability of the muted features-chromosome to be one. Default value is set to 0.10.
- feat_thresfloat, default=0.90
Proportion of selected features in the initial population. It is recommended a high percentage of the selected features for the first generations.
- keep_historybool default=False,
If True keep results of all particles in each iteration into ‘history’ attribute.
- seed_iniint, optional
An integer value containing the random number generator state.
- n_jobsint, default=1,
Number of cores to parallelize the evaluation of the swarm. It should be used with caution because the algorithms used or the ‘cross_validate()’ function used by default to evaluate individuals may also parallelize their internal processes.
- verboseint, default=0
The level of messages that we want it to show us. Possible values: 0=silent mode, 1=monitor level, 2=debug level.
Attributes
- minutes_totalfloat
Total elapsed time (in minutes).
- historyfloat
A list with the results of the population of all iterations.’history[iter]’ returns a DataFrame with the results of iteration ‘iter’.
- best_model
The best model in the whole optimization process.
- best_scorefloat
The validation score of the best model.
- best_complexityfloat
The complexity of the best model.
- selected_featureslist,
The name of the selected features for the best model.
- selected_features_boollist,
The selected features for the best model in Boolean form.
- best_model_confChromosome
The parameters and features of the best model in the whole optimization process.
Examples
Usage example for a regression model using the sklearn ‘diabetes’ dataset
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.datasets import load_diabetes from sklearn.preprocessing import StandardScaler from hybparsimony import hybparsimony # Load 'diabetes' dataset diabetes = load_diabetes() X, y = diabetes.data, diabetes.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=1234) # Standarize X and y scaler_X = StandardScaler() X_train = scaler_X.fit_transform(X_train) X_test = scaler_X.transform(X_test) scaler_y = StandardScaler() y_train = scaler_y.fit_transform(y_train.reshape(-1,1)).flatten() y_test = scaler_y.transform(y_test.reshape(-1,1)).flatten() algo = 'KernelRidge' HYBparsimony_model = hybparsimony(algorithm=algo, features=diabetes.feature_names, rerank_error=0.001, verbose=1) # Search the best hyperparameters and features # (increasing 'time_limit' to improve RMSE with high consuming algorithms) HYBparsimony_model.fit(X_train, y_train, time_limit=0.20)
Running iteration 0 Best model -> Score = -0.510786 Complexity = 9,017,405,352.5 Iter = 0 -> MeanVal = -0.88274 ValBest = -0.510786 ComplexBest = 9,017,405,352.5 Time(min) = 0.005858 Running iteration 1 Best model -> Score = -0.499005 Complexity = 8,000,032,783.88 Iter = 1 -> MeanVal = -0.659969 ValBest = -0.499005 ComplexBest = 8,000,032,783.88 Time(min) = 0.004452 ... ... ... Running iteration 34 Best model -> Score = -0.489468 Complexity = 8,000,002,255.68 Iter = 34 -> MeanVal = -0.527314 ValBest = -0.489468 ComplexBest = 8,000,002,255.68 Time(min) = 0.007533 Running iteration 35 Best model -> Score = -0.489457 Complexity = 8,000,002,199.12 Iter = 35 -> MeanVal = -0.526294 ValBest = -0.489457 ComplexBest = 8,000,002,199.12 Time(min) = 0.006522 Time limit reached. Stopped.
Usage example for a classification model using the ‘breast_cancer’ dataset
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.datasets import load_breast_cancer from sklearn.metrics import log_loss from hybparsimony import hybparsimony # load 'breast_cancer' dataset breast_cancer = load_breast_cancer() X, y = breast_cancer.data, breast_cancer.target print(X.shape) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=1) # Standarize X and y (some algorithms require that) scaler_X = StandardScaler() X_train = scaler_X.fit_transform(X_train) X_test = scaler_X.transform(X_test) HYBparsimony_model = hybparsimony(features=breast_cancer.feature_names, rerank_error=0.005, verbose=1) HYBparsimony_model.fit(X_train, y_train, time_limit=0.50) # Extract probs of class==1 preds = HYBparsimony_model.predict_proba(X_test)[:,1] print(f'\n\nBest Model = {HYBparsimony_model.best_model}') print(f'Selected features:{HYBparsimony_model.selected_features}') print(f'Complexity = {round(HYBparsimony_model.best_complexity, 2):,}') print(f'5-CV logloss = {-round(HYBparsimony_model.best_score,6)}') print(f'logloss test = {round(log_loss(y_test, preds),6)}')
(569, 30) Detected a binary-class problem. Using 'neg_log_loss' as default scoring function. Running iteration 0 Best model -> Score = -0.091519 Complexity = 29,000,000,005.11 Iter = 0 -> MeanVal = -0.297448 ValBest = -0.091519 ComplexBest = 29,000,000,005.11 Time(min) = 0.006501 Running iteration 1 Best model -> Score = -0.085673 Complexity = 27,000,000,009.97 Iter = 1 -> MeanVal = -0.117216 ValBest = -0.085673 ComplexBest = 27,000,000,009.97 Time(min) = 0.004273 ... ... Running iteration 102 Best model -> Score = -0.064557 Complexity = 11,000,000,039.47 Iter = 102 -> MeanVal = -0.076314 ValBest = -0.066261 ComplexBest = 9,000,000,047.25 Time(min) = 0.004769 Running iteration 103 Best model -> Score = -0.064557 Complexity = 11,000,000,039.47 Iter = 103 -> MeanVal = -0.086243 ValBest = -0.064995 ComplexBest = 11,000,000,031.2 Time(min) = 0.004591 Time limit reached. Stopped. Best Model = LogisticRegression(C=5.92705799354935) Selected features:['mean texture' 'mean concave points' 'radius error' 'area error' 'compactness error' 'worst radius' 'worst perimeter' 'worst area' 'worst smoothness' 'worst concavity' 'worst symmetry'] Complexity = 11,000,000,039.47 5-CV logloss = 0.064557 logloss test = 0.076254
- fit(X, y, time_limit=None)[source]
- Performs the search of accurate parsimonious models by combining feature selection, hyperparameter optimizacion,
and parsimonious model selection (PMS) with data matrix (X) and targets (y).
Parameters
- Xpandas.DataFrame or numpy.array
Training vector.
- ypandas.DataFrame or numpy.array
Target vector relative to X.
- time_limitfloat, default=None
Maximum time to perform the optimization process in minutes.
- hybparsimony.hybparsimony.multinomial(n, pvals, size=None)
Draw samples from a multinomial distribution.
The multinomial distribution is a multivariate generalization of the binomial distribution. Take an experiment with one of
p
possible outcomes. An example of such an experiment is throwing a dice, where the outcome can be 1 through 6. Each sample drawn from the distribution represents n such experiments. Its values,X_i = [X_0, X_1, ..., X_p]
, represent the number of times the outcome wasi
.Note
New code should use the ~numpy.random.Generator.multinomial method of a ~numpy.random.Generator instance instead; please see the random-quick-start.
Parameters
- nint
Number of experiments.
- pvalssequence of floats, length p
Probabilities of each of the
p
different outcomes. These must sum to 1 (however, the last element is always assumed to account for the remaining probability, as long assum(pvals[:-1]) <= 1)
.- sizeint or tuple of ints, optional
Output shape. If the given shape is, e.g.,
(m, n, k)
, thenm * n * k
samples are drawn. Default is None, in which case a single value is returned.
Returns
- outndarray
The drawn samples, of shape size, if that was provided. If not, the shape is
(N,)
.In other words, each entry
out[i,j,...,:]
is an N-dimensional value drawn from the distribution.
See Also
random.Generator.multinomial: which should be used for new code.
Examples
Throw a dice 20 times:
>>> np.random.multinomial(20, [1/6.]*6, size=1) array([[4, 1, 7, 5, 2, 1]]) # random
It landed 4 times on 1, once on 2, etc.
Now, throw the dice 20 times, and 20 times again:
>>> np.random.multinomial(20, [1/6.]*6, size=2) array([[3, 4, 3, 3, 4, 3], # random [2, 4, 3, 4, 0, 7]])
For the first run, we threw 3 times 1, 4 times 2, etc. For the second, we threw 2 times 1, 4 times 2, etc.
A loaded die is more likely to land on number 6:
>>> np.random.multinomial(100, [1/7.]*5 + [2/7.]) array([11, 16, 14, 17, 16, 26]) # random
The probability inputs should be normalized. As an implementation detail, the value of the last entry is ignored and assumed to take up any leftover probability mass, but this should not be relied on. A biased coin which has twice as much weight on one side as on the other should be sampled like so:
>>> np.random.multinomial(100, [1.0 / 3, 2.0 / 3]) # RIGHT array([38, 62]) # random
not like:
>>> np.random.multinomial(100, [1.0, 2.0]) # WRONG Traceback (most recent call last): ValueError: pvals < 0, pvals > 1 or pvals contains NaNs