Feature Selectors¶
FractionVariableSelector¶
-
class
bcselector.variable_selection.
FractionVariableSelector
[source]¶ Bases:
bcselector.variable_selection._VariableSelector
Ranks all features in dataset with difference cost filter method.
Methods Summary
fit
(data, target_variable, costs, r[, …])Ranks all features in dataset with fraction cost filter method.
score
(model, scoring_function)Method scores selected features step by step by scoring_function.
plot_scores
([budget, …])Plots scores of each iteration of the algorithm.
Getter to obtain cost-sensitive results.
Getter to obtain NO-cost-sensitive results.
Methods Documentation
-
fit
(data, target_variable, costs, r, j_criterion_func='cife', number_of_features=None, budget=None, stop_budget=False, **kwargs)[source]¶ Ranks all features in dataset with fraction cost filter method.
- Parameters
data (np.ndarray or pd.) – Matrix or data frame of data that we want to rank features.
target_variable (np.ndarray or pd.core.series.Series) – Vector or series of target variable. Number of rows in data must equal target_variable length
costs (list or dict) – Costs of features. Must be the same size as columns in data. When using data as np.array, provide costs as list of floats or integers. When using data as pd.DataFrame, provide costs as list of floats or integers or dict {‘col_1’:cost_1,…}.
r (int or float) – Cost scaling parameter. Higher r is, higher is the impact of the cost on selection.
j_criterion_func (str) – Method of approximation of the conditional mutual information Must be one of [‘mim’,’mifs’,’mrmr’,’jmi’,’cife’]. All methods can be seen by running: >>> from bcselector.information_theory.j_criterion_approximations.__all__
number_of_features (int) – Optional argument, constraint to selected number of features.
budget (int or float) – Optional argument, constraint to selected total cost of features.
stop_budget (bool) – Optional argument, TODO - must delete this argument
**kwargs – Arguments passed to fraction_find_best_feature() function and then to j_criterion_func.
Examples
>>> from bcselector.variable_selection import FractionVariableSelector >>> fvs = FractionVariableSelector() >>> fvs.fit(X, y, costs, lamb=1, j_criterion_func='mim')
-
score
(model, scoring_function)¶ Method scores selected features step by step by scoring_function. In each step one more feature is added. Of course user can do that on his own, but using score function we are sure that feature selection is performed on the same train set and it is much easier to use, than writing a loop on our own.
- Parameters
model (sklearn.base.ClassifierMixin) – Any classifier from sklearn API.
scoring_function (function) – Classification metric function from sklearn. Must be one of [‘roc_auc_score’]. For more scoring functions open an GH issue.
- Returns
total_scores (list) – List of scoring_function scores for each step. One step is one feature in algorighm ranking order.
total_costs (list) – List of accumulated costs for each step. One step is one feature in algorighm ranking order.
Examples
>>> from bcselector.variable_selection import FractionVariableSelector >>> from sklearn.metrics import roc_auc_score >>> from sklearn.linear_model import LogisticRegression >>> fvs = FractionVariableSelector() >>> fvs.fit(X, y, costs, lamb=1, j_criterion_func='mim') >>> model = LogisticRegression() >>> fvs.score(roc_auc_score, model)
-
plot_scores
(budget=None, compare_no_cost_method=False, savefig=False, annotate=False, annotate_box=False, figsize=(12, 8), bbox_pos=(0.72, 0.6), plot_title=None, x_axis_title=None, y_axis_title=None, **kwargs)¶ Plots scores of each iteration of the algorithm.
- Parameters
budget (int or float) – Budget to be ploted on the figure as vertical line.
compare_no_cost_method (bool = False) – Plot no-cost curve on the plot.
savefig (bool) – Save figure with scores, savefig arguments passed with kwargs.
annotate (bool) – Annotate plot with feature indexes on the plot.
annotate_box (bool) – Plot box with features data: id, name and cost.
figsize (tuple) – Figsize.
bbox_pos (tuple) – Position of box with features data.
plot_title (str) –
x_axis_title (str) –
y_axis_title (str) –
**kwargs (list) – Arguments passed to np.savefig()
-
get_cost_results
()¶ Getter to obtain cost-sensitive results.
- Returns
variables_selected_order (list) – Indexes of features selected.
cost_variables_selected_order (list) – Costs of features selected. In the same order as variables_selected_order
-
get_no_cost_results
()¶ Getter to obtain NO-cost-sensitive results.
- Returns
variables_selected_order (list) – Indexes of features selected.
cost_variables_selected_order (list) – Costs of features selected. In the same order as variables_selected_order
-
DiffVariableSelector¶
-
class
bcselector.variable_selection.
DiffVariableSelector
[source]¶ Bases:
bcselector.variable_selection._VariableSelector
Ranks all features in dataset with difference cost filter method.
Methods Summary
fit
(data, target_variable, costs, lamb[, …])Ranks all features in dataset with difference cost filter method.
score
(model, scoring_function)Method scores selected features step by step by scoring_function.
plot_scores
([budget, …])Plots scores of each iteration of the algorithm.
Getter to obtain cost-sensitive results.
Getter to obtain NO-cost-sensitive results.
Methods Documentation
-
fit
(data, target_variable, costs, lamb, j_criterion_func='cife', number_of_features=None, budget=None, stop_budget=False, **kwargs)[source]¶ Ranks all features in dataset with difference cost filter method.
- Parameters
data (np.ndarray or pd.) – Matrix or data frame of data that we want to rank features.
target_variable (np.ndarray or pd.core.series.Series) – Vector or series of target variable. Number of rows in data must equal target_variable length
costs (list or dict) – Costs of features. Must be the same size as columns in data. When using data as np.array, provide costs as list of floats or integers. When using data as pd.DataFrame, provide costs as list of floats or integers or dict {‘col_1’:cost_1,…}.
lamb (int or float) – Cost scaling parameter. Higher lambda is, higher is the impact of the cost on selection.
j_criterion_func (str) – Method of approximation of the conditional mutual information Must be one of [‘mim’,’mifs’,’mrmr’,’jmi’,’cife’]. All methods can be seen by running: >>> from bcselector.information_theory.j_criterion_approximations.__all__
number_of_features (int) – Optional argument, constraint to selected number of features.
budget (int or float) – Optional argument, constraint to selected total cost of features.
stop_budget (bool) – Optional argument, TODO - must delete this argument
**kwargs – Arguments passed to difference_find_best_feature() function and then to j_criterion_func.
Examples
>>> from bcselector.variable_selection import DiffVariableSelector >>> dvs = DiffVariableSelector() >>> dvs.fit(X, y, costs, lamb=1, j_criterion_func='mim')
-
score
(model, scoring_function)¶ Method scores selected features step by step by scoring_function. In each step one more feature is added. Of course user can do that on his own, but using score function we are sure that feature selection is performed on the same train set and it is much easier to use, than writing a loop on our own.
- Parameters
model (sklearn.base.ClassifierMixin) – Any classifier from sklearn API.
scoring_function (function) – Classification metric function from sklearn. Must be one of [‘roc_auc_score’]. For more scoring functions open an GH issue.
- Returns
total_scores (list) – List of scoring_function scores for each step. One step is one feature in algorighm ranking order.
total_costs (list) – List of accumulated costs for each step. One step is one feature in algorighm ranking order.
Examples
>>> from bcselector.variable_selection import FractionVariableSelector >>> from sklearn.metrics import roc_auc_score >>> from sklearn.linear_model import LogisticRegression >>> fvs = FractionVariableSelector() >>> fvs.fit(X, y, costs, lamb=1, j_criterion_func='mim') >>> model = LogisticRegression() >>> fvs.score(roc_auc_score, model)
-
plot_scores
(budget=None, compare_no_cost_method=False, savefig=False, annotate=False, annotate_box=False, figsize=(12, 8), bbox_pos=(0.72, 0.6), plot_title=None, x_axis_title=None, y_axis_title=None, **kwargs)¶ Plots scores of each iteration of the algorithm.
- Parameters
budget (int or float) – Budget to be ploted on the figure as vertical line.
compare_no_cost_method (bool = False) – Plot no-cost curve on the plot.
savefig (bool) – Save figure with scores, savefig arguments passed with kwargs.
annotate (bool) – Annotate plot with feature indexes on the plot.
annotate_box (bool) – Plot box with features data: id, name and cost.
figsize (tuple) – Figsize.
bbox_pos (tuple) – Position of box with features data.
plot_title (str) –
x_axis_title (str) –
y_axis_title (str) –
**kwargs (list) – Arguments passed to np.savefig()
-
get_cost_results
()¶ Getter to obtain cost-sensitive results.
- Returns
variables_selected_order (list) – Indexes of features selected.
cost_variables_selected_order (list) – Costs of features selected. In the same order as variables_selected_order
-
get_no_cost_results
()¶ Getter to obtain NO-cost-sensitive results.
- Returns
variables_selected_order (list) – Indexes of features selected.
cost_variables_selected_order (list) – Costs of features selected. In the same order as variables_selected_order
-