MABWiser Public API¶
base_mab¶
- Author
FMR LLC
This module defines the abstract base class for contextual multi-armed bandit algorithms.
-
class
mabwiser.base_mab.
BaseMAB
(rng: mabwiser.utils._NumpyRNG, arms: List[NewType.<locals>.new_type], n_jobs: int, backend: str = None)¶ Bases:
object
Abstract base class for multi-armed bandits.
This module is not intended to be used directly, instead it declares the basic skeleton of multi-armed bandits together with a set of parameters that are common to every bandit algorithm.
It declares abstract methods that sub-classes can override to implement specific bandit policies using:
__init__
constructor to initialize the banditadd_arm
method to add a new armfit
method for trainingpartial_fit
method for _online learningpredict_expectations
method to retrieve the expectation of each armpredict
method for testing to retrieve the best arm based on the policy
To ensure this is the case, alpha and l2_lambda are required to be greater than zero.
-
rng
¶ The random number generator.
- Type
np.random.RandomState
-
arms
¶ The list of all arms.
- Type
List
-
n_jobs
¶ This is used to specify how many concurrent processes/threads should be used for parallelized routines. Default value is set to 1. If set to -1, all CPUs are used. If set to -2, all CPUs but one are used, and so on.
- Type
int
-
backend
¶ Specify a parallelization backend implementation supported in the joblib library. Supported options are: - “loky” used by default, can induce some communication and memory overhead when exchanging input and output data with the worker Python processes. - “multiprocessing” previous process-based backend based on multiprocessing.Pool. Less robust than loky. - “threading” is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the
called function relies a lot on Python objects.
Default value is None. In this case the default backend selected by joblib will be used.
- Type
str, optional
-
arm_to_expectation
¶ The dictionary of arms (keys) to their expected rewards (values).
- Type
Dict[Arm, floot]
-
add_arm
(arm: NewType.<locals>.new_type, binarizer: Callable = None, scaler: Callable = None) → NoReturn¶ Introduces a new arm to the bandit.
Adds the new arm with zero expectations and calls the
_uptake_new_arm()
function of the sub-class.
-
abstract
fit
(decisions: numpy.ndarray, rewards: numpy.ndarray, contexts: Optional[numpy.ndarray] = None) → NoReturn¶ Abstract method.
Fits the multi-armed bandit to the given decision and reward history and corresponding contexts if any.
-
abstract
partial_fit
(decisions: numpy.ndarray, rewards: numpy.ndarray, contexts: Optional[numpy.ndarray] = None) → NoReturn¶ Abstract method.
Updates the multi-armed bandit with the given decision and reward history and corresponding contexts if any.
-
abstract
predict
(contexts: Optional[numpy.ndarray] = None) → NewType.<locals>.new_type¶ Abstract method.
Returns the predicted arm.
-
abstract
predict_expectations
(contexts: Optional[numpy.ndarray] = None) → Dict[NewType.<locals>.new_type, Union[int, float]]¶ Abstract method.
Returns a dictionary from arms (keys) to their expected rewards (values).
mab¶
- Author
FMR LLC
- Version
1.10.0 of June 22, 2020
This module defines the public interface of the MABWiser Library providing access to the following modules:
MAB
LearningPolicy
NeighborhoodPolicy
-
class
mabwiser.mab.
LearningPolicy
¶ Bases:
tuple
-
class
EpsilonGreedy
(epsilon: Union[int, float] = 0.05)¶ Bases:
tuple
Epsilon Greedy Learning Policy.
This policy selects the arm with the highest expected reward with probability 1 - \(\epsilon\), and with probability \(\epsilon\) it selects an arm at random for exploration.
-
epsilon
¶ The probability of selecting a random arm for exploration. Integer or float. Must be between 0 and 1. Default value is 0.05.
- Type
Num
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> mab = MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.25), seed=123456) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm1'
-
property
epsilon
¶ Alias for field number 0
-
-
class
LinTS
(alpha: Union[int, float] = 1.0, l2_lambda: Union[int, float] = 1.0, arm_to_scaler: Dict[NewType.<locals>.new_type, Callable] = None)¶ Bases:
tuple
LinTS Learning Policy
For each arm LinTS trains a ridge regression and creates a multivariate normal distribution for the coefficients using the calculated coefficients as the mean and the covariance as:
\[\alpha^{2} (x_i^{T}x_i + \lambda * I_d)^{-1}\]The normal distribution is randomly sampled to obtain expected coefficients for the ridge regression for each prediction.
\(\alpha\) is a factor used to adjust how conservative the estimate is. Higher \(\alpha\) values promote more exploration.
The multivariate normal distribution uses Cholesky decomposition to guarantee deterministic behavior. This method requires that the covariance is a positive definite matrix. To ensure this is the case, alpha and l2_lambda are required to be greater than zero.
-
alpha
¶ The multiplier to determine the degree of exploration. Integer or float. Must be greater than zero. Default value is 1.0.
- Type
Num
-
l2_lambda
¶ The regularization strength. Integer or float. Must be greater than zero. Default value is 1.0.
- Type
Num
-
arm_to_scaler
¶ Standardize context features by arm. Dictionary mapping each arm to a scaler object. It is assumed that the scaler objects are already fit and will only be used to transform context features. Default value is None.
- Type
Dict[Arm, Callable]
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> contexts = [[0, 1, 2, 3], [1, 2, 3, 0], [2, 3, 1, 0], [3, 2, 1, 0]] >>> mab = MAB(list_of_arms, LearningPolicy.LinTS(alpha=0.25)) >>> mab.fit(decisions, rewards, contexts) >>> mab.predict([[3, 2, 0, 1]]) 'Arm2'
-
property
alpha
¶ Alias for field number 0
-
property
arm_to_scaler
¶ Alias for field number 2
-
property
l2_lambda
¶ Alias for field number 1
-
-
class
LinUCB
(alpha: Union[int, float] = 1.0, l2_lambda: Union[int, float] = 1.0, arm_to_scaler: Dict[NewType.<locals>.new_type, Callable] = None)¶ Bases:
tuple
LinUCB Learning Policy.
This policy trains a ridge regression for each arm. Then, given a given context, it predicts a regression value and calculates the upper confidence bound of that prediction. The arm with the highest highest upper bound is selected.
The UCB for each arm is calculated as:
\[UCB = x_i \beta + \alpha \sqrt{(x_i^{T}x_i + \lambda * I_d)^{-1}x_i}\]Where \(\beta\) is the matrix of the ridge regression coefficients, \(\lambda\) is the regularization strength, and I_d is a dxd identity matrix where d is the number of features in the context data.
\(\alpha\) is a factor used to adjust how conservative the estimate is. Higher \(\alpha\) values promote more exploration.
-
alpha
¶ The parameter to control the exploration. Integer or float. Cannot be negative. Default value is 1.0.
- Type
Num
-
l2_lambda
¶ The regularization strength. Integer or float. Cannot be negative. Default value is 1.0.
- Type
Num
-
arm_to_scaler
¶ Standardize context features by arm. Dictionary mapping each arm to a scaler object. It is assumed that the scaler objects are already fit and will only be used to transform context features. Default value is None.
- Type
Dict[Arm, Callable]
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> contexts = [[0, 1, 2, 3], [1, 2, 3, 0], [2, 3, 1, 0], [3, 2, 1, 0]] >>> mab = MAB(list_of_arms, LearningPolicy.LinUCB(alpha=1.25)) >>> mab.fit(decisions, rewards, contexts) >>> mab.predict([[3, 2, 0, 1]]) 'Arm2'
-
property
alpha
¶ Alias for field number 0
-
property
arm_to_scaler
¶ Alias for field number 2
-
property
l2_lambda
¶ Alias for field number 1
-
-
class
Popularity
¶ Bases:
tuple
Randomized Popularity Learning Policy.
Returns a randomized popular arm for each prediction. The probability of selection for each arm is weighted by their mean reward. It assumes that the rewards are non-negative.
The probability of selection is calculated as:
\[P(arm) = \frac{ \mu_i } { \Sigma{ \mu } }\]where \(\mu_i\) is the mean reward for that arm.
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> mab = MAB(list_of_arms, LearningPolicy.Popularity()) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm1'
-
class
Random
¶ Bases:
tuple
Random Learning Policy.
Returns a random arm for each prediction. The probability of selection for each arm is uniformly at random.
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> mab = MAB(list_of_arms, LearningPolicy.Random()) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm2'
-
class
Softmax
(tau: Union[int, float] = 1)¶ Bases:
tuple
Softmax Learning Policy.
This policy selects each arm with a probability proportionate to its average reward. The average reward is calculated as a logistic function with each probability as:
\[P(arm) = \frac{ e ^ \frac{\mu_i - \max{\mu}}{ \tau } } { \Sigma{e ^ \frac{\mu - \max{\mu}}{ \tau }} }\]where \(\mu_i\) is the mean reward for that arm and \(\tau\) is the “temperature” to determine the degree of exploration.
-
tau
¶ The temperature to control the exploration. Integer or float. Must be greater than zero. Default value is 1.
- Type
Num
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> mab = MAB(list_of_arms, LearningPolicy.Softmax(tau=1)) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm2'
-
property
tau
¶ Alias for field number 0
-
-
class
ThompsonSampling
(binarizer: Callable = None)¶ Bases:
tuple
Thompson Sampling Learning Policy.
This policy creates a beta distribution for each arm and then randomly samples from these distributions. The arm with the highest sample value is selected.
Notice that rewards must be binary to create beta distributions. If rewards are not binary, see the
binarizer
function.-
binarizer
¶ If rewards are not binary, a binarizer function is required. Given an arm decision and its corresponding reward, the binarizer function returns True/False or 0/1 to denote whether the decision counts as a success, i.e., True/1 based on the reward or False/0 otherwise.
The function signature of the binarizer is:
binarize(arm: Arm, reward: Num) -> True/False or 0/1
- Type
Callable
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [1, 1, 1, 0] >>> mab = MAB(list_of_arms, LearningPolicy.ThompsonSampling()) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm2'
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> arm_to_threshold = {'Arm1':10, 'Arm2':10} >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [10, 20, 15, 7] >>> def binarize(arm, reward): return reward > arm_to_threshold[arm] >>> mab = MAB(list_of_arms, LearningPolicy.ThompsonSampling(binarizer=binarize)) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm2'
-
property
binarizer
¶ Alias for field number 0
-
-
class
UCB1
(alpha: Union[int, float] = 1)¶ Bases:
tuple
Upper Confidence Bound1 Learning Policy.
This policy calculates an upper confidence bound for the mean reward of each arm. It greedily selects the arm with the highest upper confidence bound.
The UCB for each arm is calculated as:
\[UCB = \mu_i + \alpha \times \sqrt[]{\frac{2 \times log(N)}{n_i}}\]Where \(\mu_i\) is the mean for that arm, \(N\) is the total number of trials, and \(n_i\) is the number of times the arm has been selected.
\(\alpha\) is a factor used to adjust how conservative the estimate is. Higher \(\alpha\) values promote more exploration.
-
alpha
¶ The parameter to control the exploration. Integer of float. Cannot be negative. Default value is 1.
- Type
Num
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> mab = MAB(list_of_arms, LearningPolicy.UCB1(alpha=1.25)) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm2'
-
property
alpha
¶ Alias for field number 0
-
-
class
-
class
mabwiser.mab.
MAB
(arms: List[NewType.<locals>.new_type], learning_policy: Union[mabwiser.mab.EpsilonGreedy, mabwiser.mab.Popularity, mabwiser.mab.Random, mabwiser.mab.Softmax, mabwiser.mab.ThompsonSampling, mabwiser.mab.UCB1, mabwiser.mab.LinTS, mabwiser.mab.LinUCB], neighborhood_policy: Union[None, mabwiser.mab.Clusters, mabwiser.mab.KNearest, mabwiser.mab.Radius] = None, seed: int = 123456, n_jobs: int = 1, backend: str = None)¶ Bases:
object
MABWiser: Contextual Multi-Armed Bandit Library
MABWiser is a research library for fast prototyping of multi-armed bandit algorithms. It supports context-free, parametric and non-parametric contextual bandit models.
-
arms
¶ The list of all of the arms available for decisions. Arms can be integers, strings, etc.
- Type
list
-
learning_policy
¶ The learning policy.
- Type
-
neighborhood_policy
¶ The neighborhood policy.
- Type
-
is_contextual
¶ True if contextual policy is given, false otherwise. This is a read-only data field.
- Type
bool
-
seed
¶ The random seed to initialize the internal random number generator. This is a read-only data field.
- Type
numbers.Rational
-
n_jobs
¶ This is used to specify how many concurrent processes/threads should be used for parallelized routines. Default value is set to 1. If set to -1, all CPUs are used. If set to -2, all CPUs but one are used, and so on.
- Type
int
-
backend
¶ Specify a parallelization backend implementation supported in the joblib library. Supported options are: - “loky” used by default, can induce some communication and memory overhead when exchanging input and
output data with the worker Python processes.
“multiprocessing” previous process-based backend based on multiprocessing.Pool. Less robust than loky.
“threading” is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the called function relies a lot on Python objects.
Default value is None. In this case the default backend selected by joblib will be used.
- Type
str, optional
Examples
>>> from mabwiser.mab import MAB, LearningPolicy >>> arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> mab = MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.25), seed=123456) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm1' >>> mab.add_arm('Arm3') >>> mab.partial_fit(['Arm3'], [30]) >>> mab.predict() 'Arm3'
>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy >>> arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1', 'Arm2'] >>> rewards = [20, 17, 25, 9, 11] >>> contexts = [[0, 0, 0], [1, 0, 1], [0, 1, 1], [0, 0, 0], [1, 1, 1]] >>> contextual_mab = MAB(arms, LearningPolicy.EpsilonGreedy(), NeighborhoodPolicy.KNearest(k=3)) >>> contextual_mab.fit(decisions, rewards, contexts) >>> contextual_mab.predict([[1, 1, 0], [1, 1, 1], [0, 1, 0]]) ['Arm2', 'Arm2', 'Arm2'] >>> contextual_mab.add_arm('Arm3') >>> contextual_mab.partial_fit(['Arm3'], [30], [[1, 1, 1]]) >>> contextual_mab.predict([[1, 1, 1]]) 'Arm3'
-
add_arm
(arm: NewType.<locals>.new_type, binarizer: Callable = None, scaler: Callable = None) → NoReturn¶ Adds an _arm_ to the list of arms.
Incorporates the arm into the learning and neighborhood policies with no training data.
- Parameters
arm (Arm) – The new arm to be added.
binarizer (Callable) – The new binarizer function for Thompson Sampling.
scaler (Callable) – A scaler object from sklearn.preprocessing.
- Returns
- Return type
No return.
- Raises
TypeError – For ThompsonSampling, binarizer must be a callable function.:
TypeError – The standard scaler object must have a transform method.:
TypeError – The standard scaler object must be fit with calculated
mean_
andvar_
attributes.:ValueError – A binarizer function was provided but the learning policy is not Thompson Sampling.:
ValueError – The arm already exists.:
ValueError – The arm is
None
.:ValueError – The arm is
NaN
.:ValueError – The arm is
Infinity
.:
-
fit
(decisions: Union[List[NewType.<locals>.new_type], numpy.ndarray, pandas.core.series.Series], rewards: Union[List[Union[int, float]], numpy.ndarray, pandas.core.series.Series], contexts: Union[None, List[List[Union[int, float]]], numpy.ndarray, pandas.core.series.Series, pandas.core.frame.DataFrame] = None) → NoReturn¶ Fits the multi-armed bandit to the given decisions, their corresponding rewards and contexts, if any.
Validates arguments and raises exceptions in case there are violations.
- This function makes the following assumptions:
each decision corresponds to an arm of the bandit.
there are no
None
,Nan
, orInfinity
values in the contexts.
- Parameters
decisions (Union[List[Arm], np.ndarray, pd.Series]) – The decisions that are made.
rewards (Union[List[Num], np.ndarray, pd.Series]) – The rewards that are received corresponding to the decisions.
contexts (Union[None, List[List[Num]], np.ndarray, pd.Series, pd.DataFrame]) – The context under which each decision is made. Default value is
None
, i.e., no contexts.
- Returns
- Return type
No return.
- Raises
TypeError – Decisions and rewards are not given as list, numpy array or pandas series.:
TypeError – Contexts is not given as
None
, list, numpy array, pandas series or data frames.:ValueError – Length mismatch between decisions, rewards, and contexts.:
ValueError – Fitting contexts data when there is no contextual policy.:
ValueError – Contextual policy when fitting no contexts data.:
ValueError – Rewards contain
None
,Nan
, orInfinity
.:
-
property
learning_policy
¶ Creates named tuple of the learning policy based on the implementor.
- Returns
- Return type
The learning policy.
- Raises
NotImplementedError – MAB learning_policy property not implemented for this learning policy.:
-
property
neighborhood_policy
¶ Creates named tuple of the neighborhood policy based on the implementor.
- Returns
- Return type
The neighborhood policy
-
partial_fit
(decisions: Union[List[NewType.<locals>.new_type], numpy.ndarray, pandas.core.series.Series], rewards: Union[List[Union[int, float]], numpy.ndarray, pandas.core.series.Series], contexts: Union[None, List[List[Union[int, float]]], numpy.ndarray, pandas.core.series.Series, pandas.core.frame.DataFrame] = None) → NoReturn¶ Updates the multi-armed bandit with the given decisions, their corresponding rewards and contexts, if any.
Validates arguments and raises exceptions in case there are violations.
- This function makes the following assumptions:
each decision corresponds to an arm of the bandit.
there are no
None
,Nan
, orInfinity
values in the contexts.
- Parameters
decisions (Union[List[Arm], np.ndarray, pd.Series]) – The decisions that are made.
rewards (Union[List[Num], np.ndarray, pd.Series]) – The rewards that are received corresponding to the decisions.
contexts (Union[None, List[List[Num]], np.ndarray, pd.Series, pd.DataFrame] =) – The context under which each decision is made. Default value is
None
, i.e., no contexts.
- Returns
- Return type
No return.
- Raises
TypeError – Decisions, rewards are not given as list, numpy array or pandas series.:
TypeError – Contexts is not given as
None
, list, numpy array, pandas series or data frames.:ValueError – Length mismatch between decisions, rewards, and contexts.:
ValueError – Fitting contexts data when there is no contextual policy.:
ValueError – Contextual policy when fitting no contexts data.:
ValueError – Rewards contain
None
,Nan
, orInfinity
:
-
predict
(contexts: Union[None, List[Union[int, float]], List[List[Union[int, float]]], numpy.ndarray, pandas.core.series.Series, pandas.core.frame.DataFrame] = None) → Union[NewType.<locals>.new_type, List[NewType.<locals>.new_type]]¶ Returns the “best” arm (or arms list if multiple contexts are given) based on the expected reward.
The definition of the best depends on the specified learning policy. Contextual learning policies and neighborhood policies require contexts data in training. In testing, they return the best arm given new context(s).
- Parameters
contexts (Union[None, List[List[Num]], np.ndarray, pd.Series, pd.DataFrame]) – The context under which each decision is made. Default value is None. Contexts should be
None
for context-free bandits and is required for contextual bandits.- Returns
- Return type
The recommended arm or recommended arms list.
- Raises
TypeError – Contexts is not given as
None
, list, numpy array, pandas series or data frames.:ValueError – Predicting with contexts data when there is no contextual policy.:
ValueError – Contextual policy when predicting with no contexts data.:
-
predict_expectations
(contexts: Union[None, List[Union[int, float]], List[List[Union[int, float]]], numpy.ndarray, pandas.core.series.Series, pandas.core.frame.DataFrame] = None) → Union[Dict[NewType.<locals>.new_type, Union[int, float]], List[Dict[NewType.<locals>.new_type, Union[int, float]]]]¶ Returns a dictionary of arms (key) to their expected rewards (value).
Contextual learning policies and neighborhood policies require contexts data for expected rewards.
- Parameters
contexts (Union[None, List[Num], List[List[Num]], np.ndarray, pd.Series, pd.DataFrame]) – The context for the expected rewards. Default value is None. Contexts should be
None
for context-free bandits and is required for contextual bandits.- Returns
- Return type
The dictionary of arms (key) to their expected rewards (value), or a list of such dictionaries.
- Raises
TypeError – Contexts is not given as
None
, list, numpy array or pandas data frames.:ValueError – Predicting with contexts data when there is no contextual policy.:
ValueError – Contextual policy when predicting with no contexts data.:
-
-
class
mabwiser.mab.
NeighborhoodPolicy
¶ Bases:
tuple
-
class
Clusters
(n_clusters: Union[int, float] = 2, is_minibatch: bool = False)¶ Bases:
tuple
Clusters Neighborhood Policy.
Clusters is a k-means clustering approach that uses the observations from the closest cluster with a learning policy. Supports
KMeans
andMiniBatchKMeans
.-
n_clusters
¶ The number of clusters. Integer. Must be at least 2. Default value is 2.
- Type
Num
-
is_minibatch
¶ Boolean flag to use
MiniBatchKMeans
or not. Default value is False.- Type
bool
Example
>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy >>> list_of_arms = [1, 2, 3, 4] >>> decisions = [1, 1, 1, 2, 2, 3, 3, 3, 3, 3] >>> rewards = [0, 1, 1, 0, 0, 0, 0, 1, 1, 1] >>> contexts = [[0, 1, 2, 3, 5], [1, 1, 1, 1, 1], [0, 0, 1, 0, 0],[0, 2, 2, 3, 5], [1, 3, 1, 1, 1], [0, 0, 0, 0, 0], [0, 1, 4, 3, 5], [0, 1, 2, 4, 5], [1, 2, 1, 1, 3], [0, 2, 1, 0, 0]] >>> mab = MAB(list_of_arms, LearningPolicy.EpsilonGreedy(epsilon=0), NeighborhoodPolicy.Clusters(3)) >>> mab.fit(decisions, rewards, contexts) >>> mab.predict([[0, 1, 2, 3, 5], [1, 1, 1, 1, 1]]) [3, 1]
-
property
is_minibatch
¶ Alias for field number 1
-
property
n_clusters
¶ Alias for field number 0
-
-
class
KNearest
(k: int = 1, metric: str = 'euclidean')¶ Bases:
tuple
KNearest Neighborhood Policy.
KNearest is a nearest neighbors approach that selects the k-nearest observations to be used with a learning policy.
-
k
¶ The number of neighbors to select. Integer value. Must be greater than zero. Default value is 1.
- Type
int
-
metric
¶ The metric used to calculate distance. Accepts any of the metrics supported by
scipy.spatial.distance.cdist
. Default value is Euclidean distance.- Type
str
Example
>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy >>> list_of_arms = [1, 2, 3, 4] >>> decisions = [1, 1, 1, 2, 2, 3, 3, 3, 3, 3] >>> rewards = [0, 1, 1, 0, 0, 0, 0, 1, 1, 1] >>> contexts = [[0, 1, 2, 3, 5], [1, 1, 1, 1, 1], [0, 0, 1, 0, 0],[0, 2, 2, 3, 5], [1, 3, 1, 1, 1], [0, 0, 0, 0, 0], [0, 1, 4, 3, 5], [0, 1, 2, 4, 5], [1, 2, 1, 1, 3], [0, 2, 1, 0, 0]] >>> mab = MAB(list_of_arms, LearningPolicy.EpsilonGreedy(epsilon=0), NeighborhoodPolicy.KNearest(2, "euclidean")) >>> mab.fit(decisions, rewards, contexts) >>> mab.predict([[0, 1, 2, 3, 5], [1, 1, 1, 1, 1]]) [1, 1]
-
property
k
¶ Alias for field number 0
-
property
metric
¶ Alias for field number 1
-
-
class
Radius
(radius: Union[int, float] = 0.05, metric: str = 'euclidean', no_nhood_prob_of_arm: Optional[List] = None)¶ Bases:
tuple
Radius Neighborhood Policy.
Radius is a nearest neighborhood approach that selects the observations within a given radius to be used with a learning policy.
-
radius
¶ The maximum distance within which to select observations. Integer or Float. Must be greater than zero. Default value is 1.
- Type
Num
-
metric
¶ The metric used to calculate distance. Accepts any of the metrics supported by scipy.spatial.distance.cdist. Default value is Euclidean distance.
- Type
str
-
no_nhood_prob_of_arm
¶ The probabilities associated with each arm. If not given, a uniform random distribution over all arms is assumed. The probabilities should sum up to 1.
- Type
None or List
Example
>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy >>> list_of_arms = [1, 2, 3, 4] >>> decisions = [1, 1, 1, 2, 2, 3, 3, 3, 3, 3] >>> rewards = [0, 1, 1, 0, 0, 0, 0, 1, 1, 1] >>> contexts = [[0, 1, 2, 3, 5], [1, 1, 1, 1, 1], [0, 0, 1, 0, 0],[0, 2, 2, 3, 5], [1, 3, 1, 1, 1], [0, 0, 0, 0, 0], [0, 1, 4, 3, 5], [0, 1, 2, 4, 5], [1, 2, 1, 1, 3], [0, 2, 1, 0, 0]] >>> mab = MAB(list_of_arms, LearningPolicy.EpsilonGreedy(epsilon=0), NeighborhoodPolicy.Radius(2, "euclidean")) >>> mab.fit(decisions, rewards, contexts) >>> mab.predict([[0, 1, 2, 3, 5], [1, 1, 1, 1, 1]]) [3, 1]
-
property
metric
¶ Alias for field number 1
-
property
no_nhood_prob_of_arm
¶ Alias for field number 2
-
property
radius
¶ Alias for field number 0
-
-
class
simulator¶
- Author
FMR LLC
- Version
1.10.0 of June 22, 2020
This module provides a simulation utility for comparing algorithms and hyper-parameter tuning.
-
class
mabwiser.simulator.
Simulator
(bandits: List[tuple], decisions: Union[List[NewType.<locals>.new_type], numpy.ndarray, pandas.core.series.Series], rewards: Union[List[Union[int, float]], numpy.ndarray, pandas.core.series.Series], contexts: Union[None, List[List[Union[int, float]]], numpy.ndarray, pandas.core.series.Series, pandas.core.frame.DataFrame] = None, scaler: callable = None, test_size: float = 0.3, is_ordered: bool = False, batch_size: int = 0, evaluator: callable = <function default_evaluator>, seed: int = 123456, is_quick: bool = False, log_file: str = None, log_format: str = '%(asctime)s %(levelname)s %(message)s')¶ Bases:
object
Multi-Armed Bandit Simulator.
This utility runs a simulation using historic data and a collection of multi-armed bandits from the MABWiser library or that extends the BaseMAB class in MABWiser.
It can be used to run a simple simulation with a single bandit or to compare multiple bandits for policy selection, hyper-parameter tuning, etc.
Nearest Neighbor bandits that use the default Radius and KNearest implementations from MABWiser are converted to custom versions that share distance calculations to speed up the simulation. These custom versions also track statistics about the neighborhoods that can be used in evaluation.
The results can be accessed as the arms_to_stats, model_to_predictions, model_to_confusion_matrices, and models_to_evaluations properties.
When using partial fitting, an additional confusion matrix is calculated for all predictions after all of the batches are processed.
A log of the simulation tracks the experiment progress.
-
bandits
¶ A list of tuples of the name of each bandit and the bandit object.
- Type
list[(str, bandit)]
-
decisions
¶ The complete decision history to be used in train and test.
- Type
array
-
rewards
¶ The complete array history to be used in train and test.
- Type
array
-
contexts
¶ The complete context history to be used in train and test.
- Type
array
-
scaler
¶ A scaler object from sklearn.preprocessing.
- Type
scaler
-
test_size
¶ The size of the test set
- Type
float
-
is_ordered
¶ Whether to use a chronological division for the train-test split. If false, uses sklearn’s train_test_split.
- Type
bool
-
batch_size
¶ The size of each batch for online learning.
- Type
int
-
evaluator
¶ The function for evaluating the bandits. Values are stored in bandit_to_arm_to_stats_avg. Must have the function signature function(arms_to_stats_train: dictionary, predictions: list, decisions: np.ndarray, rewards: np.ndarray, metric: str).
- Type
callable
-
is_quick
¶ Flag to skip neighborhood statistics.
- Type
bool
-
logger
¶ The logger object.
- Type
Logger
-
arms
¶ The list of arms used by the bandits.
- Type
list
-
arm_to_stats_total
¶ Descriptive statistics for the complete data set.
- Type
dict
-
arm_to_stats_train
¶ Descriptive statistics for the training data.
- Type
dict
-
arm_to_stats_test
¶ Descriptive statistics for the test data.
- Type
dict
-
bandit_to_arm_to_stats_avg
¶ Descriptive statistics for the predictions made by each bandit based on means from training data.
- Type
dict
-
bandit_to_arm_to_stats_min
¶ Descriptive statistics for the predictions made by each bandit based on minimums from training data.
- Type
dict
-
bandit_to_arm_to_stats_max
¶ Descriptive statistics for the predictions made by each bandit based on maximums from training data.
- Type
dict
-
bandit_to_confusion_matrices
¶ The confusion matrices for each bandit.
- Type
dict
-
bandit_to_predictions
¶ The prediction for each item in the test set for each bandit.
- Type
dict
-
bandit_to_expectations
¶ The arm_to_expectations for each item in the test set for each bandit. For context-free bandits, there is a single dictionary for each batch.
- Type
dict
-
bandit_to_neighborhood_size
¶ The number of neighbors in each neighborhood for each row in the test set. Calculated when using a Radius neighborhood policy, or a custom class that inherits from it. Not calculated when is_quick is True.
- Type
dict
-
bandit_to_arm_to_stats_neighborhoods
¶ The arm_to_stats for each neighborhood for each row in the test set. Calculated when using Radius or KNearest, or a custom class that inherits from one of them. Not calculated when is_quick is True.
- Type
dict
-
test_indices
¶ The indices of the rows in the test set. If input was not zero-indexed, these will reflect their position in the input rather than actual index.
- Type
list
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> mab1 = MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.25), seed=123456) >>> mab2 = MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.30), seed=123456) >>> bandits = [('EG 25%', mab1), ('EG 30%', mab2)] >>> offline_sim = Simulator(bandits, decisions, rewards, test_size=0.5, batch_size=0) >>> offline_sim.run() >>> offline_sim.bandit_to_arm_to_stats_avg['EG 30%']['Arm1'] {'count': 1, 'sum': 9, 'min': 9, 'max': 9, 'mean': 9.0, 'std': 0.0}
-
get_arm_stats
(decisions: numpy.ndarray, rewards: numpy.ndarray) → dict¶ Calculates descriptive statistics for each arm in the provided data set.
- Parameters
decisions (np.ndarray) – The decisions to filter the rewards.
rewards (np.ndarray) – The rewards to get statistics about.
- Returns
Arm_to_stats dictionary.
Dictionary has the format {arm {‘count’, ‘sum’, ‘min’, ‘max’, ‘mean’, ‘std’}}
-
static
get_stats
(rewards: numpy.ndarray) → dict¶ Calculates descriptive statistics for the given array of rewards.
- Parameters
rewards (nd.nparray) – Array of rewards for a single arm.
- Returns
A dictionary of descriptive statistics.
Dictionary has the format {‘count’, ‘sum’, ‘min’, ‘max’, ‘mean’, ‘std’}
-
plot
(metric: str = 'avg', is_per_arm: bool = False) → NoReturn¶ Generates a plot of the cumulative sum of the rewards for each bandit. Simulation must be run before calling this method.
- Parameters
metric (str) – The bandit_to_arm_to_stats to use to generate the plot. Must be ‘avg’, ‘min’, or ‘max
is_per_arm (bool) – Whether to plot each arm separately or use an aggregate statistic.
- Raises
AssertionError Descriptive statics for predictions are missing. –
TypeError Metric must be a string. –
TypeError The per_arm flag must be a boolean. –
ValueError The metric must be one of avg, min or max. –
- Returns
- Return type
None
-
run
() → NoReturn¶ Run simulator
Runs a simulation concurrently for all bandits in the bandits list.
- Returns
- Return type
None
-
-
mabwiser.simulator.
default_evaluator
(arms: List[NewType.<locals>.new_type], decisions: numpy.ndarray, rewards: numpy.ndarray, predictions: List[NewType.<locals>.new_type], arm_to_stats: dict, stat: str, start_index: int, nn: bool = False) → dict¶ Default evaluation function.
Calculates predicted rewards for the test batch based on predicted arms. When the predicted arm is the same as the historic decision, the historic reward is used. When the predicted arm is different, the mean, min or max reward from the training data is used. If using Radius or KNearest neighborhood policy, the statistics from the neighborhood are used instead of the entire training set.
The simulator supports custom evaluation functions, but they must have this signature to work with the simulation pipeline.
- Parameters
arms (list) – The list of arms.
decisions (np.ndarray) – The historic decisions for the batch being evaluated.
rewards (np.ndarray) – The historic rewards for the batch being evaluated.
predictions (list) – The predictions for the batch being evaluated.
arm_to_stats (dict) – The dictionary of descriptive statistics for each arm to use in evaluation.
stat (str) – Which metric from arm_to_stats to use. Takes the values ‘min’, ‘max’, ‘mean’.
start_index (int) – The index of the first row in the batch. For offline simulations it is 0. For _online simulations it is batch size * batch number. Used to select the correct index from arm_to_stats if there are separate entries for each row in the test set.
nn (bool) – Whether the results are from one of the simulator custom nearest neighbors implementations.
- Returns
An arm_to_stats dictionary for the predictions in the batch.
Dictionary has the format {arm {‘count’, ‘sum’, ‘min’, ‘max’, ‘mean’, ‘std’}}
utils¶
- Author
FMR LLC
This module provides a number of constants and helper functions.
-
mabwiser.utils.
Arm
(x)¶ Arm type is defined as integer, float, or string.
-
class
mabwiser.utils.
Constants
¶ Bases:
tuple
Constant values used by the modules.
-
default_seed
= 123456¶ The default random seed.
-
distance_metrics
= ['braycurtis', 'canberra', 'chebyshev', 'cityblock', 'correlation', 'cosine', 'dice', 'euclidean', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean']¶ The distance metrics supported by neighborhood policies.
-
-
mabwiser.utils.
Num
¶ Num type is defined as integer or float.
alias of Union[int, float]
-
mabwiser.utils.
argmax
(dictionary: Dict[NewType.<locals>.new_type, Union[int, float]]) → NewType.<locals>.new_type¶ Returns the first key with the maximum value.
-
mabwiser.utils.
check_false
(expression: bool, exception: Exception) → NoReturn¶ Checks that given expression is false, otherwise raises the given exception.
-
mabwiser.utils.
check_true
(expression: bool, exception: Exception) → NoReturn¶ Checks that given expression is true, otherwise raises the given exception.
-
mabwiser.utils.
create_rng
(seed: int) → mabwiser.utils._BaseRNG¶ Returns an rng object
- Parameters
seed (int) – the seed of the rng
- Returns
out – An rng object that implements the base rng class
- Return type
_BaseRNG
-
mabwiser.utils.
reset
(dictionary: Dict, value) → NoReturn¶ Maps every key to the given value.