MABWiser Public API¶
base_mab¶
Author: | FMR LLC |
---|---|
Email: | mabwiser@fmr.com |
This module defines the abstract base class for contextual multi-armed bandit algorithms.
-
class
mabwiser.base_mab.
BaseMAB
(rng: numpy.random.mtrand.RandomState, arms: List[NewType.<locals>.new_type], n_jobs: int, backend: str = None)¶ Bases:
object
Abstract base class for multi-armed bandits.
This module is not intended to be used directly, instead it declares the basic skeleton of multi-armed bandits together with a set of parameters that are common to every bandit algorithm.
It declares abstract methods that sub-classes can override to implement specific bandit policies using:
__init__
constructor to initialize the banditadd_arm
method to add a new armfit
method for trainingpartial_fit
method for _online learningpredict_expectations
method to retrieve the expectation of each armpredict
method for testing to retrieve the best arm based on the policy
To ensure this is the case, alpha and l2_lambda are required to be greater than zero.
-
rng
¶ The random number generator.
Type: np.random.RandomState
-
arms
¶ The list of all arms.
Type: List
-
n_jobs
¶ This is used to specify how many concurrent processes/threads should be used for parallelized routines. Default value is set to 1. If set to -1, all CPUs are used. If set to -2, all CPUs but one are used, and so on.
Type: int
-
backend
¶ Specify a parallelization backend implementation supported in the joblib library. Supported options are: - “loky” used by default, can induce some communication and memory overhead when exchanging input and output data with the worker Python processes. - “multiprocessing” previous process-based backend based on multiprocessing.Pool. Less robust than loky. - “threading” is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the
called function relies a lot on Python objects.Default value is None. In this case the default backend selected by joblib will be used.
Type: str, optional
-
arm_to_expectation
¶ The dictionary of arms (keys) to their expected rewards (values).
Type: Dict[Arm, floot]
-
add_arm
(arm: NewType.<locals>.new_type, binarizer: Callable = None, scaler: Callable = None) → NoReturn¶ Introduces a new arm to the bandit.
Adds the new arm with zero expectations and calls the
_uptake_new_arm()
function of the sub-class.
-
fit
(decisions: numpy.ndarray, rewards: numpy.ndarray, contexts: Optional[numpy.ndarray] = None) → NoReturn¶ Abstract method.
Fits the multi-armed bandit to the given decision and reward history and corresponding contexts if any.
-
partial_fit
(decisions: numpy.ndarray, rewards: numpy.ndarray, contexts: Optional[numpy.ndarray] = None) → NoReturn¶ Abstract method.
Updates the multi-armed bandit with the given decision and reward history and corresponding contexts if any.
-
predict
(contexts: Optional[numpy.ndarray] = None) → NewType.<locals>.new_type¶ Abstract method.
Returns the predicted arm.
-
predict_expectations
(contexts: Optional[numpy.ndarray] = None) → Dict[NewType.<locals>.new_type, Union[int, float]]¶ Abstract method.
Returns a dictionary from arms (keys) to their expected rewards (values).
mab¶
Author: | FMR LLC |
---|---|
Email: | mabwiser@fmr.com |
Version: | 1.9.1 of May 27, 2020 |
This module defines the public interface of the MABWiser Library providing access to the following modules:
MAB
LearningPolicy
NeighborhoodPolicy
-
class
mabwiser.mab.
LearningPolicy
¶ Bases:
tuple
-
class
EpsilonGreedy
¶ Bases:
tuple
Epsilon Greedy Learning Policy.
This policy selects the arm with the highest expected reward with probability 1 - \(\epsilon\), and with probability \(\epsilon\) it selects an arm at random for exploration.
-
epsilon
¶ The probability of selecting a random arm for exploration. Integer or float. Must be between 0 and 1. Default value is 0.05.
Type: Num
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> mab = MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.25), seed=123456) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm1'
-
epsilon
Alias for field number 0
-
-
class
LinTS
¶ Bases:
tuple
LinTS Learning Policy
For each arm LinTS trains a ridge regression and creates a multivariate normal distribution for the coefficients using the calculated coefficients as the mean and the covariance as:
\[\alpha^{2} (x_i^{T}x_i + \lambda * I_d)^{-1}\]The normal distribution is randomly sampled to obtain expected coefficients for the ridge regression for each prediction.
\(\alpha\) is a factor used to adjust how conservative the estimate is. Higher \(\alpha\) values promote more exploration.
The multivariate normal distribution uses Cholesky decomposition to guarantee deterministic behavior. This method requires that the covariance is a positive definite matrix. To ensure this is the case, alpha and l2_lambda are required to be greater than zero.
-
alpha
¶ The multiplier to determine the degree of exploration. Integer or float. Must be greater than zero. Default value is 1.0.
Type: Num
-
l2_lambda
¶ The regularization strength. Integer or float. Must be greater than zero. Default value is 1.0.
Type: Num
-
arm_to_scaler
¶ Standardize context features by arm. Dictionary mapping each arm to a scaler object. It is assumed that the scaler objects are already fit and will only be used to transform context features. Default value is None.
Type: Dict[Arm, Callable]
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> contexts = [[0, 1, 2, 3], [1, 2, 3, 0], [2, 3, 1, 0], [3, 2, 1, 0]] >>> mab = MAB(list_of_arms, LearningPolicy.LinTS(alpha=0.25)) >>> mab.fit(decisions, rewards, contexts) >>> mab.predict([[3, 2, 0, 1]]) 'Arm2'
-
alpha
Alias for field number 0
-
arm_to_scaler
Alias for field number 2
-
l2_lambda
Alias for field number 1
-
-
class
LinUCB
¶ Bases:
tuple
LinUCB Learning Policy.
This policy trains a ridge regression for each arm. Then, given a given context, it predicts a regression value and calculates the upper confidence bound of that prediction. The arm with the highest highest upper bound is selected.
The UCB for each arm is calculated as:
\[UCB = x_i \beta + \alpha \sqrt{(x_i^{T}x_i + \lambda * I_d)^{-1}x_i}\]Where \(\beta\) is the matrix of the ridge regression coefficients, \(\lambda\) is the regularization strength, and I_d is a dxd identity matrix where d is the number of features in the context data.
\(\alpha\) is a factor used to adjust how conservative the estimate is. Higher \(\alpha\) values promote more exploration.
-
alpha
¶ The parameter to control the exploration. Integer or float. Cannot be negative. Default value is 1.0.
Type: Num
-
l2_lambda
¶ The regularization strength. Integer or float. Cannot be negative. Default value is 1.0.
Type: Num
-
arm_to_scaler
¶ Standardize context features by arm. Dictionary mapping each arm to a scaler object. It is assumed that the scaler objects are already fit and will only be used to transform context features. Default value is None.
Type: Dict[Arm, Callable]
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> contexts = [[0, 1, 2, 3], [1, 2, 3, 0], [2, 3, 1, 0], [3, 2, 1, 0]] >>> mab = MAB(list_of_arms, LearningPolicy.LinUCB(alpha=1.25)) >>> mab.fit(decisions, rewards, contexts) >>> mab.predict([[3, 2, 0, 1]]) 'Arm2'
-
alpha
Alias for field number 0
-
arm_to_scaler
Alias for field number 2
-
l2_lambda
Alias for field number 1
-
-
class
Popularity
¶ Bases:
tuple
Randomized Popularity Learning Policy.
Returns a randomized popular arm for each prediction. The probability of selection for each arm is weighted by their mean reward. It assumes that the rewards are non-negative.
The probability of selection is calculated as:
\[P(arm) = \frac{ \mu_i } { \Sigma{ \mu } }\]where \(\mu_i\) is the mean reward for that arm.
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> mab = MAB(list_of_arms, LearningPolicy.Popularity()) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm1'
-
class
Random
¶ Bases:
tuple
Random Learning Policy.
Returns a random arm for each prediction. The probability of selection for each arm is uniformly at random.
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> mab = MAB(list_of_arms, LearningPolicy.Random()) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm2'
-
class
Softmax
¶ Bases:
tuple
Softmax Learning Policy.
This policy selects each arm with a probability proportionate to its average reward. The average reward is calculated as a logistic function with each probability as:
\[P(arm) = \frac{ e ^ \frac{\mu_i - \max{\mu}}{ \tau } } { \Sigma{e ^ \frac{\mu - \max{\mu}}{ \tau }} }\]where \(\mu_i\) is the mean reward for that arm and \(\tau\) is the “temperature” to determine the degree of exploration.
-
tau
¶ The temperature to control the exploration. Integer or float. Must be greater than zero. Default value is 1.
Type: Num
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> mab = MAB(list_of_arms, LearningPolicy.Softmax(tau=1)) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm2'
-
tau
Alias for field number 0
-
-
class
ThompsonSampling
¶ Bases:
tuple
Thompson Sampling Learning Policy.
This policy creates a beta distribution for each arm and then randomly samples from these distributions. The arm with the highest sample value is selected.
Notice that rewards must be binary to create beta distributions. If rewards are not binary, see the
binarizer
function.-
binarizer
¶ If rewards are not binary, a binarizer function is required. Given an arm decision and its corresponding reward, the binarizer function returns True/False or 0/1 to denote whether the decision counts as a success, i.e., True/1 based on the reward or False/0 otherwise.
The function signature of the binarizer is:
binarize(arm: Arm, reward: Num) -> True/False or 0/1
Type: Callable
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [1, 1, 1, 0] >>> mab = MAB(list_of_arms, LearningPolicy.ThompsonSampling()) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm2'
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> arm_to_threshold = {'Arm1':10, 'Arm2':10} >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [10, 20, 15, 7] >>> def binarize(arm, reward): return reward > arm_to_threshold[arm] >>> mab = MAB(list_of_arms, LearningPolicy.ThompsonSampling(binarizer=binarize)) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm2'
-
binarizer
Alias for field number 0
-
-
class
UCB1
¶ Bases:
tuple
Upper Confidence Bound1 Learning Policy.
This policy calculates an upper confidence bound for the mean reward of each arm. It greedily selects the arm with the highest upper confidence bound.
The UCB for each arm is calculated as:
\[UCB = \mu_i + \alpha \times \sqrt[]{\frac{2 \times log(N)}{n_i}}\]Where \(\mu_i\) is the mean for that arm, \(N\) is the total number of trials, and \(n_i\) is the number of times the arm has been selected.
\(\alpha\) is a factor used to adjust how conservative the estimate is. Higher \(\alpha\) values promote more exploration.
-
alpha
¶ The parameter to control the exploration. Integer of float. Cannot be negative. Default value is 1.
Type: Num
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> mab = MAB(list_of_arms, LearningPolicy.UCB1(alpha=1.25)) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm2'
-
alpha
Alias for field number 0
-
-
class
-
class
mabwiser.mab.
MAB
(arms: List[NewType.<locals>.new_type], learning_policy: Union[mabwiser.mab.EpsilonGreedy, mabwiser.mab.Popularity, mabwiser.mab.Random, mabwiser.mab.Softmax, mabwiser.mab.ThompsonSampling, mabwiser.mab.UCB1, mabwiser.mab.LinTS, mabwiser.mab.LinUCB], neighborhood_policy: Union[None, mabwiser.mab.Clusters, mabwiser.mab.KNearest, mabwiser.mab.Radius] = None, seed: int = 123456, n_jobs: int = 1, backend: str = None)¶ Bases:
object
MABWiser: Contextual Multi-Armed Bandit Library
MABWiser is a research library for fast prototyping of multi-armed bandit algorithms. It supports context-free, parametric and non-parametric contextual bandit models.
-
arms
¶ The list of all of the arms available for decisions. Arms can be integers, strings, etc.
Type: list
-
learning_policy
¶ The learning policy.
Type: LearningPolicy
-
neighborhood_policy
¶ The neighborhood policy.
Type: NeighborhoodPolicy
-
is_contextual
¶ True if contextual policy is given, false otherwise. This is a read-only data field.
Type: bool
-
seed
¶ The random seed to initialize the internal random number generator. This is a read-only data field.
Type: numbers.Rational
-
n_jobs
¶ This is used to specify how many concurrent processes/threads should be used for parallelized routines. Default value is set to 1. If set to -1, all CPUs are used. If set to -2, all CPUs but one are used, and so on.
Type: int
-
backend
¶ Specify a parallelization backend implementation supported in the joblib library. Supported options are: - “loky” used by default, can induce some communication and memory overhead when exchanging input and
output data with the worker Python processes.- “multiprocessing” previous process-based backend based on multiprocessing.Pool. Less robust than loky.
- “threading” is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the called function relies a lot on Python objects.
Default value is None. In this case the default backend selected by joblib will be used.
Type: str, optional
Examples
>>> from mabwiser.mab import MAB, LearningPolicy >>> arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> mab = MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.25), seed=123456) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm1' >>> mab.add_arm('Arm3') >>> mab.partial_fit(['Arm3'], [30]) >>> mab.predict() 'Arm3'
>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy >>> arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1', 'Arm2'] >>> rewards = [20, 17, 25, 9, 11] >>> contexts = [[0, 0, 0], [1, 0, 1], [0, 1, 1], [0, 0, 0], [1, 1, 1]] >>> contextual_mab = MAB(arms, LearningPolicy.EpsilonGreedy(), NeighborhoodPolicy.KNearest(k=3)) >>> contextual_mab.fit(decisions, rewards, contexts) >>> contextual_mab.predict([[1, 1, 0], [1, 1, 1], [0, 1, 0]]) ['Arm2', 'Arm2', 'Arm2'] >>> contextual_mab.add_arm('Arm3') >>> contextual_mab.partial_fit(['Arm3'], [30], [[1, 1, 1]]) >>> contextual_mab.predict([[1, 1, 1]]) 'Arm3'
-
add_arm
(arm: NewType.<locals>.new_type, binarizer: Callable = None, scaler: Callable = None) → NoReturn¶ Adds an _arm_ to the list of arms.
Incorporates the arm into the learning and neighborhood policies with no training data.
Parameters: - arm (Arm) – The new arm to be added.
- binarizer (Callable) – The new binarizer function for Thompson Sampling.
- scaler (Callable) – A scaler object from sklearn.preprocessing.
Returns: Return type: No return.
Raises: - TypeError: For ThompsonSampling, binarizer must be a callable function.
- TypeError: The standard scaler object must have a transform method.
- TypeError: The standard scaler object must be fit with calculated
mean_
andvar_
attributes. - ValueError: A binarizer function was provided but the learning policy is not Thompson Sampling.
- ValueError: The arm already exists.
- ValueError: The arm is
None
. - ValueError: The arm is
NaN
. - ValueError: The arm is
Infinity
.
-
fit
(decisions: Union[List[NewType.<locals>.new_type], numpy.ndarray, pandas.core.series.Series], rewards: Union[List[Union[int, float]], numpy.ndarray, pandas.core.series.Series], contexts: Union[None, List[List[Union[int, float]]], numpy.ndarray, pandas.core.series.Series, pandas.core.frame.DataFrame] = None) → NoReturn¶ Fits the multi-armed bandit to the given decisions, their corresponding rewards and contexts, if any.
Validates arguments and raises exceptions in case there are violations.
- This function makes the following assumptions:
- each decision corresponds to an arm of the bandit.
- there are no
None
,Nan
, orInfinity
values in the contexts.
Parameters: - decisions (Union[List[Arm], np.ndarray, pd.Series]) – The decisions that are made.
- rewards (Union[List[Num], np.ndarray, pd.Series]) – The rewards that are received corresponding to the decisions.
- contexts (Union[None, List[List[Num]], np.ndarray, pd.Series, pd.DataFrame]) – The context under which each decision is made. Default value is
None
, i.e., no contexts.
Returns: Return type: No return.
Raises: - TypeError: Decisions and rewards are not given as list, numpy array or pandas series.
- TypeError: Contexts is not given as
None
, list, numpy array, pandas series or data frames. - ValueError: Length mismatch between decisions, rewards, and contexts.
- ValueError: Fitting contexts data when there is no contextual policy.
- ValueError: Contextual policy when fitting no contexts data.
- ValueError: Rewards contain
None
,Nan
, orInfinity
.
-
learning_policy
Creates named tuple of the learning policy based on the implementor.
Returns: Return type: The learning policy. Raises: NotImplementedError: MAB learning_policy property not implemented for this learning policy.
-
neighborhood_policy
Creates named tuple of the neighborhood policy based on the implementor.
Returns: Return type: The neighborhood policy
-
partial_fit
(decisions: Union[List[NewType.<locals>.new_type], numpy.ndarray, pandas.core.series.Series], rewards: Union[List[Union[int, float]], numpy.ndarray, pandas.core.series.Series], contexts: Union[None, List[List[Union[int, float]]], numpy.ndarray, pandas.core.series.Series, pandas.core.frame.DataFrame] = None) → NoReturn¶ Updates the multi-armed bandit with the given decisions, their corresponding rewards and contexts, if any.
Validates arguments and raises exceptions in case there are violations.
- This function makes the following assumptions:
- each decision corresponds to an arm of the bandit.
- there are no
None
,Nan
, orInfinity
values in the contexts.
Parameters: - decisions (Union[List[Arm], np.ndarray, pd.Series]) – The decisions that are made.
- rewards (Union[List[Num], np.ndarray, pd.Series]) – The rewards that are received corresponding to the decisions.
- contexts (Union[None, List[List[Num]], np.ndarray, pd.Series, pd.DataFrame] =) – The context under which each decision is made. Default value is
None
, i.e., no contexts.
Returns: Return type: No return.
Raises: - TypeError: Decisions, rewards are not given as list, numpy array or pandas series.
- TypeError: Contexts is not given as
None
, list, numpy array, pandas series or data frames. - ValueError: Length mismatch between decisions, rewards, and contexts.
- ValueError: Fitting contexts data when there is no contextual policy.
- ValueError: Contextual policy when fitting no contexts data.
- ValueError: Rewards contain
None
,Nan
, orInfinity
-
predict
(contexts: Union[None, List[Union[int, float]], List[List[Union[int, float]]], numpy.ndarray, pandas.core.series.Series, pandas.core.frame.DataFrame] = None) → Union[NewType.<locals>.new_type, List[NewType.<locals>.new_type]]¶ Returns the “best” arm (or arms list if multiple contexts are given) based on the expected reward.
The definition of the best depends on the specified learning policy. Contextual learning policies and neighborhood policies require contexts data in training. In testing, they return the best arm given new context(s).
Parameters: contexts (Union[None, List[List[Num]], np.ndarray, pd.Series, pd.DataFrame]) – The context under which each decision is made. Default value is None. Contexts should be
None
for context-free bandits and is required for contextual bandits.Returns: Return type: The recommended arm or recommended arms list.
Raises: - TypeError: Contexts is not given as
None
, list, numpy array, pandas series or data frames. - ValueError: Predicting with contexts data when there is no contextual policy.
- ValueError: Contextual policy when predicting with no contexts data.
- TypeError: Contexts is not given as
-
predict_expectations
(contexts: Union[None, List[Union[int, float]], List[List[Union[int, float]]], numpy.ndarray, pandas.core.series.Series, pandas.core.frame.DataFrame] = None) → Union[Dict[NewType.<locals>.new_type, Union[int, float]], List[Dict[NewType.<locals>.new_type, Union[int, float]]]]¶ Returns a dictionary of arms (key) to their expected rewards (value).
Contextual learning policies and neighborhood policies require contexts data for expected rewards.
Parameters: contexts (Union[None, List[Num], List[List[Num]], np.ndarray, pd.Series, pd.DataFrame]) – The context for the expected rewards. Default value is None. Contexts should be
None
for context-free bandits and is required for contextual bandits.Returns: Return type: The dictionary of arms (key) to their expected rewards (value), or a list of such dictionaries.
Raises: - TypeError: Contexts is not given as
None
, list, numpy array or pandas data frames. - ValueError: Predicting with contexts data when there is no contextual policy.
- ValueError: Contextual policy when predicting with no contexts data.
- TypeError: Contexts is not given as
-
-
class
mabwiser.mab.
NeighborhoodPolicy
¶ Bases:
tuple
-
class
Clusters
¶ Bases:
tuple
Clusters Neighborhood Policy.
Clusters is a k-means clustering approach that uses the observations from the closest cluster with a learning policy. Supports
KMeans
andMiniBatchKMeans
.-
n_clusters
¶ The number of clusters. Integer. Must be at least 2. Default value is 2.
Type: Num
-
is_minibatch
¶ Boolean flag to use
MiniBatchKMeans
or not. Default value is False.Type: bool
Example
>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy >>> list_of_arms = [1, 2, 3, 4] >>> decisions = [1, 1, 1, 2, 2, 3, 3, 3, 3, 3] >>> rewards = [0, 1, 1, 0, 0, 0, 0, 1, 1, 1] >>> contexts = [[0, 1, 2, 3, 5], [1, 1, 1, 1, 1], [0, 0, 1, 0, 0],[0, 2, 2, 3, 5], [1, 3, 1, 1, 1], [0, 0, 0, 0, 0], [0, 1, 4, 3, 5], [0, 1, 2, 4, 5], [1, 2, 1, 1, 3], [0, 2, 1, 0, 0]] >>> mab = MAB(list_of_arms, LearningPolicy.EpsilonGreedy(epsilon=0), NeighborhoodPolicy.Clusters(3)) >>> mab.fit(decisions, rewards, contexts) >>> mab.predict([[0, 1, 2, 3, 5], [1, 1, 1, 1, 1]]) [3, 1]
-
is_minibatch
Alias for field number 1
-
n_clusters
Alias for field number 0
-
-
class
KNearest
¶ Bases:
tuple
KNearest Neighborhood Policy.
KNearest is a nearest neighbors approach that selects the k-nearest observations to be used with a learning policy.
-
k
¶ The number of neighbors to select. Integer value. Must be greater than zero. Default value is 1.
Type: int
-
metric
¶ The metric used to calculate distance. Accepts any of the metrics supported by
scipy.spatial.distance.cdist
. Default value is Euclidean distance.Type: str
Example
>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy >>> list_of_arms = [1, 2, 3, 4] >>> decisions = [1, 1, 1, 2, 2, 3, 3, 3, 3, 3] >>> rewards = [0, 1, 1, 0, 0, 0, 0, 1, 1, 1] >>> contexts = [[0, 1, 2, 3, 5], [1, 1, 1, 1, 1], [0, 0, 1, 0, 0],[0, 2, 2, 3, 5], [1, 3, 1, 1, 1], [0, 0, 0, 0, 0], [0, 1, 4, 3, 5], [0, 1, 2, 4, 5], [1, 2, 1, 1, 3], [0, 2, 1, 0, 0]] >>> mab = MAB(list_of_arms, LearningPolicy.EpsilonGreedy(epsilon=0), NeighborhoodPolicy.KNearest(2, "euclidean")) >>> mab.fit(decisions, rewards, contexts) >>> mab.predict([[0, 1, 2, 3, 5], [1, 1, 1, 1, 1]]) [1, 1]
-
k
Alias for field number 0
-
metric
Alias for field number 1
-
-
class
Radius
¶ Bases:
tuple
Radius Neighborhood Policy.
Radius is a nearest neighborhood approach that selects the observations within a given radius to be used with a learning policy.
-
radius
¶ The maximum distance within which to select observations. Integer or Float. Must be greater than zero. Default value is 1.
Type: Num
-
metric
¶ The metric used to calculate distance. Accepts any of the metrics supported by scipy.spatial.distance.cdist. Default value is Euclidean distance.
Type: str
-
no_nhood_prob_of_arm
¶ The probabilities associated with each arm. If not given, a uniform random distribution over all arms is assumed. The probabilities should sum up to 1.
Type: None or List
Example
>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy >>> list_of_arms = [1, 2, 3, 4] >>> decisions = [1, 1, 1, 2, 2, 3, 3, 3, 3, 3] >>> rewards = [0, 1, 1, 0, 0, 0, 0, 1, 1, 1] >>> contexts = [[0, 1, 2, 3, 5], [1, 1, 1, 1, 1], [0, 0, 1, 0, 0],[0, 2, 2, 3, 5], [1, 3, 1, 1, 1], [0, 0, 0, 0, 0], [0, 1, 4, 3, 5], [0, 1, 2, 4, 5], [1, 2, 1, 1, 3], [0, 2, 1, 0, 0]] >>> mab = MAB(list_of_arms, LearningPolicy.EpsilonGreedy(epsilon=0), NeighborhoodPolicy.Radius(2, "euclidean")) >>> mab.fit(decisions, rewards, contexts) >>> mab.predict([[0, 1, 2, 3, 5], [1, 1, 1, 1, 1]]) [3, 1]
-
metric
Alias for field number 1
-
no_nhood_prob_of_arm
Alias for field number 2
-
radius
Alias for field number 0
-
-
class
simulator¶
Author: | FMR LLC |
---|---|
Email: | mabwiser@fmr.com |
Version: | 1.9.0 of May 2, 2020 |
This module provides a simulation utility for comparing algorithms and hyper-parameter tuning.
-
class
mabwiser.simulator.
Simulator
(bandits: List[tuple], decisions: Union[List[NewType.<locals>.new_type], numpy.ndarray, pandas.core.series.Series], rewards: Union[List[Union[int, float]], numpy.ndarray, pandas.core.series.Series], contexts: Union[None, List[List[Union[int, float]]], numpy.ndarray, pandas.core.series.Series, pandas.core.frame.DataFrame] = None, scaler: callable = None, test_size: float = 0.3, is_ordered: bool = False, batch_size: int = 0, evaluator: callable = <function default_evaluator>, seed: int = 123456, is_quick: bool = False, log_file: str = None, log_format: str = '%(asctime)s %(levelname)s %(message)s')¶ Bases:
object
Multi-Armed Bandit Simulator.
This utility runs a simulation using historic data and a collection of multi-armed bandits from the MABWiser library or that extends the BaseMAB class in MABWiser.
It can be used to run a simple simulation with a single bandit or to compare multiple bandits for policy selection, hyper-parameter tuning, etc.
Nearest Neighbor bandits that use the default Radius and KNearest implementations from MABWiser are converted to custom versions that share distance calculations to speed up the simulation. These custom versions also track statistics about the neighborhoods that can be used in evaluation.
The results can be accessed as the arms_to_stats, model_to_predictions, model_to_confusion_matrices, and models_to_evaluations properties.
When using partial fitting, an additional confusion matrix is calculated for all predictions after all of the batches are processed.
A log of the simulation tracks the experiment progress.
-
bandits
¶ A list of tuples of the name of each bandit and the bandit object.
Type: list[(str, bandit)]
-
decisions
¶ The complete decision history to be used in train and test.
Type: array
-
rewards
¶ The complete array history to be used in train and test.
Type: array
-
contexts
¶ The complete context history to be used in train and test.
Type: array
-
scaler
¶ A scaler object from sklearn.preprocessing.
Type: scaler
-
test_size
¶ The size of the test set
Type: float
-
is_ordered
¶ Whether to use a chronological division for the train-test split. If false, uses sklearn’s train_test_split.
Type: bool
-
batch_size
¶ The size of each batch for online learning.
Type: int
-
evaluator
¶ The function for evaluating the bandits. Values are stored in bandit_to_arm_to_stats_avg. Must have the function signature function(arms_to_stats_train: dictionary, predictions: list, decisions: np.ndarray, rewards: np.ndarray, metric: str).
Type: callable
-
is_quick
¶ Flag to skip neighborhood statistics.
Type: bool
-
logger
¶ The logger object.
Type: Logger
-
arms
¶ The list of arms used by the bandits.
Type: list
-
arm_to_stats_total
¶ Descriptive statistics for the complete data set.
Type: dict
-
arm_to_stats_train
¶ Descriptive statistics for the training data.
Type: dict
-
arm_to_stats_test
¶ Descriptive statistics for the test data.
Type: dict
-
bandit_to_arm_to_stats_avg
¶ Descriptive statistics for the predictions made by each bandit based on means from training data.
Type: dict
-
bandit_to_arm_to_stats_min
¶ Descriptive statistics for the predictions made by each bandit based on minimums from training data.
Type: dict
-
bandit_to_arm_to_stats_max
¶ Descriptive statistics for the predictions made by each bandit based on maximums from training data.
Type: dict
-
bandit_to_confusion_matrices
¶ The confusion matrices for each bandit.
Type: dict
-
bandit_to_predictions
¶ The prediction for each item in the test set for each bandit.
Type: dict
-
bandit_to_expectations
¶ The arm_to_expectations for each item in the test set for each bandit. For context-free bandits, there is a single dictionary for each batch.
Type: dict
-
bandit_to_neighborhood_size
¶ The number of neighbors in each neighborhood for each row in the test set. Calculated when using a Radius neighborhood policy, or a custom class that inherits from it. Not calculated when is_quick is True.
Type: dict
-
bandit_to_arm_to_stats_neighborhoods
¶ The arm_to_stats for each neighborhood for each row in the test set. Calculated when using Radius or KNearest, or a custom class that inherits from one of them. Not calculated when is_quick is True.
Type: dict
-
test_indices
¶ The indices of the rows in the test set. If input was not zero-indexed, these will reflect their position in the input rather than actual index.
Type: list
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> mab1 = MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.25), seed=123456) >>> mab2 = MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.30), seed=123456) >>> bandits = [('EG 25%', mab1), ('EG 30%', mab2)] >>> offline_sim = Simulator(bandits, decisions, rewards, test_size=0.5, batch_size=0) >>> offline_sim.run() >>> offline_sim.bandit_to_arm_to_stats_avg['EG 30%']['Arm1'] {'count': 1, 'sum': 9, 'min': 9, 'max': 9, 'mean': 9.0, 'std': 0.0}
-
get_arm_stats
(decisions: numpy.ndarray, rewards: numpy.ndarray) → dict¶ Calculates descriptive statistics for each arm in the provided data set.
Parameters: - decisions (np.ndarray) – The decisions to filter the rewards.
- rewards (np.ndarray) – The rewards to get statistics about.
Returns: - Arm_to_stats dictionary.
- Dictionary has the format {arm {‘count’, ‘sum’, ‘min’, ‘max’, ‘mean’, ‘std’}}
-
static
get_stats
(rewards: numpy.ndarray) → dict¶ Calculates descriptive statistics for the given array of rewards.
Parameters: rewards (nd.nparray) – Array of rewards for a single arm. Returns: - A dictionary of descriptive statistics.
- Dictionary has the format {‘count’, ‘sum’, ‘min’, ‘max’, ‘mean’, ‘std’}
-
plot
(metric: str = 'avg', is_per_arm: bool = False) → NoReturn¶ Generates a plot of the cumulative sum of the rewards for each bandit. Simulation must be run before calling this method.
Parameters: - metric (str) – The bandit_to_arm_to_stats to use to generate the plot. Must be ‘avg’, ‘min’, or ‘max
- is_per_arm (bool) – Whether to plot each arm separately or use an aggregate statistic.
Raises: - AssertionError Descriptive statics for predictions are missing.
- TypeError Metric must be a string.
- TypeError The per_arm flag must be a boolean.
- ValueError The metric must be one of avg, min or max.
Returns: Return type: None
-
run
() → NoReturn¶ Run simulator
Runs a simulation concurrently for all bandits in the bandits list.
Returns: Return type: None
-
-
mabwiser.simulator.
default_evaluator
(arms: List[NewType.<locals>.new_type], decisions: numpy.ndarray, rewards: numpy.ndarray, predictions: List[NewType.<locals>.new_type], arm_to_stats: dict, stat: str, start_index: int, nn: bool = False) → dict¶ Default evaluation function.
Calculates predicted rewards for the test batch based on predicted arms. When the predicted arm is the same as the historic decision, the historic reward is used. When the predicted arm is different, the mean, min or max reward from the training data is used. If using Radius or KNearest neighborhood policy, the statistics from the neighborhood are used instead of the entire training set.
The simulator supports custom evaluation functions, but they must have this signature to work with the simulation pipeline.
Parameters: - arms (list) – The list of arms.
- decisions (np.ndarray) – The historic decisions for the batch being evaluated.
- rewards (np.ndarray) – The historic rewards for the batch being evaluated.
- predictions (list) – The predictions for the batch being evaluated.
- arm_to_stats (dict) – The dictionary of descriptive statistics for each arm to use in evaluation.
- stat (str) – Which metric from arm_to_stats to use. Takes the values ‘min’, ‘max’, ‘mean’.
- start_index (int) – The index of the first row in the batch. For offline simulations it is 0. For _online simulations it is batch size * batch number. Used to select the correct index from arm_to_stats if there are separate entries for each row in the test set.
- nn (bool) – Whether the results are from one of the simulator custom nearest neighbors implementations.
Returns: - An arm_to_stats dictionary for the predictions in the batch.
- Dictionary has the format {arm {‘count’, ‘sum’, ‘min’, ‘max’, ‘mean’, ‘std’}}
utils¶
Author: | FMR LLC |
---|---|
Email: | mabwiser@fmr.com |
This module provides a number of constants and helper functions.
-
mabwiser.utils.
Arm
(x)¶ Arm type is defined as integer, float, or string.
-
class
mabwiser.utils.
Constants
¶ Bases:
tuple
Constant values used by the modules.
-
default_seed
= 123456¶ The default random seed.
-
distance_metrics
= ['braycurtis', 'canberra', 'chebyshev', 'cityblock', 'correlation', 'cosine', 'dice', 'euclidean', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean']¶ The distance metrics supported by neighborhood policies.
-
-
mabwiser.utils.
Num
= typing.Union[int, float]¶ Num type is defined as integer or float.
-
mabwiser.utils.
argmax
(dictionary: Dict[NewType.<locals>.new_type, Union[int, float]]) → NewType.<locals>.new_type¶ Returns the first key with the maximum value.
-
mabwiser.utils.
check_false
(expression: bool, exception: Exception) → NoReturn¶ Checks that given expression is false, otherwise raises the given exception.
-
mabwiser.utils.
check_true
(expression: bool, exception: Exception) → NoReturn¶ Checks that given expression is true, otherwise raises the given exception.
-
mabwiser.utils.
reset
(dictionary: Dict[KT, VT], value) → NoReturn¶ Maps every key to the given value.