--- title: AB Test Utils keywords: fastai sidebar: home_sidebar summary: "AB Test Utils." description: "AB Test Utils." nb_path: "nbs/utils/ab_testing.ipynb" ---
Outline for A/B Tests:
We will run an A/B test for a hypothetical company that is trying to increase the amount of users that sign up for a premium account. The goal of running an A/B test is to evaluate if a change in a website will lead to improved performance in a specific metric. You may decide to test very simple alternatives such as changing the look of a single button on a webpage or testing different layouts and headlines. You could also run an A/B test on multi-step processes which may have many differences. Examples of this include the steps required in signing up a new user or processing the sale on an online marketplace. A/B testing is a huge subject and there are many techniques and rules on setting up an experiment.
Before running the test, we will know the baseline conversion rate and the desired lift or increase in signups that we would like to test. The baseline conversion rate is the current rate at which we sign up new users under the existing design. For our example, we want to use our test to confirm that the changes we make to our signup process will result in at least a 2% increase in our sign up rate. We currently sign up 10 out of 100 users who are offered a premium account.
bcr = 0.10 # baseline conversion rate
d_hat = 0.02 # difference between the groups
Typically, the total number of users participating in the A/B test make up a small percentage of the total amount of users. Users are randomly selected and assigned to either a control group or a test group. The sample size that you decide on will determine how long you might have to wait until you have collected enough. For example, websites with large audiences may be able to collect enough data very quickly, while other websites may have to wait a number of weeks. There are some events that happen rarely even for high-traffic websites, so determining the necessary sample size will inform how soon you can assess your experiment and move on to improving other metrics.
Initially, we will collect 1000 users for each group and serve the current signup page to the control group and a new signup page to the test group.
N_A = 1000
N_B = 1000
def generate_data(N_A, N_B, p_A, p_B, days=None, control_label='A',
test_label='B'):
"""Returns a pandas dataframe with fake CTR data
Example:
Parameters:
N_A (int): sample size for control group
N_B (int): sample size for test group
Note: final sample size may not match N_A provided because the
group at each row is chosen at random (50/50).
p_A (float): conversion rate; conversion rate of control group
p_B (float): conversion rate; conversion rate of test group
days (int): optional; if provided, a column for 'ts' will be included
to divide the data in chunks of time
Note: overflow data will be included in an extra day
control_label (str)
test_label (str)
Returns:
df (df)
"""
# initiate empty container
data = []
# total amount of rows in the data
N = N_A + N_B
# distribute events based on proportion of group size
group_bern = scs.bernoulli(N_A / (N_A + N_B))
# initiate bernoulli distributions from which to randomly sample
A_bern = scs.bernoulli(p_A)
B_bern = scs.bernoulli(p_B)
for idx in range(N):
# initite empty row
row = {}
# for 'ts' column
if days is not None:
if type(days) == int:
row['ts'] = idx // (N // days)
else:
raise ValueError("Provide an integer for the days parameter.")
# assign group based on 50/50 probability
row['group'] = group_bern.rvs()
if row['group'] == 0:
# assign conversion based on provided parameters
row['converted'] = A_bern.rvs()
else:
row['converted'] = B_bern.rvs()
# collect row into data container
data.append(row)
# convert data into pandas dataframe
df = pd.DataFrame(data)
# transform group labels of 0s and 1s to user-defined group labels
df['group'] = df['group'].apply(
lambda x: control_label if x == 0 else test_label)
return df
ab_data = generate_data(N_A, N_B, bcr, bcr+d_hat)
ab_data.head(10)
The converted column indicates whether a user signed up for the premium service or not with a 1 or 0, respectively. The A group will be used for our control group and the B group will be our test group.
Let’s look at a summary of the results using the pivot table function in Pandas.
ab_summary = ab_data.pivot_table(values='converted', index='group', aggfunc=np.sum)
# add additional columns to the pivot table
ab_summary['total'] = ab_data.pivot_table(values='converted', index='group', aggfunc=lambda x: len(x))
ab_summary['rate'] = ab_data.pivot_table(values='converted', index='group')
ab_summary
It looks like the difference in conversion rates between the two groups is 0.024 which is more than the lift we initially wanted of 0.02. This is a good sign that we should adopt the new design, but this is not enough evidence for us to confidently accept the new design. At this point we have not measured how confident we are in this result. This can be mitigated by looking at the distributions of the two groups.
We can compare the two groups by plotting the distribution of the control group and calculating the probability of getting the result from our test group. We can assume that the distribution for our control group is binomial because the data is a series of Bernoulli trials, where each trial only has two possible outcomes (similar to a coin flip).
A_converted = ab_summary.loc['A','converted']
A_total = ab_summary.loc['A','total']
A_cr = ab_summary.loc['A','rate']
B_converted = ab_summary.loc['B','converted']
B_total = ab_summary.loc['B','total']
B_cr = ab_summary.loc['B','rate']
fig, ax = plt.subplots(figsize=(12,6))
x = np.linspace(A_converted-49, A_converted+50, 100)
y = scs.binom(A_total, A_cr).pmf(x)
ax.bar(x, y, alpha=0.5)
ax.axvline(x=B_cr * A_total, c='blue', alpha=0.75, linestyle='--')
plt.xlabel('converted')
plt.ylabel('probability')
The distribution for the control group is shown in red and the result from the test group is indicated by the blue dashed line. We can see that the probability of getting the result from the test group is very high. However, the probability does not convey the confidence level of the results. It does not take the sample size of our test group into consideration. Intuitively, we would feel more confident in our results as our sample sizes grow larger. Let’s continue and plot the test group results as a binomial distribution and compare the distributions against each other.
fig, ax = plt.subplots(figsize=(12,6))
xA = np.linspace(A_converted-49, A_converted+50, 100)
yA = scs.binom(A_total, A_cr).pmf(xA)
ax.bar(xA, yA, alpha=0.5)
xB = np.linspace(B_converted-49, B_converted+50, 100)
yB = scs.binom(B_total, B_cr).pmf(xB)
ax.bar(xB, yB, alpha=0.5)
plt.xlabel('converted')
plt.ylabel('probability')
Let’s start off by defining the null hypothesis and the alternative hypothesis.
# use the actual values from the experiment for bcr and d_hat
# p_A is the conversion rate of the control group
# p_B is the conversion rate of the test group
bcr = A_cr
d_hat = B_cr - A_cr
abplot(N_A, N_B, bcr, d_hat)
Fortunately, both curves are identical in shape, so we can just compare the distance between the means of the two distributions. We can see that the alternative hypothesis curve suggests that the test group has a higher conversion rate than the control group. This plot also can be used to directly determine the statistical power.
I think it is easier to define statistical power and significance level by first showing how they are represented in the plot of the null and alternative hypothesis. We can return a visualization of the statistical power by adding the parameter show_power=True
abplot(N_A, N_B, bcr, d_hat, show_power=True)
abplot(N_A, N_B, bcr, d_hat, show_beta=True)
The gray dashed line that divides the area under the alternative curve into two also directly segments the area associated with the significance level, often denoted with the greek letter alpha.
abplot(N_A, N_B, bcr, d_hat, show_alpha=True)
Experiments are typically set up for a minimum desired power of 80%. If our new design is truly better, we want our experiment to show that there is at least an 80% probability that this is the case. We know that if we increase the sample size for each group, we will decrease the pooled variance for our null and alternative hypothesis. This will make our distributions much narrower and may increase the statistical power. Let’s take a look at how sample size will directly affect our results.
abplot(2000, 2000, bcr, d_hat, show_power=True)
Many people calculate Z from tables such as those shown here and here. However, I am more of a visual learner and I like to refer to the plot of the Z-distribution from which the values are derived.
zplot()
zplot(0.80,False,True)
min_sample_size(bcr=0.10, mde=0.02)
abplot(3843, 3843, 0.10, 0.02, show_power=True)
The calculated power for this sample size was approximately 0.80. Therefore, if our design change had an improvement in conversion of about 2 percent, we would need at least 3843 samples in each group for a statistical power of at least 0.80.
That was a very long but basic walkthrough of A/B tests. Once you have developed an understanding and familiarity with the procedure, you will probably be able to run an experiment and go directly to the plots for the null and alternative hypothesis to determine if your results achieved enough power. By calculating the minimum sample size you need prior to the experiment, you can determine how long it will take to get the results back to your team for a final decision.