A Straightforward Guide to A/B Testing – Data Science & Analytics Blog by Barrett Studdard
A/B Testing can be extremely useful during experimentation. Adding statistical rigor to situations where you compare one option against another. This is one step which can help guard against making faulty conclusions.
This article will demonstrate the critical steps in an A/B test:
- Determining Minimum Detectable Effect
- Calculating Sample Size
- Analyzing Results for Statistical Significance
Mục Lục
Scenario Overview
To demonstrate a situation where a company may employ A/B testing, I’m going to create a fictitious video game dataset.
In this scenario, a video game development company has recognized lots of users are quit playing after one particular level. A hypothesis posed by the product team is that the level is too hard, and making it easier will allow for less user frustration and ultimately more players continuing to play the game.
Sample Size Calculation
Desired Business Effect
Our video game company has made some tweaks to the level to make it easier, effectively changing the difficulty from hard to medium. Before blindly rolling an update out to all the users, the product team wants to test the changes on a small sample to see if they make a positive impact.
The desired metric is the percentage of players who continue playing after reaching our level in question. Currently, 70% of users continue to play where the remaining 30% stop playing. The product team has decided increasing this to 75% would warrant deploying the changes and making an update to the game.
Minimum Detectable Effect
This piece of information (70% to 75%) helps us calculate the minimum detectable effect – one of the inputs to calculating sample size. Our problem is testing two proportions, so we’ll use the proportion_effectsize function in statsmodels to translate this change into something we can work with.
import
statsmodels.
api as
sm
init_prop =
0.7
mde_prop =
0.75
effect_size =
sm.
stats.
proportion_effectsize(
init_prop,
mde_prop
)
print
(
f'For a change from
{
init_prop:
.2f
}
to
{
mde_prop:
.2f
}
- the effect size is
{
effect_size:
.2f
}
.'
)
Output: For a change from 0.70 to 0.75 - the effect size is -0.11.
Sample Size
Now that we have our minimum detectable effect, we can calculate sample size. To do this for a proportion problem, zt ind solve power is used (also from the statsmodels library).
We set the nobs1 (number of observations of sample 1) parameter to None which signifies this is what we are looking to solve for.
A significance level of 0.05 and power of 0.8 are commonly chosen default values, but these can be adjusted based on the scenario and desired false positive vs. false negative sensitivity.
from
math import
ceil
sample_size =
sm.
stats.
zt_ind_solve_power(
effect_size=
effect_size,
nobs1=
None
,
alpha=
0.05
,
power=
0.8
)
print
(
f'
{
ceil(
sample_size)
}
sample size required given power analysis and input parameters.'
)
Output: 1250 sample size required given power analysis and input parameters.
Experiment Data Overview
After communicating the sample size needed to the product team, an experiment was run where a random sample of 1,250 players played the new easier level and another 1,250 played the hard level.
After gathering the data, we learn that 980 users (of 1,250) continued playing after reaching the medium difficulty level while 881 (of 1,250) continued playing after reaching the hard difficulty level. This seems decent enough and above our hope of at least 5% improvement, should we make the change?
Before deciding, it’s important to test for statistical significance to guard against the possibility that this difference could have occurred simply by random chance (accounting for our significance and power).
Experiment Data
Analyzing Results
Calculating Inputs
The statistical test we’ll use is the proportions_ztest. We need to calculate a number of inputs first – the number of successes and observations. Data is stored in a pandas dataframe with columns for medium and hard – with 0 representing users that stopped playing and 1 representing those that continued.
medium_successes =
df.
loc[
:
,
'medium'
]
.
sum
(
)
hard_successes =
df.
loc[
:
,
'hard'
]
.
sum
(
)
medium_trials =
df.
loc[
:
,
'medium'
]
.
count(
)
hard_trials =
df.
loc[
:
,
'hard'
]
.
count(
)
print
(
f'Medium:
{
medium_successes}
successes of
{
medium_trials}
trials. Hard:
{
hard_successes}
successes of
{
hard_trials}
trials.'
)
Output: Medium: 980 successes of 1250 trials. Hard: 881 successes of 1250 trials.
Performing the Z Test
Performing the z test is simple with the required inputs:
from
statsmodels.
stats.
proportion import
proportions_ztest
z_stat,
p_value =
proportions_ztest(
count=
[
medium_successes,
hard_successes]
,
nobs=
[
medium_trials,
hard_trials]
)
print
(
f'z stat of
{
z_stat:
.3f
}
and p value of
{
p_value:
.4f
}
.'
)
Output: z stat of 4.539 and p value of 0.0000.
A p value below 0.05 meets our criteria. Reducing the level difficulty to medium is the way to go!
Summary
While this is a fictitious and straightforward example, utilizing a similar approach in the real world can add some rigor to your experiments. Consider using A/B testing when you have two samples you are looking to gauge whether an experiment resulted in a significant change or not based on your desired metric.
All examples and files available on Github.