A/B Testing in Python: “A User Experience Research Methodology”
Mục Lục
A/B Testing in Python: “A User Experience Research Methodology”
Understand the results of an A/B test run by an e-commerce website.
Photo by Luke Chesser on Unsplash
A/B tests are very commonly performed by data analysts and data scientists. It is important that you get some practice working with the difficulties of these.
For this — Udacity Data Analyst nanodegree program — project, you will be working to understand the results of an A/B test run by an e-commerce website. The company has developed a new web page in order to try and increase the number of users who “convert,” meaning the number of users who decide to pay for the company’s product. The goal is to help the company understand if they should implement this new page, keep the old page, or perhaps run the experiment longer to make their decision.
Table of contents :
- Part I — Probability
- Part II — A/B Testing
- Part III — Regression Approach
- Part IV — Conclusion
Data : ab_data.csv
#import libraries
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
random.seed(42)
Part I — Probability
#Read in the `ab_data.csv` data. Store it in `df`.
df = pd.read_csv('ab_data.csv')
df.head()
#The number of rows in the dataset.
df.shape
#The number of unique users in the dataset.
df.nunique()
#The proportion of users converted.
df.converted.mean()
#the number of times the new_page and treatment don’t line up.
first_df = len(df.query(‘group == “treatment” and landing_page == “old_page”’))
second_df = len(df.query(‘group == “control” and landing_page == “new_page”’))
e = first_df + second_df
print(e)
#missing value check.
missing = df.isnull().sum()
For the rows where treatment is not aligned with new_page or control is not aligned with old_page, we cannot be sure if this row truly received the new or old page.
#create a new dataset that meets the specifications and store the new dataframe in df2.
df2 = df.drop(df.query(‘group == “treatment” and landing_page != “new_page” | group == “control” and landing_page != “old_page”’).index)
#test
df2[((df2[‘group’] == ‘treatment’) == (df2[‘landing_page’] == ‘new_page’)) == False].shape[0]
#Unique user_ids in df2.
df2.user_id.nunique()
#repeated user_id in df2 & the row information
duplicated_user = df2[df2['user_id'].duplicated()]
df2[df2['user_id'].duplicated(keep=False)]
#remove one of the rows with a duplicate user_id.
df2 = df2.drop([2893])
#test
df2[df2['user_id'].duplicated()].any()
#the probability of an individual converting regardless of the page they receive.
df2.converted.mean()
#the probability of an individual converting that in the control group.
p_control = df2.query('group == "control"').converted.mean()
p_control
#the probability of an individual converting that in the treatment group.
p_treatment = df2.query('group == "treatment"').converted.mean()
p_treatment
#the probability that an individual received the new page.
p_newpage = df2.query('landing_page == "new_page"').shape[0]/df2.shape[0]
p_newpage
# Is there sufficient evidence to say that the new treatment page leads to more conversions?
obs_diff = p_treatment - p_control
print('Observed difference is: {}'.format(obs_diff))
For now, it cannot be said that the new treatment page leads to more conversions, I haven’t got any sufficient evidence to support the statement. The data indicates that p_control = 12%, and p_treatment = 11%. In other words, treatment groups’ and control groups’ converted rates are too close to each other to have a clear idea. There is only a 0.001 difference between them as can be seen from obs_diff above. The probability of receiving a new page or an old page comes out 50–50%. To decide the page treatment, I need to test the null hypothesis and try to have more evidence.
Part II — A/B Testing
# if you want to assume that the old page is better unless the new page proves to be definitely better at a Type I error rate of 5%, your null and alternative hypotheses should be
One-sided T-test:
H_0: 𝑝𝑜𝑙𝑑 ≥ 𝑝𝑛𝑒𝑤
H_1: 𝑝𝑜𝑙𝑑 < 𝑝𝑛𝑒𝑤
(𝑝𝑜𝑙𝑑 and 𝑝𝑛𝑒𝑤 are the converted rates for the old and new pages)
# assume under the null hypothesis, 𝑝𝑛𝑒𝑤 and 𝑝𝑜𝑙𝑑 both have “true” success rates equal to the converted success rate regardless of page
(Assuming pnew = pold & Sample size = ab_data sample size)
#the convert rate for 𝑝𝑛𝑒𝑤 and 𝑝𝑜𝑙𝑑 under the null
p_new = df2.converted.mean()
p_old = df2.converted.mean()
print('p_new:' , p_new , 'p_old:', p_old)
#nnew and nold
n_new = len(df2.query('landing_page == "new_page"'))
n_old = len(df2.query('landing_page == "old_page"'))
print('n_new:' , n_new , 'n_old:', n_old)
#Simulate nnew transactions with a conversion rate of 𝑝𝑛𝑒𝑤 under the null. Store these nnew 1’s and 0’s in new_page_converted.
new_page_converted = np.random.binomial(1, p_new, n_new)
new_page_converted.mean()
#Simulate nold transactions with a conversion rate of 𝑝𝑜𝑙𝑑 under the null. Store these nold 1’s and 0’s in old_page_converted.
old_page_converted = np.random.binomial(1, p_old, n_old)
old_page_converted.mean()
#𝑝𝑛𝑒𝑤-𝑝old
difference = new_page_converted.mean() — old_page_converted.mean()
print(‘Simulated difference is: {}’.format(difference))
#p_diffs
p_diffs = []
new_converted_simulation = np.random.binomial(n_new, p_new, 10000)/n_new
old_converted_simulation = np.random.binomial(n_old, p_old, 10000)/n_old
p_diffs = new_converted_simulation — old_converted_simulation
p_diffs = np.array(p_diffs)
#plot a histogram
plt.hist(p_diffs)
plt.title('New-old probability diffs simulation')
plt.xlabel('p_diffs');
null_vals = np.random.normal(0, p_diffs.std(), p_diffs.size)
plt.hist(null_vals)
plt.axvline(x=obs_diff, c=’red’);
The plot you create should be a normal distribution. Also, the statistics I calculated above, called in scientific studies as “p-value”. This came from the null distribution, p-value = 0.91 > 0.05. So that, do not reject the 𝑝𝑛𝑒𝑤 = 𝑝𝑜𝑙𝑑, null hypothesis.
Let nold and nnew refer to the number of rows associated with the old page and new pages, respectively.
import statsmodels.api as sm
convert_old = df2.query("landing_page == 'old_page' and converted == 1").shape[0]
convert_new = df2.query("landing_page == 'new_page' and converted == 1").shape[0]
n_old = df2.query("landing_page == 'old_page'").shape[0]
n_new = df2.query("landing_page == 'new_page'").shape[0]z_score, p_value = sm.stats.proportions_ztest([convert_old, convert_new], [n_old, n_new], alternative = “smaller”)
z_score, p_value
print(‘Z-score: {}’.format(z_score))
print(‘P-value of Z-Test: {}’.format(p_value))
Both z-score and p-value computations from the test statistic suggest that findings fail to reject the null hypothesis. Also as being both p-values are equal to 0.189, when calculated as not computing ‘the alternative’. They agree with the findings in the previous part.
Part III — A regression approach
I want to predict categorical responses, that’s why I will use logistic regression.
df2[[‘a_page’, ‘ab_page’]] = pd.get_dummies(df2[‘group’]) #create dummies
#logistic regression model
df2['intercept']= 1
logit_mod =sm.Logit(df2['converted'],df2[['intercept', 'a_page']])#fit the model
results = logit_mod.fit()
p-value ~ .19 >> type-1 error (Statistically, it is not significant to reject.)
np.exp(0.015) #to exponentiate
Individuals are 1.015 times more likely to convert the page as likely holding all else constant.
I assumed that 𝑝𝑛𝑒𝑤 = 𝑝𝑜𝑙𝑑 in this two-sided T-test analysis in PartIII.
The previous analysis was a one-sided T-test and assumed 𝑝𝑛𝑒𝑤 > 𝑝𝑜𝑙𝑑 in PartII.
The null and alternative hypotheses for two-sided T-test are:
H_0: 𝑝𝑛𝑒𝑤 — 𝑝𝑜𝑙𝑑 = 0
H_1: 𝑝𝑛𝑒𝑤 — 𝑝𝑜𝑙𝑑 != 0
The null and alternative hypotheses for one-sided T-test were:
H_0: 𝑝𝑛𝑒𝑤 — 𝑝𝑜𝑙𝑑 ≤ 0
H_1: 𝑝𝑛𝑒𝑤 — 𝑝𝑜𝑙𝑑 > 0
The p-value for both of them is 0.1899 when two-tailed tested without specifying the alternative.
In some cases, a factor could be really important and critical to affecting the result. In such a case, believe that adding the factor improves the model. I may try to see whether important or not, as calculating the correlation coefficient, then decide to hold or remove it. On the other hand, I adding too many factors would create an over-fitting problem and misleading results. Also, there is another disadvantage that multicollinearity can occur if these additional factors are correlated.
#the individual factors of country and page on conversion
#merging datasets
countries_df = pd.read_csv(‘./countries.csv’)
df_new=countries_df.set_index(‘user_id’).join(df2.set_index(‘user_id’), how=’inner’)#creating dummy variablesdf_new[[‘UK’, ‘US’, ‘CA’]] = pd.get_dummies(df_new.country)
#logit model
model = sm.Logit(df_new.converted, df_new[[‘intercept’, ‘ab_page’, ‘UK’, ‘CA’]]) #logit model
results_new = model.fit() #fitting the model
The p-values indicate that there is no country impact on conversion.
#interaction between page and country to see if there significant effects on conversion
df_new[‘ab_UK’] = df_new[‘ab_page’] * df_new[‘UK’]
df_new[‘ab_US’] = df_new[‘ab_page’] * df_new[‘US’]
df_new[‘ab_CA’] = df_new[‘ab_page’] * df_new[‘CA’]
#Logit Model
lmodel = sm.Logit(df_new[“converted”], df_new[[“intercept”, “ab_page”, “UK”, “CA”, “ab_UK”, “ab_CA”]])
#Fitting the logit model
results_factor = lmodel.fit()
1/np.exp(results_factor.params) #to exponentiate
According to statistical results, both ab_UK and ab_CA p-values are more than Type-I error(0.05). If I consider all others constant, a user in the UK is 1.08 times more likely to convert while a user in CA is 1.03 times more likely to convert. Their impacts are really small. Moreover, the results are not reliable enough to talk about a significant effect on conversion based on the individual factors of country and page.
Conclusion
All A/B testing results show that findings do not prove enough significant evidence for rejecting the null hypothesis. The decision should be that it is better to not renew the website from the old_page to the new_page. Because there would not be any change, it would be a loss of time and effort.
For details and the full version:
More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter and LinkedIn. Join our community Discord.