Mục Lục

Cracking A/B Testing For Interview

What is A/B Testing?

Multi-Variate Testing/Split Testing/Conversion Rate Optimization/Landing Page Optimization/Digital Optimization/Online Experimentation/Growth Hacking? They are basically the same thing as A/B Testing.

In statistics, A/B Testing is one way of hypothesis testing. People make assumptions of whether one or some factors are significant or not, then usually experiments will be conducted (comparing 2 or more versions, like version A to version B) to prove the assumptions or reject the assumptions. In the business world, the factors will become something as big as a new product/product line and as small as a website modification. As usual, the control group will use the existing features, while the treatment group will use the new features. One-time A/B Testing will not give a permanently useful result, continual testings will still be preferred.

Then, you probably will wonder when will A/B Testing be massively used for a company? A company that has products or services that are easy to be replaced by others and need consistent improvement to stay competitive will usually start to use A/B Testing.

A real-world example will be Under Armour. Several years ago, Under Armour hired a large consultant team from Adobe to run a test on their website. The control in this testing is their original website. The variation is their website with a very simple recommendation zone placed in the center of their home page. They want to find out that will add recommendations to the home page of Under Armour to make an impact on them, with revenue per visitor used as the main success metric. And they found this helped Under Armour got a 14% lift in revenue which lead them to gain 3 million dollars more annually. (The story was found in this video: https://www.youtube.com/watch?v=CH89jd4haRE&list=WL&index=4, starting at 10:47) After that, Under Armour still keeps this feature, but you can see they are constantly conducting A/B Testing on these things until not too long ago that they replace this zone with some other features they think is more important for them now. Therefore, A/B Testing is a very good way for employers to test that whether the candidates are familiar with the industry or their consciousness toward business.

In the business world, the whole process of A/B Testing will usually be:

Customer Funnel (Funnel Analysis)
Define Metric (KPIs, Conversion Rate, etc.)
Form Hypothesis
Formulate Testing Plan
Create Variation
Run Experiment
Analyze Testing Result
Make Conclusion

Sample size calculator: www.evanmiller.org/ab-testing/sample-size.html

About 90% of the companies are looking for whether the candidates have a business sense toward A/B Testing, such as what hypothesis should be form; what hypothesis will be imperative to any specific product/service; how to launch the A /B Testing and get valid results; etc. The rest 10% of the companies might ask the advanced A/B Testing or methodologies.

In data science interviews, A/B Testing usually will be asked together with metric questions. Just like previously stated, the interview can include any A/B testing component: developing a new hypothesis, designing A/B tests, evaluating testing results, and making decisions.

Designing Qs

How long to run an A/B test?

First, the sample size needs to be obtained. Next, we will need to get 3 parameters: type II error/power (since power = 1 — type II error ), significance level, and minimum detectable effect.
Determine the sample size: sample size approximately equal to 16 multiplied by sample variance divided by delta (difference between treatment and control) squared. Or use the link given previously

Next, we will be dividing the sample size by the number of users in each group to get an approximate duration for the experiment. The calculated duration will usually be rounded to a weekly basis at the end.

How does each parameter influence the sample size?

If you have more samples, the sample variance will become larger.
If you have fewer samples, the delta (difference between treatment and control) will become larger.

How to estimate parameters?

Sample variance can be obtained from the data
But we will need to use the minimum detectable effect to estimate delta. The minimum detectable effect represents the smallest difference that matters in experiment/practice and is usually decided by multiple stakeholders.

Multiple Testing Qs

There are also could be multiple testing problems in the interview. In that case, multiple variants will be tested, and a sample question would be:

A company is running 10 tests for trying different versions of a web page, there is a case wins with a p-value less than 0.05, should the company make this change?

Answer: The company shouldn’t accept this change because they shouldn’t use the same significance level (95%) here. In this scenario, there are more than 2 variants in the testing and if the fixed significance level be chosen, the probability of false discovery will increase.

Solution:

Bonferroni Correction

It divides the significance level by the number of tests.
Drawback: conservative method

Control False Discovery Rate (FDR)

It is the expected value of the number of false positives divided by the number of rejections.

Novelty and Primacy Qs

Primacy Effect / Change Aversion:

When changes happen, some people that get used to how things work may feel reluctant to change.

Novelty Effect:

Unlike the previous group, this group of people welcomes the changes they feel resonate with and use more frequently.

However, both of these effects are not long-term effects. Usually, when an A/B Testing has a larger or smaller initial effect, that is due to primacy or novelty effect.

Solution:

Rule out the possibility of these effects (the tests will be conducted only on first-time users)
Compare first time users to experienced users in the treatment group (get an actual estimate of the impact of primacy or novelty effect)

Groups Interference Qs

Interference between variants happens a lot among the social network industry (Twitter, Facebook, Tik Tok …) and two-sided markets (Uber vs. Lyft).

In the social network industry, users’ behavior is very likely to be impacted by other people, especially in their social circles, this is the so-called “network effect”. This may cause something like users in the control group to get influenced by users in the treatment group, and lead to underestimation of treatment effect. On the other hand, for two-sided markets, the users in both control and treatment groups will share the same resources, so the treatment effect this time will be overestimated.

Solution:

Isolate users in the control and treatment groups.

For social network market:

Network Clusters

split users into different clusters where each of them interacts the most and then assigns the clusters randomly

2. Ego-network Randomization (originated from LinkedIn)

A cluster is composed of an “ego” (an user/individual) and “alters” (the user’s/individual’s direct contacts)
measure the one-out network effect (the effect of “alters” treatment on “ego”), the user either has the feature or not
simpler and more scalable

For two-sided markets:

Geo-based Randomization

split the sample by geolocations (allow to isolate users but will have big variance since each market is unique in certain ways)

2. Time-based Randomization

Split sample by day of the week and assign all users to treatment or control group (only for the short-term treatment effect)
Don’t use this for something like a referral program

Case Study: Red vs. Green Button

Goal:

Quantify the impact that a different call-to-action button color has on the core metrics

Hypotheses:

Compared with a red CTA button, a green CTA button will attract more users’ click
A fraction of these additional clicks will comp; let the transaction, thus increase revenue
There is a bigger lift for this change on mobile

Null Hypothesis:

The green button will cause no difference on Click Through Rate (Number of clicks/ Number of users that experiencing it) or other user behaviors

KPI to measure:

Revenue, Purchase Rate (per visitor), Click Through Rate (clicks per visit/clicks per visitor)

Data to be collected:

User ID/Cookie ID, platform, page loads (onsite engagement), experiment assignment, engagement behavior metrics, etc.

Minimum detectable effect: measure a 10% increase

Current button CTR: 3% → Successful Experiment: 3.3 % or more

Determine what fraction of visitors you want to be in the treatment:

90% of visitors → control group
10% of visitors → treatment group

Run a power analysis to decide how many data samples to collect depend on our tolerance: minimum detectable level/minimum measurable difference, false negative, false positive.

False Positive (Type I error): we see the significant results when there isn’t one (typically want a false positive rate <5% and it is equivalent to the significant level of statistic test)

False Negative (Type II error): there is an effect, but we weren’t able to measure it (typically want a false negative rate <20% and the “power” of the test is 1 — beta = 0.8)

After power analysis, we need to figure out how long to run the test and run the test for that long at least. But pay attention that uneven treatment groups can cause biases, so make sure to have even treatment and control groups.

If it is the first A/B test running, a dummy test (A/A test) will be highly recommended.

Where to learn more:

A Summary of Udacity A/B Testing Course

Recently I finished the A/B testing course by Google on Udacity. The course has been highly recommended to people who…

towardsdatascience.com

7 A/B Testing Questions and Answers in Data Science Interviews

A/B tests, a.k.a controlled experiments, are used widely in industry to make product launch decisions. It allows tech…

towardsdatascience.com

***The contents in this article are not solely developed by myself. Many things are learned and gathered from the Internet (Youtube, blogs, books, webinars, and so on).

***This article is only for learning purposes (non-commercial use). Please do not use it or spread it for any business/commercial reason.