Showmax is a B2C streaming video service, so we have the luxury of developing our platform directly for the customer. The obvious benefit is that we get immediate and constant feedback on our work directly from the people using it. While we know that changes and updates have an effect on our users, the tricky part is knowing if the effects are positive or negative. Sometimes it’s pretty easy to see that the effect is negative - breaking something stands out. But, in every other instance, all we have is our collective gut feeling - and A/B testing.
This piece is a look at the statistics behind A/B testing, and what (and why) to be careful with certain things when evaluating an A/B test.
Defining test parameters
The concept of A/B testing is pretty simple. Identify what kind of users you want in the test and divide them randomly into two groups - control and alternative, present the users different versions of the product, with the tested change and without, and measure user behavior.
For example, imagine a simple web page with call-to-action (CTA) button that leads users to a payment page. Our hypothesis is that, once we change the button color from blue to pink, more users will be willing to proceed and subscribe to Showmax. To test that, we prepare two versions of the page, and present a randomly-chosen variant to the users. After we present one of the variants to a sufficient number of visitors, we’re ready to evaluate the test and decide which page performs better.
Old version
Alternate version
To make things simple, let’s assume that we know that our conversion rate of the unchanged page with the blue button is 20%. Our hypothesis is that changing the CTA button design will improve that conversion rate.
Note that, in this case, we will measure only the conversion rate of the alternative design because we already know how well the original performs. This is not a typical setup for an A/B test, but we’ll get to that.
Set up the new variant of the page and measure how many users clicked the button against how many visited the page (this is how you calculate conversion rate). After enough users visit our page, we stop the test and get a result like: “Alternative version of the page has a 1.3% higher success rate”.
Now what? Could this result be obtained by nothing more than bad (or good) luck? If we repeat the experiment, will we obtain the opposite result? Should I have run the test with more users? To answer these, the A/B test needs an additional set of parameters. If you find an A/B test calculator, you are typically asked for the following values:
- Probability of Type I error (the false positive), denoted as $ \alpha$
- Test confidence (probability of true negative), denoted as $1 - \alpha$
- Probability of Type II error (the false negative), denoted as $\beta$
- Power of the test (probability of true positive), denoted as $1 - \beta$
- Expected difference between variants, $\Delta$
- Minimal sample size $N$
Let’s look at what those parameters mean and what role they play when making a decision based on your A/B test results.
Population and Sample
Several important terms need to be defined before we get started. Imagine we can run a never-ending A/B test where we can measure every single existing user that could ever possibly appear on the page. Having this luxury, we are able to compute the true population mean (conversion rate) and the population standard deviation, typically denoted as μ and σ.
Usually, we can't measure every user as this would be too expensive and would just take an infinite amount of time. We are, however, able to measure only $N$ users, and obtain a sample from this population. From the results $x_1 \dots x_N$, we can compute the sample mean $\bar{x} = \frac{1}{N}\sum_{i=1}^{N}{x_i}$, and sample standard deviation $s = \sqrt{\frac{1}{N-1}\sum_{i=1}^{N}{(\bar{x}-x_i)^2}}$.
Any time we do the measurement and compute the sample mean $\bar{x}$, we get a different number. We can actually look at $\bar{x}$ as a specific realization of a random variable $\bar{X} = {{1}\over{N}} \sum_{i=0}^{N} X_i$ where $X_i$ are random variables with identical distribution functions. The random variable $\bar{X}$ has some pretty useful properties.
One cool tool in the statistics toolbox is Central Limit Theorem (CLT). It says that no matter what the probability distribution of random variable $X$ with mean $\mu$ and $\sigma$, the $\bar{X}$ is a random variable that for $N \rightarrow \infty$ approximates normal distribution with mean $\mu$ and $\sigma_\bar{x} = {\sigma \over{\sqrt{N}}}$,
$\bar{X} \approx Norm(\mu, {\sigma \over{\sqrt{N}}} )$.
From theoretical standpoint, $N$ should be at least 30 to have a reasonable approximation. In practice, usually the higher the better.
Let's do an experiment. Imagine we have some very crazy random variable $X$ with a distribution defined as:
$f(x) = 1 + sin(x^2)/x$ for $x \in [-\sqrt(2\pi), \sqrt(2\pi)]$, $0$ elsewhere.
It looks like:
Thorough readers may spot that the distribution function is not normalized, i.e. the area below the curve is not equal to one. Allow us here this discount, this does not have any effect on the demonstration below.
If you repeatedly take 20 numbers from this random variable, you will observe that the sample mean (red vertical line) is located in approximately the same area.
The red line in the graphs above is the average value (or conversion rate, if applied to our case). Each graph represents one test with 20 participants.
CLT tells us that, if we increase the number of samples, we should have lower variance around the true mean. This means that the red line obtained should have a higher chance of being closer to the true mean. Let’s repeat the experiment above 10,000 times, always note the obtained average value, and display the result in the histogram. We will do this for sample sizes N = 1, 5, 50 and 500.
We can see the CLT in action. As we increase the sample size, we are getting more accurate results of the true population, i.e. the standard deviation decreases.
Back to the page
So, we have our test done and we need to know if the change had any effect. Restated: In which version of the alternative universe are we? Is it:
$H_0$, where the CTA button modification has no effect, or
$H_a$, where the CTA button modification has the effect of increased conversion rate?
We will use the terms $H_0$ or $H_a$ interchangeably with the statement that CTA has no effect, or CTA has an effect.
We maintain our assumption that we know the conversion rate of our current variant $H_0$ (unmodified CTA button) is $\mu = 0.2$. For our example, we also assume that the standard deviation is $\sigma = 1/2$.
Imagine that we conducted the test by presenting the version of the new page to 90 users (this number was chosen just for this demonstration). Remember, for simplicity we still measure only the alternative version.
The test is run, and the conversion rate (our sample mean) of the experiment with the new CTA button is $\bar{x} = {23\over{90}} \approx 0.26$. We can now figure out whether our assumption that the CTA change had no effect holds true, and with what probability. To do this, imagine we really are in the universe $H_0$, where the CTA change has no effect, and where we know the true $\sigma$ and $\mu$. Draw the distribution function of $\bar{X}$ and the value of sample mean $\bar{x}$ (red vertical line).
What is the probability of getting the measured value $\bar{x}$ or higher? In our case, it's the area to the right of the $\bar{x}$ below the distribution curve.
In our case, the blue area is close to 13%. We know that in the universe where $H_0$ (CTA has no effect) is true, there is still a 13% chance that our experiment will result in a conversion rate greater than or equal to 0.26. For a case like this, we should not reject $H_0$ because "getting a conversion rate of 0.26 can...simply...happen in $H_0$ universe". Restated: We can not prove that the CTA had any effect.
We should actually reverse the process. Find a probability $\alpha$ of such value for which you will say:
"In the $H_0$ universe, getting a conversion rate higher than some critical value $CV$ has such a low probability of $\alpha$, that if we get such a result we will reject the assumption that we are in this universe".
Written in math notation, find $CV$ so that
$P(\bar{x} \gt CV \| H_0)$ = $\alpha \implies$ reject $H_0$
The value of $\alpha$ is a probability that we reject the $H_0$ hypothesis (that CTA has no effect) while $H_0$ really is true (CTA had no effect). Rejecting $H_0$ while it's in fact true is called _Type I error_, or _false positive error_. In the pre test analysis we usually set a value of $(1 - \alpha)$, called _confidence level_, the probability of rejecting $H_0$ while it really is false. It's value is typically set to $0.95$.
For simplicity's sake, we assumed that the change we make to the CTA button could only improve performance. If the change could be worse as well, we would use a two-tailed test, and we'd need to use the value $\alpha/2$ instead of $\alpha$ for further computations. Read more about two tailed tests here.
If we set the confidence level to 95% ($\alpha = 0.05$), we can compute the critical value $CV$ that can be compared to $\bar{x}$. In Python we can do it like this:
import scipy.stats as stats
CV = stats.norm.ppf(confidence, loc=mu, scale=sigma)
In our case the $CV_{0.05}$ is:
If we get $\bar{x} \le 0.29$, then we fail to reject $H_0$. We won't come to the conclusion that the change to the CTA button has had any effect on the conversion rate. If we get $\bar{x} \gt 0.29$, we can conclude that the CTA change has had some effect. But what effect?
For that, we need to make some assumptions about what our alternative universe $H_a$ looks like. Let's assume that our new CTA button increases the conversion rate by 10%, $\Delta = 0.1$, $\mu_a = \mu_0 + \Delta$, and $\sigma$ does not change. Draw both universes, where the $H_0$ is true and where the $H_a$ is true.
We have already set the confidence level $(1-\alpha)$, so the $CV$ threshold (the red vertical line) is set.
Now assume we live in the universe where $H_a$ holds true. We know that we do not reject $H_0$ when the value is less than or equal to 0.29. If we lived in the universe where $H_a$ is true, the green area, denoted as $\beta$, would be the probability of rejecting hypothesis $H_a$ (CTA has an effect) while $H_a$ is true. We can see it is 40% in our case!
Rejecting $H_a$ while it is true is called the _Type II error_, or the _false negative_ error. The complement, $(1 - \beta)$, is the probability of accepting $H_a$ while it is true, and is called the _Power of the test_. We want this value to be as large as possible.
To sum it up, in the graph we have four regions A, .., D, where the area under the curve represents the probability of arriving at conclusions:
- (A): CTA has no effect while CTA really has no effect.
- (B): CTA does have effect while CTA really has no effect: Type I error
- (C): CTA has no effect while CTA really does have effect: Type II error
- (D): CTA has no effect while CTA really has no effect.
Our decision is based on the $CV$ value. If the experiment results in $\bar{x} < CV$, we have $(1-\alpha)$ probability we were right, and $\beta$ probability we made a _Type II_ error. Given our concrete example, we see that there is a 40% chance of rejecting the CTA button modification while it could have the desired effect, which is not good.
How can we increase $(1 - \beta)$, the power of the test? There are several options:
- Increase $\alpha$. This would have a negative effect on the significance of the test by increasing the probability of a false negative conclusion.
- Increase $\Delta$. This would decrease the sensitivity of the test. If there would be still good enough difference, we may have not detect it.
- Make the normal distribution curves narrower by increasing the sample size $N$. Remember - put the CLT to use.
The third option is the right one. The normal distribution narrows with a smaller standard deviation $\sigma_{\bar{x}}$. And, we know that $\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{N}}$. Which means we have to increase the sample size $N$.
When doing a proper A/B test, we have the $\alpha$, $\beta$, and $\Delta$ parameters agreed upon before we get started.
What’s left is the sample size $N$, via which we can actually make sure the first three are met in the desired combination. Let’s see how it looks when we run the test with sample sizes equal to 30, 100, and 500.
In this case, we see that 100 users is too few and 500 might be more than enough. It would be great to know how to compute the exact $N$ with which we get the desired significance and power. We will look at how to find the optimal size in the future.
So far, we have assumed that we know the behavior of the variant without the CTA changed, and that the behavior won't change during the experiment. This may not be always true. In a typical setup, we present both variants to randomly-selected users, measure the sample means $\bar{x}_1$ and $\bar{x}_2$, and sample standard deviations $s_1$ and $s_2$. A case like this can be converted to closely resemble the situation above with a neat trick - by defining a random variable $X = X_1 - X_2$. Here we know that, in the $H_0$ universe (CTA change has no effect), the $\mu_0 = 0$, and in the alternative $H_a$ (CTA has an effect) the mean is $\mu_a = \Delta$. We will discuss the ideal sample size derivation for such random variable in the second part of this series.
Key takeaways
- Significance is not the only parameter we should be looking at when evaluating a test
- Don’t forget about the power of the test. If ignored, you may be rejecting cool features just because you ran your test on a sample that was too small
- If the measured improvement is smaller than the anticipated $\Delta$ when the required sample size $N$ was estimated, the results of the test are underpowered and the green area (the probability of rejecting actually working CTA button change) is bigger than you initially planned to tolerate.
- Compute your required sample size before running the test based on the $\alpha$, $\beta$, and $\Delta$ parameters. A nice calculator can be found here: https://abtestguide.com/calc/