Introduction

When you’re testing a new model, a new product or a new process, how do you determine how good it is? One of the most common practical methods to answering that question is to take random samples for testing.

As these random samples represent the larger population, the results of the tests applied on these samples can be taken to be representative of what the results might have looked like if the tests had been performed on the entire population.

But this approach opens up a new question: just how large should the sample size be?

Sampling, by definition, is based on an underlying assumption that the sampled population is representative of the larger population. Therefore, there will always be an element of uncertainty measured as the margin of error within a confidence interval.

Cochran's Formula

When you are able to define the acceptable margin of error and confidence interval, Cochran’s Formula can be used to calculate the minimum required sample size.

\[n = {{Z ^ 2 p q} \over {e ^ 2}}\]

Where:

  • n: Minimum sample size
  • Z: Z-Score, e.g. 1.96 for 95% confidence or 2.58 for 99% confidence
  • p: Estimated ratio of population having the desired attribute
  • q: 1 - p, or estimated ratio of population without the desired attribute
  • e: Margin of error

Let’s say we’ve designing a model to predict customer churn. If the model works as intended, it can be used to identify customers with high risk of leaving and do something about them such as running some targeted campaigns. But if the model does not work as intended, it might incorrectly label certain customers as having very low churn risk when in fact they might be at imminent risk of leaving.

As the risk of such a model being wrong some of the time outweighs the benefits of it being accurate most of the time, it would be important to test if low-risk customers identified by the model are in actual fact, low-risk. In order to do this, we would take a random sample of customers identified as low-risk, and observe their behaviour within a given timeframe (say, 3 months) to see how many of them end up leaving.

Minimum Sample Size

The test would be set up in the following way.

  • Define the null hypothesis: > 1% of low-risk customers will leave within 3 months
  • Obtain random samples of low-risk customers and observe their behaviour
  • If none of them leave in 3 months, we can reject the null hypothesis and conclude that the model works as intended, within the specified margin of error and confidence level

Let’s set the margin of error at +/-1%, and the confidence level at 95%.

Using Cochran’s Formula, we can calculate the minimum sample size for this test as:

\[\eqalign{ n &= {{Z ^ 2 p q} \over {e ^ 2}} \cr &= {{1.96 ^ 2 \times 0.01 \times (1 - 0.01)} \over {0.01 ^ 2}} \cr &= 380.32 }\]

Rounding up, this tells us that 381 low-risk customers will need to be sampled. Once all 381 are tested negative, we can make the statement:

0-2% of customers identified by the model as low risk will leave within 3 months (at 95% confidence)

Or rather, since we have rejected the null hypothesis that > 1% of low-risk customers will leave within 3 months, we can invert the statement and say that:

98-100% of customers identified by the model as low risk will not leave within 3 months (at 95% confidence)

Impact of MoE and CI

What if we wanted a smaller margin of error (e.g. 0.5%) or a higher confidence level (e.g. 99%)?

Let’s use R to see how the minimum sample size is affected by different margins of error and confidence intervals.

First, let’s load the libraries we’ll be using in this exercise. I’m also telling ggplot2 to use theme_tq from the tidyquant library as the default visual theme.

# Load required libraries
library(tidyverse)
library(tidyquant)

# Define default theme for ggplot2
ggplot2::theme_set(tidyquant::theme_tq())

Next, we define a function for calculating the minimum sample size using Cochran’s Formula. Note that we are passing p, e and ci as arguments so we can easily iterate through different values.

# Function: Calculate Minimum Sample Size using Cochran's Formula
calc_min_sample_size <- function(p = 0.1, e = 0.01, ci = 0.95 {
    z <- qnorm(1 - (1 - ci) / 2)
    n <- z ^ 2 * p * (1 - p) / e ^ 2
    return(n)
}

Now let’s take a look at the impact margin of error has on the minimum sample size.

# Impact Analysis: Margin of Error
x <- seq(from = 0.005, to = 0.02, by = 0.0001)
y <- calc_min_sample_size(p = 0.01, e = x)
ggplot2::qplot(x, y, geom = "line", color = I("firebrick")) +
    ggplot2::labs(
        title = "Impact Analysis: Margin of Error",
        subtitle = "Constants: p = 0.01, ci = 0.95",
        x = "Margin of Error",
        y = "Min. Sample Size"
    )

Observe how the minimum sample size exponentially increases as the margin of error decreases, from ~381 at 1% (0.01) to ~1,522 at 0.5% (0.005) even while leaving the confidence interval constant at 95%.

Next, let's see the impact confidence interval has on the minimum sample size.

# Impact Analysis: Confidence Level
x <- seq(from = 0.90, to = 0.999, by = 0.0001)
y <- calc_min_sample_size(p = 0.01, e = 0.01, ci = x)
ggplot2::qplot(x, y, geom = "line", color = I("firebrick")) +
    ggplot2::labs(
        title = "Impact Analysis: Confidence Level",
        subtitle = "Constants: p = 0.01, e = 0.01",
        x = "Confidence Level",
        y = "Min. Sample Size"
    )

You can observe how increasing the confidence level also has an exponential impact on the minimum sample size, from ~382 at 95% (0.90) to ~657 at 99% (0.99) to ~1,072 at 99.9% (0.999).

What happens when we do both at the same time, decreasing e to 0.005 and increasing ci to 0.99? Let's call our function to find out.

> calc_min_sample_size(p = 0.01, e = 0.005, ci = 0.99)
[1] 2627.419

We will need to test a whopping 2,628 samples!

Conclusion

Like most things in data science (you probably could even claim that for life in general), calculating the minimum sample size required for testing is an exercise in choosing trade-offs.

Naturally, we want the smallest margin of error and highest confidence level. But the resultant number of samples that need to be tested would often be prohibitively large, sometimes even to the extent of defeating the whole point of sampling in the first place.

This is why you see most studies conducted at 95% confidence, as it represents a reasonable trade-off between reliability of the result versus the effort required to conduct experiments.