Hey there. This lesson is meant to

be a conversational statistics

refresher to get you ready to use

the statistics formulas necessary for AB testing,

which is an important part of data science.

It's really helpful to have a strong understanding of

statistics for some kinds of data analysis.

If you already have a background in

statistics and you understand the difference

between standard error and

standard deviation and how these can be

used to compute a confidence interval,

you might be ready to jump right into

AB testing examples and skip this lesson altogether.

If, however, your statistics skills are rusty,

this'll give you some intuition

for some of the numbers that will be

used under the hood in our AB tests calculator.

Even if you don't have a strong background in stats,

it would help if you are already familiar with

the concept of a standard deviation and a distribution.

Okay. So here's a question from my real life.

I take the train to work,

and I enjoy my commute a lot

more if I can get a seat on the train.

So the question that I think I want to ask is,

how many seats are there on the train?

Does this question have a numerical answer?

No, not exactly.

The number of seats available varies.

It exists on a distribution.

Sometimes I'll get an empty train,

and there'll be lots of seats.

Sometimes the train will be full.

A better question to ask is,

what's the average number of

seats available on the train?

The way I can collect data to answer

this question is to make an observation.

I can go to the station and

observe how many seats are available.

So if I just did this once,

is that a good way to answer the question?

Maybe not. Imagine the case where

the trains are either really empty or really full.

I would call this distribution bimodal,

which means it has two peaks.

This won't look like a bell curve.

If I took just one observation

and use that to estimate the mean,

I might be way off.

So what's something else I could do?

I could collect a bunch of

observations and take the average.

I would be sampling randomly from the distribution,

empty, full, empty or full.

The average of these observations

would be a better estimate of the mean.

One of the cool things that

happens when we take an average over

several samples is that

the central limit theorem

unlocks a bunch of cool tools for us.

The central limit theorem basically says that even if

the original distribution we're

sampling from isn't normal,

like a bimodal distribution,

if I repeatedly take samples and compute

the average number of seats

from a collection of observations,

that set of means I compute will be normally distributed.

So it looks like a bell curve.

Then we have a normal distribution,

and we can use well-developed statistics

tools for analyzing it.

Great. So here's the situation.

I usually go to station A,

but I could also go to

station B to catch a different train into work.

This is where we introduce

our thinly-veiled metaphor for AB testing.

Suppose I've already collected two observations from

station A and my neighbor

collected 10 observations from station B.

Do we have any business comparing our numbers?

Is it a problem that we have

a different number of observations?

No, because we're talking about the averages here.

I'm still curious about which station I

should go to because I want to get a seat.

So, of course, I'm going to ask

for my neighbors' observations.

But whose estimate is better?

We could both be wrong.

In fact, we probably

both are off by at least a little bit.

So why do we have

more confidence in my neighbors' estimate?

Well, it'll be a lot easier for me to get lucky or

unlucky and catch two trains that are

both above or below the mean.

But it would take a lot more luck for

my neighbor to have it happened 10 times.

So the number of observations,

often denoted by the letter n,

is important, and it will show up in our formula.

I've been talking about a case where the train is

either really empty or really full.

But what if the train my neighbor

takes is much more consistent,

where the most common number of

open seats is just one or two,

rather than zero or 30?

Let's ignore whether we still count it as an open seat.

If someone has spilled coffee all over it,

maybe that's a fractional seat TBD.

So if the distribution for train B has less variation or,

I should say, more specifically,

a smaller standard deviation,

I suspect it might be easier to get close

to the true mean with fewer observations.

Reminder, we don't actually know

the true distribution of seat availability.

We just know what it is from our observations.

We have this notion of a confidence interval.

This is a plus or minus range that we think

the true mean is in based on our observations.

Remember, both me and my neighbor are

probably both off by at least a little bit.

Using the tools that we have for normal distributions,

we can compute a range that the true mean is

probably in based on our observed mean,

and the two other variables that we've talked about,

and the number of samples

and Sigma, the standard deviation.

So when I compare

my two observations with my neighbor's 10 observations,

my confidence interval should probably be wider because

the Sigma is big and the number of observations is small.

Their confidence interval is probably a bit narrower.

When we do AB testing,

one of the steps is to compute the standard error,

which is the Sigma divided by the square root of

n. This is going to get used in our formula.

The standard error and the standard deviation

are not the same unless n equals one.

So what's the difference?

The standard deviation refers to

the distribution of the seats on the train,

one observation at a time.

The standard error is referring to the distribution of

the means of the n observations.

So the standard deviation of this metric that we

have created from grouping random observations together.

Then we use the z-score,

which is a tool we get to use.

Thanks central limit theorem.

We pick how much of the probability we want to scoop up.

People often choose 95 percent confidence,

which means we think that 95 percent of the time,

the true mean, will be inner interval.

This z-score is a multiplier we can use.

We'll take the observed mean and then add

or subtract the multiplier times our standard error.

If I'm going to make a decision about

which station to go to,

I'm going to be looking for a case where

the confidence intervals don't overlap.

That would assure me that one of the trains probably did

have a higher number of empty seats on average.

If I don't have enough information yet,

I could continue to collect

observations and have my neighbor

collect observations until

all the confidence intervals were small enough.

This means that the number n gets bigger,

and the standard error gets smaller,

and the integrals get more narrow.

Okay. What if I'm overthinking this?

Really, it doesn't matter which train I

take because the distribution,

they're just the same at both stations.

I might call this the null hypothesis,

the hypothesis that nothing is different.

In this case, I would expect that

the confidence intervals would overlap no matter how much

time we waste collecting new observations

because we're sampling to approximate the same value.

When we compute another term called the p-value,

we're going to use those same variables,

and it will tell us what

the probability is that the difference in means occurred

because of this natural variation on the samples

versus from the difference

in the underlying distribution.

You can use the p-value to disprove the null hypothesis.

So in real life, this metric,

the number of empty seats on the train,

might not be the right metric to help

me decide which station to go two.

Remember, I just have one body.

I don't really care how many seats there are.

I actually only care if I'm going to get a seat.

In that case, a completely empty train,

every once in awhile,

doesn't do me as much good as a single seat every time.

So can you think of a better metric to collect,

to help me make my decision?

As an analyst, you might not be making

decisions about which train to take.

Instead, it might be a decision about

which subject line to use in an e-mail to

send to all of your new users or

which of two recommendation algorithms to choose from.

It will be up to you to decide

things like what data to collect,

how much data to collect, and ultimately,

whether there's a statistically significant difference

between option A and option B.

Okay. Hopefully, now you're feeling ready

to dig into some AB testing case studies.