The last thing we'll need is a conditional probability.

We want to answer a question,

what is the probability of X given that something that is called Y happened.

It is given by the formula that you can see on the slide.

It is the probability of X given Y equals to the joint probability P of X and

Y over the marginal probability P of Y.

Let's consider an example.

Imagine you are a student and you want to pass some course.

It has two exams in it, a midterm and the final.

The probability that the student will pass a midterm is 0.4 and

the probability that the student will pass a midterm and the final 0.25.

If you want to find the probability that you will pass the final, given that you

already passed the midterm, you can apply the formula from the previous slide.

And this will give you a value around 60%.

We'll need two tricks to deal with formulas.

The first is called the chain rule.

We can derive it from the definition of the conditional probability.

That is, the joint probability of X and

Y equals to the product of X given Y and the probability of Y.

By induction, we can prove the same formula for three variables.

It will be the probability of X, Y, and Z equals to probability of X given Y and

Z, the probability of Y given Z, and finally probability of Z.

And in a similar way, we can obtain the formula for

the arbitrary number of points.

So this would be the probability of the current point,

given all its previous points.

The last rule is called the sum rule.

That is, if you want to find out the marginal distribution p(X), and

you know only the joint probability that p(X,Y),

you can integrate out the random variable Y, as it is given on the formula.

And finally, the most important formula for this course, the Bayes theorem.

We want to find out the probability of theta given X,

where theta are the parameters of our model.

For example, we have a neural network and those are its parameters.

And then we have X.

Those are the observations, for example, the images that you are dealing with.

From the definition of the conditional probability, we can say that it is a ratio

between the joint probability and the marginal probability, P(X).

And also we apply the chain rule, we'll get the following formula.

It will be the probability of X given theta,

times the probability of theta over probability of X.

This formula is so important that each of its components has its own name.

The probability of theta is called a prior,

it shows us what prior knowledge we know about the parameters.

For example, you can know that some parameters are distributed at around 0.

The term probability of X given theta is called a likelihood, and

it shows how well the parameters explain our data.

The thing that we get, the probability of theta given X, is called a posterior, and

it is the probability of the parameters after we observe the data.

And finally the term in the denominator is called evidence

[MUSIC]