In the earlier videos from this week as well

as from the videos from the past several weeks,

you've already seen the basic building blocks

of forward propagation and back propagation,

the key components you need to implement a deep neural network.

Let's see how you can put these components together to build a deep net.

Here's a network with a few layers.

Let's pick one layer and look at the computations focusing on just that layer for now.

For layer L, you have some parameters WL and BL,

and for the forward prop,

you will input deactivations AL-1,

from the previous layer, and output AL.

The way we did this previously was you compute ZL=WLxAL-1+BL,

and then AL=G_of_ZL, right?

That's how you go from the input AL-1 to the output AL.

And it turns out that for later use,

it'll be useful to also cache the value ZL.

Let me include this cache as well because

storing the value ZL will be useful for backwards,

for the back propagation step later and then,

for the backward step,

for the back propagation step, again,

focusing on computation for this layer L,

you're going to implement the function that inputs

the AFL and outputs DAL-1.

Just to flesh out the details,

the input is actually DA of L,

as well as the cache,

so you have available to you the value of ZL that you computed,

and then in addition to opening DAL-1,

you can either output the gradients you

want in order to impliment gradient descent for learning.

So this is the basic structure of how you implement this forward step,

what we call the forward function as well as this backward step,

we should have called it backward function.

Just to summarize, in layer L,

you're going to have the forward step or the forward prop or the forward function,

inputs AL-1 and output AL,

and in order to make this computation,

you need to use WL and BL, and also,

I'll put a cache which contains zero and then, the backward function,

using the back prop step,

will be another function that now inputs

DA_of_L and outputs DAL-1.

That tells you, given the derivatives respect to these activations,

that's DA_of_L, what are the derivatives?

How much do I wish AL-1 changes?

So compute the derivatives with respect to the activations from the previous layer.

Within this box, you need to use WL and BL,

and it turns out along the way,

you end up computing DZL,

and then just box.

This backward function can also output DWL and DBL,

but I now sometimes using red arrows to denote that the backward animation,

so if you prefer, we could fill these arrows in red.

If you can implement these two functions,

then the basic computation of the neural network will be as follows.

You are going to take the input features A0,

feed that in, and that will compute the activations are the first layer,

let's call that A1.

To do that, you needed W1 and B1

and then we'll also cache away Z1.

Now, having done that,

you feed that to the second layer,

and then using W2 and B2,

you're going to compute the activations of the next layer A2,

and so on, until eventually,

you end up outputting A_capsule_L,

which is equal to Y_hat.

And along the way, we cached all of these values Z,

so that's the forward propagation step.

Now for the back propagation step,

what we're going to do will be a backward sequence of

iterations in which we are going backwards and computing gradients like so.

It's going to feed in here, DA_of_L,

and then this box would give us DA of L-1,

and so on, until we get DA2, DA1.

You could actually get one more output to compute DA0,

but this is derivative with respect to your input features which is not useful,

at least for training the weights of these supervised neural networks,

so you could just stop it there.

Along the way, back prop also ends up outputting DWL, DBL.

Just use parameters WL and BL,

this would output DW3,

DB3, and so on.

You end up computing all the derivatives you need.

Just to maybe fill in the structure of this a little bit more,

these boxes will use those parameters as well, WL,

BL, and it turns out that we'll see later that inside these boxes,

we end up computing DZs as well.

One innovation of training for a neural network involves starting with A0, which is X,

and going through forward prop as follows,

computing Y_hat and then using that to compute this,

and then back prop, doing that.

Now, you have all these derivative terms,

and so W will get updated as W minus the learning rate times DW for each of the layers,

and similarly for B.

Now, the compute the back pro and have all these derivatives.

That's one iteration of gradient descent for your neural network.

Before moving on, just one more informational detail.

Conceptually, it'd be useful to think of the cache here as storing

the value of Z for the backward functions,

but when you implement this,

and you see this in the former exercise, we implement this,

you find that the cache may be a convenient way to

get this value of the parameters of W1,

B1 into the backward function as well.

In the former exercise,

you actually store in your cache, Z,

as well as W and B,

so just store Z2, W2, B2.

From an implementational standpoint,

I just find this a convenient way to just get the parameters copied

to way you need to use them later when you're computing back propagation.

That's just an implementational detail that you see when you do the programming exercise.

You've now seen one of the basic building blocks or influencing a deep neural network.

In each layer, there's a forward propagation step

and there's a corresponding backward propagation step,

and there's a cache to parse information from one to the other.

Im nächsten Video,

we'll talk about how you can actually implement these building blocks.

Let's go into the next.