In the previous section, we introduced RNNs, a new type of model that accepts sequences one event at a time and develops representations of what it seen previously as it scans its input. But we haven't talked about how RNNs are able to create such powerful representations of the past. The reason that RNNs are able to remember what they've seen previously is because firstly, they have a recurrent connection between their hidden layer and their input, and secondly, because they use a clever trick during optimization. First, let's talk about that recurrent connection. The recurrent connection is depicted here in red, and it connects the hidden layer to the fixed size state. Note how the hidden layer has the same number of units as the fixed sized state. Because the recurrent connections simply copies the values in the hidden layer to the fixed sized state for the next iteration, the two must have the exact same number of dimensions. It's not obvious how recurrent connection helps an RNN remember what it seen previously. To understand how that helps, we need to look at the optimization procedure. To do that, we need to zoom out a bit. This is what the RNN model looks like as it scans the sentence. In the green rectangle, I've depicted the RNN cell. In then the orange circle, what might come after it in the network. Note how we can capture the operations we've talked about earlier mathematically. The input is represented by X of t, and it is concatenated with H of t minus one, which is the output of the hidden layer from the previous iteration. This new term is added to the bias term before being passed to the activation function, which traditionally was a tanh. The RNN cell feeds directly into a dense layer followed by a linear layer. Note that the weights and biases for all three of these layers, tanh, dense, and linear, are all different. For example, let's say that our current input is our representation of wag, which is the three component vector. We can concatenate this vector with the hidden layer from the previous iteration, compute the value of the hidden layer, which would require multiplying by the weights, and then adding a bias term, and finally passing the result through the tanh function. Then we would make a prediction for this time step, which we represent as y of t. Hopefully, the predicted output will be close to our representation of their, which is the next word in our sequence. Finally, we'd pass the contents of our hidden layer, H of t, to the next iteration. The forward propagation then looks like this diagram. I've denoted time on the vertical axis to capture the different events in the sequence being fed into the model. I've depicted iterations on the horizontal axis to capture which inputs the model accepts, at which iteration. As before, the model passes an H to the next iteration via the recurrent connection. The reason I use the same symbols to depict the model at every iteration, is because even though we're using the model multiple times, once for each event in our sequence, the model still has only one set of parameters. In other words, there's only one linear set of weights, one dense set of weights, and one tanh set of weights. The model uses these at both training time, and inference time, which means that each iteration is a blocking operation. You might be wondering what this means for our normal optimization procedure. Normally, during backpropagation, we could compute the loss of the final layer and use this value to compute the partial derivative of the loss with respect to every parameter, p^i, in our model, and then use these partial derivatives to update our parameters with a formula that looks like this one. However, for RNNs, that leads us to the following challenge. We have more partial derivatives than we do updates. In this case, we would have four partial derivatives for every parameter. This is where that clever bit of optimization comes in. We addressed this problem by updating each parameter using the average of all the partial derivatives from our iterations. This approach is called backpropagation through time. Because each parameter is updated with the combination of losses from all four iterations, the model as a whole has pressure to preserve information that's useful in the short-term and the long-term, and throw away what isn't. Remarkably, these two tweaks of our DNN architecture, the introduction of a recurrent connection, and some clever bit of optimization, make RNNs extremely powerful sequence feature extractors. However, they still have their limitations. In the next section, we'll review what those are.