Hello. You will not learn how to train your neural machine translation system. You'll learn about treating concepts like teacher forcing, and you'll see which type of cost function is being used when training. Let's dive in. In this section, you will see how to train your NMT model with attention, along with being introduced to key concepts called teacher forcing. By this point, you're aware of how attention works in a seek to seek model by saving the information from each time step in its own hidden states and using the decoder's previous prediction to give it's more or less weights to each of the encoder hidden states. You've seen how specially weighted values are tailored to each subsequent German prediction. But how do you really know that these predictions aren't wacky? This is where teacher forcing comes in. Teacher forcing allows your model to use the ground truth or the actual outputs from your decoder to compare its predictions during training. This yields faster training with the added benefits of higher accuracy. Let's take a closer look at why teacher forcing is necessary in a seek to seek model. In this example, notice how the model correctly predicted the token at the start of the sequence. But the second prediction doesn't quite match. The third one is even further off. The fourth predicted token is quite far from making logical sense in German. The important takeaway here, is that in a sequence model like this one, each wrong prediction makes the following predictions even less likely to be correct. You need to have a way to check the prediction made at each step. This is how the process works during prediction. The T1 rectangles shown here is processed through attention, in order to predict the green rectangle team, and so on and so forth for the following predictions. An important takeaway of this concept is that during training, the predicted outputs are not being used for predicting the next predicted green rectangle. Instead, the actual outputs, or ground-truth, is the input to the decoder for each time step until the end of the sequence is reached. Without teacher forcing, models can be slow to reach convergence. If they managed to reach it at all. Teacher forcing is not without its issues and is still an area of active research. I've included some optional reading in the course materials if you'd like to know more. You keep going. At last, you've arrived. The base level, the Big Kahuna. Let's put together everything you've seen so far. You'll be using this in this week's assignments when you train your very own neural machine translation model with attention. Gear up, this one is a wild ride.The initial select makes two copies. Each of the input tokens represented by zero and the target tokens represented by one. Remember that here the input is English tokens, and the target is German tokens. One copy of the input tokens are fed into the inputs encoder to be transformed into the key and value vectors. While a copy of the target tokens goes into the pre-attention decoder. Important note here, the pre-attention decoder is not the decoder you were shown earlier, which produces the decoded outputs. The pre-attention decoder is transforming the prediction targets into a different vector space called the query vector. That's going to calculate the relative weights to give each input weight. The pre-attention decoder takes the target tokens and shifts them one place to the right. This is where the teacher forcing takes place. Every token will be shifted one place to the right, and in start of a sentence token, will be a sign to the beginning of each sequence. Next, the inputs and targets are converted to embeddings or initial representations of the words. Now that you have your query key and value vectors, you can prepare them for the attention layer. You will also apply a padding mask to help determine the padding tokens. The mask is used after the computation of the Q, K transpose. This before computing the softmax, the where operator in your programming assignment will convert the zero-padding tokens to negative one billion, which will become approximately zero when computing the softmax. That's how padding works. Now, everything is ready for the attention, inside where all the calculations that assign weights happen. The residual block adds the queries generated in the pre-attention decoder to the results of the attention layer. The attention layer then outputs its activations along with the mask that was created earlier. It's time to drop the mask before running everything through the decoder, which is what the second Select is doing. It takes the activations from the attention layer or the zero, and the second copy of the target tokens, or the two. Would you remember from way back at the beginning. These are the true targets which the decoder needs to compare against the predictions. Almost done. Then run everything through a dense layer or a simple linear layer with your targets vocab size. This gives your output the right size. Finally, you will take the outputs and run it through LogSoftmax, which is what transforms the attention weights to a distribution between zero and one. Those last four steps comprise your decoder. The true target tokens are still hanging out here, and we'll pass down along with the log probabilities to be matched against the predictions. There you have it, the model that you will be building, and the intuition behind the steps. Take a break and just let all that sink in. Now, you know what's actually happening at each part of the model. That's nice. In this video you've seen how you can train your neural machine translation system.