Hello. You'll now learn about two ways that will allow you to construct the translated sentence. The first approach is known as greedy decoding, and the second approach is known as random sampling. You will also see the pros, and cons of each method. For example, when choosing the word with the highest probability at every time, that does not necessarily generate the best sequence. With that said, let's dive in, and explore these two methods. By now, you've reached the final part of this week's lectures. That's awesome. I'll show you a few methods for sampling, and decoding, as well as a discussion of an important hyperparameter in sampling called temperature. First, here's a reminder of where your model is in the process when sampling, and decoding comes into play. After all the necessary calculations have been performed on the encoder hidden states, and your model is ready to predict the next token, how will you choose to do it? With the most probable token or by taking a sample from a distribution. Let's discuss a few of the methods available to you. Greedy decoding is the simplest way to decode the model's predictions as it selects the most probable word at every step. However, this approach has limitations. When you consider the highest probability for each prediction, and concatenate all predicted tokens for the output sequence as the greedy decoder does, you can end up with a situation where the output instead of I am hungry, gives you I am, am, am, am. You can see how this could be a problem but not in all cases. For shorter sequences it can be fine but if you have many other words to consider, then knowing what's coming up next might help you better predict the next sequence. Another option is known as random sampling. What random sampling does is provide probabilities for each word, and sample accordingly for the next outputs. One of the problems with this is that it could be a little bit too random. A solution for this is to assign more weight to the words with higher probability, and less weight to the others. You will see a method for doing this in just a few moments. In sampling, temperature is a parameter you can adjust to allow for more or less randomness in your predictions. It's measured on a scale of 0-1, indicating low to high randomness. You need your model to make careful, and safe decisions about what to output so set temperature lower, and get the prediction equivalent of a very confident but rather boring person seated next to you at dinner. If you feel like taking more of a gamble, set your temperature a bit higher. This has the effect of making your network more excited, and you may get some pretty fun predictions. On the other hand, there will probably be a lot more mistakes. Previously, you've seen the greedy decoding algorithm which just selects one best candidate as an input sequence for each time stamp. The model has already encoded the input sequence, and use the previous time steps translation to calculate how much attention to give each of the input words. Now it's using the decoder to predict the next translated word. Now choosing just one best candidate might be suitable for the current time step but when we construct the full sentence, it's maybe a sub-optimal choice. Beam search decoding is a more exploratory alternative for decoding that uses a type of restricted breadth-first search to build a search stream. Instead of offering a single best output like in greedy decoding, beam search selects multiple options based on conditional probability. The search restriction I mentioned a moment ago is the beam width parameter B, which limits the number of branching paths based on a number that you choose, such as three. Then at each time step, the beam search selects B number of best alternatives with the highest probability as the most likely choice for the time step. Once you have these B possibilities, you can choose the one with the highest probability. This is a beam search decoding, which doesn't look only at the next output but instead applies a beam width parameter to select several possible options. Let's take a look at an example sequence where the beam width parameter is set to three. The beam width parameter is a defining feature of beam search, and it controls the number of beam searching through the sequence of probabilities. Setting this parameter works in an intuitive way. A larger beam width will give you better model performance but slower decoding speed. Provided with the first token I, and the beam width parameter of three, beam search assigns conditional probabilities to each of several options for the next word in the sequence. The highest probability is the one that will be chosen for each time step, and the other options will be pruned. It's determined that "am" is the most likely next token in the sequence with a probability of 40 percent. For the third, and final time step, beam search identifies hungry as the most likely token with probability around 80 percent. Does this sentence construction make more sense than any of the other options? This is a very simple example but you can see for yourself how beam search makes a more powerful alternative to greedy decoding for machine translation of longer sequences. However, beam search decoding runs into issues where the model learns a distribution that isn't useful or accurate in reality. It can also use single tokens in a problematic way, especially for unclean corpora. Imagine having training day, that is not clean for example, from a speech corpus. If you have the filler word "Uhm" which appears as a translation in every sentence with one percent probability that single element can throw off your entire translation. Imagine now that you have 11 good translations of varying Staten, which is German for the United States. These could be USA, US, US of A etc, compared to your German inputs. In total you have 11 squared, at least good translations, each with the same probability because they're all equal. The most probable one is the filler word, Uhm, and said because one over 11 squared is less than 0.01 percent so that's ends up being the most probable outcome. Not great, is it? Well, thankfully, there are alternatives to consider. Earlier you encountered random sampling as a way to choose a probable token, and the issues with that very simple implementation. But if you go a little further with that, say, by generating 30 samples, and comparing them all against one another to see which one performs the best, you'll see quite a bit of improvement in your decoding. This is called Minimum Bayes Risk decoding or MBR for short. Implementing MBR is pretty straightforward. Begin by generating several random samples then compare each sample against all its mates, and assign similarity score for each comparison. ROUGE is a good one that you may recall from a bit earlier. Finally, choose the sample with the highest similarity, which is sometimes referred to as the golden one. Here are the steps for implementing MBR on a small set of four samples. First, calculate the similarity between the first, second sample. Then for the first, and the third sample, then calculate again for the first fourth sample. Then take the average of those three similarity scores. Note, it's more common to use a weighted average when you compute MBR. Then you'll repeat this process for the other three samples in your set. At last, the top performer out of all of them will be chosen, and that's it for MBR. You'll be implementing this one in the assignment along with a greedy decoder. Let's recap everything I just showed you. By now you're aware that's beam search uses a combination of conditional probabilities, and a beam width parameter to offer more options for inputs at each time step. An alternative to beam search, the MBR, take several samples, and compares them against themselves, then chooses the best performer. This can give you a more contextually accurate translation. Excellent work taking in all of this new information about neural machine translation, and attention. Now, go forth to your coding assignment, and remember to have fun. Congratulations on finishing this week. You now know how to implement a neural machine translation system, and you also know how to evaluate it. Next week, Lucas will talk about one of the states of the art models known as a transformer, which also makes use of an encoder decoder architecture.