The BLEU score, which stands for a Bilingual Evaluation Understudy. It's an algorithm that was developed to solve some of the most difficult problems in NLP, including Machine Translation. It's evaluates the quality of machine-translated text by comparing a candidate texts translation to one or more reference translations. Usually, the closer the BLEU score is to one, the better your model is. The closer to zero, the worse it is. With that said, what is BLEU score and why is this an important metric? To get a BLEU score, the candidates and the references are usually based on an average of uni, bi, try or even four-gram precision. To demonstrate, I'll use uni-grams as an example. Let's say that you have a candidate sequence composed of I, I, am, I, I. This is what your model outputs at this step. Then you have a reference sequence one, which contains the words Younes, I am hungry. The second reference sequence contains the words, he said, I am hungry. To get the BLEU score count how many words and the candidates also appear in the references. The candidate you can see I, I, I, I appeared four times. Then am also appeared once. Each word, I and am, appears once in Reference 1, and Reference 2. I appears once in each reference. You're going to write M underscore W or max words is equal to one, and clip the count as one. Then sum over the unique uni-gram counts in the candidates. Meaning you sum over the unique counts in these candidates and you divide by the total number of words in the candidates. In this case, what would that be? Well, for the first one you have I and am, so two. Then you divide by the total number of words in the candidates, which is five. Two out of five. This is your BLEU score. Makes sense? Like anything in life. Using the BLEU score as an evaluation metrics has some caveats. For one, it doesn't consider the semantic meaning of the words. It also doesn't consider the structure of the sentence. Imagine getting this translation, "Ate I was hungry because." If the reference sentence is, I ate because I was hungry, this would actually get a perfect BLEU score. BLEU score is the most widely adapted evaluation metric for machine translation. But you should be aware of these drawbacks before you begin using it. I'll introduce you now to another family of metrics called ROUGE. It stands for Recall Oriented Understudy for Gisting Evaluation, which is a mouthful. But let's you know right off the bat that it's more recall-oriented by default. This means that it's placing more importance on how much of the human created reference appears in the machine translation. ROUGE was originally developed to evaluate the quality of machine summarized texts, but is useful for evaluating machine translation as well. It works by comparing the machine texts or system texts against the reference texts, which is often created by a human. The ROUGE score calculates precision, and recall for a machine texts by counting the n-gram overlap between the machine texts and a reference texts. Recall that's an n-gram, is a list of words that appear next to each other in a sentence where the order matters. If you have the word, "I baked a pie," a uni-gram can be the word baked, and then bi-gram can be the two words a pie. Next, I'll show you an example of how this works with uni-grams. The ROUGE family of metrics focuses on the n-gram overlap between system translated texts and the reference. By system translated text, I'm referring to a model that's being trained to do the prediction. By reference, I'm referring to the ideal correct sentence that I want the model to predict. I mentioned earlier that ROUGE is primarily recall-oriented by default. What I meant by recall on a high level, is that if you look at all of the words in the reference, which is the cats had orange fur. How many of the reference words gets predicted by the model? The second part of the equation is precision, which you can think of as answering this question. Of all the words that the model predicted, how many of them are words that we want the model to predict. To calculate the recall for your model translated text. For each word in the true reference sentence, "The cats had orange fur," counts how many of them are also predicted by the model? The, appears in the model prediction. Cats, appears in the prediction as well, had, appears, orange also appears, fur, also appears. In this case, all five of the reference words are also predicted by the model. For the example system texts, the cat had many orange fur and the reference texts, the cats had orange fur. For the example system texts, the cats had many orange fur and the reference texts, the cats had orange fur. You can see that there are a total of five overlapping uni-grams and five total words in the reference. This would give you a recall of one, a high score. If your model wanted to have a high recall score, it could just guess hundreds of thousands of words, and it would have a good chance of guessing all the words in the true reference sentence. But what does that actually tell you? This is where precision comes in. To calculate precision, look at all the words that are predicted by your model. In this example, the cats had striped orange fur. How many of these predicted words actually show up in the correct sentence? Which is represented by the reference sentence? Over here, the, appears in the reference, cat, appears in the reference, had, appears in the reference. Striped, was predicted by the model, but does not appear in the reference sentence. Orange, appears in the reference, and so does fur. Out of the six words predicted by the model, five of them appear in the reference sentence. This means that your model has a precision of five divided by six, or roughly 83 percent of the words were relevant. There are a few considerations to be aware of when using a ROUGE score. For one, it focuses on comparing n-gram counts to a yield score, which doesn't allow for meaningful evaluation of topics. What this means is that it can only count word overlap as a measure of similarity and misses any broader contexts that words are describing. For example, if two sentences being compared were, I am a fruit-field pastry, and I'm a jelly donut. ROUGE would have no way of understanding that the two sentences actually mean the same thing. Because of this limitation, it's can take a similar or synonymous concepts into consideration when the score is computed. A low ROUGE score may not reflect that a model translated text actually captured all the same relevant content as the reference texts just because it's had a large difference in n-gram overlap. But ROUGE scores are still very useful for evaluation of machine translations and summaries. These are just a couple of caveats to keep in mind as you start taking your own ROUGE scores. Here's a quick recap. You now are familiar with the blue square algorithm, which was created to evaluate machine translations and how it operates by comparing candidate texts against one or more reference translations. By taking the average of the n-grams in the candidates and comparing that to the average of the n-grams in the reference translations. Then clipping this average to fit on a scale of 0-1. You are now aware of a couple of drawbacks to using BLEU score as well, ROUGE 2.0 is also a valuable metric for measuring the relevance of machine translations. That works by counting overlapping uni or bi-grams in a machine translated text and comparing them to a reference texts, which is usually created by a human.