In the previous video, we introduced temporal difference learning. TD elegantly combines key ideas from dynamic programming and Monte Carlo methods. Like dynamic programming, TD methods bootstrap. Like Monte Carlo methods, TD can learn directly from experience. The combination is more powerful than the sum of its parts. By the end of this video, you will be able to understand the benefits of learning online with TD and identify key advantages of TD methods over dynamic programming and Monte Carlo. Let's consider the example of driving home from work. Each day, you predict how long it will take to get home. You base your predictions on the time, the day of the week, the weather, and other factors. Imagine that you have driven home many times before. You have a good estimate of how long it takes to get home from various situations you might encounter on the way. Let's look at these estimates for one trip home. One evening, as you leave the office, you predict it'll take 30 minutes to get home. In about five minutes, you exit the parking lot and notice that it is raining. Traffic is slower in the rain. So you estimate that it will take 35 minutes then on to get home. Fifteen minutes later, you exit the highway earlier than you anticipated and estimate that it will now take you 15 minutes to get home. Thirty minutes in your journey, you get stuck behind a slow truck and predict that it will take you an additional 10 minutes to get home. Ten minutes later, you enter the home street from where it usually takes three minutes to arrive home. While in three minutes later, you are home. So the numbers in the circles represent our predictions of remaining driving time, how can we improve these predictions? To understand how to update our predictions, we need to specify the reward. Let's use the amount of time taken. It took you five minutes to exit the parking lot. We can label this transition with a reward of five. It took you another 15 minutes to exit the highway, and so on and so forth. You predicted that it would take you 30 minutes to get home, but it actually took you 43 minutes. Perhaps you can learn from this experience. Let's first look at a Monte Carlo method. In particular, we will use the constant alpha Monte Carlo method with alpha equal to one. In Monte Carlo methods, you update your estimates towards the return which is only available at the end of the episode. We can only update the estimates for each of the states once we arrive home. We can start from when we leave the office. From the office, it took you a total of 43 minutes to get home. Therefore, the return at time zero is 43. As alpha is one, we move our estimate completely towards the actual return. From when you exit the parking lot, it took you 38 minutes to get home, and we update our estimate accordingly. We update the estimates for the remaining states in the same way. But is it really necessary to wait until the final outcome is known before learning can begin? Let's rewind the clock and look at how we would make updates with TD. From the office, it took you five minutes to exit the parking lot from where you made a prediction of 35 minutes. From the exit state, you can update the estimate for office. Your current estimate from the office is 30 minutes. You get a reward of five. The value of the next state is 35, so we update our estimate to 40 minutes. Let's consider the next state. Your current estimate is 35 minutes. It took you 15 minutes from the parking lot to exit the highway. The estimated time to get home from when you exit the highway is also 15 minutes. So we reduce the estimated time to go from when we exit the parking lot to 30 minutes. This makes sense because we exited the highway earlier than expected. We similarly update the estimates for the remaining states. As we make our way home, we can learn online without waiting to know the final outcome, unlike Monte Carlo. Let's summarize the advantages of using TD. Unlike dynamic programming, TD methods do not require a model of the environment. They can learn directly from experience. Unlike Monte Carlo, TD can update the values on every step. Bootstrapping allows us to update the estimates based on other estimates. TD asymptotically converges to the correct predictions. In addition, TD methods usually converge faster than Monte Carlo methods. We will see an example of this in the next video.