Now that I have introduced the model-based and model-free RL branches, let us turn our attention to some specific model-free methods. Earlier in this module, I introduced the concept of various RL types. I'll first explore the value-based type of RL in more detail. The value-based reinforcement learning approach is often used in real-life applications. Your goal in a value-based method is to maximize the value function V of S. To derive the value function, the agent learns the states through exploration and saving the information as backups. The agent expects a long-term return of the current states under the policy. The policy is donated as Pi. To get a good approximation of the value function, the agent exhaustively samples the whole state space, an activity which we refer to as complete backup, or samples from a shallow backup and generalizes to unseen states. I'll come back to discuss value function details later. Now, there are different ways of deriving a value function. An exhaustive search requires full and deep backups to capture every value for each state. Dynamic programming, temporal difference learning, and Monte Carlo methods require less sampling, but to varying degrees. In comparing the different value-based approaches, the spectrum goes from sample to deep backups and from shallow to full backups. Let me describe them in more detail. In sample backup, the agent learns from environment sampling, which may provide an incomplete picture of the environment dynamics. With enough samples, the sample backup approaches come closer to the full backup approaches. In a deep backup, the agent learns the whole trajectory of the chosen action up to the termination point. This can be the whole trajectory of the sample and not necessarily the full environment. In shallow backup, the agent learns one step at a time in a breadth for a search manner of the chosen action trajectory. In full backup, the agent learns from the ability to access the complete environment, not just samples. I talked a lot about estimating the long-term value of a state, which is predictive in nature. Now I'll explore how we enforce control or decide in value-based methods by using these predicted value estimations. In the diagram, you can see the state which is fed into the policy, policy which the agent uses to map states to actions, the predicted values of actions. These are shown in the bar graph and represent the value function approximation. There is a function approximator embedded inside the policy which is used to predict the value of a given action for a given state. For instance, it could be a Multilayer Perceptron. A1, A2, and A3 represent actions the agent could take. Each of the blue bars represents the value function or sum of rewards possible if that particular action were taken. The goal is to maximize V of S, which is denoted in the diagram as argmax. Notice that the predicted value of taking action A2 appears to be the highest. Consider a simple Epsilon 3D type exploration strategy. If the agent chooses to exploit in training, it will choose the action with the maximum value argmax, which is A2. During the exploration cycle, the agent would randomly pick among the three actions with uniform probability. During training, both exploitation and exploration take place. However, inferences serving typically only exploitation occurs because exploring new options is risky. Next, I'd like to explore three basic approaches to value-based RL algorithms: Monte Carlo, temporal difference, and dynamic programming. Dynamic programming will not be covered in this module due to time constraints, but you can find lots of information on the Internet. Before getting into the details of the algorithms, let me describe the notation I'll use in the upcoming descriptions. A decision tree is an easy way to visualize what happens in the different approaches. In the decision tree diagram, the yellow squares represent states, the blue triangles represent possible actions where each action can lead to one or more states based on some probability. T boxes are terminal states where the episode ends and the agent starts over at the very beginning. T represents each time step, a point in time. ST represents the state at each time step. That is, S1 is the root node and it is the state at the beginning of the episode. The agent determines which action to take based on the state. AT represents the actions at each time step, where each action can lead to one or more states based on some probability. That is, A1 is the action to the left of S1, E2 is the action to the right of S1. Based on which action the agent chooses, the journey will progress down one side of the tree. The interaction between the agent and the environment involves a sequence of actions and observed rewards in time, where T ranges from 1, 2, 3 all the way to capital T. During the process, the agent accumulates the knowledge about the environment, learns the optimal policy, and decides which action to take next so as to efficiently learn the best policy. Let's label these state, action, and reward at time step T as ST, AT, and RT respectively. Thus, the interaction sequence represents one episode, also known as trial or trajectory. The sequence ends at the terminal state S1, A1, R1, S2, A2, R2, all the way to ST. I'll first discuss the Monte Carlo backup method, which is based on a mathematical technique used to estimate outcomes for an uncertain event. Instead of a fixed set of input, it predicts outcomes based on a range of values within a defined Min and max and a probability distribution model of potential results. It recalculates the results after each transition and stores the results. Here's an example in the decision tree. The agent starts at the root node S1 and it takes an action A1. The action A1 eventually pushes the agent to the state on the left of it S2. The same process repeats and the agent takes a path to the terminal state T. There's some things to consider when you use the Monte Carlo algorithm. First, it tends to evaluate the whole trajectory of actions the agent took up to the terminal state of the episode. Because this method of reward retribution is sensitive to the trajectory and actions taken, it tends to over fit and exhibit high-variance. Second, if a particular value was achieved, it's assumed that each action was equally responsible for the outcome and lastly, there is an inherent assumption that an episode has a terminal state or end point, which is that in a feasible amount of time. The advantage of a Monte Carlo backup is that it is easy to implement, at each time step with each action that the agent takes, it gets a reward. The reward accumulate throughout the episode and are backed up throughout so that the agent learns that these actions led to a sudden cumulative reward. For instance, say 10 was the reward for the trajectory to the right. Now, if the agent takes a different trajectory, say to the left, that led to a reward of 20, then the agent learns, taking a left action here is better. The agent tunes to improve the state value, knowledge of the action, which action to take, and the value of the state. The tuning Denver, the agent improves rewards, and the value function. When you get V, S of T, You learn how good the situation is. Because how good or bad the situation is depends on the value. You get all the actions you can pick from here. Another option is temporary difference, TD backup. Like Monte Carlo, the TD method can learn directly from raw experience without a model of the environments dynamics. Unlike the Monte Carlo method, TD estimates are based in part on other learned estimates. Without waiting until the end of the episodes. Bootstrapping. The agent learns from one or more intermediate time steps in a recursive fashion. The recursive learning helps in excavating overall learning, even in cases where they might not be any well-defined terminal states. In TD backup, you take an action, you get some reward, you get a new state, then you backup and so forth in a recursive fashion. From each state, you will make another choice of actions end up in a new state and so on. When you get the reward at the end, you learn a better value function for the whole policy given the state, which action, and so forth, the agent can recursively learn which action would have been the best here without having to finish the complete episode. It's useful because you might not be able to explore all the state-spaces. Something to consider, is this method comes with a caveat because TD backups haven't seen the whole set of trajectories. They have a narrow perspective and tend to under fit, especially in the beginning. Despite higher complexity and higher bias TD backups are used more often than Monte Carlo backups. Now, there are some issues that arise when you use value-based approaches. I'll discuss some of them. Firstly, these approaches require much trial and error, meaning they're not good at data efficiency. The trajectory of experience is lost because data points might be used only once. Secondly, even switch only happen occasionally or lost. Ideally, you want to prevent the neural network from forgetting about them. Thirdly, the complete environment might change. That is, there could be a non-stationary policy concept drift and the distribution might change. If you wanted to revert to the previous environment. Network would have already forgotten about it unless you stored the information somewhere. Fourthly, the action you take in a particular state is often correlated with the next action you end up in. It's often dangerous to learn from correlated experience, to black the correlation between these two elements. You might want to store it somewhere and sample it in a non-sequential format. Lastly, the large state space is where you might want to remember and learn from the previous stage two, what is the solution? The experienced replay buffer could help. I'll explore this next. An agent can use an experience replay buffer to collect and learn from experience. It's like a least recently used LRU cache where the element and used for the longest period is evicted. You have a buffer where you store experience until it is full. Then, you remove the oldest experience to make space for a new one. It stores past experiences, including rare events. Sampling from replay buffer can be done in a more random fashion to break the correlation. The replay buffer can also keep previous environment information in case you go back or the environment shifts between two different distributions. With an initial policy, the agent takes a particular action in environment. The environment returns the reward and the next state, and these outcomes are stored in the experience reply buffer. This is reward should be correlated with the desirability of the action towards the goal. From the experience replay buffer, the agents samples, expedience trajectories in a non-sequential format. Perhaps the shuffle format. It shuffles random samples to counter correlation from samples of trajectories or a batch of experiences. The agent learns to use a backup method such as TD backup or Monte Carlo, then the agent updates the policy. Based on the new policy, it acts again and the cycle repeats until the episode ends.