Now that you have heard about reinforcement learning in general, I will discuss the RL framework and the workflow you will use to apply it to your own use cases. The reinforcement learning framework encompasses all the tools, notations, and algorithms you will use to design and implement your RL solution. Its purpose is to help you create higher-level abstractions of the core components of an RL algorithm. Using an RL framework makes code easier to develop and read and it can also help improve your overall efficiency. Now before I discuss the framework in more detail, let's take a moment to analyze the terminology used to describe the RL framework. The reinforcement learning framework consists of a number of terms in notations which we will use during this module. Let me take a minute to define some of them before they are used. The state represents the current situation returned by the environment. In other words, the state suggests awareness of the situation that summarizes the history. The state is sufficient for the agent to determine the next step it should take. The state might not include all the information and the agent, might not see the whole grid board here. That is, it might be only partially observable. When your state is better, it contains more information and its performance improves. An action represents one or more events that alter the state of the environment. The action space is essentially a set of actions which the agent can take to interact with the environment, at which point the agent will move to a new state, that is unless the state is a terminal one. The environment is the world in which an agent interacts. It's a scenario that an agent has to face. The agent is the learner entity or the brain, which takes action for each time step toward a goal. You can think of it as an assumed entity which performs actions in an environment to gain some reward. The agent generates the actions and updates a policy through learning. In fact, the agent doesn't need to know anything about the environment. The reward or reward signal represents the immediate feedback from the environment. It's what the agent observes after taking actions or tasks. Each time the agent acts, the environment gives it an instantaneous reward, which may be positive, negative, or neutral. Each step is independent from the previous one and requires the agent to decide on an action. Rewards can be given in multiple ways. At every timestep, sparsely, that is given after long sequences of actions or at the end of an episode. The policy is a method to map the age in state two actions. The policy dictates what the agent will do when it takes an action. It is a strategy applied by the agent to decide the next action based on the current state. An episode is nothing more than a termination point at the end of a trajectory. A value represents the total expected reward that an agent is expected to receive in the future by taking an action in a particular state. It is unexpected long-term return as compared to the short-term reward. The difference between reward and value is as follows. The reward is instantly given to the agent after it takes a certain action in a state. In contrast, the value is the cumulative sum of all rewards the agent obtains through its actions at all steps from the beginning to the end of an episode. A value function specifies the long-term desirability or goodness of any particular state where the reward signal captures the immediate reward, the value function gives a measure of potential future rewards from being an estate. V of S is shorthand for the value function. That is, the value V of the state S, Q, S of A is a notation for the value Q of taking an action A in a particular situation or state S. By value, we mean the long-term accumulated sum of rewards in the future time steps taken until the end of an episode. In other words, how good is this one action compared to all the possible action? SARSA means, state, action, reward, state, action. It is a shorthand way to reference interaction between the agent and environment. Let me walk you through an example that uses an RL framework. Think of this diagram as a representation of our fulfillment center on a small scale. In a fulfillment center, you want the agent to find the optimal path to the warehouse item you need to ship. A partial decision tree in the diagram represents the choices the agent has at each cell as it moves through the warehouse. The agent starts at the bottom left of the grid, knowing nothing about the environment, so it's first action could be chosen randomly. After each action, the agent receives a reward and moves into a new situation. That is, if it takes a step up, it's no longer in the same cell or state. After the agent takes the next action, is again in a new state, a cyclic situation. The agent is repeatedly faced with a series of actions it can take in each new state and cell. Each time, it has to decide which direction to go and sometimes not all directions are possible because an adjacent cell is blocked. The agent repeats the cyclical behavior until it reaches a termination point where the whole simulation resets and it ends up at the beginning again. In this example, the reward is given at the end of the episode. Your goal is that the agent reaches the green box while avoiding the red box. If the agent gets to the green box, it gets a reward of plus one, if it gets to the red box, it gets minus one. Both are terminal states which lead to a restart of the episode. Another way to look at it is the agent receives a neutral reward, possibly zero, in the intermediate steps before getting to the terminal state. Think of a scenario where you want the agent to learn the optimal path to the goal, not just any path. If the agent received negative rewards, minus one, for each step it took, then it would avoid too many penalties by taking a path with the least number of steps. Two of the shortest paths it can take are the L shaped ones, which require a minimum of eight steps. It needs to learn and derive a policy that will optimize positive rewards and avoid negative ones. The agent assesses each reward immediate and not the value long-term. If the agent reaches the goal, it's a termination point and it's done. Think of it like a game scenario. The first time around, the agent might not take the optimal path. It might just move randomly and accidentally reach the goal. After some episodes, the fact that it reached a goal, they reinforce the behavior that led it to the goal. In our example, the episode terminates when it reaches one of the two end points highlighted in red and green. The episode will reset and the value is all the accumulated sum of rewards the agent collected after moving between all the cells. The agent aims to reach the green block. However, in each one of these cells, it can finish in a separate state representation because of the different situation it is in. You could think, what is the better state to be in? Since the red cell is a pit, the agent will lose, the agent is better off in any other cell besides that one. No matter what action it takes, at least it will not lose. Let us look at the value function to see how it can help determine how good a state is. Now, how does the value function relate to the example I just described? Let me start by saying, the objective of an RL algorithm is to discover the action policy that maximizes the average value that it can extract from every state of the system, then how can an agent determine which box it should move to next in our fulfillment center example. The value function gives us a way to capture the measure of potential future rewards from being in a particular state. In other words, how good is this situation to be in? While the default signal represents the immediate benefit of being in a certain state, the value function [inaudible] cumulative reward to be collected from that state onward. In shorthand, we refer to the values function as the V of S or the value of being in a state. In the fulfillment center example, what is the value of moving to the right from the agent's starting point? You might say it's lower than the value of going up. If the agent moves up from its current position, it might end that trajectory in the green cell, which is the goal. If the agent instead moves right, the value of this action is most likely lower than the value of the action up. That is, the chance is higher for the agent to land in the red cell, which should be avoided. In the diagram, the box directly below the agent's current position is where it was at the start of the episode. I might say, the agent's current position is better than the one it started in, because it's now closer to the green box. Now, how would you generalize the value function for your RL problem? You use a value function to determine the best sequence of actions that will generate the optimal outcome. Let me explain. Within the agent, a brain maps state observations, the inputs to actions, the outputs. In RL nomenclature, this mapping is called the policy. Given a set of observations, the policy decides which actions to take. Just like with supervised learning, we can represent the policy as a deep neural network. Representing the policy as a deep neural network allows our agents to input thousands of states at the same time and still be able to develop a meaningful action. In the fulfillment center example, your agent had the first iteration until the end of the episode. The agent's behavior was reinforced and it learned that in the first state, moving up, gives it an award of X. We can write X as Q S of A. Because this sample warehouse is small, we can easily see the endpoint. However, let's say it's a much bigger warehouse and moving to the right from a starting point could lead to 5x or 10x delivered over moving up. The agent has never tried it, so you still don't know. This is their technique called exploration is useful to ensure that you generate trajectories of experiences which might be more optimal than the knowledge you already have. Now, we can't unleash RL on a problem if you don't know what setting up the problem correctly means. The next thing I will discuss are some of the basic questions we need to answer for ourselves. You will need to ask yourself, what exactly does the environment consists of? How will you define its properties? Is it a real or simulated environment? How do we give the agent incentive, that is, reward the agent to do what we want? How should it be structured in logic and parameters in order to derive a policy? Which training algorithm should we choose to train the agent? Finally, we put the agent to work so it can determine the optimal solution. Now that I have given you an overview of the workflow for training an agent, let me go through it in more detail. First, you need to define the environment where the agent operates including the interface between agent and environment. The environment can be either a simulation model or a real physical system. Simulated environments are usually a good first step because they are safer, real hardware can be expensive , and allow experimentation. Next, specify the reward also known as reward signal that the agent uses to measure its performance against the task scores, and it's also how the signal is calculated from the environment. Reward shaping can be tricky and might acquire a few iterations to get right. In the next step, you create the agent which consists of the policy and training algorithm. Choose away to represent the policy. For instance, use neural networks or lookup tables then select the appropriate training algorithm. Different representations are often tied to specific categories of training algorithms. In general, most modern algorithms rely on neural networks because they are good candidates for large state action spaces and complex problems. Next, setup training options. For instance, set up stopping criteria and train the agent to tune the policy. Remember to validate the train policy of the training ends. Keep in mind, training can take minutes or days depending on the application. For complex applications, parallelizing training on multiple CPUs, GPUs, and computer clusters will accelerate the process. The last step is to deploy the trained policy representation by using for example generated C, or C++, or CUDA code. There's no need to worry about agents and training algorithms at this point, the policy is a standalone decision-making system. Training an agent between enforcement learning is an iterative process. Decisions and results in later stages can require you to return to an earlier stage in the learning workflow. For example, if the training process does not converge on an optimal policy within a reasonable amount of time, you might have to update additional items before reaching the agent. For instance, you may need to update training settings, learning algorithm configuration, policy representation, reward signal definition, action and observation signals, or environment dynamics. Now how is the workflow applied? After we have the basics of the model defined, we can execute the steps of the workflow. First, put the observation generated by the environment. Second, choose an action based on a defined policy and apply it to the environment. Third, get a reward produced by the environment. Fourth, use the agent to train the target policy on trajectory data including observation, action, and rewards from the first three parts of the workflow. As you begin to explore the functions used by an agent to determine actions, looking at the SARSA algorithm can be useful. SARSA means state action reward state action. In simple terms, a SARSA agent interacts with its environment and then update it's policy based on the feedback it gets from those actions. SARSA can be represented in a quintuple: St, At, Rt, St plus 1, At plus 1. Let me define what these terms mean. St is the state at the beginning of the episode which is at time t. At is the action taken based on St, Rt is the level given based on At. When the environment is in a new state, St is incremented. St plus 1 is state at time t plus 1, a t plus 1, the action taken based on St plus 1. The cycle repeats until the end of the episode. We can use a Q value or Q as of A to represent the value Q of taking action A in a particular situation or a state S, so QSt of A would have a possible reward R. If S7 and S17 represent two possible states for the fulfillment and their example in the diagram, we could say that QS7, right would have a possible reward of R is equal to plus 1 and QS17, up would have a possible reward of R is equal to minus 1.