Before we can get into the details, I'll start by talking about reinforcement learning in general. You may recall from an earlier module that there are three types of common machine learning approaches. The first one is unsupervised learning where a learning system establishes the model for data distribution based on unlabeled examples, this then leads to approaches such as clustering etcetera. Next comes the most common form of machine learning which we use most often called supervised learning. This is where a learning system learns a latent map based on labeled examples. This is essentially a function approximator between input data and labeled examples. And finally there is reinforcement learning where a decision making system is trained to make optimal decisions. These three types are the focus of this module. I'd like to start by defining the term reinforcement learning first. It is borrowed from an area of behavior psychology known as operant conditioning and deals with learning the relationship between stimuli action and consequences. That is the occurrence of rewards or punishments. These rewards and punishments then guide the learner on the desired behavior or policy. Reinforcement learning in software is an area of machine learning where an agent or system of agents learns to achieve a goal by interacting with its environment. By goal we mean that we want the agent to learn the optimal path or behavior that collects the maximum reward. The agent might start with trial and error and after many interactions with the environment eventually have enough information to make smarter decisions. For simplicity, we often refer to these positive rewards or pleasant events, or negative rewards, unpleasant events, given to the agent. The optimal behavior is learned through interactions with the environment and observations of how it responds. The behavior is learned by exploring new states and situations that haven't been explored before as well as exploiting existing knowledge of what has worked well by repeated trial and error. With the information the agent obtains after each action, it adjusts future actions. Furthermore, they're supervised and unsupervised learning work with statistic data set. Reinforcement however works with a dynamic environment. The goal here is not to cluster or label data but to find the best sequence of actions that will generate the optimal outcome. Unlike more traditional supervised learning techniques, not every data point has a reward associated admitted in reinforcement learning. The agent in question might only have access to these sparse rewards. That is, agent may only get feedback when it is closed to or at the target and not with every action it performs. This makes RL easier to map the real world problems but still also opens the door to a lot of exciting challenges. Now, I'd like to analyze the general process agency use. Agents start off by collecting knowledge through exploration and exploitation during repeated trials. They do this to learn which actions they should perform to obtain the maximum reward. I'll outline the most basic steps in the process for you, let's start off with the first step. First, the observation step happens here, the agent identifies something in the environment that it should act on. Second, the action step happens, the agent does something in reaction to the observation it received. And third, is the reward step. The environment issues feedback for the behavior though we refer to it as a reward, you can think of it like a sort of feedback. This is because the reward can assume neutral, positive, or negative values after each action. Now, what do you suppose happens in scenarios where the agent takes actions that result in zero or negative rewards? You're right, the agent learns from these situations as well. It's all about the relative nature of these rewards. I'd like to discuss a common example from operate conditioning that highlights reinforcement learning. Let us say that we have an owner who wants to train a dog. The goal of reinforcement learning in this case is to train the dog, the agent to complete a specific task within an environment the trainer. The environment could include all surroundings of the dog but for simplicity let's focus only on the trainer and the actions they perform. First, the trainer issues the command which a dog observes, then the dog then responds with an action. If the action it shows is close to the desired behavior, the trainer will then likely provide some sort of reward such as a toy. Otherwise, no reward or even a negative reward will be provided. At the beginning of the training the dog would likely just take random actions such as when the trainer says sit, it might just roll over in response because doesn't know any better. The dog displaced his behavior because it tries to associate specific observations with actions and rewards. The association or mapping between observations and actions is called the policy. I'll elaborate on that shortly for you. To summarize the basic steps of RL in this example include the observation and this is where the trainer may say sit, the action. This is where the dog sits down and finally the reward, and this is where the trainer gives the dog a toy. Now that you can see how it works with the dog and trainer example, let us generalize the algorithm a bit more. In the absence of a supervisor, the agent must independently discover the best sequence of actions that maximize the reward. The discovery process is initially a trial and error search. We measure the quality of the actions by both the immediate reward they return as well as the delayed reward they might fetch later on and the cumulative sum of all of this. Because it enables the agent to learn the actions that result in eventual success in an unseen environment. Without the help of a supervisor reinforcement learning is such a powerful class of algorithms. The agent then refines the policy over time based on its experience of which the actions resulted in the optimal rewards. RL can be divided into two broad branches. Model based and model free. Each method takes slightly different approaches to developing algorithmic possibilities. The model based system uses a predictive model of the environment to detriment what happens when certain actions are taken. The model free system however doesn't need a modeling step because the control policy can be learned directly from the environment. Some important characteristics of reinforcement learning include the following. First, there is no supervisor, only a real number or a reward signal. For example, the reward could be defined as the following values, negative one, zero, and positive one, in other words negative, neutral, and positive values. Another characteristic is that decision making is sequential. That is the decision on what action to take in the state as two comes after the decision that has been made on state as one, and the environment changes to state as two. Next, time plays a crucial roles in RL problems, depending on when an action occurs in time there are different possibilities of what reward can be given and what state the environment can transition to. Another characteristic is that feedback is always delayed not instantaneous, the agent can't know at the beginning if it will receive a positive reward. Lastly, the agents actions that remind the subsequent data it receives depending on what action the agent takes. There are different possibilities of how the policy will be shaped. Let us take a moment to look at the reference use case where reinforcement learning is a great choice. I referred to the use case later in this module. A perhaps family use case is where a large retail warehouse or fulfillment center is used by e-commerce merchants to store and manage the inventory as the clients make purchases and their packages must be shipped. By using fulfillment centers, online businesses are relieved of the necessary physical space to store all of their products and the need to directly manage inventory. So, why did we choose our fulfillment center as a reference use case? Well, that is because it's a good representation of how reinforcement learning can optimize the work that the robot agent needs to do when it moves products around in a warehouse. Imagine a robot moving around the warehouse through aisles of racks which store products or often average of over 28 football fields of space. The robot might go the long or short way around to find something. How does it know which is optimal? Certainly if you have a small way, how is it only one robot? You could give it some heuristics to follow. However, this method doesn't work when you have enormous sphere house with many robots. With RL, a Robot can learn for itself to choose optimal paths and update the policy when there was something changes.