Earlier in this module, you heard about various RL types. Next, I will explore the contextual bandits type of RL. All right, you've already learned about value based approaches in general and policy based approaches. Now, we will look at contextual bandits. The contextual bandits approach is classified as an extension of multi-armed bandits. A contextual multi-armed bandit problem is a simplified reinforcement learning algorithm where the agent takes an action from a set of possible actions. The agent tries to maximize the total reward of the chosen actions and the problem involves a sequence of independent trials. The agent bases its decision for next actions on a given context. That is some side information. First, will discuss multi-armed bandits as a foundation and then I will move to contextual bandits. The bandits problem is a simplified reinforcement learning problem, which is only one time step and not state transition dynamics. In the diagram, we have four casino machines. The reward distribution is shown for pulling each arm. Keep in mind the reward distribution itself is unknown to the user and they can only infer it from repeated pulls off the arms. So the user tries to estimate based on past values they got which machine arm they should pull to get the highest points. Over time they learned the distribution and know what arm they have to pull to maximize the rewards. Let me give an example, a couple of times the user will pull the arm of the different machines. At some point they realize what they learned from past situations that D4 has a higher probability to get higher points, 12 in that case. Hence a better distribution in terms of maximizing the long term value. In the multi-armed bandit approach, an agent simultaneously attempts to acquire new knowledge that is exploration and optimize its decision based on existing knowledge, which we call exploitation. A matric often used to measure the progress in contextual bandits is called regret, simply put, it is the reward it got versus the maximum reward it could have gotten. Every iteration is independent of other iterations. Now that you have got an idea of multi-armed bandits in general, I will discuss an extension of that called contextual bandits. In this approach, the agent still has to choose between different arms that have an unknown distribution. However, before making the choice this time, the agent will also see a context which is an n-dimensional feature vector that the agent needs to consider. For example, in a net placement or recommender system, the context could be information about the user and the relevant past references. The agent tries to estimate the reward of each action in interaction with the context by understanding the relation between the context and the reward distribution for each action. The objective remains the same as before. To minimize regret, exploring and exploiting as necessary. Each agent can then also be initialized with the policy that is the function, approximator. It can be a linear or nonlinear proximator or something that estimates the Q values. The agent then takes the context in using this policy. So the function approximator, then it adds what we learned before the exploration and the exploitation component on top of that. Notice one difference from other RL problems. The user doesn't have a state transition. The episode ends after the user pulls an arm and they start from the very beginning. There is no complex state space to traverse making it much easier to explore. See the value of exploration strategies and quantifying tune them. I will give you another analogy for how contextual bandits work. Imagine you wake up in the morning and you have to choose a way to travel to work. You have four options a bicycle, car, train or bus. Your reward the least amount of time in transit depends on which travel option you choose. In multi-armed bandits, you don't have any contacts. So you choose one based on history. One time you choose to cycle, next time you choose a car and perhaps you took the bus one time as well. Based on the time it's taken from the actions you learn the distribution and try to choose the optimal arm next time. However, you don't know anything about the context, the weather. Now with the contextual bandit you have weather information. So you wake up in the morning and you figure, hey, it's sunny. So cycle might be the better option. Or let's say it's rainy so taking a bus might be the better option. With the extra information you form a basis of which action you will take. Let's discuss another example of multi-armed bandits. Here, you have a multi-armed bandit without state. That is, there is no state because every play is a full episode and it's independent of any others. Rewards received are only related to the action executed. So the agent learns which action yields the best reward most of often. This example focuses on an online retailer and the website they want to offer a product recommendation. In this case you the agent try showing one item, then another item and then learn that showing the second item is somehow more rewarding. The user actually clicks and buys it which earns you $22. Sometimes they will even buy it twice. In which case you earned $44. However, it's completely independent. The agent doesn't really understand why it works, especially for item two. Alternatively, we look at the problem with some context. The reward is conditional to the state of the customer environment. Rewards vary according to the state or context that the agent is operating. The agent has more data points to analyze to decide which action to take. Now you consider whether the user was on your website, whether the user got a recommendation, what time of day it is? Is the user at work or at home? What is their environment? The context of the user also on what page of my retail shop am I showing my products? For example, if it's a website about photography showing some cameras or related products is desirable. There's also another context where should micro recommendation be placed? In that case it can look different given another website and an understanding of what the user looked into in past sessions, what the interest might be showing item one can now lead to rewards, how? Because then other users are also interested in and can buy the product. Now the agent has a better way to know the time they should place the order and the right context.