Next, I will discuss different applications of reinforcement learning and where RL could be better suited than other ML types for your use cases. Now, keep in mind the intended application of reinforcement learning is to evolve and improve systems without human or programmatic interventions. It is in fact used successfully in different industries for various use cases such as in business, gaming, recommendation systems and science. Now, I will discuss some real life applications that optimize for multi objectives, that is multiple rewards. Whereas it's just one, where they optimized for themselves and all other parties involved. Spotify in other similar cloud audio platforms, give music recommendations. But there's more. The goal is to ensure both that you get the best possible selection of songs in your lists and that artists get visibility interaction from their site. So, these platforms have to balance two objectives. RL is also well suited for online e-commerce retailer shops. Normally, retailers collect customer behavior data on the service. Including which and when pages are opened, what was searched for and whether product recommendations were selected. When customers spend more time browsing the retailer site, the behavior data the retailer has to work with increases. The goal is that our agent can influence the recommendations. So, why don't preset rules work well? Unfortunately, not all customers behave the same way. Some know what they want and search for something in particular. They might spend longer on average on each page they visit but keep category overview pages. Other customers browse to find inspiration. The variations can get too complex for simple rules. We want an agent that adapts to various customer behaviors. The agent learns to differentiate between groups and determines the most appropriate action for each. So RL helps a retailer increase sales through customer specific recommendations rather than perform the same action for every customer based on a set of strict rules. The retailer has better sales and the customers have more relevant recommendations. Finally, Google Deepmind AlphaGo zero is an example of an amazing achievement as well. This algorithm shows how an agent can be trained in the highly complex domain of go from a blank slate. With no human expert play, use this training data to a superhuman level. AlphaGo zero is known for not only beating the best player on earth, but when the algorithm was recreated without human input, it was even better in a shorter time. There are so many more examples, but here are a few more to think about play store, strategy planning, robotics for industrial automation, aircraft control, machine learning and data processing. And training systems that provide custom instruction and materials according to the requirements of a learner. In summary, the problems mentioned here can be thought of as ones where the solution can both optimist for individuals and all the parties. Use supervised learning and not reinforcement learning when you have a prediction problem such as regression or classification problems with labels. You want to ensure that the distance between a predicted distribution and the actual distribution is as low as possible. You want to ensure that the prediction matches reality. It is more for an instantaneous prediction at the time step, the outcome is short term. You can do offline training on cold historic data. In these situations, you can train a model based agent on the static data. All data is IID which stands for independent and identically distributed and it means that there are no overall trends. In supervised learning, the data is often IID. Another example would be no trial and error is required because you have the target, the input features and environment is static, or the problem is suitable for lower variants and dynamism. Or you want to apply transfer learning which is very mature and supervised learning especially for image and tax based tasks. You can pre train a model and then transfer all the knowledge to another task which may not be identical to the previous one and require much less training. And finally, if you want differentiable, non noisy loss and supervised learning you want the loss function to be differentiable and preferably not noisy. Loss might not be the exact metric you're trying to optimize for but it is the closest proxy, which is differentiable and the actual business metric. On the other hand, use reinforcement learning and not supervised learning when you have control optimization, decision making problem, where you are both predicting something and responsible for taking an action among the set of actions. Or you want to optimize for the long term where the outcome is delayed and value oriented, you get the reward after a longer time for optimizing a strategic long term goal. Or you want real time training or offline simulation, you can do real time training or often simulation in reinforcement learning. But normally you don't learn from static data unless it's contextual bandit. You cannot simply dump data into a data warehouse like big query and learn from it. Also it's possible to have non-IID data. In RL, the current state, the action you take in the current state, the next state the next situation you move to, might be correlated. That is if you move to the right, then the final stage you finish in is probably in the right side of the room. Due to correlation data is not necessarily IID. Another case is when trial and error are necessary because the state space is large, and the agent must try lots of scenarios before it learns optimal decisions. There is no static definition. This trial and error is necessary either by real time or by an offline simulation. Or when you want to lower variance dynamism or transfer learning is not yet possible. In RL it is not mainstream yet. And finally, outcomes do not need to be differentiable. A noisy reward is OK. Unlike supervised learning where you try to minimize the loss, minimize the distance between the predicted and the target distribution instead of trying to maximize the value. In RL, the reward and value functions don't need to be differentiable and it's okay if the reward is noisy too. It's preferable if the world is not noisy, but the restriction to difference ability is not critical. You can directly optimize for the metric that you care for and not necessarily a proxy. We have taken you through some model free approaches where the agent learns directly from its interaction with the environment. We have also discussed model based approaches where the model learns from the environment and then the agent learns from the model. You might be wondering why not combining both approaches. Combining lets the agent learn from both real experience online and the trajectories generated by the model. So, the offline planning phase. The agent uses both sources to act on the environment. Just like actor critic approaches, combine both valium policy approaches. The agent can benefit from a combination of model free and model based methods.