In this video we are going to talk about one of the classification algorithms called the NaÃ¯ve Bayes classifier. Let's take a case to set up the scenario. This is the task of classifying text search queries. So suppose you are interested in classifying search queries and you have three classes. The Entertainment class, the Computer Science class, and the Zoology class. The three subjects. And then you know that most of these search queries are about Entertainment. So that's what you know coming in. That if you don't know anything else, the chances that it's an entertainment related query is pretty high. Then you get the query. And the query is "Python". And the task is still the same. The task is, can you classify Python as entertainment, computer science, or zoology? Is it Python the snake? If that is the case then it is a zoology document. But what it is actually Python the programming language like apply data mining in Python, right? Then it belongs to computer science. But if it is Python as in Monty Python then it is entertainment. So just the word "Python" could still mean one of these three. But then you think that Python as itself, the most common class becomes zoology. So even though generally the queries are entertainment queries, when you see the word "Python" and if that is your query then the chances that it is actually zoology becomes higher. Then you have another word and this time is "Python download". And then you can say that the most probably class is in fact computer science. So what just happened there? You had a model, in this case a probabilistic model that tells you the likelihood of a class before you have any information. And then when you are given new information, you updated that likelihood of the class. So you had a model that said "Entertainment is typically what any standard query, typical query, would be". But then given the new information about the word "Python" you said "Oh, the likelihood of it being zoology is higher". And it's not likely to be entertainment anymore. Even though there are cases of Python, as in Monty Python, where it would be entertainment. So this change is what is in the crux of this NaÃ¯ve Bayes classifier. You have something called prior probability that is the prior knowledge or the prior belief that the label belongs to entertainment, Or the label it's computer science, or the label is zoology. This is one of the three options you have. So you have a prior probability for each of them. And by basic probability you know that these are the only three options. So the sum of these two probabilities should be one. It's definitely one of the three. It's either entertainment, or computer science, or zoology. But among the three entertainment is more likely. And then when you have new information like the x is Python and input is Python, so when you have new information your probability and likelihood changes. Your probability of entertainment given the input is Python is suddenly lower. And the probability of y being computer science given x is Python is higher. And probability of y given zoology is even higher. That's why you would say that "Given the input Python, the label is zoology". So this is encapsulated in what is called Baye's Rule or Based theorem and that says that "The posterior probability depends on your prior. But then also depends on the likelihood of that happening or that event happening". And divided by the evidence. So in mathematical terms it comes to probability of y given X is probability of Y which is the prior probability. And probability of X given y which is the likelihood of having the data as X given y. That means, if you know that the label is zoology, the probability of seeing Python is so in zoology documents let's say. Or if you know that the label is entertainment the, the probability of seeing Python in entertainment is so and so and that is lower. That's significantly lower than the other one. Then if the class was zoology. So the NaÃ¯ve Bayes classifier looks at this computation, looks at what is the probability of a class like computer science given the input as Python. And if computes is saying "What is the chance that it was coming to science in general? What is the likelihood of it being computer science or the prior probability? And then what is the likelihood of seeing the word Python in documents that are computer science. The same thing with zoology. So the probability of the class being zoology given Python is, what is a prior probability of the class being zoology without knowing any information? And then given that it is the zoology class document, what is the chance that you will see? What is the likelihood of seeing Python? And then you will say that if probability of y as CS given Python is higher than y as zoology, the label as zoology given Python then I'm going to call the label as computer science. So in general, you're saying that probability of y given X is computed this way. But then the NaÃ¯ve Bayes classification model just is interested in which of the three labels is more likely. So you know that y belongs to one of the three classes. Entertainment, conputer science, or zoology. It's only important to know which among those three is higher. And so, the true label, or the predicted label I should say, y* is the y that maximizes probabilit of y given X. And then that computation it does not matter what the probability of X itself is. So what is the probability of seeing a query like Python. And you can remove it then because it does not matter, it doesn't change with the label assigned to it. This in addition to what is called the NaÃ¯ve assumption of Bayesian classification forms the NaÃ¯ve Bayes classifier. And the NaÃ¯ve assumption is that given the class label, the features themselves are independent of each other. That is given the label is y, probability of capital X given y is just individual feature probabilities. Probability of x_i given a product of all of those. That's the product that goes from the first feature to the end feature. This is the final formulation of a NaÃ¯ve Bayes classifier. So the formula stands like this. The predicted label y is the y that maximizes, the argument that maximizes this computation of probability of y given X. Which is computed using Bayes Rule as probability of y, that is the prior, times T independent products of individual features given y. So that's the likelihood of probatility of capital X given y. So for example, if the query is Python download, you're going to say "The predicted y is the y that maximizes probability of y. Probability of Python given y, and probability of download given y. So for example, if it is zoology, you know that probability of zoology is low. People don't typically ask zoology queries. But then probability of Python given zoology is very high. However, probability of download given Python is very low. However, in the case of computer science, probability of computer science queries in general is somewhere in the middle. Probability of Python given computer sciences not up there but also significant. Whereas probability of download given computer science is very significant. And a product of all of the three makes computer science as to be the best predicted leap. So in NaÃ¯ve Bayes we saw the model is just individual probabilities that you multiply together. So what are the parameters there? We have the prior probabilities. The prior probabilities are these probabilities for each of the classes in your set of classes. So that's probability of y for all y in capital Y. And then you have the likelihoods. And that this probability of seeing a particular feature in documents of class y. The probability of x_i given y that is import, no, that is needed for all features x_i and all labels y in Y. So it's a combination of every feature in each of these classes that you have. So let's take an exercise. If you have three classes that is capital Y is three. And you have hundred features in your X, that means X goes from X1, X2 up to X100. Can you compare how many parameters are there in the NaÃ¯ve Bayes model? Give it a try. No? Once you know what are the parameters that you learned in a NaÃ¯ve Bayes model, let's see how do you actually learned it. So you have prior probabilities like probability of y for all Y in the set of labels. How do you identify that? How do you know that the most common class of search queries is entertainment? Well, you have the training data. So you have the set of documents are all the queries that are labeled as which class they belong to. So you have a query such as Ashton Kutcher and that query would be entertainment, or Lady Gaga. That is also entertainment. Or tiger. And in the context of zoology as an animal and that would be zoology. Or something like C++ and that is computer science and so on. So you can just count the number of instances you have in your training data in each of these classes. In the example I gave, I gave you two examples of entertainment and one each of computer science and zoology. So your class distribution just with four instances would be half for entertainment, one quarter for zoology, and one quarter for computer science. In general, if there are any instances and small enough those are belonging to a particular class y then your probability of class y is small and or capital N. The next set of parameters you have is likelihood. So recall that likelihood is probably of x_i given Y for all features x_i and for all labels y. So it's a combination of the many, many features here. So how do you count those? So the way you do that would be you count the number of times feature x_i appears in instances labeled as class y. So you only focus on the instances that were labeled class y and among those see how many times the feature x_i occurs. So for example, you will only look at documents that belong to the computer science class and among those you would count how many times Python appears. OK? Or you would look at all the queries that we will call entertainment and then look for word like "actor" and see how many times the word "actor" occurs in those queries. So that would be probability of word "actor" given the class entertainment. So there are p instances of class y and x_i the feature x_i appears in k of them then the probability of x_i and y would be k / p. When you do this counting there is a slight problem. And that problem is, what happens if the probability becomes zero? So for example. If you have never seen the word "actor" in queries that were labeled entertainment. Or let's say you never saw the word "python" in the queries that were labeled entertainment. Because for some reason in your dataset you never saw Monty Python. Then the probability the way you compute it would be zero. But that is a problem because if you have the probability to be zero and it is multiplied to all the other probabilities, the overall probability will end up being zero. So when a feature x_i never occurs documents labelled y, it can never be the case that whenever you see the word, the feature x_i that the label is y. You don't want to be that strict. So you don't want to say that the posterior probability of y given x_i will be zero because that would make any document that has Python to never be called an entertainmenet class, right? So what you would want to do instead is to smooth the parameters and then you smooth the parameters. One of the options to do it would be just add a dummy count. You say, it's never the case that the count to zero. So I add count of one to every word in every class. This does not change the order probability significantly because that's the way we are computed these probabilities. These are called "maximum likelihood estimations" that typically good when you have large numbers but they're not so good when you have very small numbers. So by adding count of one, you don't really significantly change the larger numbers there. Because what you do now is you say, probability of X_i given y instead of k / p, you're going to say "I'm going to add one to every word". So it's going to be k + 1 over p plus n because in all I have added n words as dummies. So that would be your way by which now you'll never have a probability of zero here. Because if k indeed was zero, the probability would still be one over p + n. So this way, how can I've awarded this problem of zero counts? Now, you might imagine and wonder if you would want to do something similar to the prior probabilities. Think about it and we'll talk about it in one of the queries. The immediate questions soon. So to bring it all together. The big take home messages from this video is that Naive Bayes is a probabilistic model and it is called Naive because it assumes that features are independent of each other given the class label. It is Naive because it's actually not necessarily true even for text. So for example, the fact that you have "White House" as two words but together having a significantly different meaning. If you see the word "White" with a capital W, the chance that it is followed by "House" is very high. So these two features of "White" and "House" are not really independent of each other. But NaÃ¯ve Bayes models kind of ignored that. They say that given the class label, I'm assuming all features are independent because the math on how you can compute the probabilities becomes easy. Even so, for text classification problems, NaÃ¯ve Bayes models actually typically provide very strong baselines. It is traditionally the first one you should try because. It will give you the benchmark or a baseline that is pretty strong. And then you can look at other models that you see soon for how well it improves over NaÃ¯ve Bayes models. It's a very simple model to learn. The parameters are very easy to understand and to learn because they are just counts. Ad they are just counts counted in different ways and so on. So the big message is NaÃ¯ve Bayes is a very important classifier for text classification and should be the first one you should try.