In this video we are going to talk about support vector machines. Let's think again about the classification task we first saw. This is a passage about some medical condition, and we need to decide whether it is nephrology, neurology, or podiatry. Looking at the words, like athlete's foot, infection of the foot and so on, we would know that this is podiatry. Let's take another example. And this is about sentiment analysis. This is somebody's review for a movie, and you could guess which movie it is since in sensibility- and you can see that the the author or the reviewer here has given some stars and so on, but then used words like wow and great movie and so on. Using those words we could then identify that this particular review is positive. If you have seen words like boring, or lame, or worst, and so on, it would be a negative review. So again this is a sentiment classification task, where looking at the words, we had to determine whether it's positive or negative. So in general, a classifier can be said to be a function on the input data. So you have some function F, that works on medical passages and labels, or decides that what the class label should be. Nephrology, neurology or podiatry. Or another function looks at this review, a movie review, and decides it's +1 that stands for positive review, or -1 that stands for negative review. It's typical to actually have numbers, like +1 or -1 Assigned to classes, and that's what support vector machines or any of that linear classifier, typically use. When you're talking about support vector machines, we need to talk about decision boundaries. What is a decision boundary? Let's take this example. There are just two dimensions, X1 and X2, and there are points, leader points in this two dimensional in space. Some of them are marked star, others are marked circle. Let's call the circle ones as Class A, and the stars as Class B. Now you need to somehow make a boundary around one of the classes to separate that from the other class. You can use any decision boundary, you could use one that is a circle, or you could use one that's a square, or you could have any other random shape, in this case let's take a shape of a heart. A classic fiction function is represented by this decision surface, or decision boundary. And this choice of what shape you should use is completely dependent on this application. If you think that points in the two dimensional base are well represented using a circle, you would use that. Whenever you use a decision boundary, the main purpose is when you get a new point that is unlabeled, depending on the decision boundary you have learned, you can label it as one of the classes. So in this particular example, this point that is unlabeled, will be marked as a star, or Class B because that is outside this boundary that was learned around class A. But when you're using a decision boundary, there are other factors that should be considered. And let's take another example where here you have positive and negative points, so positive is one class, negative is the other class. And all the red positives and negatives define this training data that we have. Looking at this data, you could learn a boundary like this. It's an irregular shaped Pentagon, or in general any other polygon or closed surface that puts all the positive class, on the positive points, inside this polygon, and all the negative points are outside. This is a perfectly great division boundary. The accuracy on the training data is 100% Every point is labeled correctly. But then, when we look at the test data, represented by black positives and negatives here, you would start seeing that yes in general there are positive points near positive points that you saw earlier. In general negative points are near negative points. But then, because of the way we defined the division boundary, there are many many more mistakes that we have done here. There are many many positive points outside of the polygon. The outside spear is supposed to be negative, and there are points within this polygon, that are negative. So in general, in unseen test data, this decision boundary has not worked very well. This problem is called data overfitting. When you have learned a decision boundary that works very well on training data, but does not generalize to test data. So instead of a irregular shape boundary, let's consider a linear boundary or a straight line. So in a two dimension space, it is a straight line. In a three dimensional space. It would be a plane, in general and dimensional data representation, it would be a hyperplane. So in this particular case, you have a line that separates the positive region, that's one half of this two dimensional space as positive, and the other half as negative. As you can see, it's not accurate, it's not 100% accurate. You have three negative points on the positive side, and you have one positive point on the negative side. So yes, it has made some mistakes in the training data, but it's very easy to find, because you have to basically learn a line that maximally separates these two points, that that makes as fewer errors as positive, as possible, and you can see that there is no one line that could have separated the positives from negatives completely. So this is as good as you can get. It's also very easy to evaluate, because now you have a very simple rule that any point in this entire half-space and half-region that's on the positive side is going to be called positive. And any point on the other side is going to be called negative. In general, this idea of using a simple classifier or a simple model to fit the data, Is called Occam's Razor. And the rule of thumb here is simple models generalize well. You want to use something that is simple to explain, that would work very well in the test data. And as you can see, if you apply the test points on top of this, you would notice that all test points have been correctly classified. Except for one positive that's on the negative side. So the error is really just one point on the test, and of course there are four points in the training data that were misclassified as well. But overall this is still a much better model than this irregular shaped boundary where we had seen earlier that quite a few positive points have been misclassified, and two of the negative points were misclassified in the test data. So how do you find this linear boundary? You could use a lot of algorithms and methods to do so. Typically, when you're finding a linear boundary, you're finding W, or the slope of the line. You basically want to find out the coefficients that are the weights associated with the dimensions X1 and X2, so W is typically of the same dimension, same size as your data points. And you are going to use a linear function to see which point would be called positive or negative. There could be many methods, perceptron, a linear discriminant analysis, or linear least square techniques and so on. There are many methods to use, so you could use some perceptrons, you could use linear discriminative analysis, or linear least squares. And in one particular case, in this case, we're going to talk about support vector machines. The problem is that whenever you have a linear boundary that separates the positive points and negative points, if there is one line that splits the positives and negatives from each other, there are in fact infinitely many lines. So you could have a line like this, or the line like this, or the line like this. They all separate the positive points from the negative points. Then the choice becomes what is the best line here. So let's take another example, very similar to what we have seen before. Again two dimensions, positives and negatives. And consider this line. This line is one that goes from the farthest positive to the farthest negative. By farthest I mean the ones that are kind of the most confusing, that are closest to the boundary, and this line perfectly splits the data as best as it can. I mean it's not really 100% accurate, as you can see that there are negative points on the positive side and two points on the negative side. But if you learn this line, you could actually- you learned any other line. And the question becomes what is a reasonable boundary? So instead of the blue line, if you consider this red band, you would see that the points that are in the confusion are low. What do I mean by that? So in the previous example, when you had just this blue line, any small change to the positive point or a negative point will make an error. If you make a small perturbation on the posture point, let's take an example here, and say this point, is moved a little bit here, suddenly, you're either making a mistake or this line has to change to now get here. So you can see that a small change in a data point can actually lead to a big change in the classifier. However, if you use a line like this, where you have a big band separating the points, you would notice that any small change, the same small change that you found here, would still be on the same side as the hyperplane, as what you would expect it to be. So a small change in your data points would still not change the label that this particular classifier will assign that. So a small change- it's actually more- So it's more resistant to noise or perturbations. This idea is called maximum margin. You would want to learn the thickest band that you can fit, that separates the positive points from negative points. This band, or the thickness is called the margin, and the middle line between them is called the actual hyperplane. So that's the maximum margin hyperplane. The idea here is that you are not basing your classifier on two points, a positive and negative point, but a set of points called support vectors. And in general, the support vector machine is just maximum margin classifier. Where it's not just one or two points that will make a big difference, you actually have a lot of support to learn this margin or this band, and these support vectors are the points that are the most sensitive to shift. But even then, small changes or perturbations to support vectors still would not change the classification, because classification happens with respect to this maximum margin hyperplane. So now the question becomes how do you find it. Support vector machines uses optimization techniques to do it. These are linear classifiers that find hyperplane to separate these two classes of data, the positive and the negative class, and basically given training data of the type X_1,Y_1, X_2,Y_2 and so on, where Xs are the representations of the data in terms of its features, as opposed to any dimensional feature representation. And you have the Ys as the class label, which is one of two labels here, -1 and +1, then SVM will find the linear function W, a weight vector, such that the function of f, of X_i, is this dot product of W and X_i, so W.X_i with some bias done, but the idea is that if the value for a new X_i. If the function leads to a positive value, of X_i is greater than zero, then the label is +1, otherwise the label is -1. You have seen that this SVM tend to work only for binary classification problems, and we have seen an example of this, +1 and -1 so far, but then what happens when you have multiple classes? Let's take this example of the three classes, podiatry, nephrology, neurology. In general, when you were to use the SVM in a multiclass classification set up, you would want to learn it in one of two scenarios. One is called One versus rest. Where you are actually learning binary classifiers between one class and all the other classes together. For example, this is the classifier between nephrology and all documents that are not nephrology. Assume that d1 to d9 are all nephrology documents. So this classification is a perfect classification to learn nephrology, as these nine documents, and anything that is not nephrology is on the other side. Then, we learn another classifier, this time for neurology. Where the documents d10 to d15 are neurology documents, and everything else, d1 to d9, and d16 to d21, are all not neurology. And then, we do the same thing with podiatry. You have the documents d16 to the d21, that is labeled podiatry, and everything else. That's d1 to d15, is going to be not podiatry. So in general, we are learning three classifiers, or for n class setup, we are learning n classifiers, such that all the points in one region, let's take the region as this, D16 to d21, ist' going to be not nephrology, yes on podiatry, And what's the other one. Yes, it's not on neurology. So it's not neurology, it's not nephrology, and it's podiatry. So this is going to be podiatry. Right? Let's take another set up, and this set up is where instead of learning one versus rest, we are going to do one versus one. That means that you're going to learn classifiers between let's say nephrology and neurology. So in this case you look at only documents that were labeled nephrology or neurology. That's d1 to d9, and d10 to d15, and separate them. So the red line just separates these two points. These two classes, I'm sorry. And then you learn another classifier. This is between neurology and podiatry, that works on a smaller set. Again it's between the d1 to d9 as the neurology documents, and d16 to d21 as the podiatry document. And again the third one, which is between nephrology and podiatry. When you have these three classes together, you can see that in general, for an n-class set up, there are (n,2). A combonotorial number n_square number of classifiers. It so happens that for three classes, it is three classifiers, But for four classes it'll likely to be six classifiers, and so on. But then, you can still use the same idea where you have these points, going to be called between the class labels podiatry and neurology, it's podiatry. Between the classes podiatry and nephrology. It's podiatry, and between the classes nephrology and neurology, it's basically nothing, because it's right in the middle, right? But you still see that sometimes it's going to be called nephrology, sometimes it's going to be called neurology, but the overall votes are still in favor of podiatry because you have two- that all of these points you're going to get two votes on podiatry because of the other two classes. And maybe one additional vote for nephrology or neurology. In any case, all of these points are still going to be classified correctly as podiatry. So both of these, in this particular toy dataset have been labeled the right way, but just the idea and the approach is very different. In one case you're learning one class against all the other classes, and in the other case it's one versus one and do n_square a number of classes. Or reference squared number of classes this way. Okay, so let's look at one of the parameters when we learn a support vector machine. One of the most critical parameters becomes parameter C, and that defines regularization. Regularization is a term that denotes how important it is for individual data point to be labeled correctly, as compared to all the points in the general model. So how much importance should we give to individual data points? This parameter for example, if you have a larger value of c, that means the regularization is less, and that means that you are fitting the training data as much as possible. You are giving the individual points a lot of importance, and every data point becomes important. You want to get them right, even if the overall accuracy is low, or generalization error is high. Whereas smaller values of c means that are more regularization, where you are tolerant to errors on individual data points, as long as in the overall scheme you are getting simpler models let's say. So the generalization error is expected to be low. The default values in most of the packages is one. But you could change that parameter depending on how important you want your individual data point classifications to be. The other parameters are typically with respect to what is the type of a decision boundary you would want to learn. We've talked about linear kernels a lot here. Linear decision boundaries, but you could have a polynomial kernel, or a regular basis function kernel and so on. In fact we have talked about these in course three, as part of this specialization. So I would refer you to some of the slides and discussion around that in the previous course, in the course three. The other parameter for SVM is whether it's multiclass or not, and if it is multiclass, which one would you use, so OVR would be the parameter setting for one versus rest, and there are other parameter options possible. As you can see with one versus rest, you are learning fewer number of classifiers. So that's preferred over one versus one. The other important factor and a parameter would be class weight. So different classes could get different weights. For example, if you want a particular class, let's say spam or not spam, and you know that the spams are usually like 80% of e-mails somebody gets. So then he would want, that because it's just such a skewed distribution where one of the classes 80% and the other classes 20% you would want to give different weight to these two classes. In general these these parameters are possible to set in in python models, and we'll talk about exact parameter settings in another video. But to conclude, the big take home messages from support vector the machines is that, these tend to be most accurate classifieds for text, especially when we are talking about high-dimensional data, as text data typically is. There are strong theoretical foundations for support vector machines. It's based out of optimization theory, which were not talked about in any detail in this video, but I would encourage interested folks to go and check it out. We'll add some of them in the reading list for this video. Support vector machine handles only numeric data. So what would you do with other data? You typically can word these categorical features into numeric features. As you can see this is a dot product based algorithm. So it uses numbers to define whether it should be boundary on one side or the other. So it works only on numeric features. You would also want to normalize the features. That means you don't want some dimensions to be very high in numbers, and other dimensions to be very low. You would want to put them all in a zero to one range, so that each feature gets enough importance and you can learn the relative importance that way. But in general, the hyperplane that you learn even that they're very easy to learn and easy to test, they are hard to interpret. Which means that you don't really know why a particular point has been labelled as positive or negative. It's a combination of multiple factors. But in general if explanation is not a critical requirement for the classifier, support vector machines do give you very high and very accurate classifiers, very similar to naive bayes. In fact for a lot of text classification problems, support vector machines should be one of the first ones you should try.