In almost all cases, our models are going to make mistakes when they're predicting labels. We talked about the bias-variance tradeoff earlier when we introduced K-nearest neighbors, now we're going to discuss it in more detail. By the end of this video, you'll understand how errors arise due to bias and variance, be able to identify underperforming models, and understand some techniques that control over and under fitting. The bias-variance tradeoff is a fundamental property of learning, we can't get away from it. We sometimes think of prediction errors at least in the mathematical sense, as resulting from two different categories, error due to bias and error due to variance. These two types of errors are linked, so there's a trade-off between a model's ability to minimize errors due to bias and its ability to minimize errors due to variance. Let's look at bias in more detail. In a previous course, we talked about how bias from a particular choice of data will lead to bias dancers from your model. Here when we speak of bias of the model, we're talking about bias stemming from our choice of hypothesis space. Remember, we can't consider all possible hypotheses, we have to start with a set of hypotheses of a particular type. This makes the final model biased as no choice from this space will perfectly described the real-world. It's biased by our choice of hypothesis space via the learning algorithm and features of our data. As we already discussed, the main goal of building a qualm is to get it to predict well on new data. If our initial hypothesis space only includes models that are really simple, we have underfitting, our choice of model can't fit the training data well enough. If it doesn't fit the training data which is actually seen, it's unlikely to do well on stuff it hasn't seen. On the other hand, if we allow an excessively complicated model, the model could fit the training data really well, but perform poorly on that new data. Underfitting occurs when a machine learning algorithm cannot capture the underlying trend of the data well enough, the model is too simple. Overfitting as we discussed after introducing decision trees, happens when the model learns the noise along with the signal in the training data, and so it won't perform well on new data the model wasn't trained on. One way to avoid overfitting is to simply refuse to allow complex models. In other words, restrict the hypothesis space to simple models. You might be able to see how underfitting and overfitting relate to bias. If we have an overly simple hypothesis space, we're avoiding overfitting, but introducing more potential for bias. It's unlikely the best hypothesis in this small space actually captures the real relationships. But we can't completely avoid bias by allowing infinite complexity, because that leads to errors due to overfitting. This is where variance comes in, and here we mean variants as in how much our models vary? Are the models produced by our learning algorithm on our training data with the given features actually consistent or do they change a lot with relatively small changes in training data? We know with more complex models, we can potentially avoid bias. But the more complex the model, the more likely our approach will have high variance. What creates that variance? Imagine you're repeating the whole model building process more than once, each time you sample new data from the same source and run a new analysis, creating that new model. Most phenomena are noisy, so the data will be different each time due to randomness. This means the resulting models will almost always be different for a particular input or set of highly varying models will give inconsistent predictions, inconsistent with each other, not necessarily inconsistent with the correct predictions. We referred to this earlier with decision trees when we highlighted how they're unstable. Sometimes the variance comes from the learning algorithm itself. For parameterized models, the initialization of the parameters is usually done randomly. This means even on the same dataset, if you're looking for complicated models, you can have the same algorithm with different initializations look at the same training data and find drastically different models. Different classes of Learning algorithms will be sensitive to initialization to varying degrees. So let's go back to our ice cream profit predictor from the previous video. We saw how introducing polynomial features created very different models, different learning algorithms, learning parameters, feature transformations, their training datasets, all result in models with different characteristics. In a perfect world, your learning algorithm would always generate a model that completely accurately predicted profits. This is like a champion archer that always hits the bulls eye. This is low variance because the shots are tightly clustered and low bias because they're clustered in the right place. On the other hand, you could have an archer like me very inconsistent and not usually hitting the target, high bias high-variance neither precise nor accurate. Not the learning algorithm you want, because not only does it give you wrong answers, they're inconsistently wrong. But those are not the only two options, we could have an archer that's pretty accurate but has trouble with consistency. They do, hit around the target but all around it. This learning algorithm creates models that have low bias, but high variance, accurate on average but imprecise, or we could have a learning algorithm that generates consistent models. They don't predict profit in a similar way, but they're consistently off the mark. This archer always shoots a little to the right. They're very consistent, but clustered in the wrong spot, precise but not accurate, biased but with low variants. The real world is even worse than not having our Champion Archer is a learning algorithm, because we don't actually know where the bullseye is, that's why we're building a model after all. So how do you know how well you're learning processes doing when you don't have a convenient target to compare it to? We measure the bias in our models by looking at the performance on our learning data. Algorithms with high bias will underfit the training data as well as perform poorly on the test or validation data. Variants in our models is a bit harder to measure. You can look at how much here output changes when you repeat the learning process on a different sample of learning data. You can also take advantage of the relationship between complexity and variants and use tricks that encouraged simpler models. In a later video, you'll learn about regularization, which is a systematic way of making the trade-off between bias and variance of your ML models in order to have better generalization on new unseen data. Learning curves can also be used to diagnose your model's performance, we'll talk about that in a later video as well. So now you know what we mean by bias and variance. Bias is how far predictions are from the correct class or value, and variance is how different a model's predictions are from each other. You also now know why you want to select a model that isn't too complex and isn't too simple to manage that trade-off between bias and variance. Regularization next, let's see you there.