We're still dealing with simple linear regression here, but we want to know how to estimate those coefficients. Remember, when we are dealing with the original simple model, we have Beta_0 and Beta_1, y equals Beta_0 plus Beta_1 times x. We wanted to deal with these two things. How do we estimate them? So our setup is that, generally, you can't really do anything unless you have training data. How are you supposed to build a model if you don't train or let that model learn from the beginning? We assume we have n observations. In this case, again, it's simple linear regression. This x_1 is just a value. Later when we jump out of simple linear regression, that x will be a vector. But for now it's just a x point and a y point that are paired. Think about it with the sales on a TV. This is saying, let's say for instance, one month we increase TV by a certain amount and sales was a certain amount. Next month, we increase TV by a different amount, sales was a different amount. In that case, this is just our n observations of n months that we watched how our TV budget affected sales. That's the setup here. Each of these i points, this is for all i, y is equal to or estimated by Beta_0 plus Beta_1 x. So nothing has changed. We're just breaking it down now into each individual n observations. We take one observation, we try to model it with our simple linear regression. We take another, another, another, we do it for all i's. We do it for every one of these n observations that we have, we put it in here. Now, for each of these n observations, we have something extremely important that comes up, and it's right here, and it's called a residual. So for each point of TV advertising and sales, we have a residual. This residual, if you look, says we have our estimate, that is this function right here. Now, we also have a true value of the sales for each month. Let's take the first month and just look at what's happening here. We'll plug it into this equation for all the data. We'll plug in all the data and we'll have some function that says, look, whatever TV advertising you put in here, you're going to get an estimate for the output, the sales that you're going to get. That's just an estimate. That's right here. That's our estimate, what we think your sales will be. Now, when you actually look at the first month of data, you can actually plug that in, that's actual data. You say when TV advertising is a certain amount, sales is a certain amount for that first month, our actual y_1 value is a real value from data, and that's here. What this residual is saying is, for each data point, we have the true value in our training data, and we have the estimate we got from this function right here. This is saying is this is a residual. The word residual is what's left, or the difference, you could say. This is saying, the true value that we know from the data minus our estimated value is the residual. Some people can think of this as error, but later we'll learn other errors, and it gets confusing. This is traditionally thought of as a residual, and it's the true data-driven value that we know minus our functional estimate here. Each data point has this residual. Each data point has a true value, each data point has an estimated value. We have a residual for each data point. These are called residuals for each data point, the residual sum of squares is the RSS, and it's an important concept, and it will be used for this right here. The residual sum of squares is simply each residual here squared. You find each residual for each data point, you square each one individually, and then add up each squared one. You take the first residual you get from the first data point, square it. Residual from the second data point, square it. Then add all those up. You get the RSS, the residual, because we're dealing with residuals, residual sum of squares, this concept right here. Each residual squared, sum them all up. Now, this right here is something that, think about it, residuals are the difference between the true data and your estimate. Do you want that to be very large or do you want that to be very small? If your estimate, if your function that's trying to put x to y is very good, then the difference between your true value and your estimated value will be very small. If your function right here is extremely accurate, then this difference will be very small for a lot of the points. If the difference is very small, you have a very good estimate, difference is very small, your RSS will also be very small. That's what you want. You want a really good model whose estimates are very close to the true values. Because then these are really small, then when you square something really small, still really small, you add up a bunch of really small things, and you get something pretty small, that's good. You want to minimize the RSS. It turns out when you do that, for this function right here, y_i is approximately equal to Beta_0 hat plus Beta_1 hat times x_i. When you minimize this residual sum of squares right here, you end up with these two equations. You end up with Beta_0 hat is equal to the mean of y, the average y, so you just cross all y values, you just average them, minus Beta_1 hat times the average of x. So mean of x, mean of y, this simple equation right here. Cool. All good, except the question is, what's this Beta_1 hat? We're going to figure that out. It turns out, when you minimize this, you also get an estimate for Beta_1 hat, which is the sum over all data points, the sum from i equals 1 to n of each data point of y minus its mean, so x_i minus x bar times y_i minus y bar. You do that for each data point. You sum it over all data points. You divide by x_i minus x bar squared. Again, each data point is put in here minus its mean squared, sum up each data point. This is your estimate for Beta_1 hat. This is called least squares, and it is, I would say, overwhelmingly the most popular way to estimate coefficients. When you're given this general formula, we have the y equals Beta_0 plus Beta_1 x, the way to estimate this Beta _0, Beta_1 is least squared, and these are the two formulas that do it. In doing this, we minimize the residual sum of squares right here, which is good. If you minimize these residuals, then you're finding a function that is the closest to the true values that you could possibly find, and that's the objective here. Again, this is all with simple linear regression, so it could be somewhat simple here, but we will bring this into the non-simple case of linear regression as well.