So I hope you're becoming more comfortable with how regressions work. In examples that we've done so far, very simple regression with this demand function, a little bit more complex regression with ice-cream data including some dummy variables. I want to show you some of the problems that come up when data aren't really behaving as you'd like them to. And we need to transform the data, okay? Here on this Excel spreadsheet, I have downloaded from the NBA information about Karl Malone. He was player for a number of years from 85 through 2004. He played for the Utah Jazz, hall of famer for sure and a terrific player. And so let's look at the scatter plot for his points per game, okay? So I'm going to highlight this and let's put into a scatter plot here, all right? So now you'll notice that his points per game here is kind of like an ellipse, right? And if we were going to plot his points per game, let's say, over his career according to his age, so let's age into his x-variables here. Here's his age. We get a scatter plot that looks like this, okay, let's make this a little bit larger so you can see. because he didn't start playing until he was 22 so when we he was 5 or 6, there were no points per obviously. But you noticed that like it takes on this nature right here. If I were to try to like do a regression of his points per game as a function of his age or something, I'm not likely to get a very nice looking linear curve because, look, it goes up and it comes back down. I don't even know what this is going to look like if I add a trend line here. Linear trend line, it's kind of negative, right? I thought it might even be flat. I don't think this exactly depicts what's going on, just like as he's getting older he's becoming a worse player. But that doesn't exactly reflect the fact that he got better at the beginning of his career as he matured in the league, as he learned how to play his position or what have you, right? So you notice his points per game actually went from 14 all the way up to, like, 31 by the time he was 26, hovering around 28 or 29, and then started falling again. So this linear regression doesn't really really explain what's going on. This might happen from time to time when you are running regression. Say like okay, so there is no relationship or it's improper relationship like not really understanding it. In this case, what we're going to do is we're going to have to add a squared term, and so I added a column here that I called Age 2. This is age squared, so basically I just squared his age. And I'm going to run a regression as a function of both his age and age squared, and show you what's going on here. So we're going to go to my data. Let's run a regression. I'm going to run a regression of his points per game, that's this right here as a function of not just his age but also age squared, so these two terms right here. Got the labels here, new output and let's click OK and there we go. All right, so let's look at this. How do I interpret this? Okay, so it's a little bit weird because it says it's sort of like, it's going to produce -86 points, but assuming he has no age, of course he does have an age, he entered the league when he was 22 years old. So what's happening is here that we got a- 86 and we've got 7.65 times age, but then -0.12 times age squared. So what's going on is that the age variable here is capturing the fact that he does have an upward trajectory of his points per game, of his performance, of his productivity as he's getting older. So he's getting older, he's getting more mature, he's becoming a better performer in the league, and his points per game are going up. But at some point, his knees are starting to hurt, right? He's getting banged up. And so he's not going to keep performing at the same pace like a linear regression is not exactly appropriate. Because there's a nonlinear relationship between his performance points per game and his age, how long has been in the league, right? There are nonlinear components. So there is a linear component. He's getting better. But he's also getting worse at the same time. So he's learning how to play his position better but his knees are getting cranky. That's why an age squared term is appropriate because it's capturing, there's both a linear and a nonlinear elements. This is what we call a data transformation. Like I'm transforming some of the data to try to capture the nuance, the shapes, the flaws of this data. How do I interpret this? So if I were going to try to predict what an additional year of experience is worth. What I'd say is, okay, every year longer he's in the league or every additional birthday he has, he's going to improve by 0.765 points, right? But he's also going to lose some points. So he's going to lose 0.12 times the square of his age. Here in this sheet, I don't know if I showed it, this is the same regression, but also I constructed a little table down here. And so for this, I created a little output table, a simple dashboard if you will that says, look, let's take the information from my coefficients. And if I put in some information in each one of these two, I should get some output, right? So let me input some little arrows to show you like what I'm doing here, right? I am going to input some information here. I am going to input some information here. So we're going to get a little bit, let's put some stuff here, let's input some stuff here, right? And if I input some things in both of those, then I should get some output right here. I've constructed this thing so this works out pretty nicely. Let me just make it a little bit longer guy here. And I will get output here, right? And so the formula for this as I’ve said, let’s take the intercept here, plus this coefficient 7.65 times whatever I put in as his age plus this coefficient which is actually a negative, times whatever is age is squared, right? And this is the squared term of this. So if I change his age to let's say 23, All right? Then it's changing his age. It's changing his age squared. And then this top cell, cell B22, is saying take the information from my regression output. And let's try to forecast what his points per game would have been, if he was age 23, okay? And so we can look at how this data moves. So let's say, how'd he do when he was 36? Okay, 36 it should have been, according to our model, he should have been performing at about 24 points per game. How well a prediction is this? I don't know, let's take a look. Oops, was it 26, was it 36 so this is when he was 36 right over here, he was at between 12, here he had a little bit better season. The season right before he had almost 23.8 points per game. Here he had 23.2 points per game. So it's not surprising that the model would try to predict something in this range. So about 24 points per game and that's what the model is predicting. So the model does a fairly decent job at predicting where he's going to be. If we were just running a linear regression on this, the predictions would miss a lot of his early career success. So here, it's like, what about like when he was like 24? I was going to say all right, he's got about 24 points per game. Let's see, 25 or 26, 27 right here, 27 points per game, okay? So he's starting to hit the highlight of his career here at 27 and in 27 is actually 29 points per game. So I'm underestimating how good he does, okay? Even though this is a fairly good regression, any given athlete is going to perform more or less, better or worse any given year depending on who they're surrounding. And so I wouldn't expect this thing to have awesome predictive power. It's going to be pretty good, right? The real takeaway here is that you might find yourself in a situation where you have to do a transformation of some of your data, a square term. You do not have to square every term in your data in order to capture a squared term. The squared terms that are likely to come up in your MBA studies are things like education, the impact of education and earnings or years in your career and some kind of success rate, okay? Maybe the impact of more investments and output of your organization so some kind of capital input. It may be that you increase capital but each additional million dollars of capital that you put in your company doesn't always impact your output by the same amount. Like, there might be some diminishing returns to your capital or as you add labor to your company, the output isn't going to be increasing at the same rate, so the diminishing marginal product of your workers. And so education, age, changing your workers, changing your capital, these things might have non-linear effects. And as a result, you might want to include squared term in your analysis. The interpretation of the squared term is just this, that it’s changing and the change is changing. So like when I look at how does my age change my output here or how does an additional year of schooling change my earnings, each year helps me, it makes me better. But each year doesn't make me as better as the year before because I'm getting older, right? Or each additional year of education doesn't necessarily get me the same bang for my buck as the previous year. Which would then explain why it is that people, they get their MBA and then they’re done with school, they get their PhD and they're done with school. Most people don’t say we’re going to get an MBA and a PhD and then become a pediatric neurosurgeon, they just keep going to school forever because it’s going to have constant returns. Now there’s diminishing returns to things like that and so the squared term is pretty important.