[SOUND] Welcome! In this lecture, you will find out how to correct for endogeneity. We have seen that when applying econometrics in practice, there are often important factors that cannot be included in the model due to a lack of data. This often leads to endogeneity and in turn inconsistency of OLS. In this lecture, you will learn how to fix this without needing data on the omitted factor itself. However, we will need data on other additional variables. To gain some intuition, we first represent endogeneity in a graphical way. Here, you see the standard setup, with the dependent variable y, explanatory variables X, and an error term epsilon. Hidden in the epsilon term are different, unexplained factors. These are factors that affect y but are not included in the model, usually because we have no data on them. Endogeneity appears if at least one of these unexplained factors is correlated with an X variable. The key to consistently estimating the impact of X on y is to find a set of additional variables. Such variables are called instruments and are usually denoted by Z. The instruments need to satisfy two important properties. First of all they should be correlated with X. Secondly they should not be correlated with the unexplained factors. To correct for endogeneity we need instruments Z such that, Z and X are correlated but Z does not correlate with epsilon. Under these two conditions any correlation that we find between instruments and y will be due to X. This information can be used to form a new estimator for beta. The slide shows some details. There are two steps to the new estimation procedure. First, we use Z to decompose X in two parts, a part that can be explained by Z and a part that cannot be explained. In the second step, we regress only the explained part of X on y. The theoretical impact of "X explained" on y equals the true effect size, beta, as the unexplained part of X is simply added to the error term. This solves endogeneity as the unexplained part of X is by construction uncorrelated with the explained part. So X explained is now exogenous! This procedure is known as two-stage least squares, or 2SLS. Given the linear model and a matrix of instruments Z, we literally need to perform the two steps. First you predict X using Z. That is, we estimate the coefficients of a model where we explain X using Z. The standard OLS formula applies here. Only the role of X has changed. It is now the dependent variable. Next, we calculate the explained part of X. Let's denote this part as X hat. X hat can be written as a projection matrix, H of Z times the original matrix of regressors X. In the next step, we regress y on X hat using OLS. The estimator in this step is the 2SLS estimator, also known as the IV or instrumental variable estimator. In the first line it is very clear that this estimator is obtained using a standard regression with X hat as explanatory variable. The other lines use properties of projection matrices to rewrite the formula. The variance of the 2SLS estimator appears on this slide. Standard errors for the 2SLS estimates can easily be obtained from the variance matrix. To estimate sigma squared, it is important to use the correct residuals. The residuals should be in terms of the real X variables, not the variables used in the second stage regression. The derivation of the variance is given on the slide. Although these details perhaps look overwhelming, I advise you to take some time to verify the steps. 2SLS is consistent if some large sample conditions are satisfied. First, Z and epsilon are not correlated. Second, Z itself is not multicollinear. And third, X and Z are sufficiently correlated. All three conditions are also formally specified on the slide. The third condition also implies that we must at least have as many instruments as explanatory variables. Given these conditions the derivation on the slide argues that the 2SLS estimator converges to beta as n grows large. Again you should take some time to look at the steps. So, if we have instruments Z we can consistently estimate beta. But how can we obtain instruments? First of all, all exogenous explanatory variables in X qualify as instruments. If there are endogenous variables, additional instruments are needed. To find these, we often need expert knowledge on the topic of the model. For every endogenous variable, we will need to obtain at least one additional instrument. In general, the stronger the correlation between Z and X, the better. However, we need to make sure that there is no correlation between Z and epsilon. Let us reconsider an earlier example. Suppose we want to explain the grades on a course using the attendance at lectures. We argued before that attendance is endogenous due to omitted variables, such as the student’s motivation. Which variables would be good instruments in this case? They need to be related to attendance but should not affect the grade itself. Two variables that are likely to be good instruments are travel time to university, or if data over multiple years are available, a variable that indicates an introduction of obligatory attendance. Both variables are not likely to impact grades, but are likely to affect attendance. Students living far away may be less likely to attend all classes. And the policy change will likely increase attendance. Next I would like you to think about potential instruments for another example. Recall the case where we wanted to explain demand using price, and where a salesperson strategically sets prices. Suppose that the product is ice cream. What variables can you think of as instruments for price? In this case, the price of raw materials is likely to be an instrument. An increase in price of raw materials will increase the consumer price. However, the raw materials' price will not likely affect demand directly. In the end, the consumers only care about the price that they need to pay. Variables like competitor price or outside temperature are not valid instruments. These variables are likely to affect demand themselves. 2SLS solves endogeneity. However, there is a price that we need to pay. We should only use 2SLS if the explanatory variables are really endogenous. If X is in fact exogenous, OLS and 2SLS are both consistent. However, the Gauss-Markov theorem says that the variance of OLS will never be larger than that of 2SLS. Please review one of the earlier lectures if you don't recall this important theorem. So, in this case we are better off by just using OLS. However, if X is endogenous only 2SLS will be consistent. So here we really need to use 2SLS. Now I invite you to make a training exercise to train yourself with the topics of this lecture. You can find this exercise on the website. And this concludes this lecture.