We use regression analysis to predict the value of a dependent variable based on the values of some independent variables, and to study the impact of changes in an independent variable on the dependent variable. The dependent variable is the variable we predict or explain. The independent variables are the variables used to predict or explain the dependent variable. A linear regression takes the form y equal to b0 plus b1 times x1 plus b2 times x2 plus b3 times x3 plus a series of other variables and an error term. Think of the observations in our sample as rows in an Excel file. Each row corresponds to a given observation: a firm, an individual. And the columns are the dependent and independent variables. The dataset on 836 Italian startups used in the previous session is a good illustration. The dependent variable could be total funding. Independent variable x1 could be the letters in the firms name. x2 the founding year, and so on. Each record represents a startup with its level of total funding. Why? The number of letters in its name x1, the founding year x2, etc. The b’s are parameters to be estimated. They represent the impact of a unit change of the corresponding x on the dependent variable y. For example, as an anticipation of the application we discuss in the next session, if we estimate b1 equal to -1282.84, a company with an additional letter in its name will raise on average €1282.84 less than an identical company with one fewer letter in its last name. ‘Identical company’ has a precise meaning here. The estimated b1 is the change in the dependent variable holding all other independent variables at the same level. If we sum the constant term b0 and the estimated coefficients, times the value of the corresponding independent variables, we obtain the prediction ŷ of the dependent variable y. The error term is the difference between the true y and the prediction ŷ. The error term is a random variable that captures the fact that the summation b0 plus b1 times x1 plus b2 times x2 and so on is not exactly equal to the dependent variable for the same record. There will be a variation around the exact dependent variable, and this difference is the error term. This implies that the estimated coefficients are random variables themselves. Therefore, they will be associated with p-values, like the correlation coefficient in the previous session. The p-value represents the probability that if the true value of the coefficient is zero, that is the independent variable has no impact on the dependent variable, we observe the estimate of the coefficient that we actually observe. For example, if for the estimated -1282.84 we observe a very small p-value, it means that with very small probability, the estimate was produced by a true coefficient equal to zero. Strictly speaking, the p-value is the probability of the observed estimate computed from the probability distribution of the estimate, conditional on the true value of the coefficient equal to zero. A low p-value makes the estimate precise and reliable because the true coefficient equal to 0 generates the observed estimate with very low probability. In making the prediction ŷ, we can use the values of the x of one firm in our sample. In this case, we obtain the prediction of total funding for that firm. Alternatively, we could use the x of a hypothetical firm similar to the firms in the sample but not in the sample. It has to be a similar firm, otherwise we need a different model to explain a different firm. As discussed in the previous session, it could be a new startup, similar to the existing ones, that wants to predict its ability to raise funds. Or we could use the x of an existing firm and change the value of one of the independent variables. In this case, we predict our y changes if we had a different firm, identical in every respect but for the value of that independent variable. Multiple linear regressions rest on important assumptions. The most important one is that the errors are random. This means that the errors should not be correlated with any of the independent variables. De facto, this means that we have been able to: one, single out all the independent variables that could affect y. And two, that none of the x is caused by y. In other words, we are assuming no omitted variables or reverse causality. Violation of these assumptions means that we have correlation, but not causality. We can use the independent variables to make predictions, but we cannot take any one of them and presume that by changing it we will observe the estimated changing y implied by the value of the estimated coefficient. We discuss this point to a greater extent in our final sessions when we deal with causality. For the moment, we stick to the idea that multiple linear regressions help us to make predictions, particularly predictions of the impacts of the independent variables. However, they cannot be used to imply which actions we have to take. The method to estimate the coefficient of linear regressions is based on the minimization of the sum of the squares of the errors of each observation. This generates the so-called Ordinary Least Squares or OLS estimates. You can find easily the rationale for this estimation, the formulas of the estimated coefficients, and several other details on the web, or in any statistics textbook. Here it is sufficient to say that with no omitted variables or reverse causality, that is the independent variables are not correlated with the error term, the estimated coefficient tend, as the sample gets larger, to the true value of the coefficient. In short, this is a reliable estimation method. A few more issues before we show a concrete example. First, linear regressions produce a measure of how precise is the overall prediction ŷ. This statistic is called R2 and it is equal to the ratio between the sum of the squares of the difference between the predicted values of the dependent variable minus the mean of the dependent variable and the same sum with a true dependent variable instead of the prediction. The R2 measures how much of the variation in the variable around the mean is explained by the variation in the prediction with respect to the mean. If this statistic is equal to 1, the prediction is perfect. The closer to 0, the lower the precision of the regression. Second, we can account for qualitative variables. These variables are called dummy variables, and take values 0 or 1. For example, in our dataset of Italian startups, some firms are digital while others are not. The variable digital takes the value 1 if the firm is digital and 0 if it is not digital. Again, as an anticipation of the application in the next session, the estimated coefficient of this variable is 4,249.967. Other things being equal, a digital startup raises on average €4,249.967 more than a non-digital startup. This is because the coefficient multiplied by the dummy variable is equal to the coefficient itself when the dummy is equal to 1, and it is equal to 0 when the dummy is equal to 0. We can also accommodate qualitative variables with more than two categories. Suppose that the founder’s education is: no university degree, Bachelor, MBA or master. In this case, we build as many dummy variables as the categories minus 1. We have three dummies. One takes the value 1 if the founder holds a BA and 0 otherwise. The second one takes the value 1 if the founder has an MBA and 0 otherwise. And the third one takes the value 1 if the founder has a Master and 0 otherwise. When these three dummies are all equal to 0, we have the baseline case in which the founder has no university degree. This is why we do not need to account for it in the regression. The constant term represents the baseline case and all the other dummies represent the different impact of the category with respect to the baseline case. Third, we can introduce non-linearities. A typical case is to express the dependent variable in natural logarithms. The natural logarithm has the property that the change in the logarithm of the dependent variable is a percentage change of the dependent variable. Thus, if an independent variable changes by one unit, the estimated coefficient represents the percentage change in the dependent variable. If we also represent an independent variable in natural logarithms, the estimated coefficient represents the percentage change in the dependent variable if the independent variable changes by 100 percent. That is, if the independent variable becomes twice as large. This completes our discussion of the key elements that we want to highlight about regression analysis. In the next session, we discuss a specific application of linear regression as an illustration of these topics.