In this session, we illustrate an application of linear regression. Consider the data set Italian startups. After we bring it into the Stata environment, the Stata command for the linear regression is reg TotalFunding Letters_in_name FoundingYear Digital BA MBA and Master. Where digital is the dummy equal to 1 if the company is digital, discussed in the previous session, and BA, MBA, and Master are the three dummies for the education of the founder, also discussed in the previous session. After you run the regression, you find the R2 in the top right corner of the output mask. The R2 informs us about the precision of the prediction ŷ, as discussed in the previous session. In our case, the R2 is small. The estimated coefficients are in the bottom section of the regression output. A statistically robust coefficient tells us that the effect of the independent variable is estimated precisely, even if the R2 indicates a relatively imprecise prediction overall. We obtain the estimated coefficients discussed in the previous session. In particular, an increase by one unit of the letters in the firm's name reduces the predicted TotalFunding by €1,282.84. The p-values are in the fourth column, and in the case of letters in the last name the p-value shows that the probability that this estimate was generated by a true coefficient equal to 0 is de facto equal to 0. This is also reflected in the confidence intervals that we find in the last two columns. They indicate that the true parameter falls between -1,991.919 and -573.7614 with probability 95 percent. The other statistically powerful coefficient of this regression is MBA. This is the dummy for whether the founder holds an MBA. The estimated coefficient is 103,414.6. Other things being equal, that is holding all the other variables at the same level, a founder with an MBA collects on average 103,414.6 more funds than a founder with no MBA. For a founder with an MBA, the prediction of the impact on TotalFunding is 103,414.6 times 1, while for a founder with no MBA it's 103,414.6 times 0. The p-value of this coefficient is 3.1 percent, which means that if the true coefficient was 0, we would estimate 103,414.6 with a probability of 3.1 percent. This is a small probability, which makes us confident that the estimated coefficient is different from 0. The confidence interval suggests that with 95 percent probability, the coefficient is between about 9.500 and 200,000. While still a fairly wide range, it suggests that quite likely the impact of holding an MBA on the ability to raise funds is positive. All the other independent variables have high p-values suggesting that with high probability these coefficients were generated by distributions in which the true value of the coefficient is 0. The confidence intervals show that these coefficients can be positive or negative. This is additional evidence that we cannot nail them down precisely. This raises also the question: Should we keep the independent variables with statistically weak impacts when we compute the prediction ŷ? Or should we eliminate them by setting all the coefficients equal to 0? In general, we cannot just eliminate them because they affect the estimates of the statistically robust coefficients. You can show as an exercise that if in our regression we only keep letters-in-name and the MBA dummy, the coefficient of the MBA dummy changes in an important way. The best approach is to have a theory, or at least a good reason for the variables we include. And we need to stick to these variables, even if some of them are not precise from a statistical point of view. We could run a regression using natural logarithms of the same variables. In Stata, you generate logarithms with the command gen name of the new variable equal to log variable to be transformed. Sometimes variables take values 0 or negative, and thus we cannot compute logs, because the log of 0 or of a negative number is not defined. In our case, for nine observations TotalFunding takes value 0. And to avoid the 0 problem, it is customary to take the log of 1 plus the variable. For example, we can name the log of TotalFunding as l_TotalFunding, and generate log of 1 plus TotalFunding. We can also generate log of letters_in_name and call it l_letters_in_name. We can then run the reg command, like we did earlier, with l_TotalFunding and l_letters_in_name, instead of TotalFunding and letters_in_name. The results indicate that the coefficient of log letters_in_name is -0.7588. Since both the dependent and independent variables are in logs, a 100 percent increase in letters_in_name implies a 75.88 percent decrease in TotalFunding. In other words, a company whose name is twice as long as another company with identical characteristics will raise on average 75.88 percent fewer funding. In the log version of the regression, we find that all the other variables are much less powerful statistically than in the regression in levels. However, for the sake of illustration, we show how to interpret the effect of some independent variables measured in levels rather than in logs. For instance, the coefficient of the founding year indicates that on average an increase of the founding year by one year reduces the amount of funding rates by circa 8.4 percent. In the log regression, the p-value of the coefficient of MBA has increased, make it no longer a statistically precise effect. These are the vagaries of functional forms. Sometimes by changing the functional form, some results no longer hold, while others appear. There is little we can do, as we are dealing with statistical phenomena that are naturally associated with random prediction errors. One solution is to increase the number of observations. As a matter of fact, using big data that is hundreds or thousands or millions of observations, we can improve the prediction of the stability of the estimated coefficients. At the same time, when some impacts are robust across specifications, like in the case of the letters in the company's name, we increase our confidence that the impact is statistically precise. Finally, in the log regression the prediction corresponds to the natural logarithm of the dependent variable. To obtain the prediction in levels, we can invert the natural logarithm by taking the exponential of the logarithm. However, in general, it is difficult to make fully data-driven decisions. In this respect, we should focus on the predictions of the dependent variables, only when we can use big data, and we have seen that, at the beginning of this course, this is possible and ideal only in some cases and for some questions or problems. In many cases, data enable managers to make data-informed decision, rather than fully data-driven decision. In data-informed decisions, better decisions emerge from a mixture of data, intuition, background, and experience. Recall our managerial decision rule, V̂ greater or equal than V*, rather than an exact prediction V̂, we may be interested in factors that raise or lower V̂. For any given V*, this makes V̂ more or less likely to be higher than V*, which in turn makes the decision to proceed with the investment more or less likely.