Correlation gives us information about the association between two variables in our dataset, but the interpretation of correlation can be a little bit tricky. It is possible to get into the trap here if you confuse correlation with causal relationships, let us discuss it using an example. Let us assume that I conduct some study and I'm interested in the following question; are vegetables good for our health? Is it good to eat more vegetables to become more healthy and live longer? It is possible that I collected some data using some surveys and visualize this data in the following scatterplot; on the horizontal axis I will draw amount of vegetables in the diet of my participants, and on the vertical axis I will draw some health index. This health index measures in some way how healthy that particular person is, and let me assume that my data looks like the following; assume that I collected this data and I calculated my Pearson's correlation coefficient for this data, and I get the answer that this Pearson's correlation coefficient is greater than zero. Can we use the fact that the correlation is positive to conclude that it is good for my health to eat more vegetables? So can we suggest for example for our customers to eat more vegetables to be more healthy using this data as the foundation of our suggestion? On the first glance it is a good idea because we clearly see that those people who eat more vegetables are more healthy, but the tricky part is the causal relationship between these two variables. The correct way to interpret these positive correlation is the following; if I pick a random person and I know that this random person eat lots of vegetables, then I may safely assume that the health index of this person is relatively large, this is what correlation says to us. If I pick a random person and observe large amount of vegetables in diet for this person, I can conclude that this person is healthy. So this interpretation is correct, but note that this is a purely observational setting. We pick somebody random, we observe one variable, and we make a conclusion about another variable. If we wanted to give an advice we have to consider a little bit different question, we have to say if a particular person switch from low level of vegetables in their diet to high level of vegetables or vice versa, how it will change their index. To answer this question I have to take into account possible causal relationship between these vegetables and this health index, which are not covered by just this correlation. Indeed, this picture can be explained in different ways, the first way which is possible is that the large amount of vegetables is a reason for large health index. So in this case we can say that vegetables act on health in a causal way, it means that if I will change my vegetable consumption I will change my health. This is possible explanation of this picture, but this is not the only possible explanation. Another alternative explanation is the following; there is a third factor, a third variable that is not included into my study, this variable for example can be healthy lifestyle. In this interpretation, our participant first chooses their lifestyle, it can be more healthy or less healthy, for example if this person choose a healthy lifestyle they eat more vegetables, but also they do some sport, or they visit doctors often and so on. So they are interested in their health, in this case it is possible that healthy lifestyle itself acts on the health index. As we believe that those persons who prefer healthy lifestyle eat more vegetables, we assume that healthy lifestyle also acts on vegetables, vegetable consumption. In this case it is possible to observe the same picture, the same correlation between health index and amount of vegetables even if there is no any causal relationship between vegetable consumption and health index. We can say that vegetable consumption in this case is just a proxy for healthy lifestyle, but it does not change the health itself, it just gives us information about the lifestyle. So if some person do not have healthy lifestyle, for example they do not visit doctors and do not do other things that usually do people with healthy lifestyle, and we say to them that they have to consume more vegetables, they will increase on the value of the variable but it will not increase their health index. Because the reason of large health index of people who consume more vegetables, are not vegetables themselves but this healthy lifestyle, so this is also possible explanation which is consistent with this data. No correlation can distinguish between these two possible explanations, so we cannot use correlation to make causal conclusions unless we are assured that there is a relation like this one. This kind of relations have to be obtained using different tools, not correlation itself. Anyway, correlation is a good measure of association between variables, but we have to understand that this association is of observation nature. If we do some machine learning and we are interested only in predictions, we want to predict the health index and we use information about vegetable consumption and we know that there is a correlation, we can use correlation to make these predictions. But if we want to interfere, if we want to make decisions like what parameters should be changed to achieve a particular goal? We have to make a more complicated causal analysis. Anyway correlations are important when we are interested in the association between the variables, but Pearson's correlation that we discussed before is well-suited for linear relationship. Now let us discuss non-linear relationships and correlations that are suited for them.