Published in The Edge Malaysia, March 2018.
In every introductory statistics course, students are reminded over and over again of the Golden Rule of Statistics – Correlation is not Causation. Just because two variables happen to move closely together does not mean that one predicts the other. One may very well predict the other, but we cannot conclude that just because the two variables are highly correlated.
To highlight the absurdity of equating correlation with causation, consider the following correlations, which, in Statistics Terminology, are spurious correlations. From 1999 to 2009, the “number of people who drowned by falling into a pool” correlates strongly with “films Nicolas Cage appeared in” and “power generated by US nuclear power plants.” Over the same time period, the “number of people killed by venomous spiders” correlates strongly with “Letters in winning word of Scripps National Spelling Bee.”
Of course, no one – even the most imaginative thinker – would strongly suggest that those correlations are, in any way, linked in a causal manner. No one would say, “From 1999 to 2009, higher deaths from venomous spiders causes more letters in the winning word of the Scripps National Spelling Bee.”
But some are more subtle. A 1999 study found that “sleeping with a night-light as a child” led to “near-sightedness.” This seems plausible, certainly more so than Spelling Bee letters and venomous spiders. But this was later shown to be false – nearsightedness is genetic, and nearsighted parents more frequently placed night-lights in their children’s rooms. Similarly, consider HDL cholesterol – the “good” cholesterol – that is correlated with lower rates of heart disease. Yet, heart disease drugs that raise HDL cholesterol are ineffective. This is because HDL cholesterol does not actually cause heart health, it is rather a byproduct of a healthy heart. Thus, what sounds like a highly plausible causality link may end up wasting billions in drug development and diverting attention away from solving the actual issue.
This confusion between correlation and causation is critical, and even more so, for two reasons. First, as humans, our brains are hard-wired to believe in cause and effect and therefore, in causation. In Thinking Fast and Slow, Daniel Kahneman – winner of the Nobel Prize in Economics despite being a psychologist – argues that the human mind seeks confidence in coherence. In this case, confidence refers to whether a given explanation tells the best story of a given observation, while coherence describes whether that explanation fits that given observation well.
Yet, what Kahneman shows throughout the book is that the human mind is totally unreliable when it comes to determining causal explanations. Our brains, as the Science Historian Michael Shermer describes, are “belief engines: evolved pattern-recognition machines that connect the dots and create meaning out of the patterns that we think we see in nature” even if those connections are not really there. As such, the correlation is not causation mantra is vital as we, as humans, are essentially hardwired to make such mistakes.
The second reason why understanding the confusion between correlation and causation is so important is that we have launched ourselves into the age of Big Data and Machine Learning. Not only do we actually have the mountains and mountains of data to analyse, we also have the processing hardware and software to analyse that data. So, when we throw data into some algorithm – and all algorithms are essentially pattern-seeking – we are likely to observe relationships in the data that we would never have conceived. Some will make sense. Others will tell us that the “number of people killed by venomous spiders” correlates strongly with “Letters in winning word of Scripps National Spelling Bee.”
What this should teach us is to be extremely careful of attributing causality. Did event Y happen because there was intervention X? Or would event Y have happened anyway even if there was no intervention X? Did an economy grow robustly because of a government policy, or did it grow robustly because the rest of the world was growing robustly as well? These are important questions and false attributions of causality may lead to misguided policies which may then lead to billions of Dollars and thousands of jobs lost.
A sparkling debate on the need to be nuanced in attributing causality can be seen on a marvelous clip on Youtube called, “Mark Cuban Owns Skip Bayless!” Skip Bayless, an NBA television analyst, was arguing that the reason the Miami Heat lost to the (Mark Cuban-owned) Dallas Mavericks in the 2011 NBA Finals was that the Miami Heat’s best player, Lebron James, “disappeared and shrank in crunch time” and that the Mavericks “didn’t have to defend him.” To which Cuban countered, “So no matter what we did, he was going to stand there and do nothing?” Cuban then goes into some detail describing the way the Mavericks chose to defend Lebron James which may have impacted the way that James played.
Yet, sports is full of the type of platitudes, and indeed lackadaisical generalities, that Skip Bayless espouses. If you’ve listened to sports commentary, you’ve definitely heard, “They just wanted it more” or “They need to show more character” or “The winning team just played harder.” Those statements may or may not be right for a given situation, but they are meaningless without a deeper understanding or discussion on tactics, lineups and in-game adjustments. The same is true of everyday life. As Kahneman and others have shown, we tend to take mental shortcuts that overlook more nuanced and subtle factors to arrive at our conclusions. We do not suffer from a loss of generality.
To be clear, caution is also needed when we argue that there is no relationship between two variables. When a policy intervention, for instance, is shown to be non-statistically significant – in Statistics, this means that there is a high chance that the effect of that intervention is no different from zero – it could mean several things. First, the intervention does not work – the most straightforward explanation. Second, the intervention was implemented poorly. Third, the intervention may work for some participants, but does not remove a binding constraint for the average participant. Fourth, the intervention may only work with complementary interventions.
The interpretation and analysis of causality is incredibly tricky and deserves to be treated with caution and nuance. This is especially true as we progress further on Big Data and Machine Learning where new relationships may be generated in the data that we had not previous conceived, or old relationships may be quashed altogether. We are hardwired to attribute causality to patterns that we observe, and going against that wiring will be difficult. But we have to start somewhere. So the next time someone tells you, with great certainty, that as a result of (their) action X, a good result Y was achieved, have some healthy skepticism. They may be right, but it is good to question further.