Perils of Statistical Inference : Simpson’s Paradox
It is often common to hear of the importance of data, and how the data alone is sufficient for us to understand the world. Here, I will attempt to push back on that idea, and to give the readers an example of precisely where looking at the data blindly, without proper scientific models and theories, can lead us astray in our thinking. The example which we will explore here is the famous Simpson’s paradox, which is in truth a basic arithmetic proposition, but is an all too common pitfall for statistical inference.
Back to Basic
Consider the following arithmetic proposition. It is possible to have the following?
The answer, is unsurprisingly, yes. Consider the following example,
and it should be clear that the set of inequalities are true for a certain set of numbers. Now that we have established this, we can move on to some real-world data.
The Classic Example
To begin, let us consider the example given in the original paper written by Simpson himself. He gave the following example of an intervention segmented among gender.
Looking at the different genders, we see that there is a clear positive association with being treated, and with surviving for both genders individually. Hence, shall we now conclude that the treatment is effective in improving survivability? Moving on to the final, combined columns, we see Simpson’s paradox in play. When aggregating among both genders, the positive association disappears, and in fact, we find that there is no association whatsoever for the treatment. Thus, we are faced with a conundrum. Purely based on the data, if I were presented with a person without knowing their gender, I am forced to not prescribe the treatment, but if presented with the person knowing their gender, I am inclined to prescribe the treatment. As Simpson himself said in the paper.
What is the ‘sensible’ interpretation here? The treatment can hardly be rejected as valueless to the race when it is beneficial when applied to males and to females.
A more modern example
To resolve the paradox, let us first look at another, more modern example. While the previous example was informative, it is after all only a toy example, and there is no need for artificial constructions when such paradoxes appear commonly in our day to day lives. Consider the Case Fatality Ratios (CFR) of both Italy and China in the currently ongoing (as of writing) COVID-19 pandemic.
Breaking down by the individual age groups, China’s CFR is higher than Italy’s, but when aggregated, a reversal happens, and Italy’s CFR is higher than China’s. Again, we see Simpson’s Paradox at play here.
The data stands for itself, and in a purely descriptive sense, shows the association in the data. However, what we are interested in is not truly the associations, but the bigger, causal questions of “How did Italy and China compare when dealing with the Pandemic? (In terms of CFR)“. To answer these types of questions, we need theory, particularly, a model for the data-generating process. When we look at the causal graph, in fact we find that the paradox is easily resolvable.
From data to model
We model the data as follows. Consider that the only relevant variables are, the country (Italy or China in our case), the COVID-19 mortality, and the age groups of the patients.
The graph allows us to formalise the interactions between the variables. We firstly see an arrow from country to age, thus showing the differences in age distribution between the two countries. (For example, Italy’s population is in general older than China’s) Next, the arrow from age to mortality represents the effect of age on mortality, i.e., the fact that we know that COVID-19 is vastly more deadly for older people. Finally, the arrow from country to mortality represents the country-specific effects on mortality that do not include age, for example, the differences in lockdown strategies used or medical facilities available. (More subtlely, the arrow from country to age also includes other age related differences that may impact mortality, for example, consider if certain the elderly in China are less prone to social mixing)
Questions & Answers
Thus, we now have the means to correctly interpret the data available to us. Consider the following questions we would like to ask.
“What is the effect on mortality when comparing between China and Italy?” This is what we call the total effect. This is what is calculated when we compare the difference in aggregate CFR between the two countries.
“What is the effect on mortality when comparing between China and Italy, for a fixed age demographics?” This is the direct effect, and we calculate this by standardising the rates by age to give a fair comparison, and when calculated gives us that Italy’s approach, after controlling for the demographics, gave a lower CFR than China’s.
As to which question is to be asked, in our case, it is clear that we should be more interested in the second than the first. We are interested purely in the effects of the country-specific approaches on COVID-19 mortality, and not in the influence of demographics on mortality.
In our interest of the direct effect of country approaches on COVID-19 mortality, the age demographics is what we call the confounding variable, which is to be controlled in our model. While it was a confounding variable in our particular model, hopefully it is clear that ultimately, it depends on the question of interest to the researcher. Thus, there is no one single rule when it comes to statistical inference, it all depends on the model, and more importantly, the question that we are interested in answering with our data.
While we looked at mainly medical data here, Simpson’s Paradox appears widely in many places where observational data is analysed. Particularly, “Simpson’s paradox in Covid-19 case fatality rates: a mediation analysis of age-related causal effects” with the link provided below provides a few cases of Simpson’s Paradox in a business setting.
Sources and further reading
Simpson’s paradox … and how to avoid it
Simpson’s Paradox: a cautionary tale in advanced analytics
Simpson’s paradox in Covid-19 case fatality rates: a mediation analysis of age-related causal ffects