When I began to use the new generation of data analysis and visualization software like Tableau I thought that I would first use them to address some of the most important problems of humanity like Resource Scarcity, Inequality, Poverty, Human Migration, Refugees, …
I have found large amounts of data relevant to these problems published to the Internet by various organizations and institutions, like the United Nations, The World Bank, The World Health Organization, …The data are usually in the form of data tables with countries, regions, and locations as rows, time periods as rows or columns, and variables as columns.
The data have been collected in surveys. The completeness of the data and their reliablity is uncertain and variable.
The presentations of the data in in the worksheets and dashboards of Tableau workbooks are very fine and I have no doubt that such presentations can increase the viewers knowledge and understanding of the problems. But in order to solve a problem it is necessary to identify, eliminate or minimize its cause or causes.
The presentations can be seductive. Viewers may be tempted to identify causes by calculating correlations between the variables in the data and assuming that correlations imply causation.
Statisticians know that correlation does not imply causation. What does this mean? Correlation is a measure of how closely two things are related. You may think of it as a number describing the relative change in one thing when there is a change in the other, with 1 being a strong positive relationship between two sets of numbers, -1 being a strong negative relationship and 0 being no relationship whatsoever. “Correlation does not imply causation” means that just because two things correlate one does not necessarily cause the other. Although this is an important fact most people do not sufficiently take this into account. Their preconceptions tempt them to leap from correlation to causation without sufficient evidence.
This can result in absurd and ridiculous causal claims. Tyler Vigen has recently published the second edition of his book “Spurious Correlations” (May 8, 2015).
He has designed software that scours enormous data sets to find spurious statistical correlations. In the Introduction to the book he says:
“Humans are biologically inclined to recognize patterns….Does correlation imply causation? It’s intuitive, but it’s not always true. …Correlation, as a concept, means strictly that two things vary together…(but) Correlations don’t always make sense.
Provided enough data, it is possible to find things that correlate even when they shouldn’t. The method is often called “data dredging.” Data dredging is a technique used to find something that correlates with one variable by comparing it to hundreds of other variables. Normally scientists first hypothesize about a connection between two variables before they analyze data to determine the extent to which that connection exists.
Instead of testing individual hypotheses, a computer program can data dredge by simply comparing every dataset to every other dataset. Technology and data collection in the twenty-first century makes this significantly easier….This is the world of big data and big correlations….
Despite the humor, this book has a serious side. Graphs can lie, and not all correlations are indicative of an underlying causal connection. Data dredging is part of why it is possible to find so many spurious relationships….Correlations are an important part of scientific analysis, but they can be misleading if used incorrectly.”
Vigen, Tyler. Spurious Correlations. Hachette Books. Kindle Edition. May 2015.
Why is it that people are so easily allured/seduced into assuming that correlation implies causation? Vigen states: “Humans are biologically inclined to recognize patterns”. This reminds me of a blogpost in “Science or not” by Graham Coghill called “Confusing correlation with causation: rooster syndrome”.
He quotes: The rooster crows and the sun rises
And then he says: “This is the natural human tendency to assume that, if two events or phenomena consistently occur at about the same time, then one is the cause of the other. Hence “rooster syndrome”, from the rooster who believed that his crowing caused the sun to rise….
We have an evolved tendency to believe in false positives – when event B follows soon after event A, we assume A was the cause of B, even if this is untrue. In evolution, such beliefs are harmless, whereas the belief that A is not the cause of B when it actually is (false negative) can be fatal. Michael Shermer explains: “For example, believing that the rustle in the grass is a dangerous predator when it is only the wind does not cost much, but believing that a dangerous predator is the wind may cost an animal its life.”
Michael Shermer wrote an article in Scientific American with the title “Paternicity: Finding Meaningful Patterns in Meaningless Noise”.
He says: “Why do people see faces in nature, interpret window stains as human figures, hear voices in random sounds generated by electronic devices or find conspiracies in the daily news? A proximate cause is the priming effect, in which our brain and senses are prepared to interpret stimuli according to an expected model.
Is there a deeper ultimate cause for why people believe such weird things? There is. I call it “patternicity,” or the tendency to find meaningful patterns in meaningless noise. Traditionally, scientists have treated patternicity as an error in cognition. A type I error, or a false positive, is believing something is real when it is not (finding a nonexistent pattern). A type II error, or a false negative, is not believing something is real when it is (not recognizing a real pattern—call it “apatternicity”).
In my 2000 book How We Believe (Times Books), I argue that our brains are belief engines: evolved pattern-recognition machines that connect the dots and create meaning out of the patterns that we think we see in nature. Sometimes A really is connected to B; sometimes it is not. When it is, we have learned something valuable about the environment from which we can make predictions that aid in survival and reproduction.”
When data is collected in a non-random, uncontrolled, survey, it is very hazardous to base decisions and actions on the assumption that correlation implies causation. It is impossible know which correlations correspond to causation with a high probability and which are spurious. And it is impossible to estimate the risks associated with decisions and actions based on the assumption.
Correlations between variables calculated from data collected in a non-random, uncontrolled survey can not be used for anything but to state hypotheses that can be tested in statistically sound research.