The reasons for my present interest in this topic is that I am working on a Tableau project with a datatable with 22 measures, that I am worried that I may be tempted to calculate and visualize all pairs of measures, and that I may read too much into correlations that are significant by chance alone.
If the correlations between all possible pairs of measures in a dataset are calculated and all possible hypotheses about these correlations are tested then 5% of randomly chosen hypotheses will turn out to be significant at the 5% level, 1% turn out to be significant at the 1% level, by chance alone, even if all the null hypotheses are true.
When such correlations are visualized in scatterplot matrices it is tempting to scan the plots, identify the plots that seem to show strong relationships, assign special significance to them, and base decisions and actions on them, forgetting that 5%, 1%, …., of the correlations may be significant by chance alone. If the decisions and actions are of heavy import this may of course have disastrous results.
Such behavior may be called “fishing expeditions in scatterplot matrices”.
One way of mitigating this problem is to to regard the results of a fishing expedition as merely providing suggestions for further experiments designed to test the hypotheses.
Another way is to split the data randomly into two halves, one half for fishing and generating hypotheses, the other half hidden away until it is used to test the results of the first half against the second half.
I am determined to calculate and visualize only correlations that I can generate hypotheses about prior to calculations and visualizations. I shall therefore begin by making an association network for identifying promising hypotheses.