Fishing Expeditions in Scatterplot Matrices

The reasons for my present interest in this topic is that I am working on a Tableau project with a datatable with 22 measures, that I am worried that I may be tempted to calculate and visualize all pairs of measures, and that I may read too much into correlations that are significant by chance alone.

 

If the correlations between all possible pairs of measures in a dataset are calculated and all possible hypotheses about these correlations are tested then 5% of randomly chosen hypotheses will turn out to be significant at the 5% level, 1% turn out to be significant at the  1% level, by chance alone, even if all the null hypotheses are true.

When such correlations are visualized in scatterplot matrices it is tempting to scan the plots, identify the plots that seem to show strong relationships, assign special significance to them, and base decisions and actions on them, forgetting that 5%, 1%, …., of the correlations may be significant by chance alone. If the decisions and actions are of heavy import this may of course have disastrous results.

Iris Scatterplot Matrix

Such behavior may be called “fishing expeditions in scatterplot matrices”.

Fishing a Shoe

One way of mitigating this problem is to to regard the results of a fishing expedition as merely providing suggestions for further experiments designed to test the hypotheses.

Another way is to split the data randomly into two halves, one half for fishing and generating hypotheses, the other half hidden away until it is used to test the results of the first half against the second half.

I am determined to calculate and visualize only correlations that I can generate hypotheses about prior to calculations and visualizations. I shall therefore begin by making an association network for identifying promising hypotheses.

Poverty and Inequality

I have started to work on a project which I call “Poverty and Inequality”. This I do mainly because those are among the most important and terrible problems facing humanity – as shown in the following slide deck – but also in preparation of reading “Capital” by Piketty which focuses on inequality.

The data is derived from World Development Indicators. I have selected All Countries, twenty two Series relevant to the project, and twelve Years from 2001 to 2012. The resulting data table has been Datashaped and pivoted in order to give it a form optimal for analysis and visualization in Tableau. I have defined parameters and calculated fields that may be used to parametrize measures and thereby simplify the exploration of the data table. I have made sheets to display parametrized Tables, Bar Charts, Histograms, Histograms with overlays of histograms of Normal Curves with the same means and standard deviations, Box Plots, Tree Maps, Bubble Charts, Symbol Maps, Filled Maps, Scatter Plots, and ScatterPlot Matrices. It is much easier to select measures for these sheets from a list than to make sheets for each of the measures.

The data is derived from a census of all countries in the world. It is difficult to estimate its reliability (accuracy, consistency, stability). It is obviously not possible to calculate a reliability coefficient for the data. It is for example not possible to apply the test-retest methodology. It is also difficult to estimate the validity of the data. Do the World Development Indicators measure what we want to measure and what we think we are measuring.

One of the topics of the World Bank is Poverty with the following website:

http://www.worldbank.org/en/topic/poverty

Associated page: Poverty and Inequality