The reasons for my present interest in this topic is that I am working on a Tableau project with a datatable with 22 measures, that I am worried that I may be tempted to calculate and visualize all pairs of measures, and that I may read too much into correlations that are significant by chance alone.
If the correlations between all possible pairs of measures in a dataset are calculated and all possible hypotheses about these correlations are tested then 5% of randomly chosen hypotheses will turn out to be significant at the 5% level, 1% turn out to be significant at the 1% level, by chance alone, even if all the null hypotheses are true.
When such correlations are visualized in scatterplot matrices it is tempting to scan the plots, identify the plots that seem to show strong relationships, assign special significance to them, and base decisions and actions on them, forgetting that 5%, 1%, …., of the correlations may be significant by chance alone. If the decisions and actions are of heavy import this may of course have disastrous results.
Such behavior may be called “fishing expeditions in scatterplot matrices”.
One way of mitigating this problem is to to regard the results of a fishing expedition as merely providing suggestions for further experiments designed to test the hypotheses.
Another way is to split the data randomly into two halves, one half for fishing and generating hypotheses, the other half hidden away until it is used to test the results of the first half against the second half.
I am determined to calculate and visualize only correlations that I can generate hypotheses about prior to calculations and visualizations. I shall therefore begin by making an association network for identifying promising hypotheses.
I have started to work on a project which I call “Poverty and Inequality”. This I do mainly because those are among the most important and terrible problems facing humanity – as shown in the following slide deck – but also in preparation of reading “Capital” by Piketty which focuses on inequality.
The data is derived from World Development Indicators. I have selected All Countries, twenty two Series relevant to the project, and twelve Years from 2001 to 2012. The resulting data table has been Datashaped and pivoted in order to give it a form optimal for analysis and visualization in Tableau. I have defined parameters and calculated fields that may be used to parametrize measures and thereby simplify the exploration of the data table. I have made sheets to display parametrized Tables, Bar Charts, Histograms, Histograms with overlays of histograms of Normal Curves with the same means and standard deviations, Box Plots, Tree Maps, Bubble Charts, Symbol Maps, Filled Maps, Scatter Plots, and ScatterPlot Matrices. It is much easier to select measures for these sheets from a list than to make sheets for each of the measures.
The data is derived from a census of all countries in the world. It is difficult to estimate its reliability (accuracy, consistency, stability). It is obviously not possible to calculate a reliability coefficient for the data. It is for example not possible to apply the test-retest methodology. It is also difficult to estimate the validity of the data. Do the World Development Indicators measure what we want to measure and what we think we are measuring.
One of the topics of the World Bank is Poverty with the following website:
Associated page: Poverty and Inequality
Scientific studies using confirmatory exploratory data analysis limiting tests of significance to calculation of P values have come under increasing criticism. It has been found that it has been impossible to replicate many such studies and their conclusions can therefore not be considered true. So how reliable are the results of confirmatory exploratory data analytic studies for supporting decisions and actions?
I have written an associated page on this subject.
Reliability of Confirmatory Exploratory Data Analysis
In a post written the 27th of October on basic exploratory data analysis I expressed concern about the reliability of its results for supporting decisions and actions if improperly applied.
If improperly done basic exploratory data analysis reminds me of Munchhausen’s Journey to the Moon. I am also reminded of Buddha’s parable about blind Indians trying to describe an elephant.
There is more about this in the associated page:
Reliability of Basic ExploratoryData Analysis – 2
Lately I have been wondering about the reliability of the results of the various types of exploratory data analysis for supporting decisions and actions. How confident can we be in the decisions and actions based on these results and how willing are we to take responsibility for the risks associated with them.
I have composed a page with the same name as the post where I discuss these questions.
Reliability of Basic Exploratory Data Analysis – 1
Perhaps I should have named this post and its associated page:
Basic Exploratory Data Analysis and Munchhausen’s Journey to the Moon
Some of my data analytic projects require data from The World Bank’s “World Development Indicators”. In order to simplify the process of getting the data from WDI to Tableau I have developed a procedure to execute the process.
This procedure is described in the following related set of pages:
From World Development Indicators to Tableau
Datashaping an Excel Table
Pivoting an Excel Datatable
Transforming Data Within Tableau
Kenneth Black has an excellent post related to datashaping an Excel datatable.
Karunaker Molugu (http://got-data.blogspot.com has kindly shown me on Tableau Forum how to pivot an Excel datatable and two methods of transforming data within Tableau
A large part of the time used for analytic projects is used for the preparation of data. It is therefore of great importance for data analysts to master the preparation process.
A new page has been written about the Preparation of Data for Presentation in Tableau.
Trifacta has announced deep integration with Tableau. Tableau users now have the option of writing the output of Trifacta data transformations directly to a Tableau Data Extract format.
Trifacta provides Tableau users with an intuitive Data Transformation Platform for Hadoop so they can more efficiently transform and analyze common data formats in Hadoop
The integration between Trifacta and Tableau removes a key barrier between the raw, semi-structured data commonly stored in Hadoop and the self-service process for analyzing, visualizing and sharing of insights provided by Tableau.
Working with big data poses specific challenges. The most significant barriers come from structuring, distilling and automating the transfer of data from Hadoop.
Dr. Saed Sayed has published a fine data mining map called “An Introduction to Data Mining”. The url is http://www.saedsayad.com.
The map show the stages and substages of the data mining process with a wealth of information about relevant methods.
Dr. Sayed supplies the following information about himself:
“I have more than 20 years of experience in data mining, statistics and artificial intelligence and designed, developed and deployed many business and scientific applications of predictive modeling. I am a pioneer researcher in real time data mining, an adjunct Professor at the University of Toronto, and have been presenting a popular graduate data mining course since 2001.”
Dr. Sayed has written an excellent book called “Real Time Data Mining”. His description of the content of the book is an excellent characterization of data mining.
“Data mining is about explaining the past and predicting the future by exploring and analyzing data. Data mining is a multi-disciplinary field which combines statistics, machine learning, artificial intelligence and database technology. Although data mining algorithms are widely used in extremely diverse situations, in practice, one or more major limitations almost invariably appear and significantly constrain successful data mining applications. Frequently, these problems are associated with large increases in the rate of generation of data, the quantity of data and the number of attributes (variables) to be processed: Increasingly, the data situation is now beyond the capabilities of conventional data mining methods. The term Real Time is used to describe how well a data mining algorithm can accommodate an ever increasing data load instantaneously. Upgrading conventional data mining to real time data mining is through the use of a method termed the Real Time Learning Machine or RTLM. The use of the RTLM with conventional data mining methods enables Real Time Data Mining. The future of predictive modeling belongs to real time data mining and the main motivation in authoring this book is to help you to understand the method and to implement it for your applications.
The image below illustrates the extraction of data from a bottomless black hole of big data.
Searching for Gold.
Data Mining Page
A large number of decision making methods have been developed. A few of them are listed below:
Multicriteria Decision Analysis
Analytic Hierarchy Process
Analytic Network Process
Hierarchical Influence Diagrams
PAPRIKA Method: “potentially all pairwise rankings of possible alternatives”
These methods are implemented by various software packages, for example:
Decision Making Methods – Page