All posts by oddur.bjarnason@gmail.com

Fishing Expeditions in Scatterplot Matrices

The reasons for my present interest in this topic is that I am working on a Tableau project with a datatable with 22 measures, that I am worried that I may be tempted to calculate and visualize all pairs of measures, and that I may read too much into correlations that are significant by chance alone.

 

If the correlations between all possible pairs of measures in a dataset are calculated and all possible hypotheses about these correlations are tested then 5% of randomly chosen hypotheses will turn out to be significant at the 5% level, 1% turn out to be significant at the  1% level, by chance alone, even if all the null hypotheses are true.

When such correlations are visualized in scatterplot matrices it is tempting to scan the plots, identify the plots that seem to show strong relationships, assign special significance to them, and base decisions and actions on them, forgetting that 5%, 1%, …., of the correlations may be significant by chance alone. If the decisions and actions are of heavy import this may of course have disastrous results.

Iris Scatterplot Matrix

Such behavior may be called “fishing expeditions in scatterplot matrices”.

Fishing a Shoe

One way of mitigating this problem is to to regard the results of a fishing expedition as merely providing suggestions for further experiments designed to test the hypotheses.

Another way is to split the data randomly into two halves, one half for fishing and generating hypotheses, the other half hidden away until it is used to test the results of the first half against the second half.

I am determined to calculate and visualize only correlations that I can generate hypotheses about prior to calculations and visualizations. I shall therefore begin by making an association network for identifying promising hypotheses.

Poverty and Inequality

I have started to work on a project which I call “Poverty and Inequality”. This I do mainly because those are among the most important and terrible problems facing humanity – as shown in the following slide deck – but also in preparation of reading “Capital” by Piketty which focuses on inequality.

The data is derived from World Development Indicators. I have selected All Countries, twenty two Series relevant to the project, and twelve Years from 2001 to 2012. The resulting data table has been Datashaped and pivoted in order to give it a form optimal for analysis and visualization in Tableau. I have defined parameters and calculated fields that may be used to parametrize measures and thereby simplify the exploration of the data table. I have made sheets to display parametrized Tables, Bar Charts, Histograms, Histograms with overlays of histograms of Normal Curves with the same means and standard deviations, Box Plots, Tree Maps, Bubble Charts, Symbol Maps, Filled Maps, Scatter Plots, and ScatterPlot Matrices. It is much easier to select measures for these sheets from a list than to make sheets for each of the measures.

The data is derived from a census of all countries in the world. It is difficult to estimate its reliability (accuracy, consistency, stability). It is obviously not possible to calculate a reliability coefficient for the data. It is for example not possible to apply the test-retest methodology. It is also difficult to estimate the validity of the data. Do the World Development Indicators measure what we want to measure and what we think we are measuring.

One of the topics of the World Bank is Poverty with the following website:

http://www.worldbank.org/en/topic/poverty

Associated page: Poverty and Inequality

Reliability of Confirmatory Exploratory Data Analysis

Scientific studies using confirmatory exploratory data analysis limiting tests of significance to calculation of P values have come under increasing criticism. It has been found that it has been impossible to replicate many such studies and their conclusions can therefore not be considered true. So how reliable are the results of confirmatory exploratory data analytic studies for supporting decisions and actions?

my_p_value_is_smaller_than_your_p_value

I have written an associated page on this subject.

Reliability of Confirmatory Exploratory Data Analysis

Reliability of Basic Exploratory Data Analysis – 2

In a post written the 27th of October on basic exploratory data analysis I expressed concern about the reliability of its results for supporting decisions and actions if improperly applied.

If improperly done basic exploratory data analysis reminds me of Munchhausen’s Journey to the Moon. I am also reminded of Buddha’s parable about blind Indians trying to describe an elephant.

Elephant and Blind Indian

There is more about this in the associated page:

Reliability of Basic ExploratoryData Analysis – 2

Reliability of Basic Exploratory Data Analysis – 1

Lately I have been wondering about the reliability of the results of the various types of exploratory data analysis for supporting decisions and actions. How confident can we be in the decisions and actions based on these results and how willing are we to take responsibility for the risks associated with them.

 I have composed a page with the same name as the post where I discuss these questions.

Reliability of Basic Exploratory Data Analysis – 1

Perhaps I should have named this post and its associated page:

Basic Exploratory Data Analysis and Munchhausen’s Journey to the Moon

 

Munchhausen and the Moon

From World Development Indicators to Tableau

Some of my data analytic projects require data from The World Bank’s “World Development Indicators”. In order to simplify the process of getting the data from WDI to Tableau I have developed a procedure to execute the process.

This procedure is described in the following related set of pages:

From World Development Indicators to Tableau

Datashaping an Excel Table

Pivoting an Excel Datatable

Transforming Data Within Tableau

Kenneth Black has an excellent post related to datashaping an Excel datatable.

Karunaker Molugu (http://got-data.blogspot.com has kindly shown me on Tableau Forum how to pivot an Excel datatable and two methods of transforming data within Tableau

Tableau Integration with Trifacta

Trifacta has announced deep integration with Tableau. Tableau users now have the option of writing the output of Trifacta data transformations directly to a Tableau Data Extract format.

Trifacta provides Tableau users with an intuitive Data Transformation Platform for Hadoop so they can more efficiently transform and analyze common data formats in Hadoop

The integration between Trifacta and Tableau removes a key barrier between the raw, semi-structured data commonly stored in Hadoop and the self-service process for analyzing, visualizing and sharing of insights provided by Tableau.

Working with big data poses specific challenges. The most significant barriers come from structuring, distilling and automating the transfer of data from Hadoop.

Water Shortage

One of the most important question/problem areas for humanity is water and especially freshwater. The available amount of freshwater is decreasing and the problems caused by this are increasing. The risk associated with freshwater shortage is increasing.

“Water scarcity is one of the defining issues of the 21st century. …In its Global Risks 2013 Report, the World Economic Forum identified water supply crises as one of the highest impact and most likely threats facing the planet.”

The World Economic Forum ranks water supply crises as being more likely and having a greater impact globally than the risk of food shortage crises, terrorism, cyber attacks, and geophysical destruction.

Fresh water supply crises are clearly of great importance and it is imperative to increase knowledge about them by generating new data by research, generating new knowledge from the data, and spreading this knowledge. It is also imperative to apply the knowledge to making decisions and implementing the decisions by actions designed to decrease the risk of water crises.

In view of this I have been working on a water shortage project using data from aquastat and Tableau Software in the hope that I may contribute to an increase in the knowledge about such crises and thereby decrease the risk of their occurrence.

You may read about this project under

Freshwater Supply Crises

Water Shortage

Tableau Software

During recent weeks and months I have been studying data and decision analysis in general and especially the use of Tableau Software for applying data analysis to large datasets in order to transform the data into knowledge which then can be used to answer important questions, solve important problems, make decisions and implement them by effective actions.

Tableau software has its roots in the Stanford University Science Department research project which is aimed at increasing people’s ability to rapidly analyze data. Tableau’s main approach to visual design is to connect to a data source, and drag data fields to its workspace.

Tableau Desktop is a software package for data analysis. It’s easy to learn, easy to use, and extremely fast. It allows you to use your natural ability to see patterns, identify trends and discover visual insights.

You can connect to data and perform queries without writing a single line of code. You can follow your natural train of thought as you shift between views with drag-and-drop technology.

You can connect directly to data for live, up-to-date data analysis or extract data into Tableau’s fast data engine and take advantage of breakthrough in-memory architecture, or do both, for 2, 3, or even 10 different data sources and blend them all together. Tableau has a large number of native connectors to data sources.  A list of connectors can be viewed at

http://www.tableausoftware.com/products/desktop?qt-product_tableau_desktop=1#qt-product_tableau_desktop

Multiple views can be combined into interactive dashboards. Data can be filtered and highlighted to show relationships between variables. Content can be shared using the web-based Tableau Server or Tableau Online. Content can also be embedded into website pages, including blogs.

Tableau has powerful analytical tools. You can filter data dynamically, split trends across different categories or run an in-depth cohort analysis. You can double-click geographic fields to put data on a map. In addition it can be integrated with R.

You can go deeper into your data with new calculations on existing data. You can ake one-click forecasts, build box plots and see statistical summaries of your data. Run trend analyses, regressions, correlations, ….

There is a large amount of material on Tableau and its application to data analysis available on the Tableau website (http://www.tableausoftware.com/), in blogs, and in a number of books. Some of these books are available on Kindle.

Tableau is an ideal analysis and visualization tool in that it possesses the following attributes:
Simplicity – easy for non-technical users to master
Connectivity – seamlessly connects to a large variety of datasources
Visual competence – provides appropriate graphics
Sharing – facilitates sharing of knowledge, understanding and insight
Scaling – handles large data sets