The Seductiveness of Hans Rosling

The Seductiveness of Hans Rosling

Hans Rosling is an extremely fine presenter of data. His visualizations using Gapminder are excellent and very effective – sometimes perhaps seductive.

In his TED talk “The best stats you have ever seen” (2006) he shows a visualization of the percentage of the world population as a function of income per person per day. He maintains that the income gap has been decreasing and is disappearing. This depends on his definition of gap. If he means the dip/relative minimum in the curve he is right. But if gap means  income inequality between the poor and the rich then he not right. In fact income inequality has been increasing in recent years.

Hans Rosling exhorts all of us to use the enormous amount of data that exists for the benefit of all. He says:

“We need really to see them. We need to get them into graphic formats, where you can instantly understand them. Now, statisticians don’t like it, because they say that this will not show the reality; we have to have statistical, analytical methods.”

When Rosling says “instantly understand” I take him to mean “intuitively understand”.  He is on the verge of seducing us into accepting that the relationship/correlation between the variables he visualizes implies causation.

But then he seems to feel uncomfortable with this and says:

“Many people say data is bad. There is an uncertainty margin, but…. the differences (in the data I use) are much bigger than the weakness of the data.”

This is of course an application of statistical thinking and he finally escapes by the skin of his teeth from giving the impression that he thinks that correlation implies causation by saying:

“But this is hypothesis-generating.”

The visualizations that can be made with Gapminder are extremely fine and if you are not on your guard you can easily be seduced by them. The same applies to the equally fine visualizations made with Tableau.

Tableau and Dynamical System Models

Kenneth Black has a post in his blog

http://3danim8.wordpress.com/2014/04/26/a-look-back-at-my-mmr-groundwater-modeling-work-circa-2003/

about analyzing and visualizing dynamic system models.

He writes:

“One of my biggest regrets is that I had to work the first 20 years of my career without Tableau desktop software.  Looking back, I can see how so many computer codes I designed and wrote, or software teams that I directed, were necessary because we didn’t have a tool like Tableau to help us perform our analysis…..

There were capable tools we used …, but much of the quantitative analysis had to be programmed on a case-by-case basis.  If I had Tableau throughout my career, things would have been much easier, more insights would have been possible, and better models could have been built.

Now that Tableau is here, I’d like to take another crack at analyzing some of my earlier work.”

I had a similar experience and in a comment to his post I write:

“I found this post very interesting. I reminded me of how I became interested in dynamical systems. I happened to come across an article about a hydrologic model of the Okefenokee Swamp in Georgia. The articIe contained differential equations representing the model. I developed various dynamical models of the swamp in various programs like STELLA, Vensim, Mathematica, and even LiveMath (which for some reason still is close to my heart). When I later became chief of Acute Psychiatric Services at a Psychiatric Hospital I developed a lot of models of health services, especially acute psychatric services, in the hope that these models would increase my understanding and the understanding of health service administrators and politicians and thus lead to an improvement in the sevices. This however did not happen and has still not happened. Now, STELLA has quite good presentation methods, so I thought that the fact that administrators and politicians did not see the light of day immediately was an indication that they were not really interested in solving the serious problems of the acute psychiatric health services in Norway but mainly interested in decreasing the cost of those services. This is surely at least partly true. After reading your post I began to wonder whether it might partly be due to their inability to gain the necessary insight and understanding from the STELLA presentations and my comments on them. Perhaps it would be possible to increase their insight and understanding by presenting the underlying data and the data from simulations in Tableau Workbooks. Perhaps the clarity of these presentations and their availability to the public would make it imposible for them to avoid doing something about the problems.”

Fishing Expeditions in Scatterplot Matrices

The reasons for my present interest in this topic is that I am working on a Tableau project with a datatable with 22 measures, that I am worried that I may be tempted to calculate and visualize all pairs of measures, and that I may read too much into correlations that are significant by chance alone.

If the correlations between all possible pairs of measures in a dataset are calculated and all possible hypotheses about these correlations are tested then 5% of randomly chosen hypotheses will turn out to be significant at the 5% level, 1% turn out to be significant at the  1% level, by chance alone, even if all the null hypotheses are true.

When such correlations are visualized in scatterplot matrices it is tempting to scan the plots, identify the plots that seem to show strong relationships, assign special significance to them, and base decisions and actions on them, forgetting that 5%, 1%, …., of the correlations may be significant by chance alone. If the decisions and actions are of heavy import this may of course have disastrous results.

Such behavior may be called “fishing expeditions in scatterplot matrices”.

One way of mitigating this problem is to to regard the results of a fishing expedition as merely providing suggestions for further experiments designed to test the hypotheses.

Another way is to split the data randomly into two halves, one half for fishing and generating hypotheses, the other half hidden away until it is used to test the results of the first half against the second half.

I am determined to calculate and visualize only correlations that I can generate hypotheses about prior to calculations and visualizations. I shall therefore begin by making an association network for identifying promising hypotheses.

Poverty and Inequality

I have started to work on a project which I call “Poverty and Inequality”. This I do mainly because those are among the most important and terrible problems facing humanity – as shown in the following slide deck – but also in preparation of reading “Capital” by Piketty which focuses on inequality.

`#soliloquy-container-645{opacity:1}#soliloquy-container-645 li > .soliloquy-caption{display:none}#soliloquy-container-645 li:first-child > .soliloquy-caption{display:block}`

The data is derived from World Development Indicators. I have selected All Countries, twenty two Series relevant to the project, and twelve Years from 2001 to 2012. The resulting data table has been Datashaped and pivoted in order to give it a form optimal for analysis and visualization in Tableau. I have defined parameters and calculated fields that may be used to parametrize measures and thereby simplify the exploration of the data table. I have made sheets to display parametrized Tables, Bar Charts, Histograms, Histograms with overlays of histograms of Normal Curves with the same means and standard deviations, Box Plots, Tree Maps, Bubble Charts, Symbol Maps, Filled Maps, Scatter Plots, and ScatterPlot Matrices. It is much easier to select measures for these sheets from a list than to make sheets for each of the measures.

The data is derived from a census of all countries in the world. It is difficult to estimate its reliability (accuracy, consistency, stability). It is obviously not possible to calculate a reliability coefficient for the data. It is for example not possible to apply the test-retest methodology. It is also difficult to estimate the validity of the data. Do the World Development Indicators measure what we want to measure and what we think we are measuring.

One of the topics of the World Bank is Poverty with the following website:

http://www.worldbank.org/en/topic/poverty

Associated page: Poverty and Inequality

Tableau Integration with Trifacta

Trifacta has announced deep integration with Tableau. Tableau users now have the option of writing the output of Trifacta data transformations directly to a Tableau Data Extract format.

Trifacta provides Tableau users with an intuitive Data Transformation Platform for Hadoop so they can more efficiently transform and analyze common data formats in Hadoop

The integration between Trifacta and Tableau removes a key barrier between the raw, semi-structured data commonly stored in Hadoop and the self-service process for analyzing, visualizing and sharing of insights provided by Tableau.

Working with big data poses specific challenges. The most significant barriers come from structuring, distilling and automating the transfer of data from Hadoop.

Water Shortage

One of the most important question/problem areas for humanity is water and especially freshwater. The available amount of freshwater is decreasing and the problems caused by this are increasing. The risk associated with freshwater shortage is increasing.

“Water scarcity is one of the defining issues of the 21st century. …In its Global Risks 2013 Report, the World Economic Forum identified water supply crises as one of the highest impact and most likely threats facing the planet.”

The World Economic Forum ranks water supply crises as being more likely and having a greater impact globally than the risk of food shortage crises, terrorism, cyber attacks, and geophysical destruction.

Fresh water supply crises are clearly of great importance and it is imperative to increase knowledge about them by generating new data by research, generating new knowledge from the data, and spreading this knowledge. It is also imperative to apply the knowledge to making decisions and implementing the decisions by actions designed to decrease the risk of water crises.

In view of this I have been working on a water shortage project using data from aquastat and Tableau Software in the hope that I may contribute to an increase in the knowledge about such crises and thereby decrease the risk of their occurrence.

Freshwater Supply Crises

Water Shortage

Tableau Software

During recent weeks and months I have been studying data and decision analysis in general and especially the use of Tableau Software for applying data analysis to large datasets in order to transform the data into knowledge which then can be used to answer important questions, solve important problems, make decisions and implement them by effective actions.

Tableau software has its roots in the Stanford University Science Department research project which is aimed at increasing people’s ability to rapidly analyze data. Tableau’s main approach to visual design is to connect to a data source, and drag data fields to its workspace.

Tableau Desktop is a software package for data analysis. It’s easy to learn, easy to use, and extremely fast. It allows you to use your natural ability to see patterns, identify trends and discover visual insights.

You can connect to data and perform queries without writing a single line of code. You can follow your natural train of thought as you shift between views with drag-and-drop technology.

You can connect directly to data for live, up-to-date data analysis or extract data into Tableau’s fast data engine and take advantage of breakthrough in-memory architecture, or do both, for 2, 3, or even 10 different data sources and blend them all together. Tableau has a large number of native connectors to data sources.  A list of connectors can be viewed at

http://www.tableausoftware.com/products/desktop?qt-product_tableau_desktop=1#qt-product_tableau_desktop

Multiple views can be combined into interactive dashboards. Data can be filtered and highlighted to show relationships between variables. Content can be shared using the web-based Tableau Server or Tableau Online. Content can also be embedded into website pages, including blogs.

Tableau has powerful analytical tools. You can filter data dynamically, split trends across different categories or run an in-depth cohort analysis. You can double-click geographic fields to put data on a map. In addition it can be integrated with R.

You can go deeper into your data with new calculations on existing data. You can ake one-click forecasts, build box plots and see statistical summaries of your data. Run trend analyses, regressions, correlations, ….

There is a large amount of material on Tableau and its application to data analysis available on the Tableau website (http://www.tableausoftware.com/), in blogs, and in a number of books. Some of these books are available on Kindle.

Tableau is an ideal analysis and visualization tool in that it possesses the following attributes:
Simplicity – easy for non-technical users to master
Connectivity – seamlessly connects to a large variety of datasources
Visual competence – provides appropriate graphics
Sharing – facilitates sharing of knowledge, understanding and insight
Scaling – handles large data sets

Data Mining

Dr. Saed Sayed has published a fine data mining map called “An Introduction to Data Mining”. The url is http://www.saedsayad.com.

The map show the stages and substages of the data mining process with a wealth of information about relevant methods.

Dr. Sayed supplies the following information about himself:

“I have more than 20 years of experience in data mining, statistics and artificial intelligence and designed, developed and deployed many business and scientific applications of predictive modeling. I am a pioneer researcher in real time data mining, an adjunct Professor at the University of Toronto, and have been presenting a popular graduate data mining course since 2001.”

Dr. Sayed has written an excellent book called “Real Time Data Mining”.  His description of the content of the book is an excellent characterization of data mining.

“Data mining is about explaining the past and predicting the future by exploring and analyzing data. Data mining is a multi-disciplinary field which combines statistics, machine learning, artificial intelligence and database technology. Although data mining algorithms are widely used in extremely diverse situations, in practice, one or more major limitations almost invariably appear and significantly constrain successful data mining applications. Frequently, these problems are associated with large increases in the rate of generation of data, the quantity of data and the number of attributes (variables) to be processed: Increasingly, the data situation is now beyond the capabilities of conventional data mining methods. The term Real Time is used to describe how well a data mining algorithm can accommodate an ever increasing data load instantaneously. Upgrading conventional data mining to real time data mining is through the use of a method termed the Real Time Learning Machine or RTLM. The use of the RTLM with conventional data mining methods enables Real Time Data Mining. The future of predictive modeling belongs to real time data mining and the main motivation in authoring this book is to help you to understand the method and to implement it for your applications.

The image below illustrates the extraction of data from a bottomless black hole of big data.

Searching for Gold.

http://www.grtcorp.com/content/big-data-blues-dangers-data-mining

Data Mining Page

Establishing a Blog about Data Analysis and Associated Decision Analysis

I have for a long time been interested in data analysis and decision analysis, partly in connection with my work as a physician and psychiatrist and partly in connection with various other interests that I have.

I have a large amount of material related to data analysis and decision analysis on my computer or accessible by way of my computer and I often work on data analytic and decsion analytic projects.

I have now decided to establish a WordPress blog about data analysis and associated decision analysis. I realize that the likelihood of anybody reading the blog is minimal. In February 2014 there were 75.8 million WordPress blogs in existence in addition to hundreds of millions on other blogging services. The likelihood of anybody finding the blog and finding the blog of sufficient interest to actually read it  is minimal. Less than the likelihood of finding a message in a bottle or the proverbial needle in a haystack. Search engines would of course function analogously to loupes or magnets.

Message in a bottle buried in sand
Message in a bottle buried in sand
Neddle in the haystack
Neddle in the haystack

Nevertheless, writing the blog will be of value to me. It will help me to organize the material and thoughts that I have about data analysis and associated decision analysis and the possibility that somebody might read the blog with critical eyes will make me endeavor to improve the quality of my thinking and writing. It will discipline me.

Establishing a Blog …-Page

The Rising Data Wave

A large and rapidly increasing amount of data is being generated and collected in connection with every human activity. A large part of this data is stored and can be accessed on the Internet. The data stores may be very large and may contain very large datasets. The ability to analyze these datasets, converting their information content into knowledge, understanding and insight necessary to make decisions and implement the decisions by corresponding actions is becoming increasingly important. This ability has until recently been limited to large organizations and institutions using large computers – even supercomputers. The development of  relatively inexpensive personal computers with increasing computing power and the development  of  a number of relatively inexpensive and effective software packages has made it possible for ordinary people to analyze large datasets.

The availability of a large amount of data and effective software is not enough to derive reliable knowledge from the information in the data and to make reliable decisions as well as to take effective actions. In order to do this it is necessary to adhere to the stages in an orderly data analytic process and execute the process in a competent manner.

The Rising Data Wave – Page