Category Archives: Data Analysis Software

Correlation does not imply causation

When I began to use the new generation of data analysis and visualization software like Tableau I thought that I would first use them to address some of the most important problems of humanity like Resource Scarcity, Inequality, Poverty, Human Migration, Refugees, …

I have found large amounts of data relevant to these problems published to the Internet by various organizations and institutions, like the United Nations, The World Bank, The World Health Organization, …The data are usually in the form of data tables with countries, regions, and locations as rows, time periods as rows or columns, and variables as columns.

The data have been collected in surveys. The completeness of the data and their reliablity is uncertain and variable.

The presentations of the data in in the worksheets and dashboards of Tableau workbooks are very fine and I have no doubt that such presentations can increase the viewers knowledge and understanding of the problems. But in order to solve a problem it is necessary to identify, eliminate or minimize its cause or causes.

The presentations can be seductive. Viewers may be tempted to identify causes by calculating correlations between the variables in the data and assuming that correlations imply causation.

Statisticians know that correlation does not imply causation. What does this mean? Correlation is a measure of how closely two things are related. You may think of it as a number describing the relative change in one thing when there is a change in the other, with 1 being a strong positive relationship between two sets of numbers, -1 being a strong negative relationship and 0 being no relationship whatsoever. “Correlation does not imply causation” means that just because two things correlate one does not necessarily cause the other. Although this is an important fact most people do not sufficiently take this into account. Their preconceptions tempt them to leap from correlation to causation without sufficient evidence.

This can result in absurd and ridiculous causal claims. Tyler Vigen has recently published the second edition of his book “Spurious Correlations” (May 8, 2015).

http://www.amazon.com/gp/product/0316339431/ref=as_li_tl?ie=UTF8&camp=211189&creative=373489&creativeASIN=0316339431&link_code=as3&tag=tylervicom-20&linkId=UO6I3ENRRQUF255J

He has designed software that scours enormous data sets to find spurious statistical correlations. In the Introduction to the book he says:

“Humans are biologically inclined to recognize patterns….Does correlation imply causation? It’s intuitive, but it’s not always true. …Correlation, as a concept, means strictly that two things vary together…(but) Correlations don’t always make sense.

Provided enough data, it is possible to find things that correlate even when they shouldn’t. The method is often called “data dredging.” Data dredging is a technique used to find something that correlates with one variable by comparing it to hundreds of other variables. Normally scientists first hypothesize about a connection between two variables before they analyze data to determine the extent to which that connection exists.

Instead of testing individual hypotheses, a computer program can data dredge by simply comparing every dataset to every other dataset. Technology and data collection in the twenty-first century makes this significantly easier….This is the world of big data and big correlations….

Despite the humor, this book has a serious side. Graphs can lie, and not all correlations are indicative of an underlying causal connection. Data dredging is part of why it is possible to find so many spurious relationships….Correlations are an important part of scientific analysis, but they can be misleading if used incorrectly.”

Vigen, Tyler. Spurious Correlations. Hachette Books. Kindle Edition. May 2015.

Why is it that people are so easily allured/seduced into assuming that correlation implies causation? Vigen states: “Humans are biologically inclined to recognize patterns”. This reminds me of a blogpost in “Science or not” by Graham Coghill called “Confusing correlation with causation: rooster syndrome”.

http://scienceornot.net/2012/07/05/confusing-correlation-with-causation-rooster-syndrome

He quotes: The rooster crows and the sun rises

And then he says: “This is the natural human tendency to assume that, if two events or phenomena consistently occur at about the same time, then one is the cause of the other. Hence “rooster syndrome”, from the rooster who believed that his crowing caused the sun to rise….

We have an evolved tendency to believe in false positives – when event B follows soon after event A, we assume A was the cause of B, even if this is untrue. In evolution, such beliefs are harmless, whereas the belief that A is not the cause of B when it actually is (false negative) can be fatal. Michael Shermer explains: “For example, believing that the rustle in the grass is a dangerous predator when it is only the wind does not cost much, but believing that a dangerous predator is the wind may cost an animal its life.”

Michael Shermer wrote an article in Scientific American with the title “Paternicity: Finding Meaningful Patterns in Meaningless Noise”.

http://www.scientificamerican.com/article/patternicity-finding-meaningful-patterns/

 

He says:  “Why do people see faces in nature, interpret window stains as human figures, hear voices in random sounds generated by electronic devices or find conspiracies in the daily news? A proximate cause is the priming effect, in which our brain and senses are prepared to interpret stimuli according to an expected model.

Is there a deeper ultimate cause for why people believe such weird things? There is. I call it “patternicity,” or the tendency to find meaningful patterns in meaningless noise. Traditionally, scientists have treated patternicity as an error in cognition. A type I error, or a false positive, is believing something is real when it is not (finding a nonexistent pattern). A type II error, or a false negative, is not believing something is real when it is (not recognizing a real pattern—call it “apat­ternicity”).

In my 2000 book How We Believe (Times Books), I argue that our brains are belief engines: evolved pattern-recognition machines that connect the dots and create meaning out of the patterns that we think we see in nature. Sometimes A really is connected to B; sometimes it is not. When it is, we have learned something valuable about the environment from which we can make predictions that aid in survival and reproduction.”

When data is collected in a non-random, uncontrolled, survey, it is very  hazardous to base decisions and actions on the assumption that correlation implies causation. It is impossible know which correlations correspond to causation with a high probability and which are spurious. And it is impossible to estimate the risks associated with decisions and actions based on the assumption.

Correlations between variables calculated from data collected in a non-random, uncontrolled survey can not be used for anything but to state hypotheses that can be tested in statistically sound research.

Tableau and Dynamical System Models

Kenneth Black has a post in his blog

http://3danim8.wordpress.com/2014/04/26/a-look-back-at-my-mmr-groundwater-modeling-work-circa-2003/

about analyzing and visualizing dynamic system models.

He writes:

“One of my biggest regrets is that I had to work the first 20 years of my career without Tableau desktop software.  Looking back, I can see how so many computer codes I designed and wrote, or software teams that I directed, were necessary because we didn’t have a tool like Tableau to help us perform our analysis…..

There were capable tools we used …, but much of the quantitative analysis had to be programmed on a case-by-case basis.  If I had Tableau throughout my career, things would have been much easier, more insights would have been possible, and better models could have been built.

Now that Tableau is here, I’d like to take another crack at analyzing some of my earlier work.”

I had a similar experience and in a comment to his post I write:

“I found this post very interesting. I reminded me of how I became interested in dynamical systems. I happened to come across an article about a hydrologic model of the Okefenokee Swamp in Georgia. The articIe contained differential equations representing the model. I developed various dynamical models of the swamp in various programs like STELLA, Vensim, Mathematica, and even LiveMath (which for some reason still is close to my heart). When I later became chief of Acute Psychiatric Services at a Psychiatric Hospital I developed a lot of models of health services, especially acute psychatric services, in the hope that these models would increase my understanding and the understanding of health service administrators and politicians and thus lead to an improvement in the sevices. This however did not happen and has still not happened. Now, STELLA has quite good presentation methods, so I thought that the fact that administrators and politicians did not see the light of day immediately was an indication that they were not really interested in solving the serious problems of the acute psychiatric health services in Norway but mainly interested in decreasing the cost of those services. This is surely at least partly true. After reading your post I began to wonder whether it might partly be due to their inability to gain the necessary insight and understanding from the STELLA presentations and my comments on them. Perhaps it would be possible to increase their insight and understanding by presenting the underlying data and the data from simulations in Tableau Workbooks. Perhaps the clarity of these presentations and their availability to the public would make it imposible for them to avoid doing something about the problems.”

Poverty and Inequality

I have started to work on a project which I call “Poverty and Inequality”. This I do mainly because those are among the most important and terrible problems facing humanity – as shown in the following slide deck – but also in preparation of reading “Capital” by Piketty which focuses on inequality.

The data is derived from World Development Indicators. I have selected All Countries, twenty two Series relevant to the project, and twelve Years from 2001 to 2012. The resulting data table has been Datashaped and pivoted in order to give it a form optimal for analysis and visualization in Tableau. I have defined parameters and calculated fields that may be used to parametrize measures and thereby simplify the exploration of the data table. I have made sheets to display parametrized Tables, Bar Charts, Histograms, Histograms with overlays of histograms of Normal Curves with the same means and standard deviations, Box Plots, Tree Maps, Bubble Charts, Symbol Maps, Filled Maps, Scatter Plots, and ScatterPlot Matrices. It is much easier to select measures for these sheets from a list than to make sheets for each of the measures.

The data is derived from a census of all countries in the world. It is difficult to estimate its reliability (accuracy, consistency, stability). It is obviously not possible to calculate a reliability coefficient for the data. It is for example not possible to apply the test-retest methodology. It is also difficult to estimate the validity of the data. Do the World Development Indicators measure what we want to measure and what we think we are measuring.

One of the topics of the World Bank is Poverty with the following website:

http://www.worldbank.org/en/topic/poverty

Associated page: Poverty and Inequality

Tableau Integration with Trifacta

Trifacta has announced deep integration with Tableau. Tableau users now have the option of writing the output of Trifacta data transformations directly to a Tableau Data Extract format.

Trifacta provides Tableau users with an intuitive Data Transformation Platform for Hadoop so they can more efficiently transform and analyze common data formats in Hadoop

The integration between Trifacta and Tableau removes a key barrier between the raw, semi-structured data commonly stored in Hadoop and the self-service process for analyzing, visualizing and sharing of insights provided by Tableau.

Working with big data poses specific challenges. The most significant barriers come from structuring, distilling and automating the transfer of data from Hadoop.

Water Shortage

One of the most important question/problem areas for humanity is water and especially freshwater. The available amount of freshwater is decreasing and the problems caused by this are increasing. The risk associated with freshwater shortage is increasing.

“Water scarcity is one of the defining issues of the 21st century. …In its Global Risks 2013 Report, the World Economic Forum identified water supply crises as one of the highest impact and most likely threats facing the planet.”

The World Economic Forum ranks water supply crises as being more likely and having a greater impact globally than the risk of food shortage crises, terrorism, cyber attacks, and geophysical destruction.

Fresh water supply crises are clearly of great importance and it is imperative to increase knowledge about them by generating new data by research, generating new knowledge from the data, and spreading this knowledge. It is also imperative to apply the knowledge to making decisions and implementing the decisions by actions designed to decrease the risk of water crises.

In view of this I have been working on a water shortage project using data from aquastat and Tableau Software in the hope that I may contribute to an increase in the knowledge about such crises and thereby decrease the risk of their occurrence.

You may read about this project under

Freshwater Supply Crises

Water Shortage

Tableau Software

During recent weeks and months I have been studying data and decision analysis in general and especially the use of Tableau Software for applying data analysis to large datasets in order to transform the data into knowledge which then can be used to answer important questions, solve important problems, make decisions and implement them by effective actions.

Tableau software has its roots in the Stanford University Science Department research project which is aimed at increasing people’s ability to rapidly analyze data. Tableau’s main approach to visual design is to connect to a data source, and drag data fields to its workspace.

Tableau Desktop is a software package for data analysis. It’s easy to learn, easy to use, and extremely fast. It allows you to use your natural ability to see patterns, identify trends and discover visual insights.

You can connect to data and perform queries without writing a single line of code. You can follow your natural train of thought as you shift between views with drag-and-drop technology.

You can connect directly to data for live, up-to-date data analysis or extract data into Tableau’s fast data engine and take advantage of breakthrough in-memory architecture, or do both, for 2, 3, or even 10 different data sources and blend them all together. Tableau has a large number of native connectors to data sources.  A list of connectors can be viewed at

http://www.tableausoftware.com/products/desktop?qt-product_tableau_desktop=1#qt-product_tableau_desktop

Multiple views can be combined into interactive dashboards. Data can be filtered and highlighted to show relationships between variables. Content can be shared using the web-based Tableau Server or Tableau Online. Content can also be embedded into website pages, including blogs.

Tableau has powerful analytical tools. You can filter data dynamically, split trends across different categories or run an in-depth cohort analysis. You can double-click geographic fields to put data on a map. In addition it can be integrated with R.

You can go deeper into your data with new calculations on existing data. You can ake one-click forecasts, build box plots and see statistical summaries of your data. Run trend analyses, regressions, correlations, ….

There is a large amount of material on Tableau and its application to data analysis available on the Tableau website (http://www.tableausoftware.com/), in blogs, and in a number of books. Some of these books are available on Kindle.

Tableau is an ideal analysis and visualization tool in that it possesses the following attributes:
Simplicity – easy for non-technical users to master
Connectivity – seamlessly connects to a large variety of datasources
Visual competence – provides appropriate graphics
Sharing – facilitates sharing of knowledge, understanding and insight
Scaling – handles large data sets

Data Analysis Software

The recent tidal wave of data has given rise to the development of a large number of software programs relevant to the analysis of the data. From a long list of programs I have chosen the following for my own use:

  1. Tableau
  2. DataDesk
  3. StatCrunch
  4. BestView – Addon to Mathematica
  5. Mathematica
  6. R
  7. ParallAX
  8. NeuroSolutions
  9. Gephi
  10. Ayasdi

Some of these programs are preexisting programs that have been adapted to the requirements of big data, some are new, as for example Tableau and Ayasdi. The programs I have chosen are not necessarily the best for all but they are the best for my present needs.

Data Analysis Software – Page

Data and Decision Analytic Process

The data and decision analytic process is a path leading from the larva of data to the butterfly of knowledge, understanding, and insight.

Before starting to work on data analytic and associated decision analytic projects it is necessary, in order to ensure the quality of the results, to

  1. define an orderly data analytic and decision analytic process
  2. select methods for executing the process
  3. select software packages for implementing the methods

In order to ensure the reliability of the answers/solutions and the quality of the decisions made and actions taken, it is necessary to adhere to the analytic process in an orderly manner and apply the methods and the software packages in a  competent manner

Before starting an analytic process it is necessary to state the question/problem under consideration and ask the following preliminary questions:

  1. Is the answer/solution considered known?
  2. Is the the answer/solution based on sufficiently recent/reliable data?
  3. Was the analysis performed in a competent/reliable manner?
  4. Is the results of the analysis presented/visualized in such a way that it sufficiently increases the understanding and insight of the target group ?
  5. Do the results of the analysis, their presentation/visualization, and the resulting understanding and insight form a sufficently firm basis for decision making and action?

If any of the answers are no there may be a reason to go ahead with the analytic and decision analytic process. If all the the answers are yes it is unnecessary to go ahead with the process unless you are confident that you can improve the results materially or introduce your particular results to a new or wider audience. But beware of hubris.

The main stages of a combined data analytic and decision analytic process

  1. State an important question/problem
  2. Data analysis
    1. Select data relevant to answering the questions or solving the problems
    2. Prepare the data for analysis. Employ visualization during preparation
    3. Analyze the data – Increase knowledge about the past, present, and future state of the system generating the data – Increase knowledge about individual variables and the relationship between variables. Employ visualization extensively during analysis
      1. Descriptive data analysis
      2. Exploratory data analysis
      3. Confirmatory data analysis
      4. Predictive data analysis
    4. Present/visualize the results of the analysis
    5. Evaluate the results of the analysis – Have the original questions been answered?
  3. Decision analysis
    1. Make decisions based on the results of the analysis
    2. Implement decisions – Act
    3. Present/visualize the results of the actions
    4. Evaluate the results of the actions – Have the original problems originally posed been solved?
  4. Reiterate the process or its individual stages as necessary

Data and Decision Analytic Process – Page

Books about Data Analysis

In order to become proficient in data analysis it is necessary to study the theory and practice of data analysis,

A large number of books have been written about data analysis, especially after the data deluge in recent years and the appearance of relatively inexpensive and effective data analysis software.

I have read or at least skimmed through a considerable number of these books and am keeping them on hand as reference books when I work  on data analysis projects.

A list of these books can be found in the page Data Analysis Books

From these books I have learned that data analysis is a process consisting of a sequence of stages beginning with  data and ending in knowledge derived from the information in the data.

This knowledge may then be used in making decisions and implementing them by corresponding actions.

The books place different emphasis on the different steps in the data analytic process. Some emphasize the data end, some the analytic middle, some the visualization of the data and the results of the analysis.

The quality of the books are is quite varied. All of them contain something of value and can be used for reference. Some of them I find of special interest to me and have selected for thorough reading. These are:

      1. Data Just Right – Manoochehri
      2. Making Sense of Data – Myatt
      3. The Visual Display of Quantitative Information – Tufte
      4. Visual Statistics – Seeing Data with Dynamic Interactive Graphics – Young et alia
      5. Tableau Your Data – Murray
      6. DataDesk Manual
      7. Parallel Coordinates – Inselberg
      8. Modeling Techniques in Predictive Analytics – Miller
      9. Predictive Analytics for Dummies – Bari, Chaouchi, Jung
      10. R for Dummies – Meys, de Vries

These books are not necessarily the best for all but they are the best for fulfilling my present needs.

Data Analysis Books – Page