Category Archives: Data Analysis

Poverty and Inequality

I have started to work on a project which I call “Poverty and Inequality”. This I do mainly because those are among the most important and terrible problems facing humanity – as shown in the following slide deck – but also in preparation of reading “Capital” by Piketty which focuses on inequality.

The data is derived from World Development Indicators. I have selected All Countries, twenty two Series relevant to the project, and twelve Years from 2001 to 2012. The resulting data table has been Datashaped and pivoted in order to give it a form optimal for analysis and visualization in Tableau. I have defined parameters and calculated fields that may be used to parametrize measures and thereby simplify the exploration of the data table. I have made sheets to display parametrized Tables, Bar Charts, Histograms, Histograms with overlays of histograms of Normal Curves with the same means and standard deviations, Box Plots, Tree Maps, Bubble Charts, Symbol Maps, Filled Maps, Scatter Plots, and ScatterPlot Matrices. It is much easier to select measures for these sheets from a list than to make sheets for each of the measures.

The data is derived from a census of all countries in the world. It is difficult to estimate its reliability (accuracy, consistency, stability). It is obviously not possible to calculate a reliability coefficient for the data. It is for example not possible to apply the test-retest methodology. It is also difficult to estimate the validity of the data. Do the World Development Indicators measure what we want to measure and what we think we are measuring.

One of the topics of the World Bank is Poverty with the following website:

http://www.worldbank.org/en/topic/poverty

Associated page: Poverty and Inequality

Reliability of Confirmatory Exploratory Data Analysis

Scientific studies using confirmatory exploratory data analysis limiting tests of significance to calculation of P values have come under increasing criticism. It has been found that it has been impossible to replicate many such studies and their conclusions can therefore not be considered true. So how reliable are the results of confirmatory exploratory data analytic studies for supporting decisions and actions?

my_p_value_is_smaller_than_your_p_value

I have written an associated page on this subject.

Reliability of Confirmatory Exploratory Data Analysis

Reliability of Basic Exploratory Data Analysis – 2

In a post written the 27th of October on basic exploratory data analysis I expressed concern about the reliability of its results for supporting decisions and actions if improperly applied.

If improperly done basic exploratory data analysis reminds me of Munchhausen’s Journey to the Moon. I am also reminded of Buddha’s parable about blind Indians trying to describe an elephant.

Elephant and Blind Indian

There is more about this in the associated page:

Reliability of Basic ExploratoryData Analysis – 2

Reliability of Basic Exploratory Data Analysis – 1

Lately I have been wondering about the reliability of the results of the various types of exploratory data analysis for supporting decisions and actions. How confident can we be in the decisions and actions based on these results and how willing are we to take responsibility for the risks associated with them.

 I have composed a page with the same name as the post where I discuss these questions.

Reliability of Basic Exploratory Data Analysis – 1

Perhaps I should have named this post and its associated page:

Basic Exploratory Data Analysis and Munchhausen’s Journey to the Moon

 

Munchhausen and the Moon

From World Development Indicators to Tableau

Some of my data analytic projects require data from The World Bank’s “World Development Indicators”. In order to simplify the process of getting the data from WDI to Tableau I have developed a procedure to execute the process.

This procedure is described in the following related set of pages:

From World Development Indicators to Tableau

Datashaping an Excel Table

Pivoting an Excel Datatable

Transforming Data Within Tableau

Kenneth Black has an excellent post related to datashaping an Excel datatable.

Karunaker Molugu (http://got-data.blogspot.com has kindly shown me on Tableau Forum how to pivot an Excel datatable and two methods of transforming data within Tableau

Tableau Integration with Trifacta

Trifacta has announced deep integration with Tableau. Tableau users now have the option of writing the output of Trifacta data transformations directly to a Tableau Data Extract format.

Trifacta provides Tableau users with an intuitive Data Transformation Platform for Hadoop so they can more efficiently transform and analyze common data formats in Hadoop

The integration between Trifacta and Tableau removes a key barrier between the raw, semi-structured data commonly stored in Hadoop and the self-service process for analyzing, visualizing and sharing of insights provided by Tableau.

Working with big data poses specific challenges. The most significant barriers come from structuring, distilling and automating the transfer of data from Hadoop.

Water Shortage

One of the most important question/problem areas for humanity is water and especially freshwater. The available amount of freshwater is decreasing and the problems caused by this are increasing. The risk associated with freshwater shortage is increasing.

“Water scarcity is one of the defining issues of the 21st century. …In its Global Risks 2013 Report, the World Economic Forum identified water supply crises as one of the highest impact and most likely threats facing the planet.”

The World Economic Forum ranks water supply crises as being more likely and having a greater impact globally than the risk of food shortage crises, terrorism, cyber attacks, and geophysical destruction.

Fresh water supply crises are clearly of great importance and it is imperative to increase knowledge about them by generating new data by research, generating new knowledge from the data, and spreading this knowledge. It is also imperative to apply the knowledge to making decisions and implementing the decisions by actions designed to decrease the risk of water crises.

In view of this I have been working on a water shortage project using data from aquastat and Tableau Software in the hope that I may contribute to an increase in the knowledge about such crises and thereby decrease the risk of their occurrence.

You may read about this project under

Freshwater Supply Crises

Water Shortage

Tableau Software

During recent weeks and months I have been studying data and decision analysis in general and especially the use of Tableau Software for applying data analysis to large datasets in order to transform the data into knowledge which then can be used to answer important questions, solve important problems, make decisions and implement them by effective actions.

Tableau software has its roots in the Stanford University Science Department research project which is aimed at increasing people’s ability to rapidly analyze data. Tableau’s main approach to visual design is to connect to a data source, and drag data fields to its workspace.

Tableau Desktop is a software package for data analysis. It’s easy to learn, easy to use, and extremely fast. It allows you to use your natural ability to see patterns, identify trends and discover visual insights.

You can connect to data and perform queries without writing a single line of code. You can follow your natural train of thought as you shift between views with drag-and-drop technology.

You can connect directly to data for live, up-to-date data analysis or extract data into Tableau’s fast data engine and take advantage of breakthrough in-memory architecture, or do both, for 2, 3, or even 10 different data sources and blend them all together. Tableau has a large number of native connectors to data sources.  A list of connectors can be viewed at

http://www.tableausoftware.com/products/desktop?qt-product_tableau_desktop=1#qt-product_tableau_desktop

Multiple views can be combined into interactive dashboards. Data can be filtered and highlighted to show relationships between variables. Content can be shared using the web-based Tableau Server or Tableau Online. Content can also be embedded into website pages, including blogs.

Tableau has powerful analytical tools. You can filter data dynamically, split trends across different categories or run an in-depth cohort analysis. You can double-click geographic fields to put data on a map. In addition it can be integrated with R.

You can go deeper into your data with new calculations on existing data. You can ake one-click forecasts, build box plots and see statistical summaries of your data. Run trend analyses, regressions, correlations, ….

There is a large amount of material on Tableau and its application to data analysis available on the Tableau website (http://www.tableausoftware.com/), in blogs, and in a number of books. Some of these books are available on Kindle.

Tableau is an ideal analysis and visualization tool in that it possesses the following attributes:
Simplicity – easy for non-technical users to master
Connectivity – seamlessly connects to a large variety of datasources
Visual competence – provides appropriate graphics
Sharing – facilitates sharing of knowledge, understanding and insight
Scaling – handles large data sets

Data Mining

Dr. Saed Sayed has published a fine data mining map called “An Introduction to Data Mining”. The url is http://www.saedsayad.com.

The map show the stages and substages of the data mining process with a wealth of information about relevant methods.

Dr. Sayed supplies the following information about himself:

“I have more than 20 years of experience in data mining, statistics and artificial intelligence and designed, developed and deployed many business and scientific applications of predictive modeling. I am a pioneer researcher in real time data mining, an adjunct Professor at the University of Toronto, and have been presenting a popular graduate data mining course since 2001.”

Dr. Sayed has written an excellent book called “Real Time Data Mining”.  His description of the content of the book is an excellent characterization of data mining.

“Data mining is about explaining the past and predicting the future by exploring and analyzing data. Data mining is a multi-disciplinary field which combines statistics, machine learning, artificial intelligence and database technology. Although data mining algorithms are widely used in extremely diverse situations, in practice, one or more major limitations almost invariably appear and significantly constrain successful data mining applications. Frequently, these problems are associated with large increases in the rate of generation of data, the quantity of data and the number of attributes (variables) to be processed: Increasingly, the data situation is now beyond the capabilities of conventional data mining methods. The term Real Time is used to describe how well a data mining algorithm can accommodate an ever increasing data load instantaneously. Upgrading conventional data mining to real time data mining is through the use of a method termed the Real Time Learning Machine or RTLM. The use of the RTLM with conventional data mining methods enables Real Time Data Mining. The future of predictive modeling belongs to real time data mining and the main motivation in authoring this book is to help you to understand the method and to implement it for your applications.

The image below illustrates the extraction of data from a bottomless black hole of big data.

http://www.proscoutleadgeneration.com/wp-content/uploads/2014/04/datamining.jpg

Searching for Gold.

http://www.grtcorp.com/content/big-data-blues-dangers-data-mining

Data Mining Page