Reliability of Confirmatory Exploratory Data Analysis

Scientific studies using confirmatory exploratory analysis limiting tests of significance to calculation of P values have come under increasing criticism. the P value is neither as objective nor as reliable. The results of many studies that have been reported statistically significant in terms of P values have not been replicable. the conclusions of these studies can therefore not be considered true. The reliability of confirmatory exploratory studies for supporting decisions and actions have therefore been called into question. Researchers have been overconfident in their results. Regina Nuzzi. Scientific method: Statistical errors

Geoff Cumming, Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis

Andrew Gelman. Bayesian Data Analysis

Ronald Fisher introduced the P value in 1920. He intended it as an informal way to determine whether the evidence provided by some data was significant – for example in the sense of supporting/giving support to decisions and actions. The basic idea was to run some experiment, collect the data generated by the experiment, and then see if the data might have been generated by chance with a probability P >= a, where a was some predetermined probability. A researcher would first define a null hypothesis that he wanted to disprove – for example that there was no difference between the mean of some variable in two groups, or that there was no correlation between the values of two variables describing a group. Then he would assume that this null hypothesis was in fact true, calculate the probability P of getting results by chance at least as large as he actually observed. The smaller the P value the greater the probability that the null hypothesis was false. With the purpose to make evidence-based decision-making and actions as  objective as possible Neyman and Pearson introduced an altenative framework for data analysis that included statistical power, false positives, false negatives. They did not include the P-value.

Other researches created a system that included both Fishers P and the Neyman Pearson parameters. Because of the ease with which P could be calculated its use increased. The P-value of 0.05 became established as the criterion of statistical significance.

P is the probability that the null hypothesis is true. If P is low then the probability that the null hypothesis is true is low. This means that the probability of some alternative hypothesis is high. This means that the probability of a difference greater than 0 is 1-P. But this tells us nothing about the size of the difference.

Suppose the null hypothesis is “there is no difference in the mean weights of two groups of people “. Suppose that the P-value is 0.05, then the probability that the null-hypothesis is true is <= 0.05. The null-hypothesis is rejected at P = 0.05. This means that the probability of the difference in mean weights being 0 is > 0.05. It also means that the difference in mean weights > 0, but it tells us nothing about how large the difference is. The difference might be very small and thus without any practical significance. The result might be statistically significant but practically insignificant. It might be useless for supporting decisions and actions.

How can the practical significance of the difference in mean weights be calculated/estimated? Statisticians have pointed to a number of measures that might help.One way of doing this is to estimate the probability of a real effect.

Cummings thinks that to avoid the trap of reporting results as significant or not significant, researchers should always report effect sizes and confidence intervals. these convey what a P value does not: the magnitude and relative importance of an effect.

Many statisticians also advocate replacing the P value with methods that take advantage of Bayes’ rule that describes how to think about probability as the plausibility of an outcome, rather than as the potential frequency of that outcome. This entails a certain subjectivity, but the Bayesian framework makes it comparatively easy for observers to incorporate what they know about the world into their conclusions, and to calculate how probabilities change as new evidence arises.

Gelman advocates “two stage replication” or “preregistered replication”.

In this approach, exploratory and confirmatory analyses are approached differently and clearly labelled. Instead of doing four separate small studies and reporting the results in one paper, for instance, researchers would first do two small exploratory studies and gather potentially interesting findings without worrying too much about false alarms. Then, on the basis of these results, the authors would decide exactly how they planned to confirm the findings, and would publicly preregister their intentions.

They would then conduct the replication studies and publish the results alongside those of the exploratory studies. This approach allows for freedom and flexibility in analyses, says Gelman, while providing enough rigour to reduce the number of false results being published.