The problem with P-values

Valen Johnson has been getting a lot of press this year for his paper Revised standards for statistical evidence. It was highlighted in Nature as Weak statistical standards implicated in scientific irreproducibility, on the ABC as Stringent statistics make better science, and mentioned (twice) in The Australian: Pharmas 'concerned' at low evidentiary bar.

The paper is well-written, and the mathematical appendix clear and useful. However, my initial reading of the paper indicates that some useful prior work was not cited. In particular, I first became aware of the problem with P-values through the writings of Robert Matthews. His 1998 paper Facts versus Factions: The use and abuse of subjectivity in scientific research—which was published in 2000 in Rethinking Risk and the Precautionary Principle (pages 247-282)—cites Pocock and Spiegelhalter (1992), which is also cited in Bayesian Methods in Clinical Trials by Deborah Ashby (2005).

Facts versus Factions addresses some of the same topics as Johnson and also has a mathematical appendix, which I re-worked as a Mathematica Notebook here. There is also a Bayesian Credibility Analysis online calculator, and a nice overview of this topic in Matthews' Bayesian Critique of Statistics in Health: The Great Health Hoax.

Updates: Nature has just published a very readable news feature by Regina Nuzzo entitled P values, the 'gold standard' of statistical validity, are not as reliable as many scientists assume.

As much as one wants to defend the scientific method against pseudo-science, modern medicine is smug in its superiority over chiropractory and alternative medicine, but it is sobering that all medical research is based on using p values as the "gold standard".


  1. And talking about p-values in medicine - organizations like Friends of Science in Medicine pursue “evidence-based” medicine using the “scientific method” apparently as a bastion of rationality against CAM pseudoscience. For the most part this involves p-values.

    Another question - Is the comparative lack of Bayesian Statistics (not to mention Data Science) used to underpin modern treatments due to a lack of relevant research skills/educational infrastructure or instead down to perceptions about them being "less friendly, less rational"?

  2. Also, often overlooked in widespread misconceptions and misinterpretations of p-values is the role of technology used for their generation - the influence of interfaces/languages and in particular what role computation plays in selecting (or even designing) appropriate statistical tests.