My best effort searching online dates the use of the term “crisis” to describe worries concerning replication failure to an editorial piece by Pashler and Wagenmakers in 2012. The worry was voiced in the context of priming studies in social psychology, triggered by a number of unfortunate events unfolded in 2011: the Diederik Stapel fraud case, Bem’s suspicious publication “showing evidence for” extrasensory perception, Bargh’s vehement attacks on a failure of replication of his famous elderly priming study by Doyen et al. The conversation has since broadened to include other disciplines, notably pharmaceutical research. The scientists and statisticians who have long felt the discomfort surrounding a variety of scientific and statistical practices in a variety of disciplines united under the banners of replication crisis and statistical reform.
Three days ago (March 20, 2019), Nature published an article advocating for the abolishment of “statistical significance” in science; the article has gained quite a bit of publicity because it is co-signed by 800 scientists. Three days earlier, the American Statistician journal put out a special issue on how to move beyond statistical significance. The editorial of this special issue possesses a certain style of forcefulness I’ve come to associate with statisticians. One of its co-authors, Nicole Lazar, also explains the central themes in a short interview that has received some circulation in the philosophy of science circle.
A one-size-fits-all approach to statistical inference is an inappropriate expectation… We summarize our recommendations in two sentences totaling seven words: “Accept uncertainty. Be thoughtful, open, and modest.” Remember “ATOM.”
In this blog post, I raise some cross-disciplinary considerations that might not have received enough attention.
The problems surrounding the replication or statistical crisis are multi-faceted. To put it crudely: a scientist conducts sketchy procedures to meet a statistical threshold so they can publish and not perish; the statistical threshold imposes a demand that is arbitrarily chosen; the results are then reported in a way that is misleading to both scientists and science reporters; science reporters then relate this to the public, who has no idea how the process is supposed to work, and therefore gets upset and loses faith in science when the process “goes wrong” (i.e. does exactly what it should do). There are many different problems occurring in this process, and not all of them can or should be solved by everyone. A statistician may work towards developing better tools to do what p-value doesn’t do but is supposed to do. A statistician may even help scientists understand what exactly a p-value does so that misinterpretations are less likely to arise. However, a statistician does not really have the responsibility (or very often the credential) to tell a scientist how to fix their methods or that they should not p-hack for the sake of publication. The same is true of scientists and science reporters. However, this often means that what falls between the disciplinary cracks falls between the cracks of debates, too. I give two examples.
Consider the title of the interview mentioned above: “Time to say goodbye to “statistically significant” and embrace uncertainty, say statisticians”. Who benefits from such a shift? Scientists, certainly. It is always a good idea for a scientist to have a more sophisticated sense of how much each piece of evidence weighs according to what metric and how much their own research is going to contribute. However, from the perspective of science users, uncertainty is not helpful. A medical practitioner will need to know whether a drug is effective or not. A policymaker will need to know whether an intervention program is effective or not. These are binary options. Someone will then need to develop decision procedures that synthesize uncertainty in an optimal manner; one should also inform the public that it is okay and normal for science to retract its previous commitments because uncertainty is inherent in science. People who will be doing these are presumably neither statisticians nor scientists. They don’t need to “embrace uncertainty” in the same way that scientists do.
But for the choices often required in regulatory, policy and business environments, decisions based on the costs, benefits and likelihoods of all potential consequences always beat those made based solely on statistical significance. Moreover, for decisions about whether to pursue a research idea further, there is no simple connection between a P value and the probable results of subsequent studies.
Consider, also, what p-value tells us. Roughly speaking, a low p-value says something like “this observation is not very expected under the null hypothesis”. This doesn’t necessarily mean that the null is false, of course, and it certainly doesn’t mean that the alternative hypothesis is true. But what does it mean? Well, it depends. Is the null a well-established theory? Is the alternative a well-established theory? Is the scientific procedure carried out in a way that would very likely produce a different observation had the null been true? All these are “behind-the-scenes” questions that are scientific rather than statistical. Sometimes a low p-value means that the null is incompatible with reality and therefore should be rejected. Other times it means the study procedure isn’t sensitive enough for this kind of research. The statistics aren’t going to tell us which is which. Sure, it is easier to adopt an “if it doesn’t work every time let’s just assume it doesn’t work at all” attitude, but I suspect (as do writers of the editorial cited above) that this kind of attitude isn’t going to result in any productive conversation at all. (See, also, this fascinating piece on misleading criticisms of p-values.)
Here are my two cents: continuing to view “the replication crisis” as a single issue is counter-productive. There once was a time where people thought their problems were localized or unique to their own disciplines or small. Unification under the banner, “the replication crisis”, shattered these self-deceptive hopes and made people realize the prevalence and seriousness of the issues. This is all very well. However, now is perhaps a good time to start recognizing and appreciating the differences among these issues, and tackle subsets of them in more tractable ways.
It is perhaps strange that a philosopher (or any serious thinker, really) would advocate division over unification, and my view on this might just as well change in the future. Let me put my worries more concretely: solutions that don’t work in general might as well work in specific contexts; seeing all problems (across all contexts) as manifestations of the same set of issues may blur this fact.
To give an example of what I have in mind, the field of drug research is notoriously plagued with the problem of industry funding. There is a lot of good evidence that studies funded by pharmaceutical companies are more likely to report positive findings than independently funded research. There exist some disagreement over the possible cause of this: whether it’s caused by problematic procedures, implicit bias, or publication bias. (See this excerpt from Ben Goldacre’s book on this topic.) This seems (to me, at least) to be a sufficiently different problem from the QRP (Questionable Research Practices; ironically, the original finding on QRP has been disputed in the context of replication crisis) problems in social psychology, such as “hypothesizing after the results are known” (“Harking”, Kerr 1998). Even though both can be glossed as a kind of p-hacking, they have different manifestations, scales, and stakes. Should we try to develop a solution addressing them both simultaneously? Does the failure of a method to address one imply that it will not be effective in addressing the other?
In particular, QRP in psychology seems to be motivated primarily by academic-cultural incentives like publication bias and credit dispute, whereas pharmaceutical funding provides incentives external to the epistemic community. It seems natural that reconfiguration of the academic social structure by, for example, encouraging replication efforts and promoting quality over quantity of publications, may not be effective in addressing the pharmaceutical problem, which did not arise through academic incentives in the first place. On the other hand, in a talk by Mariam Solomon that I happened to be at, she advocated a way of addressing the pharmaceutical problem through straightforward discounting of industry-funded research in decision making. What is notable about her account is that it is a head-on approach aimed at addressing a specific problem (drug research) in a specific context (drug approval) and that this account does not, and is not meant to, generalize to other “similar” problems. It does not take a stance on whether the pharmaceutical problem occurs through unsound study design, value-laden bias, statistical cherry-picking, or non-publication. As such, the proposal is not scientific or statistical. Instead, it is a social or political proposal. My worry is that proposals like this will have trouble finding their places if people continue to assume that “the replication crisis” is a unified scientific/statistical problem and any plausible proposal needs to address both science and statistics in a unified manner.
All in all, I am in favour of dropping “statistical significance”, if only as a way to open up conversations about statistical multitude. I also admire the willingness of many statisticians to let go of the comfortable and venture into the unknown and the chaotic. However, we non-statisticians should not lose sight of the fact that a statistical reform is unlikely to solve scientific/social problems and that the failure of a statistical reform to solve those problems is not the same as a failure of a statistical reform to do what it sets out to do.
- It might happen after all - May 14, 2023
- Another job market data point - December 17, 2022
- Our place in the fediverse - November 30, 2022