Hidden moderators come up regularly as a possible explanation for failed replications. The argument goes something like this: the original experiment found the effect, but the replication did not. Therefore, some third, unknown variable has changed. Perhaps the attitudes or behaviours which gave rise to the effect are not present in the sampled population, or at least this specific sample:
Doyen et al. apparently did not check to make sure their participants possessed the same stereotype of the elderly as our participants did.– John Bargh
Perhaps the transposition of the experiment across time and space has lead to the recruitment of subjects from a qualitatively different population:
Based on the literature on moral judgment, one possibility is that participants in the Michigan samples were on average more politically conservative than the participants in the original studies conducted in the UK. —Simone Schnall
And perhaps, in the case of some social priming effects, societal values have changed so much in the period between the original study and the replication that this specific effect will never be found again: its ecological niche has vanished, or has been occupied by another, more contemporary social more.
These are valid possible explanations for why a replication may have failed . But the implication typically seems to be that since the replicators did not account for these potential hidden moderators, the replication is fatally flawed and should not be published as is. Faced with this critique from a reviewer, replicating authors are left with two alternatives: give up and don’t publish it; or collect more data and attempt to establish experimental control:
My recollection is that we used to talk about experimental control. Perhaps this was in the days of behaviourism. The idea was that the purpose of an experiment was to gain control over the behaviour of interest. A failure to replicate indicates that we don’t have control over the behaviour of interest, and is a sign that we should be doing more work in order to gain control.
In an ideal world, establishing experimental control is the best alternative. The original effect is genuine, but perhaps the luminance of the stimuli, the lighting in the experimental chamber, or the political leanings of the participants differed across experiments. Running more experiments which account for these variables means we improve our understanding of the effect, establishing the boundary conditions under which it does and does not appear. If the reviewer has correctly identified a hidden moderator, then the understanding of the effect is greater than it was before.
So what’s the catch?
This is all well and good when the effect itself is well established, with strong evidence in its favour. But what if the original evidence was weak? The effect being significant does not mean the evidence was strong, and you can’t establish boundary conditions for an effect which doesn’t exist; you can only provide more opportunities for false positives. Demanding that replicators run more experiments to test for potential hidden moderators places an additional experimental burden on them for an effect that they have already provided evidence may at least be substantially weaker than was originally reported, and places them in a difficult situation: running more experiments can never provide a definitive answer to the hidden moderator critique.
Even if the effect re-emerges, this does not mean that it explains the discrepancy between the replicated and replicating experiments: the problem with hidden moderators is that they’re hidden, and by definition, their influence on the results of original study is unknown . Thus, as an author, the hidden moderator critique can feel somewhat unfair: you are criticized for not controlling something which was not controlled in the original study. And if the reviewer identifies a potential hidden moderator that turns out to have no effect, then they may demand yet more experiments to account for yet more hidden moderators, or worse, criticize the replicators for failing to identify conditions under which the effect emerges.
How sure are you about the results?
What’s missing is a consideration of the strength of the evidence . It’s all too easy to over-estimate how strong the original evidence was . It shouldn’t always be enough to simply say that the effect was significant in the original study, and therefore those wishing to publish a failed replication must also find conditions under which it emerges, or at least account for as many different reasons why it may not emerge as the reviewer can think of. This may be appropriate if the original study provided strong evidence in favour of the effect – but if it doesn’t, the barrier should be lower for a replication to be viable in its own right. What should be necessary is that the evidence the replication provides on its own is strong; and if that is true, it provides a valuable data point in its own right, even without follow-ups aimed at uncovering a putative moderator or mechanism for an effect we should have less confidence is a general one.
 And if not, there’s always
 Even if the participant’s predilection for wearing outlandish hats moderates their susceptibility to the priming of personality judgements by the colour of the experimenter’s hat, there was no measure of outlandish hat-wearing in the original study.
 Here’s a nice example of using the Bayes Factor to do this from Felix Schoenbrodt: http://www.nicebread.de/reanalyzing-the-schnalljohnson-cleanliness-data-sets-new-insights-from-bayesian-and-robust-approaches/
 And this does not imply the original researchers did something wrong, a la QRP or p-hacking: I’m talking simply here about statistical strength and evidential value, not implying that there is evidence of questionable practice or methodological failure. These things happen. That’s why we do statistics!