Category Archives: Psychology

Are bankers really more dishonest?

Nobody likes a merchant banker, and a new report in Nature, Business culture and dishonesty in the banking industry, makes the case that such distaste may have a sound basis: Bankers who took a survey which asked questions about their jobs behaved more dishonestly than bankers who took a survey which addressed mundane, everyday topics, such as how much television they watched per week. It’s a catchy claim. But in contrast to the headlines, the data suggest something else: bankers were more honest overall than other groups, and at worst no more dishonest.

Each group of bankers was asked to toss a coin 10 times and report, online and anonymously, how often it landed on each side. They were told that each time the coin landed on a particular side (heads for some, tails for others), they could win $20 dollars.

The group who took the job-related survey reported 58.2% successful coin flips, while the control group reported 51.6% successful coin flips. Thus, the authors argued, priming the bankers with their professional identity made them more likely to dishonestly claim that they had tossed coins more successfully than they actually had.

To follow this up, the authors conducted two more studies with different populations, non-banking professionals and students. For these two groups, there was no effect of priming with professional identity; control groups and “treatment” (i.e. primed) groups performed similarly. Hence, the headline finding that making bankers think about their professional identity as bankers made them more dishonest. Other groups did not become more dishonest when primed with their professional identity, and thus there is something about banking and banking culture that makes an honest person crooked.

But more dishonest than who?

Curiously, what is glossed over in the main paper – instead it can be found in the extended figures and the supplementary information – is that what was different about the results from the non-banking professionals and students is that the control groups were as dishonest as the primed groups. In fact, of all the groups, the odd one out is the banking control group. Whereas the banking control group reported 51.6% successful coin flips, the non-banker and student control groups reported 59.8% and 57.9% respectively. The primed banking group reported 58.2% successful flips, while the non-banker and student primed groups reported 55.8% and 56.4% respectively.

If we collapse across the control and primed groups and simply look at the average success rate for each sample population, on average, bankers reported 54.6% successful coin flips, non-banking professionals 57.8%, and students 57.15%. Thus, overall, the bankers were the most honest group.

So maybe the headline should be that bankers are more honest than other groups, until they’re reminded that they’re bankers. Then they’re as dishonest as everyone else (or at least, non-banking professionals, and students).

Hidden moderators and experimental control

Hidden moderators come up regularly as a possible explanation for failed replications. The argument goes something like this: the original experiment found the effect, but the replication did not. Therefore, some third, unknown variable has changed. Perhaps the attitudes or behaviours which gave rise to the effect are not present in the sampled population, or at least this specific sample:

Doyen et al. apparently did not check to make sure their participants possessed the same stereotype of the elderly as our participants did.– John Bargh

Perhaps the transposition of the experiment across time and space has lead to the recruitment of subjects from a qualitatively different population:

Based on the literature on moral judgment, one possibility is that participants in the Michigan samples were on average more politically conservative than the participants in the original studies conducted in the UK. —Simone Schnall

And perhaps, in the case of some social priming effects, societal values have changed so much in the period between the original study and the replication that this specific effect will never be found again: its ecological niche has vanished, or has been occupied by another, more contemporary social more.

These are valid possible explanations for why a replication may have failed [1]. But the implication typically seems to be that since the replicators did not account for these potential hidden moderators, the replication is fatally flawed and should not be published as is. Faced with this critique from a reviewer, replicating authors are left with two alternatives: give up and don’t publish it; or collect more data and attempt to establish experimental control:

My recollection is that we used to talk about experimental control. Perhaps this was in the days of behaviourism. The idea was that the purpose of an experiment was to gain control over the behaviour of interest. A failure to replicate indicates that we don’t have control over the behaviour of interest, and is a sign that we should be doing more work in order to gain control.

Chris Frith

In an ideal world, establishing experimental control is the best alternative. The original effect is genuine, but perhaps the luminance of the stimuli, the lighting in the experimental chamber, or the political leanings of the participants differed across experiments. Running more experiments which account for these variables means we improve our understanding of the effect, establishing the boundary conditions under which it does and does not appear. If the reviewer has correctly identified a hidden moderator, then the understanding of the effect is greater than it was before.

So what’s the catch?

This is all well and good when the effect itself is well established, with strong evidence in its favour. But what if the original evidence was weak? The effect being significant does not mean the evidence was strong, and you can’t establish boundary conditions for an effect which doesn’t exist; you can only provide more opportunities for false positives. Demanding that replicators run more experiments to test for potential hidden moderators places an additional experimental burden on them for an effect that they have already provided evidence may at least be substantially weaker than was originally reported, and places them in a difficult situation: running more experiments can never provide a definitive answer to the hidden moderator critique.

Catch-22's Yossarian

Damned if you do; and damned if you don’t

Even if the effect re-emerges, this does not mean that it explains the discrepancy between the replicated and replicating experiments: the problem with hidden moderators is that they’re hidden, and by definition, their influence on the results of original study is unknown [2]. Thus, as an author, the hidden moderator critique can feel somewhat unfair: you are criticized for not controlling something which was not controlled in the original study. And if the reviewer identifies a potential hidden moderator that turns out to have no effect, then they may demand yet more experiments to account for yet more hidden moderators, or worse, criticize the replicators for failing to identify conditions under which the effect emerges.

How sure are you about the results?

What’s missing is a consideration of the strength of the evidence [3]. It’s all too easy to over-estimate how strong the original evidence was [4]. It shouldn’t always be enough to simply say that the effect was significant in the original study, and therefore those wishing to publish a failed replication must also find conditions under which it emerges, or at least account for as many different reasons why it may not emerge as the reviewer can think of. This may be appropriate if the original study provided strong evidence in favour of the effect – but if it doesn’t, the barrier should be lower for a replication to be viable in its own right. What should be necessary is that the evidence the replication provides on its own is strong; and if that is true, it provides a valuable data point in its own right, even without follow-ups aimed at uncovering a putative moderator or mechanism for an effect we should have less confidence is a general one.


[1] And if not, there’s always


[2] Even if the participant’s predilection for wearing outlandish hats moderates their susceptibility to the priming of personality judgements by the colour of the experimenter’s hat, there was no measure of outlandish hat-wearing in the original study.

[3] Here’s a nice example of using the Bayes Factor to do this from Felix Schoenbrodt:

[4] And this does not imply the original researchers did something wrong, a la QRP or p-hacking: I’m talking simply here about statistical strength and evidential value, not implying that there is evidence of questionable practice or methodological failure. These things happen. That’s why we do statistics!

Are women turned off by sexy adverts for cheap products? Just a couple of questions…

Lest anyone forget, we’re in the run up to Christmas. The already annoying adverts transform into a parade of even more annoying adverts (one good thing about living outside the UK is I never see the sodding Argos aliens), flogging overpriced goodies through any means necessary. A lot more time seems dedicated to expensive luxury items modelled by beautiful specimens of man- and woman-hood! And hey, everybody knows that sex sells, right?

A new study in Psychological Science purports to find that, actually, sex doesn’t always sell: women find sexy adverts a turn off, and more so when the goods are cheap. So sometimes, ads can be just *too* sexy.

The study, “The Price Had Better Be Right: Women’s Reactions to Sexual Stimuli Vary With Market Factors”, by Vohs, Sengupta, & Dahl, garnered media attention (Sexy adverts turn women off, research shows, The Independent), as these things are wont to do, and you can read the press release here.

A while back, Rolf Zwaan deconstructed another article from Psychological Science, asking 50 questions about clean rooms and messy data. Funnily enough, an author of that article was also an author of this article. And I have quite a few questions for them about this one too.

Let’s have a quick run through of the logic here. According to Vohs, Sengupta, and Dahl, women typically react negatively to sexual images. For women, sex should be portrayed as rare, special, and infrequent in order to maintain it as something of high value, and properly reflect the higher costs of sex and reproduction for women and their overall lower sexual desire. Vohs in particular adheres to the theory (her own, along with Baumeister) of sexual economics, which apparently predicts that women should want sex to take place only when they get something out of it other than sex itself. Thus, associating sex with something cheap should produce a more negative reaction than associating it with something expensive, because it debases the value of sex. Men shouldn’t care whether sex is used to sell something cheap or expensive, because they don’t value sex as highly as women.

I don’t find this terribly convincing, and the back-up evidence was largely produced by the authors. But, let’s move swiftly on to the meat and potatoes of the experiments. Just like Rolf, I’ve a few questions which should be pretty straightforward, and I’ll follow the structure of the paper. I won’t quite get to 50, but let’s see…

Experiment 1

Eighty-seven undergraduates (47 women) participated in a 2 (gender) × 2 (price:cheap vs. expensive) between-subjects design that used a sexy ad to promote a product. Two additional conditions (n = 46) assessed women’s attitudes to nonsexual ads that varied in price (cheap vs. expensive).

(1) How old were the participants, and where were they from? The authors are in marketing departments at Universities in three different countries (US, Canada, Hong Kong). This strikes me as something that might have huge cross-cultural differences.

(2) There were two additional control conditions with 46 subjects. Were there 46 subjects in each of the two conditions or 23 in each? Were the subjects different from those who took part in the other conditions? Why were these controls only run on women and not on men?

(3) Doesn’t ~22 participants per condition seem on the low side? How big an effect are we expecting here, like, King Kong bellyflopping in the Hudson River after leaping from the Empire State Building effect sized? I mean that’s a pretty decent leap to start with.

All participants were subjected to a cognitive load while they perused the ads, so that the ads would elicit spontaneous reactions (Gilbert, Krull, & Pelham, 1988).

(4) Why is this in the participants section?

To induce cognitive load, we had participants silently rehearse 10 digits while viewing three ads for 20 s each. The second ad promoted Chaumet women’s watches.

(5) The participants saw three adverts for 20 seconds each, with the target advert the second of the three. What were the other adverts? Is having an additional cognitive load task really going to ensure you only tap participants initial reactions to the adverts when they had 20 seconds to view them?

(6) Why Chaumet watches? Why watches specifically? Why only one product? If the participants knew the brand, why would they believe one would be available for 10$ anyway?

Participants in the sexual condition saw explicit sexual imagery taking up the majority of the ad,
with an image of the product in the bottom corner (see Sengupta & Dahl, 2008, for the images).

(7) Why aren’t pictures of the stimuli included in this paper, and instead only available in one from 5 years ago, in a different journal? And honestly, explicit? It’s maybe just about post-watershed.

What a lovely interaction
Fig 1 from Vohs et al., (2013): doi: 10.1177/0956797613502732

(8) Why do the graphs start from zero when the rating scales were from 1 to 7? It’s hard for me to tell me what the actual ratings were, and this would be really quite useful.

A 2 (gender) × 2 (price) analysis of variance (ANOVA) of the data from the sexual condition revealed the predicted interaction, F(1, 83) = 4.97, p < .05, hp 2 = .056,

(9)  Phew, that was close (p = .03)! Is the effect size partial eta squared? Anyway, looks like the authors’ ability to predict crossover interactions, ably noted by Rolf Zwaan, extends to ordinal interactions too! Kudos. I can’t really interpret the “control” non-sexual conditions on the right since no direct statistical comparisons are made between them and any of the other conditions.

(10) Why assess the results of the two control conditions using an F-test and not a t-test? More to the point, why assess the two control conditions separately from the experimental conditions to which you compare them? There should be an interaction test here. Note that they repeat the same mistake throughout the paper. It’s hard to draw firm conclusions without such tests.

A second hypothesis was that the Gender × Price interaction would predict self-reported negative affect. This hypothesis was supported, F(1, 82) = 3.68, p = .059,

(11) So, it wasn’t supported, then?

As expected, men’s self-reported negative emotions did not vary with price condition (F <1), whereas women showed the predicted pattern of stronger negative emotions after viewing a sexual ad promoting a cheap rather than expensive watch, F(1, 82) = 5.43, p <.05, hp 2 = .088 (Fig. 2).”

(12) but the interaction wasn’t significant, so…

(13) Was a 60 second viewing followed by a quick set of ratings really all the participants did? Were they paid/compensated for their time?

(14) Can the authors really draw firm conclusions from a single stimulus? What if it was just something peculiar to the images used in this study?

(15) What were the ratings calibrated to? What would be considered a really good, likeable advert which would elicit positive emotions? What would be considered a really bad, horrible advert that would elicit negative emotions? I can think of a few.

Experiment 2

(16) The authors state that this experiment was intended to replicate exp 1 with the addition of a second watch to the advert, a men’s watch, to test for the possibility that men found the product irrelevant in Exp 1 because it was a women’s watch. Why not just replace the women’s watch with a men’s watch instead of presenting both, so that you can test what happens when it becomes irrelevant to women instead, as well as what happens when it becomes relevant to men?

(17) Why assume that the women’s watch was irrelevant to men? Don’t men buy gifts? Is that really the only possible explanation the authors could come up with? What if it was some other product altogether? Perfume? Power tools? You know, something that’s always annoyingly gender stereotyped in adverts. Did the reviewers really think this was a good control?

Participants were 212 adults (107 female). They participated in a 2 (gender) × 2 (price:cheap vs. expensive) design with 2 hanging control conditions in which nonsexual ads (cheap vs. expensive) were viewed only by women.

(18) 212 participants in this one, but again those two weird unclear control conditions, where it’s not clear if they were given to the same subjects or not. Also, again, lacking critical information about the participants. These participants are described only as adults, leading me to believe it’s a somewhat different population to experiment 1. How old were they? How were they recruited? How were they compensated? Where were they from?

After reporting the digits and thus releasing the cognitive load, participants moved sliding scales (equating to 100 points)

(19) Wait, I thought this was a direct replication with the exception of the change of stimuli? Why change to a sliding 0-100 scale from a 1-7 Likert type scale?

As predicted, and as in Experiment 1, gender and price had an interactive effect on the two key outcomes in the sexual condition—ad attitudes: F(1, 115) = 3.83, p = .05, hp2 =.032;

(20)  .052, and not significant, again.

(21) Hang on, F(1, 115)? there are 212 participants, four conditions (excluding the two weird control conditions, because I don’t know who actually does them). Why are there only 115 df here? Shouldn’t it be 208? Honestly, the results section here is really confusing, because it’s not always clear what actual tests were done.

A few other random questions:

(22) Isn’t anybody curious how every hypothesis was as predicted, and was unchallenged even when the stats weren’t significant?

(23) Did participants even notice how much the watches cost? Why not just ask them?

(24) Do the authors/reviewers/editor not think, perhaps, that given there is basically no research on this topic conducted by anyone other than the authors, it might look a little odd that everything comes out exactly the way it is predicted?

These findings have several* implications. One is that women can be swayed to tolerate sexual imagery, as long as it comports with their preferences regarding when and why sex is used. A second, more profound implication is that women’s reactions to sexual images reveal their preferences about how sex should be understood. *Two.

(25)  Don’t the authors/reviewers/editor think that these conclusions are just a little strident, given that we have virtually no idea who the participants were, have no information about the sexual preferences of the participants, and only a single product (a watch) was ever presented? Aren’t there any alternative possibilities, questions left open, limitations of the current research?

(26) Do the authors really think that women’s sexual preferences can be so simply identified with perception of financial value? Why claim that women can be swayed to tolerate sexual imagery as if by default they do not? Is there any evidence for this claim that the authors could provide that isn’t in a paper they wrote?

(27) Do they really want to claim that reactions to an advert for a watch is a good way of assessing women’s valuation of sex? REALLY?

And then, a few more questions, which just occurred to me while quickly going over Sengupta & Dahl (2008). Just a few, you know, slight concerns.

(28) Why don’t the authors of the current paper mention that their experiments are a replication of the experiments in Sengupta & Dahl?

(29) Why do the authors think that Sengupta & Dahl, with almost exactly the same set-up, find that men’s attitude towards sexual adverts was better than towards non-sexual adverts? Why didn’t they predict that for the current study, on the basis of their previous results?

(30) Why did Sengupta & Dahl actually predict the different pattern of results that they found in their original paper, but make different predictions here, which the results, yet again, perfectly matched?

(31) Didn’t the reviewers and/or editor even look at Sengupta & Dahl 2008, despite it being an absolutely critical paper for the interpretation of Vohs et al.?