Are bankers really more dishonest?

Nobody likes a merchant banker, and a new report in Nature, Business culture and dishonesty in the banking industry, makes the case that such distaste may have a sound basis: Bankers who took a survey which asked questions about their jobs behaved more dishonestly than bankers who took a survey which addressed mundane, everyday topics, such as how much television they watched per week. It’s a catchy claim. But in contrast to the headlines, the data suggest something else: bankers were more honest overall than other groups, and at worst no more dishonest.

Each group of bankers was asked to toss a coin 10 times and report, online and anonymously, how often it landed on each side. They were told that each time the coin landed on a particular side (heads for some, tails for others), they could win $20 dollars.

The group who took the job-related survey reported 58.2% successful coin flips, while the control group reported 51.6% successful coin flips. Thus, the authors argued, priming the bankers with their professional identity made them more likely to dishonestly claim that they had tossed coins more successfully than they actually had.

To follow this up, the authors conducted two more studies with different populations, non-banking professionals and students. For these two groups, there was no effect of priming with professional identity; control groups and “treatment” (i.e. primed) groups performed similarly. Hence, the headline finding that making bankers think about their professional identity as bankers made them more dishonest. Other groups did not become more dishonest when primed with their professional identity, and thus there is something about banking and banking culture that makes an honest person crooked.

But more dishonest than who?

Curiously, what is glossed over in the main paper – instead it can be found in the extended figures and the supplementary information – is that what was different about the results from the non-banking professionals and students is that the control groups were as dishonest as the primed groups. In fact, of all the groups, the odd one out is the banking control group. Whereas the banking control group reported 51.6% successful coin flips, the non-banker and student control groups reported 59.8% and 57.9% respectively. The primed banking group reported 58.2% successful flips, while the non-banker and student primed groups reported 55.8% and 56.4% respectively.

If we collapse across the control and primed groups and simply look at the average success rate for each sample population, on average, bankers reported 54.6% successful coin flips, non-banking professionals 57.8%, and students 57.15%. Thus, overall, the bankers were the most honest group.

So maybe the headline should be that bankers are more honest than other groups, until they’re reminded that they’re bankers. Then they’re as dishonest as everyone else (or at least, non-banking professionals, and students).

Hidden moderators and experimental control

Hidden moderators come up regularly as a possible explanation for failed replications. The argument goes something like this: the original experiment found the effect, but the replication did not. Therefore, some third, unknown variable has changed. Perhaps the attitudes or behaviours which gave rise to the effect are not present in the sampled population, or at least this specific sample:

Doyen et al. apparently did not check to make sure their participants possessed the same stereotype of the elderly as our participants did.– John Bargh

Perhaps the transposition of the experiment across time and space has lead to the recruitment of subjects from a qualitatively different population:

Based on the literature on moral judgment, one possibility is that participants in the Michigan samples were on average more politically conservative than the participants in the original studies conducted in the UK. —Simone Schnall

And perhaps, in the case of some social priming effects, societal values have changed so much in the period between the original study and the replication that this specific effect will never be found again: its ecological niche has vanished, or has been occupied by another, more contemporary social more.

These are valid possible explanations for why a replication may have failed [1]. But the implication typically seems to be that since the replicators did not account for these potential hidden moderators, the replication is fatally flawed and should not be published as is. Faced with this critique from a reviewer, replicating authors are left with two alternatives: give up and don’t publish it; or collect more data and attempt to establish experimental control:

My recollection is that we used to talk about experimental control. Perhaps this was in the days of behaviourism. The idea was that the purpose of an experiment was to gain control over the behaviour of interest. A failure to replicate indicates that we don’t have control over the behaviour of interest, and is a sign that we should be doing more work in order to gain control.

Chris Frith

In an ideal world, establishing experimental control is the best alternative. The original effect is genuine, but perhaps the luminance of the stimuli, the lighting in the experimental chamber, or the political leanings of the participants differed across experiments. Running more experiments which account for these variables means we improve our understanding of the effect, establishing the boundary conditions under which it does and does not appear. If the reviewer has correctly identified a hidden moderator, then the understanding of the effect is greater than it was before.

So what’s the catch?

This is all well and good when the effect itself is well established, with strong evidence in its favour. But what if the original evidence was weak? The effect being significant does not mean the evidence was strong, and you can’t establish boundary conditions for an effect which doesn’t exist; you can only provide more opportunities for false positives. Demanding that replicators run more experiments to test for potential hidden moderators places an additional experimental burden on them for an effect that they have already provided evidence may at least be substantially weaker than was originally reported, and places them in a difficult situation: running more experiments can never provide a definitive answer to the hidden moderator critique.

Catch-22's Yossarian

Damned if you do; and damned if you don’t

Even if the effect re-emerges, this does not mean that it explains the discrepancy between the replicated and replicating experiments: the problem with hidden moderators is that they’re hidden, and by definition, their influence on the results of original study is unknown [2]. Thus, as an author, the hidden moderator critique can feel somewhat unfair: you are criticized for not controlling something which was not controlled in the original study. And if the reviewer identifies a potential hidden moderator that turns out to have no effect, then they may demand yet more experiments to account for yet more hidden moderators, or worse, criticize the replicators for failing to identify conditions under which the effect emerges.

How sure are you about the results?

What’s missing is a consideration of the strength of the evidence [3]. It’s all too easy to over-estimate how strong the original evidence was [4]. It shouldn’t always be enough to simply say that the effect was significant in the original study, and therefore those wishing to publish a failed replication must also find conditions under which it emerges, or at least account for as many different reasons why it may not emerge as the reviewer can think of. This may be appropriate if the original study provided strong evidence in favour of the effect – but if it doesn’t, the barrier should be lower for a replication to be viable in its own right. What should be necessary is that the evidence the replication provides on its own is strong; and if that is true, it provides a valuable data point in its own right, even without follow-ups aimed at uncovering a putative moderator or mechanism for an effect we should have less confidence is a general one.

 


[1] And if not, there’s always

Aliens?!

[2] Even if the participant’s predilection for wearing outlandish hats moderates their susceptibility to the priming of personality judgements by the colour of the experimenter’s hat, there was no measure of outlandish hat-wearing in the original study.

[3] Here’s a nice example of using the Bayes Factor to do this from Felix Schoenbrodt: http://www.nicebread.de/reanalyzing-the-schnalljohnson-cleanliness-data-sets-new-insights-from-bayesian-and-robust-approaches/

[4] And this does not imply the original researchers did something wrong, a la QRP or p-hacking: I’m talking simply here about statistical strength and evidential value, not implying that there is evidence of questionable practice or methodological failure. These things happen. That’s why we do statistics!

Are women turned off by sexy adverts for cheap products? Just a couple of questions…

Lest anyone forget, we’re in the run up to Christmas. The already annoying adverts transform into a parade of even more annoying adverts (one good thing about living outside the UK is I never see the sodding Argos aliens), flogging overpriced goodies through any means necessary. A lot more time seems dedicated to expensive luxury items modelled by beautiful specimens of man- and woman-hood! And hey, everybody knows that sex sells, right?

A new study in Psychological Science purports to find that, actually, sex doesn’t always sell: women find sexy adverts a turn off, and more so when the goods are cheap. So sometimes, ads can be just *too* sexy.

The study, “The Price Had Better Be Right: Women’s Reactions to Sexual Stimuli Vary With Market Factors”, by Vohs, Sengupta, & Dahl, garnered media attention (Sexy adverts turn women off, research shows, The Independent), as these things are wont to do, and you can read the press release here.

A while back, Rolf Zwaan deconstructed another article from Psychological Science, asking 50 questions about clean rooms and messy data. Funnily enough, an author of that article was also an author of this article. And I have quite a few questions for them about this one too.

Let’s have a quick run through of the logic here. According to Vohs, Sengupta, and Dahl, women typically react negatively to sexual images. For women, sex should be portrayed as rare, special, and infrequent in order to maintain it as something of high value, and properly reflect the higher costs of sex and reproduction for women and their overall lower sexual desire. Vohs in particular adheres to the theory (her own, along with Baumeister) of sexual economics, which apparently predicts that women should want sex to take place only when they get something out of it other than sex itself. Thus, associating sex with something cheap should produce a more negative reaction than associating it with something expensive, because it debases the value of sex. Men shouldn’t care whether sex is used to sell something cheap or expensive, because they don’t value sex as highly as women.

I don’t find this terribly convincing, and the back-up evidence was largely produced by the authors. But, let’s move swiftly on to the meat and potatoes of the experiments. Just like Rolf, I’ve a few questions which should be pretty straightforward, and I’ll follow the structure of the paper. I won’t quite get to 50, but let’s see…

Experiment 1

Eighty-seven undergraduates (47 women) participated in a 2 (gender) × 2 (price:cheap vs. expensive) between-subjects design that used a sexy ad to promote a product. Two additional conditions (n = 46) assessed women’s attitudes to nonsexual ads that varied in price (cheap vs. expensive).

(1) How old were the participants, and where were they from? The authors are in marketing departments at Universities in three different countries (US, Canada, Hong Kong). This strikes me as something that might have huge cross-cultural differences.

(2) There were two additional control conditions with 46 subjects. Were there 46 subjects in each of the two conditions or 23 in each? Were the subjects different from those who took part in the other conditions? Why were these controls only run on women and not on men?

(3) Doesn’t ~22 participants per condition seem on the low side? How big an effect are we expecting here, like, King Kong bellyflopping in the Hudson River after leaping from the Empire State Building effect sized? I mean that’s a pretty decent leap to start with.

All participants were subjected to a cognitive load while they perused the ads, so that the ads would elicit spontaneous reactions (Gilbert, Krull, & Pelham, 1988).

(4) Why is this in the participants section?

To induce cognitive load, we had participants silently rehearse 10 digits while viewing three ads for 20 s each. The second ad promoted Chaumet women’s watches.

(5) The participants saw three adverts for 20 seconds each, with the target advert the second of the three. What were the other adverts? Is having an additional cognitive load task really going to ensure you only tap participants initial reactions to the adverts when they had 20 seconds to view them?

(6) Why Chaumet watches? Why watches specifically? Why only one product? If the participants knew the brand, why would they believe one would be available for 10$ anyway?

Participants in the sexual condition saw explicit sexual imagery taking up the majority of the ad,
with an image of the product in the bottom corner (see Sengupta & Dahl, 2008, for the images).

(7) Why aren’t pictures of the stimuli included in this paper, and instead only available in one from 5 years ago, in a different journal? And honestly, explicit? It’s maybe just about post-watershed.

What a lovely interaction
Fig 1 from Vohs et al., (2013): doi: 10.1177/0956797613502732

(8) Why do the graphs start from zero when the rating scales were from 1 to 7? It’s hard for me to tell me what the actual ratings were, and this would be really quite useful.

A 2 (gender) × 2 (price) analysis of variance (ANOVA) of the data from the sexual condition revealed the predicted interaction, F(1, 83) = 4.97, p < .05, hp 2 = .056,

(9)  Phew, that was close (p = .03)! Is the effect size partial eta squared? Anyway, looks like the authors’ ability to predict crossover interactions, ably noted by Rolf Zwaan, extends to ordinal interactions too! Kudos. I can’t really interpret the “control” non-sexual conditions on the right since no direct statistical comparisons are made between them and any of the other conditions.

(10) Why assess the results of the two control conditions using an F-test and not a t-test? More to the point, why assess the two control conditions separately from the experimental conditions to which you compare them? There should be an interaction test here. Note that they repeat the same mistake throughout the paper. It’s hard to draw firm conclusions without such tests.

A second hypothesis was that the Gender × Price interaction would predict self-reported negative affect. This hypothesis was supported, F(1, 82) = 3.68, p = .059,

(11) So, it wasn’t supported, then?

As expected, men’s self-reported negative emotions did not vary with price condition (F <1), whereas women showed the predicted pattern of stronger negative emotions after viewing a sexual ad promoting a cheap rather than expensive watch, F(1, 82) = 5.43, p <.05, hp 2 = .088 (Fig. 2).”

(12) but the interaction wasn’t significant, so…

(13) Was a 60 second viewing followed by a quick set of ratings really all the participants did? Were they paid/compensated for their time?

(14) Can the authors really draw firm conclusions from a single stimulus? What if it was just something peculiar to the images used in this study?

(15) What were the ratings calibrated to? What would be considered a really good, likeable advert which would elicit positive emotions? What would be considered a really bad, horrible advert that would elicit negative emotions? I can think of a few.

Experiment 2

(16) The authors state that this experiment was intended to replicate exp 1 with the addition of a second watch to the advert, a men’s watch, to test for the possibility that men found the product irrelevant in Exp 1 because it was a women’s watch. Why not just replace the women’s watch with a men’s watch instead of presenting both, so that you can test what happens when it becomes irrelevant to women instead, as well as what happens when it becomes relevant to men?

(17) Why assume that the women’s watch was irrelevant to men? Don’t men buy gifts? Is that really the only possible explanation the authors could come up with? What if it was some other product altogether? Perfume? Power tools? You know, something that’s always annoyingly gender stereotyped in adverts. Did the reviewers really think this was a good control?

Participants were 212 adults (107 female). They participated in a 2 (gender) × 2 (price:cheap vs. expensive) design with 2 hanging control conditions in which nonsexual ads (cheap vs. expensive) were viewed only by women.

(18) 212 participants in this one, but again those two weird unclear control conditions, where it’s not clear if they were given to the same subjects or not. Also, again, lacking critical information about the participants. These participants are described only as adults, leading me to believe it’s a somewhat different population to experiment 1. How old were they? How were they recruited? How were they compensated? Where were they from?

After reporting the digits and thus releasing the cognitive load, participants moved sliding scales (equating to 100 points)

(19) Wait, I thought this was a direct replication with the exception of the change of stimuli? Why change to a sliding 0-100 scale from a 1-7 Likert type scale?

As predicted, and as in Experiment 1, gender and price had an interactive effect on the two key outcomes in the sexual condition—ad attitudes: F(1, 115) = 3.83, p = .05, hp2 =.032;

(20)  .052, and not significant, again.

(21) Hang on, F(1, 115)? there are 212 participants, four conditions (excluding the two weird control conditions, because I don’t know who actually does them). Why are there only 115 df here? Shouldn’t it be 208? Honestly, the results section here is really confusing, because it’s not always clear what actual tests were done.

A few other random questions:

(22) Isn’t anybody curious how every hypothesis was as predicted, and was unchallenged even when the stats weren’t significant?

(23) Did participants even notice how much the watches cost? Why not just ask them?

(24) Do the authors/reviewers/editor not think, perhaps, that given there is basically no research on this topic conducted by anyone other than the authors, it might look a little odd that everything comes out exactly the way it is predicted?

These findings have several* implications. One is that women can be swayed to tolerate sexual imagery, as long as it comports with their preferences regarding when and why sex is used. A second, more profound implication is that women’s reactions to sexual images reveal their preferences about how sex should be understood. *Two.

(25)  Don’t the authors/reviewers/editor think that these conclusions are just a little strident, given that we have virtually no idea who the participants were, have no information about the sexual preferences of the participants, and only a single product (a watch) was ever presented? Aren’t there any alternative possibilities, questions left open, limitations of the current research?

(26) Do the authors really think that women’s sexual preferences can be so simply identified with perception of financial value? Why claim that women can be swayed to tolerate sexual imagery as if by default they do not? Is there any evidence for this claim that the authors could provide that isn’t in a paper they wrote?

(27) Do they really want to claim that reactions to an advert for a watch is a good way of assessing women’s valuation of sex? REALLY?

And then, a few more questions, which just occurred to me while quickly going over Sengupta & Dahl (2008). Just a few, you know, slight concerns.

(28) Why don’t the authors of the current paper mention that their experiments are a replication of the experiments in Sengupta & Dahl?

(29) Why do the authors think that Sengupta & Dahl, with almost exactly the same set-up, find that men’s attitude towards sexual adverts was better than towards non-sexual adverts? Why didn’t they predict that for the current study, on the basis of their previous results?

(30) Why did Sengupta & Dahl actually predict the different pattern of results that they found in their original paper, but make different predictions here, which the results, yet again, perfectly matched?

(31) Didn’t the reviewers and/or editor even look at Sengupta & Dahl 2008, despite it being an absolutely critical paper for the interpretation of Vohs et al.?

Every action potential, every neuron

Neuroscience was not always what I wanted to do. All I really wanted to do was play Smells Like Teen Spirit.

My parents bought me a battered nylon-strung acoustic from a car boot sale. Mr Brown, my chemistry teacher, taught me the basics. I quickly got the hang of simple chord shapes, the As, the Gs, the Es. Soon, I was banging out House of the Rising Sun and Blowing in the Wind like every neophyte guitarist before me.

But all I really wanted to do was play Smells Like Teen Spirit. How could I make my guitar sound like that? I started playing the basic melody on the low E string. Then I figured out power chords, which sound as limp on a classical acoustic as classical implies. I moved on to electric guitar (“Judas!”). Bigger. Louder. Cooler.Striking a pose

I had the basics. I cranked up the overdrive and made my chords crunch. I listened to the solo over and over until I could play every note without watching my fingers move up and down the fretboard. I figured out what effects to use, how close to stand to the amp to get feedback, how to mute the strings with my palm to get a percussive effect.

And all I’d really wanted to do was play Smells Like Teen Spirit. I moved on to more complicated songs, learning new riffs and new techniques, repeating them over and over until they became as natural as speaking. And I realized how often the details were less important than the generalities. You could shift all the notes to a different key, or play them all on a glockenspiel. You didn’t even have to play exactly the same notes. Once you had the overall structure down, you could take it for a walk to wherever you wanted to go.

These are grand times for neuroscience. Huge, ambitious projects with incredible scope garner Presidential attention and lavish funding. The big new idea? To record every action potential from every neuron; to build the most complete model of the human brain that’s ever been built; to be able to reproduce every instant of every task, on demand, just as if it were happening right now.

If all I’d really wanted to do was play Smells Like Teen Spirit, maybe, if I’d had the technology, I could have broken every instant down all the way to its individual frequency components. And then I could reproduce those exact frequencies on demand, without worrying about what produced them. Every detail, all the way through the song, exactly as it was, all without knowing a single chord, all without knowing how the guitar makes the sounds it does, what shape it needs to be, or how and why the strings resonate at particular frequencies, and all without knowing how and why the song was made, or why hearing it made me want to play it.

When it comes to researching the brain, there are thousands of people playing in different keys, each learning different parts on different instruments, each trying to find out what note they should be playing. Sometimes we find a new instrument, or even a new note. Sometimes it turns out to be the same old instrument playing a different note, or the same old note on a different instrument. And different movements in the composition rise and fall on the weight of evidence.

Do we really need to rebuild a particular guitar to learn the song? Of course, details are important. But knowing how to reproduce the notes is not the same thing as knowing how to play them in the right order. And sometimes you need to know how the song goes before you can know when you’re hitting the wrong notes. Until we have a feel for the movements, how can we understand where the notes should go?

But I digress; after all, all I really wanted to do was play Smells Like Teen Spirit.

Frontiers Research Topics

A few months ago, I got an invitation to host a Research Topic at Frontiers, one of the (relatively) new wave of Open Access journals, where authors pay publication costs and readers can freely access articles. Apparently, my recent article would be an excellent fit for the Research Topics initiative! Research Topics are where a couple of editors get together and invite submissions on their pet topic – a bit like a special issue, or a conference symposium. I’ve seen some great examples of these on topics dear to my own heart, like VanRullen & Krieman’s The timing of visual object recognition, so you’d think being asked to host one would be pretty cool, no?

Now here’s the thing: I get plenty of spam. Invitations to random conferences, offers of monoclonal antibodies, invitations to enlarge various appendages. Every couple of months I get a letter from a vanity publishing press asking if I want to publish my thesis as a book. The common theme is that they’re rather impersonal. When I get what reads like a form email that makes only a cursory reference to me and my work and then tells me what a great opportunity it’s providing me, I get suspicious.

After a moment’s pause, I discounted the invitation as spam, as I have done the repeat invitations. And the pause was only because it was from Frontiers, a journal family I like. From a few conversations on Twitter, it feels like this is a pretty common reaction.

The attitude that Open Access is simply vanity publishing is one I clearly disagree with (published in both PLOS ONE and Frontiers), but it’s a long way from being a dead opinion yet. It’s not great if even supporters of the OA movement and Frontiers find these kind of invitations a bit spammy.

There are a couple of things at play here. I’m a junior researcher. Nobody asks me to host a symposium or edit a special issue. I once mentioned to a colleague that perhaps we could try setting up a research topic with Frontiers, and the reaction was “isn’t that really for more senior researchers?” If I start inviting people, I feel like the most likely reaction will be “Who are you, and how did you get my address?” This is something Frontiers have explicitly claimed they’re trying to address, opening up such paths to junior researchers, but it’s not a stated aim on their website or in the emails.

The article that formed the basis of my invitation has been cited once – by me – so if you want to tell me that this is an article that can form the keystone of a research topic, you need to do more than re-state the title. Tell me why it’s interesting and why it might fit. Otherwise, I don’t get the feeling you’ve even read the abstract, and I start to get the impression that these special issue Research Topics are perhaps not so special.

So in other words, if you want to appeal to us juniors, you have to overcome both our insecurity in our early career status and our doubts about your sincerity. If you want to reach out to us, make it feel like you’re interested in *us* and have some idea what our research area actually *is*. That’s if that’s what you really want to do.