Here's the link to the paper again. I'm really happy that this is finally out so I can share it. It's been a while getting here but I'm so proud of this and grateful to my amazing team of (bsky-less) co-authors who got us here
www.sciencedirect.com/science/arti...
Posts by James Strachan
I'm skimming over a lot of detail here, so read the paper for the full story
But the take-home is that we show how GPT-4o has the ability to make advanced mental state inferences, but that this capacity has substantial differences from how humans process the same information
Confusion matrices for decisions across the four conditions of interest: human responses to upright images, human responses to inverted, GPT-4o to upright, GPT-4o to inverted. Axes show the presented mental state (rows) and the reported mental state (columns). Each tile corresponds to a particular decision outcome, with the colour indicating the conditional probability of a particular response given the image presented. Correct answers are shown on the diagonal, and errors on the off-diagonal. Human errors appear more evenly distributed than GPT-4o errors, which are highly concentrated across both plots
Analysis of the errors that humans and GPT-4o made revealed that the model had a highly structured orientation-dependent distribution of errors, while human errors were more entropic but less affected by orientation
Violin plots showing performance on the Multiracial Reading the Mind in the Eyes Test (MRMET) as fraction correct. Violins show human responses to upright and inverted images (left) and GPT-4o responses to upright and inverted images (right). Both humans and GPT-4o show greater fraction correct for upright than inverted images. GPT-4o performs significantly better than humans on upright images, but significantly worse than humans on inverted images. Significance markers indicate that all conditions are significantly different from chance. GPT-4o responses to inverted images are the only condition that is significantly below-chance, all others are significantly above-chance
GPT-4o performed well on the standard test (as you could possibly guess from the title) but was more affected by perturbations to the visual information (image inversion) than humans
An example of a trial from the Multiracial Reading the Mind in the Eyes test developed by Kim et al. (2024). A photograph of a face is cropped to show only the eye region. Around the image are four mental state descriptions: curious, friendly, sarcastic and nervous. The label "friendly" is in bold, indicating the correct answer for this image
We administered two variants of the Reading the Mind in the Eyes test, an advanced test of theory of mind, to a multimodal LLM (GPT-4o) and 400 humans
Using limited information from only the eye regions of faces, subjects must make 4AFC decisions about the mental state of the person in the image
Our previous work established that leading LLMs have been able to pass standard tests of theory of mind at or above human levels since at least GPT4. But it has yet to be seen if this capacity for mentalistic inference extends to other domains than language
Better late than never to announce that I have now moved to take up a one-year postdoc position at the Université Mohamed VI Polytechnique #UM6P in Rabat, Morocco 🇲🇦
Very excited to get started on this new chapter
New paper out now in Nature Human Behaviour
We administered a range of standard Theory of Mind tests to LLMs and humans in order to compare how models performed at tests of social reasoning
link.springer.com/article/10.1...
Very pleased to announce that the first study from my PhD thesis is out now 🥳 with thanks to @ConstableMerryn
and all my other wonderful co-authors!
Chimpanzees exhibit a behavioural signature of human social coordination:
www.sciencedirect.com/science/arti...
Check out our paper (thread below) on Flexible Cultural Learning Through Action Coordination in Perspectives on Psychological Science