My agent is working on a replication with more recent open-source and frontier models. I'll post an addendum soon. I am seeing that the same pattern holds. Pairwise comparisons are preferable to naive scaling even with GPT 5.4
Posts by Matt DiGiuseppe
Using pairwise comparisons and then fitting a Bayesian Bradley-Terry model to retrieve a coefficient for each piece of text reduces measurement error, provides more consistent results across LLMs or various sizes, and allows for the incorporation of uncertainty in downstream analyses.
We show that pairwise comparisons are a superior way to scale variables from text (open-ended questions here) with LLMs. Asking LLMs to judge a concept on a pre-determined scale produces unevenly distributed results.
I am happy that this paper with @flynnpolsci.bsky.social is finally in print.
If you need to turn text into a number for rigorous analysis, this is likely the solution you are looking for.
Short story: pairwise comparisons (A vs. B) are better than naive LLM scaling (0-10 placement).
Paper w/ @carogarriga.bsky.social: Academia’s Class Problem. PoliSci is dominated by the upper middle class / people with parents who went to university – unlike society as a whole.
dx.doi.org/10.2139/ssrn...
@polscileiden.bsky.social
The broader point: LLMs can persuade, but once politics enters the picture, their power runs into real limits.
Paper: arxiv.org/abs/2602.18092
But there’s a catch: if people think ChatGPT is politically biased, persuasion drops. Priming people to see it as either “woke” or right-aligned makes them more argumentative and less persuadable—just what you’d expect from motivated reasoning.
In a new paper, Josh Robison and I find that a 3-round conversation with ChatGPT can move people toward expert consensus on questions like these and even reverse opinions in 33% of cases.
Do you think rent control improves housing quality and supply? Or that government debt works just like household debt?
Economists generally think those views are wrong. And they're probably right.
📢 New lecture: 'AI Literacy for Studying, Thesis Research, and Life' by @mdigiuseppe.bsky.social 🔎
📆 Wednesday 11 March 2026, 17:15 - 18:30
📍Room 3B.38, Spui Campus, The Hague
➡️More info and registration here: www.universiteitleiden.nl/en/events/20...
In-class assessment is the only way forward.
Universities can survive if they can assure students a degree can't be attained with prompting alone.
Formal models will make a comeback in social science
The only people who should be snoozing on this are those with tenure. If you thought the productivity race was bad before…
“Three years ago, we were impressed that a machine could write a poem about otters. Less than 1,000 days later, I am debating statistical methodology with an agent that built its own research environment.
open.substack.com/pub/oneusefu...
In the past few months I've spoken to a lot of people facing objections to using chatbots, including a surprising number of people who want to buy chatbot access for their large organizations and have been shot down because of worry over the impact of individual prompts. I think it's crazy that this is still happening and want a much shorter post readers can send to people who are still misinformed about this. Here it is!
A short summary of my argument that using ChatGPT isn't bad for the environment - Andy Masley andymasley.substack.com/p/a-short-summ… #AI #environment #energy
Immigration is popular considering the alternatives - doi.org/10.1080/1350...
Your 'moment of doom' for Oct. 11, 2025 ~ Everything is fine.
"The invisible gas can be seen in streams of bubbles originating on the seafloor of Antarctica's Ross Sea ... describing the mechanism as 'seemingly widespread' throughout the region... "
abcnews.go.com/Internationa...
Criticizing Charlie Kirk is a fireable offense that incites domestic terrorism. But calling political opponents "the party of hate, evil, and Satan" is proper and good. Got it.
Very grateful for the invite to this conference. I learned a lot.
ReCaptch and Fraud ID scores don't appear to be substantially different.
Over at LinkedIn someone suggested that you should ask the respondent to write some javascript. completing the task should be a good sign it is not a human. Until someone prompts around that.
It doesn't appear to be misclicking by humans. There were 4 options, only two were clicked (human, AI). Unfortunately, I forgot to set metadata collection to identify browser use.
Yesterday I posted about how I used comet browser to take a Qualtrics survey almost undetected. Last night, I ran a pilot (N=400) on @joinprolific.bsky.social . I found that almost 10% of "respondents" identified as AI when directly asked.
Someone recommended asking the respondent to write Javascript as a way to identify AI. I'll try this out in my next pilot.
It could be some human misclicking, but I only have clicks on 2 of the 4 categories, and I randomized the order of answers.
[Unfortunately, I forgot to collect the metadata on browser use]