Paper: coevolution.fas.harvard.edu/publications...
It's what our sciences team at @Prolific also built HUMAINE to address - LLM evaluation using demographically stratified participants. We're presenting it at #ICLR2026 soon!
huggingface.co/spaces/Proli...
Posts by phelimb
Researchers from @harvard.edu find that LLMs claiming "human-like" performance actually reflect a very specific subset of humanity.
They cluster closest to WEIRD populations (Western, Educated, Industrialized, Rich, Democratic), diverging as psychological distance increases (r ≈ -0.70) 👇🏻
AI pollution in human data samples is a hot topic.
Some great work from @andrewgordon.bsky.social et al. showing that concerns here are (generally) overblown, with the majority of platforms empirically showing low levels of AI pollution.
osf.io/preprints/ps...
New preprint out today (osf.io/preprints/ps...). We tested whether AI agents are actually infiltrating online surveys.
Spoiler alert: they aren't
Thread 🧵
[1/9]
As of today, if an AI agent is detected in your Prolific study, you'll get twice the cost of that participant back. We’re calling this our 100% Human Guarantee.
Years of investing into @joinprolific.bsky.social's system has made us confident in data integrity.
www.prolific.com/100-human-gu...
New working paper on online research data quality, led by @univie.ac.at, reveals that pass rates on quality checks vary wildly by source. Pretty interesting.
Prolific: 90% | Lab: 80% | Bilendi: 73% | Moblab: 55% | MTurk: 9% | AI agents: 0%
github.com/survey-data-... CC @jyusof.bsky.social
I don't disagree that the rules of the game likely need to change/have changed. Assumptions about what's required in order to guarantee different levels of assurance will need to change.
I’m confident that this is addressable with sufficient innovation/investment personally. May require some significant changes in how we run projects though. E.g. controlled envs, multimodal tooling beyond text, high confidence auditing back to identified humans, etc
+1.
Eerke, I can only speak for Prolific, but the team here is full of smart, motivated people working extremely hard to maintain and improve the integrity and quality of our platform for running online research. The misuse of AI tools is a threat, but one that can be protected against.
Lots of hard work from the Prolific team to achieve the lowest rate of AI misuse detected in this study. More to do to get this to 0, though!
The sky is not falling; high-quality platforms (Prolific, Verasight, CR Connect) have low rates of apparent bots. osf.io/preprints/ps... But also not zero; vigilance is very much needed!
We ran a controlled study of 125 verified humans vs 5 AI agents. Can agents reliably be detected?
Here's what we found:
www.prolific.com/resources/au...
Beta is currently available in Qualtrics. We’re actively scoping integrations with additional platforms and the tech is generalisable. If helpful, I’d be happy to connect you with someone from Prolific to learn more about your feedback?
Frontiers episode 1:
Jerome Wynne from @Prolific in conversation with Crystal Qian, from Google DeepMind, talking about Deliberate Lab: a platform for running online research experiments on human + LLM group dynamics.
www.youtube.com/watch?v=5vyi...
AI agents are becoming a serious threat to research data quality.
Today we’re rolling out Bot authenticity checks on @joinprolific.bsky.social, detecting agentic AI with 100% accuracy in testing.
Comes with a native Qualtrics integration! More info:
www.prolific.com/resources/in...
Fresh HUMAINE results are here.
Gemini 3 is still first, but Mistral Large 3 and Deepseek v3.2 are making things interesting.
Opus 4.5 didn't dominate, but Antropic is likely prioritizing complex reasoning/coding over the conversational fluency that this benchmark favors.
prolific.com/humaine
Lots of chatter about this paper currently. Its a stark warning, but at present I see this as a stark warning of what might come, not what is happening now. As a research community we need to see it as a call-to-arms to develop new strategies, NOT a call to abandon online sampling. Reasoning below
All fair. Expected to see more diversity in modalities also. Qual studies (which can now be done at scale) are likely to be more robust than survey only.
Studies aren't distributed on a first-come first-served basis, but it's a useful theory. I will share with the team.
Right. It is generalisable (JS plugin), though we don't have a native integration with otree yet.
Will reach out to the authors to see if we can understand more details & see if we can add Authenticity Check as a mitigation option.
45% of participants copying OR pasting ~= 45% LLM use.
Only single-digit responses seem to fail their honeypot and other mitigations, which is closer to our internal prevalence measures.
There are many reasons to copy/paste while still being a conscientious human.
"Even to an untrained eye,some of these responses were obviously generated by LLMs", but the % doesn't seem reported?
If I'm reading the paper correctly, their detection of prevalence was "we only tracked copying and pasting on a page containing an openended question" - this is fairly crude measure of llm detection and is upper bound rather than an accurate prevalence measure
LLM use by real humans is a slightly different threat to the scaled agent threat discussed in the paper though, and I think requires a bit more nuance in its response.
I hadn't, thanks for sharing. Agree with many of the mitigation strategies, though given data was collected on Prolific we would have reccommeded our built in tool.
researcher-help.prolific.com/en/articles/...
prolific.com/resources/pr...
If you want to work on these problems, or collaborate on research in this area, get in touch. Much more to come in this space!
Without minimising the seriousness of the threat raised in this paper, I'm more optimistic. This is just the latest challenge in online integrity of online research.
We've been proactively adding to our suite of authenticity tools - more every week - including many of Sean's recommendations:
We also do spot checks to protect against account reselling: participant-help.prolific.com/en/articles/...
Not sure I agree – these are tractable challenges and we are working on them.
bsky.app/profile/phe-...