Hard to say absolute best, but Claude definitely has a more unique writing style than the rest. Some of this could be the result of higher api pricing, so open models less likely to train off of large amounts of Claude generated text? Also partly Claude is better lol (imo)
Posts by Jenna Russell
Joint work with Rishanth Rajendhran, @miyyer.bsky.social, and John Wieting. Thanks also to UMD Clip for all the support!
We're releasing StoryScope code, all 10,272 prompts, 51,336 AI-generated narratives (~5k words each), and per-story features to support future work on narrative analysis & AI authorship.
π arxiv.org/abs/2604.03136
π github.com/jenna-russe...
Style-based detectors' performance will vary as models evolve (GPT already cut the em-dash). Narrative structure is harder to humanize, changing it requires significant rewriting instead of post-hoc edits. StoryScope can be used as a more durable basis of authorship analysis.
Narrative features are robust to stylistic editing. When we ran a subset stories over the LAMP protocol (e.g., rewriting cliches, purple prose) detection only dropped from 95.5% -> 93.9% macro F1. Surface-level 'humanization' doesn't fix structural tells.
π’ Gemini has the tidiest endings and bleakest settings (88% bleak).
π DeepSeek likes to front-load crucial context (humans and others hold it until the end).
π£ Kimi is the most generic, with few choices distinctive from other models.
Each model has its own fingerprint
π΄ Claude keeps it cool - flat event escalations, follows traditional literary tropes, and likes epilogues, producing consistent, careful stories.
π΅ GPT is a gossip: uses rumors as plot devices, frames stories as decades-old retrospectives.
The five AI models cluster together in narrative space, distinctly separated from human writing. Human stories are also rarer, based on neighbor distance in narrative space. 24.7% of human stories fall in the rarest 10% of the corpus, vs 7.1% of AI stories.
Humans ...
- Love nonlinear structures (flashbacks and time jumps)
- Reference real texts and authors at ~2x the AI rate
- Frames protagonists as more morally ambivalent (59% vs. 38%)
- Diverse features: more characters, more dialogue, and more subplots (42%)
What separates AI from humans? AI ...
- over-explains its themes
- narrators spell out the moral (77% vs 52% for humans)
- AI favors clean, single-track plots (79% have no subplots)
- over-writes the body; emotion is conveyed as 'tight chests & cold sweats' (81% vs. 38%).
On simple XGBoost models, our narrative features hit 93.2% macro-F1 (0.96 AUPRC) for the human vs AI detection task, keeping 97% of the performance of our model using narrative and stylistic cues. Just 30 'core' narrative features capture the majority of the signal.
We introduce StoryScope, a pipeline to extract interpretable narrative features (e.g., plot, character, revelation) across 60k+ stories, written by humans and 5 LLMs (Claude, GPT, Gemini, Kimi, DeepSeek) over the same ~10k prompts.
Would you realize if the book you were reading was AI? What if it was humanized to remove AI-speak?
We find that even without using stylistic cues (e.g., word choice or sentence structure) narrative choices alone give AI fiction away!
Thanks to my amazing coauthors
@markar.bsky.social, Destiny Akinode, @kthai1618.bsky.social, Bradley Emi, Max Spero and @miyyer.bsky.social and the support of UMD Clip lab and Pangram Labs
We will be continuously monitoring American news to keep up with how AI use changes over time. Follow along at π ainewsaudit.github.io
Weβre releasing:
π Browse articles: ainewsaudit.github.io
π Datasets (recent_news, opinions, ai_reporters): github.com/jenna-russe...
π Paper: arxiv.org/abs/2510.18774
AI has been creeping into the news all of us read, often without any disclosure. We call for clearly defined standards for U.S. newsrooms:
1οΈβ£ Clearly define what counts as acceptable use of AI and publish these standards openly
2οΈβ£ Require AI-use attestations for all writers
Many AI-written stories still contain authentic quotes. We hypothesize that people often use AI for editing or expanding on their human-written work. But with no disclosure, there's no way to tell for sure.
We also track how AI adoption has evolved over time:
Among 10 veteran reporters we followed longitudinally, AI use rose from 0% pre-ChatGPT (2022) to >40% in 2025.
AI is disproportionately affecting news written in languages other than English. Roughly ~8% of English news is AI-generated, compared to 33% of non-English languages (primarily Spanish). Without disclosure, we cannot be sure whether AI is translating stories or writing them.
In NYT, WaPo & WSJ, opinion sections show 6.4Γ higher AI use than other sections, rising ~25Γ since 2022 (from ~0% β ~4%).
AI use is concentrated among prominent guest authors: politicians, CEOs, and scientists.
Despite widespread use, transparency is basically nonexistent.
Out of 100 AI-flagged articles we manually annotated, only 5 disclosed that AI was used and over 90% of outlets have no public AI policy.
AI use isnβt evenly distributed:
ποΈ Far higher in small local papers than national outlets
π Especially common in Mid-Atlantic & Southern states
π’ Largely Driven by ownership groups (e.g. Boone Newsmedia & Advance Publications)
π§ Most concentrated in weather, tech, and health
We detect AI using Pangram, a model with a reported false positive rate of 0.001% on news text. We find that 5.2% of recent news Is completely AI-generated, with another 3.9% partially AI-generated. www.pangram.com/
AI is already at work in American newsrooms.
We examine 186k articles published this summer and find that ~9% are either fully or partially AI-generated, usually without readers having any idea.
Here's what we learned about how AI is influencing local and national journalism:
π€ What if you gave an LLM thousands of random human-written paragraphs and told it to write something new -- while copying 90% of its output from those texts?
π§ You get what we call a Frankentext!
π‘ Frankentexts are surprisingly coherent and tough for AI detectors to flag.
International students will stop coming to American universities if their visas are going to be at risk. This will make our intellectual community poorer and also make tuition more expensive for domestic students.
There is a quasi-religion in Silicon Valley that views AI as godlike. This faith has always been parallel to Evangelical Christianity: salvation (transhumanism), the rapture (the technological singularity), and demons (Roko's Basilisk)
Lately the AI faith has fully fused with Christian Nationalism.
Introducing π» BEARCUBS π», a βsmall but mightyβ dataset of 111 QA pairs designed to assess computer-using web agents in multimodal interactions on the live web!
β
Humans achieve 85% accuracy
β OpenAI Operator: 24%
β Anthropic Computer Use: 14%
β Convergence AI Proxy: 13%
Is the needle-in-a-haystack test still meaningful given the giant green heatmaps in modern LLM papers?
We create ONERULER π, a multilingual long-context benchmark that allows for nonexistent needles. Turns out NIAH isn't so easy after all!
Our analysis across 26 languages π§΅π