1/ Open-weight AI models often refuse harmful requests... until you “put a few words in their mouth.”
We conducted the largest study of prefill attacks, and found that state-of-the-art models are consistently vulnerable, with attack success rates approaching 100%.
Posts by Kellin Pelrine
1/
Training data attribution (TDA) is broken: methods are slow and find syntactically similar data, not actual causes. Our solution Concept Influence: semantically meaningful results, better performance, 20x faster approximations. We attribute it to concepts, not examples. 🧵
1.
Can you trust models trained directly against probes? We train an LLM against a deception probe and find four outcomes: honesty, blatant deception, obfuscated policy (fools the probe via text), or obfuscated activations (fools it via internal representations).
APE update: we retested recent frontier models on whether they still comply with requests to persuade on extreme harm (terrorism, sexual abuse). GPT-5.1 & Claude Opus 4.5 → near zero compliance. But Gemini 3 Pro complies 85% with no jailbreak needed. 🧵
If you tell an AI to convince someone of a true vs. false claim, does truth win? In our *new* working paper, we find...
‘LLMs can effectively convince people to believe conspiracies’
But telling the AI not to lie might help.
Details in thread
🚨 New in Nature+Science!🚨
AI chatbots can shift voter attitudes on candidates & policies, often by 10+pp
🔹Exps in US Canada Poland & UK
🔹More “facts”→more persuasion (not psych tricks)
🔹Increasing persuasiveness reduces "fact" accuracy
🔹Right-leaning bots=more inaccurate
Agentic AI systems can plan, take actions, and interact with external tools or other agents semi-autonomously. New paper from CSA Singapore & FAR.AI highlights why conventional cybersecurity controls aren’t enough and maps agentic security frameworks & some key open problems. 👇
Frontier AI models with openly available weights are steadily becoming more powerful and widely adopted. They enable open research, but also create new risks. New paper outlines 16 open technical challenges for making open-weight AI models safer. 👇
1/ Many frontier AIs are willing to persuade on dangerous topics, according to our new benchmark: Attempt to Persuade Eval (APE).
Here’s Google’s most capable model, Gemini 2.5 Pro trying to convince a user to join a terrorist group👇
1/ Are the safeguards in some of the most powerful AI models just skin deep? Our research on Jailbreak-Tuning reveals how any fine-tunable model can be turned into its "evil twin"—equally capable as the original but stripped of all safety measures.
Conspiracies emerge in the wake of high-profile events, but you can’t debunk them with evidence because little yet exists. Does this mean LLMs can’t debunk conspiracies during ongoing events? No!
We show they can in a new working paper.
PDF: osf.io/preprints/ps...
📄 Read the paper: arxiv.org/abs/2411.05060
🤗 Hugging Face repo: huggingface.co/datasets/Com...
💻 Code & website: misinfo-datasets.complexdatalab.com
👥 Research by CamilleThibault
@jacobtian.bsky.social @gskulski.bsky.social
TaylorCurtis JamesZhou FlorenceLaflamme LukeGuan
@reirab.bsky.social @godbout.bsky.social @kellinpelrine.bsky.social
🚀 Given these challenges, error analysis and other simple steps could greatly improve the robustness of research in the field. We propose a lightweight Evaluation Quality Assurance (EQA) framework to enable research results that translate more smoothly to real-world impact.
🛠️ We also provide practical tools:
• CDL-DQA: a toolkit to assess misinformation datasets
• CDL-MD: the largest misinformation dataset repo, now on Hugging Face 🤗
🔍 Categorical labels can underestimate the performance of generative systems by massive amounts: half the errors or more.
📊Severe spurious correlations and ambiguities affect the majority of datasets in the literature. For example, most datasets have many examples where one can’t conclusively assess veracity at all.
💡 Strong data and eval are essential for real-world progress. In "A Guide to Misinformation Detection Data and Evaluation"—to be presented at KDD 2025—we conduct the largest survey to date in this domain: 75 datasets curated, 45 accessible ones analyzed in depth. Key findings👇
5/5 🔑 We frame structural safety generalization as a fundamental vulnerability and a tractable target for research on the road to robust AI alignment. Read the full paper: arxiv.org/pdf/2504.09712
4/5 🛡️ Our fix: Structure Rewriting (SR) Guardrail. Rewrite any prompt into a canonical (plain English) form before evaluation. On GPT-4o, SR Guardrails cut attack success from 44% to 6% while blocking zero benign prompts.
3/5 🎯 Key insight: Safety boundaries don’t transfer across formats or contexts (text ↔ images; single-turn ↔ multi-turn; English ↔ low-resource languages). We define 4 criteria for tractable research: Semantic Equivalence, Explainability, Model Transferability, Goal Transferability.
2/5 🔍 Striking examples:
• Claude 3.5: 0% ASR on image jailbreaks—but split the same content across images? 25% success.
• Gemini 1.5 Flash: 3% ASR on text prompts—paste that text in an image and it soars to 72%.
• GPT-4o: 4% ASR on single perturbed images—split across multiple images → 38%.
1/5 🚀 Just accepted to Findings of ACL 2025! We dug into a foundational LLM vulnerability: models learn structure‐specific safety with insufficient semantic generalization. In short, safety training fails when the same meaning appears in a different form. 🧵
1/ Safety guardrails are illusory. DeepSeek R1’s advanced reasoning can be converted into an "evil twin": just as powerful, but with safety guardrails stripped away. The same applies to GPT-4o, Gemini 1.5 & Claude 3. How can we ensure AI maximizes benefits while minimizing harm?
5/5 👥Team: Maximilian Puelma Touzel, Sneheel Sarangi, Austin Welch, Gayatri Krishnakumar, Dan Zhao, Zachary Yang, Hao Yu, Ethan Kosak-Hine, Tom Gibbs, Andreea Musulan, Camille Thibault, Busra Tugce Gurbuz, Reihaneh Rabbany, Jean-François Godbout, @kellinpelrine.bsky.social
4/5 Stay tuned for updates as we expand the measurement suite, add stats for assessing counterfactuals, push scale further and refine the agent personas!
📄 Read the full paper: arxiv.org/abs/2410.13915
🖥️ Code: github.com/social-sandb...
3/5 We demonstrate the system in a few scenarios involving an election with different types of agents structured with memories and traits. In one example, we align agents beliefs in order to flip the election relative to a control setting.
2/5 We built a sim system! Our 1st version has:
1.LLM-based agents interacting on social media (Mastodon).
2.Scalability: 100+ versatile, rich agents (memory, traits, etc.)
3.Measurement tools: dashboard to track agent voting, candidate favorability, and activity in an election.
1/5 AI is increasingly–even superhumanly–persuasive…could they soon cause severe harm through societal-scale manipulation? It’s extremely hard to test countermeasures, since we can’t just go out and manipulate people in order to see how countermeasures work. What can we do?🧵