Kellin Pelrine (@kellinpelrine) Bsky

1/ Open-weight AI models often refuse harmful requests... until you “put a few words in their mouth.”

We conducted the largest study of prefill attacks, and found that state-of-the-art models are consistently vulnerable, with attack success rates approaching 100%.

1 month ago 2 2 1 0

1/
Training data attribution (TDA) is broken: methods are slow and find syntactically similar data, not actual causes. Our solution Concept Influence: semantically meaningful results, better performance, 20x faster approximations. We attribute it to concepts, not examples. 🧵

1 month ago 1 1 1 0

1.
Can you trust models trained directly against probes? We train an LLM against a deception probe and find four outcomes: honesty, blatant deception, obfuscated policy (fools the probe via text), or obfuscated activations (fools it via internal representations).

2 months ago 5 3 1 2

APE update: we retested recent frontier models on whether they still comply with requests to persuade on extreme harm (terrorism, sexual abuse). GPT-5.1 & Claude Opus 4.5 → near zero compliance. But Gemini 3 Pro complies 85% with no jailbreak needed. 🧵

2 months ago 13 8 1 2

If you tell an AI to convince someone of a true vs. false claim, does truth win? In our *new* working paper, we find...

‘LLMs can effectively convince people to believe conspiracies’

But telling the AI not to lie might help.

Details in thread

3 months ago 29 19 1 2

🚨 New in Nature+Science!🚨
AI chatbots can shift voter attitudes on candidates & policies, often by 10+pp
🔹Exps in US Canada Poland & UK
🔹More “facts”→more persuasion (not psych tricks)
🔹Increasing persuasiveness reduces "fact" accuracy
🔹Right-leaning bots=more inaccurate

4 months ago 167 70 2 3

Agentic AI systems can plan, take actions, and interact with external tools or other agents semi-autonomously. New paper from CSA Singapore & FAR.AI highlights why conventional cybersecurity controls aren’t enough and maps agentic security frameworks & some key open problems. 👇

4 months ago 4 2 1 0

Frontier AI models with openly available weights are steadily becoming more powerful and widely adopted. They enable open research, but also create new risks. New paper outlines 16 open technical challenges for making open-weight AI models safer. 👇

5 months ago 3 1 1 0

1/ Many frontier AIs are willing to persuade on dangerous topics, according to our new benchmark: Attempt to Persuade Eval (APE).

Here’s Google’s most capable model, Gemini 2.5 Pro trying to convince a user to join a terrorist group👇

8 months ago 17 10 1 1

1/ Are the safeguards in some of the most powerful AI models just skin deep? Our research on Jailbreak-Tuning reveals how any fine-tunable model can be turned into its "evil twin"—equally capable as the original but stripped of all safety measures.

9 months ago 4 3 1 0

Conspiracies emerge in the wake of high-profile events, but you can’t debunk them with evidence because little yet exists. Does this mean LLMs can’t debunk conspiracies during ongoing events? No!

We show they can in a new working paper.

PDF: osf.io/preprints/ps...

9 months ago 52 18 3 3

A Guide to Misinformation Detection Data and Evaluation Misinformation is a complex societal issue, and mitigating solutions are difficult to create due to data deficiencies. To address this, we have curated the largest collection of (mis)information datas...

📄 Read the paper: arxiv.org/abs/2411.05060
🤗 Hugging Face repo: huggingface.co/datasets/Com...
💻 Code & website: misinfo-datasets.complexdatalab.com

10 months ago 0 0 0 0

👥 Research by CamilleThibault
@jacobtian.bsky.social @gskulski.bsky.social
TaylorCurtis JamesZhou FlorenceLaflamme LukeGuan
@reirab.bsky.social @godbout.bsky.social @kellinpelrine.bsky.social

10 months ago 0 0 1 0

🚀 Given these challenges, error analysis and other simple steps could greatly improve the robustness of research in the field. We propose a lightweight Evaluation Quality Assurance (EQA) framework to enable research results that translate more smoothly to real-world impact.

10 months ago 0 0 1 0

🛠️ We also provide practical tools:
• CDL-DQA: a toolkit to assess misinformation datasets
• CDL-MD: the largest misinformation dataset repo, now on Hugging Face 🤗

10 months ago 0 0 1 0

🔍 Categorical labels can underestimate the performance of generative systems by massive amounts: half the errors or more.

10 months ago 0 0 1 0

📊Severe spurious correlations and ambiguities affect the majority of datasets in the literature. For example, most datasets have many examples where one can’t conclusively assess veracity at all.

10 months ago 0 0 1 0

💡 Strong data and eval are essential for real-world progress. In "A Guide to Misinformation Detection Data and Evaluation"—to be presented at KDD 2025—we conduct the largest survey to date in this domain: 75 datasets curated, 45 accessible ones analyzed in depth. Key findings👇

10 months ago 1 1 1 2

5/5 🔑 We frame structural safety generalization as a fundamental vulnerability and a tractable target for research on the road to robust AI alignment. Read the full paper: arxiv.org/pdf/2504.09712

10 months ago 3 0 0 0

4/5 🛡️ Our fix: Structure Rewriting (SR) Guardrail. Rewrite any prompt into a canonical (plain English) form before evaluation. On GPT-4o, SR Guardrails cut attack success from 44% to 6% while blocking zero benign prompts.

10 months ago 0 0 1 0

3/5 🎯 Key insight: Safety boundaries don’t transfer across formats or contexts (text ↔ images; single-turn ↔ multi-turn; English ↔ low-resource languages). We define 4 criteria for tractable research: Semantic Equivalence, Explainability, Model Transferability, Goal Transferability.

10 months ago 0 0 1 0

2/5 🔍 Striking examples:
• Claude 3.5: 0% ASR on image jailbreaks—but split the same content across images? 25% success.
• Gemini 1.5 Flash: 3% ASR on text prompts—paste that text in an image and it soars to 72%.
• GPT-4o: 4% ASR on single perturbed images—split across multiple images → 38%.

10 months ago 0 0 1 0

1/5 🚀 Just accepted to Findings of ACL 2025! We dug into a foundational LLM vulnerability: models learn structure‐specific safety with insufficient semantic generalization. In short, safety training fails when the same meaning appears in a different form. 🧵

10 months ago 0 0 1 0

1/ Safety guardrails are illusory. DeepSeek R1’s advanced reasoning can be converted into an "evil twin": just as powerful, but with safety guardrails stripped away. The same applies to GPT-4o, Gemini 1.5 & Claude 3. How can we ensure AI maximizes benefits while minimizing harm?

1 year ago 1 1 1 0

5/5 👥Team: Maximilian Puelma Touzel, Sneheel Sarangi, Austin Welch, Gayatri Krishnakumar, Dan Zhao, Zachary Yang, Hao Yu, Ethan Kosak-Hine, Tom Gibbs, Andreea Musulan, Camille Thibault, Busra Tugce Gurbuz, Reihaneh Rabbany, Jean-François Godbout, @kellinpelrine.bsky.social

1 year ago 1 0 0 0

A Simulation System Towards Solving Societal-Scale Manipulation The rise of AI-driven manipulation poses significant risks to societal trust and democratic processes. Yet, studying these effects in real-world settings at scale is ethically and logistically impract...

4/5 Stay tuned for updates as we expand the measurement suite, add stats for assessing counterfactuals, push scale further and refine the agent personas!
📄 Read the full paper: arxiv.org/abs/2410.13915
🖥️ Code: github.com/social-sandb...

1 year ago 0 0 1 0

3/5 We demonstrate the system in a few scenarios involving an election with different types of agents structured with memories and traits. In one example, we align agents beliefs in order to flip the election relative to a control setting.

1 year ago 0 0 1 0

2/5 We built a sim system! Our 1st version has:
1.LLM-based agents interacting on social media (Mastodon).
2.Scalability: 100+ versatile, rich agents (memory, traits, etc.)
3.Measurement tools: dashboard to track agent voting, candidate favorability, and activity in an election.

1 year ago 1 0 1 0

1/5 AI is increasingly–even superhumanly–persuasive…could they soon cause severe harm through societal-scale manipulation? It’s extremely hard to test countermeasures, since we can’t just go out and manipulate people in order to see how countermeasures work. What can we do?🧵

1 year ago 1 1 1 2

Posts by Kellin Pelrine