1.
Can you trust models trained directly against probes? We train an LLM against a deception probe and find four outcomes: honesty, blatant deception, obfuscated policy (fools the probe via text), or obfuscated activations (fools it via internal representations).
Posts by Adam Gleave
Excited to see the new International AI Safety Report come out! In a world of AI hype the Report cuts through to highlight where capabilities have advanced and lagged, and surveys risks in a nuanced evidence-based way. Recommended.
Look forward to presenting our STACK attack at #AAAI2026 that we've used to bypass safeguards in frontier models like GPT-5 and Opus 4
Training models that aren't necessarily robust but have *uncorrelated* failures with other models is an interesting research direction I'd love to see more people work on!
Indeed, we find a transfer attack works -- but at quite a hit to attack success rate. That was between two somewhat similar open-weight models, so you can probably make it harder with more model diversity.
The other challenge is that your components in a defense-in-depth pipeline are all ML models. So, it's not like an attacker has to guess a random password, they have to guess what exploits will affect (unseen, but predictable) models.
The problem is then you go through from exponential to linear complexity: it's like being able to bruteforce a combination lock digit by digit rather than having to guess the whole code.
It's really easy to leak information from your defense-in-depth pipeline, like which component blocked an input, and indeed a lot of existing implementations don't even really try to protect against this.
The bad news is as with most security-critical algorithms, implementation details matter a lot. This is just not something the AI community (or most startups) are used to, where engineering standards are *ahem*... mixed?
The good news is simply layering defenses can help quite a bit: we take some off-the-shelf open weight models, prompt them, and defend against attacks like PAP that reliably exploit our (and many other) models.
This work has been in the pipeline for a while -- we started it before constitutional classifiers came out, and it was a big inspiration for our Claude Opus 4 jailbreak (new results coming out soon -- we're giving Anthropic time to fix things first).
I started this research project quite skeptical that we'll be able to get robust LLMs. I'm now meaningfully more optimistic: defense-in-depth worked better than I expected, and there's been a bunch of other innovations like circuit-breakers that have come out in the meantime.
Progress in robustness is just in time with new security threats from growing agent deployments and growing misuse risks from emerging model capabilities.
With SOTA defenses LLMs can be difficult even for experts to exploit. Yet developers often compromise on defenses to retain performance (e.g. low-latency). This paper shows how these compromises can be used to break models โ and how to securely implement defenses.
So many great talks from the Singapore Alignment Workshop -- I look forward to catching up on those that I missed in person!
As I say in the video, innovation vs safety is a false dichotomy -- do check out great ideas from our speakers for how innovation can enable effective policy in the video ๐ and initial talk recordings!
AI control is one of the most exciting new research directions; excited to have the videos from ControlConf, the world's first control-specific conference. Tons of great material both intros & diving into specific areas!
AI security needs more than just testing, it needs guarantees.
Evan Miyazono calls for broader adoption of formal proofs, suggesting a new paradigm where AI produces code to meet human specifications.
Had a great time at the Singapore Alignment Workshop earlier this week -- fantastic start to the ICLR week! My only complaint is I missed many of the excellent talks because I was having so many interesting conversations. Looking forward to the videos to catch up!
ControlConf 2025 Day 2 delivered!
From robust evals to security R&D & moral patienthood, we covered the edge of AI control theory and practice. Thanks to Ryan Greenblatt, Rohin Shah, Alex Mallen, Stephen McAleer, Tomek Korbak, Steve Kelly & others for their insights.
My biggest complaint with the AI Security Forum was too much great content across the three tracks. Looking forward to catching up on the talks I missed with the videos ๐
Excited to meet others working on or interested in alignment at the Alignment Workshop Open Social before ICLR!
Excited to see people before ICLR at Alignment Workshop Singapore!
Humans sometimes cheat at exams -- might AIs do the same? Unique challenge to evaluating intelligent systems.
AI agents can start VMs, buy things, send e-mails, etc. AI control is a promising way to prevent harmful agent actions -- whether by accident, due to adversarial attack, or the systems themselves trying to subvert controls. Apply to the world's first control conference ๐
Since joining FAR.AI in June, Lindsay has delivered amazing events like the Alignment Workshop Bay Area and Paris Security Forum. Welcome to the team!
Evaluations are key to understanding AI capabilities and risks -- but which ones matter? Enjoyed Soroush's talk exploring these issues!
Excited to have Annie join our team, and help produce a 200-person event in her first month! We're growing across operations and technical roles -- check out opportunities ๐
I had great conversations at the AI Security Forum -- it's exciting to see people from cybersec, hardware root of trust, and AI come together to come up with creative solutions to boost AI security.
Formal verification has a lot of exciting applications, especially in the age of LLMs: e.g. can LLMs output programs with proofs of correctness? However formally verifying neural network behavior in general seems intractable -- enjoyed Zac's talk on limitations.