Adam Gleave (@gleave.me) Bsky

1.
Can you trust models trained directly against probes? We train an LLM against a deception probe and find four outcomes: honesty, blatant deception, obfuscated policy (fools the probe via text), or obfuscated activations (fools it via internal representations).

2 months ago 5 3 1 2

Excited to see the new International AI Safety Report come out! In a world of AI hype the Report cuts through to highlight where capabilities have advanced and lagged, and surveys risks in a nuanced evidence-based way. Recommended.

2 months ago 0 0 0 1

Look forward to presenting our STACK attack at #AAAI2026 that we've used to bypass safeguards in frontier models like GPT-5 and Opus 4

3 months ago 1 0 0 0

Training models that aren't necessarily robust but have *uncorrelated* failures with other models is an interesting research direction I'd love to see more people work on!

9 months ago 0 0 0 0

Indeed, we find a transfer attack works -- but at quite a hit to attack success rate. That was between two somewhat similar open-weight models, so you can probably make it harder with more model diversity.

9 months ago 0 0 1 0

The other challenge is that your components in a defense-in-depth pipeline are all ML models. So, it's not like an attacker has to guess a random password, they have to guess what exploits will affect (unseen, but predictable) models.

9 months ago 0 0 1 0

The problem is then you go through from exponential to linear complexity: it's like being able to bruteforce a combination lock digit by digit rather than having to guess the whole code.

9 months ago 0 0 1 0

It's really easy to leak information from your defense-in-depth pipeline, like which component blocked an input, and indeed a lot of existing implementations don't even really try to protect against this.

9 months ago 0 0 1 0

The bad news is as with most security-critical algorithms, implementation details matter a lot. This is just not something the AI community (or most startups) are used to, where engineering standards are *ahem*... mixed?

9 months ago 0 0 1 0

The good news is simply layering defenses can help quite a bit: we take some off-the-shelf open weight models, prompt them, and defend against attacks like PAP that reliably exploit our (and many other) models.

9 months ago 0 0 1 0

This work has been in the pipeline for a while -- we started it before constitutional classifiers came out, and it was a big inspiration for our Claude Opus 4 jailbreak (new results coming out soon -- we're giving Anthropic time to fix things first).

9 months ago 0 0 1 0

I started this research project quite skeptical that we'll be able to get robust LLMs. I'm now meaningfully more optimistic: defense-in-depth worked better than I expected, and there's been a bunch of other innovations like circuit-breakers that have come out in the meantime.

9 months ago 0 0 1 0

Progress in robustness is just in time with new security threats from growing agent deployments and growing misuse risks from emerging model capabilities.

9 months ago 0 0 1 0

With SOTA defenses LLMs can be difficult even for experts to exploit. Yet developers often compromise on defenses to retain performance (e.g. low-latency). This paper shows how these compromises can be used to break models – and how to securely implement defenses.

9 months ago 2 0 1 0

So many great talks from the Singapore Alignment Workshop -- I look forward to catching up on those that I missed in person!

10 months ago 4 0 0 0

As I say in the video, innovation vs safety is a false dichotomy -- do check out great ideas from our speakers for how innovation can enable effective policy in the video 👇 and initial talk recordings!

10 months ago 0 0 0 0

AI control is one of the most exciting new research directions; excited to have the videos from ControlConf, the world's first control-specific conference. Tons of great material both intros & diving into specific areas!

11 months ago 2 0 0 0

AI security needs more than just testing, it needs guarantees.
Evan Miyazono calls for broader adoption of formal proofs, suggesting a new paradigm where AI produces code to meet human specifications.

11 months ago 3 2 1 0

Had a great time at the Singapore Alignment Workshop earlier this week -- fantastic start to the ICLR week! My only complaint is I missed many of the excellent talks because I was having so many interesting conversations. Looking forward to the videos to catch up!

11 months ago 2 0 0 0

ControlConf 2025 Day 2 delivered!
From robust evals to security R&D & moral patienthood, we covered the edge of AI control theory and practice. Thanks to Ryan Greenblatt, Rohin Shah, Alex Mallen, Stephen McAleer, Tomek Korbak, Steve Kelly & others for their insights.

1 year ago 2 1 1 0

My biggest complaint with the AI Security Forum was too much great content across the three tracks. Looking forward to catching up on the talks I missed with the videos 👇

1 year ago 0 0 0 0

Excited to meet others working on or interested in alignment at the Alignment Workshop Open Social before ICLR!

1 year ago 0 0 0 0

Excited to see people before ICLR at Alignment Workshop Singapore!

1 year ago 0 0 0 0

Humans sometimes cheat at exams -- might AIs do the same? Unique challenge to evaluating intelligent systems.

1 year ago 0 0 0 0

AI agents can start VMs, buy things, send e-mails, etc. AI control is a promising way to prevent harmful agent actions -- whether by accident, due to adversarial attack, or the systems themselves trying to subvert controls. Apply to the world's first control conference 👇

1 year ago 1 0 0 0

Since joining FAR.AI in June, Lindsay has delivered amazing events like the Alignment Workshop Bay Area and Paris Security Forum. Welcome to the team!

1 year ago 0 0 0 0

Evaluations are key to understanding AI capabilities and risks -- but which ones matter? Enjoyed Soroush's talk exploring these issues!

1 year ago 0 0 0 0

Excited to have Annie join our team, and help produce a 200-person event in her first month! We're growing across operations and technical roles -- check out opportunities 👇

1 year ago 0 0 0 0

I had great conversations at the AI Security Forum -- it's exciting to see people from cybersec, hardware root of trust, and AI come together to come up with creative solutions to boost AI security.

1 year ago 0 0 0 0

Formal verification has a lot of exciting applications, especially in the age of LLMs: e.g. can LLMs output programs with proofs of correctness? However formally verifying neural network behavior in general seems intractable -- enjoyed Zac's talk on limitations.

1 year ago 0 0 0 0

Posts by Adam Gleave