Advertisement · 728 × 90

Posts by Kyle O’Brien

Very excited to see pretraining safety efforts! We’re only now beginning to understand how promising pretraining safety and alignment interventions are. Much in the way that curating the base model is important for capabilities like reasoning, so too might it be important for safety.

2 months ago 3 0 0 0
Preview
Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment LLMs trained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with synthetic data about good AIs helps them become more aligned. These alignment priors persist th...

I've joined Geodesic Research to build the open-science field of AI safety pretraining research. Our first paper is wild.

TL;DR — LLMs pretrained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with data about good AIs helps them become more aligned.

3 months ago 2 0 0 0
Preview
Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment LLMs trained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with synthetic data about good AIs helps them become more aligned. These alignment priors persist…

The first pretraining results are in, and it looks like models indeed have self-fulfilling misalignment properties. Great work by Tice et al! alignmentpretraining.ai

4 months ago 6 1 0 0
Post image Post image

You know you're AGI-pilled when your Spotify Wrapped looks like this.

4 months ago 1 0 0 0
Preview
ERA Fellowship ERA is a talent programme supporting early-career researchers and entrepreneurs to understand and mitigate risks from frontier AI, based at Cambridge, UK.

Applications to apply for the ERA:AI Fellowship close November 3rd! Participating in this Summer's fellowship was my gateway into pursuing AGI safety research full-time. I will be a research manager for the upcoming Winter fellowships. Feel free to DM me with questions. :) erafellowship.org

5 months ago 0 0 0 0
Preview
Stephen Casper Visit the post for more.

📌📌📌
I'm excited to be on the faculty job market this fall. I just updated my website with my CV.
stephencasper.com

7 months ago 18 4 0 1
Preview
AI safety tip: if you don’t want it giving bioweapon instructions, maybe don’t put them in the training data, say researchers New research shows that scrubbing risky material from AI training data can build safeguards that are harder to bypass — and one author calls out tech giants for keeping such work under wraps.

Thanks to @stellaathena.bsky.social for chatting with me about Deep Ignorance: the new paper/project from Eleuther AI and the UK AISI. Bottom line: Worried AI could teach people to build bioweapons? Don’t teach it how

fortune.com/2025/08/14/w...

8 months ago 12 2 0 1
Advertisement

Author here! Data filtering is resistant to tampering, but not fully robust. We expect that a high-resource attacker can still teach the model the filtered knowledge. Our work is a significant improvement over the baselines, but far more work is needed.

8 months ago 1 1 1 0

This articles covers our work for a general audience. :)

8 months ago 4 0 0 0

Big and True :)

8 months ago 3 0 0 0
Preview
Estimating Worst-Case Frontier Risks of Open-Weight LLMs In this paper, we study the worst-case frontier risks of releasing gpt-oss. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as ca...

I like that OpenAI published this. They were able to fine-tune away GPT-oss's refusal, decreasing refusal rates to ~0%. These results aren't surprising. Acknowledging that existing safeguards don't generalize to open models is the first step in developing solutions.
arxiv.org/abs/2508.031...

8 months ago 1 0 0 0
Preview
Don’t "Think", Just Think Lessons From Breaking Into AI Research

I've learned a lot over the past two years of getting into research, mostly from mistakes. I’ve made many mistakes. Such is science. Good research is often at the adjacent possible. I've written up much of what I've learned now that I'm beginning to mentor others. open.substack.com/pub/kyletoke...

8 months ago 0 0 0 0
Preview
Steering Language Model Refusal with Sparse Autoencoders Responsible deployment of language models requires mechanisms for refusing unsafe prompts while preserving model performance. While most approaches modify model weights through additional training, we...

I led an effort at Microsoft last Fall that studied whether SAE steering was an effective way to improve jailbreak robustness. Our paper on SAE steering has been accepted to the ICML Actionable Interpretability Workshop!

Venue: actionable-interpretability.github.io
Paper: arxiv.org/abs/2411.11296

10 months ago 2 0 0 0
Preview
Fellowship — ERA Fellowship

I'll be in England this summer as an AI Safety Research Fellow with ERA! erafellowship.org/fellowship

I will be studying data filtering and tamper-resistant unlearning for open-weight AI safety so that the community can continue to benefit from open models as capabilities improve.

10 months ago 5 0 1 0