Very excited to see pretraining safety efforts! We’re only now beginning to understand how promising pretraining safety and alignment interventions are. Much in the way that curating the base model is important for capabilities like reasoning, so too might it be important for safety.
Posts by Kyle O’Brien
I've joined Geodesic Research to build the open-science field of AI safety pretraining research. Our first paper is wild.
TL;DR — LLMs pretrained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with data about good AIs helps them become more aligned.
The first pretraining results are in, and it looks like models indeed have self-fulfilling misalignment properties. Great work by Tice et al! alignmentpretraining.ai
You know you're AGI-pilled when your Spotify Wrapped looks like this.
Applications to apply for the ERA:AI Fellowship close November 3rd! Participating in this Summer's fellowship was my gateway into pursuing AGI safety research full-time. I will be a research manager for the upcoming Winter fellowships. Feel free to DM me with questions. :) erafellowship.org
📌📌📌
I'm excited to be on the faculty job market this fall. I just updated my website with my CV.
stephencasper.com
Thanks to @stellaathena.bsky.social for chatting with me about Deep Ignorance: the new paper/project from Eleuther AI and the UK AISI. Bottom line: Worried AI could teach people to build bioweapons? Don’t teach it how
fortune.com/2025/08/14/w...
Author here! Data filtering is resistant to tampering, but not fully robust. We expect that a high-resource attacker can still teach the model the filtered knowledge. Our work is a significant improvement over the baselines, but far more work is needed.
This articles covers our work for a general audience. :)
Big and True :)
I like that OpenAI published this. They were able to fine-tune away GPT-oss's refusal, decreasing refusal rates to ~0%. These results aren't surprising. Acknowledging that existing safeguards don't generalize to open models is the first step in developing solutions.
arxiv.org/abs/2508.031...
I've learned a lot over the past two years of getting into research, mostly from mistakes. I’ve made many mistakes. Such is science. Good research is often at the adjacent possible. I've written up much of what I've learned now that I'm beginning to mentor others. open.substack.com/pub/kyletoke...
I led an effort at Microsoft last Fall that studied whether SAE steering was an effective way to improve jailbreak robustness. Our paper on SAE steering has been accepted to the ICML Actionable Interpretability Workshop!
Venue: actionable-interpretability.github.io
Paper: arxiv.org/abs/2411.11296
I'll be in England this summer as an AI Safety Research Fellow with ERA! erafellowship.org/fellowship
I will be studying data filtering and tamper-resistant unlearning for open-weight AI safety so that the community can continue to benefit from open models as capabilities improve.