▶️ Watch Dominic Rizzo: youtu.be/tSWPN7qGHmg
▶️ Watch Anne Neuberger: www.youtube.com/watch?v=yfS0...
Posts by FAR.AI
Two new recordings from TIAP 2026 are now live:
▸ Fireside Chat with Anne Neuberger: AI and National Security
▸ Dominic Rizzo - Silicon Roots of Trust: Attestation You'd Want Even If Nobody Required It
👇
▶️ Watch San Diego Alignment Workshop video: youtu.be/8oJW7hdbc2I&...
“Fault free oversight requires intimate contact with failure”. Sarah Schwettmann builds tools for better scalable oversight of AI systems, by going beyond learning from human preferences to training agents to keep track of how close a system is to failure.👇
📄 Anthropic's Pilot Sabotage Risk Report - alignment.anthropic.com/2025/sabotag...
📄 METR Review of the Anthropic Summer 2025 Pilot Sabotage Risk Report - metr.org/2025_pilot_r...
▶️ Watch San Diego Alignment Workshop video: youtu.be/eO7RWlUl1BE&...
@sleepinyourhat just presented the first misalignment safety case from a frontier AI lab, analyzing Claude Opus 4: a 6 month effort that covered 9 pathways to catastrophic harm.👇
📄 BountyBench: buff.ly/buDf1NX
📄 CyberGym: buff.ly/rbLpC8J
📄 VERINA: buff.ly/JY8Iw50
▶️ Watch San Diego Alignment Workshop video: buff.ly/95GcptY
Cybersecurity is one of AI's biggest risk domains—Attackers only need one exploit, but defenders must patch everything. Dawn Song reveals frontier AI can find vulnerabilities for just a few dollars and argues for a shift to formally verified, secure-by-design systems.👇
📖 Read the article with talk summaries: far.ai/news/london-...
▶️ Watch the full playlist: youtube.com/playlist?lis...
More recordings from the London Alignment Workshop are now available.
New videos from Neel Nanda, Marius Hobbhahn, Robert Trager, Sören Mindermann, Ryan Lowe, @scasper.bsky.social, @conitzer.bsky.social, @zhijingjin.bsky.social + more. 👇
📄 www.far.ai/news/ai-dece...
Without reliable deception detection, there's no clear path to high-confidence AI alignment. Black-box monitoring alone can't get us there. White-box methods that read model internals offer more promise. Our latest blog explains why. 👇
▶️ Watch San Diego Alignment Workshop video: youtu.be/RV4ycXFPWro&...
📄 openreview.net/forum?id=pcG...
Yisen Wang showed that safety isn't erased, just masked. Removing reasoning neurons cut harmful rates 33%→5%, random removal did nothing. SafeReAct reactivates hidden safety using lightweight LoRA on harmful prompts only. 👇
▶️ Watch Rohin Shah (Google DeepMind) - youtu.be/BWHbxv5kxLI
▶️ Watch Gillian Hadfield (Johns Hopkins University) - youtu.be/KLqBPTobRQc
Two new recordings from the London Alignment Workshop:
Rohin Shah on why safety research fails to move decisions at frontier labs, and what to do about it.
@ghadfield.bluesky.social on why AI governance can't verify the claims it's supposed to oversee, and how to fix it.
Links in replies 👇
▶️ Watch San Diego Alignment Workshop video: youtu.be/Ke4t7qsBJLg&...
📄 Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models - arxiv.org/abs/2506.07468
Single-agent red teaming keeps finding the same attacks. @natashajaques.bsky.social uses multi-agent self-play where attacker and defender co-evolve. Every exploit gets patched, forcing new discoveries. This results in 95% fewer harmful outputs with only 5% more refusals. 👇
▶️ Keynote address: www.youtube.com/watch?v=Zndi...
▶️ Fireside chat on his shift to AI safety: www.youtube.com/watch?v=_7OA...
📄 arxiv.org/abs/2502.15657
🔗 lawzero.org
The laws of physics don't care about you. That's what makes them safe. @yoshuabengio.bsky.social argues we can train AI the same way. In the "truthification pipeline" training data is categorized as either factual or a claim. This allows the AI to answer what it actually thinks it true. 👇
▶️ Watch San Diego Alignment Workshop video: youtu.be/k93o4R145Os&...
📄 neelnanda.io/vision
📄 neelnanda.io/agenda
Interpretability produced insights but didn't necessarily impact AGI safety. Neel Nanda's pivot: study what works. Anthropic made progress on eval awareness with simple activation steering. His team now grounds work in testable proxy tasks, fails fast on dead ends. 👇
📩 far.ai/futures-eoi
▶️ buff.ly/ZQHnAOA
200+ researchers joined London Alignment Workshop Day 2 for talks on governance, scheming & multi-agent safety. Thanks to Allan Dafoe, Gillian Hadfield, Marius Hobbhahn, Joseph Bloom, Ryan Lowe, Stephen Casper, Sören Mindermann and all speakers! 👇
London Alignment Workshop Day 1 on interpretability, scalable oversight & EU AI policy. Rohin Shah, Neel Nanda, Zachary Kenton, Vincent Conitzer, Owain Evans, James Black, Christopher Summerfield, Matthieu Delescluse, Simon Möller, Victoria Krakovna and more. Ready for Day 2! 👇
"Move fast, break things" isn't appropriate when the stakes are this high. Our CEO @gleave.me told CNBC that coding agents are already replacing engineers. While agentic swarms are overhyped for now, we're building on an insecure substrate that attackers will exploit at scale. 👇
▶️ Fireside chat: www.youtube.com/watch?v=_7OA...
▶️ Keynote address: www.youtube.com/watch?v=Zndi...
Even a 1% chance we all die is not something we can just take lightly. @yoshuabengio.bsky.social explains why he shifted from AI capabilities to safety research, his ChatGPT wakeup call, thinking about his children when evaluating AI risk, and the hopeful path forward. Watch the chat 👇