Advertisement · 728 × 90

Posts by FAR.AI

▶️ Watch Dominic Rizzo: youtu.be/tSWPN7qGHmg
▶️ Watch Anne Neuberger: www.youtube.com/watch?v=yfS0...

1 day ago 0 0 0 0
Post image

Two new recordings from TIAP 2026 are now live:
▸ Fireside Chat with Anne Neuberger: AI and National Security
▸ Dominic Rizzo - Silicon Roots of Trust: Attestation You'd Want Even If Nobody Required It
👇

1 day ago 0 0 1 0
Sarah Schwettmann - Scalable Oversight and Understanding [Alignment Workshop]
Sarah Schwettmann - Scalable Oversight and Understanding [Alignment Workshop] Sarah Schwettmann (Transluce) presents AI oversight that moves beyond preference alignment to continuous safety measurement. Her approach trains specialized oversight agents on vast amounts of…

▶️ Watch San Diego Alignment Workshop video: youtu.be/8oJW7hdbc2I&...

3 days ago 0 0 0 0
Post image

“Fault free oversight requires intimate contact with failure”. Sarah Schwettmann builds tools for better scalable oversight of AI systems, by going beyond learning from human preferences to training agents to keep track of how close a system is to failure.👇

3 days ago 0 0 1 0
Anthropic's Pilot Sabotage Risk Report As practice for potential future Responsible Scaling Policy obligations, we're releasing a report on misalignment risk posed by our deployed models as of Summer 2025. We conclude that there is very…

📄 Anthropic's Pilot Sabotage Risk Report - alignment.anthropic.com/2025/sabotag...
📄 METR Review of the Anthropic Summer 2025 Pilot Sabotage Risk Report - metr.org/2025_pilot_r...
▶️ Watch San Diego Alignment Workshop video: youtu.be/eO7RWlUl1BE&...

1 week ago 1 0 0 0
Post image

@sleepinyourhat just presented the first misalignment safety case from a frontier AI lab, analyzing Claude Opus 4: a 6 month effort that covered 9 pathways to catastrophic harm.👇

1 week ago 1 0 1 0
Dawn Song - Frontier AI in Cybersecurity: Risks, Challenges & Future Directions [Alignment Workshop]
Dawn Song - Frontier AI in Cybersecurity: Risks, Challenges & Future Directions [Alignment Workshop] Dawn Song (UC Berkeley) identifies cybersecurity as one of the biggest AI risk domains, investigating how frontier AI changes the security landscape. Her BountyBench and CyberGym benchmarks…

📄 BountyBench: buff.ly/buDf1NX
📄 CyberGym: buff.ly/rbLpC8J
📄 VERINA: buff.ly/JY8Iw50
▶️ Watch San Diego Alignment Workshop video: buff.ly/95GcptY

2 weeks ago 0 0 0 0
Advertisement
Video

Cybersecurity is one of AI's biggest risk domains—Attackers only need one exploit, but defenders must patch everything. Dawn Song reveals frontier AI can find vulnerabilities for just a few dollars and argues for a shift to formally verified, secure-by-design systems.👇

2 weeks ago 3 1 1 0

📖 Read the article with talk summaries: far.ai/news/london-...
▶️ Watch the full playlist: youtube.com/playlist?lis...

3 weeks ago 0 0 0 0
Post image

More recordings from the London Alignment Workshop are now available.
New videos from Neel Nanda, Marius Hobbhahn, Robert Trager, Sören Mindermann, Ryan Lowe, @scasper.bsky.social, @conitzer.bsky.social, @zhijingjin.bsky.social + more. 👇

3 weeks ago 3 2 1 0

📄 www.far.ai/news/ai-dece...

3 weeks ago 0 0 0 0
Post image

Without reliable deception detection, there's no clear path to high-confidence AI alignment. Black-box monitoring alone can't get us there. White-box methods that read model internals offer more promise. Our latest blog explains why. 👇

3 weeks ago 2 0 1 0
Yisen Wang - Finding & Reactivating Safety Mechanisms of Post-Trained LLMs [Alignment Workshop]
Yisen Wang - Finding & Reactivating Safety Mechanisms of Post-Trained LLMs [Alignment Workshop] Yisen Wang (Peking University) demonstrates that post-training doesn't erase safety mechanisms in large language models but masks them with stronger task-specific functions. Through targeted neuron…

▶️ Watch San Diego Alignment Workshop video: youtu.be/RV4ycXFPWro&...
📄 openreview.net/forum?id=pcG...

3 weeks ago 0 0 0 0
Post image

Yisen Wang showed that safety isn't erased, just masked. Removing reasoning neurons cut harmful rates 33%→5%, random removal did nothing. SafeReAct reactivates hidden safety using lightweight LoRA on harmful prompts only. 👇

3 weeks ago 0 1 1 0
Advertisement

▶️ Watch Rohin Shah (Google DeepMind) - youtu.be/BWHbxv5kxLI
▶️ Watch Gillian Hadfield (Johns Hopkins University) - youtu.be/KLqBPTobRQc

4 weeks ago 1 0 0 0
Video

Two new recordings from the London Alignment Workshop:
Rohin Shah on why safety research fails to move decisions at frontier labs, and what to do about it.
@ghadfield.bluesky.social on why AI governance can't verify the claims it's supposed to oversee, and how to fix it.

Links in replies 👇

4 weeks ago 5 2 1 0
Natasha Jaques - Multi-agent RL for Provably Robust LLM Safety [Alignment Workshop]
Natasha Jaques - Multi-agent RL for Provably Robust LLM Safety [Alignment Workshop] Natasha Jaques tackles a critical problem: 1 billion weekly LLM users face zero guarantees against harmful outputs like bomb-making instructions. Her multi-agent reinforcement learning solution uses…

▶️ Watch San Diego Alignment Workshop video: youtu.be/Ke4t7qsBJLg&...
📄 Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models - arxiv.org/abs/2506.07468

4 weeks ago 0 0 0 0
Post image

Single-agent red teaming keeps finding the same attacks. @natashajaques.bsky.social uses multi-agent self-play where attacker and defender co-evolve. Every exploit gets patched, forcing new discoveries. This results in 95% fewer harmful outputs with only 5% more refusals. 👇

4 weeks ago 1 0 1 0

▶️ Keynote address: www.youtube.com/watch?v=Zndi...
▶️ Fireside chat on his shift to AI safety: www.youtube.com/watch?v=_7OA...
📄 arxiv.org/abs/2502.15657
🔗 lawzero.org

1 month ago 0 0 0 0
Post image

The laws of physics don't care about you. That's what makes them safe. @yoshuabengio.bsky.social argues we can train AI the same way. In the "truthification pipeline" training data is categorized as either factual or a claim. This allows the AI to answer what it actually thinks it true. 👇

1 month ago 1 0 1 0
Neel Nanda - Our Pivot To Pragmatic Interpretability [Alignment Workshop]
Neel Nanda - Our Pivot To Pragmatic Interpretability [Alignment Workshop] Neel Nanda (Google DeepMind) discussed his mechanistic interpretability team's pivot from ambitious reverse-engineering goals to more pragmatic work that can be empirically validated on today's…

▶️ Watch San Diego Alignment Workshop video: youtu.be/k93o4R145Os&...
📄 neelnanda.io/vision
📄 neelnanda.io/agenda

1 month ago 0 0 0 0
Post image

Interpretability produced insights but didn't necessarily impact AGI safety. Neel Nanda's pivot: study what works. Anthropic made progress on eval awareness with simple activation steering. His team now grounds work in testable proxy tasks, fails fast on dead ends. 👇

1 month ago 2 0 1 0

📩 far.ai/futures-eoi
▶️ buff.ly/ZQHnAOA

1 month ago 0 0 0 0
Post image

200+ researchers joined London Alignment Workshop Day 2 for talks on governance, scheming & multi-agent safety. Thanks to Allan Dafoe, Gillian Hadfield, Marius Hobbhahn, Joseph Bloom, Ryan Lowe, Stephen Casper, Sören Mindermann and all speakers! 👇

1 month ago 1 0 1 0
Advertisement
Preview
Alignment Workshop The Alignment Workshop is a series of events convening top ML researchers from industry and academia, along with experts in the government and nonprofit sect...

▶️ youtube.com/playlist?list=PLpvkFqYJXcrdxYK-C4ZRj0cgcco3o0VxF
📩 far.ai/futures-eoi

1 month ago 1 0 0 0
Post image

London Alignment Workshop Day 1 on interpretability, scalable oversight & EU AI policy. Rohin Shah, Neel Nanda, Zachary Kenton, Vincent Conitzer, Owain Evans, James Black, Christopher Summerfield, Matthieu Delescluse, Simon Möller, Victoria Krakovna and more. Ready for Day 2! 👇

1 month ago 3 1 1 0
Preview
What worries the CEO of safety group FAR.AI most about the race to dominate the AI market Adam Gleave, co-founder & CEO of FAR.AI, and former Google DeepMind employee, discusses potential threats AI could pose to humanity, including job losses, and how companies can develop and ensure…

▶️ Watch the interview at www.cnbc.com/video/2026/0...

1 month ago 0 0 0 0
Post image

"Move fast, break things" isn't appropriate when the stakes are this high. Our CEO @gleave.me told CNBC that coding agents are already replacing engineers. While agentic swarms are overhyped for now, we're building on an insecure substrate that attackers will exploit at scale. 👇

1 month ago 2 0 1 0
Yoshua Bengio - Fireside Chat with Yoshua Bengio [Alignment Workshop]
Yoshua Bengio - Fireside Chat with Yoshua Bengio [Alignment Workshop] Turing Award winner Yoshua Bengio (MILA, Law Zero) explains why he shifted from pioneering deep learning to warning about its dangers. In this fireside chat with Adam Gleave (FAR.AI), Bengio…

▶️ Fireside chat: www.youtube.com/watch?v=_7OA...
▶️ Keynote address: www.youtube.com/watch?v=Zndi...

1 month ago 0 0 0 0
Video

Even a 1% chance we all die is not something we can just take lightly. @yoshuabengio.bsky.social explains why he shifted from AI capabilities to safety research, his ChatGPT wakeup call, thinking about his children when evaluating AI risk, and the hopeful path forward. Watch the chat 👇

1 month ago 0 0 1 1