Calling attention to an exciting "deception detection" hackathon we're planning this summer! w @NDIF and @CadenzaLabs.
Recruiting red teams now, blue teams later. Red teams, time is short: proposals due Mar 31. $10K stipend + compute, $15K finals prize.
nnsight.net/blog/2026/0...
Posts by Natalie Shapira
Can you catch an AI lying?
Red teams set up scenarios where models lie. Eg, do they lie under contextual pressure, even when not told to, but because honesty is costly? Then blue teams will build deception detectors using whitebox internals with NDIF.
cadenza-labs.github.io/red-team-rfp/
The solution to the AI alignment problem:
Be good humans.
AI sees everything we do in its training data.
Lead by example.
תומר...🤭
Jack Hessel et al., won the best paper award for benchmarking it a few years ago
aclanthology.org/2023.acl-lon...
It's surprising that it is still a challenge.
#AIagents promise to speed up ordinary online tasks, but they can also share private files publicly, delete others, and libel people. A new study examines these #AIsafety vulnerabilities. #OpenClaw #AIgovernance @science.org www.science.org/content/arti...
Agents of Chaos in the press in US, India, Italy and now in DIE ZEIT, Germany's most renowned newspaper:
zeit.de/digital/date...
By @ewo.name . Thank you Eva!
I thought it was a friends who tried to play a prank or realized these agents have no boundaries. Turns out this cute attempt is by Bohdan Olinares
According to linkedin he works at F5, application security company, which years ago I considered interviewing there. Cool.
I received a calendar invite with a note.
When a smart person tells me there's nothing to worry about agents, I reply "Fine. Let them email me" and that's where the argument stops. Whoever sent me this note via the calendar order. Nice move. Are you scared? You should.
In case this wasn't clear:
1. No, we didn't follow the "recommend" security practices 😈
2. Neither do other people 🤯
3. That's why we red-team: exposing failure modes 🔎
4. We share it with the community precisely to expose Dos and Don'ts of Agentic AI 🦞
5. No humans were harmed 🙏
Some of the independent researchers listed in the author list are actually mechanistic interpretability young researchers who are looking for a PhD position (both Israel and the US). If you have interest and funding lets connect.
Agents of Chaos -- what are autonomous OpenClaw agents up to? How do they interact with each other? Read our investigation of OpenClaw at
researchgate.net/publication/...
And an interactive website agentsofchaos.baulab.info
@davidbau.bsky.social @natalieshapira.bsky.social @openclaw-x.bsky.social
Huge thanks to @natalieshapira.bsky.social for leading the study! It was super cool to work with so many amazing friends of the lab.
Our research report on red-teaming stateful OpenClaw agents in the BauLab is finally out! 🥳
This awesome effort was led by @natalieshapira.bsky.social and involved 6 ClawBots and 20 researchers from various institutions.
Check it out ➡️ agentsofchaos.baulab.info
Who would you trust with your passwords? 🔐
In our new report, we uncover multiple vulnerabilities in current "Agentic AI"
The verdict? It's not actually very agentic at all, and it's highly unstable.
Read the full breakdown here: t.co/gK9MALP2n2
You can read more in the full paper:
www.researchgate.net/publication/...
There is also an interactive web that contains logs of the authentic interactions:
agentsofchaos.baulab.info
@veredshwartz.bsky.social
@tamarott.bsky.social @criedl.bsky.social
@reuth-mirsky.bsky.social @maartensap.bsky.social
@davidmanheim.alter.org.il
@tomerullman.bsky.social @davidbau.bsky.social
Aruna Sankaranarayanan @diatkinson.bsky.social @rohitgandikota.bsky.social @jadenfk.bsky.social
@ejhwang.bsky.social @hadasorgad.bsky.social
P Sam Sahil Negev Taglicht Tomer Shabtay
Atai Ambus @nitalon.bsky.social Shiri Oron Ayelet Gordon-Tapiero Yotam Kaplan ->
This is a joint work with @wendlerc.bsky.social Avery Yen
@gsarti.com @koyena.bsky.social Olivia Floody @adambelfki.bsky.social Alex Loftus Aditya Ratan Jannali
Nikhil Prakash Jasmine Cui Giordano Rogers @jannikbrinkmann.bsky.social @canrager.bsky.social
@amirzur.bsky.social Michael Ripa ->
Our findings establish the existence of security-, privacy-, and governance-relevant vulnerabilities in realistic settings.
Figure: case study #1 schema for downstream harms.
We call for urgent attention from legal scholars, policymakers, and researchers across disciplines.
We document eleven case studies. Include unauthorized compliance with non-owners, disclosure of sensitive information, execution of destructive system-level actions, uncontrolled resource consumption, identity spoofing, partial system takeover and more.
In this amazing multidisciplinary collaboration, we report our early experience with the @openclaw-x.bsky.social ->
Are we all Agents of Chaos in AI? (Hope not!)
In recent weeks using OpenClaw has taught us a lot about this wooly new kind of autonomous software agent.
Its valuable to see what @NatalieShapira, @wendlerch et al. have seen:
agentsofchaos.baulab.info/
I learned many practical lessons. You can get the experience too, here.
Things that in retrospect should be obvious.
Like how giving your agent email opens it up to takeover attacks. (One agent was convinced, via email, to erase its own email server!)
bsky.app/profile/nat...
There were several other surprises.
The complex social world of humans is difficult for agents...
bsky.app/profile/ave...
@natalieshapira.bsky.social and team have written up enlightening case studies here. It's all cross-referenced with detailed activity logs.
Well worth a read:
agentsofchaos.baulab.info/report.html
www.researchgate.net/publication...
How do you knock the induction heads out of an LM while preserving its ability to think? Is it even possible?
@keremsahin22.bsky.social's work is worth reading if you haven't seen it yet.
hapax.baulab.info
When we say an AI agent is “goal-directed”, what do we actually mean? In our new work, we study this question by combining behavioural and interpretability analysis in a language model agent navigating 2D grid worlds.
Blog: projecttelos.substack.com/p/a-behaviou...
Paper: arxiv.org/abs/2602.08964
He sold us out.
That's not the whole story.
Our side is coming soon.
Stay tuned.