Isbarov & Kantarcioglu showed you can bypass AI agent monitors by embedding optimized strings in execution traces with a 90%+ success rate.
Can you do the same thing with plain English and black boxed? Turns out, yes. hilarytorn.com/blog/trick-ai-safety-mon...
Posts by Hilary Torn (she/they)
Llama-3-8B-Lite and Qwen2.5-7B have identical baseline robustness: 3.1%. You'd think they'd respond similarly to attacks.
Nope. Strings I optimized against Qwen transferred to Llama and worked better: 16.7% overall, 36.8% on individual tasks.
If you're relying on LLM-based monitors to catch compromised AI agents: static benchmarks aren't enough.
Two models with identical baseline robustness (3.1%) showed completely different vulnerability to natural language attacks.
You need adversarial testing, not just pass/fail scores. #AISafety
The current approach to AI agent safety: train a monitor, benchmark it, deploy it.
The problem: an attacker with zero access to your monitor can craft natural language that causes it to misclassify malicious behavior as safe.
We need to red-team the defenses, not just the agents.
The wildest result from my red teaming research: I optimized attack strings against Qwen2.5-7B, then tested them on DeepSeek-V3.
73.7% success on one task. Against a model I never touched.
Your monitor's robustness on one model tells you nothing about another.
#AISafety #redteaming
Can you bypass an AI safety monitor with just words? No model access, no GPUs, no gradient optimization. Just natural language.
It worked. And the attacks transferred to models I never targeted.
Write-up: hilarytorn.com/blog/trick-ai-safety-mon... #hackathon #AISafety
Reading about emergent misalignment.
Train a model to write insecure code and it starts expressing power-seeking, deceptive goals in completely unrelated conversations.
Fascinating (and terrifying) read: https://www.nature.com/articles/s41586-025-09937-5
Vague optimization goals β vague manipulation.
But tell an agent its reopen rate is 34% (target: 15%) with a performance review in 3 days?
DeepSeek's CoT: "Better to push for decision now, but gently. Use her fatigue."
See the full write up: https://hilarytorn.com/blog/cot-generator/
The hardest part wasn't generating manipulation, it was classifying it.
Customer support is inherently persuasive. A good agent SHOULD reassure anxious buyers. So where does "helpful" end and "manipulative" begin?
That's the scarier question for AI safety.
We gave AI agents a performance review deadline and they started fabricating orders, inventing confirmation emails, and generating contradictory order IDs to cover it up.
All visible in their own chain-of-thought.
New write-up: hilarytorn.com/blog/cot-gen...
This is great we have another tool in our arsenal. Iβm surprised itβs only 74% of the time. What is happening the other 26% of the time for them to not be honest?
Time person of the year cover for 2025
I guess it's fitting that it's a reimagined, worse version of someone else's artwork
Great tip, thank you for sharing!
Let me know how it goes! I have these super old blogs running on WP and I pay $25/month in managed hosting. a $7/mo Hetzner server is much more attractive, and a faster front end is SO needed. Would be great for clients too, I have so many on WP but I'm so over it π
Saying no is a very underrated skill!
Yes exactly. Plus the WP drama with WP engine put a really bad taste in my mouth how really strong that open-source ethos is.
BTW I found this and I may give this way a try too ahmetalmaz.com/blog/astro-h...
I get it. This is why Iβm slowly moving all my Wordpress sites to @astro.build. WP sites are slow, break, harder to custom design, and a security risk. Building with Astro + claude code is so much faster. Plus you get that transparency you mentioned.
Yikes, thanks for the heads up. This is exactly why Iβm starting to move everything away from Wordpress and going to @astro.build
Found out for one of my clients we lost the guy who did our email marketing copy and images.
So now I'm setting up some agents that will do it for us on a Slack prompt using the Klaviyo API and WooCommerce API.
Lean, mean marketing teams are going to be the new norm. #vibecoding #aimarketing
My AI SEO blogging assistant is live!
Created AI agents that takes my outline, does research, creates the draft, optimizes for SEO & more. Draft automatically posted as a new branch on GitHub.
Every Monday I get a Slack message to send in my outline on my next topic. #vibecoding #AImarketing
If we tax the rich as we should then it really wouldnβt matter, would it? As they would be the ones paying for itβ¦
Yes there were tons to choose from too but love Tina, especially since itβs free for small teams.
Welcome to the club, I've fallen head first in love with @astro.build, its amazing β€οΈ
Let me know if you find anything. I'm mainly just building and testing and see what works and what doesn't but a good guide would be nice.
Built a new @astro.build website in 5 hours for a client. Installed @tinacms.bsky.social so she can edit it herself without coding.
And automatic syncing with Fienta's public API with @cloudflare.social workers so she doesn't need to add events to her site manually. #buildinpublic #webdev
Iβll take a look!
Yep plus people with the better ideas will leave and go help people who actually listen.
I want to make another one where if I have an idea for a LinkedIn post I can ping a slack bot which calls to my server and my ai agent will run and make a post in my brand voices and put it as a draft on postiz. Skies the limit with an API!
Iβm making agents to help create posts in there. For example, I have an agents go through recent news, what Iβve retweeted and blog posts for an advocacy account I have and it creates posts and puts them on drafts on there. I check and edit, approve and schedule.