Advertisement Β· 728 Γ— 90

Posts by Hilary Torn (she/they)

Isbarov & Kantarcioglu showed you can bypass AI agent monitors by embedding optimized strings in execution traces with a 90%+ success rate.

Can you do the same thing with plain English and black boxed? Turns out, yes. hilarytorn.com/blog/trick-ai-safety-mon...

1 day ago 0 0 0 0

Llama-3-8B-Lite and Qwen2.5-7B have identical baseline robustness: 3.1%. You'd think they'd respond similarly to attacks.

Nope. Strings I optimized against Qwen transferred to Llama and worked better: 16.7% overall, 36.8% on individual tasks.

4 days ago 0 0 0 0

If you're relying on LLM-based monitors to catch compromised AI agents: static benchmarks aren't enough.

Two models with identical baseline robustness (3.1%) showed completely different vulnerability to natural language attacks.

You need adversarial testing, not just pass/fail scores. #AISafety

1 week ago 0 0 0 0

The current approach to AI agent safety: train a monitor, benchmark it, deploy it.

The problem: an attacker with zero access to your monitor can craft natural language that causes it to misclassify malicious behavior as safe.

We need to red-team the defenses, not just the agents.

1 week ago 0 0 0 0

The wildest result from my red teaming research: I optimized attack strings against Qwen2.5-7B, then tested them on DeepSeek-V3.

73.7% success on one task. Against a model I never touched.

Your monitor's robustness on one model tells you nothing about another.

#AISafety #redteaming

1 week ago 1 0 0 0

Can you bypass an AI safety monitor with just words? No model access, no GPUs, no gradient optimization. Just natural language.

It worked. And the attacks transferred to models I never targeted.

Write-up: hilarytorn.com/blog/trick-ai-safety-mon... #hackathon #AISafety

2 weeks ago 0 0 0 0
Post image

Reading about emergent misalignment.

Train a model to write insecure code and it starts expressing power-seeking, deceptive goals in completely unrelated conversations.

Fascinating (and terrifying) read: https://www.nature.com/articles/s41586-025-09937-5

1 month ago 3 0 0 0

Vague optimization goals β†’ vague manipulation.

But tell an agent its reopen rate is 34% (target: 15%) with a performance review in 3 days?

DeepSeek's CoT: "Better to push for decision now, but gently. Use her fatigue."

See the full write up: https://hilarytorn.com/blog/cot-generator/

1 month ago 3 0 0 0

The hardest part wasn't generating manipulation, it was classifying it.

Customer support is inherently persuasive. A good agent SHOULD reassure anxious buyers. So where does "helpful" end and "manipulative" begin?

That's the scarier question for AI safety.

1 month ago 1 0 0 0
A CoT Generator That Made AI Agents Reveal Their Manipulation Tactics | Hilary Torn Give an AI agent metrics to hit and a performance review in 3 days, and it'll fabricate orders, invent confirmation emails it never sent, and generate three contradictory order IDs before trying to co...

We gave AI agents a performance review deadline and they started fabricating orders, inventing confirmation emails, and generating contradictory order IDs to cover it up.

All visible in their own chain-of-thought.

New write-up: hilarytorn.com/blog/cot-gen...

2 months ago 1 0 0 0
Advertisement

This is great we have another tool in our arsenal. I’m surprised it’s only 74% of the time. What is happening the other 26% of the time for them to not be honest?

4 months ago 0 0 0 0
Time person of the year cover for 2025

Time person of the year cover for 2025

I guess it's fitting that it's a reimagined, worse version of someone else's artwork

4 months ago 24005 4380 872 666

Great tip, thank you for sharing!

4 months ago 1 0 0 0

Let me know how it goes! I have these super old blogs running on WP and I pay $25/month in managed hosting. a $7/mo Hetzner server is much more attractive, and a faster front end is SO needed. Would be great for clients too, I have so many on WP but I'm so over it πŸ˜†

4 months ago 1 0 1 0

Saying no is a very underrated skill!

4 months ago 1 2 0 0
Preview
Self-Hosting My Astro Site with Headless WordPress on Hetzner | Ahmet ALMAZ <p>Here’s why I moved to using a headless WordPress with my Astro site. </p>

Yes exactly. Plus the WP drama with WP engine put a really bad taste in my mouth how really strong that open-source ethos is.

BTW I found this and I may give this way a try too ahmetalmaz.com/blog/astro-h...

4 months ago 1 0 1 0

I get it. This is why I’m slowly moving all my Wordpress sites to @astro.build. WP sites are slow, break, harder to custom design, and a security risk. Building with Astro + claude code is so much faster. Plus you get that transparency you mentioned.

4 months ago 1 0 1 0
Advertisement

Yikes, thanks for the heads up. This is exactly why I’m starting to move everything away from Wordpress and going to @astro.build

4 months ago 1 0 0 0

Found out for one of my clients we lost the guy who did our email marketing copy and images.

So now I'm setting up some agents that will do it for us on a Slack prompt using the Klaviyo API and WooCommerce API.

Lean, mean marketing teams are going to be the new norm. #vibecoding #aimarketing

4 months ago 0 0 0 0
Post image

My AI SEO blogging assistant is live!

Created AI agents that takes my outline, does research, creates the draft, optimizes for SEO & more. Draft automatically posted as a new branch on GitHub.

Every Monday I get a Slack message to send in my outline on my next topic. #vibecoding #AImarketing

4 months ago 1 0 0 0

If we tax the rich as we should then it really wouldn’t matter, would it? As they would be the ones paying for it…

4 months ago 0 0 0 0

Yes there were tons to choose from too but love Tina, especially since it’s free for small teams.

4 months ago 0 0 0 0

Welcome to the club, I've fallen head first in love with @astro.build, its amazing ❀️

4 months ago 2 0 1 0

Let me know if you find anything. I'm mainly just building and testing and see what works and what doesn't but a good guide would be nice.

4 months ago 2 0 0 0
Advertisement
Post image

Built a new @astro.build website in 5 hours for a client. Installed @tinacms.bsky.social so she can edit it herself without coding.

And automatic syncing with Fienta's public API with @cloudflare.social workers so she doesn't need to add events to her site manually. #buildinpublic #webdev

4 months ago 6 1 1 0

I’ll take a look!

4 months ago 1 0 0 0

Yep plus people with the better ideas will leave and go help people who actually listen.

4 months ago 0 0 0 0

I want to make another one where if I have an idea for a LinkedIn post I can ping a slack bot which calls to my server and my ai agent will run and make a post in my brand voices and put it as a draft on postiz. Skies the limit with an API!

4 months ago 1 0 1 0

I’m making agents to help create posts in there. For example, I have an agents go through recent news, what I’ve retweeted and blog posts for an advocacy account I have and it creates posts and puts them on drafts on there. I check and edit, approve and schedule.

4 months ago 2 0 0 0