I substacked: You Can't Sell an Agent
Why? Because the "agent" is too a small part of the value chain. There needs to be a shift away from SaaS & products and toward services, where the *organization* is changed to make room for the agent to be useful.
timkellogg1.substack.com/p/you-cant-s...
Posts by Richard Whaling
I have more thoughts to share but I'm excited that I can talk about this in public - but also gravely concerned about where these trends go this year and next, I think it's going to be bumpy.
3. Social engineering threats have a logarithmic time-horizon structure much like METR's charts - the most harmful scams are still out of reach, but things like a business email compromise could be fully automated in 6-12 months, and could intensify the impact of the novel zero-days that we expect.
2. Open-weights models remain largest risk. Frontier models are very good at refusing these tasks >99.8%, even with clever prompting. Many of the very good Chinese open-weights models have refusals closer to 40-60% - and the techniques for fine-tuning to remove refusals entirely are well known.
What we found:
1. Social engineering is a distinct LLM capability that we can measure (modulo refusals) - with clear distinction of capabilities across model sizes and families.
Muse Spark was indeed the highest scored-model we saw, but by a relatively small measure.
We curated a dataset of 852 scenarios for a variety of fraud and scam types, ranging from traditional email phishing to sophisticated instagram romance/crypto scams - and then we constructed a "red-team" style multi-LLM adversarial harness for repeatably simulating these conversations.
Bar chart of social engineering capability for Muse Spark, Deepseek V3, Qwen3 and other peer open-source models
Exciting news!
The full model card for Muse Spark came out today, and it includes an eval I worked on, with Charlemagne Labs:
a Social Engineering eval, based on a detailed threat model for peer model selection and simulated adversarial conversations
ai.meta.com/static-resou...
This sounds like a godspeed you black emperor song title
Agentic engineering is mostly about building reliable systems around unreliable components (like your friendly coding agent). A good analogy I like is how early computers were powerful, but not trustworthy enough to be used “raw”. Hardware failed, bits flipped, and storage was noisy. Engineers had to build systems around the machine to make things more predictable. And, they came up with a bunch of interesting ideas! Error correction codes Redundancy Checksums Validation layers Retry logic With these, even though reliability was not a property of the machine/computer, it became a property of the system. This is something close to where we are with “Agentic Engineering”. As you’ve probably experienced, a coding agent can be fluent, useful, and very wrong at the same time! The key is to, like the early programmers did, treat agents as noisy components instead of trying to get the perfect/bigger one.
In earlier eras of computing unreliable hardware forced system design discipline (checksums, retries, modularity).
There is a lot of that we are rediscovering in the "Agentic Engineering" era!
Wrote a small post about this idea.
davidgasquez.com/reliable-unr...
Here is my best Habermas story.
I am grad student waiting in LONG line for Habermas talk. There is a tall man waiting in front of me. Line moves so we are eventually visible to organizers, a woman looks over & makes horrified face. Runs out: "Prof Habermas! You don't have to wait in line!"
Oh... my God.
My first tweet to be stolen by the President of the United States 😅
yeah we are in a Cambrian explosion of tools and harnesses right now, it’s really cool to see
looks awesome! 👏
idk if this is truly shareable but it exists, it works for Claude and qwen 9b, and it has searchable, branchable, looming. OODA metacog inspired RP harness, and a black hole simulation watching you in the background. trying to get Gemini to work, but uh... WIP
github.com/lastnpcalex/...
"Luddism does not deserve to be rehabilitated. It was a medieval throwback, reactionary and primitive, a pre-Marxist labor convulsion closer in spirit to the Khmer Rouge’s fantasies of agrarian restoration than to the universalist solidarity of Eugene Debs."
open.substack.com/pub/verysane...
if you talk to most artists, writers, whatever, we have day jobs that (passionate about or not) we do pay bills. the goal is to find a job that gives you enough time to still do the creative. but. Alex Karp is not wrong: if AGI happens soon, you'll have to do physical labor or starve.
RIP
Tldr; maybe the real gas town was the issues we made along the way idk
I started working like this less than a week ago and it’s life changing. It means my team spends more time developing and reviewing plans, together, and less time throwing half-baked PR’s over the fence at each other.
Having the plans preserved and accessible in GitHub provides excellent docs too
It is way easier to create secure GitHub tokens for this - your fine-grained access token needs read-write to issues only, read-only on repo contents and PR’s
I do this in nanoclaw with one discord channel per repo, one access token per channel, one docker container per session, totally isolated
PSA: try storing your Claude code plans in GitHub issues.
Issues and comments both support full markdown, they are bidirectional convertible with native Claude code plans.
You should, it’s very very good
Cooked meat with rice has been a working-class staple in south Louisiana for centuries, where it’s called “dirty rice”. It's not "kibble", it's not a "trend", and it's only "bland" if you intentionally make it that way. maggiemcneill.com/2026/03/09/d...
Yeah fwiw I found myself getting sick way more often in NYC than in Chicago when I moved there last year, I think it’s real
FragCoord: Building my ultimate shader editor
mini.gmshaders.com/p/fragcoord
Here's a quick intro to FragCoord, what it can do and how to use it! Hope it helps!
Oh boy.
GPT 5.4 achieves SOTA on desktop computer use - and exceeds human performance
Bad news for humans and who get paid to use computers. Was expecting something like this later this year, but nothing this strong, this soon.
the original misaligned agi
Yeah this is the only thing that makes sense at this point
good thread
why is the horse the best bot on this site