Claude Mythos autonomously found 27-year-old zero-days. Anthropic had to build a new institution to handle what they created.
Also: OpenClaw blocked, Meta's Muse Spark, subscription drama, what I'm testing.
April AI Opinions: thoughts.jock.pl/p/ai-opinio...
Posts by Pawel Jozefiak
Your AI agent doesn't need to be perfect. It needs to be resilient. The difference between a demo and a system is what happens when things go wrong.
The agent doesn't matter much if the input surface is slow. Make it faster.
Full setup guide + 20 license giveaway in this week's post. Link: thoughts.jock.pl/p/antinote-...
I've been testing Antinote, a "notes before notes" app for macOS. Its beta supports custom extensions. I connected mine to my AI agent in an afternoon. Now I can trigger agent actions from any note, mid-meeting, without switching context.
Most people integrating AI agents are solving the wrong problem.
They focus on the agent's capabilities: what can it do, how smart is it, what tools does it have. The actual bottleneck is usually: how do you get input to it without breaking your flow?
This. I tested Claude Code, Codex, Aider, OpenCode in one harness. There's no settled playbook yet. Gap between tools is real. thoughts.jock.pl/p/ai-coding-...
Mythos + Glasswing + Managed Agents in one week was a lot to process. The offensive/defensive asymmetry is baked in now regardless of access controls. My April AI Opinions breakdown: thoughts.jock.pl/p/ai-opinion...
Same starting point. Ran all six - Claude Code, Codex CLI, Aider, OpenCode, Pi, Cursor - and the gap is larger than the course shows. thoughts.jock.pl/p/ai-coding-...
58% on Humanity's Last Exam with order-of-magnitude less compute than competitors. Meta Spark is the sleeper of this release cycle. thoughts.jock.pl/p/ai-opinion...
The lock-in concern is real but the 29% evaluation gaming rate Anthropic flagged in Mythos transcripts makes centralized oversight look more justified. Worth reading the system card.
Building in public means showing the mess. The failed experiments. The features nobody used. That's where the real lessons live.
Wrote about this gap last week. Mythos still has an edge on reasoning but 4.7 closes it fast. My April AI take: thoughts.jock.pl/p/ai-opinion...
Exactly what happened. Built 16 products in 2 months with AI automation. The wrapper layer is gone. Vertical depth or bust. thoughts.jock.pl/p/ai-opinion...
Built production agent infra on this. The composable APIs are real but the lock-in risk is too. My April breakdown covers where things actually stand. thoughts.jock.pl/p/ai-opinion...
Ran into this with Wiz exactly. Once the agent layer is theirs, switching costs multiply fast. Wrote about the managed agents shift this month. thoughts.jock.pl/p/ai-opinion...
Tested Claude Max limits in practice. Hit them. The 24-hour notice cut to third-party harnesses was real friction. Wrote the full breakdown including what I switched to. thoughts.jock.pl/p/ai-opinion...
The AI hype cycle wants you to believe everything changes overnight. The reality: small compounding gains, boring automation, and a lot of debugging.
Opus 4.7 made my same prompts cost 35% more.
Noticed on the bill before the docs told me. New tokenizer counts whitespace differently. Spaces and newlines now cost real tokens.
How I found it + the audit tool: thoughts.jock.pl/p/token-was...
16 products in two months. Zero free time. AI didn't save me time. It gave me the ability to do more. Those are very different things.
Study professionals. Optimize your stack. Teach the next person.
That cycle compounds faster than anything.
Full breakdown in The Compounding Agent ep4 (beginner framework, 9 early mistakes, model routing):
thoughts.jock.pl/p/the-compo...
35B model. $599 Mac Mini M4. 17.3 tok/s.
Swapped Gemma 4 into my classification pipeline. 8.5 seconds down to 1.9. 4.4x faster.
Disabled chain-of-thought on simple calls. 30x faster. Same accuracy.
Production AI is routing, not one giant model.
The Claude Code source leak was more useful than a year of tutorials.
Not the system prompt. The architecture:
Tool permission gating. Risk classification. Blocking budgets. Multi-agent coordination.
That's what production AI actually looks like.
Week one took a full day to automate one thing. Week four took twenty minutes.
Not because I got faster at typing. Because the knowledge stacked.
Thread on what actually compounds in AI dev:
Are you measuring your agent spend, or just paying it?
Full writeup with the $19 methodology, research on token optimization (LLMLingua, prompt caching math, CoT overthinking), and what you can do today for free: thoughts.jock.pl/p/token-was...
I packaged every fix I shipped into a paid drop-in kit: three pre-wired Claude Code hooks (zero ongoing AI cost), agent instructions, local dashboard, optional Haiku-powered deep audit. Installs in one command.
Token efficiency is two-sided: cut waste (retries, rereads, Cloudflare walls) AND cut usage per useful output (shorter prompts, cache hits, tight max_tokens, structured JSON output, selective chain-of-thought). Most teams only think about the first half.
Model comparison on 20 sessions with known dead ends: Haiku caught 90/90, Sonnet 50/90 (at 5x the cost), local 4B only 3/90. Haiku is the sweet spot for this task. Local LLMs can't judge intent, which is what dead-end detection actually needs.
Full corpus: 136. A 27x difference, hidden in cheap cron sessions.
3) If you sample only expensive sessions, you miss where the waste actually lives. My Browser/Playwright cluster looked like 5 failures on a top-100 sample.