The meta-lesson is older than embeddings.
Whenever you see a dial with multiple settings, the interesting question is never "which setting is best."
It's "where does the marginal return collapse?"
That's where the right answer lives. Almost always.
Draw the curve.
Posts by Ran Aroussi
One more gift of Matryoshka: the decision is reversible downward, never upward.
768 → 512 → 384 is free at any time.
384 → 768 requires re-embedding everything.
So: start at the generous end. Truncate later when scale demands it.
And it's not just disk.
HNSW index RAM scales with dimensionality. 25% smaller vectors = 25% less RAM -- often the difference between fitting on one box and needing the next tier up.
Query latency drops proportionally. Smaller dot products, faster SIMD.
This is the lesson.
Looking at absolute numbers ("how small can I go?") makes 384 look smart.
Looking at *marginal* numbers ("what's each step costing me?") makes 768 or 512 look smart (768 effectively being "free").
The first framing is lazy. The second is engineering.
Plot it and the curve tells the whole story.
Convex. Falls steeply at first — you get lots of storage for almost no quality cost — then flattens, where each additional byte costs more quality.
The inflection is right around 512.
But quality drops non-linearly (retrieval nDCG, Jina v3 on MTEB).
The first truncation buys 8x more storage per quality point than the last.
Start with storage. Vectors are float32 = 4 bytes per dimension.
Tempting to just grab 256 and move on, right?
So the natural instinct is: find the sweet spot.
Then I did the math. The answer surprised me.
Modern models (Jina v3, OpenAI v3, Nomic) use something called Matryoshka Representation Learning.
Like the Russian dolls — the model is trained so you can chop off the tail of the vector and it still works.
1024 → 768 → 512 → 384 → 256. No re-embedding. Just slice. Feels like a free lunch.
The list has a length. Usually 256, 384, 512, 768, 1024, ...
Longer list = more nuance captured = better retrieval.
But also: more disk, more RAM, slower queries.
So there's a dial. And the question is where to set it.
First, what are embeddings?
An embedding is a list of numbers that represents the *meaning* of something - a sentence, an image, a product.
"Dog" and "puppy" get similar lists. "Dog" and "skyscraper" don't.
Every RAG system, semantic search, and agent memory runs on these.
Been upgrading an embeddings pipeline this week, and I thought I'd share some insights that might help and/or educate people on embeddings.
Specifically: a lesson about optimization that surprised me.
Let's get into it. 👇
(warning: technical stuff ahead)
x.com/aroussi/sta...
Part 3 of the Architect series is getting good feedback.
Part 1: AI writes the code.
Part 2: AI fixes the bugs.
Part 3: Who watches the server?
Nobody's talking about this.
It's the one that still wakes you up at 3am.
x.com/aroussi/sta...
I once spent an hour debugging a payment flow in production. Tracing logs, checking deploys, reading diffs.
The payment processor was returning a malformed JSON. Their status page said everything's fine.
Here's a possible solution.
x.com/aroussi/sta...
We use AI to write the code. AI to review the code. AI to fix the code. Then we deploy it onto infrastructure like absolute cavemen – SSH, hope, and prayer. It's 2026. This is embarrassing.
x.com/aroussi/sta...
– Monitoring tool: "Service is down."
– Me: "Cool. Why?"
– Monitoring tool: 🤷♂️
There's a better model. Wrote about it.
A week or so ago, I wrote about the law firm model for dev agencies — Architects running pods with revenue share.
The most common question: "How does one Architect deliver like a team of 5?"
This is the answer.
x.com/aroussi/sta...
x.com/aroussi/sta...
The hard problem in AI coding isn't code generation. Claude Code already solved that. It's decomposition. Planning. Verification. Simplification. PR management. Institutional memory.
That's engineering management, not a smarter agent.
x.com/aroussi/sta...
Three human checkpoints. Everything else autonomous.
- Approve the plan
- Review the PR
- Ship the release
That's how a good CTO works with a good team. It should be how you work with AI.
Coding agent: executes a task when you point it
Engineering daemon: runs your delivery pipeline while you lead
The gap between those two things is the gap between a freelancer and a team.
x.com/aroussi/sta...
You can be 10x more productive with AI and still starve if nobody knows you exist.
The solo dev trap is real.
The answer isn't "get better at marketing." It's structural.
x.com/aroussi/sta...
Every pod owner in this model gets what no traditional agency offers:
→ Revenue share on their projects
→ Shared profit pool across all pods
→ Firm handles sales, brand, infra, design, marketing
→ They just built
Old model. New tech. Big implications.
x.com/aroussi/sta...
Junior developers in 2026 shouldn't be hired for their coding ability.
They should be hired as the pipeline to becoming Architects — the hybrid of senior dev + product owner + team lead that runs an AI-augmented pod.
18-24 months from junior to running their own pod.
x.com/aroussi/sta...
"Go solo and earn a little more per hour but work twice as hard"
vs
"Stay at an agency and earn less while someone else captures your leverage"
There's a third option. Law firms figured it out 200 years ago.
x.com/aroussi/sta...