Limits of vector search
a new GDM paper shows that embeddings canβt represent combinations of concepts well
e.g. Dave likes blue trucks AND Ford trucks
even k=2 sub-predicates make SOTA embedding models fall apart
www.alphaxiv.org/pdf/2508.21038
Posts by Benjamin Lefaudeux πΊπ¦
The image is a multi-panel bar chart comparing performance of different large language models across several benchmarks. It is divided into four categories: General Domains, Agentic Tool Use, Code, and Instruction Following. Each panel has bars representing model results, with scores on the y-axis. Top row β General Domains: β’ ArenaHard-V2: LongGPT-Flash leads with 86.5, followed by Kimi K2 (88.2), DeepSeek V3.1 (84.1), Claude Sonnet (61.5), GPT-4.1 (62.1), Qwen3.5 MoE-2507 (85.7), and Gemini 2.5 Flash (77.0). β’ MMLU-Pro: Best scores are Kimi K2 (84.5) and DeepSeek V3.1 (84.5), with LongGPT-Flash (82.7), Qwen3.5 MoE-2507 (82.1), GPT-4.1 (81.7), Claude Sonnet (83.7), Gemini 2.5 Flash (82.0). Top row β Agentic Tool Use: β’ t2-Bench (average): LongGPT-Flash leads (67.7), Kimi K2 (64.2), Claude Sonnet (62.1), GPT-4.1 (55.1), DeepSeek V3.1 (49.8), Qwen3.5 MoE-2507 (43.0), Gemini 2.5 Flash (40.9). β’ VitaBench: LongGPT-Flash 24.3, Claude Sonnet 23.0, DeepSeek V3.1 20.3, Kimi K2 18.2, GPT-4.1 19.0, Qwen3.5 MoE-2507 8.5, Gemini 2.5 Flash 8.0. Bottom row β Code: β’ SWE-Bench-Verified: Claude Sonnet leads with 68.0, Kimi K2 64.6, DeepSeek V3.1 66.0, LongGPT-Flash 60.4, GPT-4.1 48.6, Qwen3.5 MoE-2507 42.0, Gemini 2.5 Flash 40.6. β’ TerminalBench: Claude Sonnet 40.7, LongGPT-Flash 39.5, DeepSeek V3.1 31.3, GPT-4.1 28.4, Kimi K2 25.9, Qwen3.5 MoE-2507 17.3, Gemini 2.5 Flash 12.4. Bottom row β Instruction Following: β’ COLLIE: LongGPT-Flash 57.1, Kimi K2 56.3, Claude Sonnet 51.2, GPT-4.1 50.0, DeepSeek V3.1 49.7, Gemini 2.5 Flash 48.6, Qwen3.5 MoE-2507 43.8. β’ Meeseeks (ZH): LongGPT-Flash 43.0, Kimi K2 42.8, Claude Sonnet 41.5, GPT-4.1 35.1, DeepSeek V3.1 35.3, Qwen3.5 MoE-2507 33.8, Gemini 2.5 Flash 34.8.
Longcat-Flash-Chat (560B)
uh, holy shit this one is intriguing. bare minimum they compare themselves to all the (actual) top models and do okay
but inside.. damn this one has some cool ideas
huggingface.co/meituan-long...
In 2012 when I had to clean data it seemed natural to look for rules I could use to clean it.
Now it seems natural to model the noise, find new clean data it can destroy, and then train a model to reverse the process.
Machine learning makes you a sicko.
Three things to note about this:
1) AI has obvious utility to many, this is a tremendous amount of use already
2) There is room for multiple frontier model providers, at least for now
3) Any losses from subsidizing cost of AI use (and it is not clear this is happening) are now relatively small
Above is intuitive when you think about it long enough (or so it feels at least), but I missed it entirely during a couple of years working on diffusion, so I figured it was worth emphasizing and the authors did too :)
Worth a deep read in general, not personally completely done with it, I hope it ages well. Closing with some nice insight wrt diffusion models: they don't open up for serial awareness, since model iterates on _the same_ solution, no state space + carry over. _Less_ powerful than autoregressive
Paper cannot prove its point completely since models are really good approximators, and used as such (hence a formal disprove is not enough). Pretty good hints still, makes me confident we're far from peak efficiency in most use cases (we approx serial awareness by adding tons of compute)
I think that hardware recommendations are a little naive/premature, as much as I like CPUs nothing will happen prior to needs and solutions being put on the table. Lowering is expensive and risky in general, will happen last, but at least this shows there's kryptonite to GPU dominance
The paper is very pedagogical, and some takeaways ring pretty reasonable. Intuition is interesting behind LLMs being just ok to not great Chess players (missing the MCTS like mechanism of specialized models), or failing to be effective at multi step reasoning prior to test time compute / CoT
It then feels like the dichotomy proposed by the paper (inherently parallel and TC0 models will fail on serial problems) is excessive, or at least that the frontier is a bit fuzzy. One line is great though, paraphrasing "only with test time compute did we factor in some serial compute power"
There are caveats in the definition of "inherently serial" problems:
- not all solutions will require serial computations, even for something outside of TC0
- approximations can fall pretty close, and oftentimes we donΒ΄t expect anything much better than an approximation
"The Serial Scaling Hypothesis" (arxiv.org/abs/2507.125..., Liu et al) is interesting I think, not as new as it completely looks (autoregressive models are used serially, models have depth,..) but feels like a good formalization and intuition as of where current GPT based LLMs will typically fail
1/ Can open-data models beat DINOv2? Today we release Franca, a fully open-sourced vision foundation model. Franca with ViT-G backbone matches (and often beats) proprietary models like SigLIPv2, CLIP, DINOv2 on various benchmarks setting a new standard for open-source research.
Claude Code is really good for some narrowly defined tasks (add unit tests for instance), and in that case it's clearly an agent. The "vibe coding" coding middle ground (with somebody in the loop who doesn't completely get it) is the part on shaky grounds I believe
Something the LLMs have not seen beforehand (new model architecture for instance). In my experience that's where all the current tools break, for relatable reasons. I guess it's the same for somebody developing a SOTA DB engine or computer shader
For things LLMs are not great at (typically new, frontier work) you're better off doing it instead of inheriting a broken spaghetti plate. Vibe coding your way to oblivion is not a great proposition for either of these. I don't think there's that much of a middle ground
In the coming age of agents, I think vibe coding will die out, same lasting power as prompt engineering. For things LLMs excell at, you might as well stick to higher level directives and let it own the work, Claude Code is a good example. 1/2
this is probably why Meta was able to poach OpenAI ppl
aside from the absolute piles of cash, Sama is very SV-minded and canβt imagine building apart from a product
a lot of accelerationists see things differently, more broadly, and ids dissatisfying to be forced into a product box
Qualitatively the chunking is real and meaningful
I was a bit short on the results in this thread re:HNets, they are pretty convincing even if taking over transformers will take more validation. Of note the models become naturally robust to typos, which is a great omen
Well you can read my thread, else the link is in the first post :) model weights are open
HNets is chunking dynamically, that's why it's a big deal for me ! Else byte latents was doing that already, so not exactly nothing but not entirely mature, yes
comparisons with diffusion models are not a complete hit, because the comparison is with undistilled, 1000-steps models, which nobody uses in their right mind (fast samplers & distilled models mean that images are clean in 4-8 steps, 30 tops). The fact that EBT is usable as is is already great
Similarly to HNets I think the proof will be in the scaling, but there are good omens, where the technique works as you would expect it to. For instance, thinking more on out-of-distribution data has a bigger impact than on in-distribution (assuming the model was big enough to capture training set)
the big result is in the thinking, in that by opening up the compute valves for the more complicated cases has a meaningful effect.
Note that there's a interesting operating mode attached to being able to self-assess: generate multiple options then pick the better one (self-monte carlo ?)
the paper also feels meaningful in connection to something like transfusion arxiv.org/abs/2408.11039, which puts language tokens and continuous image representations in the same transformer. Not the case here (no mixed models), but the EBT framing does work for both representations
there are connections with diffusion/scoring all around, besides the steps to the right direction, among which the fact that noise / langevin dynamics for exploration / thinking
Forgot in the above, but assuming you can trust the model it also gives you 3: how truthful is the prediction (assuming 1 and 2 donΒ΄t team up effectively)
The paper runs pretty deep, besides the initial handwave which is nice and intuitive (model essentially predicts a step, not final distribution)
Looks like this, and now the even more interesting bit is that it doesn't have to be about language tokens, works across modalities
What this gives is twofold:
1 - whether you're done, next token prediction is precise enough and you can move on
2 - if not 1, where to go gradient descending the energy levels (see the similarity with scoring models ?)
1 is just like NTP models. 2 gives you per token extra thinking cycles