Cambrian-S is a valuable first step in defining what “supersensing” might mean for video models. Our results simply highlight how subtle benchmark design choices can be exploited — and how we can improve them together.
📄 arxiv.org/abs/2511.16655
🔗 github.com/bethgelab/s...
Posts by Andreas Hochlehnert
This indicates that the tailored Cambrian-S inference strategy may rely on benchmark-specific shortcuts (e.g. rooms are never revisited), rather than building a persistent, spatial world model over time.
For VSI-Super-Counting (VSC), we run a sanity check:
🔁 VSC-Repeat: we concatenate each video with itself 1-5×
✅ Unique object count stays the same
❌ Cambrian-S accuracy drops from 42% → 0%
A genuine supersensing system should be robust here.
We introduce a simple baseline called NoSense, an image-only (SigLIP) model that discards almost all temporal structure.
Surprisingly, it reaches 95% accuracy on VSI-Super-Recall (VSR), even on 4-hour videos.
This suggests VSR can be solved without true spatial supersensing.
🚨 New Paper: "Solving Spatial Supersensing Without Spatial Supersensing"
Huge credit to the Cambrian-S team for tackling one of the hardest open problems in video understanding: spatial supersensing. In our paper, we take a closer look at their benchmarks & methods 👇
Presenting A Sober Look at Progress in LM Reasoning at @colmweb.org today 🇨🇦 #COLM2025
📅 Today
🕔 11:00 AM – 1:00 PM
📍 Room 710 - Poster #31
We find that many “reasoning” gains fall within variance and show how to make evaluation reproducible again.
📘 bethgelab.github.io/sober-reasoning
🖐️
Excited about this new work from @haoyuhe.bsky.social. TLDR: Diffusion language models treat learning and inference differently which lowers performance. RL can be used to overcome this issue for certain problems.
7/ Takeaway?
Many supposed gains don’t hold up under scrutiny.
Progress is possible—but let’s build on reproducible foundations.
🧠 Full paper: arxiv.org/abs/2504.07086
🧑🔬 By: @hrdkbhatnagar.bsky.social @vishaalurao.bsky.social @samuelalbanie.bsky.social @bayesiankitten.bsky.social @MatthiasBethge
6/ Our recommendations: – Evaluate with ≥10 seeds
– Tune decoding per model
– Use appropriate prompts/templates
– Standardize hardware/software (we use Docker)
– Open-source everything
📦 Code, prompts, outputs: github.com/bethgelab/so...
5/ What actually works?
🔹 RL methods over distillations? Often negligible gains, prone to overfitting.
🔹 Supervised finetuning (SFT) on reasoning traces? Stable & generalizable.
4/ Variance is everywhere:
– Random seed: swings Pass@1 by 5–15pp
– Temperature/top-p: another ±10pp
– Software & Hardware? Yes, even that changes scores
🎯 Single-seed results on small datasets are essentially noise.
3/ We re-evaluated recent 1.5B and 7B reasoning models on 6 benchmarks under controlled settings.
➡️ Performance dropped by up to 17%
➡️ Improvements fall within variance range of the base model
➡️ Some models don’t beat the baseline!
2/ Reasoning is the next frontier for LMs—but current evaluation practices often lack rigor.
We find that many celebrated gains from RL methods vanish once you:
✅ average over multiple seeds
✅ control decoding
✅ standardize prompt & infra
🧵1/ 🚨 New paper: A Sober Look at Progress in Language Model Reasoning
We re-evaluate recent SFT and RL models for mathematical reasoning and find most gains vanish under rigorous, multi-seed, standardized evaluation.
📊 bethgelab.github.io/sober-reason...
📄 arxiv.org/abs/2504.07086
New preprint out! 🎉
How does LLM training loss translate to downstream performance?
We show that pretraining data and tokenizer shape loss-to-loss scaling, while architecture and other factors play a surprisingly minor role!
brendel-group.github.io/llm-line/ 🧵1/8
We are just getting started! We're building better filters, aggregating released benchmarks — datacomp style — and develop fast, accurate OpenThinking models. Stay tuned! w/
@hrdkbhatnagar.bsky.social, @vishaalurao.bsky.social, @bayesiankitten.bsky.social, Matthias Bethge [6/6]
These issues encourage shortcuts and flawed reasoning. If GRPO rewards bad logic, models reinforce errors instead of improving. Garbage In, Garbage Out 🚨 [5/6]
🔸 Some questions reference figures that aren't included! Text-only models can't infer missing visuals. [4/6]
🔸 Mathematical proofs are a challenge. There's no automated way to verify them, and answers often only show an initial equation, leading to unreliable training signals. [3/6]
Example of multiple questions asked in the analyzed datasets
Blog (For Updates): huggingface.co/datasets/bet...
🔸 Some questions contain subquestions, but only one answer is labeled. The model may get penalized for "wrong" but valid reasoning. [2/6]
CuratedThoughts: Data Curation for RL Datasets 🚀
Since DeepSeek-R1 introduced reasoning-based RL, datasets like Open-R1 & OpenThoughts emerged for fine-tuning & GRPO. Our deep dive found major flaws — 25% of OpenThoughts needed elimination by data curation.
Here's why 👇🧵
SWE-bench Multimodal evaluation code is out now!
SWE-bench MM is a new set of JavaScript issues that have a visual component (‘map isn’t rendering correctly’, ‘button text isn’t appearing’).
www.swebench.com/sb-cli/
This is joint work with @oripress.bsky.social, @vishaalurao.bsky.social, @bayesiankitten.bsky.social, @ofirpress.bsky.social and Matthias Bethge