Advertisement · 728 × 90

Posts by Andreas Hochlehnert

Preview
GitHub - bethgelab/supersanity: A critical analysis of the Cambrian-S model and VSI-Super benchmarks A critical analysis of the Cambrian-S model and VSI-Super benchmarks - bethgelab/supersanity

Cambrian-S is a valuable first step in defining what “supersensing” might mean for video models. Our results simply highlight how subtle benchmark design choices can be exploited — and how we can improve them together.

📄 arxiv.org/abs/2511.16655
🔗 github.com/bethgelab/s...

4 months ago 0 0 0 0

This indicates that the tailored Cambrian-S inference strategy may rely on benchmark-specific shortcuts (e.g. rooms are never revisited), rather than building a persistent, spatial world model over time.

4 months ago 0 0 1 0

For VSI-Super-Counting (VSC), we run a sanity check:

🔁 VSC-Repeat: we concatenate each video with itself 1-5×
✅ Unique object count stays the same
❌ Cambrian-S accuracy drops from 42% → 0%

A genuine supersensing system should be robust here.

4 months ago 0 0 1 0

We introduce a simple baseline called NoSense, an image-only (SigLIP) model that discards almost all temporal structure.

Surprisingly, it reaches 95% accuracy on VSI-Super-Recall (VSR), even on 4-hour videos.

This suggests VSR can be solved without true spatial supersensing.

4 months ago 0 0 1 0
Post image

🚨 New Paper: "Solving Spatial Supersensing Without Spatial Supersensing"

Huge credit to the Cambrian-S team for tackling one of the hardest open problems in video understanding: spatial supersensing. In our paper, we take a closer look at their benchmarks & methods 👇

4 months ago 3 2 1 1

Presenting A Sober Look at Progress in LM Reasoning at @colmweb.org today 🇨🇦 #COLM2025

📅 Today
🕔 11:00 AM – 1:00 PM
📍 Room 710 - Poster #31

We find that many “reasoning” gains fall within variance and show how to make evaluation reproducible again.
📘 bethgelab.github.io/sober-reasoning

6 months ago 0 0 0 0

🖐️

7 months ago 0 0 1 0

Excited about this new work from @haoyuhe.bsky.social. TLDR: Diffusion language models treat learning and inference differently which lowers performance. RL can be used to overcome this issue for certain problems.

8 months ago 6 1 0 0
Advertisement
Preview
A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility Reasoning has emerged as the next major frontier for language models (LMs), with rapid advances from both academic and industrial labs. However, this progress often outpaces methodological rigor, with...

7/ Takeaway?

Many supposed gains don’t hold up under scrutiny.
Progress is possible—but let’s build on reproducible foundations.

🧠 Full paper: arxiv.org/abs/2504.07086

🧑‍🔬 By: @hrdkbhatnagar.bsky.social @vishaalurao.bsky.social @samuelalbanie.bsky.social @bayesiankitten.bsky.social @MatthiasBethge

1 year ago 0 0 0 0
Preview
GitHub - bethgelab/sober-reasoning Contribute to bethgelab/sober-reasoning development by creating an account on GitHub.

6/ Our recommendations: – Evaluate with ≥10 seeds

– Tune decoding per model
– Use appropriate prompts/templates
– Standardize hardware/software (we use Docker)
– Open-source everything

📦 Code, prompts, outputs: github.com/bethgelab/so...

1 year ago 0 0 1 0
Post image

5/ What actually works?
🔹 RL methods over distillations? Often negligible gains, prone to overfitting.

🔹 Supervised finetuning (SFT) on reasoning traces? Stable & generalizable.

1 year ago 0 0 1 0
Post image

4/ Variance is everywhere:

– Random seed: swings Pass@1 by 5–15pp
– Temperature/top-p: another ±10pp
– Software & Hardware? Yes, even that changes scores

🎯 Single-seed results on small datasets are essentially noise.

1 year ago 0 0 1 0

3/ We re-evaluated recent 1.5B and 7B reasoning models on 6 benchmarks under controlled settings.

➡️ Performance dropped by up to 17%
➡️ Improvements fall within variance range of the base model
➡️ Some models don’t beat the baseline!

1 year ago 0 0 1 0

2/ Reasoning is the next frontier for LMs—but current evaluation practices often lack rigor.

We find that many celebrated gains from RL methods vanish once you:

✅ average over multiple seeds
✅ control decoding
✅ standardize prompt & infra

1 year ago 1 0 1 0
Post image

🧵1/ 🚨 New paper: A Sober Look at Progress in Language Model Reasoning
We re-evaluate recent SFT and RL models for mathematical reasoning and find most gains vanish under rigorous, multi-seed, standardized evaluation.

📊 bethgelab.github.io/sober-reason...
📄 arxiv.org/abs/2504.07086

1 year ago 14 5 1 0
Post image

New preprint out! 🎉

How does LLM training loss translate to downstream performance?

We show that pretraining data and tokenizer shape loss-to-loss scaling, while architecture and other factors play a surprisingly minor role!
brendel-group.github.io/llm-line/ 🧵1/8

1 year ago 18 8 1 2

We are just getting started! We're building better filters, aggregating released benchmarks — datacomp style — and develop fast, accurate OpenThinking models. Stay tuned! w/
@hrdkbhatnagar.bsky.social, @vishaalurao.bsky.social, @bayesiankitten.bsky.social, Matthias Bethge [6/6]

1 year ago 1 0 0 0
Advertisement

These issues encourage shortcuts and flawed reasoning. If GRPO rewards bad logic, models reinforce errors instead of improving. Garbage In, Garbage Out 🚨 [5/6]

1 year ago 1 0 1 0
Post image

🔸 Some questions reference figures that aren't included! Text-only models can't infer missing visuals. [4/6]

1 year ago 1 0 1 0
Post image

🔸 Mathematical proofs are a challenge. There's no automated way to verify them, and answers often only show an initial equation, leading to unreliable training signals. [3/6]

1 year ago 1 0 1 0
Example of multiple questions asked in the analyzed datasets

Example of multiple questions asked in the analyzed datasets

Blog (For Updates): huggingface.co/datasets/bet...

🔸 Some questions contain subquestions, but only one answer is labeled. The model may get penalized for "wrong" but valid reasoning. [2/6]

1 year ago 1 0 1 0

CuratedThoughts: Data Curation for RL Datasets 🚀

Since DeepSeek-R1 introduced reasoning-based RL, datasets like Open-R1 & OpenThoughts emerged for fine-tuning & GRPO. Our deep dive found major flaws — 25% of OpenThoughts needed elimination by data curation.

Here's why 👇🧵

1 year ago 13 9 1 1
Post image

SWE-bench Multimodal evaluation code is out now!

SWE-bench MM is a new set of JavaScript issues that have a visual component (‘map isn’t rendering correctly’, ‘button text isn’t appearing’).

www.swebench.com/sb-cli/

1 year ago 8 1 0 0

This is joint work with @oripress.bsky.social, @vishaalurao.bsky.social, @bayesiankitten.bsky.social, @ofirpress.bsky.social and Matthias Bethge

1 year ago 1 0 0 0
Advertisement
CiteME CiteME is a benchmark designed to test the abilities of language models in finding papers that are cited in scientific texts.

We are presenting CiteMe today at the 11AM poster session (East Exhibit Hall A-C, #3309)

CiteMe is a challenging benchmark for LM-based agents to find paper citations, moving beyond simple multiple-choice Q&A to real-world use cases.

Come by and say hi :)

citeme.ai

1 year ago 6 1 2 0
Preview
Tübingen AI Join the conversation

Here's a fledgling starter pack for the AI community in Tübingen. Let me know if you'd like to be added!

go.bsky.app/NFbVzrA

1 year ago 24 13 18 0