Advertisement · 728 × 90

Posts by Florian Dorner

📅 Thursday, April 23, 2026 ⏰ 10:30 AM – 1:00 PM
📍 Pavilion 4, P4-#4413
📄 Paper: arxiv.org/pdf/2507.12399
💻 GitHub: github.com/socialfounda...

Joint work with @yatongchen.bsky.social @andcrz.bsky.social and Fanny Yang

6 hours ago 0 0 0 0
Post image

This is not just a theoretical phenomenon: In our experiments with differently sized Qwen verifiers, we see similar performance for all sizes at small N, but larger verifiers yield noticeably better performance when N is increased.

6 hours ago 0 0 1 0
Post image

The top-right region of the ROC determines early scaling, while the bottom-left determines behavior at large N. Thus we cannot extrapolate scaling laws from small-N observations: For any observed early scaling, there are multiple consistent ROCs, each associated with different large-N performance.

6 hours ago 1 0 2 0

We show that for any query, the performance of resampling methods like Best-of-N is fully determined by initial model accuracy and the verifier ROC curve. In particular, concave ROC curves imply monotonic scaling.

6 hours ago 0 0 1 0

At ICLR and interested in theory for LLMs? Join us at our poster to learn more about the (im)possibility of scaling laws for test-time scaling methods like Best-of-N when verification is imperfect!

6 hours ago 0 0 1 0

In light of the discussions about LLM-generated ICLR reviews, I recently wondered whether a similar dynamic might play out for LLMs: While pre-training objectives promote approximate indistinguishability of generated text, more and more heavy post-training might make detection a lot easier...

4 months ago 1 0 0 0
Preview
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an importan...

In the second paper (arxiv.org/abs/2410.13341), we show that LLM judges weaker than the models they evaluate are of limited use for benchmarking, even if their judgments are processed in a statistically optimal way. Correspondingly, we cannot rely on LLM judges for evaluating frontier models.

4 months ago 3 0 1 0
Preview
ROC-n-reroll: How verifier imperfection affects test-time scaling Test-time scaling aims to improve language model performance by leveraging additional compute during inference. Many works have empirically studied techniques such as Best-of-N (BoN) and Rejection Sam...

In the first paper (arxiv.org/abs/2507.12399), we characterize how LLM judge errors affect test-time-scaling via Best-of-N based on the verifier ROC curve. Our results point towards more efficient alternatives to Best-of-N, and explain why scaling laws for test-time-scaling are unreliable.

4 months ago 2 0 1 0
Advertisement

Meet me at the Benchmarking workshop (sites.google.com/view/benchma...) at EurIPS on Saturday: We’ll present two works on errors in LLM-as-Judge and their impacts on benchmarking and test-time-scaling:

4 months ago 7 3 1 0
Post image

I'll be @neuripsconf.bsky.social presenting Strategic Hypothesis Testing (spotlight!)

tldr: Many high-stakes decisions (e.g., drug approval) rely on p-values, but people submitting evidence respond strategically even w/o p-hacking. Can we characterize this behavior & how policy shapes it?

1/n

4 months ago 17 4 1 0

Also, from time to time, the wrong proofs it suggests for more complicated things seem to contain non-trivial insights and are "fixable".

5 months ago 1 0 0 0

Not much of a step up compared to the o1/o3 "thinking" versions of GPT-4. But quite a big step compared to base GPT-4. It still makes a lot of mistakes, but often produces correct proofs for simple Lemmata (not so much for more complicated stuff).

5 months ago 1 1 1 0
Preview
Vivian Nastl and Ricardo Dominguez-Olmedo receive 2025 Google Ph.D. Fellowship Program supports exceptional graduate students working on innovative research in computer science and related fields

Congratulations also to Vivian Nastl (supervised by Moritz Hardt) and Ricardo Dominguez-Olmedo (Moritz Hardt and Bernhard Schölkopf) for winning 2025 Global Google PhD fellowships.
Find out more about their work here: is.mpg.de/en/news/vivi...

@maxplanckcampus.bsky.social @unituebingen.bsky.social

5 months ago 5 2 0 0
Post image Post image Post image Post image

The viral "Definition of AGI" paper tells you to read fake references which do not exist!

Proof: different articles present at the specified journal/volume/page number, and their titles exist nowhere on any searchable repository.

Take this as a warning to not use LMs to generate your references!

6 months ago 156 36 6 16

Assuming all problems are actually solvable...

6 months ago 0 0 0 0

Is that not trivially true, since LLMs assign nonzero probability to any possible string?

6 months ago 0 0 1 0
Advertisement
Post image

We (w/ Moritz Hardt, Olawale Salaudeen and
@joavanschoren.bsky.social) are organizing the Workshop on the Science of Benchmarking & Evaluating AI @euripsconf.bsky.social 2025 in Copenhagen!

📢 Call for Posters: rb.gy/kyid4f
📅 Deadline: Oct 10, 2025 (AoE)
🔗 More info: rebrand.ly/bg931sf

7 months ago 21 7 1 0

Do you have a list of the best ones? I vaguely recall reading things in this direction, but cannot really remember specific titles.

7 months ago 1 0 0 0
Post image

Wouldn’t it be great to have questions about LM internals answered in plain English? That’s the promise of verbalization interpretability. Unfortunately, our new paper shows that evaluating these methods is nuanced—and verbalizers might not tell us what we hope they do. 🧵👇1/8

7 months ago 26 8 1 1

The focus on evaluating checkpoints during a training run rather than different trained models is super interesting!

7 months ago 1 0 1 0
Preview
How Benchmark Prediction from Fewer Data Misses the Mark Large language model (LLM) evaluation is increasingly costly, prompting interest in methods that speed up evaluation by shrinking benchmark datasets. Benchmark prediction (also called efficient LLM ev...

Interesting work! Can you comment a bit on what you do different compared to previous IRT-based LLM evaluation methods?

We recently did some work confirming IRTs efficacy for in-distribution models, but also found it to be quite brittle when it comes to novel models arxiv.org/abs/2506.07673

7 months ago 1 0 2 0

I guess in terms of the notation from section 4 in the paper, does this plot Type X risk, or Type X Error Feasibility rate?

7 months ago 0 0 0 0

, at least for large n. So I am trying to understand whether the asymptotics kick in a lot slower than I would have thought, or whether I am missing something else about the setup., at least for large n.

7 months ago 0 0 0 0
Advertisement

Thank you! Do I understand correctly that these results are independent/orthogonal from the success hacking ones? I guess my confusion stems from asymptotic theory for PPI (and by extension seemingly for DSL) suggesting that both type 1 and type 2 errors should be lower/at most very similar

7 months ago 0 0 1 0

Are the reported errors for the case of selecting the model with the most significant results, post-hoc?

7 months ago 0 0 1 0

Interesting work! Can you comment a bit more on the setup for the regression correction methods? As far as I understand, PPI++ (which should be quite similar to DSL) relatively reliably reduces variance compared to ground truth only, while remaining quite close to unbiased.

7 months ago 0 0 2 0
Post image

Does anyone have background on this plot, compared to the 32% performance for o3-mini-high with tool use claimed by OpenAI in January? #GPT5 #GPT-5

openai.com/index/introd...
openai.com/index/openai...

8 months ago 1 0 0 0
Preview
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an importan...

Super interesting field, but worth keeping in mind that this usually only buys you a relatively small fraction of "extra ground truth labels" (this does not cover active sampling strategies, but I haven not seen them yielding much larger improvements in practice, either) arxiv.org/abs/2410.13341

8 months ago 2 0 0 0

Do you have a source re: attendance requirement? 👀

9 months ago 0 0 1 0

Not sure this can ethically be done retroactively (due to participant consent). But given that 20% of data is shared with model providers, privacy concerns with instead sharing this data publically in the future seem surmountable.

11 months ago 0 0 0 0