Paper: "Many AI Analysts, One Dataset: Navigating the Agentic Data Science Multiverse"
Joint with Martin Bertran, Riccardo Fogliato
Paper: arxiv.org/pdf/2602.18710
We are planning to make these tools available so you can generate the multiverse for your own data and hypotheses. Stay tuned!
Posts by Steven Wu
Open questions remain:
How do we interpret specification distributions?
What measures of robustness are meaningful?
How should this interact with preregistration?
But the core shift is clear: once broad exploration becomes feasible, transparency should change with it.
Our proposal: transparency norms should match computational capacity.
If AI makes it cheap to generate many analyses, we should:
→ Report distributions across specifications (the multiverse), not one path
→ Disclose prompts alongside code and data
The multiverse can also be a tool. Given a published study’s specification, AI analysts can run the same constrained analyses and measure variation from choices left implicit.
In a pilot study, we show that being more precise about the hypothesis can meaningfully reduce dispersion
But the same capability suggests the solution:
If searching the space is cheap, disclosing it can be too.
Instead of reporting one specification with robustness checks, report the distribution across reasonable specifications. Change the unit of evidence from paths to landscapes.
This creates a new computational-incentive problem: when thousands of valid analyses cost almost nothing to generate, selective reporting becomes frictionless.
Concurrent work by @njw.fish and Gabe Sekeres similarly finds that the barrier to selective reporting has collapsed.
The data science multiverse is large and consequential.
Even with a fixed estimand, LLM agents produce many valid analyses with strikingly different estimates: some supporting the hypothesis, others rejecting it.
An LLM judge (Claude Sonnet 4.5) deems each analysis methodologically sound.
We gave agents the same dataset, hypothesis, and coding tools, then let them independently test it.
Built on Inspect AI (inspect.aisi.org.uk) with an auditor checking validity.
3 datasets, 4 LLMs → ~5k analyses
Before LLMs, the “multiverse” was studied through many-analyst projects: dozens of teams analyzing the same data and hypothesis, often reaching conflicting conclusions.
But these studies required huge coordination and sampled only tiny slices. LLMs can now explore this space cheaply and at scale.
There's growing evidence that LLMs can p-hack.
But p-hacking also points to something bigger: a data science multiverse of defensible analytical choices.
We wrote a paper (arxiv.org/abs/2602.18710) on using LLM agents to map this multiverse systematically. 🧵
nice....
I was lucky enough to be invited give a talk on our new paper on the value of RL in fine-tuning at Cornell last week! Because of my poor time management skills, the talk isn't as polished as I'd like, but I think the "vibes" are accurate enough to share: youtu.be/E4b3cSirpsg.
1.5 yrs ago, we set out to answer a seemingly simple question: what are we *actually* getting out of RL in fine-tuning? I'm thrilled to share a pearl we found on the deepest dive of my PhD: the value of RL in RLHF seems to come from *generation-verification gaps*. Get ready to 🤿:
can you present other people's results :-)
that makes sense to me.... i should go to bed....
@gswamy.bsky.social et al propose SPO which builds a game from a preferences, solving for the minimax winner. Handles non-Markovian, intransitive, and stochastic preferences. Nice empirical eval ranging from small demonstrative domains to huge RL domain (Mujoco).
arxiv.org/abs/2401.04056
2/3.
I have become a fan of the game-theoretic approaches to RLHF, so here are two more papers in that category! (with one more tomorrow 😅)
1. Self-Play Preference Optimization (SPO).
2. Direct Nash Optimization (DNO).
🧵 1/3.
1....