Steven Wu (@zstevenwu) Bsky

Paper: "Many AI Analysts, One Dataset: Navigating the Agentic Data Science Multiverse"

Joint with Martin Bertran, Riccardo Fogliato
Paper: arxiv.org/pdf/2602.18710

We are planning to make these tools available so you can generate the multiverse for your own data and hypotheses. Stay tuned!

1 month ago 1 0 0 0

Open questions remain:

How do we interpret specification distributions?

What measures of robustness are meaningful?

How should this interact with preregistration?

But the core shift is clear: once broad exploration becomes feasible, transparency should change with it.

1 month ago 0 0 1 0

Our proposal: transparency norms should match computational capacity.

If AI makes it cheap to generate many analyses, we should:
→ Report distributions across specifications (the multiverse), not one path
→ Disclose prompts alongside code and data

1 month ago 0 0 1 0

The multiverse can also be a tool. Given a published study’s specification, AI analysts can run the same constrained analyses and measure variation from choices left implicit.

In a pilot study, we show that being more precise about the hypothesis can meaningfully reduce dispersion

1 month ago 0 0 1 0

But the same capability suggests the solution:

If searching the space is cheap, disclosing it can be too.

Instead of reporting one specification with robustness checks, report the distribution across reasonable specifications. Change the unit of evidence from paths to landscapes.

1 month ago 0 0 1 0

This creates a new computational-incentive problem: when thousands of valid analyses cost almost nothing to generate, selective reporting becomes frictionless.

Concurrent work by @njw.fish and Gabe Sekeres similarly finds that the barrier to selective reporting has collapsed.

1 month ago 1 0 1 0

The data science multiverse is large and consequential.

Even with a fixed estimand, LLM agents produce many valid analyses with strikingly different estimates: some supporting the hypothesis, others rejecting it.

An LLM judge (Claude Sonnet 4.5) deems each analysis methodologically sound.

1 month ago 0 0 1 0

Inspect Open-source framework for large language model evaluations

We gave agents the same dataset, hypothesis, and coding tools, then let them independently test it.

Built on Inspect AI (inspect.aisi.org.uk) with an auditor checking validity.

3 datasets, 4 LLMs → ~5k analyses

1 month ago 0 0 1 0

Before LLMs, the “multiverse” was studied through many-analyst projects: dozens of teams analyzing the same data and hypothesis, often reaching conflicting conclusions.

But these studies required huge coordination and sampled only tiny slices. LLMs can now explore this space cheaply and at scale.

1 month ago 1 1 1 0

There's growing evidence that LLMs can p-hack.

But p-hacking also points to something bigger: a data science multiverse of defensible analytical choices.
We wrote a paper (arxiv.org/abs/2602.18710) on using LLM agents to map this multiverse systematically. 🧵

1 month ago 10 3 1 2

nice....

1 year ago 1 0 0 0

I was lucky enough to be invited give a talk on our new paper on the value of RL in fine-tuning at Cornell last week! Because of my poor time management skills, the talk isn't as polished as I'd like, but I think the "vibes" are accurate enough to share: youtu.be/E4b3cSirpsg.

1 year ago 15 3 0 0

1.5 yrs ago, we set out to answer a seemingly simple question: what are we *actually* getting out of RL in fine-tuning? I'm thrilled to share a pearl we found on the deepest dive of my PhD: the value of RL in RLHF seems to come from *generation-verification gaps*. Get ready to 🤿:

1 year ago 59 11 1 3

can you present other people's results :-)

1 year ago 1 0 0 0

that makes sense to me.... i should go to bed....

1 year ago 3 0 0 0

A Minimaximalist Approach to Reinforcement Learning from Human Feedback We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement learning from human feedback. Our approach is minimalist in that it does not require training a reward model nor unst...

@gswamy.bsky.social et al propose SPO which builds a game from a preferences, solving for the minimax winner. Handles non-Markovian, intransitive, and stochastic preferences. Nice empirical eval ranging from small demonstrative domains to huge RL domain (Mujoco).

arxiv.org/abs/2401.04056

2/3.

1 year ago 17 2 2 0

I have become a fan of the game-theoretic approaches to RLHF, so here are two more papers in that category! (with one more tomorrow 😅)

1. Self-Play Preference Optimization (SPO).

2. Direct Nash Optimization (DNO).

🧵 1/3.

1 year ago 74 9 2 2

1....

1 year ago 4 0 0 0

Posts by Steven Wu