Advertisement · 728 × 90

Posts by Shuvom Sadhuka

Preview
Revenge of the Worst Case Or maybe, revenge of the 1st quantile. What common AI benchmarking discourse misses.

2/ Model A may beat model B on average, but model A can still lose to model B if judged by the min. over several tasks.

I wrote a brief blog post on this (good time to announce I started a substack!).

shuvom.substack.com/p/revenge-of...

3 weeks ago 1 0 0 0
Post image

1/ CS majors are drilled to think about "worst-case" performance of algorithms. By contrast, much of the discourse on AI evals focuses on average-case or best-case (e.g. LLM X can solve IMO problems). Maybe one key to "reliability" is certifying the 1st quantile of outputs too, not just the mean.

3 weeks ago 0 0 1 0

Thank you to co-authors @drewprinster.bsky.social, Clara Fannjiang, Gabriele Scalia, Aviv Regev, and Hanchen Wang! This work was done during an internship at Genentech. I highly recommend it if you have a chance!

4 months ago 0 0 0 0

Arxiv: arxiv.org/abs/2512.03109
GitHub: github.com/shuvom-s/e-v...
PyPi: pypi.org/project/e-va...

Run pip install e-valuator to try it out yourself!

4 months ago 0 0 1 0
Post image Post image

6/ E-valuator can also be used to terminate unsuccessful trajectories, which provides a better accuracy-tokens tradeoff. It can also be used as a general monitoring metric!

4 months ago 0 0 1 0

5/ E-valuator is able to provably control the false alarm rate (i.e., rate of accidentally flagging successful trajectories as unsuccessful) while also providing high power. We test on several datasets, including non-LLM agents, and find empirically that e-valuator outperforms other baselines.

4 months ago 0 0 1 0

4/ Inspired by work in sequential hypothesis testing and e-values, we frame the problem of detecting (and thereby terminating) unsuccessful agent trajectories as a sequential hypothesis testing problem. E-valuator is a statistical wrapper that can adapt to any agent/verifier.

4 months ago 0 0 1 0

3/ Furthermore, finetuning the verifier requires both (a) a large, labeled dataset of successful and unsuccessful agent trajectories and (b) white-box access to verifier weights, both of which may be expensive or impossible to obtain.

4 months ago 0 0 1 0

2/ A challenge in deploying these verifiers is that the number of steps an agent will take is not known beforehand, so calibrating a decision rule valid across the entire trajectory is difficult.

4 months ago 0 0 1 0
Advertisement

1/ Agents make mistakes, and it’s important to detect these mistakes. Towards this goal, people have developed verifier models (e.g., PRMs, judges) to score each step in an agent’s trajectory.

4 months ago 0 0 1 0
Post image Post image

How can you evaluate agent trajectories with only black-box access to a verifier and the agent?

Introducing E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing

❌no finetuning
❌no additional GPU compute
⬛black-box access
✔️controllable error rates w/ guarantees

4 months ago 1 0 1 0

10/ Also, I will be sticking around in San Diego for NeurIPS afterwards to present SSME. See my co-author’s thread on that here:
bsky.app/profile/dmsh...

5 months ago 0 0 0 0

9/ This was joint work with
Sophia Lin, @bergerlab.bsky.social , and @emmapierson.bsky.social ! A special shoutout to Sophia Lin who was a high school student when we started this project.

Arxiv: arxiv.org/pdf/2511.11684
Code: github.com/shuvom-s/fun...

5 months ago 0 0 1 0

8/ We find that the estimated mortality risk required to admit male patients into the ICU (4.5% is lower than that for female patients (5.1%). We find a similar disparity in hospital admissions as well.

5 months ago 0 0 1 0

7/ We then apply our model to MIMIC-IV, a dataset of medical records from a Boston-area emergency department. We model the flow of patients from the ED to the hospital to the ICU using our funnel model.

5 months ago 0 0 1 0

6/ We introduce a funnel model that predicts both the ground truth label and the human decisions, while accounting for unobserved covariates that affect both the label and the decisions.

5 months ago 0 0 1 0

5/ This censoring of the ground truth happens over multiple stages and isn't random. Patients who make it to later stages tend to be at higher risk of cancer. If we train only on the population with observed labels, we’ll be learning from a biased subset of the population.

5 months ago 0 0 1 0

4/ We only observe the ground truth outcome (biopsy result) for patients who make it to the last stage.

5 months ago 0 0 1 0

3/ For instance, in a breast cancer diagnosis, a clinician may first administer a breast exam, then order a mammogram for patients with concerning exams, before ordering a biopsy for patients with concerning mammograms.

5 months ago 0 0 1 0
Advertisement

2/ In many decision-making settings we only observe ground truth labels only after a sequence of human decisions.

5 months ago 1 0 1 0
Video

I’m excited to share our new paper A Bayesian Model for Multi-stage Censoring, which I will present at #ML4H2025 in San Diego! 🧵 below:

5 months ago 7 1 1 0

Relatedly, @dmshanmugam.bsky.social is on the academic job market and I strongly recommend working with her! She has an impressive array of reliable ML work, often with applications in biomedical/health settings, and I'd recommend talking to her if you can.

6 months ago 0 0 0 0

What can you do when you need to evaluate a set of models but don't have too much labeled data? Solution: add unlabeled data. It was great to co-lead this project with @dmshanmugam.bsky.social and come talk to us at NeurIPS!

6 months ago 6 0 1 0

congrats! sad I missed it :/

11 months ago 1 0 1 0
Post image

It's interesting because "Shuvom" is definitely the highest signal for gender out of the list mentioned, but it probably just doesn't know this, since my name is pretty rare.

1 year ago 2 0 0 0
Post image

Slightly interesting observation: I asked chatgpt "generate an image that represents what you know about me. don't ask questions" and it drew me as a woman. I've never revealed my gender to chatgpt, so I asked it why it drew me as a woman. Here's what it said:

1 year ago 2 0 1 0
Democracy and the Central Limit Theorem | Shuvom Sadhuka Is democracy just an average of preferences?

Some past writing I like:

Democracy and the CLT: shuvom-s.github.io/blog/2020/de...

Privacy vs collaboration in genomics: shuvom-s.github.io/blog/2023/ov...

Please check out the others too if interested! [3/3]

1 year ago 1 0 0 0
Advertisement
blog | Shuvom Sadhuka A simple, whitespace theme for academics. Based on [*folio](https://github.com/bogoli/-folio) design.

Relatedly, I finally consolidated my scattered writing into a blog: shuvom-s.github.io/blog/

For now, I'm hosting on my (rebuilt) personal website, but curious if anyone has recs on a platform to blog (Medium? Substack? Personal site?). I've turned on comments with giscus. [2/3]

1 year ago 1 0 1 0
Measuring Entropy | Shuvom Sadhuka How would you measure the entropy of natural language?

I wrote up some thoughts on what it means to measure the entropy of natural languages and connections to LLMs, loosely inspired by an awesome paper from Shannon in 1951(!)

Check it out: shuvom-s.github.io/blog/2025/me... [1/3]

1 year ago 1 0 1 0

I'm in Vancouver and will be giving a spotlight talk at #ML4H
tomorrow, Dec. 15, at 4:30pm on some ongoing work on modeling multi-stage selection problems in clinical settings. Work done with (high school senior!) Sophia Lin, Bonnie Berger, and @emmapierson.bsky.social. I hope to see you there!

1 year ago 9 1 0 0