2/ Model A may beat model B on average, but model A can still lose to model B if judged by the min. over several tasks.
I wrote a brief blog post on this (good time to announce I started a substack!).
shuvom.substack.com/p/revenge-of...
Posts by Shuvom Sadhuka
1/ CS majors are drilled to think about "worst-case" performance of algorithms. By contrast, much of the discourse on AI evals focuses on average-case or best-case (e.g. LLM X can solve IMO problems). Maybe one key to "reliability" is certifying the 1st quantile of outputs too, not just the mean.
Thank you to co-authors @drewprinster.bsky.social, Clara Fannjiang, Gabriele Scalia, Aviv Regev, and Hanchen Wang! This work was done during an internship at Genentech. I highly recommend it if you have a chance!
Arxiv: arxiv.org/abs/2512.03109
GitHub: github.com/shuvom-s/e-v...
PyPi: pypi.org/project/e-va...
Run pip install e-valuator to try it out yourself!
6/ E-valuator can also be used to terminate unsuccessful trajectories, which provides a better accuracy-tokens tradeoff. It can also be used as a general monitoring metric!
5/ E-valuator is able to provably control the false alarm rate (i.e., rate of accidentally flagging successful trajectories as unsuccessful) while also providing high power. We test on several datasets, including non-LLM agents, and find empirically that e-valuator outperforms other baselines.
4/ Inspired by work in sequential hypothesis testing and e-values, we frame the problem of detecting (and thereby terminating) unsuccessful agent trajectories as a sequential hypothesis testing problem. E-valuator is a statistical wrapper that can adapt to any agent/verifier.
3/ Furthermore, finetuning the verifier requires both (a) a large, labeled dataset of successful and unsuccessful agent trajectories and (b) white-box access to verifier weights, both of which may be expensive or impossible to obtain.
2/ A challenge in deploying these verifiers is that the number of steps an agent will take is not known beforehand, so calibrating a decision rule valid across the entire trajectory is difficult.
1/ Agents make mistakes, and it’s important to detect these mistakes. Towards this goal, people have developed verifier models (e.g., PRMs, judges) to score each step in an agent’s trajectory.
How can you evaluate agent trajectories with only black-box access to a verifier and the agent?
Introducing E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing
❌no finetuning
❌no additional GPU compute
⬛black-box access
✔️controllable error rates w/ guarantees
10/ Also, I will be sticking around in San Diego for NeurIPS afterwards to present SSME. See my co-author’s thread on that here:
bsky.app/profile/dmsh...
9/ This was joint work with
Sophia Lin, @bergerlab.bsky.social , and @emmapierson.bsky.social ! A special shoutout to Sophia Lin who was a high school student when we started this project.
Arxiv: arxiv.org/pdf/2511.11684
Code: github.com/shuvom-s/fun...
8/ We find that the estimated mortality risk required to admit male patients into the ICU (4.5% is lower than that for female patients (5.1%). We find a similar disparity in hospital admissions as well.
7/ We then apply our model to MIMIC-IV, a dataset of medical records from a Boston-area emergency department. We model the flow of patients from the ED to the hospital to the ICU using our funnel model.
6/ We introduce a funnel model that predicts both the ground truth label and the human decisions, while accounting for unobserved covariates that affect both the label and the decisions.
5/ This censoring of the ground truth happens over multiple stages and isn't random. Patients who make it to later stages tend to be at higher risk of cancer. If we train only on the population with observed labels, we’ll be learning from a biased subset of the population.
4/ We only observe the ground truth outcome (biopsy result) for patients who make it to the last stage.
3/ For instance, in a breast cancer diagnosis, a clinician may first administer a breast exam, then order a mammogram for patients with concerning exams, before ordering a biopsy for patients with concerning mammograms.
2/ In many decision-making settings we only observe ground truth labels only after a sequence of human decisions.
I’m excited to share our new paper A Bayesian Model for Multi-stage Censoring, which I will present at #ML4H2025 in San Diego! 🧵 below:
Relatedly, @dmshanmugam.bsky.social is on the academic job market and I strongly recommend working with her! She has an impressive array of reliable ML work, often with applications in biomedical/health settings, and I'd recommend talking to her if you can.
What can you do when you need to evaluate a set of models but don't have too much labeled data? Solution: add unlabeled data. It was great to co-lead this project with @dmshanmugam.bsky.social and come talk to us at NeurIPS!
congrats! sad I missed it :/
It's interesting because "Shuvom" is definitely the highest signal for gender out of the list mentioned, but it probably just doesn't know this, since my name is pretty rare.
Slightly interesting observation: I asked chatgpt "generate an image that represents what you know about me. don't ask questions" and it drew me as a woman. I've never revealed my gender to chatgpt, so I asked it why it drew me as a woman. Here's what it said:
Some past writing I like:
Democracy and the CLT: shuvom-s.github.io/blog/2020/de...
Privacy vs collaboration in genomics: shuvom-s.github.io/blog/2023/ov...
Please check out the others too if interested! [3/3]
Relatedly, I finally consolidated my scattered writing into a blog: shuvom-s.github.io/blog/
For now, I'm hosting on my (rebuilt) personal website, but curious if anyone has recs on a platform to blog (Medium? Substack? Personal site?). I've turned on comments with giscus. [2/3]
I wrote up some thoughts on what it means to measure the entropy of natural languages and connections to LLMs, loosely inspired by an awesome paper from Shannon in 1951(!)
Check it out: shuvom-s.github.io/blog/2025/me... [1/3]
I'm in Vancouver and will be giving a spotlight talk at #ML4H
tomorrow, Dec. 15, at 4:30pm on some ongoing work on modeling multi-stage selection problems in clinical settings. Work done with (high school senior!) Sophia Lin, Bonnie Berger, and @emmapierson.bsky.social. I hope to see you there!