Advertisement · 728 × 90

Posts by Michael Oberst

Thanks for sharing!

1 month ago 1 0 0 0

I hesitate to give actual feedback lol, but maybe make the pause button a fixed size? When I pause and then try to go back, the pause button becomes larger to accommodate more text, and then I accidentally hit it while pressing where the “go back” button used to be.

3 months ago 4 0 0 0

I think it’s fair to say there are also good examples of both goals (measuring “true” vs “relative” performance) being pursued in ML, especially when the benchmark is ostensibly tied to a “real” application, and meant to demonstrate some real-world utility.

3 months ago 1 0 0 0
Preview
AI for radiographic COVID-19 detection selects shortcuts over signal - Nature Machine Intelligence The urgency of the developing COVID-19 epidemic has led to a large number of novel diagnostic approaches, many of which use machine learning. DeGrave and colleagues use explainable AI techniques to an...

Another example: Excitement around diagnosing COVID from chest X-rays, very easy on public datasets, but much harder in practice. In public datasets, “positives” came from a certain set of hospitals, and “negatives” from others, so many shortcuts existed. See eg www.nature.com/articles/s42...

3 months ago 3 0 1 0

More broadly, I loved the post, thanks for writing and sharing it! My critique is more in the “minor revisions” category than anything else :)

3 months ago 2 0 1 0

The implicit hope being that relative gains on a benchmark (the “local” evaluation, as you put it) translate into relative gains on “real” tasks. Of course, that’s not always how it works out, as you point out.

3 months ago 3 0 2 0

I think that’s a fair point! FWIW I loved the “build-and-test” vs “describe and defend” distinction, and as you put it, “which approach performs better under the same evaluation?” is often the build-and-test question, where you care about relative performance, not so much absolute accuracy.

3 months ago 1 0 1 0

It’s non-obvious to me that “corrigenda for years of ImageNet papers” was required, given the finding that the essential conclusions (does model A improve over model B) were shown to hold up!

3 months ago 3 0 2 0

So this part of the piece feels off:

> Consider ImageNet: when Recht et al. (2019) built fresh test sets and found nontrivial accuracy drops…the typical response was not to issue corrigenda for years of ImageNet papers. Instead, the field continued to iterate on the next yardstick

3 months ago 3 0 1 0
Advertisement

Enjoyed the post and have encouraged folks to read it, but IMO it misrepresents @beenwrekt.bsky.social’s “Do ImageNet Classifiers Generalize to ImageNet?” The surprising finding of that paper wasn’t the (absolute) accuracy drop, but the fact that ranking of models was essentially unchanged.

3 months ago 5 0 1 1
Antibiotic Resistance Microbiology Dataset Mass General Brigham (ARMD-MGB) v1.0.0 ARMD-MGB contains detailed microbiology and clinical metadata for >225,000 patients and >970,000 cultures collected over 10 years

Today I am very proud to announce the release of the Antibiotic Resistance Microbiology Dataset - Mass General Brigham (ARMD-MGB; physionet.org/content/armd...) , as part of an NIH-funded collaboration led by Jonathan Chen at Stanford. (1/6)

4 months ago 30 8 1 0
Data Science and AI Institute announces 22 new faculty - Johns Hopkins Data Science and AI Institute The Johns Hopkins Data Science and AI Institute welcomes 22 new faculty members. These newly appointed faculty members join more than 150 Data Science and AI Institute faculty members across…

More broadly, JHU is an increasingly exciting place to do research in AI and ML, with huge investments in faculty, students, and compute. Just last year we hired 22 (!) new faculty in Data Science and AI! ai.jhu.edu/news/data-sc...

4 months ago 0 0 0 0

For more information about me and my group, see my website, which also has information on applying to the CS PhD program (www.michaelkoberst.com)

4 months ago 0 1 1 0
Photo of Johns Hopkins University

Photo of Johns Hopkins University

Come join my group at Johns Hopkins!

I'm recruiting CS PhD students for Fall'26 (deadline: Dec 15) who are interested in safe/reliable AI in healthcare. See my website (link in reply) for more info.

I'm also headed to #NeurIPS, and happy to chat with prospective students!

4 months ago 2 2 1 0

For more details, see the paper / poster!

And if you're at UAI, check out the talk and poster today! Jacob (not on social media) and I are around at UAI, so reach out if you're interested in chatting more!

Paper: arxiv.org/abs/2502.09467
Poster: www.michaelkoberst.com/assets/paper...

8 months ago 0 0 0 0

These findings are also relevant for the design of new trials!

For instance, deploying *multiple models* in a trial has two benefits: (1) it allows us to construct tighter bounds for new models, and (2) it allows us to test whether these assumptions hold in practice.

8 months ago 0 0 1 0

We make some other mild assumptions, which can be falsified using existing RCT data. For instance, if two models have the *same* output on a given patient, then we assume outcomes are at least as good under the model with higher performance.

8 months ago 1 0 1 0
Advertisement
Post image

To capture these challenges, we assume that model impact is mediated by both the output of the model (A), and the performance characteristics (M).

This formalism allows us to start reasoning about the impact of new models with different outputs and performance characteristics.

8 months ago 0 0 1 0
Post image

The second challenge is trust: Impact depends on the actions of human decision-makers, and those decision-makers may treat two models differently based on their performance characteristics (e.g., if a model produces a lot of false alarms, clinicians may ignore the outputs).

8 months ago 0 0 1 0
Post image

We tackle two non-standard challenges that arise in this setting, *coverage* and *trust*.

The first challenge is coverage: If the new model is very different from previous models, it may produce outputs (for specific types of inputs) that were never observed in the trial.

8 months ago 0 0 1 0
Post image

We develop a method for placing bounds on the impact of a *new* ML model, by re-using data from an RCT that did not include the model.

These bounds require some mild assumptions, but those assumptions can be tested in practice using RCT data that includes multiple models.

8 months ago 0 0 1 0
Post image

Randomized trials (RCTs) help evaluate if deploying AI/ML systems actually improves outcomes (e.g., survival rates in a healthcare context).

But AI/ML systems can change: Do we need a new RCT every time we update the model? Not necessarily, as we show in our UAI paper! arxiv.org/abs/2502.09467

8 months ago 5 1 1 0

Hard to have a graded quiz, but still useful as an ungraded “self-assessment” (which I’ve seen) to set expectations for what kind of prereqs are expected. In some courses, you might expect those who would be scared off to drop the course later in any case, esp if drop deadline is pretty late.

1 year ago 7 0 0 0

From skimming the paper it seems more like the takeaway is: “if you binarize, you are estimating *something* that has a specific causal interpretation but it’s a weird thing (diff of two very specific treatment policies) you might not actually care about except in some special cases”

1 year ago 9 0 1 0

I’d nominate @monicaagrawal.bsky.social

1 year ago 6 0 1 0

@matt-levine.bsky.social has a great explanation in his Money Stuff newsletter (which I also highly recommend in general)

1 year ago 1 0 1 0
Advertisement

In this conversation I have been endorsed as "twee" and "not a crank".

BTW, I'm on the job market this year. If you are interested hiring an economist in macro/metrics/computational/ML with such stellar endorsements, please get in touch!

1 year ago 32 6 0 1

An example of some recent work (my first last-author paper!) on rigorous re-evaluation of popular approaches to adapt LLMs and VLMs to the medical domain
bsky.app/profile/zach...

1 year ago 7 0 0 0
Joining the Group Computer Science, Statistics, Causality, and Healthcare

Application link: www.cs.jhu.edu/academic-pro...

More information: www.michaelkoberst.com/joining

1 year ago 4 0 1 0
Photo of Johns Hopkins Campus

Photo of Johns Hopkins Campus

I'm recruiting PhD students for Fall 2025! CS PhD Deadline: Dec. 15th.

I work on safe/reliable ML and causal inference, motivated by healthcare applications.

Beyond myself, Johns Hopkins has a rich community of folks doing similar work. Come join us!

1 year ago 19 7 1 0