There is now a whole sub-industry around LLM routing, gateways, even compute arbitrageurs (e.g. inference dot net). This is a basic but nice study on arbitrage in such settings and implications for the ecosystem (e.g. price drops, market entry, revenue cannibalization etc.)
arxiv.org/abs/2603.22404
Posts by AndrΓ© Cruz
Meet me at the Benchmarking workshop (sites.google.com/view/benchma...) at EurIPS on Saturday: Weβll present two works on errors in LLM-as-Judge and their impacts on benchmarking and test-time-scaling:
We (w/ Moritz Hardt, Olawale Salaudeen and
@joavanschoren.bsky.social) are organizing the Workshop on the Science of Benchmarking & Evaluating AI @euripsconf.bsky.social 2025 in Copenhagen!
π’ Call for Posters: rb.gy/kyid4f
π
Deadline: Oct 10, 2025 (AoE)
π More info: rebrand.ly/bg931sf
Welcome to the Bluesky account for Stand Up for Science 2025!
Keep an eye on this space for updates, event information, and ways to get involved. We can't wait to see everyone #standupforscience2025 on March 7th, both in DC and locations nationwide!
#scienceforall #sciencenotsilence
The paper is accompanied by a new benchmark package: *Folktexts*. It builds socio-demographic backstories from Census data to evaluate LLM calibration, fairness, and uncertainty estimation.
Package: github.com/socialfounda...
Paper: arxiv.org/pdf/2407.14614
Tomorrow at 1:30pm ET at the Harvard EconCS seminar, I'm presenting our paper on LLMs as risk scorers: We build benchmarks using US Census data & show how miscalibrated LLMs are on real-world tabular data distributions.
πHarvard SEC LL2.221-open to the public
econcs.seas.harvard.edu/event/spring...