André Cruz (@andcrz) Bsky

Computational Arbitrage in AI Model Markets Consider a market of competing model providers selling query access to models with varying costs and capabilities. Customers submit problem instances and are willing to pay up to a budget for a verifi...

There is now a whole sub-industry around LLM routing, gateways, even compute arbitrageurs (e.g. inference dot net). This is a basic but nice study on arbitrage in such settings and implications for the ecosystem (e.g. price drops, market entry, revenue cannibalization etc.)
arxiv.org/abs/2603.22404

3 weeks ago 2 1 0 0

Meet me at the Benchmarking workshop (sites.google.com/view/benchma...) at EurIPS on Saturday: We’ll present two works on errors in LLM-as-Judge and their impacts on benchmarking and test-time-scaling:

4 months ago 7 3 1 0

We (w/ Moritz Hardt, Olawale Salaudeen and
@joavanschoren.bsky.social) are organizing the Workshop on the Science of Benchmarking & Evaluating AI @euripsconf.bsky.social 2025 in Copenhagen!

📢 Call for Posters: rb.gy/kyid4f
📅 Deadline: Oct 10, 2025 (AoE)
🔗 More info: rebrand.ly/bg931sf

6 months ago 21 7 1 0

Welcome to the Bluesky account for Stand Up for Science 2025!

Keep an eye on this space for updates, event information, and ways to get involved. We can't wait to see everyone #standupforscience2025 on March 7th, both in DC and locations nationwide!

#scienceforall #sciencenotsilence

1 year ago 11479 5423 291 668

GitHub - socialfoundations/folktexts: Evaluate uncertainty, calibration, accuracy, and fairness of LLMs on real-world survey data! Evaluate uncertainty, calibration, accuracy, and fairness of LLMs on real-world survey data! - socialfoundations/folktexts

The paper is accompanied by a new benchmark package: *Folktexts*. It builds socio-demographic backstories from Census data to evaluate LLM calibration, fairness, and uncertainty estimation.

Package: github.com/socialfounda...
Paper: arxiv.org/pdf/2407.14614

1 year ago 1 0 0 0

Spring EconCS 2025 Seminars | EconCS Group

Tomorrow at 1:30pm ET at the Harvard EconCS seminar, I'm presenting our paper on LLMs as risk scorers: We build benchmarks using US Census data & show how miscalibrated LLMs are on real-world tabular data distributions.

📍Harvard SEC LL2.221-open to the public
econcs.seas.harvard.edu/event/spring...

1 year ago 2 1 1 0

Posts by André Cruz