Brooke Vlahos, Peter Clark, Doug Downey, @yoavgo.bsky.social Ashish Sabharwal, Daniel S. Weld
Posts by Jonathan Bragg
Amanpreet Singh, Harshit Surana, Aryeh Tiktinsky, Rosni Vasu @guywiener.bsky.social Chloe Anastasiades, Stefan Candra, Jason Dunkelberger, Dan Emery, Rob Evans, Malachi Hamada, Regan Huff, Rodney Kinney, Matt Latzke, Jaron Lochner, Ruben Lozano-Aguilera, Cecile Nguyen, Smita Rao, Amber Tanaka...
🙏 Many thanks to my @ai2.bsky.social teammates—Mike D’Arcy @nbalepur.bsky.social Dan Bareket, Bhavana Dalvi @sergeyf.bsky.social Dany Haddad, Jena D. Hwang, @peterjansen-ai.bsky.social Varsha Kishore, Bodhisattwa Majumder @arnaik19.bsky.social Sigal Rahamimov, Kyle Richardson...
We tested 22 agent classes—more *kinds* than other benchmarks
🤖AgentBaselines makes them reusable, incl. our SOTA science agents: github.com/allenai/agent-baselines
📚Blog: allenai.org/blog/astabench
📄Paper: arxiv.org/abs/2510.21652
📊Leaderboard: huggingface.co/spaces/allenai/asta-bench-leaderboard
🛠️AstaBench is the first to provide reproducible (date-limited) large-scale search tools—plus a full scientific research environment for agents.
📊Our leaderboard highlights agents that use these tools, enabling more controlled measurement of *AI*. (We measure LLM costs too.)
AstaBench with abstract measurement icons
Agent benchmarks don't measure true *AI* advances
We built one that's hard & trustworthy:
👉 AstaBench tests agents w/ *standardized tools* on 2400+ scientific research problems
👉 SOTA results across 22 agent *classes*
👉 AgentBaselines agents suite
🆕 arxiv.org/abs/2510.21652
🧵👇
@kylelo.bsky.social your gifs are an unapproved manipulation of my human attention