Advertisement · 728 × 90

Posts by Jonathan Bragg

Brooke Vlahos, Peter Clark, Doug Downey, @yoavgo.bsky.social Ashish Sabharwal, Daniel S. Weld

5 months ago 0 0 0 0

Amanpreet Singh, Harshit Surana, Aryeh Tiktinsky, Rosni Vasu @guywiener.bsky.social Chloe Anastasiades, Stefan Candra, Jason Dunkelberger, Dan Emery, Rob Evans, Malachi Hamada, Regan Huff, Rodney Kinney, Matt Latzke, Jaron Lochner, Ruben Lozano-Aguilera, Cecile Nguyen, Smita Rao, Amber Tanaka...

5 months ago 0 0 1 0

🙏 Many thanks to my @ai2.bsky.social teammates—Mike D’Arcy @nbalepur.bsky.social Dan Bareket, Bhavana Dalvi @sergeyf.bsky.social Dany Haddad, Jena D. Hwang, @peterjansen-ai.bsky.social Varsha Kishore, Bodhisattwa Majumder @arnaik19.bsky.social Sigal Rahamimov, Kyle Richardson...

5 months ago 0 0 1 0
Preview
GitHub - allenai/agent-baselines Contribute to allenai/agent-baselines development by creating an account on GitHub.

We tested 22 agent classes—more *kinds* than other benchmarks

🤖AgentBaselines makes them reusable, incl. our SOTA science agents: github.com/allenai/agent-baselines

📚Blog: allenai.org/blog/astabench
📄Paper: arxiv.org/abs/2510.21652
📊Leaderboard: huggingface.co/spaces/allenai/asta-bench-leaderboard

5 months ago 0 0 1 0

🛠️AstaBench is the first to provide reproducible (date-limited) large-scale search tools—plus a full scientific research environment for agents.

📊Our leaderboard highlights agents that use these tools, enabling more controlled measurement of *AI*. (We measure LLM costs too.)

5 months ago 0 0 1 0
AstaBench with abstract measurement icons

AstaBench with abstract measurement icons

Agent benchmarks don't measure true *AI* advances

We built one that's hard & trustworthy:
👉 AstaBench tests agents w/ *standardized tools* on 2400+ scientific research problems
👉 SOTA results across 22 agent *classes*
👉 AgentBaselines agents suite

🆕 arxiv.org/abs/2510.21652

🧵👇

5 months ago 7 1 1 0

@kylelo.bsky.social your gifs are an unapproved manipulation of my human attention

6 months ago 2 0 0 0