Submitting a benchmark to
ICML? Check out our NeurIPS Spotlight paper BetterBench! We outline best practices for benchmark design, implementation & reporting to help shift community norms. Be part of the change! 🙌
+ Add your benchmark to our database for visibility: betterbench.stanford.edu
Posts by Chandler Smith
The 2025 Cooperative AI summer school (9-13 July 2025 near London) is now accepting applications, due March 7th!
www.cooperativeai.com/summer-schoo...
Very excited to read this!
On my way to NeurIPS ‘24 ✈️ to present our Spotlight paper Betterbench and the Concordia Contest!
Would love to connect with folks and chat anything multi-agent, agentic AI, benchmarking, etc.
I am applying for fall ‘25 PhDs. Ping me if you have advice or there may be a fit!
Here’s the link to the paper and hugging face page: arxiv.org/pdf/2412.01928 and huggingface.co/papers/2412....
This was an incredible collaboration with our lead Sumeet Motwani, Phillip Torr, and Ronnie Clark from Oxford - Fabio Pizzati, Rocktim Jyoti Das, Ivan Laptev at MBZUAI -and Mark Rybchuk at Berkeley. Expert supervision from Christian Schroeder de Witt from @oxfordtvg.bsky.social !
MALT is still preliminary work, and there is a lot to be explored, but I believe this is an important research direction. We’ll be working on scaling it in more settings (especially with partial observability for a critic who can use tools and smarter ways to distill things)
We see very strong performance across MATH, GSM8k, and CommonsenseQA against trained and untrained baselines with Llama 3.1 8B!
In this setup, models get better at checking/improving certain parts of answers based on what worked best during search. This can address limitations models might have around back-tracking or CoT critiques.
Using SFT and DPO, we can learn from both positive and negative reasoning traces. The multi-agent setup allows for role specialization, where more context present in the prompts eases off computation for subsequent models.
This allows us to compare final outputs to a ground truth, propagate rewards throughout downstream nodes, and post-train models on role-specific data. The generator learns to be a better generator, the critic learns to be a better critic, and so on by bootstrapping reasoning traces.
By just looking at these trees, how do you tell which branches are useful for post-training without human feedback or trained PRMs? Value iteration can be used as a simple approach to propagate labels throughout branches with a thresholding factor to label the quality of reasoning steps.
Training models in a single line might be a difficult problem to approach with discrete outputs produced by each model. We use a tree-based sampling strategy with an exponential branching factor that can generate an incredible amount of synthetic data for bootstrapping the performance of each model!
Our goal was to develop techniques where a system of multiple models could be trained together. We use a generator, critic, and refinement setting that mimics how humans might interact with LLMs.
🚀🚨 Excited to announce our work on Multi-Agent LLM Training!
MALT is a multi-agent configuration that leverages synthetic data generation and credit assignment strategies for post-training specialized models solving problems together
It was a privilege to collaborate with
@ankareuel.bsky.social, Amelia Hardy, @mlamparth.bsky.social, Malcolm Hardy, and Professor Mykel Kochenderfer
🚀 Check out our @neuripsconf.bsky.social Spotlight paper Betterbench, which outlines new standards in benchmarking AI! Delighted to have it featured in
@techreviewjp.bsky.social
🚨 NeurIPS 2024 Spotlight
Did you know we lack standards for AI benchmarks, despite their role in tracking progress, comparing models, and shaping policy? 🤯 Enter BetterBench–our framework with 46 criteria to assess benchmark quality: betterbench.stanford.edu 1/x