Chandler Smith (@chansmi) Bsky

Submitting a benchmark to
ICML? Check out our NeurIPS Spotlight paper BetterBench! We outline best practices for benchmark design, implementation & reporting to help shift community norms. Be part of the change! 🙌

+ Add your benchmark to our database for visibility: betterbench.stanford.edu

1 year ago 12 3 1 0

Cooperative AI

The 2025 Cooperative AI summer school (9-13 July 2025 near London) is now accepting applications, due March 7th!
www.cooperativeai.com/summer-schoo...

1 year ago 14 5 1 0

Very excited to read this!

1 year ago 3 0 0 0

Chandler Smith

chandlersmith.me

neurips.cc/virtual/2024...

betterbench.stanford.edu

1 year ago 3 0 0 0

On my way to NeurIPS ‘24 ✈️ to present our Spotlight paper Betterbench and the Concordia Contest!

Would love to connect with folks and chat anything multi-agent, agentic AI, benchmarking, etc.

I am applying for fall ‘25 PhDs. Ping me if you have advice or there may be a fit!

1 year ago 2 0 1 0

Chandler Smith

Personal Site: chandlersmith.me

1 year ago 1 0 0 0

Here’s the link to the paper and hugging face page: arxiv.org/pdf/2412.01928 and huggingface.co/papers/2412....

1 year ago 1 0 0 0

This was an incredible collaboration with our lead Sumeet Motwani, Phillip Torr, and Ronnie Clark from Oxford - Fabio Pizzati, Rocktim Jyoti Das, Ivan Laptev at MBZUAI -and Mark Rybchuk at Berkeley. Expert supervision from Christian Schroeder de Witt from @oxfordtvg.bsky.social !

1 year ago 1 0 1 0

MALT is still preliminary work, and there is a lot to be explored, but I believe this is an important research direction. We’ll be working on scaling it in more settings (especially with partial observability for a critic who can use tools and smarter ways to distill things)

1 year ago 0 0 1 0

We see very strong performance across MATH, GSM8k, and CommonsenseQA against trained and untrained baselines with Llama 3.1 8B!

1 year ago 0 0 1 0

In this setup, models get better at checking/improving certain parts of answers based on what worked best during search. This can address limitations models might have around back-tracking or CoT critiques.

1 year ago 0 0 1 0

Using SFT and DPO, we can learn from both positive and negative reasoning traces. The multi-agent setup allows for role specialization, where more context present in the prompts eases off computation for subsequent models.

1 year ago 0 0 1 0

This allows us to compare final outputs to a ground truth, propagate rewards throughout downstream nodes, and post-train models on role-specific data. The generator learns to be a better generator, the critic learns to be a better critic, and so on by bootstrapping reasoning traces.

1 year ago 0 0 1 0

By just looking at these trees, how do you tell which branches are useful for post-training without human feedback or trained PRMs? Value iteration can be used as a simple approach to propagate labels throughout branches with a thresholding factor to label the quality of reasoning steps.

1 year ago 0 0 1 0

Training models in a single line might be a difficult problem to approach with discrete outputs produced by each model. We use a tree-based sampling strategy with an exponential branching factor that can generate an incredible amount of synthetic data for bootstrapping the performance of each model!

1 year ago 0 0 1 0

Our goal was to develop techniques where a system of multiple models could be trained together. We use a generator, critic, and refinement setting that mimics how humans might interact with LLMs.

1 year ago 0 0 1 0

🚀🚨 Excited to announce our work on Multi-Agent LLM Training!

MALT is a multi-agent configuration that leverages synthetic data generation and credit assignment strategies for post-training specialized models solving problems together

1 year ago 1 0 1 0

It was a privilege to collaborate with
@ankareuel.bsky.social, Amelia Hardy, @mlamparth.bsky.social, Malcolm Hardy, and Professor Mykel Kochenderfer

1 year ago 1 0 0 0

🚀 Check out our @neuripsconf.bsky.social Spotlight paper Betterbench, which outlines new standards in benchmarking AI! Delighted to have it featured in
@techreviewjp.bsky.social

1 year ago 2 0 1 0

🚨 NeurIPS 2024 Spotlight
Did you know we lack standards for AI benchmarks, despite their role in tracking progress, comparing models, and shaping policy? 🤯 Enter BetterBench–our framework with 46 criteria to assess benchmark quality: betterbench.stanford.edu 1/x

1 year ago 139 25 4 7

Posts by Chandler Smith