Joe Stacey (@joestacey) Bsky

This was just embarrassing. Shame on everyone who works on Grok…

5 months ago 1 0 0 0

We have released #AgentCoMa, an agentic reasoning benchmark where each task requires a mix of commonsense and math to be solved 🧐

LLM agents performing real-world tasks should be able to combine these different types of reasoning, but are they fit for the job? 🤔

🧵⬇️

7 months ago 5 2 1 0

Congratulations!! Awesome you will be in Europe!

8 months ago 1 0 1 0

The bad:

- the chocolate here is terrible for no good reason
- hotel breakfasts never have any baked beans, which are way under appreciated here (they are delicious and add much needed moisture to a cooked breakfast)
- the temperature in summer is inhumane

Think that covers the main stuff 😍

9 months ago 0 0 0 0

Here’s my review of the US after a few days here. Did I miss anything? 🤔

The good:

- Americans are the most charming, friendly and hospitable people
- it’s super fun how the country is split into states that all have different laws and stuff, with different vibes state to state

9 months ago 1 0 1 0

Any chance Keir Starmer can reshuffle himself in as foreign secretary, and shuffle in another prime minister who actually has some vague idea about what they want to achieve? 🙏🤦‍♂️

9 months ago 0 0 0 0

Finally the heatwave has ended, and the UK is once again a bearable place to be 😍😍

If you have any UK-based collaborations, their productivity is about to increase like 10 fold

9 months ago 2 0 0 0

How to Improve the Robustness of Closed-Source Models on NLI Closed-source Large Language Models (LLMs) have become increasingly popular, with impressive performance across a wide range of natural language tasks. These models can be fine-tuned to further improv...

This work was really fun and a great last paper for my PhD. Check it out 🙂 Massive thanks to all my amazing collaborators!

arxiv.org/abs/2505.20209

P.S. if you know about a paper improving NLI model robustness not already in our related work appendix, I would love to hear about it 🥰

10 months ago 0 0 0 0

5) The best way to improve performance on the hardest OOD data was to choose more challenging training examples

Our best method (Uncertainty Sampling) picked examples with the most uncertain predictions. This identified challenging examples, but without too much label noise

10 months ago 1 0 1 0

4) Creating more complex synthetic data avoids a loss in performance on harder OOD datasets

We find that generating more challenging synthetic data (Long & Complex Generation) helps retain performance on harder OOD datasets, while still achieving gains on easier OOD data

10 months ago 0 0 1 0

3) Replacing some training examples with LLM-generated data proved very effective on less challenging OOD data

See Standard-OOD scores below (avg), where the simplest LLM-generated data (Short & Simple Generation) performed best, with substantial improvements

10 months ago 0 0 1 0

2) We experiment with 6+ ways for improving robustness:

This involved sampling methods to choose more complex examples in our training data, and generating new synthetic examples

Some methods were pretty fun, e.g. asking an LLM to assess the difficulty of training examples

10 months ago 1 0 1 0

1) It's time to stop using fine-tuned encoder models:

We find that fine-tuned LLMs are substantially more robust than commonly used encoder models, despite being fine-tuned on x50 less data.

This is especially the case on challenging OOD datasets (see Challenge-OOD avg below)

10 months ago 0 0 1 0

The paper tries to improve the robustness of closed-source LLMs fine-tuned on NLI, assuming a realistic training budget of 10k training examples.

Here's a 45 second rundown of what we found!

10 months ago 0 0 1 0

We have a fun new #NLProc paper on arXiv about improving the robustness of fine-tuned NLI models!

Have a look :)
arxiv.org/abs/2505.20209

10 months ago 6 0 1 0

I’d personally just love to see more negative results from nice ideas that didn’t quite work out. I feel like there’s probably a bunch of cool stuff people have tried out and discarded that could be made to work across multiple papers. Would be fun and interesting too

11 months ago 2 1 1 0

Was worried it was just me hating on it so much 🤣

11 months ago 0 0 0 0

I’d love to see more diversity in the field, what kind of things were you thinking?

11 months ago 0 0 1 0

Should I use an LLM to help refine my paper writing for the ARR deadline? 🤔🤔

It will improve the paper for sure, but probably also making the tone a whole lot more annoying

11 months ago 0 0 1 0

If you're at #NAACL2025 and want to hear about similarity effects for property inheritance in LMs, please stop by!

I will be presenting this work on Wednesday at the 11-12:30 poster session on Interpretability & analysis for language models (Hall 3).

aclanthology.org/2025.naacl-l...

11 months ago 12 4 0 0

Looks so cool! I’m insanely jealous

11 months ago 2 0 1 0

I’m not a fan of musk, but imo there’s some really nice work here 🙂

Interested in the Washington post article, would you mind sharing a link?

11 months ago 1 0 0 0

Excited to share our ICLR and NAACL papers! Please come and say hi, we're super friendly :)

11 months ago 14 5 0 0

That’s an awesome paper 👍👍

1 year ago 0 1 1 0

Wow, the old ITV Agatha Christie’s Poirot is brilliant. Some tv for 1989…

Gonna go binge watch the 13 seasons now 😍

1 year ago 1 0 0 0

Congratulations! It’s definitely worth trying/experimenting with responses that are more concise in the future and see what kind of reaction you get.

Best of luck with your meta-reviews! 🤞

1 year ago 1 0 0 0

Ah that’s good to know!

Yeah I think when authors choose to write concise responses everybody wins 🙂

1 year ago 1 0 0 0

Good point. I think the other downside is all the reviewer time to go through them.

I’m not sure what best solution is, and if you limit the responses too much it’s frustrating, but maybe something that stops way too long responses might be helpful 🙂

1 year ago 1 0 1 0

I feel like the length of the ARR author rebuttals keep growing every cycle

Is this a good thing for authors or reviewers that the responses can be so long? I feel like it’s a bit sub-optimal for both at the moment

1 year ago 4 0 3 0

Not only does everyone learn for themselves, but I think almost everyone sees themselves as good reviewers when that may not be the case

I think the ARR stats on how many great reviews people did is pretty cool step in the right direction!

1 year ago 1 0 0 0

Posts by Joe Stacey