Advertisement · 728 × 90

Posts by Jacob Morrison

Post image

Forget modeling every belief and goal! What if we represented people as following simple scripts instead (i.e "cross the crosswalk")?

Our new paper shows AI which models others’ minds as Python code 💻 can quickly and accurately predict human behavior!

shorturl.at/siUYI%F0%9F%...

6 months ago 38 14 3 5
Preview
Reward Bench 2 - a allenai Collection Datasets, spaces, and models for Reward Bench 2 benchmark and paper!

Thank you to co-authors @natolambert.bsky.social, @valentinapy.bsky.social, @jacobcares.bsky.social, Sander Land, @nlpnoah.bsky.social, @hanna-nlp.bsky.social!
Read more in the paper here (ArXiv soon!): github.com/allenai/rewa...
Dataset, leaderboard, and models here: huggingface.co/collections/...

10 months ago 2 1 0 0
The RewardBench 2 Leaderboard on HuggingFace.

The RewardBench 2 Leaderboard on HuggingFace.

RewardBench 2 is here! We took a long time to learn from our first reward model evaluation tool to make one that is substantially harder and more correlated with both downstream RLHF and inference-time scaling.

10 months ago 20 8 1 1
Post image

Heading to NAACL? With "verification being the key to AI" you should go to the poster session Friday, 9-10:30am to chat with my star colleagues @valentinapy.bsky.social + @jacobcares.bsky.social about RewardBench (and really RewardBench 2, evaluation, and reward models in post-training).

11 months ago 14 2 0 0

Valentina and I will be presenting RewardBench at NAACL! Come say hi at the poster session on Friday and we can chat about reward models, staying up for 30 hours straight to rapidly reset from Singapore time, and more 🏜️

11 months ago 5 3 0 0

I'll be at #NAACL2025:

🖇️To present my paper "Superlatives in Context", showing how the interpretation of superlatives is very context dependent and often implicit, and how LLMs handle such semantic underspecification

🖇️And we will present RewardBench on Friday

Reach out if you want to chat!

11 months ago 28 5 1 1

what a flattering picture lol

11 months ago 2 0 1 0
Preview
Holistically Evaluating the Environmental Impact of Creating Language Models As the performance of artificial intelligence systems has dramatically increased, so too has the environmental impact of creating these systems. While many model developers release estimates of the po...

📜Paper: arxiv.org/abs/2503.05804
✍️Thanks to my illustrious coauthors @clarana.bsky.social @jaredfern.bsky.social timdettmers.com @strubell.bsky.social @jessedodge.bsky.social, t'was a fun project 🌏

11 months ago 9 4 0 3
Advertisement

I'm in Singapore for @iclr-conf.bsky.social ! Come check out our spotlight paper on the environmental impact of training OLMo (link in next tweet) during the Saturday morning poster session from 10-12:30 -- happy to chat about this or anything else! DMs should be open, email works too

11 months ago 10 5 1 1
Post image

Announcing OLMo 2 32B: the first fully open model to beat GPT 3.5 & GPT-4o mini on a suite of popular, multi-skill benchmarks.

Comparable to best open-weight models, but a fraction of training compute. When you have a good recipe, ✨ magical things happen when you scale it up!

1 year ago 58 14 3 3
Preview
Two Washington cats infected with bird flu • Washington State Standard Two domestic cats in Washington state have been infected with bird flu after eating raw pet food, according to the department of agriculture.

There are no proven benefits to raw feeding yet plenty of serious, well-documented risks. It has never been a good idea but ESPECIALLY NOW. washingtonstatestandard.com/briefs/two-w...

1 year ago 6713 1653 1 102

also some other tülu contributors are on the market:
@ljvmiranda.bsky.social (ljvmiranda921.github.io) and Xinxi Lyu (alrope123.github.io) are also applying to phd programs, and @valentinapy.bsky.social (valentinapy.github.io) is on the faculty market, hire them all!!

1 year ago 1 1 0 0

check out the updated paper here: arxiv.org/pdf/2411.15124 (with a beautiful new template!) and the model here: huggingface.co/allenai/Llam... and on the ai2 playground: playground.allenai.org

1 year ago 0 0 1 0

big tülu is here! can't wait for everyone to try it, it's been a lot of fun seeing how RL performs at this scale thanks to @hamishivi.bsky.social
and @vwxyzjn.bsky.social, and preference data from @ljvmiranda.bsky.social

on an unrelated note, I'm applying to phd programs this year 👀

1 year ago 5 0 1 0
The logo for Tülu 405B.

The logo for Tülu 405B.

Here is Tülu 3 405B 🐫 our open-source post-training model that surpasses the performance of DeepSeek-V3! It demonstrates that our recipe, which includes RVLR scales to 405B - with performance on par with GPT-4o, & surpassing prior open-weight post-trained models of the same size including Llama 3.1.

1 year ago 92 21 2 8
Post image

Excited to see Tulu 3 sits in between Llama 3.1 and 3.3 instruct on the chatbot arena leaderboard right now!

Particularly happy it is top 20 for Math and Multi-turn prompts :)

All the details and data on how to train a model this good are right here: arxiv.org/abs/2411.15124

1 year ago 15 3 0 0
Post image

Very pleased to see Tulu 3 70B more or less tied with Llama 3.1 70B Instruct on style controlled ChatBotArena. The only model anywhere close to that with open code and data for post-training! Lots of stuff people can build on.

Next looking for OLMo 2 numbers.

1 year ago 24 3 0 0
Advertisement
Post image

We released the OLMo 2 report! Ready for some more RL curves? 😏

This time, we applied RLVR iteratively! Our initial RLVR checkpoint on the RLVR dataset mix shows a low GSM8K score, so we did another RLVR on GSM8K only and another on MATH only 😆.

And it works! A thread 🧵 1/N

1 year ago 12 5 1 1
Post image

kicking off 2025 with our OLMo 2 tech report while payin homage to the sequelest of sequels 🫡

🚗 2 OLMo 2 Furious 🔥 is everythin we learned since OLMo 1, with deep dives into:

🚖 stable pretrain recipe
🚔 lr anneal 🤝 data curricula 🤝 soups
🚘 tulu post-train recipe
🚜 compute infra setup

👇🧵

1 year ago 69 17 2 1
Post image

Want to predict the task performance of LMs before pretraining them?

We develop task scaling laws and model ladders, which predict the accuracy on individual tasks by OLMo 2 7B & 13B models within 2 points of absolute error. The cost is 1% of the compute used to pretrain them.

1 year ago 33 14 2 0
Saagar Enjeti tweets: "Probably a cold take but IMO Tokyo is the male fashion capital of the world: whether it’s western wear, suits, street wear the aesthetic is refined to the highest possible level

From the salaryman to the rebel teen they are impeccably dressed

It also helps no one is fat"

Saagar Enjeti tweets: "Probably a cold take but IMO Tokyo is the male fashion capital of the world: whether it’s western wear, suits, street wear the aesthetic is refined to the highest possible level From the salaryman to the rebel teen they are impeccably dressed It also helps no one is fat"

Why is Tokyo so fashionable? Some theories. 🧵

1 year ago 8925 1718 220 557

OLMo 2 is out 🥳 7B and 13B trained on 5T tokens, and meticulousy instruction tuned using Tulu 3 recipe.

Simply the best fully open models yet.

Really proud of the work & the amazing team at
@ai2.bsky.social

1 year ago 260 44 9 2

🍲

1 year ago 18 2 1 0
The OLMo 2 models sit at the Pareto frontier of training FLOPs vs model average performance.

The OLMo 2 models sit at the Pareto frontier of training FLOPs vs model average performance.

Meet OLMo 2, the best fully open language model to date, including a family of 7B and 13B models trained up to 5T tokens. OLMo 2 outperforms other fully open models and competes with open-weight models like Llama 3.1 8B — As always, we released our data, code, recipes and more 🎁

1 year ago 151 36 5 12
Advertisement

Thanks Tyler, great to hear from you!!

1 year ago 0 0 0 0
Post image Post image Post image Post image

yeah language models are great, but which Tulu 3 are you

- brat tulu, a @jacobcares.bsky.social favorite
- PNW tulu, don’t forget where @ai2.bsky.social is from
- dank tulu 💪
- tulu at tulu, bc tulu means sunrise in farsi

1 year ago 9 1 1 0
Post image

Meet Tülu 3, a set of state-of-the-art instruct models with fully open data, eval code, and training algorithms.
We invented new methods for fine-tuning language models with RL and built upon best practices to scale synthetic instruction and preference data.
Demo, GitHub, paper, and models 👇

1 year ago 111 31 2 7

Thanks to everybody that worked on this, our most fun project so far 🥳 Can't wait to see what we do next!

1 year ago 0 0 1 0

Models: huggingface.co/collections/...
Training data: huggingface.co/collections/...
Paper: allenai.org/papers/tulu-...
Blog: allenai.org/blog/tulu-3
Technical blog: allenai.org/blog/tulu-3-...
Training code: github.com/allenai/open...
Eval code: github.com/allenai/olmes

1 year ago 0 0 1 0

I'm so excited that we're finally releasing Tülu 3, our new post-training recipe! We're releasing models built on top of Llama 3.1 base (OLMo coming soon!), all of our datasets, a (73 page!) paper, new evaluations, and all of our code.

1 year ago 10 0 1 0