Advertisement Β· 728 Γ— 90

Posts by Nick Tomlin

Post image

Agents interact with environments to get information. But exploration (tools, retrieval, user interaction) is costly.

Calibrate-Then-Act allows LLM agents to balance exploration and cost:
πŸ“ Estimate uncertainty about the environment
πŸ’­ Reason about cost-uncertainty tradeoffs
βš™οΈ Act accordingly

1 month ago 17 6 1 1
A figure demonstrating the different aspects of the corpus described in the tweet. There is a main isomorphic 3D view of a level in the Portal 2 co-op game, with some portals, lasers, and the blue and orange players. Inset, there are first-person captures of the blue and orange player views. There is also a box containing the transcribed dialogue with timestamps and labels for the discursive acts. Finally, there is a box containing a task and a list of subtasks. Some subtasks are already crossed out, with the time that they have been completed. The last subtask ("Player 2 places portal 4 on wall 4") is marked incomplete.

The dialogue is as follows:

Blue: Can you put your other portal up here? (tagged as directive)
Orange: Where? (tagged as request for clarification)
Blue: On uh, on this wall. (tagged as directive)
Blue: So that it uh points at the circle. (tagged as directive)
Orange: Okay. (tagged as commit)

The full list of subtasks is:

Task: Redirect lasers
Subtask: Player 1 places portal 1 on wall 1. (completed)
Subtask: Player 1 polaces portal 2 on wall 2 or 3. (completed)
Subtask: Player 2 places portal 3 opposite of portal 2. (completed)
Subtask: Player 2 places portal 4 on wall 4. (incomplete)

A figure demonstrating the different aspects of the corpus described in the tweet. There is a main isomorphic 3D view of a level in the Portal 2 co-op game, with some portals, lasers, and the blue and orange players. Inset, there are first-person captures of the blue and orange player views. There is also a box containing the transcribed dialogue with timestamps and labels for the discursive acts. Finally, there is a box containing a task and a list of subtasks. Some subtasks are already crossed out, with the time that they have been completed. The last subtask ("Player 2 places portal 4 on wall 4") is marked incomplete. The dialogue is as follows: Blue: Can you put your other portal up here? (tagged as directive) Orange: Where? (tagged as request for clarification) Blue: On uh, on this wall. (tagged as directive) Blue: So that it uh points at the circle. (tagged as directive) Orange: Okay. (tagged as commit) The full list of subtasks is: Task: Redirect lasers Subtask: Player 1 places portal 1 on wall 1. (completed) Subtask: Player 1 polaces portal 2 on wall 2 or 3. (completed) Subtask: Player 2 places portal 3 opposite of portal 2. (completed) Subtask: Player 2 places portal 4 on wall 4. (incomplete)

A couple years (!) in the making: we’re releasing a new corpus of embodied, collaborative problem solving dialogues. We paid 36 people to play Portal 2’s co-op mode and collected their speech + game recordings.

Paper: arxiv.org/abs/2512.03381
Website: berkeley-nlp.github.io/portal-dialo...

1/n

4 months ago 100 31 3 8

I'm recruiting my first group of students at TTIC! If you're interested, please apply by December 9th and mention my name in your application

4 months ago 9 6 0 0
TTIC Faculty Opportunities at TTIC

Two brief advertisements!

TTIC is recruiting both tenure-track and research assistant professors: ttic.edu/faculty-hiri...
NYU is recruiting faculty fellows: apply.interfolio.com/174686

Happy to chat with anyone considering either of these options

5 months ago 8 6 0 0

CRA changed their interface and it's much harder to browse now for some reason...

Last year, I ended up just making a list of schools/departments that I wanted to apply to and individually searching through each of their websites for job postings

6 months ago 1 0 1 0

FYI that UChicago CS & Stats is hiring at all levels via the Data Science Institue:

Postdoc: uchicago.infoready4.com#freeformComp...
Assistant Professor: apply.interfolio.com/174766
Associate Professor: apply.interfolio.com/174768

6 months ago 8 3 0 0
Preview
What does it take to build a human-like user simulator?

What does it take to build a human-like user simulator? //

Jessy Lin and I wrote another blogpost on user simulators as a reward function for training interactive models, this time focused on methods + open questions:
jessylin.com/2025/09/25/u...

6 months ago 3 0 0 0
Eugene Vinitsky

Was talking to a student who wasn't sure about why one would get a PhD. So I wrote up a list of reasons!
www.eugenevinitsky.com/posts/reason...

8 months ago 51 11 7 0
Preview
User simulators bridge RL with real-world interaction

An excellent blog post about a still huge missing gap, models of humans you can actually use to study human-AI interaction: jessylin.com/2025/07/10/u...

9 months ago 12 2 1 0
Advertisement
Post image

We’re proud to announce three new tenure-track assistant professors joining TTIC in Fall 2026: Yossi Gandelsman, Will Merrill, and Nick Tomlin (@nickatomlin.bsky.social). Meet them here: buff.ly/JH1DFtT

9 months ago 7 2 0 0

πŸ€ πŸ€“πŸ™‚

10 months ago 4 0 1 0

Haha main reason for using Gym was that we wanted a way to automatically evaluate models against trained RL agents. Doing the full arena-style evaluation on reasoning models gets really expensive

It also helps that current LLMs are really good at generating functional Gym code

11 months ago 1 0 1 0

I think in the short term that’s reasonable, e.g., current models can play chess but they definitely can’t understand chess variants

In the long term, I suspect there’s more risk of over-optimizing to those specific games, so the hope is that our approach is a bit more future-proof

11 months ago 0 0 0 0
Preview
GitHub - vivek3141/gg-bench: Measuring General Intelligence With Generated Games (Preprint) Measuring General Intelligence With Generated Games (Preprint) - vivek3141/gg-bench

For anyone interested in evaluating or expanding on this benchmark, we have a nice code release here: github.com/vivek3141/gg...

11 months ago 4 0 0 0
Results table. The best model (o1) wins about 36% of games against the RL baselines.

Results table. The best model (o1) wins about 36% of games against the RL baselines.

This is a difficult benchmark: the best non-reasoning LLMs score around 9%, while the best reasoning models score around 36%. In the future, as models get stronger, we anticipate that they'll also be able to generate harder games

11 months ago 1 0 1 0
Main paper figure showing a three-step pipeline of game description generation, implementation generation, and self-play training of RL agents

Main paper figure showing a three-step pipeline of game description generation, implementation generation, and self-play training of RL agents

We use o1 to generate natural language rulebooks for 1000 two-player games and then implement these games as Gym environments. For each game, we train baseline agents in self-play with RL and then evaluate whether LLMs can beat the RL baselines

11 months ago 4 0 2 0
Title and abstract of the paper, "Measuring General Intelligence with Generated Games"

Title and abstract of the paper, "Measuring General Intelligence with Generated Games"

I'm particularly fond of this new benchmark paper we wrote, which aims to scalably evaluate whether language models can generalize to arbitrary new tasks. The core idea is to use LLMs to generate new games, and then evaluate whether LLMs can play those games

πŸ“„: arxiv.org/abs/2505.07215

11 months ago 33 9 3 1
Advertisement

I might be able to hire a postdoc for this fall in computational linguistics at UT Austin. Topics in the general LLM + cognitive space (particularly reasoning, chain of thought, LLMs + code) and LLM + linguistic space. If this could be of interest, feel free to get in touch!

1 year ago 60 31 0 1

Writing my first post here to announce that I've accepted an assistant professor job at TTIC! I'll be starting in Fall 2026, and recruiting students this upcoming cycle.

Until then, I'll be wrapping up the PhD at Berkeley, and this summer I'll join NYU as a CDS Faculty Fellow πŸ™οΈ

1 year ago 41 2 3 2