The paper, "Mitigating goal misgeneralization via minimax regret" will appear at @rl-conference.bsky.social!
Joint work with the great Matthew Farrugia-Roberts, Usman Anwar, Hannah Erlebach, Christrian Schroeder de Witt, David Krueger and @michaelddennis.bsky.social
www.arxiv.org/pdf/2507.03068
Posts by Karim Abdel Sadek
Future work we are excited about:
• Improving UED algorithms to be closer to the results predicted by our theory
• Mitigating the fully ambiguous case, by focusing on the inductive biases of the agent.
We also visualize the performance of our agents in a maze for each possible location of the goal in the environment.
The results show that agents trained with the regret objective achieve near-maximum return for almost all goal locations.
We complement our theoretical findings with empirical results. We find these as supporting our theory, showing better generalization of agents trained via minimax regret.
Left: performance at test time
Right: % of distinguishing levels played by the respective level designer
In the case where the environments in deployment are in the support of the training level distribution, we also show that a policy that is optimal with respect to the minimax regret objective must provably be robust against goal misgeneralization!
We first formally show that a policy maximizing expected value may suffer from goal misgeneralization if distinguishing levels are rare.
Goal misgeneralization can occur when training only on non-distinguishing levels, as shown in Langosco et al., 2022.
Adding a few distinguishing levels does not alter this outcome. However, we propose a mitigation for this scenario!
Goal misgeneralization arises due to the presence of ‘proxy goals’. We formalize this and characterize environments as either:
• Non-distinguishing: the true and proxy reward may induce the same behaviour
• Distinguishing: the true and proxy rewards induce different behavior
We propose using regret, the difference between the optimal agent's return and our current policy's return, as a training objective.
Minimizing it will encourage the agent to solve rare out-of-distribution levels during training, helping it learn the correct reward function.
*New Paper*
🚨 Goal misgeneralization occurs when AI agents learn the wrong reward function, instead of the human's intended goal.
😇 We show that training with a minimax regret objective provably mitigates it, promoting safer and better-aligned RL policies!
CAIF's new and massive report on multi-agent AI risks will be really useful resource for the field
www.cooperativeai.com/post/new-rep...
what if…
A large group of us (spearheaded by Denizalp Goktas) have put out a position paper on paths towards foundation models for strategic decision-making. Language models still lack these capabilities so we'll need to build them: hal.science/hal-04925309...
lbh gnxr gur yninynzc bhgchg, naq Nyvpr naq Obo qb gur qbg cebqhpg bs vg jvgu gurve erfcrpgvir ahzore naq gura nccyl zbq 2 gb gur erfhyg. Gurl gura pbzzhavpngr gur ovg gurl bognvarq (1=jnir,0=jvax), naq guvf bcrengvba nyjnlf erghea gur fnzr ahzore gb obgu vs n=o be bgurejvfr snvyf jvgu c=1/2?
Model-free deep RL algorithms like NFSP, PSRO, ESCHER, & R-NaD are tailor-made for games with hidden information (e.g. poker).
We performed the largest-ever comparison of these algorithms.
We find that they do not outperform generic policy gradient methods, such as PPO.
arxiv.org/abs/2502.08938
1/N
The 2025 Cooperative AI summer school (9-13 July 2025 near London) is now accepting applications, due March 7th!
www.cooperativeai.com/summer-schoo...
The magic thing that humans do is a pretty good job at solving tasks under high uncertainty about the problem specification. We also frequently are capable of doing this collaboratively. I still do not see evidence that models can do any part of this.
I will be at @neuripsconf.bsky.social this week!
Would love to chat about Multi-agent systems, RL, Human-AI Alignment, or anything interesting :)
I'm also applying for PhD programs this cycle, feel free to reach out for any advice!
More about me: karim-abdel.github.io
I give you a loaded coin, with some (unknown) probability 0<p<1 of landing Heads, and I ask you to generate a fair coin toss.
Great! We know how to do this! This is the Von Neumann trick: toss twice. If HH or TT, repeat; if HT or TH, return the first.
Problem solved? Not quite... This can be bad!
Here some cool work doing a first step towards that in Minecraft using MCTS: Scalably Solving Assistance Games - openreview.net/pdf/080f0c69...
Very cool work! I think an important challenge is to scale assistance games in scenarios where the goal/action/communication space can be 'large', as to capture real world scenarios where we will want to actually apply CIRL.
Here some cool work doing a first step towards that in Minecraft using MCTS: Scalably Solving Assistance Games - openreview.net/pdf/080f0c69...