This year's ACM/SIGAI Autonomous Agents Research Award goes to Prof. Shlomo Zilberstein. His work on decentralized Markov Decision Processes laid the foundation for decision-theoretic planning in multi-agent systems and multi-agent reinforcement learning.
sigai.acm.org/main/2025/03...
#SIGAIAward
Posts by Roy Fox
I hear that the other site has been undergoing a Distributed Disinterest in Service attack.
I had one who was essentially head of Sales.
Exciting news - early bird registration is now open for #RLDM2025!
🔗 Register now: forms.gle/QZS1GkZhYGRF...
Register now to save €100 on your ticket. Early bird prices are only available until 1st April.
to the above, I'd add Offline RL (I start with AWR, then IQL and CQL)
2025 is looking to be the year that information-theoretic principles in sequential decision making, finally make a comeback! (at least for me, I know others never stopped.) already 4 very exciting projects, and counting!
I received an email from the Department of Energy stating that “DOE is moving aggressively to implement this Executive Order by directing the suspension of [...] DEI policies [...] Community Benefits Plans [... and] Justice40 requirements”.
This probably explains the NSF panel suspensions as well.
Screenshot of open roles at Fauna Robotics
Want a job in robotics in New York? faunarobotics.com
Quick links to the 2024 reviewed works:
1. bsky.app/profile/royf...
2. bsky.app/profile/royf...
3. bsky.app/profile/royf...
4. bsky.app/profile/royf...
5. bsky.app/profile/royf...
2. Using RL to guide search. We called it Q* before OpenAI made that name famous.
“Q* Search: Heuristic Search with Deep Q-Networks”, by Forest Agostinelli, in collaboration with Shahaf Shperberg, Alexander Shmakov, Stephen McAleer, and Pierre Baldi. PRL @ ICAPS 2024.
1. Using segmentation foundation models to overcome distractions in model-based RL.
“Make the Pertinent Salient: Task-Relevant Reconstruction for Visual Control with Distraction”, by Kyungmin Kim, in collaboration with Charless Fowlkes. TAFM @ RLC 2024.
Our 2024 research review isn't complete without mentioning 2 workshop papers that preview upcoming publications; I'll leave other things happening as surprises for 2025.
Davide Corsi @dcorsi.bsky.social, a rising star in Safe Robot Learning, led this work in collaboration with Guy Amir, Andoni Rodríguez, César Sánchez, and Guy Katz, published in RLC 2024. Not to be confused with Davide's other work in RLC 2024, for which he won a Best Paper Award (see below).
If the unsafe state space is small, and the boundary simplification is careful not to expand it much, the result is that we can safely run the policy and only rarely invoke the shield on unsafe states, leading to significant speedup with safety guarantees.
The trick is to use offline verification not only to label a policy safe/unsafe, but to label each state safe/unsafe, resp. if the policy's action there satisfies/violates safety constraints. The partition is complex, so we simplify it while guaranteeing no false negatives (no unsafe labeled safe).
Online verification can be slow but more useful than offline: it's easier to replace occasional unsafe actions than entire unsafe policies. And unsafe actions are often rare, only reducing optimality a little. But it's costly that we need to run the shield on every action, even if it turns out safe.
Given a control policy (say, a reinforcement-learned neural network) and a set of safety constraints, there are 2 ways to verify safety: offline, where the policy is verified to always output safe actions; and online, where a “shield” intercepts unsafe actions and replaces them with safe ones.
Last in our 2024 research review: control with efficient safety guarantees. Formal verification methods are very slow, but here's a cool trick to use them for safe control, with minimal slowdown and provable safety guarantees.
Led by the fantastic Armin Karamzade in collaboration with Kyungmin Kim and Montek Kalsi, this work was published in RLC 2024.
This method works well for short delays, but gets worse as the WM drifts over longer horizons than it was trained for. For longer delays, our experiments suggest a simpler method that directly conditions the policy on the delayed WM state and the following actions.
This suggests several delayed model-based RL methods. Most interestingly, when observations are delayed, we can use the WM to imagine how recent actions could have affected the world state, in order to choose the next action.
But real-world control problems are often partially observable. Can we use the structure of delayed POMDPs? Recent world modeling (WM) methods have a cool property: they can learn an MDP model of a POMDP. We show that for a good WM of an undelayed POMDP, the delayed WM models the delayed POMDP.
Previous works have noticed some important modeling tricks. First, delays can be modeled as just partial observability (POMDP), but generic POMDPs lose the nice temporal structure provided by delays. Second, delayed MDPs are still MDPs, in a larger state space — exponential, but keeps the structure.
Next up in our 2024 research overview: reinforcement learning under delays. The usual control loop assumes immediate observation and action in each time step, but that's not always possible, as processing observations and decisions can take time. How can we learn to control delayed systems?
Led by the tireless Kolby Nottingham, partly during his AI2 internship, in collaboration with Bodhisattwa Majumder, Bhavana Dalvi Mishra, @sameer-singh.bsky.social, and Peter Clark, this work was published in ICML 2024.
SSO outperforms Reflexion and ReAct on a NetHack benchmark.
Skill Set Optimization (SSO) achieves record task success rate in ScienceWorld and NetHack benchmarks, compared with existing memory-based language agents (ReAct, Reflexion, CLIN).
How to curate a skill set? Keep evicting skills that are rarely used in high-reward interactions. Here we rely on another prompt to tell us which skills it thinks actually informed actions in successful executions. We only keep those.
How to learn new skills? Take pairs (or more) of high-reward experiences that follow similar state trajectories and ask a language to describe their shared prototype, i.e. a joint abstraction of their end state + a list of abstract instructions that hint at their actions.
Here, skill = abstraction of initial state + subgoal + instructions list. We keep a set of those.
How to use a skill set? Retrieve the most relevant skills for the current state (highest similarity to skill initial state) and put them in context for the language agent.