We provide experimental results showing that our MCTS-based approach to solve GUMDPs in the single-trial regime is successful in tasks such as exploration, imitation learning or adversarial MDPs.
N/N
Posts by Pedro Santos
Then, we explore how online planning techniques can be used to solve GUMDPs in the single-trial regime. In particular, we show that we can use an MCTS algorithm to provably solve GUMDPs in the single-trial regime.
In our work, under the discounted infinite-horizon setting, we first provide fundamental results for policy optimization in the single-trial regime. We show that non-Markovianity matters, connect single-trial optimization with solving a particular MDP, and prove a hardness result.
In particular, the optimal policy for the single-trial regime can differ from the optimal policy for the multiple-trial regime. This is unfortunate since the single-trial regime is important in the real-world, where policy performance is usually assessed based on a single trajectory.
However, previous works (jmlr.org/papers/volum..., arxiv.org/pdf/2409.15128) pointed out that the performance of a policy depends on the number of trials/trajectories drawn to evaluate its performance.
GUMDPs generalize the MDP framework by allowing the performance of a given policy to depend on a (possibly non-linear) function of the frequency of visitation of state-action pairs induced by the policy.
Our work, "Solving General-Utility Markov Decision Processes in the Single-Trial Regime with Online Planning", got accepted to ICLR 2026.
arxiv.org/abs/2505.15782
1/N
Joint work with Francisco S. Melo and Alberto Sardinha.
Here’s Pedro at yet another international conference! 🙌✨
GAIPS member Pedro P. Santos presented “Centralized training with hybrid execution in multi-agent reinforcement learning via predictive observation imputation” at #AAAI2026, Singapore 🇸🇬
📄 Check out his paper: doi.org/10.1016/j.ar...
Here’s some photos of GAIPS member @pedrosantospps.bsky.social presenting his work on ICML 2025 in Vancouver and EWRL 2025 in Tübingen, Germany. His poster was selected as a "spotlight poster" (top 2.6% of the papers)! 🙌 Read his work here: icml.cc/virtual/2025...
Walking around posters at @icmlconf.bsky.social, I was happy to see some buzz around convex RL—a topic I’ve worked on and strongly believe in.
Thought I’d share a few ICML papers on this direction. Let’s dive in👇
But first… what is convex RL?
🧵
1/n
The paper can be found here: arxiv.org/pdf/2409.15128
We provide lower and upper bounds on the mismatch between the finite and infinite trials formulations for GUMDPs, as well as empirical results to support our claims, highlighting how the number of trajectories and the structure of the underlying GUMDP influence policy evaluation.
We show that the number of trials plays a key role in infinite-horizon GUMDPs, and the expected performance of a given policy depends, in general, on the number of trials.
We contribute the first analysis on the impact of the number of trials, i.e., the number of randomly sampled trajectories, in infinite-horizon GUMDPs (considering both discounted and average formulations).
The general-utility Markov decision processes (GUMDPs) framework generalizes the MDPs framework by considering objective functions that depend on the frequency of visitation of state-action pairs induced by a given policy.
Happy to share that our paper "The Number of Trials Matters in Infinite-Horizon General-Utility Markov Decision Processes" got accepted as a spotlight poster at the International Conference on Machine Learning (ICML).