#FoundationsOfReinforcementLearning hashtag - Bluesky

9 months ago

$First, we claim that there exists a unique value function $\Vopt$ that satisfies the following equation: For any $x \in \XX$, we have \begin{align*} \Vopt(x) = \max_{a \in \AA} \left \{ r(x,a) + \gamma \int \PKernel(\dx' | x, a) \Vopt(x') \right \}. \end{align*} This claim alone, however, does not show that this $\Vopt$ is the same as $V^\piopt$. The second claim is that $\Vopt$ is indeed the same as $V^{\piopt}$, the optimal value function when $\pi$ is restricted to be within the space of stationary policies. This claim alone, however, does not preclude the possibility that we can find an ever more performant policy by going beyond the space of stationary policies. The third claim is that for discounted continuing MDPs, we can always find a stationary policy that is optimal within the space of all stationary and non-stationary policies. These three claims together show that the Bellman optimality equation reveals the recursive structure of the optimal value function $\Vopt = V^{\piopt}$. There is no policy, stationary or non-stationary, with a value function better than $\Vopt$, for the class of discounted continuing MDPs.$

First, we claim that there exists a unique value function $\Vopt$ that satisfies the following equation: For any $x \in \XX$, we have \begin{align*} \Vopt(x) = \max_{a \in \AA} \left \{ r(x,a) + \gamma \int \PKernel(\dx' | x, a) \Vopt(x') \right \}. \end{align*} This claim alone, however, does not show that this $\Vopt$ is the same as $V^\piopt$. The second claim is that $\Vopt$ is indeed the same as $V^{\piopt}$, the optimal value function when $\pi$ is restricted to be within the space of stationary policies. This claim alone, however, does not preclude the possibility that we can find an ever more performant policy by going beyond the space of stationary policies. The third claim is that for discounted continuing MDPs, we can always find a stationary policy that is optimal within the space of all stationary and non-stationary policies. These three claims together show that the Bellman optimality equation reveals the recursive structure of the optimal value function $\Vopt = V^{\piopt}$. There is no policy, stationary or non-stationary, with a value function better than $\Vopt$, for the class of discounted continuing MDPs.

What do we talk about when we talk about the Bellman Optimality Equation?

If we think carefully, we are (implicitly) making three claims.

#FoundationsOfReinforcementLearning #sneakpeek