论文标题
在高维状态空间中具有有限时间保证的马尔可夫决策过程的结构估计
Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees
论文作者
论文摘要
我们考虑根据实施行动和访问状态的可观察到的人类代理人估算人类代理人动态决策结构模型的任务。这个问题具有固有的嵌套结构:在内部问题中,在外部问题中确定了给定奖励功能的最佳策略,最大化拟合度的度量。已经提出了几种方法来减轻这种嵌套环结构的计算负担,但是当状态空间具有较大的基数离散或连续的高维时,这些方法仍然具有很高的复杂性。逆增强学习(IRL)文献中的其他方法强调政策估计,而牺牲奖励估计精度降低。在本文中,我们提出了一种具有有限时间的单循环估计算法,该算法有限,可以处理高维状态空间,而不会损害奖励估计准确性。在提出的算法中,每个策略改进步骤之后是可能性最大化的随机梯度步骤。我们表明,所提出的算法会收敛到具有有限时间保证的固定溶液。此外,如果奖励是线性化的参数化,我们表明该算法近似于最大似然估计量。最后,通过在Mujoco及其传输设置中使用机器人技术控制问题,我们表明,与其他IRL和模仿学习基准相比,所提出的算法的性能卓越。
We consider the task of estimating a structural model of dynamic decisions by a human agent based upon the observable history of implemented actions and visited states. This problem has an inherent nested structure: in the inner problem, an optimal policy for a given reward function is identified while in the outer problem, a measure of fit is maximized. Several approaches have been proposed to alleviate the computational burden of this nested-loop structure, but these methods still suffer from high complexity when the state space is either discrete with large cardinality or continuous in high dimensions. Other approaches in the inverse reinforcement learning (IRL) literature emphasize policy estimation at the expense of reduced reward estimation accuracy. In this paper we propose a single-loop estimation algorithm with finite time guarantees that is equipped to deal with high-dimensional state spaces without compromising reward estimation accuracy. In the proposed algorithm, each policy improvement step is followed by a stochastic gradient step for likelihood maximization. We show that the proposed algorithm converges to a stationary solution with a finite-time guarantee. Further, if the reward is parameterized linearly, we show that the algorithm approximates the maximum likelihood estimator sublinearly. Finally, by using robotics control problems in MuJoCo and their transfer settings, we show that the proposed algorithm achieves superior performance compared with other IRL and imitation learning benchmarks.