在连续价值估计中管理时间分辨率：基本的权衡

论文标题

在连续价值估计中管理时间分辨率：基本的权衡

Managing Temporal Resolution in Continuous Value Estimation: A Fundamental Trade-off

论文作者

Zhang, Zichen, Kirschner, Johannes, Zhang, Junxi, Zanini, Francesco, Ayoub, Alex, Dehghan, Masood, Schuurmans, Dale

论文摘要

强化学习（RL）和最佳控制的默认假设是，观察值在固定时钟周期中以离散的时间点到达。但是，许多应用程序涉及连续时间系统，这些系统原则上可以管理时间离散化。时间离散化对RL方法的影响尚未在现有理论中充分表征，但是对其效果进行了更详细的分析，可以揭示提高数据效率的机会。我们通过分析LQR系统的蒙特卡罗政策评估来解决这一差距，并发现价值估计中近似和统计误差之间的基本权衡。重要的是，这两个错误的行为与时间离散化不同，从而为给定数据预算提供了最佳的时间分辨率。这些发现表明，管理时间分辨率可以证明具有有限数据的LQR系统中的政策评估效率。从经验上讲，我们证明了非线性连续控制的LQR实例和标准RL基准的数值模拟中的权衡。

A default assumption in reinforcement learning (RL) and optimal control is that observations arrive at discrete time points on a fixed clock cycle. Yet, many applications involve continuous-time systems where the time discretization, in principle, can be managed. The impact of time discretization on RL methods has not been fully characterized in existing theory, but a more detailed analysis of its effect could reveal opportunities for improving data-efficiency. We address this gap by analyzing Monte-Carlo policy evaluation for LQR systems and uncover a fundamental trade-off between approximation and statistical error in value estimation. Importantly, these two errors behave differently to time discretization, leading to an optimal choice of temporal resolution for a given data budget. These findings show that managing the temporal resolution can provably improve policy evaluation efficiency in LQR systems with finite data. Empirically, we demonstrate the trade-off in numerical simulations of LQR instances and standard RL benchmarks for non-linear continuous control.

下载PDF全文

下载文献需遵守相关版权规定

论文标题