上下文匪徒和强化学习的依赖性复杂性：基于分歧的观点

论文标题

上下文匪徒和强化学习的依赖性复杂性：基于分歧的观点

Instance-Dependent Complexity of Contextual Bandits and Reinforcement Learning: A Disagreement-Based Perspective

论文作者

Foster, Dylan J., Rakhlin, Alexander, Simchi-Levi, David, Xu, Yunzong

论文摘要

在经典的多武器匪徒问题中，与实例有关的算法在最佳和第二好的臂之间的差距方面提高了“简单”问题的性能。上下文匪徒是否可以保证？尽管在某些特殊情况下是众所周知的积极结果，但没有一般理论表征何时以及如何依赖实例的遗憾，以实现富裕的一般政策类别的上下文匪徒。我们介绍了一个复杂性的家族，这些措施既足够又需要获得依赖实例的遗憾界限。然后，我们引入了新的Oracle效率算法，这些算法会尽可能适应差距，同时在最坏情况下达到最小值。最后，我们提供结构性结果，以将以前提出的许多复杂性度量联系在一起，并在整个上下文匪徒，强化学习和积极学习中，并阐明其在确定最佳实例依赖性遗憾中的作用。在大规模的经验评估中，我们发现我们的方法通常为具有挑战性的勘探问题提供了卓越的结果。通过功能近似，我们将重点转向增强学习，我们开发了新的Oracle效率算法，用于通过丰富的观察值来获得最佳的GAP依赖性样品复杂性。

In the classical multi-armed bandit problem, instance-dependent algorithms attain improved performance on "easy" problems with a gap between the best and second-best arm. Are similar guarantees possible for contextual bandits? While positive results are known for certain special cases, there is no general theory characterizing when and how instance-dependent regret bounds for contextual bandits can be achieved for rich, general classes of policies. We introduce a family of complexity measures that are both sufficient and necessary to obtain instance-dependent regret bounds. We then introduce new oracle-efficient algorithms which adapt to the gap whenever possible, while also attaining the minimax rate in the worst case. Finally, we provide structural results that tie together a number of complexity measures previously proposed throughout contextual bandits, reinforcement learning, and active learning and elucidate their role in determining the optimal instance-dependent regret. In a large-scale empirical evaluation, we find that our approach often gives superior results for challenging exploration problems. Turning our focus to reinforcement learning with function approximation, we develop new oracle-efficient algorithms for reinforcement learning with rich observations that obtain optimal gap-dependent sample complexity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题