提示：如何在上下文匪徒中利用损失预测变量？

论文标题

提示：如何在上下文匪徒中利用损失预测变量？

Taking a hint: How to leverage loss predictors in contextual bandits?

论文作者

Wei, Chen-Yu, Luo, Haipeng, Agarwal, Alekh

论文摘要

我们在损失预测因子的帮助下启动在上下文匪徒中进行学习的研究。我们要解决的主要问题是，当预测器$ \ MATHCAL {E} \ leq t $相对较小时，我们是否可以通过Minimax Reval $ \ MATHCAL {o}（\ sqrt {t}）$进行学习。我们为这个问题提供了一个完整的答案，包括各种设置的上限和下限：对抗性与随机环境，已知与未知$ \ MATHCAL {E} $，以及单个与多个预测指标。我们显示了几个令人惊讶的结果，例如1）最佳遗憾是$ \ Mathcal {o}（\ min \ {\ sqrt {t}，\ sqrt {\ sqrt {\ Mathcal {e}} t^\ frac {1} {1} {4} {4} {4} {4} {4} {4} {4} {4}）$ $ \ MATHCAL {O}（\ sqrt {\ Mathcal {e}}}）$对于非上下文问题（例如多臂Bandits）; 2）如果$ \ MATHCAL {E} $未知，则无法实现相同的界限，但是作为一种补救措施，$ \ Mathcal {o}（\ sqrt {\ sqrt {\ Mathcal {e}} t^\ frac {1} {1} {3} {3} {3}）$是可实现的; 3）对于$ m $预测变量，即使对于非上下文问题也可能存在对数依赖性，也需要对$ m $的线性依赖。我们还开发了几种新型算法技术来实现匹配的上限，包括1）一种关键动作映射技术，以最佳的遗憾，以已知的$ \ Mathcal {e} $，2）通过ERM甲骨文有效地实现Catoni的稳健平均估计器，通过ERM ORacle有效地实现了$ nouderal for of for Off for of for of for of for of for of for $} $} $} $} $} $} $。用未知$ \ Mathcal {e} $和4）使用多个预测因子学习的自我指南方案估算随机设置的指数大小的直方图，所有这些方案都可能具有独立的利益。

We initiate the study of learning in contextual bandits with the help of loss predictors. The main question we address is whether one can improve over the minimax regret $\mathcal{O}(\sqrt{T})$ for learning over $T$ rounds, when the total error of the predictor $\mathcal{E} \leq T$ is relatively small. We provide a complete answer to this question, including upper and lower bounds for various settings: adversarial versus stochastic environments, known versus unknown $\mathcal{E}$, and single versus multiple predictors. We show several surprising results, such as 1) the optimal regret is $\mathcal{O}(\min\{\sqrt{T}, \sqrt{\mathcal{E}}T^\frac{1}{4}\})$ when $\mathcal{E}$ is known, a sharp contrast to the standard and better bound $\mathcal{O}(\sqrt{\mathcal{E}})$ for non-contextual problems (such as multi-armed bandits); 2) the same bound cannot be achieved if $\mathcal{E}$ is unknown, but as a remedy, $\mathcal{O}(\sqrt{\mathcal{E}}T^\frac{1}{3})$ is achievable; 3) with $M$ predictors, a linear dependence on $M$ is necessary, even if logarithmic dependence is possible for non-contextual problems. We also develop several novel algorithmic techniques to achieve matching upper bounds, including 1) a key action remapping technique for optimal regret with known $\mathcal{E}$, 2) implementing Catoni's robust mean estimator efficiently via an ERM oracle leading to an efficient algorithm in the stochastic setting with optimal regret, 3) constructing an underestimator for $\mathcal{E}$ via estimating the histogram with bins of exponentially increasing size for the stochastic setting with unknown $\mathcal{E}$, and 4) a self-referential scheme for learning with multiple predictors, all of which might be of independent interest.

下载PDF全文

下载文献需遵守相关版权规定

论文标题