用外源输入的MDP的事后观察学习

论文标题

用外源输入的MDP的事后观察学习

Hindsight Learning for MDPs with Exogenous Inputs

论文作者

Sinclair, Sean R., Frujeri, Felipe, Cheng, Ching-An, Marshall, Luke, Barbalho, Hugo, Li, Jingling, Neville, Jennifer, Menache, Ishai, Swaminathan, Adith

论文摘要

许多资源管理问题需要在不确定性下进行顺序决策，在这种情况下，影响决策结果的唯一不确定性是决策者控制之外的外源变量。我们将这些问题建模为Exo-MDP（具有外在输入的马尔可夫决策过程），并为它们设计了一类数据效率算法，称为Hindsight Learning（HL）。我们的HL算法通过利用关键见解来实现数据效率：具有外源变量的样本，可以在事后重新审视过去的决策，以推断出可以加速政策改进的反事实后果。我们将HL与多副秘书和航空公司收入管理问题中的经典基线进行比较。我们还将算法扩展到关键的云云资源管理问题 - 将虚拟机（VM）分配给物理机器，并使用大型公共云提供商的真实数据集模拟其性能。我们发现，HL算法的表现优于特定领域的启发式方法，以及最新的强化学习方法。

Many resource management problems require sequential decision-making under uncertainty, where the only uncertainty affecting the decision outcomes are exogenous variables outside the control of the decision-maker. We model these problems as Exo-MDPs (Markov Decision Processes with Exogenous Inputs) and design a class of data-efficient algorithms for them termed Hindsight Learning (HL). Our HL algorithms achieve data efficiency by leveraging a key insight: having samples of the exogenous variables, past decisions can be revisited in hindsight to infer counterfactual consequences that can accelerate policy improvements. We compare HL against classic baselines in the multi-secretary and airline revenue management problems. We also scale our algorithms to a business-critical cloud resource management problem -- allocating Virtual Machines (VMs) to physical machines, and simulate their performance with real datasets from a large public cloud provider. We find that HL algorithms outperform domain-specific heuristics, as well as state-of-the-art reinforcement learning methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题