Managing Tail Risk in Online Learning: Efficiency, Safety, and AlphaGo

Online learning is a fast-growing research area in sequential decision-making. While previous work mostly focuses on achieving efficiency by minimizing regret expectation, controlling the regret tail risk to ensure safety is essential in applications such as revenue management, clinical trials, and financial investment, but has not been well studied. This work tries to provide a detailed characterization of regret distribution in online learning under the safety concern of managing tail risk.
In Part I, we aim to design policies that enjoy both optimal regret expectation and light-tailed regret distribution. We first find that any policy that obtains the optimal instance-dependent regret expectation could incur a heavy-tailed regret tail risk. We then design a novel policy that enjoys the optimal worst-case regret expectation and has the optimal worst-case regret tail risk with an optimal exponential decaying rate for any regret threshold. Numerical experiments show that our design can lead to extra practical benefits compared to celebrated policies, especially when (i) the sub-optimality gap is small, or (ii) there is a challenge of tuning policy hyper-parameters. Furthermore, we find that our policy design surprisingly coincides with what was adopted in AlphaGo Monte Carlo Tree Search. Our theory provides high-level insights to why their engineered solution is successful and should be advocated in complex decision-making environments.
In Part II, we study the optimal trade-off between expectation and tail risk for regret distribution. We fully characterize the interplay among three desired properties for policy design: worst-case optimality, instance-dependent consistency, and light-tailed risk. All our results can be extended to the stochastic linear bandit setting.