Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs
Texte intégral
Figure
Documents relatifs
It could be seen as a form of architecture de- signing, from the most general purpose of automated machine learning (AutoML, see [25]) to the problem of aggregation and design
Keywords: online learning, online combinatorial optimization, semi-bandit feedback, follow the perturbed leader, improvements for small losses, first-order
We show in Theorem 2.1 that the Bernstein Online Aggregation (BOA) and Squint algorithms achieve a fast rate with high probability: i.e.. The theorem also provides a quantile bound
Regret lower bounds and extended Upper Confidence Bounds policies in stochastic multi-armed bandit problem... Regret lower bounds and extended Upper Confidence Bounds policies
If Hannan consistency can be achieved for this problem, then there exists a Hannan consistent forecaster whose average regret vanishes at rate n −1/3.. Thus, whenever it is possible
Our main result is to bound the regret experienced by algorithms relative to the a posteriori optimal strategy of playing the best arm throughout based on benign assumptions about
This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret:
Such methods were proved to satisfy sharp sparsity oracle inequalities (i.e., with leading constant C = 1), either in the regression model with fixed design (Dalalyan and