• Aucun résultat trouvé

Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs

N/A
N/A
Protected

Academic year: 2021

Partager "Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs"

Copied!
37
0
0

Texte intégral

Figure

Figure 1: The N -state Ergodic RiverSwim MDP
Table 1: Comparison of span and variance for S-state Ergodic RiverSwim.
Figure 2: The MDP M 0 for lower bound (Jaksch et al., 2010)
Figure 3: The composite MDP M (Jaksch et al., 2010)

Références

Documents relatifs

It could be seen as a form of architecture de- signing, from the most general purpose of automated machine learning (AutoML, see [25]) to the problem of aggregation and design

Keywords: online learning, online combinatorial optimization, semi-bandit feedback, follow the perturbed leader, improvements for small losses, first-order

We show in Theorem 2.1 that the Bernstein Online Aggregation (BOA) and Squint algorithms achieve a fast rate with high probability: i.e.. The theorem also provides a quantile bound

Regret lower bounds and extended Upper Confidence Bounds policies in stochastic multi-armed bandit problem... Regret lower bounds and extended Upper Confidence Bounds policies

If Hannan consistency can be achieved for this problem, then there exists a Hannan consistent forecaster whose average regret vanishes at rate n −1/3.. Thus, whenever it is possible

Our main result is to bound the regret experienced by algorithms relative to the a posteriori optimal strategy of playing the best arm throughout based on benign assumptions about

This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret:

Such methods were proved to satisfy sharp sparsity oracle inequalities (i.e., with leading constant C = 1), either in the regression model with fixed design (Dalalyan and