Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization

Daniil Tiapkin; Evgenii Chzhen; Gilles Stoltz

Pré-Publication, Document De Travail Année : 2024

Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization

(1, 2) , (3, 2) , (2, 3)

1
2
3

Daniil Tiapkin

Fonction : Auteur
PersonId : 1398059

Centre de Mathématiques Appliquées - Ecole Polytechnique

Laboratoire de Mathématiques d'Orsay

Evgenii Chzhen

Fonction : Auteur
PersonId : 169831
IdHAL : echzhen
ORCID : 0009-0003-3065-4267

Statistique mathématique et apprentissage

Laboratoire de Mathématiques d'Orsay

Gilles Stoltz

Fonction : Auteur
PersonId : 738739
IdHAL : gilles-stoltz
ORCID : 0000-0003-1240-1007
IdRef : 091575419

Laboratoire de Mathématiques d'Orsay

Statistique mathématique et apprentissage

Résumé

In this paper, we consider the problem of learning in adversarial Markov decision processes [MDPs] with an oblivious adversary in a full-information setting. The agent interacts with an environment during $T$ episodes, each of which consists of $H$ stages, and each episode is evaluated with respect to a reward function that will be revealed only at the end of the episode. We propose an algorithm, called APO-MVP, that achieves a regret bound of order $\tilde{\mathcal{O}}(\mathrm{poly}(H)\sqrt{SAT})$, where $S$ and $A$ are sizes of the state and action spaces, respectively. This result improves upon the best-known regret bound by a factor of $\sqrt{S}$, bridging the gap between adversarial and stochastic MDPs, and matching the minimax lower bound $\Omega(\sqrt{H^3SAT})$ as far as the dependencies in $S,A,T$ are concerned. The proposed algorithm and analysis completely avoid the typical tool given by occupancy measures; instead, it performs policy optimization based only on dynamic programming and on a black-box online linear optimization strategy run over estimated advantage functions, making it easy to implement. The analysis leverages two recent techniques: policy optimization based on online linear optimization strategies (Jonckheere et al., 2023) and a refined martingale analysis of the impact on values of estimating transitions kernels (Zhang et al., 2023).

Domaines

Apprentissage [cs.LG]

Fichier principal

TCS--RL-dependency-S.pdf (345.9 Ko)

Origine	Fichiers produits par l'(les) auteur(s)

Gilles Stoltz : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04636422

Soumis le : vendredi 5 juillet 2024-11:02:18

Dernière modification le : mercredi 10 juillet 2024-03:27:35

Dates et versions

hal-04636422 , version 1 (05-07-2024)

Identifiants

HAL Id : hal-04636422 , version 1

Citer

Daniil Tiapkin, Evgenii Chzhen, Gilles Stoltz. Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization. 2024. ⟨hal-04636422⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

X CNRS INRIA INSMI X-CMAP X-DEP-MATHA LM-ORSAY CMAP INRIA2 UNIV-PARIS-SACLAY IP_PARIS GS-MATHEMATIQUES GS-COMPUTER-SCIENCE

0 Consultations

0 Téléchargements

Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager