In the last years, DeepMind algorithm AlphaZero has become the state of the art to efficiently tackle perfect information two-player zero-sum games with a win/lose outcome. However, when the win/lose outcome is decided by a final score difference, AlphaZero may play score-suboptimal moves, because all winning final positions are equivalent from the win/lose outcome perspective. This can be an issue, for instance when used for teaching, or when trying to understand whether there is a better move. Moreover, there is the theoretical quest of the perfect game. A naive approach would be training a AlphaZero-like agent to predict score differences instead of win/lose outcomes. Since the game of Go is deterministic, this should as well produce outcome-optimal play. However, it is a folklore belief that "this does not work".In this paper we first provide empirical evidence to this belief. We then give a theoretical interpretation of this suboptimality in a general perfect information two-player zero-sum game where the complexity of a game like Go is replaced by randomness of the environment. We show that an outcome-optimal policy has a different preference for uncertainty when it is winning or losing. In particular, when in a losing state, an outcome-optimal agent chooses actions leading to a higher variance of the score. We then posit that when approximation is involved, a deterministic game behaves like a nondeterministic game, where the score variance is modeled by how uncertain the position is. We validate this hypothesis in a AlphaZero-like software with a human expert.

Score vs. Winrate in Score-Based Games: Which Reward for Reinforcement Learning? / Pasqualini, L.; Parton, M.; Morandin, F.; Amato, G.; Gini, R.; Metta, C.; Fantozzi, M.; Marchetti, A.. - ELETTRONICO. - (2022), pp. 187486.573-187486.578. (Intervento presentato al convegno 21st IEEE International Conference on Machine Learning and Applications (ICMLA) tenutosi a Nassau nel 2022) [10.1109/ICMLA55696.2022.00099].

Score vs. Winrate in Score-Based Games: Which Reward for Reinforcement Learning?

Parton M.
;
Morandin F.;Metta C.;Fantozzi M.;
2022-01-01

Abstract

In the last years, DeepMind algorithm AlphaZero has become the state of the art to efficiently tackle perfect information two-player zero-sum games with a win/lose outcome. However, when the win/lose outcome is decided by a final score difference, AlphaZero may play score-suboptimal moves, because all winning final positions are equivalent from the win/lose outcome perspective. This can be an issue, for instance when used for teaching, or when trying to understand whether there is a better move. Moreover, there is the theoretical quest of the perfect game. A naive approach would be training a AlphaZero-like agent to predict score differences instead of win/lose outcomes. Since the game of Go is deterministic, this should as well produce outcome-optimal play. However, it is a folklore belief that "this does not work".In this paper we first provide empirical evidence to this belief. We then give a theoretical interpretation of this suboptimality in a general perfect information two-player zero-sum game where the complexity of a game like Go is replaced by randomness of the environment. We show that an outcome-optimal policy has a different preference for uncertainty when it is winning or losing. In particular, when in a losing state, an outcome-optimal agent chooses actions leading to a higher variance of the score. We then posit that when approximation is involved, a deterministic game behaves like a nondeterministic game, where the score variance is modeled by how uncertain the position is. We validate this hypothesis in a AlphaZero-like software with a human expert.
2022
Score vs. Winrate in Score-Based Games: Which Reward for Reinforcement Learning? / Pasqualini, L.; Parton, M.; Morandin, F.; Amato, G.; Gini, R.; Metta, C.; Fantozzi, M.; Marchetti, A.. - ELETTRONICO. - (2022), pp. 187486.573-187486.578. (Intervento presentato al convegno 21st IEEE International Conference on Machine Learning and Applications (ICMLA) tenutosi a Nassau nel 2022) [10.1109/ICMLA55696.2022.00099].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11381/2953312
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact