Knowledge Tree

thesis

Improvements and Evaluation of the Monte Carlo Tree Search Algorithm

Defense date:

Jan. 1, 2009

Edit

Institution:

Paris 11

Disciplines:

Computer sciences

Authors:

Arpad Rimmel

Directors:

Antoine Cornuéjols

Abstract EN:

My thesis deals with planification in a discrete environment with finite horizon and with a number of states too large to be explored entirely. The goal is to maximize a reward function that associates a value to final states. This thesis focuses on particular on improving and evaluating a new algorithm: bandit-based Monte Carlo tree search. After presenting the state of the art (Minimax and Alphabeta for the two-players case; nested Monte Carlo and Dynamic Programing for the one-player case), I describe the principle of the algorithm. Then, I propose an efficient parallelization method for the case of separated memories. This method can be combined with classical parallelization methods for shared memories. I propose also a way of constructing an opening book and show its efficiency in the concrete case of the game of Go. I introduce also several ways of using expert knowledge, in the part concerning bandits as well as in the Monte Carlo part. Finally, I show that this algorithm that gives very good results in the context of two-players applications is also efficient in a one-player context. I propose an adaptation of the algorithm in order to handle graphs and use a different bandit formula in order to solve the problem of the automatic generation of linear transforms libraries. I obtain results much better than by using a classical dynamic programming algorithm.

Abstract FR:

Ma thèse se situe dans le contexte de la planification à horizon fini en environnement discret avec un nombre d'états trop important pour qu'ils soient tous explorés. L'objectif est de maximiser une fonction de récompense qui associe une valeur aux états finaux. Cette thèse est en particulier centrée sur l'amélioration et l'étude d'un nouvel algorithme: l'exploration d'arbre basée sur une formule de bandit avec évaluation Monte Carlo. Après avoir présenté les algorithmes de référence du domaine (Minimax et Alphabéta dans le cas deux joueurs; Nested Monte Carlo et Programmation Dynamique dans le cas un joueur), je décris le principe de l'algorithme. Puis je propose une méthode de parallélisation efficace pour le cas ou la mémoire est séparée. Cette méthode se combine avec des méthodes classiques de parallélisation à mémoire partagée. Je propose ensuite une méthode pour construire une base d'ouverture et montre son efficacité dans le cadre concret du jeu de Go. J'introduis également plusieurs manières d'utiliser des connaissances expertes, aussi bien dans la partie concernant les bandits que dans la partie Monte Carlo. Finalement, je montre que cet algorithme qui donne de très bons résultats dans le cadre des applications à deux joueurs est également efficace dans un cadre à un joueur. En effet, je propose une adaptation de l'algorithme pour le cas des graphes et en utilisant une formule de bandit différente afin de résoudre le problème concret de la génération automatique de librairies de transformations linéaires. J'obtiens des résultats nettement supérieurs à ceux obtenus avec une méthode classique de programmation dynamique.