# 马尔科夫决策过程的Monte Carlo规划入门

基于模型计算策略的方法称为规划（planning）。相对模型学习而言，规划是计算的瓶颈(Walsh, Goschin, & Littman, 2010)。传统基于模型的规划方法有动态规划，但动态规划需要遍历所有的状态-动作对的值函数，对于大规模连续系统，这通常是难以实现的。另一种思路是：为保证实时性要求，即便只能获得次优的策略也是可以接受的。如稀疏采样(Kearns, Mansour, & Ng, 1999)、和 Monte Carlo 树搜索(MCTS)(Chaslot, 2010)，以及改进的MCTS如UCT(Kocsis & Szepesvári, 2006)。我在机器人实验中的规划方法采用UTC。

UCT文章和代码参考：

http://rddlsim.googlecode.com/svn/!svn/bc/182/trunk/src/rddl/solver/mdp/uct/UCT.java

一个不错的入门介绍：Monte-Carlo Planning

A revolution in planning originated in MCTS algorithms. Below an introduction and a survey (oriented to popularization).

**Introduction (video / slides) in french**: http://www-c.inria.fr/Internet/rendez-vous/modele-et-algo/prise-de-decision-sequentielle-dans-l-incertain-application-au-jeu-de-go

**Slides in english: www.lri.fr/~teytaud/nicta.**pdf

MCTS/UCT are algorithms for taking decisions in uncertain environments. The principles are:

- iteratively growing tree;

- random simulations.

Algorithmically, in the case of Go (you can generalize easily to other problems):

- a. First, you have in memory a
**tree with only one node**, which is the current state of the Goban.

- b. Then,
**start at the root, and find a path until a leaf**; for each node, you choose the son which has highest exploration/exploitation value: the exploration/exploitation value is typically the sum of- the exploitation term: the percentage of won simulations from this son.
- the exploration term: the square root of the ratio log(nb simulations of the parent) / (nb of simulations of this son)

- c. Then,
**randomly play from the leaf until the game is over**. - d. Then,
**update the statistics in the tree**: update the number of simulations in each node, and the number of wins for black/white. - e. If there is time left for thinking,
**go back to b.** - f. Play the
**most simulated move**from the root.(this is UCT, the clearest version of MCTS; there are plenty of variations)

1.

**Main tricks in MCTS / UCT**MCTS and UCT refer to the same family of algorithms. The main initial publications are

Kocsis et al (ECML 06), Coulom 06, Chaslot et al 06.A strong improvement in the Monte-Carlo part, showing in particular that the Monte-Carlo should not "only" play well but also keep diversity, is "fill board"; http://hal.inria.fr/inria-00117266

There are several works around the parallelization; Cazenave, Chaslot et al,, Gelly et al http://hal.inria.fr/inria-00287867/en/.

For the case of Go, but probably with more importance than just Go, the importance of preserving diversity has been emphasized in a post on the computer-Go mailing-list for the case of Nakade;

Some Meta-level MCTS was proposed, with more or less good results; this was for example used for designing an opening book, better than no opening book at all, but less efficient than handcrafted opening books in spite of years of computational power.

Please notice that in spite of the optimism of an IEEE spectrum publication, classical techniques like alpha-beta did not succeed for the game of Go ("cracking go", 2007).

2. **Links between MCTS and other existing algorithms**

A* is a famous algorithm for e.g. shortest path. The idea consists in developing a tree of possible futures; the algorithm iteratively expands the non-expanded node with maximum "score", where the score is an estimate of the distance to the optimum.

The usual theoretical analysis of A* shows that it is "optimal" when the score is optimistic: it should (in a minimization problem) be a lower bound of the real value of the situation.

Optimism, a general view on MCTS/UCT, is exactly this. The interesting paper [Hren and Munos 08] http://www.springerlink.com/content/mu7227g225512347/ shows theoretical properties of "optimism" in the case of a proved bound.

UCT also iteratively expands the node with highest score. Usually, the score is computed also for already expanded nodes, but this is, I think, only for commodity of development.

Remark: A* has the notion of closed set, which is not used in usual MCTS implementations, whereas it makes sense. However, this is done in [De Mesmay et al 09] http://www.lri.fr/~rimmel/publi/TAG.pdf is an example in which there is an equivalent of "closed set".

3. **Early publications related to MCTS/UCT**

I like the reference Peret Garcia ECAI03, which was the first one in which I saw (out of games) the idea of locally developping a tree for the situation at which a decision should be taken. This idea might be considered as trivial for people working in games; it is not in areas in which the standart is Bellman's dynamic programming and variants.

I also like papers which try to have some learning from some branches to other branches, typically by evolutionary-like algorithms:

Fogel DB, Hays TJ, Hahn SL, and Quon J (2004) “A Self-Learning Evolutionary Chess Program,” Proceedings of the IEEE, Vol. 92:12, pp. 1947-1954.

whenever for the moment these techniques do not work in the case of the game of Go.

4. **Results in computer-Go: computers vs humans**

**In 9x9 Go:**

2007:win against a pro(5p)9x9(blitz) MoGo (Amsterdam)

2008:win against a pro(5p)9x9 MoGo (Paris)

2009:win against a pro(5p)9x9 as black MoGo (Rennes)

2009: win against a top pro (9p) 9x9 Fuego as white (Korea)

**In 13x13 Go,**MoGoTW won games against 6D players with H2 (2010) and MoGoTW won a game against a 5P player with handicap 2.5. (2011)

**In 19x19 Go:**

1998: ManyFaces looses with 29 stones against M. Mueller!

2008:win against a pro(8p)19x19,H9 MoGo (Portland)

2008:win against a pro(4p)19x19,H8 CrazyStone (Tokyo)

2008:win against a pro(4p)19x19,H7 CrazyStone (Tokyo)

2009:win against a pro(9p)19x19,H7 MoGo (Tainan), reproduced by Pachi in 2011 (IEEE SSCI)

2009:win against a pro(1p)19x19,H6 MoGo (Tainan)

0抢沙发 分享