Geordie Rose’s Post

Hey RL friends, I've got a zero-sum two-player game where terminal states evaluate to either win, lose, or draw. I've got an MCTS implementation that from state S runs a random policy and accumulates probabilities of winning from each allowed action a from that state. Question: do the probabilities p(win|S, a) under random policy (typical MCTS thing) converge to the quality Q(s, a) under an optimal policy?

Olivier Vincent

CEO and co-founder at Autozen

1mo

No, p(\text{win} | S, a) under a random policy in MCTS does not converge to Q(S, a) under an optimal policy. Random policies reflect outcomes of random play, not optimal play. To approximate Q(S, a) , use heuristics or learning policies that better approximate good play, improving the simulation quality and making p(\text{win} | S, a) closer to Q(S, a) .

Fauwial Khan

Data Scientist | Economist | USDOT ML Researcher | Engineer

2mo

Geordie Rose In theory, if MCTS were run with infinite simulations and a perfect exploration strategy, it would asymptotically approximate the optimal policy. However, this is impractical in most real-world scenarios. The raw probabilities from random rollouts shouldn't be expected to directly converge to optimal Q-values. The strength of MCTS comes from its ability to focus computational resources on promising lines of play through iterative refinement, rather than from the direct estimation of optimal action values through random sampling.

Like
Reply
Jorge Ramírez

Ph. D. in Automatic Control • Computer Vision • Reinforcement Learning • Robotics

2mo

There are some convergence theorems and most of them conclude the same. To ensure convergence in RL algorithms, certain conditions must be met. One of these conditions concerns the learning factor. Specifically: - The sum of the learning factors over time must equal to infinity. - The sum of the squares of the learning factors must be less than infinity. These conditions ensure that, over time, the algorithm can make significant adjustments (due to the first condition) but that these adjustments become smaller and more stable (due to the second condition). For these conditions to be satisfied, it is necessary to visit all possible states and actions infinitely no matter the policy. This follows from the law of large numbers, which guarantees that, by repeatedly visiting all states and actions, the estimates of transition probabilities and rewards approach their true values.

Like
Reply
Wessam Hamid

Robotics Engineer @ Embedded LLM • I design robots

2mo

In a zero-sum two-player game, the probabilities p(win|S, a) obtained from random rollouts in MCTS can be good approximators of Q(s, a). While the exact numerical values may not match, the more simulations you run, the closer these probabilities will approximate Q(s, a). MCTS focuses on promising actions based on these probabilities, which helps in approximating the optimal policy π(s). Theoretically, with infinite compute and perfect exploration, the probabilities p(win|S, a) under random policy would match the quality Q(s, a) under the optimal policy. This is because infinite simulations would allow the MCTS to explore all possible outcomes comprehensively, leading to the convergence of p(win|S, a) to the true Q(s, a). However, in real-world scenarios with finite resources, the raw probabilities from random rollouts may not directly converge to optimal Q-values.

See more comments

To view or add a comment, sign in

Explore topics