Biased MCTS/Expert Iteration questions

Biased MCTS/Expert Iteration questions - Printable Version

+- Ludii Forum (https://ludii.games/forums)
+-- Forum: Questions (https://ludii.games/forums/forumdisplay.php?fid=13)
+--- Forum: About Ludii (https://ludii.games/forums/forumdisplay.php?fid=14)
+--- Thread: Biased MCTS/Expert Iteration questions (/showthread.php?tid=1207)

Biased MCTS/Expert Iteration questions - fuerchter - 09-06-2022

Hi,
I have some questions about the expert-iteration CLI if you don't mind me asking.
I'm currently working with a game (deterministic, two player, perfect information as you might expect) where I've found MC-GRAVE to be the best currently available AI algorithm (I'd be willing to send the .lud via PM or E-Mail, if you're interested. I'm not confident enough about my design abilities to post it publicly). MC-GRAVE is doing great already, however I wanted to see if I could use the polygames bridge or Biased MCTS to get even better results. I've fully read through (not necessarily understood) your 2019 paper and skimmed through your 2020 paper on this.

After playing around with the CLI, I've not been able to reach the results I was hoping for (maybe 25-30% win rate against MC-GRAVE?), so I have a few questions:
1. If I generate training data, and play with a default game option, for example:

Code:
(item "5" <5> "Board size 5")*

Should this work or do I still need the "(useFor)" ludeme?
2. Should I be using a heuristic (value learning?) before I start training? In my particular case there are two ways the game can end (a score threshold and a second way).
3. Should I be using both "selectionFeatures" and "playoutFeatures"? If not, which one should I use over the other?
4. Should I be using "Biased MCTS" or "Biased MCTS (Uniform Playouts)" to later play with my data? I've read their explanation in the User Guide and some of the code comments, but am not sure I understand. Am I correct in thinking that Uniform Playouts doesn't use "playoutFeatures"? If I just go by which is used more in def_ai I would guess Uniform Playouts works better generally?
5. Should I be using TSPG (--train-tspg?) or WED (--is-episode-durations?) as mentioned in the papers? For TSPG, what do the values after "Effective Params:" in the final weights file mean?
6. Can MC-GRAVE be used for "--expert-ai"? This line seems to imply, that this is not the case.
7. Finally, if none of these yields new insight, would just cranking up "--num-games" and "--thinking-time" work?

Here are the three ways of training I've tried so far (with some combinations of selectionFeatures and playoutFeatures in the final .lud):

Code:
java -jar Ludii-1.3.6.jar --expert-iteration --game "Game.lud" -n 300 --out-dir "ExIt" (with (heuristics {(score)}) )

java -jar Ludii-1.3.6.jar --expert-iteration --game "Game.lud" -n 300 --out-dir "ExIt" --no-value-learning

java -jar Ludii-1.3.6.jar --expert-iteration --game "Game.lud" -n 300 --out-dir "ExIt" --no-value-learning --train-tspg

RE: Biased MCTS/Expert Iteration questions - DennisSoemers - 09-06-2022

1. That will work fine, no need for the (useFor) thing in metadata if you're playing on the default options for your game.

2. I'd recommend using `--no-value-learning`, i.e. not training a heuristic-based value function. Training that is only actually useful if you're going to use it afterwards, which our standard Biased MCTS doesn't. We do have some variations of MCTSes that can use value functions, and maybe it could work well for your game, but it can vary greatly from game to game. In some games, it is easy to express a useful simple heuristic and this can help the MCTS. In other games, it is not easy to express a useful simple heuristic, or it adds too much computational overhead, and it does not help the MCTS.

3. If you can, it would be best to use both the selection and the playout features. Actually, the features are the same, but the weights can be different. The `(featureSet ...)` metadata item does allow you to specify both, if you want. If you don't want to / can't, stick to just the selection weights. Selection weights will work well in both the Selection and Playout phases of MCTS. Playout weights can be slightly better in the Playout phase, but have a greater risk of being bad in the Selection phase. The distinction between the two is really quite subtle and small though, so it's really not too bad to just stick to only Selection. The distinction between these two is also fairly new (didn't get around to writing about the difference in any publications yet), all the publications you've seen use only the Selection weights.

4. Biased MCTS uses features to guide both the Selection and the Playout phases of MCTS. Biased MCTS (Uniform Playouts) plays uniformly random playouts, only using features in the Selection phase of MCTS. Which one is best can again vary very much from game to game. Typically it depends on how "fast" or "slow" your game is (in terms of how quickly Ludii can run it, which is not necessarily the same as the expected number of moves per game, though there often is some correlation). If you have a game that Ludii can run very quickly, using features adds relatively much computational overhead, and just uniform playouts may be better. If the game is already very slow to run to begin with, the computational overhead of features is (relatively speaking) low, and then they're more likely to help in playouts.

5. There is probably no point in using TSPG for you. Based on the results we got in the CoG 2019 paper, I never directly use those weights in a Biased MCTS (because in playouts it doesn't help, and in selection it really hurts). I have been using them for other research goals though (very fast standalone playing based purely on features without any search at all, or identifying features that are interesting to explain to humans). I do recommend using WED in general, since it generally helps (maybe just a little bit).

6. No MC-GRAVE cannot serve directly as the expert. If you have a game where MC-GRAVE is remarkably strong, it might help to try using `--tournament-mode` though. But test it with a short training first so you don't waste too much time, I don't think I've used it in a long time myself so it might also just crash. If it still works, it enables a tournament mode similar to the one used by Polygames, where it keeps a larger population of many agents and draws from them in self-play. In that mode, I also add a plain UCT and a plain MC-GRAVE to this population, and if the MC-GRAVE is indeed very good it should get picked relatively often to generate experience.

7. Raising `--thinking-time` is a relatively good bet for increasing performance yes, at the cost of increased training time. Raising `--num-games`... *probably* also improves performance, but there technically also is a risk of decreasing performance. This is because I just keep adding new features throughout the entire process, so this will result in strictly more features. More features can be better, but there is a risk of it also being worse (due to increased computational overhead).

In general, I would recommend a command line like this (if I have no knowledge of which game I'm working with at all and just care about what is most likely to produce something decent):

Code:
java -jar Ludii-1.3.6.jar --expert-iteration --game "Game.lud" -n 300 --out-dir "ExIt" --no-value-learning --game-length-cap 1000 --thinking-time 1 --iteration-limit 12000 --wis --playout-features-epsilon 0.5 --checkpoint-freq 5 --num-agent-threads 3 --num-feature-discovery-threads 2 --special-moves-expander-split --handle-aliasing --is-episode-durations --prioritized-experience-replay

Things I added:
* --game-length-cap 1000: Slightly lowering the default maximum number of moves in Ludii before a game is declared a draw
* --iteration-limit 12000: In most games the MCTS won't be able to hit this many iterations before it's thinking time per move (default 1 second) elapses anyway, so then this doesn't matter. But in extremely simple games, this lets us stop the MCTS after 12000 iterations. This can be a speedup, but also stops the MCTS expert distributions from becoming excessively deterministic.
* --wis: Use weighted importance sampling instead of ordinary importance sampling for WED/PER (see CoG 2020 paper)
* --playout-features-epsilon 0.5: This actually interpolates between Biased MCTS and Biased MCTS (Uniform Playouts). Uses features with probability 0.5 for every action in a playout.
* --checkpoint-freq 5: Don't need to fill up my directory with too many files
* --num-agent-threads 3: Use 3 threads for the MCTS agents (running iterations in parallel) during self-play. Can of course adjust based on your hardware
* --num-feature-discovery-threads 2: Use 2 threads for computing which new feature to add after every game of self-play. Can again adjust based on hardware, but it's pointless to raise this beyond the number of players in a game (so keep it at 1 or 2 for a 2-player game).
* --special-moves-expander-split: New thing, didn't get around to publishing anywhere yet. On average I've found this to be slightly helpful, but not for all games. If you have a game for which all win conditions are extremely difficult to express in small local patterns (like Hex where you have a very "global" win condition that spans the entire board), should probably leave this off.
* --handle-aliasing: Generally (slightly) helpful. Not really described in detail in any publications yet.
* --is-episode-durations: This is WED (but important to also add --wis as mentioned above, because otherwise it's using ordinary importance sampling instead of weighted).
* --prioritized-experience-replay: From same paper as WED.

RE: Biased MCTS/Expert Iteration questions - fuerchter - 09-07-2022

Hey Dennis,
thank you very much for the detailed reply (I do really appreciate it!).
With what I've been training yesterday (your pre-formed command with slight changes like --thinking-time 3 and --tournament-mode e.g.), I managed to reach ~37% win rate. I might start raising --thinking-time even further, but we'll see if that doesn't take too long. In 7. you say that for --num-games there is also a risk of decreasing performance. This also implies that I can't just pick the last checkpoint of a training run and it will be the strongest right away, right? If so, is there a good way to find the best checkpoint (other than trial and error)? Are the Elos and Pick Counts from the log at all meaningful for this? Also just to clarify, this data is with respect to the trained agent (not UCT or MC-GRAVE) at each checkpoint, right?

Two further questions I have:
Can you resume a training run after pausing/interrupting it? Perhaps via "--best-agents-data-dir" (if so, what needs to be in this directory)?
Assuming the agents I'm training eventually surpass MC-GRAVE, is there some way to compare them amongst each other, aside from launching two instances and entering moves manually? I'm guessing this one is related to this thread.

RE: Biased MCTS/Expert Iteration questions - DennisSoemers - 09-07-2022

Quote:In 7. you say that for --num-games there is also a risk of decreasing performance. This also implies that I can't just pick the last checkpoint of a training run and it will be the strongest right away, right? If so, is there a good way to find the best checkpoint (other than trial and error)?

It does imply that yes. Not really a better way than picking a few different checkpoints (for example at like 1, 50, 100, 200, ... however many you like) and making them play against each other or against some fixed baselnie.

Quote:Are the Elos and Pick Counts from the log at all meaningful for this? Also just to clarify, this data is with respect to the trained agent (not UCT or MC-GRAVE) at each checkpoint, right?

It also has pick counts and estimated Elo ratings for UCT and MC-GRAVE, and below that long lists for all the checkpoints. These are meaningful in the sense that the tournament mode uses a softmax over the Elo estimates to create the distribution from which it samples agents to play self-play games. So you'll the ones with higher Elos also having slightly higher pick counts. But all the pick counts are really low, and for many of them just 0. So, we don't have a lot of data there on which the Elo rating estimates are based, so it's probably not a great way for picking the best checkpoint/agent. They're very rough estimates based on very little data.

Quote:Can you resume a training run after pausing/interrupting it? Perhaps via "--best-agents-data-dir" (if so, what needs to be in this directory)?

It should actually resume training from where it left off if you use the same command line arguments (or similar ones) as before, specifically if the --out-dir is the same. It should see that files it needs to resume (like the experience buffer, and all the checkpoints of features and weights) are already there and use them.

Quote:Assuming the agents I'm training eventually surpass MC-GRAVE, is there some way to compare them amongst each other, aside from launching two instances and entering moves manually?

I normally evaluate the playing strength of my agents using the command line as well, not through anything in the GUI. You can use something like:

Code:
java -jar Ludii-1.3.6.jar --eval-agents --game "Game.lud" -n 100 --out-dir "Eval" --thinking-time 1 --agents "algorithm=MCTS;selection=ag0selection;playout=softmax,policyweights1=ExIt/PolicyWeightsPlayout_P1_00201.txt,policyweights2=ExIt/PolicyWeightsPlayout_P2_00201.txt;tree_reuse=true;final_move=robustchild;learned_selection_policy=softmax,policyweights1=ExIt/PolicyWeightsSelection_P1_00201.txt,policyweights2=ExIt/PolicyWeightsSelection_P2_00201.txt;friendly_name=BiasedMCTS" "MC-GRAVE"

-n 100 means playing 100 evaluation games, can of course change.

The huge first string passed to --agents will construct a Biased MCTS for you. The 00201 there is what I usually use after 200 games of self-play, you can increase that if you run more games. If you use a bigger number than you actually have checkpoints for, it will automatically pick whatever the latest checkpoint was. This is what you can also use (with lower numbers) to evaluate the performance of earlier checkpoints than the latest (i.e. if you do this for many checkpoints you can create learning curves).

This would give you the Biased MCTS that uses features also throughout complete playouts. Instead of the large `playout=softmax,blablabla` part, you could just have `playout=random` and it would be like the Biased MCTS (Uniform Playouts) agent. Or something like `playout=softmax,blablabla,epsilon=0.5` to get something in between.

Be careful with all the filepaths for the policy weights. I made them start with "ExIt" because that is what you had as your --out-dir for the training run. But it might be safer to actually put a proper, full filepath there (ideally without whitespaces).

The second small string, "MC-GRAVE", is of course to evaluate against MC-GRAVE. Can also make this UCT, or again a giant string for a biased MCTS (possibly for a different checkpoint, or otherwise different parameters).

RE: Biased MCTS/Expert Iteration questions - fuerchter - 09-08-2022

(09-07-2022, 09:27 AM)DennisSoemers Wrote: It should actually resume training from where it left off if you use the same command line arguments (or similar ones) as before, specifically if the --out-dir is the same. It should see that files it needs to resume (like the experience buffer, and all the checkpoints of features and weights) are already there and use them.

I tried this yesterday and it does work. Checkpoints restart from no. 0 though, so it's not surprising I missed it (without looking at the code).

(09-07-2022, 09:27 AM)DennisSoemers Wrote: I normally evaluate the playing strength of my agents using the command line as well

I'm currently still using "Compare Agents" (have found a checkpoint with ~47% win vs MC-GRAVE so far, at --thinking-time 6), but I intend to switch this to CLI eventually.

Thanks again for the input. I'm hoping I can get a winning agent on my own from here on out (and perhaps this thread can be useful for anyone else looking to train an agent).