Biased MCTS/Expert Iteration questions

***DennisSoemers*** · 09-07-2022, 09:27 AM

Quote:In 7. you say that for --num-games there is also a risk of decreasing performance. This also implies that I can't just pick the last checkpoint of a training run and it will be the strongest right away, right? If so, is there a good way to find the best checkpoint (other than trial and error)?

It does imply that yes. Not really a better way than picking a few different checkpoints (for example at like 1, 50, 100, 200, ... however many you like) and making them play against each other or against some fixed baselnie.

Quote:Are the Elos and Pick Counts from the log at all meaningful for this? Also just to clarify, this data is with respect to the trained agent (not UCT or MC-GRAVE) at each checkpoint, right?

It also has pick counts and estimated Elo ratings for UCT and MC-GRAVE, and below that long lists for all the checkpoints. These are meaningful in the sense that the tournament mode uses a softmax over the Elo estimates to create the distribution from which it samples agents to play self-play games. So you'll the ones with higher Elos also having slightly higher pick counts. But all the pick counts are really low, and for many of them just 0. So, we don't have a lot of data there on which the Elo rating estimates are based, so it's probably not a great way for picking the best checkpoint/agent. They're very rough estimates based on very little data.

Quote:Can you resume a training run after pausing/interrupting it? Perhaps via "--best-agents-data-dir" (if so, what needs to be in this directory)?

It should actually resume training from where it left off if you use the same command line arguments (or similar ones) as before, specifically if the --out-dir is the same. It should see that files it needs to resume (like the experience buffer, and all the checkpoints of features and weights) are already there and use them.

Quote:Assuming the agents I'm training eventually surpass MC-GRAVE, is there some way to compare them amongst each other, aside from launching two instances and entering moves manually?

I normally evaluate the playing strength of my agents using the command line as well, not through anything in the GUI. You can use something like:

Code:
java -jar Ludii-1.3.6.jar --eval-agents --game "Game.lud" -n 100 --out-dir "Eval" --thinking-time 1 --agents "algorithm=MCTS;selection=ag0selection;playout=softmax,policyweights1=ExIt/PolicyWeightsPlayout_P1_00201.txt,policyweights2=ExIt/PolicyWeightsPlayout_P2_00201.txt;tree_reuse=true;final_move=robustchild;learned_selection_policy=softmax,policyweights1=ExIt/PolicyWeightsSelection_P1_00201.txt,policyweights2=ExIt/PolicyWeightsSelection_P2_00201.txt;friendly_name=BiasedMCTS" "MC-GRAVE"

-n 100 means playing 100 evaluation games, can of course change.

The huge first string passed to --agents will construct a Biased MCTS for you. The 00201 there is what I usually use after 200 games of self-play, you can increase that if you run more games. If you use a bigger number than you actually have checkpoints for, it will automatically pick whatever the latest checkpoint was. This is what you can also use (with lower numbers) to evaluate the performance of earlier checkpoints than the latest (i.e. if you do this for many checkpoints you can create learning curves).

This would give you the Biased MCTS that uses features also throughout complete playouts. Instead of the large `playout=softmax,blablabla` part, you could just have `playout=random` and it would be like the Biased MCTS (Uniform Playouts) agent. Or something like `playout=softmax,blablabla,epsilon=0.5` to get something in between.

Be careful with all the filepaths for the policy weights. I made them start with "ExIt" because that is what you had as your --out-dir for the training run. But it might be safer to actually put a proper, full filepath there (ideally without whitespaces).

The second small string, "MC-GRAVE", is of course to evaluate against MC-GRAVE. Can also make this UCT, or again a giant string for a biased MCTS (possibly for a different checkpoint, or otherwise different parameters).