this is hilarious and also a great idea. i dont see any reason why you can't pla...

sillysaurusx · on Jan 8, 2020

It should be similarly efficient. AlphaZero used 1,000 TPUv1's to generate self-play games, and a much smaller number of TPUs to train the model on the previous self-play results. Whenever it generated a model that was >= 55% better, that became the new model.

The same algorithm could be applied here.

jeffshek · on Jan 8, 2020

It would not be close to similarly efficient. They have completely different loss functions.

sillysaurusx · on Jan 8, 2020

You're right, "efficient" should be substituted with "possible". We're certainly not claiming that this is a smart way to do it, just that you can.

Still, I think that there's a chance it could work well. Each move could be prefixed with the final outcome of the game, which is the technique either alphazero or muzero uses.