What’s being glossed over here, and explains a lot of the confusion around which is the best method of “solving” the multi-arm-bandit problem, is the classical bias-variance tradeoff. All of the methods presume some model of the problem, and some of those models are more flexible than others. When a model is more flexible, it allows solutions to take on more shapes and must burn more of its training data choosing among those shapes. Models that are more biased toward a particular shape, on the other hand, can use more of their data for convergence and so converge more rapidly.
Which method is “best” depends on what you know about the problem. Does its optimal solution look a certain way? change over time? and so on. If you’re willing to bet on your answers to those questions, you can choose a method that’s biased toward your answers, and you’ll converge more rapidly on a solution. The risk, however, is that you’ll bet wrong and converge on a poor solution (because your biases rule out better solutions).
If you’re not willing to bet on your answers, you can choose a method that will place bets for you based on what it sees in the data. But now you’re burning some of your data on betting. So that’s the tradeoff: you can use more of your knowledge to place bets (and risk placing the wrong bets), or more of the data’s knowledge to place bets (and burn some of your data on betting). Where you adjust the slider between those two is up to you.
Which brings us back to our original question. Which method of solving the multi-arm-bandit problem is best? It depends a lot on where you want to adjust the slider. Which depends on your knowledge, aversion to risk, and expected payoffs.
In life, sometimes one size does not fit all. If you’re going to test one shoe size against another, make sure you know which foot will end up wearing the winner. Likewise, if you’re going to compare algorithms for solving the multi-arm-bandit problem, make sure know the particulars of the problem you need to solve.
Which method is “best” depends on what you know about the problem. Does its optimal solution look a certain way? change over time? and so on. If you’re willing to bet on your answers to those questions, you can choose a method that’s biased toward your answers, and you’ll converge more rapidly on a solution. The risk, however, is that you’ll bet wrong and converge on a poor solution (because your biases rule out better solutions).
If you’re not willing to bet on your answers, you can choose a method that will place bets for you based on what it sees in the data. But now you’re burning some of your data on betting. So that’s the tradeoff: you can use more of your knowledge to place bets (and risk placing the wrong bets), or more of the data’s knowledge to place bets (and burn some of your data on betting). Where you adjust the slider between those two is up to you.
Which brings us back to our original question. Which method of solving the multi-arm-bandit problem is best? It depends a lot on where you want to adjust the slider. Which depends on your knowledge, aversion to risk, and expected payoffs.
In life, sometimes one size does not fit all. If you’re going to test one shoe size against another, make sure you know which foot will end up wearing the winner. Likewise, if you’re going to compare algorithms for solving the multi-arm-bandit problem, make sure know the particulars of the problem you need to solve.