Rather than try to mine historical data, run an experiment to pit UCB against Neyman-Pearson inference. For some A/B tests, split the users into two groups. Treatment A is A/B testing, treatment B is UCB.
In A/B testing, follow appropriate A/B testing procedures: Pick a sample size prior to the experiment that gives you appropriate power, or use Armitage's rule for optimal test termination. (Email me if you're interested, I'm happy to send over papers/scan relevant pages from his book). However , it's probably best to use a fixed sample size, as that is what most real life A/B test practitioners use. Picking the sample size can be a bit tricky, but as a rule of thumb, pick something that is large in enough to dectect differences in treatments as small as 1%age point.
In the treatment group B, use the UCB1 procedure. Subject the users to whichever design UCB1 picks, and continue with the learning.
Do not share any information between treatment groups A and B.
Run these tests for a sufficient amount of time over a largish number of clients, and then use permutation tests to determine which treatment, UCB1 vs Neyman-Pearson, performs better.
In all the simulations I've seen, UCB performs simple A/B testing, but it would be great to see some empirical evidence as well.
Most websites lack sufficient traffic to reach statistical significance in a short time frame. Sure, Google and Facebook can run a test and get real results in a day (or even hour(s)), but the rest of us need weeks or months to do things properly
In A/B testing, follow appropriate A/B testing procedures: Pick a sample size prior to the experiment that gives you appropriate power, or use Armitage's rule for optimal test termination. (Email me if you're interested, I'm happy to send over papers/scan relevant pages from his book). However , it's probably best to use a fixed sample size, as that is what most real life A/B test practitioners use. Picking the sample size can be a bit tricky, but as a rule of thumb, pick something that is large in enough to dectect differences in treatments as small as 1%age point.
In the treatment group B, use the UCB1 procedure. Subject the users to whichever design UCB1 picks, and continue with the learning.
Do not share any information between treatment groups A and B.
Run these tests for a sufficient amount of time over a largish number of clients, and then use permutation tests to determine which treatment, UCB1 vs Neyman-Pearson, performs better.
In all the simulations I've seen, UCB performs simple A/B testing, but it would be great to see some empirical evidence as well.