I love the concept of A/A testing here, illustrating that you get apparent resul...

RyJones · on Feb 23, 2014

When ExP was a thing at Microsoft, we always ran an A/A test before we did experiments. We'd also do an A/A/B test to make sure the actual experiments were working.

http://www.exp-platform.com/Pages/default.aspx

darkmighty · on Feb 24, 2014

There are also some complex problems with assumptions that are infrequently addressed, e.g. maybe if a regular user sees a structural/cosmetic change he is more likely to look at that and click it, while that would fade away in steady-state.

RyJones · on Feb 24, 2014

True. We did everything we could to account for that, though. Someone chooses to clear cookies every time they load a page - not much we could do.

tel · on Feb 24, 2014

A/A testing should be used to get accurate estimates for within-sample variance. If you run an A/A/B test then you can calibrate the A/B component to be sensitive w.r.t. the tolerances of real data.

And then yeah, I'm sure a lot of successful A/B tests will get washed.

Homunculiheaded · on Feb 24, 2014

confidence in your own decisions can also be referred to as a Bayesian prior ;)

I've treated the A/B tests I've run pretty much as a case of Bayesian parameter estimation (where the true conversion of A and of B are your parameter). You then get nice beta distributions you can sample from, as well as use the prior to constrain expectations of improvement and also reduce the effects of early flukes in your sampling.

mendicantB · on Feb 24, 2014

Bayesian approaches are probably out of grasp for most small companies. They have a long way to go before being as approachable and easy as frequentist approaches. Schools and the statistics field as a whole need drastic reformation in introductory course offerings that are taken by everyone.

Until then, it's A/B, p value <.05, ignore bias and sample size for companies who aren't large enough to have a statistician or data scientist.

yummyfajitas · on Feb 24, 2014

No they aren't. Here is a Bayesian method that is just as easy as any Frequentist one. At my last job, a completely non-technical user who didn't even understand statistical significance used it just fine [1].

http://www.bayesianwitch.com/blog/2014/bayesian_ab_test.html

The only cost of the Bayesian method is that the bayesian python script is thousands of times slower than the frequentist one. I didn't do benchmarks, but in terms of order of magnitude, the frequentist method might take 1 microsecond while the Bayesian method might take 1 second.

[1] He used a less advanced version of the method which used a normal approximation - not that he needed to know the difference.

ronaldx · on Feb 24, 2014

Sorry, but I don't understand how Bayesian statistics could possibly solve the problems described here.

Sometimes bad scenarios will get good results, by luck, and sometimes good scenarios will get bad results, by luck.

Using more advanced statistical methods doesn't change that these cases are fundamentally indistinguishable.

darkxanthos · on Feb 24, 2014

You're right. The one exception though is with Bayesian statistics you can estimate an effect size using your experiment results using a credibility interval.

If the differences are drastic enough you can still get value from split testing. Incremental changes are just probably not going to bring you much luck.

Homunculiheaded · on Feb 24, 2014

There are several things that help. Firstly you're not just looking for a red light/green light significance. Since you're actually modeling the beta distribution for each conversion rate you not only can ask "what's the probability that this test is an improvement?" you can actually sample from both distributions and see what that improvement looks like.

For example I just simulated some bad data. A has 480 observations and a mean conversion of 33%, B has 410 observations and has a mean conversion of 37%. The p-value here is 0.0323 In the traditional A/B testing model we'd be done and claiming better than a 10% improvement!

However when I sample from these 2 beta distributions I see that my credible region is -2% to 34% meaning this new test could be anywhere from 2% worse to 34% better. No magic value is needed to tell you that you really don't know anything yet.

Another huge help is the use of a prior. Until your data overrides your prior belief you aren't going to see anything. Going with the last example, if I had a good prior that the true conversion rate on that page was actually 33% I wouldn't have even gotten a p-value of less then 0.05. On the other hand if I had a strong prior that the conversion rate was 50% that would imply that both A and B were getting strangely unlucky results, which would actually boost the probability that B was in fact an improvement.

On the philosophical side, Bayesian statistics are simply trying to quantify what you know, not give you 'yes'/'no' answers. Maybe the gamble of -2 to 34 is good for you, or maybe you really want to know tighter bounds on your improvement and aren't comfortable with any possibility of decline. Bayesian statistics gives you a direct way to trade off certainty with time.

ernopp · on Feb 24, 2014

Full disclosure: I work for Qubit who published this white paper.

Just wanted to add that if you have less than a million users you can A/B test for upper funnel goals, effectively measuring if changes improve engagement. Obviously then you have the problem of working out if the engagement translates into more sales but perhaps you're willing to wait longer to find out if a test that improves engagement leads to more revenue in the long run.