I've seen that, too. One of my clients redid their marketing site 3x in one year, each time claiming incredible improvements. The incredible improvements turned out to be local hill climbing, while the entire site's performance languished... 3-4 years ago there were a ton of blog posts about how a green button produced incredible sales when compared to a red button. And so everyone switched to green buttons...
By contrast, I've evolved multiple websites through incremental, globally measured, optimizations. It's a lot of fun and it requires you to really understand your user (I've called AB testing+analytics "a conversation between you and your users"). But, as you point out, it can be tough to get statistically relevant data on changes to a small site. That's why I usually focused on big effects (e.g. 25%), rather than on the blog posts about "OMG! +2.76% change in sales!". That's also why I did a lot of "historical testing", under the assumption that week-week changes in normalized stats would be swamped by my tests.
under the assumption that week-week changes in normalized stats would be swamped by my tests
This is an enormously problematic assumption, which you can verify by either looking at the week-to-week stats for a period prior to you joining the company, or (for a far more fun demonstration) doing historical testing of the brand of toothpaste you use for the next 6 weeks. Swap from Colgate to $NAME_ANOTHER_BRAND, note the improvement, conclude that website visitors pay an awful lot of attention to the webmaster's toothpaste choices.
Full disclosure: I work for Qubit who published this white paper.
This kind of "historical testing" (I think people often call it sequential testing?) can be pretty dangerous even for large effects. For example Christmas might be a really good time to change the colour of all the buttons on your site and see a 50% increase in sales.
Yes. This kind of micro-A/B testing ("red or green buttons?") feels analogous to premature optimization when coding. Don't worry about the tiny 0.0001% improvements you get from using a for-loop over a while-loop; improve the algorithm itself for order-of-magnitude changes. Focus on the big picture.
By contrast, I've evolved multiple websites through incremental, globally measured, optimizations. It's a lot of fun and it requires you to really understand your user (I've called AB testing+analytics "a conversation between you and your users"). But, as you point out, it can be tough to get statistically relevant data on changes to a small site. That's why I usually focused on big effects (e.g. 25%), rather than on the blog posts about "OMG! +2.76% change in sales!". That's also why I did a lot of "historical testing", under the assumption that week-week changes in normalized stats would be swamped by my tests.