This thinking has always kind of confused me. Why are customers/users at a big c...

mechanical_fish · on Nov 28, 2008

If you have the agility to make rapid production changes, you also have the ability to rapidly rollback.

This is just not true. Rollbacks are always more expensive than changes, because you can't rewind time to undo the consequences of having your software be broken for minutes, hours, or days. Worse, in the absence of "checks", the cost of making a production change tends to be roughly constant as the company grows -- it takes the Amazon sysadmin no more time to type "make deploy" than it does me -- but the cost of a rollback scales directly with the size of your company's customer base.

Within a few seconds after Amazon.com breaks S3, thousands of companies begin to lose money, and they lose money second by second until the rollback happens. Even if Amazon is only down for a minute, that's one minute of downtime multiplied by its number of customers. The larger the customer base, the larger the stakes.

And, unfortunately, the cost of downtime is nonlinear. If Amazon goes down for a mere two minutes, hundreds of peacefully sleeping system administrators will get emergency pages from their uptime-monitoring systems. They will get out of bed. They will check their logs and their failover mechanisms. They will lose a lot of sleep, and soak up a bunch of overtime pay, and a lot of their good will towards Amazon will dissipate like the morning dew. Once you lose your reputation for quality it takes a lot of work to get it back.

This is why larger companies have more controls. The controls are in place to try and pass the ever-increasing cost of a rollback back to the team that causes the rollbacks. The reason it seems so gosh-darned expensive to add a trivial feature to your flagship app is that it is expensive: If the average rollback costs $1m in revenue and every new feature is only 95% reliable, every new feature costs the company $50k to deploy.

The secret here is: If you want to deploy changes rapidly, don't work on a product that has a lot of uptime-sensitive customers! Start a different product line, or start a beta program, or found a smaller company.

dcurtis · on Nov 28, 2008

S3 is a really bad example because they provide infrastructure. Their customers actually see their entire site go down. Those kinds of companies are the exception. I hope Heroku has rigorous testing and scrutinizes every change, even though they are a startup.

Let's say I own a video site and I want to add threaded comments. If I have 5 users and the site goes down for 5 minutes, those 5 users will get 5 minutes each of annoyance. If I have a million users, each of those users will get 5 minutes of annoyance each also. There is no difference to the user there. So, by adding more checks to make sure the site doesn't go down for 5 minutes when you have more users, you're saying the more users you have, more the important each user becomes. I think that's a strange way of thinking.

(The same is true here of an infrastructure service-- if S3 had 5 users and were more cavalier about their release schedule and broke something, those 5 users would exact the same net effects of downtime as if S3 had 5 million users.)

The awesome benefit of getting threaded comments developed, tested briefly, and pushed in one evening is worth the risk of 5 minutes of downtime compared to the 2 weeks of rigorous testing and approval-by-committee. No matter how many users you have.

mechanical_fish · on Nov 28, 2008

I used an infrastructure site as an example because the value proposition is easy to understand when you use a site that has a clear and simple monetization strategy. Video sharing sites are arguably an even worse example than S3, because the value of uptime is so hard to perceive or compute. It's likely that even Twitter doesn't understand the true value of a customer-hour of Twitter uptime, because the site isn't monetized and so much of the value is concentrated in the brand. Measuring that is like voodoo, only less empirical. ;)

If I have 5 users and the site goes down for 5 minutes, those 5 users will get 5 minutes each of annoyance. If I have a million users, each of those users will get 5 minutes of annoyance each also. There is no difference to the user there.

No, but there is a big difference for you! If a user is worth a dollar per year, the five-user site is worth five bucks per year, but the million-user site is worth a million bucks. If each patch to your code causes 0.1% of users to abandon your product (a number which depends on the odds that a patch will cause a rollback, and on the odds that a rollback will annoy a user enough to make them leave), patching a 5-user site costs you half a cent per year on average (most likely it has no perceptable cost, since odds are no users will leave) but each patch to a million-user site costs you $1000 per year in revenue. And that's just the linear cost. There are nonlinear consequences: one or zero annoyed users is nothing to worry about -- unless that user is Michael Arrington -- but a clique of 1000 annoyed users is potentially a movement: a critical mass of people who will all start complaining about your company on Twitter on the same day, potentially costing you your next 10,000 or 100,000 or 1 million users while simultaneously empowering your competitors, who may begin building the site that will take you down by poaching those dissatisfied users.

This is just the flip side of scalability. As a programmer you enjoy mighty economies of scale: Running a site with a million users is more expensive than running a single-user site, but it is much less than a million times as expensive. But this leverage also applies to your mistakes: a mistake that costs you a dollar when your site is small might cost you $1,000,000 when your site is big. And it's the same mistake! Typos are just as easy to make on big sites as on small ones.

Obviously, this doesn't mean that you shouldn't ever change the site. Presumably each and every one of your patches is valuable, and will bring in revenue to pay for its own insurance premiums. Right? :) But you do need to think about that calculation, because you do occasionally make mistakes. As your userbase grows, you may wish to test each patch on a subset of users to be sure they will really like it, and that the additional revenue is really going to be there. You may wish to institute tests and internal audits that lower the risk of rollbacks, or failover mechanisms to lower the cost of rollbacks. And before long, lo, you will be that which you deplore: A company with a bunch of annoying internal controls! But at least you'll have revenue to console yourself with.

fallentimes · on Nov 28, 2008

But I think what Dustin is saying (correct me if I'm wrong) is that the multiplier applies both ways. And that the total cost of making a 5 minute downtime mistake, even to a million users, could easily be outweighed by the benefits of releasing a product/feature/site 2 weeks early. In most cases, I think large companies are risk adverse instead of risk neutral to situations like this.

I agree with both of you that it varies considerably based on what the site does (infrastructure, videos, games, etc).

mechanical_fish · on Nov 28, 2008

In most cases, I think large companies are risk adverse instead of risk neutral to situations like this.

I'm not going to argue with that. Just because a certain increase of caution is rational doesn't mean that caution isn't being overapplied in many cases, just as PG suggests in his original post.

unalone · on Nov 28, 2008

No, it's that most people don't want to use a product that changes often, if at all. They get it for what it is and if it changes, it means disrupting their way of using it. When you're working with a smaller product, you can iterate more and work closely with the people using it. When it gets larger, any change you make will alienate some users, so you have to plan to minimize that alienation.

You can change and roll back features quickly, but that gives an impression of instability among users. If things are constantly changing, they'll seek out something more stable. It's "lowest common denominator" thinking. It results in something that nobody dislikes, and that's the goal of larger groups. Niche companies are able to work much better, but even then there's some slowdown.

jumper · on Nov 28, 2008

I think this is particularly important given that often the biggest cost in using a new product\tool is learning to use it well... if it's changing all the time, with no release schedule or no attempt to stabilize the main release via betas\QA\etc, then it gets very frustrating to learn.

marvin · on Nov 28, 2008

I think this phenomenon is caused by the culture of fear among most employees. Screwing up badly will get you fired, the greatest fear of all. Most employees have heavy mortgages and children, and require a consistent and predictable income. (This state of things is a problem in itself, but that's another topic).

The big issue is that the level of fear increases as you move up the command structure.

Let's assume one of your startups has crashed and burned. Something really spectacular must have happened if this makes potential customers of your next startup spurn you. Taking risks may have consequences, but they are temporary. But if you are a high-level manager for a large company, spectacular failure will make you unemployable, or in the best case push you several steps down the ladder - steps which took years of boredom and politics to climb, and which you will never get back. Failure is punished disproportionately, success is rarely compensated. Conservatism is the correct choice for each individual actor.

We all know bureaucracy, culture and politics - existing companies won't change in this area.

gaius · on Nov 29, 2008

If you have the agility to make rapid production changes, you also have the ability to rapidly rollback

It doesn't matter. If you make a mistake with other people's money (e.g. calculating a payment wrong, crediting/debiting the wrong person, etc), even if you put it right quickly, they'll start losing trust and looking at your competitors.