More seriously, even if you have complete test coverage of your code, tell me, do you do realistic load testing as part of your automated tests? Does that include making sure the results returned by that load test remain correct? What about verifying that your backup system continues functioning correctly after installation of an update? Your monitoring system? Will your tests catch the fact that your SSL configuration just broke? Do they test your load balancer?
I could go on for a while. My career basically started with production service operations, and it's never stopped being a part of my life since. I've seen things people insisted had to be impossible even while I was staring right at it.
I have a favorite story I sometimes tell people in another context. I once wrote an email that, between my explanations of what happened and the SQL dumps proving it, spanned something close to 10 printed pages. It was, at last, real proof of a bug that I'd suspected the existence of for months, but was told had to be impossible, and couldn't be reproduced.
We were days from deploying a change in production that would have triggered this bug in a catastrophic way, and if we didn't know exactly what was going on, we would have had no way of knowing until customer complaints streamed in.
Guess why the developers and server QA never saw it?
Their machines were in the Pacific time zone. It only affected non-PST8PDT machines.
The bug was a confluence of factors, some of which were in third-party code. If it hadn't existed to begin with, it very easily could have been introduced in a package update, as this particular behavior was not a well-specified part of the package's intended behavior. Automated tests would never have caught it.
And at this very moment, somewhere in the world, there's an HN reader rushing off to make sure all their development and production machines are set to the same time zone.
Wow, my hat is off to you - that's a hell of a bug and excellent reporting when you were on a completely different team. I think this is the reason why, as QA, I have greater affinity to Ops people of all stripes than Dev or PM. Those Ops people who care about their servers or look at the big picture of the deployed environment are a huge multiplier to my testing and understanding of core systems.
Also agreed with your position in respect to the person you're replying to. :)
Server QA can be every bit as valuable to ops, and not just by preemptively finding bugs. At that company, I ended up adjacent to the server QA team. They effectively became an extension of the ops team. Many an hour were spent with one side or the other talking to a disembodied head over the cubicle wall, or with me outright sitting in their cubicle. Emails and IMs were a constant. They saved our asses in the field many times. They also wrote a lot of our usable documentation.
On my way out, I recommended one of them to replace me. Shortly thereafter, he did. Years later, I think he's actually the ops manager now. I wouldn't mind working for him, but political BS drove me out of that company, and pretty much any other company that big.
Funny man.
More seriously, even if you have complete test coverage of your code, tell me, do you do realistic load testing as part of your automated tests? Does that include making sure the results returned by that load test remain correct? What about verifying that your backup system continues functioning correctly after installation of an update? Your monitoring system? Will your tests catch the fact that your SSL configuration just broke? Do they test your load balancer?
I could go on for a while. My career basically started with production service operations, and it's never stopped being a part of my life since. I've seen things people insisted had to be impossible even while I was staring right at it.
I have a favorite story I sometimes tell people in another context. I once wrote an email that, between my explanations of what happened and the SQL dumps proving it, spanned something close to 10 printed pages. It was, at last, real proof of a bug that I'd suspected the existence of for months, but was told had to be impossible, and couldn't be reproduced.
We were days from deploying a change in production that would have triggered this bug in a catastrophic way, and if we didn't know exactly what was going on, we would have had no way of knowing until customer complaints streamed in.
Guess why the developers and server QA never saw it?
Their machines were in the Pacific time zone. It only affected non-PST8PDT machines.
The bug was a confluence of factors, some of which were in third-party code. If it hadn't existed to begin with, it very easily could have been introduced in a package update, as this particular behavior was not a well-specified part of the package's intended behavior. Automated tests would never have caught it.
And at this very moment, somewhere in the world, there's an HN reader rushing off to make sure all their development and production machines are set to the same time zone.