One time during my internship years ago I took down a production server because ...

tetha · on July 29, 2023

For critical and overall... fiddly things, we've grown into a culture of writing down reviewable plans and possibly executing these plans in pairs.

We tend to go ahead and either use a runbook, or whatever experience we might have, to setup a pretty detailed plan of what to run on which systems with which purpose. You can then throw these plans at someone else to review. Sure, it takes an hour or two more to setup a solid plan and waiting for a review takes time as well. But this has turned into a great tool to build up experience in weird parts of the infrastructure.

scrame · on Aug 1, 2023

I get that, but it also can turn into wiki checklists of things that could be automated.

one of my frustrations of bigco software is people taking basically maintenance roles where the computer tells them what to do, because the lava flow legacy code base is too scary to touch.

however, you can automate your daily clean up tasks. it's certainly shellacking more mud on the ball, but if you're not going to even try scripting your repetitive tasks, then i don't know why you're a programmer.

tetha · on Aug 3, 2023

Our running gag is: Once such a runbook has been sufficiently refined and clarified to the point of being really comprehensive and easy to follow.... someone turns it into a jenkins job and we don't need it anymore.

returningfory2 · on July 29, 2023

In my opinion you weren't at fault here. Production systems should be designed so that one person can't inadvertently destroy things.

ljm · on July 29, 2023

In almost every place I've worked at, the most difficult thing has been getting people out of ad-hoc JFDI style development and debugging, where everything in production is fair game, and into a process where you avoid touching production as much as humanly possible.

Takes a lot of effort to stop people opening up a shell in prod or grabbing a prod DB dump or even just connecting to the prod datastore directly from their local env.

RyanHamilton · on July 29, 2023

This is the way.