Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Any infrastructure with lots of data.

OP's first point is 'don't put data in docker'. Docker is not for your data. But more to the point, if you're rebuilding your data store a couple of times every day, a couple of hours downtime isn't going to be feasible.

> You're on bare metal because running node on VMs isn't fast enough

In such a situation, you should be able to image bare metal faster than 2 hours. DD a base image, run a config manager over it, and you should be done. Small shops that rarely bring up new infra wouldn't need this, but anyone running 'bare metal to scale' should.

> bureaucracy

Isn't part of the infra rebuild per se.

> Anytime you have to change DNS. That's going to take days

Depends on your DNS timeouts, but this is config, not infra. Even if it is infra, 48-hour DNS entries aren't a best-practice anymore (and if you're on AWS, most things default to a 5 min timeout)

> Clients (or vendors) whitelist IPs, and you have to work through with them to fix the IPs

I'd file this under 'bureaucracy' - it's part of your config, not part of your prod infra (which the GP was talking about).

> Amazon gives you the dreaded...

Well, yes, but this is on the same order as "what if there's a power outage at the datacentre". Every single deploy plan out there has an unknown-length outage if the 'upstream' dependencies aren't working. "What if there's a hostage event at our NOC?" blah blah.

The point is that with upstream working as normal, you should be able to cover the common SPOFs and get your prod components up in a relatively short time.



> OP's first point is 'don't put data in docker'. Docker is not for your data.

I agree, but I (and the GP, from my reading) was not speaking about only Docker infrastructure.

> Isn't part of the infra rebuild per se.

I can see your point, and perhaps these points don't belong in a discussion purely about rebuilding instances discussion. That said, I have a very hard time focusing just on the time it takes to rebuilding capacity when discussing a DC going down; there's just too many other considerations that someone in Operations must consider.

When I have my operations hat on, I consider a DC going down to be a disaster. Even if the company has followed my advice and the customers do not notice anything, we're now at a point where any other single failure will take the site down. It's imperative to get everything taken down with that DC back up; and it's going to take more than an hour or two.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: