We can't keep going on like this. The vulnerability of centralised internet infrastructure is a huge problem for everyone. Somebody, somewhere, really ought to sort it all out
10-20 minute router misconfigurations and subsequent fixes are sometimes a fact of life. big network infrastructure is complicated, and sometimes the best laid route tables of mice and men do go abloop and die.
Outages happen no matter what the infrastructure is. There's no solution, they're just something you need to recognize and handle, which Cloudflare seemingly did relatively quickly here.
I feel like for a lot of sites CF & CDNs are the only way to survive Reddit/HN/etc - do you disagree?
I definitely agree in concept with you, but then i think back to how frequently script kiddies took down sites ~10 years ago, or w/e. I feel like what has changes is the massive CDNs in front of so many sites.
So while i do want a better solution, i'm not sure what it looks like. Thoughts?
Does it really matter? If you're small, who cares if you go down for half an hour? What, you'll make $0.02 this hour instead of $0.05? If you're big, you can afford your own infrastructure. Stick a few servers in a few colos around the world and you'll have better uptime than CF and friends anyway.
> I feel like for a lot of sites CF & CDNs are the only way to survive Reddit/HN/etc - do you disagree?
Reddit/HN/etc will send all users to the same URL. Almost all of those users will come without any pre-existing cookies. Serving the same content to all those users should not be impossible for most sites without CF or a CDN.
a) complexity: trick your servers into doing something hard
b) volumetric: overwelm your servers with a lot of traffic
c) volumetric part two: overwelm your servers with a lot of requests, so you respond with a lot of traffic
A and C are things you can work on your self --- try to limit the amount of work your server does in response to requests, and/or make resource consuming responses require resource consuming requests; and monitor and fix hotspots as they're found.
B is tricky, there's two ways to solve volumentric attack; either have enough bandwidth to drop the packets on your end, or convince the other end to drop the packets (usually called null routing). Null routes work great, but usually drop all packets to a particular destination IP, which means you need to move your service to another IP if you want it to stay online; that's hard to do if your IP needs to stay fixed for a meaningful time (TTL for glue records at TLDs is usually at least a day); and IP space is limited, so if your attackers are quick at moving attacks, you could run out of IPs to use. Some attacks are going above 1 Tbps though, so that's a lot of bandwidth if you need to accept and drop; and of course, the more bandwidth people get so they can weather attacks, the more bandwidth that can be used to attack others if it's not well secured.
I'm not very familiar with DDoS protection strategies. Can you please elaborate on what is meant in (c) by "make resource consuming responses require resource consuming requests"?
Make people login before doing a search is a common example for forums. Search is hard, unauthenticated search will bring low end forums down, so they make you create an account and login.