Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

you start with reliability brownouts. first fail 0.1% requests, then after a week 1%, then after a month 5%.


Much better is to stop the service but add button "Resume" that re-enables service for two more weeks with no data loss. That way you give users opportunity to gracefully migrate away.

Stopping service and immediately delete data is just callous.


When ovh sunsetted a class of VPSen and I'd completely failed to notice they were going to do that, I asked nicely in the support ticket I'd sent in and they turned it back on for a few days while I shifted the data to a replacement (which was still an ovh VPS, it had been Just Working long enough that I didn't feel like I'd been mistreated, more lulled into complacency by the lack of problems).

I think requiring a ticket might be a worthwhile trade-off compared to just adding the button, because that allows you to engage with customers to make sure they can (in a case like this) migrate to a different region of your own service, and the activation energy of sending a ticket means a customer's less likely to click 'Resume' and then forget about it again until it's too late.


I mean this is why you do these projects on two different timelines: The internal timeline and the external timeline.

Externally you communication: Different announcements each month, final notices at T+5M, System will be deleted at T+6M, data will be lost at that point and so on.

Internally (at least at work) such a timeline is more that at T+6M, we cut access to the systems. Afterwards, systems not accessed for 2-4 weeks are removed periodically and the hard removal is planned for T+9M. Customer support and account managers can manage if systems need to be accessed. If a customer needs the system for a longer time, they can, but then they pay for it. Entirely with all necessary infrastructure, not renting a few licenses on the system.

Call it a bit callous, but this allows our customer support to appear nice and in control. And it leaves the customer happy and relieved that we have left some slack and leeway. But they've been shaken and woken up and can get to migrating.

The biggest challenge here is to stay on it and to not allow customers to become complacent again. This can be done by e.g. limiting the reactivation time to a week or so so they have to get on it.


Yep, and in certain in house situations it's best to keep a backup around for ~13 months in case there's an obscure business process that only gets done once a year. (I'm aware that some people reading that sentence are going to go wtf at the idea that that's anybody's problem except whoever didn't tell you said business process even existed, but if it's a sufficiently critical finance or HR thing it tends to rapidly become everybody's problem so I like to have options)

Agree absolutely wrt complacency, I believe I asked for less than a week because I actively preferred a situation where I had to get on it immediately.


That seems like the worst of both worlds, during the brown-out you have to keep paying for the compute while your customers don't get a reliable service, even if they have a plan to migrate.

Also you probably can't keep charging customers for that period since you offer a crippled service on purpose.


If you are shutting it down, you can pay for the grace period. Period.


Just shut it down for real (after proper early warnings), so you save on compute and no one is confused about the state, and offer data retrieval for the grace period.

Brownouts are great for API changes, but not very useful before a full shutdown.


You assume warnings reach users. Some people miss emails. Fewer miss a service going offline. Keeping data after shutdown is a good backup.


That's why I'm saying to take it offline. Purposefully broken service is not very valuable, can't really be sold, and yet can still be missed; it also costs you money.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: