Migrating bajillions of database records at Stripe

rtpg · on Sept 1, 2015

When migrating to new data models I almost always follw this list now.

https://matthew.mceachen.us/blog/how-to-make-breaking-change...

Pretty much the perfect checklist for ensuring your migration strategy is safe.

manigandham · on Sept 1, 2015

This is a pretty normal way to handle things: double writes, active sync/migration, double reads, disable old writes, finish sync, disable old reads.

Over the last 12 months we've migrated our entire system handling hundreds of millions of requests a day through 2 different database systems. Just requires testing and good release management.

Also it would be nice to have real numbers other than "bajillions"... what's that even mean? Doesn't sound like much more than a few gigs of data, in which case this transition could take seconds by just using an in-memory system.

Animats · on Sept 1, 2015

What database are they using? Their description reads like they used some NoSQL database, and then they needed to do an ALTER TABLE.

Also, how many merchants can they have? I have a database of all US businesses on a desktop machine and a server. It's a few gigabytes. There are about 20 million business entities in the US, and some of them may not be Stripe customers.

timr · on Sept 1, 2015

This can happen even with something as robust as postgres. If you have a huge table undergoing lots of read/write activity, the locking needed to modify the table schema can take (essentially) infinite time, and/or slow your writes to a crawl and/or spike memory usage on your DB such that you no longer have faith in process completion.

Example from my past: setting a default column value on a huge table undergoing hundreds of writes per second. Oops.

manigandham · on Sept 1, 2015

This doesnt really affect any advanced database that has online alters and workload management. It would just take longer but wouldn't affect any normal transactions. It's a solved problem unless the database just doesn't support it.

timr · on Sept 1, 2015

AFAIK, all of the common open-source databases have this problem. Postgres is better than mysql, but it still needs to do extensive locking when you alter a column.

Under heavy read/write loads and/or with large tables, this can blow up in your face.

beambot · on Sept 1, 2015

IIRC, this isn't true for Postgres. Many common schema alterations in Posgres are lock-free (non-blocking). In MySQL, these alterations used to be locking (ie. O(n) in number rows)... but perhaps it has gotten better? (Wouldn't know; don't use MySQL.) I did a quick cursory google search to find some substantiation:

- http://www.estelnetcomputing.com/index.php?/archives/12-Lock...

- http://stackoverflow.com/a/6542757

timr · on Sept 1, 2015

The relevant bit from your first link: "postgres will update the table structure and there is no locking unless you do something that needs to alter existing rows."

So, altering a column (e.g. adding a default) will trigger the problem. Adding a new column or table will not.

SapphireSun · on Sept 1, 2015

I ran into this problem with MySQL. We have a table of user submissions that is millions of rows but the amount of time it was taking to migrate the test server was unacceptable downtime in production. We ended up using https://www.percona.com/doc/percona-toolkit/2.1/pt-online-sc... to do an online update.

agopaul · on Sept 1, 2015

They used Mongo, but I'm not sure this is still the case. I guess they use multiple datastores across the company services.

I thought the same thing, how many can they be? Even considering Stripe operates globally, I'd say it's a number in the tens of millions.

Edit: lots of typos

pmontra · on Sept 1, 2015

They released this tool a couple of years ago https://stripe.com/blog/announcing-mosql

In the post they wrote

"Here at Stripe, we use a number of different database technologies for both internal- and external-facing services. Over time, we've found ourselves with growing amounts of data in MongoDB that we would like to be able to analyze using SQL. [...] MoSQL does an initial import of your MongoDB collections into a PostgreSQL database, and then continues running, applying any changes to the MongoDB server in near-real-time to the PostgreSQL mirror".

So they could have been run that migration on MongoDB.

pbreit · on Sept 1, 2015

I wonder if whatever they're doing in Mongo was really not doable or viable in Postgres?

ianhawes · on Sept 1, 2015

I imagine the approval for posting this article went something like, "Sure you can do it, just uh, don't use any real numbers."

meesterdude · on Sept 1, 2015

Awesome writeup! Validating to see that the way I do it is the way stripe does it with bajillions of records, while the system is running. Very pragmatic and incremental - maybe a little overly cautious with the feature flag, but better safe than sorry.

I've had to do similar when migrating password schemes. When users login they would validate against their current password, and then generate a new password hash to use next login. Although these types of migrations usually take quite a while, as a user has to login for the migration to happen.

joebeetee · on Sept 1, 2015

I really enjoy Rob's writing. Here are a couple of other good ones

http://robertheaton.com/2014/03/07/lessons-from-a-silicon-va...

http://robertheaton.com/2014/07/14/getting-nothing-done-a-mi...

dmmalam · on Sept 1, 2015

Did you consider using views/triggers on the database to migrate to the new API? [1]. Would still have to change all models, though static typing would make this much easier/safer.

[1] http://johannesbrodwall.com/2010/10/13/database-refactoring-...

beambot · on Sept 1, 2015

Does the "before_save do" hook happen inside the same transaction as the other object's save method? If not, this could result in some gnarly data consistency issues...

arenaninja · on Sept 1, 2015

You know, this is great for Stripe and all, but it makes me shudder to think the size of databases at, say, Bank of America, Wells Fargo, etc. Must be humongous

jerven · on Sept 1, 2015

Three years one of the DBA's at Credit Suisse mentioned that their DB for their most important business line DB was 150 TB in size. Can't remember anymore if that was an oracle installation, or Sybase... So this is gossip level info...

undergrowth54 · on Sept 1, 2015

And flubbing that migration means that people lose their houses.

scott_karana · on Sept 10, 2015

Doubtful. Banks have effectively infinite piles of paper trails, and I suspect they have many, many backups in a variety of other forms as well. They're strongly risk averse, and they are going to strongly try to avoid culpability.

Nobody would lose their houses, but their IT staff would lose man-years of sleep, and people in many departments would probably be fired after the smoke settled.

duellsy · on Sept 1, 2015

I'm always intrigued how companies with data of this magnitude deal with migrating, I'm really glad Rob put this post together with such detail

dreamdu5t · on Sept 1, 2015

Bajillions? I see what you did there... Afraid your 10-20mil records aren't impressive enough? How many is it?

softawre · on Sept 1, 2015

You realize it was probably the lawyers who wouldn't let him say, right?

How many users does your company have?