One other huge downside of raiding EBS volumes is you can't use EBS's snapshotti...

saurik · on March 18, 2011

You have to snapshot at the system level anyway if you want a consistent snapshot: otherwise the filesystem (or your database) could have been reordering and delaying writes that end up not being part of the "consistent snapshot". This is simply not a RAID-specific issue, nor is it a problem with EBS (as it is generally easy to use LVM, xfs, and/or PostgreSQL to handle that part of the job).

agmiklas · on March 19, 2011

This is something I've never quite understood. Best practice guides say you need to do a "flush all tables" in MySQL and then do a filesystem freeze (possible in XFS) before you can use a snapshot system like the ones built into EBS or LVM. If you don't, you apparently stand a good chance of getting an inconsistent snapshot, even if the snapshotting mechanism itself is (like EBS and LVM) "point in time" consistent.

Why is all this necessary? If the system (i.e. DB + FS + block device) are all working as they should, then once a commit returns, the data should be on disk. If it's not, you have no guarantee data that you thought was committed will still be there after a kernel panic or power outage.

In that case, no amount of xfs-freeze or table flushing during a snapshot is going to save you from the fact that your DB is one kernel panic away from losing what the rest of your system believed were committed transactions.

saurik · on March 19, 2011

In the specific case of a database server that actually has correct fsync semantics that the user has not disabled for some crazy performance reason, you are correct. However, there are many use cases that people want consistent snapshots across, like "apt-get install", that do not use a write barrier for every atomic-feeling operation.

(In fact, with a good database solution, like PostgreSQL, the RAID issue of the parent post is also solved: put your write-ahead or checkpoint logs on a single device, as its linear writes will easily swamp network I/O on an EBS, and use RAID only for backend storage, where you need random I/O.)

parasubvert · on March 19, 2011

This is one reason why Oracle is still the gold standard. when entering hot backup mode, which is what you do during a snapshot, it logs the FULL BLOCKS that are changed. Failures and inconsistencies can be replayed from the archive logs.

Of course this means you can quickly blow out your log archival , so it's meant to be a transitory mode:

saurik · on March 19, 2011

PostgreSQL has this exact same feature.

andrewvc · on March 18, 2011

True. However, for some cases where you don't mind losing some data due to a recovery process EBS snapshots are 'good enough'. Additionally, with a database like CouchDB with a 'crash only' design, it should work for some cases as well.

enjo · on March 18, 2011

We use EBS snapshots as a last-resort backup. They're really convenient that way. We have a more robust backup system, but in the unlikely event that something goes wrong at least we have those snapshots, even if they're not perfect.

WALoeIII · on March 18, 2011

xfs_freeze

In fact there is a handy package called ec2-consistent-snapshot (https://launchpad.net/ec2-consistent-snapshot) that will manage this for you!