Put that database in memory

rm-rf · on Nov 17, 2009

I've had very good results simulating an in-memory database with conventional RDBMS's by simply running the databases with a high ratio of physical memory to data - on the order of 128GB of memory for a 500GB database.

The advantage is that COTS applications work as is, you don't need to wait for researchers to dream up something new, you don't need to re-write an application to use the latest fashion in database technology, and most importantly, poor design & poor coding can be papered over with fast logical I/Os (in memory).

In conventional databases configured with high memory->data ratios, we don't have traditional I/O bottlenecks on database reads. Hence I've made a simple decision - that it is cheaper to add memory (even expensive memory) than it is to build a disk I/O subsystem that can handle equivalent I/O.

Transaction writes are still a potential I/O issue though.

gruseom · on Nov 17, 2009

I just finished reading the paper that Greg is blogging about, by Ousterhout and a truckload of co-authors. Yes, they advocate keeping all data in RAM and logging to disk purely for backup. But they also argue that it will take years before this becomes practical (at least at scale), and that fundamental research is still needed on how to achieve durability, distribution, and other factors. I find that surprising. Do such systems really not exist today?

evgen · on Nov 17, 2009

Such systems exist (Redis is a great example of the "just keep it in RAM" idea) but I think what Ousterhout et al are getting at is that we are just starting to think about the consequences of moving to RAM-based dbs and have not really worked out the kinks. Right now RAM-based systems that I am aware of do not do very much meta-analysis on the access patterns or the data itself to try to migrate or cluster blobs for maximum performance. We have started to get better at doing this for data on disk (e.g. column-based stores, using large disk chunks to cut down on seeks, etc) but have just started thinking about how things will change when primary data is all in RAM. We have a lot of experience with dealing with efficient use of disk resources but the world of RAM-resident data is still pretty new and we are learning how many of our assumptions about how things should work based on the world of disk-based storage will carry over to the new paradigm.

antirez · on Nov 17, 2009

I'm starting to do things like this in Redis. For instance consider Redis Lists: POP and PUSH are O(1), also to get the first or last 10 items in constant time is possible with LRANGE, but what if the user is often accessing a "far" range in a very long list? A similar problem happens to ZRANGE and Sorted Sets.

Ok this problems allows to apply some access-pattern based optimization solution. For instance the linked list will have an associated N-elements circular buffer with pointers to far elements, so you can jump to the N-th node if it's on the (small) circular buffer and the clients continue to ask for an "LRANGE mylist 1000000 1000010".

This is probably just the start but in general a small cache of "nodes" to recently very used places of data structures is a promising strategy in order to turn otherwise O(N) access patterns in constant time when they are very frequent.

neilc · on Nov 17, 2009

Simply storing the data in memory, with backup on disk, is not difficult or novel. The hard part is meeting the design's performance goals:

"It should be possible to achieve access latencies of 5- 10 microseconds, measured end-to-end for a process running in an application server to read a few hundred bytes of data from a single record in a single storage server in the same datacenter."

A single multi-core storage server should be able to service at least 1,000,000 small requests per second.

gruseom · on Nov 17, 2009

Hmm, but the paper mostly argues that that performance advantage comes automatically from using RAM instead of disk. The only obstacle they mention to realizing 5-10µs latency per se is that networking infrastructure can't yet support it. No question that's a big deal, but they spend far more time talking about durability, data models, distribution and scaling, all of which are auxiliary to raw latency. I'm not criticizing the paper, just seeking to understand how far the vision they outline really is from current practice.

One interesting aspect is the sharp distinction the authors make between the approach they're advocating and systems that make heavy use of RAM caching (obviously common today), even when the RAM caches hold nearly all the data. So ok, let's rule out cache-based systems as instances of what they call RAMclouds. How widespread, today, are true RAM-based systems (as defined in the paper), even if they don't yet achieve 5-10µs round trips to storage? What major systems are in production that can be cited? Google were famous years ago for keeping their web indexes in RAM. Does that count?

No question many innovative projects have sprouted up in the last few years in this space (e.g. Redis), and while they sound fabulous, I don't think they count as answers to my question without examples of major systems in production (for some fuzzy value of "major"). If anyone can cite any, please do.

Troll disclaimer: Our startup is currently working on these very issues (storage strategy and how application talks to storage), so my interest is both genuine and acute.

PS: Simply storing the data in memory, with backup on disk, is not difficult or novel.

That's good to hear. What specifically are the common strategies for backup to disk?

neilc · on Nov 17, 2009

The only obstacle they mention to realizing 5-10µs latency per se is that networking infrastructure can't yet support it.

The paper also talks about other issues that will need to be resolved (Section 4.1): software overheads like context switching and polling network devices, and protocol issues (e.g. TCP's minimum retransmission timeout is currently 200 milliseconds, which would make even a single dropped packet catastrophic for latency).

I do agree that it is a bit suspect for the authors to go advocate at length for low-latency RPCs, but not volunteer to step up to the plate to do the innovation in switches and network hardware that this will require.

What specifically are the common strategies for backup to disk?

Well, most schemes follow some variant of write-ahead logging to a "stable" location: either local disk or to the RAM of a remote network node (and then assuming that the local and remote nodes won't fail simultaneously).

wmf · on Nov 17, 2009

Yes, it's been done. I think the tone of the paper is a bit misleading.

http://www.oracle.com/database/timesten.html http://www-01.ibm.com/software/data/soliddb/ http://en.wikipedia.org/wiki/Mnesia

gwern · on Nov 17, 2009

Or, in Haskell rather than Erlang, something like Happstack: http://happstack.com/

jhammerb · on Nov 17, 2009

http://www.voltdb.com

gruseom · on Nov 17, 2009

working in stealth mode on the next-generation of OLTP DBMS

That surely does not count as existing today.

jhammerb · on Nov 17, 2009

The early release program has already started and the product is in use.

_david · on Nov 17, 2009

>> Do such systems really not exist today?

ext2?

gojomo · on Nov 17, 2009

I once heard a rumor of a government system with a petabyte of RAM.

swolchok · on Nov 17, 2009

Um, don't you lose durability (or at least transactional semantics) if you do that? Are we just assuming that replication is sufficient and not all the in-memory copies of the data will die at once?

jeremyw · on Nov 17, 2009

Sort of. Given a continuum of reliability needs, options for variably controlling the window of data loss are welcome.

The bottom line is solid-state components are very reliable. Across hundreds of systems, I see uptimes in years. If I can dial that data loss potential in a robust way, I'll take the 5 nines of 100x write capability.

antirez · on Nov 17, 2009

Well there are many ways to add durability to an in-memory DB. For instance Redis 1.1 is supporting three of this ways with different degrees of safety VS performances.

1 - snapshotting) This uses the idea that after a fork() the OS uses copy-on-write semantics. So Redis forks() and dumps a very compact snapshot of the data in RAM on disk. This snapshot is also used in master-slave replication for the initial synchronization. You can configure different save-points, for instance save when there are at least 100 changes and 60 seconds elapsed, and so forth.

2 - append only journal) In this mode Redis just append every command that modified the DB into a file, in order to later reload the log to rebuild the status of the DB. This is much more durable. This days I'm coding a command that is able to rebuild the log in background in order to avoid to end with a huge log. The rebuilding process is fully non-blocking as it uses the same trick of the background snapshotting, that is, copy-on-write of fork. So basically Redis forks and starts to rebuild the append-only log on a different file. The parent process continues to log to the old file and accumulates all the new differences in RAM. When the child finished the rebuild, the parent adds all the new logs at the end of the file and atomically rename(2) the new log to the old one.

3 - master-slave replication.

jeremyw · on Nov 17, 2009

That's exactly the continuum I'm referring to.

jbert · on Nov 17, 2009

The old quipu X.500 directory originally used a similar approach (load to ram at startup, writes go to ram+disk, reads+searches come from ram).

You get durability and run-time performance, but there can be appreciable downtime during the startup phase.

forensic · on Nov 17, 2009

It's crazy that this isn't standard practice already. I guess it just shows how much momentum an old paradigm can have.

AndrewDucker · on Nov 17, 2009

Cost/Benefit.

Buying several TB of RAM is not cheap...

neilc · on Nov 17, 2009

Also, there is a lot of cold data out there. Storing all of it in RAM just doesn't make sense right now, given upfront + ongoing expense (energy).