Just in time! This will be nice reference. I recently started working on a new Scala driver that uses the v0_4 asynchronous protocol, built on top of Akka's IO module and Play's JSON module. I think I have the performance where I want it, but now I need to flesh out the DSL for proper ReQL support.
Exponential back off is a good idea, but to make it even better you'll want to add some random jitter to the delay period. Especially in the case of deadlocks this helps avoid another deadlock when two things would have retried at exactly the same time.
Absolutely. I made some graphs of different kinds of jitter approaches at http://www.awsarchitectureblog.com/2015/03/backoff.html. Simply backing off with delay does very little to spread out spikes of work, and can lead to longer MTTRs if the cause of the issue was overload.
Would it be possible for the server to simply return a queue position to the contending clients? For instance, "There are 37 clients ahead of you. Try again in 37ms." (assuming each write is averaging 1ms)
I suspect if the client and server worked together in this fashion you could get much closer to the linear completion time seen with no backoff and also closer to linear work (since each client would try exactly twice in an ideal world where the server accurately predicted when it should try again).
I think most exponential backoff schemes I've looked at treat the 'backoff delay' as a range and do randrange(0, current_backoff) or similar which does what you say.
Truncated exponential (where you cap the upper limit of the retry delay to some maximum) is also often a good idea, to prevent a short service outage from spiking the retry timers to crazy numbers.
Doesn't exponential back-off mean that you select the time until retry uniformly at random from the interval [0, ..., 2^n-1] in the n-th round of failure, or something along those lines?
Currently continuous views must read from a stream. However, in the very near future it will possible to write to streams from triggers, which would probably give you enough flexibility to model the behavior you want if you could conceptualize a table as a stream of changes.
Our next release (2.1) is due in about three weeks and includes automatic failover/high availability. Feeds on table joins (and other greatly expanded feed functionality) will be in 2.2, which should happen ~6-8 weeks after 2.1.
(Sorry to jump in with a shameless plug; what PipelineDB is doing is super-cool; I also met the founders a few times, and they're awesome, smart, and very driven people -- I'm really excited about what PipelineDB has to offer!)
KONG definitely looks interesting, and I'd love to know more about it. However, there's definitely not a lot written about it yet.
For example: I've gone searching through the blog posts, github readme, and KONG documentation, but I still have no idea _why_ it needs Cassandra. What does it store in there?
One of the main graphics on the KONG docs shows a Caching plugin (http://getkong.org/assets/images/homepage/diagram-right.png), but the list of available plugins doesn't include such an entry. Is that because caching is built in? Is the cache state stored in Cassandra? Or is the plugin yet to be built?
All the data that Kong stores (including rate-limiting data, consumers, etc) is being saved into Cassandra.
nginx has a simple in-memory cache, but it can only be shared across workers on the same instance, so in order to scale Kong horizontally by adding more servers there must be a third-party datastore (in this case Cassandra) that stores and serves the data to the cluster.
Kong supports a simple caching mechanism that's basically the one that nginx supports. We are planning to add a more complex Caching plugin that will store data into Cassandra as well, and will make the cached items available across the cluster.
if you want to build a new derived datastore, you can just start a new consumer
at the beginning of the log, and churn through the history of the log, applying
all the writes to your datastore.
For high-throughput environments with lots of appends to the log, how do you get around the ever-increasing size of your log file? I know the traditional answer is to take a periodic snapshot and compact the previous data, but is that built in to tools like Kafka?
There's a log compaction cleanup policy yes. Never used it myself but if I'm not mistaken it works like this: for each message you send to Kafka, you set a key with it. When Kafka does log compaction, it keeps only the last value for each key.
The other cleanup policy is to just have a retention time. After X minutes/days/weeks segments of the log are simply deleted.
That sounds great if your messages in the logs are the complete state for that key, but I'm not seeing how to use that compaction system if the messages are change events.
Is there a system designed for snapshotting the aggregate and logging the delta?
It's easy to store messages in HDFS or S3 for long-term storage. It's also easy to replay messages from those mediums, if you need to re-ingest data later on.
One idea is to shard the logs. By analogy with git: any given repo has a log of its commits, but you can have as many repos as you like.
It does limit throughput for any given shard, though, and then you're left with a distributed transaction problem to solve when you need to commit changes to objects in different repos.
Yes. When you're getting initial data you'll get a document of the form `{ new_val: data }`. When you're getting changes, the document is of the form `{ new_val: data, old_val: data }`. Note that in the former case, the `old_data` field is missing.