Searching 20 GB/sec: Systems Engineering Before Algorithms (2014)

snewman · on March 14, 2015

Hi! Great to see this pop back up on HN. I'm the author of the blog post (and Scalyr founder), happy to answer any questions.

Downthread, someone mentioned that they couldn't find the HN discussion from when this was originally posted; it's here:

https://news.ycombinator.com/item?id=7715025

hglaser · on March 14, 2015

Great post.

BTW, this product (Scalyr) is a lifesaver. We (Periscope) are able to operate ~ a dozen heterogeneous servers with no FT DevOps largely because of Scalyr.

twotwotwo · on March 15, 2015

Lots of attention goes to OLTP-type loads for good reasons, but when you do design to just stream fast, some fun things happen:

You can use lots of relatively cheap spindles in parallel, and think of each one as (at least) 100MB/s of sequential read speed and a couple TB of space. You have fast compression available that can increase your effective bandwidth and make the effective cost of space cheaper.

You can draw on well-understood ways to search, sort, do hash- or sort-based joining and grouping, and so on.

Streaming doesn't need a big in-memory cache to avoid disk seeks, so you can use those gobs of RAM for other things--aggregating results or holding data to join against, say. (Of course, if you don't need the RAM, disk cache might still be useful for some access patterns.)

Besides log search, you see a stream-fast approach in analytics-focused DBs: BigQuery, Redshift, Vertica, and open-source ones--Facebook put up a good post about the work that led to the their Hive ORCFile design.

Some bioinformatics tools load a big hashtable into memory and, roughly, hash-join against a ton of raw data streamed from disk, then sometimes then repeat the process with another hashtable.

These are not at all original observations, but I managed to hear about these sorts of analytics and bioinformatics tools for a while before really getting how or why why they did things all that differently from a typical random-access-oriented database.

imaginenore · on March 15, 2015

I have another idea for you guys. Instead of relying on expensive AWS SSD instances, why not switch to Hetzner, and keep everything in RAM?

128 GB RAM for $135/month:

https://www.hetzner.de/en/hosting/produkte_rootserver/px120

And you will have so much extra disk space, you can use it for backups. Or even resell it.

Your i2.4xlarge cost you $2,455/month.

noir_lord · on March 15, 2015

Tried hetzner last year would never touch it with a bargepole.

imaginenore · on March 15, 2015

What's wrong with them?

noir_lord · on March 16, 2015

Had the machine for ~6 weeks.

In that time I had 4-5 days I couldn't even login since the network was "under attack", they didn't tell me I told them.

Their management interface is an exercise in frustration.

Customer support is horrific, "We are not aware of any on going network issues" yup except I can't ping it from any of my other servers anywhere in the world.

You get a fast machine for a cheap price but the network is a joke, the support is a joke and they are under a near perpetual DDoS.

imaginenore · on March 14, 2015

I wonder why they chose Java for substring search. Why not C (strstr) or grep?

http://www.arstdesign.com/articles/fastsearch.html

snewman · on March 14, 2015

We started with Java because the rest of our backend is written in Java. We stuck with it because it has not turned out to be much of a bottleneck in practice. This came as a surprise; I had expected that we'd need to rewrite the substring search loop in native code. We may still do so eventually, but we're able to get very high performance in pure Java. (1.25GB / second / core, as noted in the post.)

Edit to add: "good enough is not good enough" -- very well put! Yes, it would be good to move this loop to native code. However, with our current design, the overhead of calling out from Java to native code would probably outweigh any benefits. We're planning to move our core database storage to off-Java-heap memory buffers, at which point it will become much more feasible to call out to native code.

The inner loop speedup on this particular code will be more like 2x than 3x, and the speedup of the overall system will be quite a bit less than that. But your point is still correct.

imaginenore · on March 14, 2015

Well, if you can triple your performance, you will require 1/3 of the servers, and your clients will enjoy faster responses. I thought that was the idea, good enough is not good enough.

Anyways, congrats on your success. I would love to read more about the business side - finding clients, profits, expenses, etc.

exDM69 · on March 15, 2015

String searching is an inherently I/O or memory bound problem. Your CPU ends up waiting for bytes to arrive from memory at around 50 GB/s theoretical max, half of that in practice usually. The programming language or algorithm doesn't matter that much when memory bandwidth is saturated.

A faster implementation of a string searching algorithm could only save a few milliwatts of CPU power, it wouldn't make it faster nor require less hardware.

imaginenore · on March 15, 2015

The author says they max out at 1.25 GB/s. That's a long way from the theoretical max when it comes to even DDR3.

It's possible they are bound by the SSD speed.

I can't find much on AWS SSD max speeds.

tacotime · on March 14, 2015

"if you can triple your performance, you will require 1/3 of the servers, and your clients will enjoy faster responses"

You're going to have to choose one of those or the other...

reitzensteinm · on March 15, 2015

That's not correct, until the point that requests are queuing, increasing performance gets you lower latency and higher throughout simultaneously in proportion.

Once there's a backlog of requests, faster requests on proportionally fewer machines will not help the queuing latency.

But in general it's not a good plan to be in that situation for long, because the back pressure that stops the queued requests expanding infinitely is people leaving your service in frustration.

cshimmin · on March 16, 2015

Hey, can you help me understand this? I don't quite follow. Obviously for a fixed number of servers the throughput goes up and latency goes down. But it sounds like you're saying you can somehow reduce the number of servers and still have lower latency?

Lets just make up toy numbers. Suppose you have a code that can process (say) 1000 lines per second, per machine. You have 3 machines and you need to process 90k lines. Each one gets assigned 30k lines and it takes 30 seconds overall right?

Suppose you find a 3x speedup in the code. Now your 90k lines takes 10 seconds. Alternatively, you can ditch who machines, and process the whole 90k on one machine in 30 seconds, the same as the original time.

To me it seems like this is what tacotime was saying, you can either have faster responses or fewer machines.

reitzensteinm · on March 21, 2015

Hi there, didn't see this until just now.

The metric that is improving is the latency experienced by a user.

Imagine that you have a web app that takes one second of server CPU time to render a page, and you have three servers which process three hits a second in total. All three servers are thus on 100% CPU load, dealing with one hit a second each.

Each time somebody visits your site, they are going to experience a 1 second latency (in addition to communication latency), as they wait for one of your servers to build the page.

If you then optimize your code so that it completes in a third the time, 333ms, then your servers are going to suddenly be at 1/3 load; they execute their one query for 333ms, and then sleep for 667ms.

Not just that; but the user now only has to wait for 333ms for the page to render on the server, so the site gets a lot more snappy for them.

Then you can shut down two servers leaving one; it will sit on 100% load, but you still keep the shorter 333ms latency experienced by your users.

You are doing the same amount of work as before - 3 hits a second. But previously, when the tasks took 1s each, three would be running in parallel, each being completed more slowly. With the faster run time, they are running in series, being completed quickly before switching to the next one.

Now, this does not apply if your requests are queuing. Because you're not able to do any more work than before (after shutting down the other two servers), if your hits/s exceeds the capacity of your servers to deal with them, the backlog will grow just as fast as it normally would have, and the latency caused by this will sky rocket.

Your example doesn't fit because multiple machines aren't ever used to process front end web requests in parallel; you don't render half of a template on one machine, and the other half on another, for example. If you can set up a system like this and see gains from it, then what I've written above will not apply.

cshimmin · on April 3, 2015

Gotcha, thanks for clarifying. I had something like batch processing for data analysis in mind, which is what I understood to be the target use model of the original article. What you are saying in the context of synchronous request processing makes sense and is an interesting point.

latch · on March 15, 2015

Thank you! A lot of people somehow don't get this. They think only in terms of horizontal or vertical scaling. (Increase code performance is definitely a form of vertical scaling, it isn't what people are using thinking/talking about)

lostmsu · on March 23, 2015

Nice. But does not scale.

swatow · on March 14, 2015

Judging from the comments, this article was written around May 8 2014. Can we get a (2014) in the title?

dang · on March 14, 2015

Good catch. Added.

kiallmacinnes · on March 14, 2015

The linked article has been posted before, I can't find the old HN thread.. But it was certainly worth a re-read :)

I wonder has scalyr reached their expected 100GB/s yet?

caust1c · on March 14, 2015

It was posted 17 hours ago by the same guy and a different url: https://news.ycombinator.com/item?id=9201444

preetamjinka · on March 14, 2015

I was asked by HN to repost it.

dang · on March 14, 2015

Yes; part of an ongoing experiment to reduce the randomness of /newest by giving substantive stories multiple chances to make the front page.

Edit: though it looks like we broke HN's rule about duplicates in this case. Sorry!

tbirdz · on March 14, 2015

What's the criteria for a "substantive story"? Not being snarky or difficult, just genuinely curious.

dang · on March 15, 2015

Sorry for the late reply—I wanted to give you a serious answer and didn't have time earlier.

The criteria for "substantive" are what the HN guidelines (along with https://news.ycombinator.com/newswelcome.html) say about which stories are on topic: intellectually interesting (as opposed to sensational) and so on. HN's culture is well-enough established by now that I think most community members can probably agree, not of course on what interests each of us personally, but on a working set of candidate stories that roughly fit the criteria. That's our hypothesis, anyhow.

HN sees perhaps a thousand new stories a day. Most aren't "intellectually interesting" in HN's sense (i.e. gratifying intellectual curiosity), and the weeds have grown too thick for the upvote system alone to reliably surface the potential gems. To comb through /newest looking for them has become too much work.

Some interesting stories (again, "interesting" in the HN sense) do fine under the upvote system. Breaking news, stories about fashionable technologies, and anything controversial reliably attract upvotes. But the quieter and deeper submissions are often not immediately recognizable. Those deserve closer attention than they were getting, so we've been experimenting with different approaches to finding them.

In the spirit of "do things that don't scale", it's mostly us trying different things manually for now, but what we're looking for is a mechanism that can be opened to anyone willing to put in the effort. The upvote system will work exactly as it always has, but we're hoping to add a new one that complements it and compensates for its weaknesses.

Since it requires significant effort to look through the story stream hunting for out-of-the-way pieces, one idea (my favorite) is to make this a new way of earning karma on the site. But first we have to find something repeatable that will work.

bitsandbots · on March 14, 2015

Interesting. Good to know!