Well, if you can triple your performance, you will require 1/3 of the servers, a...

exDM69 · on March 15, 2015

String searching is an inherently I/O or memory bound problem. Your CPU ends up waiting for bytes to arrive from memory at around 50 GB/s theoretical max, half of that in practice usually. The programming language or algorithm doesn't matter that much when memory bandwidth is saturated.

A faster implementation of a string searching algorithm could only save a few milliwatts of CPU power, it wouldn't make it faster nor require less hardware.

imaginenore · on March 15, 2015

The author says they max out at 1.25 GB/s. That's a long way from the theoretical max when it comes to even DDR3.

It's possible they are bound by the SSD speed.

I can't find much on AWS SSD max speeds.

tacotime · on March 14, 2015

"if you can triple your performance, you will require 1/3 of the servers, and your clients will enjoy faster responses"

You're going to have to choose one of those or the other...

reitzensteinm · on March 15, 2015

That's not correct, until the point that requests are queuing, increasing performance gets you lower latency and higher throughout simultaneously in proportion.

Once there's a backlog of requests, faster requests on proportionally fewer machines will not help the queuing latency.

But in general it's not a good plan to be in that situation for long, because the back pressure that stops the queued requests expanding infinitely is people leaving your service in frustration.

cshimmin · on March 16, 2015

Hey, can you help me understand this? I don't quite follow. Obviously for a fixed number of servers the throughput goes up and latency goes down. But it sounds like you're saying you can somehow reduce the number of servers and still have lower latency?

Lets just make up toy numbers. Suppose you have a code that can process (say) 1000 lines per second, per machine. You have 3 machines and you need to process 90k lines. Each one gets assigned 30k lines and it takes 30 seconds overall right?

Suppose you find a 3x speedup in the code. Now your 90k lines takes 10 seconds. Alternatively, you can ditch who machines, and process the whole 90k on one machine in 30 seconds, the same as the original time.

To me it seems like this is what tacotime was saying, you can either have faster responses or fewer machines.

reitzensteinm · on March 21, 2015

Hi there, didn't see this until just now.

The metric that is improving is the latency experienced by a user.

Imagine that you have a web app that takes one second of server CPU time to render a page, and you have three servers which process three hits a second in total. All three servers are thus on 100% CPU load, dealing with one hit a second each.

Each time somebody visits your site, they are going to experience a 1 second latency (in addition to communication latency), as they wait for one of your servers to build the page.

If you then optimize your code so that it completes in a third the time, 333ms, then your servers are going to suddenly be at 1/3 load; they execute their one query for 333ms, and then sleep for 667ms.

Not just that; but the user now only has to wait for 333ms for the page to render on the server, so the site gets a lot more snappy for them.

Then you can shut down two servers leaving one; it will sit on 100% load, but you still keep the shorter 333ms latency experienced by your users.

You are doing the same amount of work as before - 3 hits a second. But previously, when the tasks took 1s each, three would be running in parallel, each being completed more slowly. With the faster run time, they are running in series, being completed quickly before switching to the next one.

Now, this does not apply if your requests are queuing. Because you're not able to do any more work than before (after shutting down the other two servers), if your hits/s exceeds the capacity of your servers to deal with them, the backlog will grow just as fast as it normally would have, and the latency caused by this will sky rocket.

Your example doesn't fit because multiple machines aren't ever used to process front end web requests in parallel; you don't render half of a template on one machine, and the other half on another, for example. If you can set up a system like this and see gains from it, then what I've written above will not apply.

cshimmin · on April 3, 2015

Gotcha, thanks for clarifying. I had something like batch processing for data analysis in mind, which is what I understood to be the target use model of the original article. What you are saying in the context of synchronous request processing makes sense and is an interesting point.

latch · on March 15, 2015

Thank you! A lot of people somehow don't get this. They think only in terms of horizontal or vertical scaling. (Increase code performance is definitely a form of vertical scaling, it isn't what people are using thinking/talking about)