From the post: A recent example illustrates the power of this approach: Cloudfla...

JoeAltmaier · on Nov 29, 2013

Wow I'd need some evidence to believe JIT can do a better job than a 'static' compiler. What can JIT do to optimize 'much more so'? How about a little more? If you know some common conditional jump stats you can do slightly better. Is there anything else?

rayiner · on Nov 29, 2013

A static compiler has to generate code that handles every possible path through the code. This tends to pessimize code downstream of a control flow merge point, because such code has to be compiled assuming it can be reached via multiple paths. A trace compiler can compile only the paths that are actually taken on a given workload, and compile the (dynamically) uncommon cases into exits to the interpreter.

The firewall rule example Mike Pall gives is a pretty good one. You have code that has many different paths through it, but only specific ones are triggered, dynamically, by a particular attack. Those can be compiled into traces that assume the other paths are not taken (except to guard the possible control flow splits with trace exits). This can allow an optimizer to make much more aggressive assumptions on the actually-taken paths.

JoeAltmaier · on Nov 30, 2013

My point exactly. This slight optimization is all you get. I think this technology is being oversold if that's all they got.

apw · on Nov 29, 2013

You might want to read about Dynamo:

http://www.hpl.hp.com/techreports/1999/HPL-1999-78.pdf

"Contrary to intuition, we demonstrate that it is possible to use a piece of software to improve the performance of a native, statically optimized program binary, while it is executing. Dynamo not only speeds up real application programs, its performance improvement is often quite significant."

mikemike · on Nov 29, 2013

Oh, well ... pasting my standard rant on this:

This is a common misinterpretation of the Dynamo paper: they compiled their C code at the lowest optimization level and then ran the (suboptimal) machine code through Dynamo. So there was actually something left to optimize.

Think about it this way: a 20% difference isn't unrealistic if you compare -O1 vs. -O3.

But it's completely unrealistic to expect a 20% improvement if you'd try this with the machine code generated by a modern C compiler at the highest optimization level.

Claiming that JIT compilers outperform static compilers, solely based on this paper, is an untenable position.

But, yes, JIT compilers can outperform static compilers under specific circumstances. This has more to do with e.g. better profiling feedback or extra specialization opportunities. But this is not what this paper demonstrates.

Many compiler optimizations have strong non-linear costs in terms of the number of control flow edges. A static compiler has to punt at a certain complexity. OTOH a JIT compiler is free to ignore many edges, since it may fall back to an interpreter for cold paths or attach new code anytime later if some edges become hot.

One interesting example is auto-vectorization (SIMDization) where static compilers have to generate code for all possible combinations of vector alignments in case the underlying alignment of the participating vectors is not statically known. This quickly gets very expensive in terms of code space. OTOH a JIT compiler can simply specialize to the observed vector alignment(s) at runtime, which show almost no variation in practice.

apw · on Dec 5, 2013

Thank you for your comment. Very interesting points.

Isn't it wrong to say "compiled their C code at the lowest optimization level" though? As I read the paper, they compiled their C code at -O2. It's hard to work backwards in time to know exactly what that meant when they published their paper, but I suspect it meant something very similar to the current gcc definition:

-O2 Perform nearly all supported optimizations that do not involve a space-speed tradeoff.

k4st · on Nov 29, 2013

Dynamic binary translation can sometimes get performance improvements. This is because you would still statically compile your program, and then at runtime, the DBT system dynamically decodes and re-encodes the binary instructions. With a DBT system, you can do tracing, by identifying hot paths and such and lining those basic blocks up in your code cache.

justincormack · on Nov 29, 2013

you have a very large number of options so getting code locality between the ones that are applicable is important, at a guess. performance in large state machines is very much locality dependent but this way you get to do this dynamically not statically.

agumonkey · on Nov 29, 2013

Interesting to see JIT compilation used for low level and what I assume high perf workload. NetBSD is supporting in-kernel lua code, I wonder if dynamic compilation can produce expressive and flexible performant execution too in drivers.

ps: I wonder if it's related, maybe cloudflare uses netbsd and pushed for kernel embedding, or they just benefited from netbsd lua love.

cloudflare · on Nov 29, 2013

We are entirely Linux not NetBSD and are not currently using Lua in the kernel at all. We did, however, sponsor some of Mike Pall's work on LuaJIT based on our particular workload: http://luajit.org/sponsors.html#sponsorship_perf

We use Lua for the WAF (as Mike says), but also for all request processing. This is in part because we use Nginx and in part because employ agentzh (http://agentzh.org/) and he works on OpenResty. Fairly recently we moved all processing of requests through CloudFlare to a Lua-based system inside Nginx.

That's >5B pages views per 24 hours going through LuaJIT-ed code.

recuter · on Nov 29, 2013

Across how many nginx servers? :)

I'll add a more serious question. This is a nice optimization, but wouldn't clever attackers respond in turn by eventually throwing highly varied traffic patterns to fluster the JIT?

cloudflare · on Nov 29, 2013

Hundreds of servers in 23 locations worldwide, each server running multiple instances of Nginx: http://blog.cloudflare.com/a-tour-inside-cloudflares-latest-... The key fact is that this type of scale is possible with Nginx + Lua + hundreds (not thousands) of servers.

If that were to happen (and I think it's a big "if" because of the complexity of predicting what will throw it off given that the attacker doesn't know the code we are running) then we'd see the WAF latency increase and alarms would be generated immediately. That, in turn, would cause a bunch of other mechanisms to come into play.

recuter · on Nov 29, 2013

Oh I get that this is the key fact, I put a smiley in there because I was fishing for a more exact number that is unlikely to be given :)

Thanks for your answer. Its a very neat setup. (My intuition is that its not a case of if but when, but you've probably won yourself quite a bit of time until that happens)

cloudflare · on Nov 29, 2013

"Oh I get that this is the key fact, I put a smiley in there because I was fishing for a more exact number that is unlikely to be given"

The honest truth is that I actually don't know how many servers we have because it's not a number that I have to worry about.

I do know that floor(log_10(#servers)) = 2.

justincormack · on Nov 29, 2013

Entirely unrelated I think.

ivanr · on Nov 29, 2013

Just as a side comment, optimisations to the attack paths are unlikely to make a substantial impact on overall performance, given that attack requests are only a fraction of all requests processed at any given time. (Excluding DoS, of course, but that's not an area where you'd expect a WAF to do much.) In a WAF, you want to heavily optimise then non-attack paths.