A recent example illustrates the power of this approach:
Cloudflare's WAF (web application firewall) basically generates
Lua code for the (highly non-linear) maze of firewall rules. An
incoming attack triggers certain rules and the corresponding paths
are turned into linearized traces. These can be heavily optimized
by LuaJIT, much more so than you could ever hope to do with a
static compiler.
When a different attack wave is started, it'll spawn different
traces -- i.e. the generated code dynamically adapts to the
specific attacks.
The result is a much higher firewall throughput than you might be
able to achieve with a data-driven or a static compilation
approach. This directly equates into money saved on machines
needed to handle the load.
Added to that is the fact that the WAF configuration (which is automatically generated Lua code) is different for each customer and so we rely heavily on caching and JITing to get the performance.
The bottom line is that LuaJIT is very fast and very flexible.
Wow I'd need some evidence to believe JIT can do a better job than a 'static' compiler. What can JIT do to optimize 'much more so'? How about a little more? If you know some common conditional jump stats you can do slightly better. Is there anything else?
A static compiler has to generate code that handles every possible path through the code. This tends to pessimize code downstream of a control flow merge point, because such code has to be compiled assuming it can be reached via multiple paths. A trace compiler can compile only the paths that are actually taken on a given workload, and compile the (dynamically) uncommon cases into exits to the interpreter.
The firewall rule example Mike Pall gives is a pretty good one. You have code that has many different paths through it, but only specific ones are triggered, dynamically, by a particular attack. Those can be compiled into traces that assume the other paths are not taken (except to guard the possible control flow splits with trace exits). This can allow an optimizer to make much more aggressive assumptions on the actually-taken paths.
"Contrary to intuition, we demonstrate that it is possible to use a piece of software to improve the performance of a native, statically optimized program binary, while it is executing. Dynamo not only speeds up real application programs, its performance improvement is often quite significant."
This is a common misinterpretation of the Dynamo paper: they compiled their C code at the lowest optimization level and then ran the (suboptimal) machine code through Dynamo. So there was actually something left to optimize.
Think about it this way: a 20% difference isn't unrealistic if you compare -O1 vs. -O3.
But it's completely unrealistic to expect a 20% improvement if you'd try this with the machine code generated by a modern C compiler at the highest optimization level.
Claiming that JIT compilers outperform static compilers, solely based on this paper, is an untenable position.
But, yes, JIT compilers can outperform static compilers under specific circumstances. This has more to do with e.g. better profiling feedback or extra specialization opportunities. But this is not what this paper demonstrates.
Many compiler optimizations have strong non-linear costs in terms of the number of control flow edges. A static compiler has to punt at a certain complexity. OTOH a JIT compiler is free to ignore many edges, since it may fall back to an interpreter for cold paths or attach new code anytime later if some edges become hot.
One interesting example is auto-vectorization (SIMDization) where static compilers have to generate code for all possible combinations of vector alignments in case the underlying alignment of the participating vectors is not statically known. This quickly gets very expensive in terms of code space. OTOH a JIT compiler can simply specialize to the observed vector alignment(s) at runtime, which show almost no variation in practice.
Thank you for your comment. Very interesting points.
Isn't it wrong to say "compiled their C code at the lowest optimization level" though? As I read the paper, they compiled their C code at -O2. It's hard to work backwards in time to know exactly what that meant when they published their paper, but I suspect it meant something very similar to the current gcc definition:
-O2 Perform nearly all supported optimizations that do not involve a space-speed tradeoff.
Dynamic binary translation can sometimes get performance improvements. This is because you would still statically compile your program, and then at runtime, the DBT system dynamically decodes and re-encodes the binary instructions. With a DBT system, you can do tracing, by identifying hot paths and such and lining those basic blocks up in your code cache.
you have a very large number of options so getting code locality between the ones that are applicable is important, at a guess. performance in large state machines is very much locality dependent but this way you get to do this dynamically not statically.
Interesting to see JIT compilation used for low level and what I assume high perf workload. NetBSD is supporting in-kernel lua code, I wonder if dynamic compilation can produce expressive and flexible performant execution too in drivers.
ps: I wonder if it's related, maybe cloudflare uses netbsd and pushed for kernel embedding, or they just benefited from netbsd lua love.
We are entirely Linux not NetBSD and are not currently using Lua in the kernel at all. We did, however, sponsor some of Mike Pall's work on LuaJIT based on our particular workload: http://luajit.org/sponsors.html#sponsorship_perf
We use Lua for the WAF (as Mike says), but also for all request processing. This is in part because we use Nginx and in part because employ agentzh (http://agentzh.org/) and he works on OpenResty. Fairly recently we moved all processing of requests through CloudFlare to a Lua-based system inside Nginx.
That's >5B pages views per 24 hours going through LuaJIT-ed code.
I'll add a more serious question. This is a nice optimization, but wouldn't clever attackers respond in turn by eventually throwing highly varied traffic patterns to fluster the JIT?
Hundreds of servers in 23 locations worldwide, each server running multiple instances of Nginx: http://blog.cloudflare.com/a-tour-inside-cloudflares-latest-... The key fact is that this type of scale is possible with Nginx + Lua + hundreds (not thousands) of servers.
If that were to happen (and I think it's a big "if" because of the complexity of predicting what will throw it off given that the attacker doesn't know the code we are running) then we'd see the WAF latency increase and alarms would be generated immediately. That, in turn, would cause a bunch of other mechanisms to come into play.
Oh I get that this is the key fact, I put a smiley in there because I was fishing for a more exact number that is unlikely to be given :)
Thanks for your answer. Its a very neat setup. (My intuition is that its not a case of if but when, but you've probably won yourself quite a bit of time until that happens)
Just as a side comment, optimisations to the attack paths are unlikely to make a substantial impact on overall performance, given that attack requests are only a fraction of all requests processed at any given time. (Excluding DoS, of course, but that's not an area where you'd expect a WAF to do much.) In a WAF, you want to heavily optimise then non-attack paths.
The bottom line is that LuaJIT is very fast and very flexible.