As far as I know, all three of those languages are interpreted. They might be good at slinging together predefined operations with fast implementations written in another language (C)… as long as you’re operating in parallel on large arrays, so that you spend more time inside the C functions than actually interpreting. But if you need to write your own operations or just do anything that’s isn’t massively parallel, none of those languages even compete. For example, you couldn’t write a performant compiler in them. In contrast, languages like Haskell and OCaml have compilers that generate native code – maybe not C-level native code, but still an order of magnitude faster than an interpreter.
Edit: For that matter, from quickly browsing the source code, it looks like Miranda is interpreted as well. So it’s absurd to say it’s faster than Haskell.
Most of the APLs don't automatically parallelise array operations either, because it's hard to know automatically when it's worth it. This is exacerbated by the fact that most APLs don't have a compiler, so the granularity of independent operations is fine.
While I don't know much about the mysteries of Ks implementation, I know that the most widely used industrial APL implementation, Dyalog, uses a pretty conventional explicit task parallel API for parallelism. They call it "isolates", and it's essentially about launching a thread that has its own internal APL state (with a lot of polish for convenience and communication, of course). There may be a few primitive operations that are automatically parallel internally, but they are rare.
APL is cheating since all the magic happens in highly pipelined SIMD-heavy loops :p
but if it's really true that other than array languages, no modern FP language can outperform Miranda then that's really depressing.
I was under the impression that GHC attempted to generate decently good code, maybe it's all the little allocations that slow Haskell down or something like that.
APL isn't cheating because of that, it's cheating because it's small enough to fit in your CPU cache!
I'm sure something else can beat Miranda, I'm just unsure of what. I don't really care for the FP paradigm outside of arrays and Lisp, though, so I'll be the first to admit that I haven't spent days looking; just a few hours here and there.
Oh, actually: Stalin probably does. It's an R4RS compiler. Good luck getting it to compile, though. It took me a few hours and a lot of code changes to get it to compile half a decade ago (I wanted to compare the one I was writing, which was a lot worse, naturally, but much easier to compile). The compiler itself is slow as a tortoise, but it generates really wonderful code. It's probably rotted a bit now, though.
The claim that k runs quickly because its functions fit in the CPU cache is, as far as I can tell, an off-hand comment that Arthur Whitney made once which has been repeated far more than it deserves. It's false—instruction cache behavior doesn't contribute significantly to k's advantage over other languages—for a few reasons: inner loops in array languages are 3–5 orders of magnitude smaller than the CPU cache, the loops that compiled languages produce also fit in cache, and instruction caching doesn't matter all that much for performance anyway. Despite spending plenty of time looking for it, I've never been able to measure an impact of code size on performance. Even data caching doesn't have that much of an effect: current versions of Dyalog APL almost always allocate new arrays from uncached memory (this is mostly fixed in the next version), and it's still one of the fastest array languages around. Unless you're using SIMD, code with linear access patterns can't even keep up with main memory, and the cache has no effect.
Why is using SIMD cheating? SIMD loops are the only way to get the full performance out of a CPU, and fast compilers do try to produce them. If it turns out that array-based interpreters are a better way to convert programmer intentions to SIMD loops than scalar compilers (and it certainly seems that way) then the array languages are legitimately faster. I suppose SIMD is considered "non-portable" because you can't use it from C, but that's an artificial restriction coming from historical programming language design decisions. The most important vector instructions are the same in any modern vector ISA. How is using standard CPU features cheating? They're not even that much newer than double-precision float support.
My comment was that taking advantage of SIMD wasn't cheating, but "cheating" was a joke in both cases.
Though I disagree with your comment on the cache: it's very blatantly better for interpreters; Whitney isn't the only one who's a believer in that, Moore does too, and Moore is more competent than just about anybody. If you look at the performance of k and Moore-written Forths compared to Dyalog APL, it does seem like they have a point.
If you're talking about parsing speed, Dyalog is slow because it has a much more complicated grammar, and because it stores the execution stack in the workspace in order to make stack overflows impossible. If you're claiming k or Forth is faster for large array processing, I'd like to see some benchmarks. Do you have a citation for Moore on the instruction cache?
> Oh, actually: Stalin probably does. It's an R4RS compiler. Good luck getting it to compile, though. It took me a few hours and a lot of code changes to get it to compile half a decade ago (I wanted to compare the one I was writing, which was a lot worse, naturally, but much easier to compile). The compiler itself is slow as a tortoise, but it generates really wonderful code. It's probably rotted a bit now, though.
Stalin isn't really a good compiler in the realm of high-performance computing. What did and does make Stalin awesome is that it showed how you could compile away most of the overhead of using a very high-level language (Scheme) and end up with code that matched reasonably written C. That does not mean its competitive with the code generated by a heavily vectorising Fortran compiler for number crunching. Stalin is more about removing language overheads than about pushing the hardware to the hilt.
If you want a compiler built around the same rough philosophy as Stalin, then there's MLton, which is also still maintained: http://mlton.org/