The argument is not that because Fortran has multidimensional arrays it is faste...

celrod · on Nov 28, 2018

Languages like C(++), Rust, and Julia have vector intrinsics that are unfortunately lacking in Fortran. Is there a way to wrap [SLEEF](https://github.com/shibatch/sleef) in Fortran?

While writing fast code in Fortran is easy, I think it is unfortunately harder to write "extremely optimised" libraries like BLAS in Fortran than in these other languages for that reason. Without the low level control, you can't be as explicit about vectorization patterns.

Eg, for BLAS, you normally write a lot of kernels, which you then apply to blocks of the matrices. The blocks will not be contiugous; each column will be separated by the matrix stride (or rows separated by a stride, if row major).

gfortran 8.2 will not succesfully vectorize a matmul kernel (I did file an issue on bugzilla). C (with vector intrinsics) does great. Julia does well too, but not perfect: there are a couple redundant move instructions per for loop iteration.

When it comes to applying the kernel to blocks, Fortran will also try and make array temporaries of each block, which would cripple performance. In languages with pointers, all you'd have to do is pass a pointer and the stride between columns. I haven't tried playing with pointers in Fortran, but I guess that would work too? Julia often refuses to autovectorize anything as soon as you start playing with pointers; does Fortran's stance on aliasing mean it fairs better? If not, Julia at least has the LLVM vector intrinsics and metaprogramming that let you force it to generate nearly optimal code despite its protests.

But to say something nice about Fortran: dynamically-sized-mutable-stack-allocated-arrays. (-fstack-arrays)

In Julia, stack-allocated objects are immutable, and (partly for that reason) also generally very small. You wouldn't want to create a new 30x30 array every time you want to edit a single value. You also can't get stack pointers, meaning you can't use masked load/store operations to vectorize code when the array dimensions aren't a multiple of SIMD-vector-width.

This means to write fast code, I normally heap allocate everything. And to avoid triggering the GC, that means keeping the same arrays alive, and avoiding any temporaries.

With Fortran, you don't have to worry about any of that. You can always write convenient and clear code, and it's likely to be very fast. If you hold vectorization in mind while laying out your data and computations, compilers will normally do a great job figuring it out.

celrod · on Nov 29, 2018

I think it's worth pointing out that someone who knows way more about Julia internals than I do (yuyichao) said this comment is wrong on several points on the [Julia discourse](https://discourse.julialang.org/t/fortran-vs-julia-stack-all...).

Highlights include that that "Stack allocation is a property of the object, mutability is a property of the type", and that mutable objects that do not escape a function are likely to avoid heap allocation (ie, try to avoid calling non-inline functions).

Unfortunately, that can sometimes mean having to roll your own functions like `sum` and `dot_product`. A simple for loop is easy, but that some of these basic functions don't inline by default can make it a little more cumbersome.

pjmlp · on Nov 28, 2018

You need to look at commercial Fortran compilers when talking about performance, e.g. PGI, Intel, XL, not gfortran.

celrod · on Nov 30, 2018

Big disclaimer: my experiences are limited, and compiler comparisons are going to be highly variable, depending on the particular code bases.

That said, I have a statistics model that has to be fit thousands (millions?) of times. As a baseline, running it using JAGS (a popular tool for Gibbs sampling) for a given number of iterations and chains in parallel took >160 seconds.

The same model in Julia took 520 ms, g++ & gfortran 480 ms, and ifort took 495 ms.

The code I compiled with g++ used vector intrinsics and SLEEF. Not exactly a revelation that code like that will be fast. But part of my point is that C++ easily lets you take measures like that when you want more performance or control, therefore it is more optimizable. Switching to a commercial compiler wasn't an automatic performance boon.

All else equal, I'd much prefer sticking with an open source compiler suite. I also won't be a student with access to the Intel compilers for much longer.

The actual model is addressing a fun problem (improving the accuracy of satellite-space debris collision probability estimates), and I'm organizing some of it into a "Gibbs Sampling" presentation for tomorrow, so I could put the material online.