I’ve used CUDA and Julia extensively in my work for radio astronomy imaging applications.
I can say it is a delight to work with. All the usual GPU tips and tricks still apply, of course, and you need to pay careful attention to sequential memory accesses and so on (as with all GPU programming). But staying in the one, high level language is a real boon, and having access to native types and methods directly in my kernels is fantastic. I can’t speak highly enough of it.
And for performance comparison, I see between 3-4 orders of magnitude improvement in speed, about as fast as native CUDA.
Can you please recommend an open source codebase that uses Julia + CUDA and that can be used to learn that combination? I am considering starting a CUDA-related project and Julia is a serious contender, but I am scared to hit too many rough edges.
I can't recommend any particular project that implements something in cuda, but I'd check out the StructArrays.jl[0] project.
One of julia's strengths is it's macro and type system. StructArrays.jl uses them to create a SoA struct out of a AoS. This is a killer feature that generally requires some form of code generation in C/C++.
Even if you're just doing something on the cpu, it should set you up to be both simd & gpu friendly. They have a guide on how to swap out the underlying array storage from cpu to gpu memory
fwiw, cuda is a "Tier 1" supported architecture[1], where "Tier 1" is defined as
> Tier 1: Julia is guaranteed to build from source and pass all tests on these platforms when built with the default options. Official binaries are always available and CI is run on every commit to ensure support is actively maintained.
it's not needed, but image processing algorithms are pretty much what GPUs are designed for. You have a lot of data to process, and doing it quickly is always nice.
I'd be interested to see what each looks like if the allocations were done external to the benchmark. For that matter, it'd be interesting to see if after the allocations are factored out, if the same function could be used for cuda & cpu. From there, I'd be curious if the compiler is able to vectorize it automatically, or if it'd benefit from a @simd
It's also great to see how well cuda is supported in julia. I've started to pick up julia lately, and find it incredibly pleasant to work with. It feels like a lovely mix of haskell, lisp, and python, with a really nice repl.
> Note that I benchmarked with --check-bounds=no, which is a startup option that you pass to Julia, when launching, that disables the performance killer “bounds checking”.
If you add bounds checking, then what happens is that on every index into the array, a line of code gets called that says "is the index the user specified less than the length of the array". On it's own, that's not a problem, but when you're doing this repeatedly millions of times, it's a performance killer.
I can say it is a delight to work with. All the usual GPU tips and tricks still apply, of course, and you need to pay careful attention to sequential memory accesses and so on (as with all GPU programming). But staying in the one, high level language is a real boon, and having access to native types and methods directly in my kernels is fantastic. I can’t speak highly enough of it.
And for performance comparison, I see between 3-4 orders of magnitude improvement in speed, about as fast as native CUDA.