Partly related I believe so perhaps someone can help. Whole theses have been wri...

jhj · on Aug 10, 2023

Not speaking to their implementation, but prefix sums/scans are simply a very useful primitive tool for parallelizing many otherwise sequential operations. For instance, appending a variable number of items per worker to a shared coalesced buffer uses an exclusive prefix sum. This is probably the most common use case for them in practical programming. They can also be used to partition work across parallel workers (segmented prefix scans).

In lieu of pointer chasing, hashing and the like, parallel operations on flat arrays are the way to maximize GPU utilization.

shwestrick · on Aug 10, 2023

Tons and tons of parallel algorithms use prefix sums. Typically the most common use is to compute a collection of offsets in parallel. Some examples:

- compact a hash table (i.e., remove the empty slots)

- flatten a jagged 2D array

- rewrite a dense matrix in compressed-sparse-row (CSR) format

mschuetz · on Aug 10, 2023

It's used in one of the fastest sorting approaches - counting sort / binning - to compute the location of where to store the sorted/binned items. First you count the number of items per bin, then you use prefix-sums to compute the memory location of each bin, then you insert the items into the respective bins. Some radix-sort implementations also utilize counting sort under the hood, and therefore prefix-sums. (Not sure if all radix-sort implementations need it)

gpuhacker · on Aug 10, 2023

It's incredibly useful if you have many threads that produce a variable number of outputs. Imagine you're implementing some filtering operation on the GPU, many threads will take on a fixed workload and then produce some number of outcomes. Unless we take some precautions, we have a huge synchronization problem when all threads try to append their results to the output. Note that GPUs didn't have atomics for the first couple of generations that supported CUDA, so you couldn't just getAndIncrement an index and append to an array. We could store those outputs in a dense structure, allocating a fixed number of output slots per thread, but that would leave many blanks in between the results. Now once we know the number of outputs per thread we can use a prefix sum to let every thread know where they can write their results in the array.

The outcome of a prefix sum exactly corresponds with the "row starts" part of the CSR sparse matrix notation. So they are also essential when creating sparse matrices.