>>Is Parallel Programming Hard, and, If So, What Can You Do About It? I confess ...

jayjader · on June 14, 2023

I don't know of any programming language that does that as a language-first primitive. But this is basically slapping a `#pragma openmp parallel for` before your C for loop. In Rust there is the crate `rayon` that takes advantage of the language-level trait system to "augment" any (existing) iterable with a `par_iter()` method that effectively does what you describe: when included in the same module you can simply `for foo in bar.par_iter()` in place of `for foo in bar.iter()` to parallelize your iterations.

I suspect any language wanting to offer this as the default behavior for `for` loops will end up being a language like Java, Erlang, or SmallTalk - the language spec includes a "virtual" runtime that allows it to assume such flexibility in behavior, and capacity to "create/run threads/processes" that otherwise requires an OS (as in, you start leaving the perimeter of a programming language).

flohofwoe · on June 14, 2023

It's only easy if the items that are iterated over are entirely self contained, and the code in the loop body also does not need (write-)access to any other state that might be shared between threads. Rust has such limitations built into the language, but this is also one of the points that make Rust "difficult" (because you actually need to think first about how you organize your data to be "multi-threading-compatible", and that's the hard part, not adding a parallel for-each to a language.

tm-guimaraes · on June 14, 2023

That’s only safety. It still does not guarantee that performance would improve.

bsdetector · on June 14, 2023

A built-in parallel foreach can't guarantee better performance because the hardware isn't there. The parallel part is scheduled on normal threads by the kernel and this is too unpredictable.

But if each OS-scheduled CPU was really a cluster of for instance 1 fast + 64 slow cores then it could have special instructions to split a low-level loop on these slow cores and to facilitate blocking and rejoining back into a single stream. A 'parallel branch' instruction could take index, limit, and estimated instruction count and the CPU itself could decide whether to run it in a single fast stream or in parallel on slow cores.

There's some technical challenges like the size of a register file (so probably uninterruptible), but this may be where we're going since there's diminishing returns from making a single CPU faster and there's lots of opportunity for parallel processing that GPUs are too clumsy for.

tm-guimaraes · on June 26, 2023

That's pretty much the usual GPU+CPU workflow... Only you would have less badwith and latency issues.

Still, deciding where to run such things are not trivial, the time it takes to decide what would be faster could actually be more than the time it takes to just run it.

tm-guimaraes · on June 14, 2023

So instead we would have ridiculous amount of problems with cache and scheduling. I bet most of the time sequential cose would be “faster”, due to the parallel overheads, the cache misses, and the scheduling that obviously can’t fit every loop scenario.

And this is assuming that the code in the loop is thread safe, otherwise you will summon every concurrency bug on the planet.