>>Is Parallel Programming Hard, and, If So, What Can You Do About It?
I confess I didn't read TFA but the title made me think of something that hopefully some people here are well placed to take further. In a programming language I use there is an "each" function. For example in pseudo-java given a function f(int i). result = f each List<Integer>. Runs f over each integer. To make this multi-threaded you can simply say f peach listOfNumbers. Today in java you can do similar using streams: listOfNumbers.parallelStream().forEach() but for anyone here creating a new language I would suggest parallelising the language primitive forEach loop by default. Given the number of CPUs and that the machine is better selecting the optimal threading distribution I think that would be the better choice. This seems a small simple change but the hard part about parallel programming is often getting people to use it. By making it totally invisible, it's "easy".
I don't know of any programming language that does that as a language-first primitive. But this is basically slapping a `#pragma openmp parallel for` before your C for loop. In Rust there is the crate `rayon` that takes advantage of the language-level trait system to "augment" any (existing) iterable with a `par_iter()` method that effectively does what you describe: when included in the same module you can simply `for foo in bar.par_iter()` in place of `for foo in bar.iter()` to parallelize your iterations.
I suspect any language wanting to offer this as the default behavior for `for` loops will end up being a language like Java, Erlang, or SmallTalk - the language spec includes a "virtual" runtime that allows it to assume such flexibility in behavior, and capacity to "create/run threads/processes" that otherwise requires an OS (as in, you start leaving the perimeter of a programming language).
It's only easy if the items that are iterated over are entirely self contained, and the code in the loop body also does not need (write-)access to any other state that might be shared between threads. Rust has such limitations built into the language, but this is also one of the points that make Rust "difficult" (because you actually need to think first about how you organize your data to be "multi-threading-compatible", and that's the hard part, not adding a parallel for-each to a language.
A built-in parallel foreach can't guarantee better performance because the hardware isn't there. The parallel part is scheduled on normal threads by the kernel and this is too unpredictable.
But if each OS-scheduled CPU was really a cluster of for instance 1 fast + 64 slow cores then it could have special instructions to split a low-level loop on these slow cores and to facilitate blocking and rejoining back into a single stream. A 'parallel branch' instruction could take index, limit, and estimated instruction count and the CPU itself could decide whether to run it in a single fast stream or in parallel on slow cores.
There's some technical challenges like the size of a register file (so probably uninterruptible), but this may be where we're going since there's diminishing returns from making a single CPU faster and there's lots of opportunity for parallel processing that GPUs are too clumsy for.
That's pretty much the usual GPU+CPU workflow... Only you would have less badwith and latency issues.
Still, deciding where to run such things are not trivial, the time it takes to decide what would be faster could actually be more than the time it takes to just run it.
So instead we would have ridiculous amount of problems with cache and scheduling. I bet most of the time sequential cose would be “faster”, due to the parallel overheads, the cache misses, and the scheduling that obviously can’t fit every loop scenario.
And this is assuming that the code in the loop is thread safe, otherwise you will summon every concurrency bug on the planet.