That helps, but it still means you can only have the data on one thread at a time. What would be a massive help would be if you could split a buffer into non-overlapping views, and transfer each view to a separate worker. Some algorithms would still be challenging or impossible to implement this way, like parallel prefix sum algorithms, but it would still greatly widen the number of things you could do in parallel.