Nice article. I got too far into it before realizing I was essentially tricked into reading an academic paper :) There seems to be a tradeoff between clear presentation of an idea and the formalization required to usefully build off of it/publish it. Abstractions which do both (e.g. Turing machines) get more reuse, I think.
The sublogarithmic result is constructive, though, and my understanding of it is like this: We can split up a decision problem (take in a string, give a yes/no answer) over many workers by splitting up the input string into sqrt(n) chunks of size sqrt(n) each and using a shared working space. The first worker handles the first chunk with a blank working space, the second worker handles the second chunk with finished working space from the first one, etc. until the last worker handles the last chunk, with the second-to-last's working space, and gives a yes/no answer. This can be parallelized into a single MapReduce round by instead having every worker process their chunk using every possible configuration of the shared working tape. This sounds like a lot of work, but if shared working area is sublogarithmic in size, calculating such a table fits within polynomial time and sqrt(n) space. This means that one final "reduce" step can take the working space from the first input chunk, pass it through the second input chunk in constant time, and so on until it can make a decision.
I can't think of any practical sublogarithmic space problems off the top of my head but evidently there are some good ones in graph theory. If you tried to implement the above in practice, you could imagine that the final reduce step could be intelligent about combining the processing stages so workers don't need to calculate every single configuration.
What is unclear to me is whether all Turing machine deciding languages in sublogarithmic space can be chunked in this way. The usual definition of an input read head is that you can go left or right whenever you want, but having access to the entire input means that the workers can't be cleanly chunked. You could say the input is one-way like a DFA/PDA, but is there a result that these are equivalent with sublogarithmic scratch space?
So they are talking about how many MapReduce rounds it takes to express certain graph algorithms. (A single MapReduce round is of course O(n log n) with respect to the number of records.)
I wonder if this context is practical/realistic. One paper I thought was interesting is GraphChi, a graph processing framework that runs on a single machine:
Graphs can be encoded very efficiently; you need a pair of integer IDs for a pair of nodes, and maybe a weight or some other small payload data. A graph of say 7 billion nodes for everyone in the world will often fit in memory on a single machine these days, depending on the number of edges. GraphChi uses disk, so you have even more space, if you are willing to constrain the model (MapReduce is of course already heavily constrained, especially for graphs).
The raw source data is probably larger than a machine, but you could just run a MapReduce to copy it out into a bunch of (ID, ID) pairs, do your hairy graph algorithms on a single machine, then export the result back your data store with a MapReduce join.
You could use either something like GraphChi or just an arbitrary C/Java program working on a big graph in memory.
My hunch is that this is a lot more tractable that sequential MapReduce... I'd be interested in hearing if people have big graph problems where you couldn't use such an approach.
MR is a pretty inflexible paradigm, and doesn't really suit graph problems well at all. Most people I know prefer to use Spark (or in the case of graph problems, giraph). Of course this is for huge graphs that won't fit on a machine (petabytes). If it fits on a machine it's far easier to not muck around with distributed computing.
You may be interested in GraphX, which describes itself as a hybrid between the "data-parallel" approach of MapReduce and core Spark, and the "graph-parallel" approach of Giraph, Graphlab, etc. See http://spark.apache.org/graphx/. It allows you to do graph processing in Spark, and I quite like its approach.
The scalability of Spark is certainly not as established as that of Hadoop. It's entirely anecdotal as I'm not in a position to share verifiable details, but I know of several installations that run Spark on several-hundred-node clusters. They're certainly well into terabytes, if not in the neighborhood of petabytes.
The sublogarithmic result is constructive, though, and my understanding of it is like this: We can split up a decision problem (take in a string, give a yes/no answer) over many workers by splitting up the input string into sqrt(n) chunks of size sqrt(n) each and using a shared working space. The first worker handles the first chunk with a blank working space, the second worker handles the second chunk with finished working space from the first one, etc. until the last worker handles the last chunk, with the second-to-last's working space, and gives a yes/no answer. This can be parallelized into a single MapReduce round by instead having every worker process their chunk using every possible configuration of the shared working tape. This sounds like a lot of work, but if shared working area is sublogarithmic in size, calculating such a table fits within polynomial time and sqrt(n) space. This means that one final "reduce" step can take the working space from the first input chunk, pass it through the second input chunk in constant time, and so on until it can make a decision.
I can't think of any practical sublogarithmic space problems off the top of my head but evidently there are some good ones in graph theory. If you tried to implement the above in practice, you could imagine that the final reduce step could be intelligent about combining the processing stages so workers don't need to calculate every single configuration.
What is unclear to me is whether all Turing machine deciding languages in sublogarithmic space can be chunked in this way. The usual definition of an input read head is that you can go left or right whenever you want, but having access to the entire input means that the workers can't be cleanly chunked. You could say the input is one-way like a DFA/PDA, but is there a result that these are equivalent with sublogarithmic scratch space?