Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics [pdf]

econner · on Oct 6, 2015

Is MapReduce faster for any cases? A quick glance at the paper seems to suggest that Spark is always at least as good or better than traditional MR.

Should I ever use traditional MR over Spark?

bpodgursky · on Oct 6, 2015

You didn't even try to make it through the abstract...

> An exception to this is the Sort workload, for which MapReduce is 2 x faster than Spark. We show that MapReduce’s execution model is more efficient for shuffling data than Spark, thus making Sort run faster on MapRe- duce

eranation · on Oct 6, 2015

Didn't spark win the world record on sorting recently?

http://spark.apache.org/news/spark-wins-daytona-gray-sort-10...

falaki · on Oct 6, 2015

Note that you can change Spark's aggregation algorithm to sort-based with a configuration.

mrry · on Oct 6, 2015

A lot of the differences between the systems arise from the implementation choice of how to do aggregation in Hadoop 2.4.0 and Spark 1.3. There's nothing inherent in the RDD model, for example, that says the aggregation has to be done eagerly at the mapper; nor in the MapReduce model that says it has to be done at the reducer. Either system could support the other aggregation mechanism, and the only challenge would be in choosing which one to use.

Some former colleagues wrote a nice paper about the performance trade-offs for different styles of distributed aggregation in DryadLINQ (a MapReduce-style system), and evaluated it at scale:

http://sigops.org/sosp/sosp09/papers/yu-sosp09.pdf

gopalv · on Oct 6, 2015

> Either system could support the other aggregation mechanism, and the only challenge would be in choosing which one to use.

Hive implements something similar to the paper mentioned. Partial aggregation on mappers & the reducer does a sorted final aggregation.

You'll find Hive beating MapReduce[1], even though it is implemented using MR.

[1] - https://www.cl.cam.ac.uk/research/srg/netos/musketeer/eurosy...

analytically · on Oct 6, 2015

4 node cluster??

bpodgursky · on Oct 6, 2015

This is a very valid concern. There are a lot of ways to make algorithms and distributed systems scale poorly for significant numbers of computation nodes (it took Hadoop half a decade to figure out how improve this via YARN).

fuzzieozzie · on Oct 6, 2015

I guess they did not analyze joining two large datasets (I am assuming because M/R would win hands down). Someone more experienced please tell me if I am "thinking" correctly.

dalke · on Oct 6, 2015

What does it mean to "join two large datasets"? I can think of many meanings. What part of the method description in the abstract did you consider yourself too inexperienced to understand?

EwanToo · on Oct 6, 2015

I don't think so, if you're just doing a staight batch job with no sorting, iterative processing, etc, then they'll probably be a very similar speed.

If you have to sort one of the datasets first, then Spark will likely be faster, but it all depends on a ton of factors that are task specific.