I did only a bit of profiling on mrjob, but it seemed that the ser/de was signif...

irskep · on Jan 7, 2013

Here's the code that's doing the actual line parsing: https://github.com/Yelp/mrjob/blob/master/mrjob/protocol.py#...

It's just splitting on tab for input and re-joining on tab for output. Some extra logic for lines without tabs.

EDIT: The example in the post is using JSON for communication between intermediate steps, while the Hadoop Streaming example is using a custom delimiter format. So this isn't really a fair comparison; the mrjob example could just as easily use the same efficient intermediate format.

laserson · on Jan 7, 2013

Here is the profiling output: https://gist.github.com/4478737

The input is RawProtocol, which simply splits on tab. But after that, mrjob defaults to using JSON internally, and this is causing a lot of slowdown.

irskep · on Jan 7, 2013

So mrjob isn't slower at all! You just chose to use JSON for your intermediate steps in the mrjob example, instead of the delimiter format you used in the raw Python example. I believe the mrjob docs do say what the defaults are. RawProtocol can be used for intermediate steps just fine.

Please either mention this difference in your post or update the code and conclusions. If there's a place in the documentation where we should mention optimizations or details like this, I'd be interested to know.

I should have thought of this before. Oh well.

laserson · on Jan 7, 2013

My goal was to use mrjob's features, not strip them down for performance. I find mrjob's natural use of JSON very appealing in terms of user-experience. It means that keys can be more complex types without the user having to manually figure out the best way to encode them. I make it clear that this is an appealing property of mrjob, and I make it clear that this is the reason for the slowdown. As is, the code will not work with RawProtocol internally because the key is a tuple of words.

irskep · on Jan 7, 2013

I can't find the part of the post where you explain that JSON parsing is the reason for the slowdown. You just say mrjob itself is slower. While I agree that mrjob's defaults encourage the use of JSON, I think it's unfair to blame lack of optimization on the framework, given that the bare Python example could just as easily have used JSON.

One real issue with mrjob is that it assumes you're only going to have one key and one value. It isn't straightforward to use multiple key fields. The workaround is to write a custom protocol (which, btw, is very simple [1]) that uses the line up to the first tab as the key, and the rest of the line as the value, probably splitting it on tab as well and passing it through as a tuple. If we had made multipart keys simpler to use, maybe you would have chosen to use a more efficient format.

Anyway, the main part I take issue with is:

"mrjob seems highly active, easy-to-use, and mature...but it appears to perform the slowest."

That's just not true. It would be fair to say that optimizing jobs with multipart keys isn't straightforward and therefore encourages non-optimal code, but that's moot if you're just using one key and one value, as most people do.

I'm really not trying to dump on you here. I liked the post! I would just prefer that it was more precise about these things.

[1] http://mrjob.readthedocs.org/en/latest/guides/writing-mrjobs...

EDIT: If anyone's thinking about downvoting this guy (someone did), don't. This is a discussion in good faith.