This article is the equivalent of "horse drawn carriages are perfectly adequate ...

mbreese · on May 13, 2013

The point was that not everyone needs or has big data. That's hardly controversial. Even some instances where you think you have big data that you think needs to be handled in parallel by a cluster could easily be handled by a single server or even laptop. Again, nothing controversial.

The most important thing is knowing what data you have, how best to collect it, and what it can (and can't) tell you. Just because you find correlations doesn't mean that they are real. It takes people with real expertise to help here, and just running your data on a cluster isn't going to help you. In fact, it could even hurt.

I didn't see anything wrong with the article at all.

achompas · on May 13, 2013

Telling people to just collect the "important data" is like telling someone who has lost his keys just to go back to where he left them.

He doesn't tell anyone to collect the "important data," and he doesn't insist FB or Yahoo are not web scale.

His concluding paragraph is relatively weak, but the main thesis -- most businesses can ignore the Forbes/BI crap and analyze their data sufficiently using normal tools -- is true and sound.

dinkumthinkum · on May 13, 2013

This is such a naive response. It is the sort of response that portends to show how it's object of criticism is naive, yet it misses the point entirely. First of all, you misunderstand that the author is claiming that people are treating the terms "big data" and "analysis" synonymously and that that is erroneous.

Choronzon · on May 13, 2013

The problem is your that your ability to explore the data and the data volume are inversely correlated.You are far more likely to find interesting things exploring an in-memory dataset using something like ipython and pandas than throwing pig jobs at a few dozen TB of gunk. Big data is great if you know exactly what you are looking for. If you get into a stage where you are trying to explore a huge DB looking for relationships your need to be very good at machine learning and statistical analysis (spurious correlations ahoy!) to come out significantly ahead.Its also an enormous time sink. In summation the bigger the data the simpler the analysis you can throw at it efficiently.

ims · on May 13, 2013

Very true. Wouldn't the typical approach to this involve probabilistic methods like taking large-ish (but not "Big") samples from your multi TB data and doing your EDA with those?

Choronzon · on May 13, 2013

That would work very well if our random sample accurately reflected the superset of data,which it almost always does but you also want to consider the following...

Imagine our data was 98% junk with 2% of the data consisting of sequential patterns. We may be able to spot this on a graph relatively easily over the whole dataset but our random sampling would greatly reduce the quality of this information.

We can extend that to any ordering or periodicity in the data.if data at position n has a hidden dependency of data at position n+/-1 random sampling will break us.

cma · on May 13, 2013

Do random sampling plus n lines of surrounding context.